nearai-bench

Benchmarking harness for evaluating AI agents. Extracted from ironclaw.

Available Suites

Suite	Description
`trajectory`	Multi-turn trajectory scenarios with per-turn assertions (supersedes `spot`)
`spot`	End-to-end spot checks: conversation, tool use, chaining, robustness
`custom`	Custom JSONL tasks with flexible scoring (exact, contains, regex, LLM)
`gaia`	GAIA benchmark (knowledge and reasoning)
`tau_bench`	Tau-bench (multi-turn tool-calling dialog)
`swe_bench`	SWE-bench Pro (real-world software engineering)

Quick Start

# 1. Configure your LLM provider (pick one)
cp .env.example .env
# Edit .env with your API key (OPENAI_API_KEY, ANTHROPIC_API_KEY, or LLM_* vars)

# 2. List available suites
nearai-bench list

# 3. Run trajectory scenarios
nearai-bench run --suite trajectory --config suites/trajectory.toml

# Run with a specific model
nearai-bench run --suite trajectory --config suites/trajectory.toml --model gpt-4o

# View latest results
nearai-bench results latest

# Compare two runs
nearai-bench compare <baseline-uuid> <comparison-uuid>

LLM Provider Setup

Copy .env.example to .env and set your provider credentials. The harness supports any OpenAI-compatible API endpoint.

OpenAI (simplest):

OPENAI_API_KEY=sk-...

Anthropic:

ANTHROPIC_API_KEY=sk-ant-...

Any OpenAI-compatible provider (OpenRouter, Together, vLLM, Ollama, etc.):

LLM_BACKEND=openai_compatible
LLM_BASE_URL=https://openrouter.ai/api/v1
LLM_API_KEY=sk-or-...
LLM_MODEL=anthropic/claude-sonnet-4

NEAR AI (requires ironclaw onboarding):

LLM_BACKEND=nearai

Project Structure

benchmarks/
  datasets/          Versioned benchmark datasets
    spot/v1/           21 spot-check tasks
    swe-bench-lite/v1/ SWE-bench Lite dataset (astropy subset)
  suites/            Suite configuration files (TOML)
  baselines/         Curated reference results by suite
  results/           Run output, namespaced by harness
    ironclaw/          Results from the ironclaw harness
  src/               Harness source code
    adapters/          Suite adapter implementations

Trajectory Suite

The trajectory suite replays multi-turn conversation scenarios with per-turn assertions (tool selection, response content, call limits). See datasets/trajectory/v1/ for examples.

Workspace Identity Files

In production, the IronClaw agent loads identity files (SOUL.md, IDENTITY.md, AGENTS.md, USER.md, TOOLS.md) from its workspace to build the system prompt. To benchmark with the same system prompt as production, point workspace_path at the deployment repo's workspace directory:

[suite_config]
dataset_path = "datasets/trajectory/v1"
workspace_path = "../openclaw-nearai-worker/ironclaw-worker/workspace"

All .md files from that directory are loaded and seeded into a fresh in-memory workspace for each benchmark task. The agent's Workspace::system_prompt() then assembles them into the system prompt exactly as it would in production.

Per-scenario setup.identity overrides in the JSON scenario files take precedence over the base files from workspace_path.

Important: Only files explicitly provided are loaded. Unlike ironclaw's standalone app, the library does not auto-seed default identity files. If a file is missing from the workspace directory, it is simply absent from the system prompt — no defaults are injected.

Datasets

Datasets live under datasets/{suite-name}/v{N}/tasks.jsonl. The versioning scheme lets datasets evolve without invalidating older results that reference a prior version.

Adding a New Dataset

Create datasets/{name}/v1/tasks.jsonl in the appropriate JSONL format.
Create suites/{name}.toml pointing suite_config.dataset_path at the new file.
If the suite type doesn't exist, implement a new adapter in src/adapters/.

Results

Results are written to results/{harness}/{run-uuid}/ containing:

run.json: aggregate metrics (pass rate, cost, timing, model, harness)
tasks.jsonl: per-task results with scores, traces, and responses

The harness field in run.json identifies which agent implementation produced the results, allowing multiple harnesses to share the same results directory structure.

Multi-Framework Comparison

The harness supports benchmarking different agent frameworks on the same suites via --framework.

Ironclaw (default)

Runs tasks in-process using the ironclaw agent library with its full tool stack and safety layer.

# Run with ironclaw (default)
nearai-bench run --suite trajectory --config suites/zclaw-security-eng.toml

OpenClaw

Runs tasks against an OpenClaw gateway in a Docker container. Each task gets a fresh container with identity files mounted as workspace.

# 1. Build the openclaw Docker image (one-time)
docker build -t openclaw:local /path/to/openclaw/

# 2. Ensure API keys are in your environment (OPENROUTER_API_KEY, ANTHROPIC_API_KEY, etc.)

# 3. Run — model ID must use openclaw's provider prefix format
nearai-bench run --suite trajectory \
  --config suites/zclaw-security-eng.toml \
  --framework openclaw \
  --model openrouter/anthropic/claude-sonnet-4

Model ID format: {provider}/{model} — e.g. openrouter/anthropic/claude-sonnet-4, anthropic/claude-sonnet-4-20250514, openai/gpt-4o.

Optional [openclaw] section in suite TOML:

[openclaw]
image = "openclaw:local"       # Docker image (default: openclaw:local)
gateway_token = "my-token"     # Gateway auth token (default: bench-token)

Comparing Frameworks

nearai-bench compare <ironclaw-run-uuid> <openclaw-run-uuid>

Results are tagged with framework in run.json and displayed in the leaderboard site.

ZClaw Security Suite

10 prompt injection resistance tasks extracted from ZClawBench. Each task presents a social engineering attack and checks that the agent refuses without leaking identity files seeded into its workspace.

Available in Chinese (original) and English (translated):

# Chinese
nearai-bench run --suite trajectory --config suites/zclaw-security-chn.toml

# English
nearai-bench run --suite trajectory --config suites/zclaw-security-eng.toml

Attack vectors covered: authority impersonation, fake audit, colleague impersonation, emotional pressure, training pretext, minimal confirmation fishing, structure extraction, reasoning extraction, encoding bypass, platform identity extraction.

Configuration

Suite configs are TOML files with this structure:

task_timeout = "120s"
parallelism = 1

[[matrix]]
label = "default"
# model = "openai/gpt-4o"  # optional model override

[suite_config]
dataset_path = "datasets/spot/v1/tasks.jsonl"

License

MIT OR Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
baselines/spot/gpt5.2-2c43b83		baselines/spot/gpt5.2-2c43b83
datasets		datasets
scripts		scripts
site		site
src		src
suites		suites
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nearai-bench

Available Suites

Quick Start

LLM Provider Setup

Project Structure

Trajectory Suite

Workspace Identity Files

Datasets

Adding a New Dataset

Results

Multi-Framework Comparison

Ironclaw (default)

OpenClaw

Comparing Frameworks

ZClaw Security Suite

Configuration

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nearai-bench

Available Suites

Quick Start

LLM Provider Setup

Project Structure

Trajectory Suite

Workspace Identity Files

Datasets

Adding a New Dataset

Results

Multi-Framework Comparison

Ironclaw (default)

OpenClaw

Comparing Frameworks

ZClaw Security Suite

Configuration

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages