Benchmarking harness for evaluating AI agents. Extracted from ironclaw.
| Suite | Description |
|---|---|
trajectory |
Multi-turn trajectory scenarios with per-turn assertions (supersedes spot) |
spot |
End-to-end spot checks: conversation, tool use, chaining, robustness |
custom |
Custom JSONL tasks with flexible scoring (exact, contains, regex, LLM) |
gaia |
GAIA benchmark (knowledge and reasoning) |
tau_bench |
Tau-bench (multi-turn tool-calling dialog) |
swe_bench |
SWE-bench Pro (real-world software engineering) |
# 1. Configure your LLM provider (pick one)
cp .env.example .env
# Edit .env with your API key (OPENAI_API_KEY, ANTHROPIC_API_KEY, or LLM_* vars)
# 2. List available suites
nearai-bench list
# 3. Run trajectory scenarios
nearai-bench run --suite trajectory --config suites/trajectory.toml
# Run with a specific model
nearai-bench run --suite trajectory --config suites/trajectory.toml --model gpt-4o
# View latest results
nearai-bench results latest
# Compare two runs
nearai-bench compare <baseline-uuid> <comparison-uuid>Copy .env.example to .env and set your provider credentials. The harness
supports any OpenAI-compatible API endpoint.
OpenAI (simplest):
OPENAI_API_KEY=sk-...Anthropic:
ANTHROPIC_API_KEY=sk-ant-...Any OpenAI-compatible provider (OpenRouter, Together, vLLM, Ollama, etc.):
LLM_BACKEND=openai_compatible
LLM_BASE_URL=https://openrouter.ai/api/v1
LLM_API_KEY=sk-or-...
LLM_MODEL=anthropic/claude-sonnet-4NEAR AI (requires ironclaw onboarding):
LLM_BACKEND=nearaibenchmarks/
datasets/ Versioned benchmark datasets
spot/v1/ 21 spot-check tasks
swe-bench-lite/v1/ SWE-bench Lite dataset (astropy subset)
suites/ Suite configuration files (TOML)
baselines/ Curated reference results by suite
results/ Run output, namespaced by harness
ironclaw/ Results from the ironclaw harness
src/ Harness source code
adapters/ Suite adapter implementations
The trajectory suite replays multi-turn conversation scenarios with per-turn assertions
(tool selection, response content, call limits). See datasets/trajectory/v1/ for examples.
In production, the IronClaw agent loads identity files (SOUL.md, IDENTITY.md, AGENTS.md,
USER.md, TOOLS.md) from its workspace to build the system prompt. To benchmark with the
same system prompt as production, point workspace_path at the deployment repo's workspace
directory:
[suite_config]
dataset_path = "datasets/trajectory/v1"
workspace_path = "../openclaw-nearai-worker/ironclaw-worker/workspace"All .md files from that directory are loaded and seeded into a fresh in-memory workspace
for each benchmark task. The agent's Workspace::system_prompt() then assembles them into
the system prompt exactly as it would in production.
Per-scenario setup.identity overrides in the JSON scenario files take precedence over the
base files from workspace_path.
Important: Only files explicitly provided are loaded. Unlike ironclaw's standalone app, the library does not auto-seed default identity files. If a file is missing from the workspace directory, it is simply absent from the system prompt — no defaults are injected.
Datasets live under datasets/{suite-name}/v{N}/tasks.jsonl. The versioning scheme lets
datasets evolve without invalidating older results that reference a prior version.
- Create
datasets/{name}/v1/tasks.jsonlin the appropriate JSONL format. - Create
suites/{name}.tomlpointingsuite_config.dataset_pathat the new file. - If the suite type doesn't exist, implement a new adapter in
src/adapters/.
Results are written to results/{harness}/{run-uuid}/ containing:
run.json: aggregate metrics (pass rate, cost, timing, model, harness)tasks.jsonl: per-task results with scores, traces, and responses
The harness field in run.json identifies which agent implementation produced the results,
allowing multiple harnesses to share the same results directory structure.
The harness supports benchmarking different agent frameworks on the same suites via --framework.
Runs tasks in-process using the ironclaw agent library with its full tool stack and safety layer.
# Run with ironclaw (default)
nearai-bench run --suite trajectory --config suites/zclaw-security-eng.tomlRuns tasks against an OpenClaw gateway in a Docker container. Each task gets a fresh container with identity files mounted as workspace.
# 1. Build the openclaw Docker image (one-time)
docker build -t openclaw:local /path/to/openclaw/
# 2. Ensure API keys are in your environment (OPENROUTER_API_KEY, ANTHROPIC_API_KEY, etc.)
# 3. Run — model ID must use openclaw's provider prefix format
nearai-bench run --suite trajectory \
--config suites/zclaw-security-eng.toml \
--framework openclaw \
--model openrouter/anthropic/claude-sonnet-4Model ID format: {provider}/{model} — e.g. openrouter/anthropic/claude-sonnet-4,
anthropic/claude-sonnet-4-20250514, openai/gpt-4o.
Optional [openclaw] section in suite TOML:
[openclaw]
image = "openclaw:local" # Docker image (default: openclaw:local)
gateway_token = "my-token" # Gateway auth token (default: bench-token)nearai-bench compare <ironclaw-run-uuid> <openclaw-run-uuid>Results are tagged with framework in run.json and displayed in the leaderboard site.
10 prompt injection resistance tasks extracted from ZClawBench. Each task presents a social engineering attack and checks that the agent refuses without leaking identity files seeded into its workspace.
Available in Chinese (original) and English (translated):
# Chinese
nearai-bench run --suite trajectory --config suites/zclaw-security-chn.toml
# English
nearai-bench run --suite trajectory --config suites/zclaw-security-eng.tomlAttack vectors covered: authority impersonation, fake audit, colleague impersonation, emotional pressure, training pretext, minimal confirmation fishing, structure extraction, reasoning extraction, encoding bypass, platform identity extraction.
Suite configs are TOML files with this structure:
task_timeout = "120s"
parallelism = 1
[[matrix]]
label = "default"
# model = "openai/gpt-4o" # optional model override
[suite_config]
dataset_path = "datasets/spot/v1/tasks.jsonl"MIT OR Apache-2.0