Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,23 @@ The format is based on Keep a Changelog and this project follows Semantic Versio

## [Unreleased]

### Added

- **Long-context Python function retrieval benchmark** — added seven built-in context-window templates linked to file-backed Python function-retrieval datasets, with front/middle/late function placement, two-function retrieval, and a negative control.
- **Long-context Python needle benchmark** — added seven built-in context-window templates linked to file-backed Python positional-recall datasets, with front/middle/late needle placement, 4k-256k context sizes, two-fact retrieval, and a negative control.

### Changed

- **Changelog category workflow** — `AGENTS.md` now requires changelog updates to preserve Keep a Changelog category headings and place entries under the appropriate `Added`, `Changed`, `Fixed`, `Removed`, or `Security` section instead of flattening release notes.
- **Run fatal upstream errors** — Run-created benchmark profiles now cancel on the first fatal upstream error, context-window retrieval stops on the first failed item, and HTTP diagnostics preserve upstream provider codes such as `prefill_memory_exceeded`.
- **Run template capability filtering** — Run now disables benchmark templates that exceed a selected model's declared context window or require tool calling when the selected model/server is not tool-capable.
- **Run audit and functional checks split** — Run now separates pipeline execution health from functional benchmark checks, and treats missing required terms as a visible functional failure when exact matching is disabled.
- **Run functional failure clue** — Run now surfaces a benchmark assertion failure line when quality metrics fail despite a technically completed run, with categories such as invalid tool arguments or missing tool calls.

## [0.10.0] - 2026-06-19

### Added

- **Run functional failure clue** — Run now surfaces a benchmark assertion failure line when quality metrics fail despite a technically completed run, with categories such as invalid tool arguments or missing tool calls.
- **Datasets editor checkpoint** — added a Datasets page and JSONL dataset-file API for creating, editing, saving, and deleting dataset item files under `INFERHARNESS_BENCHMARK_DATASET_ROOT`, with synced `dataset_manifest` documents, copy-down editing for repeated fields, and clamped long-prompt display.
- **Tool-call assertion metric** — benchmark tool-call templates now include `tool_call_assertion_pass`, a single-turn pass/fail metric requiring exact expected tool selection and structurally matching arguments while keeping assertion failures as quality metrics rather than execution failures.
- **Tool-call assertion UI** — Run now promotes tool-call assertion pass/fail as the primary correctness verdict, and Templates groups metrics with readable labels while auto-adding the assertion metric when tool calling is enabled.
Expand Down
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ This means a result is more than a screenshot or a manually copied answer. It is
Register local or remote inference servers, discover available models, and maintain a model catalog with provider, format, quantization, capabilities, and base-model metadata.

**Reusable test definitions**
Start with built-in benchmark templates, then create tests for one prompt, a dataset loop, tool-calling behavior, structured output, or multi-model comparisons.
Start with built-in benchmark templates, then create tests for one prompt, a dataset loop, tool-calling behavior, long-context needle or function retrieval, structured output, or multi-model comparisons.

Benchmark documents are persisted as JSON in a file-backed library and indexed into SQLite for runtime use. Built-in documents ship with the app, while user-created templates, datasets, runtime profiles, and plans are written to a local library directory so they can be restored if the database is rebuilt.

Expand All @@ -91,6 +91,8 @@ Use the Templates page agent as the primary authoring flow to challenge underspe
**Benchmark runs**
Run the same test against one model, many models, or the same model served by different inference servers.
When a selected template has a unique linked `dataset_manifest`, Run uses that manifest automatically instead of creating a prompt or file-backed dataset manifest.
Run disables templates that exceed a selected model's declared context window or require unsupported tool calling.
Run separates execution health from functional checks so a technically completed pipeline can still show failed retrieval, schema, or tool-call assertions.

**Automated metrics**
Capture time to first token, total latency, prefill/decode timing, prompt tokens, completion tokens, and tokens per second.
Expand Down
5 changes: 5 additions & 0 deletions backend/data/datasets/context-function-retrieval-128k.jsonl

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions backend/data/datasets/context-function-retrieval-16k.jsonl

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions backend/data/datasets/context-function-retrieval-256k.jsonl

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions backend/data/datasets/context-function-retrieval-32k.jsonl

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions backend/data/datasets/context-function-retrieval-4k.jsonl

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions backend/data/datasets/context-function-retrieval-64k.jsonl

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions backend/data/datasets/context-function-retrieval-8k.jsonl

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions backend/data/datasets/context-needle-128k.jsonl

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions backend/data/datasets/context-needle-16k.jsonl

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions backend/data/datasets/context-needle-256k.jsonl

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions backend/data/datasets/context-needle-32k.jsonl

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions backend/data/datasets/context-needle-64k.jsonl

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions backend/data/datasets/context-needle-8k.jsonl

Large diffs are not rendered by default.

15 changes: 5 additions & 10 deletions backend/data/datasets/positional-recall-python.jsonl

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{
"kind": "dataset_manifest",
"schema_version": "benchmark_dataset_manifest_v1",
"dataset_id": "dataset-model-context-function-retrieval-128k-v1",
"source": {
"source_type": "file",
"format": "jsonl",
"path": "data/datasets/context-function-retrieval-128k.jsonl"
},
"canonicalization_version": "dataset_canonical_v1",
"snapshot_policy": "manifest_only",
"dataset_hash": "sha256:79e2c6b53c00e6ca288911567ded0d814063a57164ca813697f8469caa3496cd",
"item_count": 5,
"item_hashes": [
{
"item_id": "function-front-128k",
"hash": "sha256:89ee905cb4cc217124513eeb69a1ad9b6943ad25c2483a435630396f4dc01d77"
},
{
"item_id": "function-middle-128k",
"hash": "sha256:e436ebd5b3d2ac6e99a9a14916f2f8bfcdf4200b8d4da3134450563fe4966f26"
},
{
"item_id": "function-late-128k",
"hash": "sha256:a3c2f11070896c8160823a94162f32dfd73ef0c8a1a03fda10f71a8b4e5541cb"
},
{
"item_id": "function-two-blocks-128k",
"hash": "sha256:c0d36aa2a4fece4db0cc7c85eb3ba57ec4ab770e3bcb593e012c13250853cf85"
},
{
"item_id": "function-negative-control-128k",
"hash": "sha256:0a4a3ba3a168b9fbc378dcb2d53b7aab7174d1377df5a74599e253de1a08dd58"
}
],
"item_manifest_ref": null,
"snapshot_blob_ref": null,
"metadata": {
"source": "built-in-context-library",
"template_id": "model-context-function-retrieval-128k-v1",
"source_file": "backend/data/datasets/frame.py",
"dataset_file": "backend/data/datasets/context-function-retrieval-128k.jsonl",
"dataset_family": "function_retrieval",
"context_window_tokens": 128000
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{
"kind": "dataset_manifest",
"schema_version": "benchmark_dataset_manifest_v1",
"dataset_id": "dataset-model-context-function-retrieval-16k-v1",
"source": {
"source_type": "file",
"format": "jsonl",
"path": "data/datasets/context-function-retrieval-16k.jsonl"
},
"canonicalization_version": "dataset_canonical_v1",
"snapshot_policy": "manifest_only",
"dataset_hash": "sha256:09236e8d77ff47bac3e7169528c1796d64352b374d68ba5a68b06615e812f658",
"item_count": 5,
"item_hashes": [
{
"item_id": "function-front-16k",
"hash": "sha256:ccba4364f62e5bb1b3978480ea1fc44f8b09fac5f072fb8751a92b9d293f686d"
},
{
"item_id": "function-middle-16k",
"hash": "sha256:c0fe9f547fe73a296dc9aa9b374438cfbbd05fd77542bddcfbb0f545b404c524"
},
{
"item_id": "function-late-16k",
"hash": "sha256:53ea3ec7ecb0c4a6590a11927148b8b412711c67be97ed0d910e4854fd5d5ed4"
},
{
"item_id": "function-two-blocks-16k",
"hash": "sha256:d3d523a69337e9ab6737385c2aecb1f81f0b163de97b8d6ba1ad725ea097bdd0"
},
{
"item_id": "function-negative-control-16k",
"hash": "sha256:740e4305fe70ec13cbf964ec1609b53002de9361e3c9ce74674a141b73e0f698"
}
],
"item_manifest_ref": null,
"snapshot_blob_ref": null,
"metadata": {
"source": "built-in-context-library",
"template_id": "model-context-function-retrieval-16k-v1",
"source_file": "backend/data/datasets/frame.py",
"dataset_file": "backend/data/datasets/context-function-retrieval-16k.jsonl",
"dataset_family": "function_retrieval",
"context_window_tokens": 16000
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{
"kind": "dataset_manifest",
"schema_version": "benchmark_dataset_manifest_v1",
"dataset_id": "dataset-model-context-function-retrieval-256k-v1",
"source": {
"source_type": "file",
"format": "jsonl",
"path": "data/datasets/context-function-retrieval-256k.jsonl"
},
"canonicalization_version": "dataset_canonical_v1",
"snapshot_policy": "manifest_only",
"dataset_hash": "sha256:02d2521f1fdee676be076b4340172a885638c4ea423083449822ff8d8a68b3fe",
"item_count": 5,
"item_hashes": [
{
"item_id": "function-front-256k",
"hash": "sha256:0e85afd5e91a8bc215138347cfdfecc699988c8371b815d33c598d68c86e6e5a"
},
{
"item_id": "function-middle-256k",
"hash": "sha256:cb47668518b5ec9e941573ea3ffe2e1f0b689210084c9b594212fd3b327fb05f"
},
{
"item_id": "function-late-256k",
"hash": "sha256:3f44c8c34f7296c932eb0bd88049598c07a4317ff1e805fefd732671b00ac257"
},
{
"item_id": "function-two-blocks-256k",
"hash": "sha256:2e7076c23277f5d9c058f8c722bdf177fbcb4b56f22b65675b209111a63356dd"
},
{
"item_id": "function-negative-control-256k",
"hash": "sha256:3d54549df5d2bc058f05d74fdd9634024300b609e14429985e45281ff60de882"
}
],
"item_manifest_ref": null,
"snapshot_blob_ref": null,
"metadata": {
"source": "built-in-context-library",
"template_id": "model-context-function-retrieval-256k-v1",
"source_file": "backend/data/datasets/frame.py",
"dataset_file": "backend/data/datasets/context-function-retrieval-256k.jsonl",
"dataset_family": "function_retrieval",
"context_window_tokens": 256000
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{
"kind": "dataset_manifest",
"schema_version": "benchmark_dataset_manifest_v1",
"dataset_id": "dataset-model-context-function-retrieval-32k-v1",
"source": {
"source_type": "file",
"format": "jsonl",
"path": "data/datasets/context-function-retrieval-32k.jsonl"
},
"canonicalization_version": "dataset_canonical_v1",
"snapshot_policy": "manifest_only",
"dataset_hash": "sha256:98cf5203642c463fabb29612a54cf706d9f576cd2c5f094d73c9f3b1ee1522b7",
"item_count": 5,
"item_hashes": [
{
"item_id": "function-front-32k",
"hash": "sha256:68ca37b68f58d12e23141a9c28a1a9a325aac1d801bc2d8447302ab37f0c987e"
},
{
"item_id": "function-middle-32k",
"hash": "sha256:8056c08c6e62bc679a875414d74ed37c896fca27ee0db269460eb20e6b0a484e"
},
{
"item_id": "function-late-32k",
"hash": "sha256:1fe797370a033041244642da97f0e753c0a70c351496790684a8f5b2dc775918"
},
{
"item_id": "function-two-blocks-32k",
"hash": "sha256:2d7622f5f996710ce57f45d1ee421ba2976bac35980d56e7fdc61817bf1bdc07"
},
{
"item_id": "function-negative-control-32k",
"hash": "sha256:ffa07c1f7adc7825f317eaf4f53df3a63da8524ca1e4d4f91104bd3c9096d57b"
}
],
"item_manifest_ref": null,
"snapshot_blob_ref": null,
"metadata": {
"source": "built-in-context-library",
"template_id": "model-context-function-retrieval-32k-v1",
"source_file": "backend/data/datasets/frame.py",
"dataset_file": "backend/data/datasets/context-function-retrieval-32k.jsonl",
"dataset_family": "function_retrieval",
"context_window_tokens": 32000
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{
"kind": "dataset_manifest",
"schema_version": "benchmark_dataset_manifest_v1",
"dataset_id": "dataset-model-context-function-retrieval-4k-v1",
"source": {
"source_type": "file",
"format": "jsonl",
"path": "data/datasets/context-function-retrieval-4k.jsonl"
},
"canonicalization_version": "dataset_canonical_v1",
"snapshot_policy": "manifest_only",
"dataset_hash": "sha256:8f8a71a734d9cbb8a69072f467c3c673cad01d536fa2859f36881ccdb51e4f1b",
"item_count": 5,
"item_hashes": [
{
"item_id": "function-front-4k",
"hash": "sha256:6e451c4c3f7a8ae153c5d87a12e8e89b8926796837061ba88b6d92033a4b2153"
},
{
"item_id": "function-middle-4k",
"hash": "sha256:11c61d18fb7c07348d863b346d58f95cd6f38abfae9120f32f81070eb1a13828"
},
{
"item_id": "function-late-4k",
"hash": "sha256:7b9750f61e53e247af61a613962198bf21d582be01cdf00c35c99a7843d72eea"
},
{
"item_id": "function-two-blocks-4k",
"hash": "sha256:ec32f76349e8e849f8b1b724e84d6508b9eba8a00f7f8fde693760508bd842bb"
},
{
"item_id": "function-negative-control-4k",
"hash": "sha256:24356452d9944f42a44afe705133ec909caae784e3a10125bb084898a5d02c6b"
}
],
"item_manifest_ref": null,
"snapshot_blob_ref": null,
"metadata": {
"source": "built-in-context-library",
"template_id": "model-context-function-retrieval-4k-v1",
"source_file": "backend/data/datasets/frame.py",
"dataset_file": "backend/data/datasets/context-function-retrieval-4k.jsonl",
"dataset_family": "function_retrieval",
"context_window_tokens": 4000
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{
"kind": "dataset_manifest",
"schema_version": "benchmark_dataset_manifest_v1",
"dataset_id": "dataset-model-context-function-retrieval-64k-v1",
"source": {
"source_type": "file",
"format": "jsonl",
"path": "data/datasets/context-function-retrieval-64k.jsonl"
},
"canonicalization_version": "dataset_canonical_v1",
"snapshot_policy": "manifest_only",
"dataset_hash": "sha256:4226782d48c4cf0a4e881353cd2b351700388a55eebe541a33f615e72ddd14c7",
"item_count": 5,
"item_hashes": [
{
"item_id": "function-front-64k",
"hash": "sha256:c016db5396e11fa24f5f26ed5578d368f0cf1cdbcf08c7d12bd87a010c1eed60"
},
{
"item_id": "function-middle-64k",
"hash": "sha256:aaabb6be8b5d035d539b962d13816708c173537c58b2a91eab298b22f323d773"
},
{
"item_id": "function-late-64k",
"hash": "sha256:079b1e26a928739a3c2efd18b8f390435b557d8ff27de973f7f28e773e10825e"
},
{
"item_id": "function-two-blocks-64k",
"hash": "sha256:4c686a8c640dc80fb3b5b43e2790aad97667be6e52c8d97cdba1dd6082dc0021"
},
{
"item_id": "function-negative-control-64k",
"hash": "sha256:dc9b91faf80539d11ff62cd8867794a2a14b84770501225e0b1655ea2bff90ff"
}
],
"item_manifest_ref": null,
"snapshot_blob_ref": null,
"metadata": {
"source": "built-in-context-library",
"template_id": "model-context-function-retrieval-64k-v1",
"source_file": "backend/data/datasets/frame.py",
"dataset_file": "backend/data/datasets/context-function-retrieval-64k.jsonl",
"dataset_family": "function_retrieval",
"context_window_tokens": 64000
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{
"kind": "dataset_manifest",
"schema_version": "benchmark_dataset_manifest_v1",
"dataset_id": "dataset-model-context-function-retrieval-8k-v1",
"source": {
"source_type": "file",
"format": "jsonl",
"path": "data/datasets/context-function-retrieval-8k.jsonl"
},
"canonicalization_version": "dataset_canonical_v1",
"snapshot_policy": "manifest_only",
"dataset_hash": "sha256:7630f14fcfae4de49e9bbd3fca5796c168d5bab42b602b83ee1c1646f16450ec",
"item_count": 5,
"item_hashes": [
{
"item_id": "function-front-8k",
"hash": "sha256:96854137cc3fcbb21fe6de14c7de0aa189360a8f0f099eb29febc9d2a041cc8d"
},
{
"item_id": "function-middle-8k",
"hash": "sha256:ad326f29edc6e1e941ea2729ffd69c684a9ddf99984b2e6c1e4810cd2bf7fbfc"
},
{
"item_id": "function-late-8k",
"hash": "sha256:782c77eeef23dddfb468b97d1ea292c31121923511ff8318be30b6d6b85ab58a"
},
{
"item_id": "function-two-blocks-8k",
"hash": "sha256:e48475a792a884755f4aa69f74adffecd1107501985e06943adc44f4e372ef86"
},
{
"item_id": "function-negative-control-8k",
"hash": "sha256:03ed19dd6d4f7151ca223a46a1880fd1f770d69abbb90f45c50a5705c5a10fe8"
}
],
"item_manifest_ref": null,
"snapshot_blob_ref": null,
"metadata": {
"source": "built-in-context-library",
"template_id": "model-context-function-retrieval-8k-v1",
"source_file": "backend/data/datasets/frame.py",
"dataset_file": "backend/data/datasets/context-function-retrieval-8k.jsonl",
"dataset_family": "function_retrieval",
"context_window_tokens": 8000
}
}
Loading
Loading