Fango2007 · Fango2007 · Jun 20, 2026 · Jun 20, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,15 +6,23 @@ The format is based on Keep a Changelog and this project follows Semantic Versio
 
 ## [Unreleased]
 
+### Added
+
+- **Long-context Python function retrieval benchmark** — added seven built-in context-window templates linked to file-backed Python function-retrieval datasets, with front/middle/late function placement, two-function retrieval, and a negative control.
+- **Long-context Python needle benchmark** — added seven built-in context-window templates linked to file-backed Python positional-recall datasets, with front/middle/late needle placement, 4k-256k context sizes, two-fact retrieval, and a negative control.
+
 ### Changed
 
 - **Changelog category workflow** — `AGENTS.md` now requires changelog updates to preserve Keep a Changelog category headings and place entries under the appropriate `Added`, `Changed`, `Fixed`, `Removed`, or `Security` section instead of flattening release notes.
+- **Run fatal upstream errors** — Run-created benchmark profiles now cancel on the first fatal upstream error, context-window retrieval stops on the first failed item, and HTTP diagnostics preserve upstream provider codes such as `prefill_memory_exceeded`.
+- **Run template capability filtering** — Run now disables benchmark templates that exceed a selected model's declared context window or require tool calling when the selected model/server is not tool-capable.
+- **Run audit and functional checks split** — Run now separates pipeline execution health from functional benchmark checks, and treats missing required terms as a visible functional failure when exact matching is disabled.
+- **Run functional failure clue** — Run now surfaces a benchmark assertion failure line when quality metrics fail despite a technically completed run, with categories such as invalid tool arguments or missing tool calls.
 
 ## [0.10.0] - 2026-06-19
 
 ### Added
 
-- **Run functional failure clue** — Run now surfaces a benchmark assertion failure line when quality metrics fail despite a technically completed run, with categories such as invalid tool arguments or missing tool calls.
 - **Datasets editor checkpoint** — added a Datasets page and JSONL dataset-file API for creating, editing, saving, and deleting dataset item files under `INFERHARNESS_BENCHMARK_DATASET_ROOT`, with synced `dataset_manifest` documents, copy-down editing for repeated fields, and clamped long-prompt display.
 - **Tool-call assertion metric** — benchmark tool-call templates now include `tool_call_assertion_pass`, a single-turn pass/fail metric requiring exact expected tool selection and structurally matching arguments while keeping assertion failures as quality metrics rather than execution failures.
 - **Tool-call assertion UI** — Run now promotes tool-call assertion pass/fail as the primary correctness verdict, and Templates groups metrics with readable labels while auto-adding the assertion metric when tool calling is enabled.

diff --git a/README.md b/README.md
@@ -81,7 +81,7 @@ This means a result is more than a screenshot or a manually copied answer. It is
 Register local or remote inference servers, discover available models, and maintain a model catalog with provider, format, quantization, capabilities, and base-model metadata.
 
 **Reusable test definitions**
-Start with built-in benchmark templates, then create tests for one prompt, a dataset loop, tool-calling behavior, structured output, or multi-model comparisons.
+Start with built-in benchmark templates, then create tests for one prompt, a dataset loop, tool-calling behavior, long-context needle or function retrieval, structured output, or multi-model comparisons.
 
 Benchmark documents are persisted as JSON in a file-backed library and indexed into SQLite for runtime use. Built-in documents ship with the app, while user-created templates, datasets, runtime profiles, and plans are written to a local library directory so they can be restored if the database is rebuilt.
 
@@ -91,6 +91,8 @@ Use the Templates page agent as the primary authoring flow to challenge underspe
 **Benchmark runs**
 Run the same test against one model, many models, or the same model served by different inference servers.
 When a selected template has a unique linked `dataset_manifest`, Run uses that manifest automatically instead of creating a prompt or file-backed dataset manifest.
+Run disables templates that exceed a selected model's declared context window or require unsupported tool calling.
+Run separates execution health from functional checks so a technically completed pipeline can still show failed retrieval, schema, or tool-call assertions.
 
 **Automated metrics**
 Capture time to first token, total latency, prefill/decode timing, prompt tokens, completion tokens, and tokens per second.

diff --git a/backend/data/datasets/context-function-retrieval-128k.jsonl b/backend/data/datasets/context-function-retrieval-128k.jsonl
diff --git a/backend/data/datasets/context-function-retrieval-16k.jsonl b/backend/data/datasets/context-function-retrieval-16k.jsonl
diff --git a/backend/data/datasets/context-function-retrieval-256k.jsonl b/backend/data/datasets/context-function-retrieval-256k.jsonl
diff --git a/backend/data/datasets/context-function-retrieval-32k.jsonl b/backend/data/datasets/context-function-retrieval-32k.jsonl
diff --git a/backend/data/datasets/context-function-retrieval-4k.jsonl b/backend/data/datasets/context-function-retrieval-4k.jsonl
diff --git a/backend/data/datasets/context-function-retrieval-64k.jsonl b/backend/data/datasets/context-function-retrieval-64k.jsonl
diff --git a/backend/data/datasets/context-function-retrieval-8k.jsonl b/backend/data/datasets/context-function-retrieval-8k.jsonl
diff --git a/backend/data/datasets/context-needle-128k.jsonl b/backend/data/datasets/context-needle-128k.jsonl
diff --git a/backend/data/datasets/context-needle-16k.jsonl b/backend/data/datasets/context-needle-16k.jsonl
diff --git a/backend/data/datasets/context-needle-256k.jsonl b/backend/data/datasets/context-needle-256k.jsonl
diff --git a/backend/data/datasets/context-needle-32k.jsonl b/backend/data/datasets/context-needle-32k.jsonl
diff --git a/backend/data/datasets/context-needle-64k.jsonl b/backend/data/datasets/context-needle-64k.jsonl
diff --git a/backend/data/datasets/context-needle-8k.jsonl b/backend/data/datasets/context-needle-8k.jsonl
diff --git a/backend/data/datasets/positional-recall-python.jsonl b/backend/data/datasets/positional-recall-python.jsonl
diff --git a/...-library/documents/dataset_manifest/dataset-model-context-function-retrieval-128k-v1.json b/...-library/documents/dataset_manifest/dataset-model-context-function-retrieval-128k-v1.json
@@ -0,0 +1,46 @@
+{
+  "kind": "dataset_manifest",
+  "schema_version": "benchmark_dataset_manifest_v1",
+  "dataset_id": "dataset-model-context-function-retrieval-128k-v1",
+  "source": {
+    "source_type": "file",
+    "format": "jsonl",
+    "path": "data/datasets/context-function-retrieval-128k.jsonl"
+  },
+  "canonicalization_version": "dataset_canonical_v1",
+  "snapshot_policy": "manifest_only",
+  "dataset_hash": "sha256:79e2c6b53c00e6ca288911567ded0d814063a57164ca813697f8469caa3496cd",
+  "item_count": 5,
+  "item_hashes": [
+    {
+      "item_id": "function-front-128k",
+      "hash": "sha256:89ee905cb4cc217124513eeb69a1ad9b6943ad25c2483a435630396f4dc01d77"
+    },
+    {
+      "item_id": "function-middle-128k",
+      "hash": "sha256:e436ebd5b3d2ac6e99a9a14916f2f8bfcdf4200b8d4da3134450563fe4966f26"
+    },
+    {
+      "item_id": "function-late-128k",
+      "hash": "sha256:a3c2f11070896c8160823a94162f32dfd73ef0c8a1a03fda10f71a8b4e5541cb"
+    },
+    {
+      "item_id": "function-two-blocks-128k",
+      "hash": "sha256:c0d36aa2a4fece4db0cc7c85eb3ba57ec4ab770e3bcb593e012c13250853cf85"
+    },
+    {
+      "item_id": "function-negative-control-128k",
+      "hash": "sha256:0a4a3ba3a168b9fbc378dcb2d53b7aab7174d1377df5a74599e253de1a08dd58"
+    }
+  ],
+  "item_manifest_ref": null,
+  "snapshot_blob_ref": null,
+  "metadata": {
+    "source": "built-in-context-library",
+    "template_id": "model-context-function-retrieval-128k-v1",
+    "source_file": "backend/data/datasets/frame.py",
+    "dataset_file": "backend/data/datasets/context-function-retrieval-128k.jsonl",
+    "dataset_family": "function_retrieval",
+    "context_window_tokens": 128000
+  }
+}
diff --git a/...k-library/documents/dataset_manifest/dataset-model-context-function-retrieval-16k-v1.json b/...k-library/documents/dataset_manifest/dataset-model-context-function-retrieval-16k-v1.json
@@ -0,0 +1,46 @@
+{
+  "kind": "dataset_manifest",
+  "schema_version": "benchmark_dataset_manifest_v1",
+  "dataset_id": "dataset-model-context-function-retrieval-16k-v1",
+  "source": {
+    "source_type": "file",
+    "format": "jsonl",
+    "path": "data/datasets/context-function-retrieval-16k.jsonl"
+  },
+  "canonicalization_version": "dataset_canonical_v1",
+  "snapshot_policy": "manifest_only",
+  "dataset_hash": "sha256:09236e8d77ff47bac3e7169528c1796d64352b374d68ba5a68b06615e812f658",
+  "item_count": 5,
+  "item_hashes": [
+    {
+      "item_id": "function-front-16k",
+      "hash": "sha256:ccba4364f62e5bb1b3978480ea1fc44f8b09fac5f072fb8751a92b9d293f686d"
+    },
+    {
+      "item_id": "function-middle-16k",
+      "hash": "sha256:c0fe9f547fe73a296dc9aa9b374438cfbbd05fd77542bddcfbb0f545b404c524"
+    },
+    {
+      "item_id": "function-late-16k",
+      "hash": "sha256:53ea3ec7ecb0c4a6590a11927148b8b412711c67be97ed0d910e4854fd5d5ed4"
+    },
+    {
+      "item_id": "function-two-blocks-16k",
+      "hash": "sha256:d3d523a69337e9ab6737385c2aecb1f81f0b163de97b8d6ba1ad725ea097bdd0"
+    },
+    {
+      "item_id": "function-negative-control-16k",
+      "hash": "sha256:740e4305fe70ec13cbf964ec1609b53002de9361e3c9ce74674a141b73e0f698"
+    }
+  ],
+  "item_manifest_ref": null,
+  "snapshot_blob_ref": null,
+  "metadata": {
+    "source": "built-in-context-library",
+    "template_id": "model-context-function-retrieval-16k-v1",
+    "source_file": "backend/data/datasets/frame.py",
+    "dataset_file": "backend/data/datasets/context-function-retrieval-16k.jsonl",
+    "dataset_family": "function_retrieval",
+    "context_window_tokens": 16000
+  }
+}
diff --git a/...-library/documents/dataset_manifest/dataset-model-context-function-retrieval-256k-v1.json b/...-library/documents/dataset_manifest/dataset-model-context-function-retrieval-256k-v1.json
@@ -0,0 +1,46 @@
+{
+  "kind": "dataset_manifest",
+  "schema_version": "benchmark_dataset_manifest_v1",
+  "dataset_id": "dataset-model-context-function-retrieval-256k-v1",
+  "source": {
+    "source_type": "file",
+    "format": "jsonl",
+    "path": "data/datasets/context-function-retrieval-256k.jsonl"
+  },
+  "canonicalization_version": "dataset_canonical_v1",
+  "snapshot_policy": "manifest_only",
+  "dataset_hash": "sha256:02d2521f1fdee676be076b4340172a885638c4ea423083449822ff8d8a68b3fe",
+  "item_count": 5,
+  "item_hashes": [
+    {
+      "item_id": "function-front-256k",
+      "hash": "sha256:0e85afd5e91a8bc215138347cfdfecc699988c8371b815d33c598d68c86e6e5a"
+    },
+    {
+      "item_id": "function-middle-256k",
+      "hash": "sha256:cb47668518b5ec9e941573ea3ffe2e1f0b689210084c9b594212fd3b327fb05f"
+    },
+    {
+      "item_id": "function-late-256k",
+      "hash": "sha256:3f44c8c34f7296c932eb0bd88049598c07a4317ff1e805fefd732671b00ac257"
+    },
+    {
+      "item_id": "function-two-blocks-256k",
+      "hash": "sha256:2e7076c23277f5d9c058f8c722bdf177fbcb4b56f22b65675b209111a63356dd"
+    },
+    {
+      "item_id": "function-negative-control-256k",
+      "hash": "sha256:3d54549df5d2bc058f05d74fdd9634024300b609e14429985e45281ff60de882"
+    }
+  ],
+  "item_manifest_ref": null,
+  "snapshot_blob_ref": null,
+  "metadata": {
+    "source": "built-in-context-library",
+    "template_id": "model-context-function-retrieval-256k-v1",
+    "source_file": "backend/data/datasets/frame.py",
+    "dataset_file": "backend/data/datasets/context-function-retrieval-256k.jsonl",
+    "dataset_family": "function_retrieval",
+    "context_window_tokens": 256000
+  }
+}
diff --git a/...k-library/documents/dataset_manifest/dataset-model-context-function-retrieval-32k-v1.json b/...k-library/documents/dataset_manifest/dataset-model-context-function-retrieval-32k-v1.json
@@ -0,0 +1,46 @@
+{
+  "kind": "dataset_manifest",
+  "schema_version": "benchmark_dataset_manifest_v1",
+  "dataset_id": "dataset-model-context-function-retrieval-32k-v1",
+  "source": {
+    "source_type": "file",
+    "format": "jsonl",
+    "path": "data/datasets/context-function-retrieval-32k.jsonl"
+  },
+  "canonicalization_version": "dataset_canonical_v1",
+  "snapshot_policy": "manifest_only",
+  "dataset_hash": "sha256:98cf5203642c463fabb29612a54cf706d9f576cd2c5f094d73c9f3b1ee1522b7",
+  "item_count": 5,
+  "item_hashes": [
+    {
+      "item_id": "function-front-32k",
+      "hash": "sha256:68ca37b68f58d12e23141a9c28a1a9a325aac1d801bc2d8447302ab37f0c987e"
+    },
+    {
+      "item_id": "function-middle-32k",
+      "hash": "sha256:8056c08c6e62bc679a875414d74ed37c896fca27ee0db269460eb20e6b0a484e"
+    },
+    {
+      "item_id": "function-late-32k",
+      "hash": "sha256:1fe797370a033041244642da97f0e753c0a70c351496790684a8f5b2dc775918"
+    },
+    {
+      "item_id": "function-two-blocks-32k",
+      "hash": "sha256:2d7622f5f996710ce57f45d1ee421ba2976bac35980d56e7fdc61817bf1bdc07"
+    },
+    {
+      "item_id": "function-negative-control-32k",
+      "hash": "sha256:ffa07c1f7adc7825f317eaf4f53df3a63da8524ca1e4d4f91104bd3c9096d57b"
+    }
+  ],
+  "item_manifest_ref": null,
+  "snapshot_blob_ref": null,
+  "metadata": {
+    "source": "built-in-context-library",
+    "template_id": "model-context-function-retrieval-32k-v1",
+    "source_file": "backend/data/datasets/frame.py",
+    "dataset_file": "backend/data/datasets/context-function-retrieval-32k.jsonl",
+    "dataset_family": "function_retrieval",
+    "context_window_tokens": 32000
+  }
+}
diff --git a/...rk-library/documents/dataset_manifest/dataset-model-context-function-retrieval-4k-v1.json b/...rk-library/documents/dataset_manifest/dataset-model-context-function-retrieval-4k-v1.json
@@ -0,0 +1,46 @@
+{
+  "kind": "dataset_manifest",
+  "schema_version": "benchmark_dataset_manifest_v1",
+  "dataset_id": "dataset-model-context-function-retrieval-4k-v1",
+  "source": {
+    "source_type": "file",
+    "format": "jsonl",
+    "path": "data/datasets/context-function-retrieval-4k.jsonl"
+  },
+  "canonicalization_version": "dataset_canonical_v1",
+  "snapshot_policy": "manifest_only",
+  "dataset_hash": "sha256:8f8a71a734d9cbb8a69072f467c3c673cad01d536fa2859f36881ccdb51e4f1b",
+  "item_count": 5,
+  "item_hashes": [
+    {
+      "item_id": "function-front-4k",
+      "hash": "sha256:6e451c4c3f7a8ae153c5d87a12e8e89b8926796837061ba88b6d92033a4b2153"
+    },
+    {
+      "item_id": "function-middle-4k",
+      "hash": "sha256:11c61d18fb7c07348d863b346d58f95cd6f38abfae9120f32f81070eb1a13828"
+    },
+    {
+      "item_id": "function-late-4k",
+      "hash": "sha256:7b9750f61e53e247af61a613962198bf21d582be01cdf00c35c99a7843d72eea"
+    },
+    {
+      "item_id": "function-two-blocks-4k",
+      "hash": "sha256:ec32f76349e8e849f8b1b724e84d6508b9eba8a00f7f8fde693760508bd842bb"
+    },
+    {
+      "item_id": "function-negative-control-4k",
+      "hash": "sha256:24356452d9944f42a44afe705133ec909caae784e3a10125bb084898a5d02c6b"
+    }
+  ],
+  "item_manifest_ref": null,
+  "snapshot_blob_ref": null,
+  "metadata": {
+    "source": "built-in-context-library",
+    "template_id": "model-context-function-retrieval-4k-v1",
+    "source_file": "backend/data/datasets/frame.py",
+    "dataset_file": "backend/data/datasets/context-function-retrieval-4k.jsonl",
+    "dataset_family": "function_retrieval",
+    "context_window_tokens": 4000
+  }
+}
diff --git a/...k-library/documents/dataset_manifest/dataset-model-context-function-retrieval-64k-v1.json b/...k-library/documents/dataset_manifest/dataset-model-context-function-retrieval-64k-v1.json
@@ -0,0 +1,46 @@
+{
+  "kind": "dataset_manifest",
+  "schema_version": "benchmark_dataset_manifest_v1",
+  "dataset_id": "dataset-model-context-function-retrieval-64k-v1",
+  "source": {
+    "source_type": "file",
+    "format": "jsonl",
+    "path": "data/datasets/context-function-retrieval-64k.jsonl"
+  },
+  "canonicalization_version": "dataset_canonical_v1",
+  "snapshot_policy": "manifest_only",
+  "dataset_hash": "sha256:4226782d48c4cf0a4e881353cd2b351700388a55eebe541a33f615e72ddd14c7",
+  "item_count": 5,
+  "item_hashes": [
+    {
+      "item_id": "function-front-64k",
+      "hash": "sha256:c016db5396e11fa24f5f26ed5578d368f0cf1cdbcf08c7d12bd87a010c1eed60"
+    },
+    {
+      "item_id": "function-middle-64k",
+      "hash": "sha256:aaabb6be8b5d035d539b962d13816708c173537c58b2a91eab298b22f323d773"
+    },
+    {
+      "item_id": "function-late-64k",
+      "hash": "sha256:079b1e26a928739a3c2efd18b8f390435b557d8ff27de973f7f28e773e10825e"
+    },
+    {
+      "item_id": "function-two-blocks-64k",
+      "hash": "sha256:4c686a8c640dc80fb3b5b43e2790aad97667be6e52c8d97cdba1dd6082dc0021"
+    },
+    {
+      "item_id": "function-negative-control-64k",
+      "hash": "sha256:dc9b91faf80539d11ff62cd8867794a2a14b84770501225e0b1655ea2bff90ff"
+    }
+  ],
+  "item_manifest_ref": null,
+  "snapshot_blob_ref": null,
+  "metadata": {
+    "source": "built-in-context-library",
+    "template_id": "model-context-function-retrieval-64k-v1",
+    "source_file": "backend/data/datasets/frame.py",
+    "dataset_file": "backend/data/datasets/context-function-retrieval-64k.jsonl",
+    "dataset_family": "function_retrieval",
+    "context_window_tokens": 64000
+  }
+}
diff --git a/...rk-library/documents/dataset_manifest/dataset-model-context-function-retrieval-8k-v1.json b/...rk-library/documents/dataset_manifest/dataset-model-context-function-retrieval-8k-v1.json
@@ -0,0 +1,46 @@
+{
+  "kind": "dataset_manifest",
+  "schema_version": "benchmark_dataset_manifest_v1",
+  "dataset_id": "dataset-model-context-function-retrieval-8k-v1",
+  "source": {
+    "source_type": "file",
+    "format": "jsonl",
+    "path": "data/datasets/context-function-retrieval-8k.jsonl"
+  },
+  "canonicalization_version": "dataset_canonical_v1",
+  "snapshot_policy": "manifest_only",
+  "dataset_hash": "sha256:7630f14fcfae4de49e9bbd3fca5796c168d5bab42b602b83ee1c1646f16450ec",
+  "item_count": 5,
+  "item_hashes": [
+    {
+      "item_id": "function-front-8k",
+      "hash": "sha256:96854137cc3fcbb21fe6de14c7de0aa189360a8f0f099eb29febc9d2a041cc8d"
+    },
+    {
+      "item_id": "function-middle-8k",
+      "hash": "sha256:ad326f29edc6e1e941ea2729ffd69c684a9ddf99984b2e6c1e4810cd2bf7fbfc"
+    },
+    {
+      "item_id": "function-late-8k",
+      "hash": "sha256:782c77eeef23dddfb468b97d1ea292c31121923511ff8318be30b6d6b85ab58a"
+    },
+    {
+      "item_id": "function-two-blocks-8k",
+      "hash": "sha256:e48475a792a884755f4aa69f74adffecd1107501985e06943adc44f4e372ef86"
+    },
+    {
+      "item_id": "function-negative-control-8k",
+      "hash": "sha256:03ed19dd6d4f7151ca223a46a1880fd1f770d69abbb90f45c50a5705c5a10fe8"
+    }
+  ],
+  "item_manifest_ref": null,
+  "snapshot_blob_ref": null,
+  "metadata": {
+    "source": "built-in-context-library",
+    "template_id": "model-context-function-retrieval-8k-v1",
+    "source_file": "backend/data/datasets/frame.py",
+    "dataset_file": "backend/data/datasets/context-function-retrieval-8k.jsonl",
+    "dataset_family": "function_retrieval",
+    "context_window_tokens": 8000
+  }
+}