From 3b459ee7516c718bfca6d777ec576e3c62138616 Mon Sep 17 00:00:00 2001 From: Avijit Ghosh Date: Sat, 25 Apr 2026 09:52:06 -0400 Subject: [PATCH 01/20] add blog post: AI evals are becoming the new compute bottleneck Field-guide essay by Avijit Ghosh on evaluation costs across static, agentic, and training-in-the-loop benchmarks. Preserves the responsive CSS bar charts from the source HTML and rethemes them with the evaleval design tokens (Inter/IBM Plex Mono, --fg/--accent/--border) so dark mode and the prose layout work without conflicts. --- _posts/2026-04-25-eval-costs-bottleneck.md | 641 +++++++++++++++++++++ 1 file changed, 641 insertions(+) create mode 100644 _posts/2026-04-25-eval-costs-bottleneck.md diff --git a/_posts/2026-04-25-eval-costs-bottleneck.md b/_posts/2026-04-25-eval-costs-bottleneck.md new file mode 100644 index 0000000..b1139b8 --- /dev/null +++ b/_posts/2026-04-25-eval-costs-bottleneck.md @@ -0,0 +1,641 @@ +--- +layout: post +title: "AI evals are becoming the new compute bottleneck" +date: 2026-04-25 +published: true +category: Research +authors: + - name: "Avijit Ghosh" +tags: + - "AI Evaluation" + - "Cost" + - "Benchmarks" + - "Agents" + - "Reliability" + - "Compute" +description: "A field guide to evaluation costs: where the money goes, why old compression tricks break, and why agentic evals, training-in-the-loop benchmarks, and reliability measures are starting to break the bank." +--- + +
+ + + +

Summary. AI evaluation has crossed a cost threshold that changes who can do it. The Holistic Agent Leaderboard recently spent about $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. In scientific ML, The Well costs about 960 H100-hours to evaluate one new architecture and 3,840 H100-hours for a full four-baseline sweep. While compression techniques have been proposed for static benchmarks, new agent benchmarks are noisy, scaffold-sensitive, and only partly compressible. Training-in-the-loop benchmarks are expensive by construction, and when you try to add reliability to these evals, repeated runs further multiply the cost.

+ + + +

Making static LLM benchmarks cheaper

+ +

The cost problem started before agents. When Stanford's CRFM released HELM in 2022, full-coverage evaluation already required roughly $10,000 or 4,000+ GPU-hours per model. Perlitz et al. (2023) restate that figure, and IBM Research notes that putting Granite-13B through HELM "can consume as many as 1,000 GPU hours." Multiplied across HELM's 30 models and 42 scenarios, the aggregate ran into the high six figures.

+ +

The more striking observation came from EleutherAI's Pythia paper. Pythia released 154 checkpoints across 16 model sizes so the community could study training dynamics. Running the LM Evaluation Harness across all those checkpoints turned eval into a multiplier on training. As Perlitz et al. put it, evaluation costs "may even surpass those of pretraining when evaluating checkpoints." For small models, evaluation became the dominant compute line item across the whole development cycle.

+ +

Perlitz et al. then asked how much of HELM actually carried the rankings. The result was uncomfortable: a 100× to 200× reduction in compute preserved nearly the same ordering, and even a 400× reduction still grouped models into the same coarse tiers. Flash-HELM turned that finding into a coarse-to-fine procedure: run cheap evaluations first, then spend high-resolution compute only on the top candidates. Much of HELM's compute was not discovering new information. It was confirming rankings the field could have inferred much more cheaply.

+ +

Other work reached the same conclusion from different angles. tinyBenchmarks compressed MMLU from 14,000 items to 100 anchor items at about 2% error using Item Response Theory. The Open LLM Leaderboard collapsed from 29,000 examples to 180. Anchor Points showed that as few as 1 to 30 examples could rank-order 77 LLMs on GLUE. Static benchmarks had a weakness you could exploit: model differences often concentrate in a small subset of items, so ranking fidelity can survive aggressive subsampling.

+ +

That trick weakened sharply once benchmarks moved from static predictions to agents.

+ +

Agent evals are messier

+ +

The cleanest public accounting of agent evaluation comes from the Holistic Agent Leaderboard (Kapoor et al., ICLR 2026). HAL runs standardized agent harnesses across nine benchmarks covering coding, web navigation, science tasks, and customer service, with shared scaffolds and centralized cost tracking. The headline cost: $40,000 for 21,730 rollouts across 9 models and 9 benchmarks. By April 2026 the leaderboard had grown to 26,597 rollouts. Ndzomga's independent reproduction arrives at almost the same number: $46,000 across 242 agent runs.

+ +

The aggregate number hides the important part: the cost of a single benchmark run varies by four orders of magnitude across HAL tasks, and by three orders within some individual benchmarks.

+ +
+
Per-run cost spread on agent benchmarks
+
USD per one agent configuration on the full benchmark, log scale, HAL April 2026
+
+Maximum below $1,000 +Maximum at or above $1,000 +
+ +
Figure 1. Each bar shows the minimum-to-maximum cost across HAL configurations on a single benchmark. Highlighted bars cross the round $1,000-per-run threshold. A "run" is one full agent evaluation across all tasks. Within-benchmark spread reflects the model × scaffold × token-budget product. Source: live HAL leaderboard, April 2026.
+
+ +

Behind these numbers is a blunt pricing fact. Claude Opus 4.1 charges $15 per million input tokens and $75 per million output. Gemini 2.0 Flash charges $0.10 and $0.40, a two-order-of-magnitude spread on input alone. Agent benchmarks rarely benchmark "the model" in isolation. They benchmark a model × scaffold × token-budget product, and small scaffold choices can multiply costs 10×.

+ +

Worse, higher spend does not reliably buy better results. On Online Mind2Web, Browser-Use with Claude Sonnet 4 cost $1,577 for 40% accuracy. SeeAct with GPT-5 Medium hit 42% for $171. The HAL paper notes "a 9× difference in cost despite just a two-percentage-point difference in accuracy." On GAIA, an HAL Generalist with o3 Medium cost $2,828 for 28.5% accuracy, while a different agent hit 57.6% for $1,686. CLEAR finds across 6 SOTA agents on 300 enterprise tasks that "accuracy-optimal configurations cost 4.4 to 10.8× more than Pareto-efficient alternatives" with comparable real-world performance.

+ +

The static-era toolkit should have helped, but it has only gone so far. Ndzomga's mid-difficulty filter, which selects tasks with 30 to 70% historical pass rates, achieves a 2× to 3.5× reduction while preserving rank fidelity under scaffold and temporal shifts. That is useful, but it falls far short of the 100× to 200× gains available for static benchmarks. The mechanics explain why: when each item is a multi-turn rollout with its own variance, the expensive object is not the test question. It is the whole trajectory.

+ +

Some evals are just training

+ +

Some benchmarks escape the API-cost framing altogether because their evaluation protocol trains models from scratch.

+ +

The Well (NeurIPS 2024 D&B) gives the cleanest current example. It bundles 16 scientific machine-learning datasets spanning biological systems, fluid dynamics, magnetohydrodynamics, supernova explosions, viscoelastic instability, and active matter, totaling 15 TB. The protocol leaves little room to economize: train each baseline model for 12 hours on a single H100, try five learning rates per (model, dataset) pair, repeat across four architectures and 16 datasets. The full sweep consumes 3,840 H100-hours, or roughly $7,700 to $11,500 under the conversion assumptions below. A single new architecture still costs about 960 H100-hours.

+ +

This is the asymmetry that makes The Well important. Training one neural operator can take a single 12-hour H100 run, while evaluating it across the benchmark requires 80 such trainings. In this corner of ML, evaluation compute exceeds training compute by roughly two orders of magnitude, reversing the old deep-learning mental model.

+ +

The same pattern recurs across SciML. PDEBench covers 11 PDE families with per-submission training in the 50 to 200 GPU-hour range per architecture. MLE-Bench (OpenAI) sits between agent and training regimes. Each agent attempt at one of 75 Kaggle competitions runs 24 hours on a single A10 GPU, training real ML pipelines. The paper is explicit: "A single run of our main experiment setup of 24 hours per competition attempt requires 24 hours × 75 competitions = 1,800 GPU hours of compute," plus o1-preview consuming 127.5M input and 15M output tokens per seed. Three seeds × six models for a comparison study lands comfortably in six figures.

+ +

METR's RE-Bench caps each of seven research engineering environments at 8 hours on 1 to 6 H100s. A single-agent eval across the suite runs about 500 to 600 H100-hours; the human baseline, with 71 expert attempts, raises the implicit budget much further. Because the benchmark gives agents and humans the same wall-clock compute, a real-time training process sets the cost floor. A token budget no longer bounds it from above.

+ +

ResearchGym (ICLR 2026) makes the agent run actual ML research. Five test tasks (39 sub-tasks) drawn from ICML, ICLR, ACL, and CVPR orals, with the proposed methods withheld. The agent has to propose hypotheses, train models, and beat the original authors' baselines. The budget is tight: $10 in API plus 12 to 24 hours on a single GPU under 24 GB per task. A full pass (5 tasks × 24h × 3 seeds) consumes about 360 GPU-hours per agent.

+ +

PaperBench is where the cost picture turns brutal. Twenty ICML 2024 Spotlight or Oral papers must be replicated from scratch, graded against rubric trees with 8,316 leaf-node criteria. Each rollout uses an A10 GPU for 12 hours. The costs are easy to state and hard to absorb:

+ + + +

PaperBench Code-Dev drops execution on purpose. That choice halves rollout cost to about $4,000 and cuts grading to $10 per paper (85% lower). OpenAI built the variant because many groups cannot afford the full benchmark; the paper says so directly.

+ +

The historical precedent is NAS-Bench-101, whose tabular construction required over 100 TPU-years of training. Without that one-time investment, every NAS algorithm comparison would have cost 1 to 100+ GPU-hours per run, which would have made comparison pricier than the algorithms themselves.

+ +
+
Cost per single evaluation, training-in-the-loop benchmarks
+
USD per one model or agent through the full benchmark protocol, log scale
+
+Below $5,000 +$5,000 or more +
+ +
Figure 2. All values in USD per single evaluation of one model or agent through the full benchmark protocol. GPU costs converted at $2.50/H100-hr, $1.50/A10-hr; API and grading costs included where applicable. Highlighted bars denote benchmarks costing at least the round $5,000-per-evaluation threshold. The most expensive of these match the most expensive agent benchmarks (Figure 1) but require GPU compute that has no API substitute.
+
+ +

These benchmarks have a hard floor because compression changes what they measure. If you shrink them by 200×, you no longer test the original premise. A neural operator cannot demonstrate generalization to a Navier-Stokes regime on 5% of the dataset, because the model has to be retrained. The HELM toolkit does not transfer.

+ +

As benchmarks move closer to real work, compression gets harder: static prediction leaves room for large savings, agent rollouts leave less, and in-the-loop training leaves almost none.

+ +
+
Compression factors achievable by benchmark type
+
Maximum reduction in evaluation compute that preserves model-rank fidelity, log scale
+
+Measured compression +No general compression method +
+ +
Figure 3. The toolkit for compressing evaluation does not transfer as benchmarks become more complex. Solid bars show measured compression ranges. The highlighted bar is not a cost threshold; it flags the ~1× baseline where no general compression method exists. Static benchmarks routinely compress 100–200× without losing rankings. Agent benchmarks compress 2–3.5× at best. Training-in-the-loop benchmarks resist subsampling because the unit being evaluated is the trained model.
+
+ +

Reliability is the expensive part

+ +

Most of the costs above buy only single-run measurements with limited statistical power. When you measure reliability across repeated runs, static benchmarks, agent benchmarks, and training-in-the-loop benchmarks all become more expensive.

+ +

Agent reliability can fall hard when you stop treating one run as evidence. The best-known example comes from Yao et al.'s τ-bench, later reframed in CLEAR (Mehta, 2025): performance can drop from 60% on a single run to 25% under 8-run consistency. Kapoor et al.'s "AI Agents That Matter" found that simple baseline agents Pareto-dominate complex SOTA agents (Reflexion, LDB, LATS) on HumanEval at 50× lower cost, and that 7 of 10 popular agent benchmarks lacked adequate holdout sets. The HAL paper notes that a "do-nothing" agent passes 38% of τ-bench airline tasks under the original construction. HAL's own log analysis revealed data leakage in the TAU-bench Few Shot scaffold, forcing its removal in December 2025.

+ +

The most recent reliability accounting comes from Rabanser, Kapoor et al.'s "Towards a Science of AI Agent Reliability", which proposes twelve metrics across consistency, robustness, predictability, and safety. Their finding: "recent capability gains have only yielded small improvements in reliability." HAL's internal analysis shows how much fragility hides behind aggregate accuracy. On SciCode and CORE-Bench, agents almost never completed a run without a tool-calling failure. On AssistantBench and CORE-Bench, environmental errors occurred in roughly 40% of runs. Agents violated explicit benchmark instructions in their final answer over 60% of the time on failed tasks.

+ +
+The reliability multiplier in practice. A statistically credible HAL-style evaluation with k = 8 reruns per cell takes the $40K aggregate to roughly $320K. The same multiplier on PaperBench's $9,500-per-run cost pushes a single agent's evaluation past $75K. On The Well, a multi-seed protocol takes the per-architecture cost from ~960 H100-hours to several thousand. Reliability doesn't require new cost categories. It inflates the ones that already exist. +
+ +

HAL has paused new model evaluations to focus on reliability, which makes the issue plain: the field's headline numbers still carry too much noise, and reducing that noise costs real money. The figures above are floors, not ceilings, and those floors already exclude many evaluators.

+ +

What this means for ML as a field

+ +

The data point to three consequences that reinforce one another.

+ +

Eval cost is now an accountability barrier

+ +

Academic groups, AI Safety Institutes, and journalists now hit the budget constraint before the technical one when they try to evaluate frontier agents independently. A single GAIA run can exceed an annual graduate student travel budget. A single PaperBench evaluation, including the LLM judge, runs about $9,500. Three-seed comparisons of six models, the kind of study one might publish, push above $150,000. The established practice of "running a benchmark once and reporting the accuracy number" is no more rigorous than crash-testing one car in perfect weather, but moving past it requires money the academic system does not currently allocate as research compute.

+ +

The compute divide now includes evaluation

+ +

Ahmed, Wahed and Thompson (Science 2023) documented that industry models in 2021 were 29× larger than academic ones by parameter count, and that about 70% of AI PhDs went to industry in 2020 versus 21% in 2004. The original "compute divide" story mostly ignored evaluation because evaluation used to look cheap next to training. Many benchmarks have reversed that relationship. A lab that can fine-tune a 7B model can no longer assume it can afford the benchmarks the field takes seriously.

+ +

Cost-blind leaderboards reward waste

+ +

When leaderboards report raw accuracy and omit cost, researchers can rationally pour tokens into a problem until the number ticks up. The HAL paper finds that higher reasoning effort actually reduces accuracy in the majority of runs, which exposes the deeper pathology: extra inference compute does not reliably improve even the metric it is supposed to optimize. Pareto frontiers fix the comparison by ranking accuracy against cost. HAL implements them, but most leaderboards still do not.

+ +

If only frontier-lab compute budgets can produce statistically reliable benchmark numbers on the highest-cost agentic and scientific benchmarks, the social process of evaluating AI systems becomes concentrated inside the same labs that build them, rendering external validation partial, and sometimes absent, unless someone subsidizes the cost directly.

+ +

Cost summary across benchmark types

+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + +
BenchmarkTypeUSD per single evaluationWhat "one evaluation" means
HELM (per LLM, 2022)Static LLM~$8,000 – $10,000One LLM through full HELM (~4,000 GPU-hrs)
ScienceAgentBenchAgentic, science$0.19 – $77One agent config across 102 tasks
TAU-bench AirlineAgentic$0.31 – $180One agent across all airline tasks
SciCodeAgentic, science$0.12 – $625One agent across 338 sub-problems
CORE-Bench HardAgentic, replication$2 – $510One agent across 45 papers
SWE-bench Verified MiniAgentic, coding$4 – $1,600One agent across 50 issues
Online Mind2WebAgentic, web$5 – $1,610One agent across 300 web tasks
GAIAAgentic, multimodal$7.80 – $2,829One agent across GAIA tasks
ResearchGym (per seed)ML research, training$540 – $1,2605 tasks × 24h GPU + API
RE-Bench (full agent)ML R&D, training$1,200 – $1,8007 environments × 8h on H100
The Well (per architecture)SciML, training$1,920 – $2,8805 LRs × 16 datasets × 12h H100
MLE-Bench (1 seed)ML R&D, training~$2,700 – $3,00075 Kaggle competitions × 24h on A10
PaperBench Code-DevScientific, code only~$4,200One agent across 20 papers, no execution
The Well (full sweep)SciML, training$7,700 – $11,5004 architectures, full protocol
PaperBench (full)Scientific~$9,500One agent across 20 papers, full protocol
HAL aggregate9 benchmarks × 9 models~$40,000All 81 cells, single seed each
+
+ +

All figures normalized to USD per single evaluation. GPU compute converted at $2.50/H100-hour, $1.50/A10-hour; API and grading costs included where applicable. Pythia ("eval can exceed pretraining"), PDEBench (50–200 GPU-hours per architecture without specified hardware), and NAS-Bench-101's 100 TPU-year construction cost are excluded because they do not normalize cleanly to a per-evaluation USD figure.

+ +

Where this leaves us

+ +

The economics have changed. Not long ago, training was expensive and evaluation was cheap. For frontier LLMs trained at $50 million to $100 million, evaluation still looks like a rounding error, but that rounding error now costs tens of thousands of dollars per benchmark run and often leaves noisy results behind. For neural operators, ML research agents, and replication benchmarks, the ratio has flipped: a credible evaluation can cost more than training the candidate model.

+ +

The field already knows how to make static evaluation cheaper. Flash-HELM, tinyBenchmarks, and Anchor Points work. Agent evaluation has only partial fixes: mid-difficulty filtering helps, and Pareto-front leaderboards help, but the toolkit remains thin. Training-in-the-loop evaluation has no general compression method; tabular precomputation and tight budget caps can reduce cost only by narrowing what the benchmark measures. Reliability adds another layer because repeated runs raise the price of every protocol.

+ +

The field still talks as if capability sets the main constraint, but evaluation points to reliability as the tighter one. Governance institutions should want to measure the gap between single-run accuracy and pass^k consistency, yet that gap costs the most to measure. Static-benchmark compression does not transfer to agent or training-in-the-loop benchmarks, and mid-difficulty filtering remains the only credible partial substitute. Cost-blind leaderboards now mislead by design, because they reward extra spending without reporting what that spending bought.

+ +

Evaluation now has its own compute budgets, statistical methods, and failure modes. Its price also shapes who gets to evaluate powerful systems in the first place. Whoever can pay for the evaluation gets to write the leaderboard.

+ +
+ +
+Sources: Perlitz et al. arXiv:2308.11696; Garikaparthi et al. arXiv:2602.15112; Starace et al. arXiv:2504.01848 (PaperBench); Mehta arXiv:2511.14136 (CLEAR); Ndzomga arXiv:2603.23749; Kapoor et al. arXiv:2510.11977 (HAL); Tian et al. arXiv:2407.13168 (SciCode); Chen et al. arXiv:2410.05080 (ScienceAgentBench); Siegel et al. arXiv:2409.11363 (CORE-Bench); Chan et al. arXiv:2410.07095 (MLE-Bench); METR arXiv:2411.15114 (RE-Bench); Ohana et al. arXiv:2412.00568 (The Well); Polo et al. arXiv:2402.14992 (tinyBenchmarks); Vivek et al. arXiv:2309.08638 (Anchor Points); Rabanser et al. arXiv:2602.16666; live HAL leaderboard hal.cs.princeton.edu. +
+ +
From e64e4ccec6a02bc80aca70f81e0caadbe98a8e8d Mon Sep 17 00:00:00 2001 From: Avijit Ghosh Date: Sat, 25 Apr 2026 09:56:11 -0400 Subject: [PATCH 02/20] fix eval-costs post: drop inline TOC, use accent blue, escape MathJax - Remove the inline roadmap nav (sidebar TOC already covers it). - Retheme the highlight color from brown to var(--accent) so the bars match the evaleval blue identity in both light and dark modes. - Add tex2jax_ignore on the article wrapper so MathJax stops parsing dollar-sign pairs in prose (the summary was rendering as italic math). --- _posts/2026-04-25-eval-costs-bottleneck.md | 43 ++-------------------- 1 file changed, 3 insertions(+), 40 deletions(-) diff --git a/_posts/2026-04-25-eval-costs-bottleneck.md b/_posts/2026-04-25-eval-costs-bottleneck.md index b1139b8..e980acf 100644 --- a/_posts/2026-04-25-eval-costs-bottleneck.md +++ b/_posts/2026-04-25-eval-costs-bottleneck.md @@ -16,21 +16,18 @@ tags: description: "A field guide to evaluation costs: where the money goes, why old compression tricks break, and why agentic evals, training-in-the-loop benchmarks, and reliability measures are starting to break the bank." --- -
+
-

Summary. AI evaluation has crossed a cost threshold that changes who can do it. The Holistic Agent Leaderboard recently spent about $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. In scientific ML, The Well costs about 960 H100-hours to evaluate one new architecture and 3,840 H100-hours for a full four-baseline sweep. While compression techniques have been proposed for static benchmarks, new agent benchmarks are noisy, scaffold-sensitive, and only partly compressible. Training-in-the-loop benchmarks are expensive by construction, and when you try to add reliability to these evals, repeated runs further multiply the cost.

+

Summary. AI evaluation has crossed a cost threshold that changes who can do it. The Holistic Agent Leaderboard (HAL) recently spent about $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. Now, just imagine comparing the effects of scaffolds vs. models, in Exgentic or scaling agentic steps to millions, like in UK-AISI. In scientific ML, The Well costs about 960 H100-hours to evaluate one new architecture and 3,840 H100-hours for a full four-baseline sweep. While compression techniques have been proposed for static benchmarks, new agent benchmarks are noisy, scaffold-sensitive, and only partly compressible. Training-in-the-loop benchmarks are expensive by construction, and when you try to add reliability to these evals, repeated runs further multiply the cost.

Making static LLM benchmarks cheaper

-

The cost problem started before agents. When Stanford's CRFM released HELM in 2022, full-coverage evaluation already required roughly $10,000 or 4,000+ GPU-hours per model. Perlitz et al. (2023) restate that figure, and IBM Research notes that putting Granite-13B through HELM "can consume as many as 1,000 GPU hours." Multiplied across HELM's 30 models and 42 scenarios, the aggregate ran into the high six figures.

+

The cost problem started before agents. When Stanford's CRFM released HELM in 2022, full-coverage evaluation already required roughly $10,000 or 4,000+ GPU-hours per model. Perlitz et al. (2023) restate that figure, and IBM Research notes that putting Granite-13B through HELM "can consume as many as 1,000 GPU hours." Multiplied across HELM's 30 models and 42 scenarios, the aggregate ran into the high six figures.

-

The more striking observation came from Perlitz et al.'s analysis of EleutherAI's Pythia checkpoints. Pythia released 154 checkpoints across 16 model sizes so the community could study training dynamics. Running the LM Evaluation Harness across all those checkpoints turns eval into a multiplier on training: Perlitz et al. (2024) found that evaluation costs "may even surpass those of pretraining when evaluating checkpoints." For small models, evaluation becomes the dominant compute line item across the whole development cycle.

+

The more striking observation came from Perlitz et al.'s analysis of EleutherAI's Pythia checkpoints, developers pay for evaluation even more. Pythia released 154 checkpoints across 16 model sizes so the community could study training dynamics. Running the LM Evaluation Harness across all those checkpoints turns eval into a multiplier on training: Perlitz et al. (2024) noted that evaluation costs "may even surpass those of pretraining when evaluating checkpoints." For small models, evaluation becomes the dominant compute line item across the whole development cycle. When we scale inference-time compute, we scale evaluation costs.

-

Perlitz et al. then asked how much of HELM actually carried the rankings. The result was uncomfortable: a 100× to 200× reduction in compute preserved nearly the same ordering, and even a 400× reduction still grouped models into the same coarse tiers. Flash-HELM turned that finding into a coarse-to-fine procedure: run cheap evaluations first, then spend high-resolution compute only on the top candidates. Much of HELM's compute was not discovering new information. It was confirming rankings the field could have inferred much more cheaply.

+

Perlitz et al. then asked how much of HELM actually carried the rankings. The result was uncomfortable: a 100× to 200× reduction in compute preserved nearly the same ordering, and even a 400× reduction still grouped models into the same coarse tiers. Flash-HELM turned that finding into a coarse-to-fine procedure: run cheap evaluations first, then spend high-resolution compute only on the top candidates. Much of HELM's compute was not discovering new information. It was confirming rankings that the field could have inferred much more cheaply.

-

Other work reached the same conclusion from different angles. tinyBenchmarks compressed MMLU from 14,000 items to 100 anchor items at about 2% error using Item Response Theory. The Open LLM Leaderboard collapsed from 29,000 examples to 180. Anchor Points showed that as few as 1 to 30 examples could rank-order 77 LLMs on GLUE. Static benchmarks had a weakness you could exploit: model differences often concentrate in a small subset of items, so ranking fidelity can survive aggressive subsampling.

+

Other work reached the same conclusion from different angles. tinyBenchmarks compressed MMLU from 14,000 items to 100 anchor items at about 2% error using Item Response Theory. The Open LLM Leaderboard collapsed from 29,000 examples to 180. Anchor Points showed that as few as 1 to 30 examples could rank-order 77 LLMs on GLUE, and others followed, reducing dataset sizes by 90\%. Static benchmarks had a weakness you could exploit: model differences often concentrate in a small subset of items, so ranking can survive aggressive subsampling.

That trick weakened sharply once benchmarks moved from static predictions to agents.

Agent evals are messier

-

The cleanest public accounting of agent evaluation comes from the Holistic Agent Leaderboard (Kapoor et al., ICLR 2026). HAL runs standardized agent harnesses across nine benchmarks covering coding, web navigation, science tasks, and customer service, with shared scaffolds and centralized cost tracking. The headline cost: $40,000 for 21,730 rollouts across 9 models and 9 benchmarks. By April 2026 the leaderboard had grown to 26,597 rollouts. Ndzomga's independent reproduction arrives at almost the same number: $46,000 across 242 agent runs.

+

The cleanest public accounting of agent evaluation comes from the Holistic Agent Leaderboard (Kapoor et al., ICLR 2026). HAL runs standardized agent harnesses across nine benchmarks covering coding, web navigation, science tasks, and customer service, with shared scaffolds and centralized cost tracking. The headline cost: $40,000 for 21,730 rollouts across 9 models and 9 benchmarks. By April 2026, the leaderboard had grown to 26,597 rollouts. Ndzomga's independent reproduction arrives at almost the same number: $46,000 across 242 agent runs.

The aggregate number hides the important part: the cost of a single benchmark run varies by four orders of magnitude across HAL tasks, and by three orders within some individual benchmarks.

@@ -497,7 +500,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c

Worse, higher spend does not reliably buy better results. On Online Mind2Web, Browser-Use with Claude Sonnet 4 cost $1,577 for 40% accuracy. SeeAct with GPT-5 Medium hit 42% for $171. The HAL paper notes "a 9× difference in cost despite just a two-percentage-point difference in accuracy." On GAIA, an HAL Generalist with o3 Medium cost $2,828 for 28.5% accuracy, while a different agent hit 57.6% for $1,686. CLEAR finds across 6 SOTA agents on 300 enterprise tasks that "accuracy-optimal configurations cost 4.4 to 10.8× more than Pareto-efficient alternatives" with comparable real-world performance.

-

The static-era toolkit should have helped, but it has only gone so far. Ndzomga's mid-difficulty filter, which selects tasks with 30 to 70% historical pass rates, achieves a 2× to 3.5× reduction while preserving rank fidelity under scaffold and temporal shifts. That is useful, but it falls far short of the 100× to 200× gains available for static benchmarks. The mechanics explain why: when each item is a multi-turn rollout with its own variance, the expensive object is not the test question. It is the whole trajectory.

+

The static-era toolkit should have helped, but it has only gone so far. Ndzomga's mid-difficulty filter, which selects tasks with 30 to 70% historical pass rates, achieves a 2× to 3.5× reduction while preserving rank fidelity under scaffold and temporal shifts. That is useful, but it falls far short of the 100× to 200× gains available for static benchmarks. The mechanics explain why: when each item is a multi-turn rollout with its own variance, the expensive object is not the large number of questions. It is the unavoidable long trajectory per single question.

Some evals are just training

From 2f708b7a30b1ed35debd6d9ad34c5658d5343402 Mon Sep 17 00:00:00 2001 From: Avijit Ghosh Date: Sat, 25 Apr 2026 21:28:29 -0400 Subject: [PATCH 14/20] complete author list in bibtex citation --- _posts/2026-04-25-eval-costs-bottleneck.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2026-04-25-eval-costs-bottleneck.md b/_posts/2026-04-25-eval-costs-bottleneck.md index 34c46b7..05bc361 100644 --- a/_posts/2026-04-25-eval-costs-bottleneck.md +++ b/_posts/2026-04-25-eval-costs-bottleneck.md @@ -664,7 +664,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c
BibTeX Citation
{% raw %}@misc{ghosh2026evalbottleneck,
-  author       = {Ghosh, Avijit},
+  author       = {Ghosh, Avijit and Mai, Yifan and Channing, Georgia and Choshen, Leshem},
   title        = {{AI} evals are becoming the new compute bottleneck},
   year         = {2026},
   month        = apr,

From f8c7eb57ffb79f2ae487106086dbc7fc88c87b2f Mon Sep 17 00:00:00 2001
From: Avijit Ghosh 
Date: Sat, 25 Apr 2026 21:30:51 -0400
Subject: [PATCH 15/20] smooth link phrasing in summary callout

---
 _posts/2026-04-25-eval-costs-bottleneck.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/_posts/2026-04-25-eval-costs-bottleneck.md b/_posts/2026-04-25-eval-costs-bottleneck.md
index 05bc361..0dd8b85 100644
--- a/_posts/2026-04-25-eval-costs-bottleneck.md
+++ b/_posts/2026-04-25-eval-costs-bottleneck.md
@@ -456,7 +456,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c
 }
 
 
-

Summary. AI evaluation has crossed a cost threshold that changes who can do it. The Holistic Agent Leaderboard (HAL) recently spent about $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. Now, just imagine comparing the effects of scaffolds vs. models, in Exgentic or scaling agentic steps to millions, like in UK-AISI. In scientific ML, The Well costs about 960 H100-hours to evaluate one new architecture and 3,840 H100-hours for a full four-baseline sweep. While compression techniques have been proposed for static benchmarks, new agent benchmarks are noisy, scaffold-sensitive, and only partly compressible. Training-in-the-loop benchmarks are expensive by construction, and when you try to add reliability to these evals, repeated runs further multiply the cost.

+

Summary. AI evaluation has crossed a cost threshold that changes who can do it. The Holistic Agent Leaderboard (HAL) recently spent about $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. Now imagine comparing the effects of scaffolds versus models, as Exgentic does, or scaling agentic steps into the millions, as UK-AISI recently did. In scientific ML, The Well costs about 960 H100-hours to evaluate one new architecture and 3,840 H100-hours for a full four-baseline sweep. While compression techniques have been proposed for static benchmarks, new agent benchmarks are noisy, scaffold-sensitive, and only partly compressible. Training-in-the-loop benchmarks are expensive by construction, and when you try to add reliability to these evals, repeated runs further multiply the cost.

Making static LLM benchmarks cheaper

From 9cb682c67fc92c1ec2e4bd8f63ebf536d1da5d07 Mon Sep 17 00:00:00 2001 From: Avijit Ghosh Date: Sat, 25 Apr 2026 21:34:04 -0400 Subject: [PATCH 16/20] fill in missing citations in sources block --- _posts/2026-04-25-eval-costs-bottleneck.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2026-04-25-eval-costs-bottleneck.md b/_posts/2026-04-25-eval-costs-bottleneck.md index 0dd8b85..cc5fcd5 100644 --- a/_posts/2026-04-25-eval-costs-bottleneck.md +++ b/_posts/2026-04-25-eval-costs-bottleneck.md @@ -657,7 +657,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c
-Sources: Perlitz et al. arXiv:2308.11696; Garikaparthi et al. arXiv:2602.15112; Starace et al. arXiv:2504.01848 (PaperBench); Mehta arXiv:2511.14136 (CLEAR); Ndzomga arXiv:2603.23749; Kapoor et al. arXiv:2510.11977 (HAL); Tian et al. arXiv:2407.13168 (SciCode); Chen et al. arXiv:2410.05080 (ScienceAgentBench); Siegel et al. arXiv:2409.11363 (CORE-Bench); Chan et al. arXiv:2410.07095 (MLE-Bench); METR arXiv:2411.15114 (RE-Bench); Ohana et al. arXiv:2412.00568 (The Well); Polo et al. arXiv:2402.14992 (tinyBenchmarks); Vivek et al. arXiv:2309.08638 (Anchor Points); Rabanser et al. arXiv:2602.16666; live HAL leaderboard hal.cs.princeton.edu. +Sources: Liang et al. arXiv:2211.09110 (HELM); Biderman et al. arXiv:2304.01373 (Pythia); Perlitz et al. arXiv:2308.11696; Polo et al. arXiv:2402.14992 (tinyBenchmarks); Vivek et al. arXiv:2309.08638 (Anchor Points); Li et al. arXiv:2511.04689 (Adaptive Testing); Garikaparthi et al. arXiv:2602.15112 (ResearchGym); Starace et al. arXiv:2504.01848 (PaperBench); Mehta arXiv:2511.14136 (CLEAR); Ndzomga arXiv:2603.23749; Kapoor et al. arXiv:2510.11977 (HAL); Kapoor et al. arXiv:2407.01502 (AI Agents That Matter); Tian et al. arXiv:2407.13168 (SciCode); Chen et al. arXiv:2410.05080 (ScienceAgentBench); Siegel et al. arXiv:2409.11363 (CORE-Bench); Chan et al. arXiv:2410.07095 (MLE-Bench); METR arXiv:2411.15114 (RE-Bench); Takamoto et al. arXiv:2210.07182 (PDEBench); Ohana et al. arXiv:2412.00568 (The Well); Ying et al. arXiv:1902.09635 (NAS-Bench-101); Rabanser et al. arXiv:2602.16666; Ahmed, Wahed and Thompson Science (2023); IBM Research "Efficient LLM Benchmarking"; UK AISI inference-scaling cyber blog; live HAL leaderboard hal.cs.princeton.edu.
From c81e2b6c527cb7646357d50fef3647d31aa5d071 Mon Sep 17 00:00:00 2001 From: Avijit Ghosh Date: Sat, 25 Apr 2026 21:55:23 -0400 Subject: [PATCH 17/20] tighten prose, restructure sources as bibliography list - restructure sources block as numbered bibliography (full titles, year, authors) - correct RE-Bench attribution (Wijk et al., not METR) - integrate reliability-multiplier callout into surrounding prose - remove filler intros and 'X not Y' staccato patterns - rephrase 'floors not ceilings' line --- _posts/2026-04-25-eval-costs-bottleneck.md | 72 ++++++++++++++++++---- 1 file changed, 60 insertions(+), 12 deletions(-) diff --git a/_posts/2026-04-25-eval-costs-bottleneck.md b/_posts/2026-04-25-eval-costs-bottleneck.md index cc5fcd5..1afcdc1 100644 --- a/_posts/2026-04-25-eval-costs-bottleneck.md +++ b/_posts/2026-04-25-eval-costs-bottleneck.md @@ -394,6 +394,28 @@ description: "A field guide to evaluation costs: where the money goes, why old c letter-spacing: .1em; font-size: 11px; } +.eval-cost-article .source-list { + list-style: decimal; + margin: 14px 0 0; + padding-left: 26px; + max-width: none; + font-size: 12.5px; + line-height: 1.55; + column-count: 1; +} +.eval-cost-article .source-list li { + margin-bottom: 5px; + padding-left: 2px; + break-inside: avoid; +} +.eval-cost-article .source-list li::marker { + color: var(--fg-subtle); + font-variant-numeric: tabular-nums; +} +.eval-cost-article .source-list cite { + font-style: italic; + color: var(--fg); +} @media (max-width: 760px) { .eval-cost-article { font-size: 16.5px; line-height: 1.68; } @@ -464,7 +486,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c

The more striking observation came from Perlitz et al.'s analysis of EleutherAI's Pythia checkpoints, developers pay for evaluation even more. Pythia released 154 checkpoints across 16 model sizes so the community could study training dynamics. Running the LM Evaluation Harness across all those checkpoints turns eval into a multiplier on training: Perlitz et al. (2024) noted that evaluation costs "may even surpass those of pretraining when evaluating checkpoints." For small models, evaluation becomes the dominant compute line item across the whole development cycle. When we scale inference-time compute, we scale evaluation costs.

-

Perlitz et al. then asked how much of HELM actually carried the rankings. The result was uncomfortable: a 100× to 200× reduction in compute preserved nearly the same ordering, and even a 400× reduction still grouped models into the same coarse tiers. Flash-HELM turned that finding into a coarse-to-fine procedure: run cheap evaluations first, then spend high-resolution compute only on the top candidates. Much of HELM's compute was not discovering new information. It was confirming rankings that the field could have inferred much more cheaply.

+

Perlitz et al. then asked how much of HELM actually carried the rankings. The result was uncomfortable: a 100× to 200× reduction in compute preserved nearly the same ordering, and even a 400× reduction still grouped models into the same coarse tiers. Flash-HELM turned that finding into a coarse-to-fine procedure: run cheap evaluations first, then spend high-resolution compute only on the top candidates. Much of HELM's compute was not discovering new information; it was confirming rankings that the field could have inferred much more cheaply.

Other work reached the same conclusion from different angles. tinyBenchmarks compressed MMLU from 14,000 items to 100 anchor items at about 2% error using Item Response Theory. The Open LLM Leaderboard collapsed from 29,000 examples to 180. Anchor Points showed that as few as 1 to 30 examples could rank-order 77 LLMs on GLUE, and others followed, reducing dataset sizes by 90\%. Static benchmarks had a weakness you could exploit: model differences often concentrate in a small subset of items, so ranking can survive aggressive subsampling.

@@ -474,7 +496,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c

The cleanest public accounting of agent evaluation comes from the Holistic Agent Leaderboard (Kapoor et al., ICLR 2026). HAL runs standardized agent harnesses across nine benchmarks covering coding, web navigation, science tasks, and customer service, with shared scaffolds and centralized cost tracking. The headline cost: $40,000 for 21,730 rollouts across 9 models and 9 benchmarks. By April 2026, the leaderboard had grown to 26,597 rollouts. Ndzomga's independent reproduction arrives at almost the same number: $46,000 across 242 agent runs.

-

The aggregate number hides the important part: the cost of a single benchmark run varies by four orders of magnitude across HAL tasks, and by three orders within some individual benchmarks.

+

Behind that aggregate, the cost of a single benchmark run varies by four orders of magnitude across HAL tasks, and by three orders within some individual benchmarks.

Per-run cost spread on agent benchmarks
@@ -500,7 +522,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c

Worse, higher spend does not reliably buy better results. On Online Mind2Web, Browser-Use with Claude Sonnet 4 cost $1,577 for 40% accuracy. SeeAct with GPT-5 Medium hit 42% for $171. The HAL paper notes "a 9× difference in cost despite just a two-percentage-point difference in accuracy." On GAIA, an HAL Generalist with o3 Medium cost $2,828 for 28.5% accuracy, while a different agent hit 57.6% for $1,686. CLEAR finds across 6 SOTA agents on 300 enterprise tasks that "accuracy-optimal configurations cost 4.4 to 10.8× more than Pareto-efficient alternatives" with comparable real-world performance.

-

The static-era toolkit should have helped, but it has only gone so far. Ndzomga's mid-difficulty filter, which selects tasks with 30 to 70% historical pass rates, achieves a 2× to 3.5× reduction while preserving rank fidelity under scaffold and temporal shifts. That is useful, but it falls far short of the 100× to 200× gains available for static benchmarks. The mechanics explain why: when each item is a multi-turn rollout with its own variance, the expensive object is not the large number of questions. It is the unavoidable long trajectory per single question.

+

The static-era toolkit should have helped, but it has only gone so far. Ndzomga's mid-difficulty filter, which selects tasks with 30 to 70% historical pass rates, achieves a 2× to 3.5× reduction while preserving rank fidelity under scaffold and temporal shifts. That is useful, but it falls far short of the 100× to 200× gains available for static benchmarks. When each item is a multi-turn rollout with its own variance, the expensive object is not the large number of questions but the unavoidable long trajectory per single question.

Some evals are just training

@@ -508,7 +530,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c

The Well (NeurIPS 2024 D&B) gives the cleanest current example. It bundles 16 scientific machine-learning datasets spanning biological systems, fluid dynamics, magnetohydrodynamics, supernova explosions, viscoelastic instability, and active matter, totaling 15 TB. The protocol leaves little room to economize: train each baseline model for 12 hours on a single H100, try five learning rates per (model, dataset) pair, repeat across four architectures and 16 datasets. The full sweep consumes 3,840 H100-hours, or roughly $7,700 to $11,500 under the conversion assumptions below. A single new architecture still costs about 960 H100-hours.

-

This is the asymmetry that makes The Well important. Training one neural operator can take a single 12-hour H100 run, while evaluating it across the benchmark requires 80 such trainings. In this corner of ML, evaluation compute exceeds training compute by roughly two orders of magnitude, reversing the old deep-learning mental model.

+

Training one neural operator can take a single 12-hour H100 run, while evaluating it across the benchmark requires 80 such trainings — the asymmetry that makes The Well important. In this corner of ML, evaluation compute exceeds training compute by roughly two orders of magnitude, reversing the old deep-learning mental model.

The same pattern recurs across SciML. PDEBench covers 11 PDE families with per-submission training in the 50 to 200 GPU-hour range per architecture. MLE-Bench (OpenAI) sits between agent and training regimes. Each agent attempt at one of 75 Kaggle competitions runs 24 hours on a single A10 GPU, training real ML pipelines. The paper is explicit: "A single run of our main experiment setup of 24 hours per competition attempt requires 24 hours × 75 competitions = 1,800 GPU hours of compute," plus o1-preview consuming 127.5M input and 15M output tokens per seed. Three seeds × six models for a comparison study lands comfortably in six figures.

@@ -516,7 +538,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c

ResearchGym (ICLR 2026) makes the agent run actual ML research. Five test tasks (39 sub-tasks) drawn from ICML, ICLR, ACL, and CVPR orals, with the proposed methods withheld. The agent has to propose hypotheses, train models, and beat the original authors' baselines. The budget is tight: $10 in API plus 12 to 24 hours on a single GPU under 24 GB per task. A full pass (5 tasks × 24h × 3 seeds) consumes about 360 GPU-hours per agent.

-

PaperBench is where the cost picture turns brutal. Twenty ICML 2024 Spotlight or Oral papers must be replicated from scratch, graded against rubric trees with 8,316 leaf-node criteria. Each rollout uses an A10 GPU for 12 hours. The costs are easy to state and hard to absorb:

+

PaperBench is where the cost picture turns brutal. Twenty ICML 2024 Spotlight or Oral papers must be replicated from scratch, graded against rubric trees with 8,316 leaf-node criteria. Each rollout uses an A10 GPU for 12 hours, and the per-paper math is straightforward:

  • $400 in API per o1 IterativeAgent rollout, times 20 papers, comes to about $8,000 per evaluation.
  • @@ -524,7 +546,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c
  • Using o1 as judge would push grading to about $830 per paper.
-

PaperBench Code-Dev drops execution on purpose. That choice halves rollout cost to about $4,000 and cuts grading to $10 per paper (85% lower). OpenAI built the variant because many groups cannot afford the full benchmark; the paper says so directly.

+

PaperBench Code-Dev drops execution on purpose. That choice halves rollout cost to about $4,000 and cuts grading to $10 per paper (85% lower). OpenAI built the variant because many groups cannot afford the full benchmark.

The historical precedent is NAS-Bench-101, whose tabular construction required over 100 TPU-years of training. Without that one-time investment, every NAS algorithm comparison would have cost 1 to 100+ GPU-hours per run, which would have made comparison pricier than the algorithms themselves.

@@ -576,11 +598,9 @@ description: "A field guide to evaluation costs: where the money goes, why old c

The most recent reliability accounting comes from Rabanser, Kapoor et al.'s "Towards a Science of AI Agent Reliability", which proposes twelve metrics across consistency, robustness, predictability, and safety. Their finding: "recent capability gains have only yielded small improvements in reliability." HAL's internal analysis shows how much fragility hides behind aggregate accuracy. On SciCode and CORE-Bench, agents almost never completed a run without a tool-calling failure. On AssistantBench and CORE-Bench, environmental errors occurred in roughly 40% of runs. Agents violated explicit benchmark instructions in their final answer over 60% of the time on failed tasks.

-
-The reliability multiplier in practice. A statistically credible HAL-style evaluation with k = 8 reruns per cell takes the $40K aggregate to roughly $320K. The same multiplier on PaperBench's $9,500-per-run cost pushes a single agent's evaluation past $75K. On The Well, a multi-seed protocol takes the per-architecture cost from ~960 H100-hours to several thousand. Reliability doesn't require new cost categories. It inflates the ones that already exist. -
+

A statistically credible HAL-style evaluation with k = 8 reruns per cell takes the $40K aggregate to roughly $320K. The same multiplier on PaperBench's $9,500-per-run cost pushes a single agent's evaluation past $75K, and on The Well, a multi-seed protocol takes the per-architecture cost from ~960 H100-hours to several thousand. Reliability acts as a multiplier on every cost category above.

-

HAL has paused new model evaluations to focus on reliability, which makes the issue plain: the field's headline numbers still carry too much noise, and reducing that noise costs real money. The figures above are floors, not ceilings, and those floors already exclude many evaluators.

+

HAL has paused new model evaluations to focus on reliability: the field's headline numbers still carry too much noise, and reducing that noise costs real money. And the figures above are lower bounds; many evaluators are already priced out.

What this means for ML as a field

@@ -596,7 +616,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c

Cost-blind leaderboards reward waste

-

When leaderboards report raw accuracy and omit cost, researchers can rationally pour tokens into a problem until the number ticks up. The HAL paper finds that higher reasoning effort actually reduces accuracy in the majority of runs, which exposes the deeper pathology: extra inference compute does not reliably improve even the metric it is supposed to optimize. Pareto frontiers fix the comparison by ranking accuracy against cost. HAL implements them, but most leaderboards still do not.

+

When leaderboards report raw accuracy and omit cost, researchers can rationally pour tokens into a problem until the number ticks up. The HAL paper finds that higher reasoning effort actually reduces accuracy in the majority of runs: extra inference compute does not reliably improve even the metric it is supposed to optimize. Pareto frontiers fix the comparison by ranking accuracy against cost. HAL implements them, but most leaderboards still do not.

If only frontier-lab compute budgets can produce statistically reliable benchmark numbers on the highest-cost agentic and scientific benchmarks, the social process of evaluating AI systems becomes concentrated inside the same labs that build them, rendering external validation partial, and sometimes absent, unless someone subsidizes the cost directly.

@@ -657,9 +677,37 @@ description: "A field guide to evaluation costs: where the money goes, why old c
-Sources: Liang et al. arXiv:2211.09110 (HELM); Biderman et al. arXiv:2304.01373 (Pythia); Perlitz et al. arXiv:2308.11696; Polo et al. arXiv:2402.14992 (tinyBenchmarks); Vivek et al. arXiv:2309.08638 (Anchor Points); Li et al. arXiv:2511.04689 (Adaptive Testing); Garikaparthi et al. arXiv:2602.15112 (ResearchGym); Starace et al. arXiv:2504.01848 (PaperBench); Mehta arXiv:2511.14136 (CLEAR); Ndzomga arXiv:2603.23749; Kapoor et al. arXiv:2510.11977 (HAL); Kapoor et al. arXiv:2407.01502 (AI Agents That Matter); Tian et al. arXiv:2407.13168 (SciCode); Chen et al. arXiv:2410.05080 (ScienceAgentBench); Siegel et al. arXiv:2409.11363 (CORE-Bench); Chan et al. arXiv:2410.07095 (MLE-Bench); METR arXiv:2411.15114 (RE-Bench); Takamoto et al. arXiv:2210.07182 (PDEBench); Ohana et al. arXiv:2412.00568 (The Well); Ying et al. arXiv:1902.09635 (NAS-Bench-101); Rabanser et al. arXiv:2602.16666; Ahmed, Wahed and Thompson Science (2023); IBM Research "Efficient LLM Benchmarking"; UK AISI inference-scaling cyber blog; live HAL leaderboard hal.cs.princeton.edu. +Sources +
    +
  1. Ying et al. (2019). NAS-Bench-101: Towards Reproducible Neural Architecture Search. arXiv:1902.09635.
  2. +
  3. Liang et al. (2022). Holistic Evaluation of Language Models. arXiv:2211.09110.
  4. +
  5. Takamoto et al. (2022). PDEBench: An Extensive Benchmark for Scientific Machine Learning. arXiv:2210.07182.
  6. +
  7. Ahmed, Wahed and Thompson (2023). The growing influence of industry in AI research. Science 379(6635).
  8. +
  9. Biderman et al. (2023). Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. arXiv:2304.01373.
  10. +
  11. IBM Research (2023). Efficient LLM Benchmarking. research.ibm.com.
  12. +
  13. Perlitz et al. (2023). Efficient Benchmarking of Language Models. arXiv:2308.11696.
  14. +
  15. Vivek et al. (2023). Anchor Points: Benchmarking Models with Much Fewer Examples. arXiv:2309.08638.
  16. +
  17. Chan et al. (2024). MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. arXiv:2410.07095.
  18. +
  19. Chen et al. (2024). ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery. arXiv:2410.05080.
  20. +
  21. Kapoor et al. (2024). AI Agents That Matter. arXiv:2407.01502.
  22. +
  23. Wijk et al. (METR, 2024). RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts. arXiv:2411.15114.
  24. +
  25. Ohana et al. (2024). The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning. arXiv:2412.00568.
  26. +
  27. Polo et al. (2024). tinyBenchmarks: evaluating LLMs with fewer examples. arXiv:2402.14992.
  28. +
  29. Siegel et al. (2024). CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark. arXiv:2409.11363.
  30. +
  31. Tian et al. (2024). SciCode: A Research Coding Benchmark Curated by Scientists. arXiv:2407.13168.
  32. +
  33. Kapoor et al. (2025). Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation. arXiv:2510.11977.
  34. +
  35. Li et al. (2025). Adaptive Testing for LLM Evaluation: A Psychometric Alternative to Static Benchmarks. arXiv:2511.04689.
  36. +
  37. Mehta (2025). Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems. arXiv:2511.14136.
  38. +
  39. Starace et al. (2025). PaperBench: Evaluating AI's Ability to Replicate AI Research. arXiv:2504.01848.
  40. +
  41. UK AISI (2025). Evidence for inference scaling in AI cyber tasks: increased evaluation budgets reveal higher success rates. aisi.gov.uk.
  42. +
  43. Garikaparthi et al. (2026). ResearchGym: Evaluating Language Model Agents on Real-World AI Research. arXiv:2602.15112.
  44. +
  45. Ndzomga (2026). Efficient Benchmarking of AI Agents. arXiv:2603.23749.
  46. +
  47. Rabanser et al. (2026). Towards a Science of AI Agent Reliability. arXiv:2602.16666.
  48. +
  49. Holistic Agent Leaderboard (live). hal.cs.princeton.edu.
  50. +
+
BibTeX Citation From 287c09c8dca5437ca1f25f17372208a9e103eb50 Mon Sep 17 00:00:00 2001 From: Avijit Ghosh Date: Tue, 28 Apr 2026 12:46:32 -0400 Subject: [PATCH 18/20] fix HELM numbers, chart axes, and unbroken gridlines MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cross-checked HELM cost claims against Section 6 model table (p. 43): replaced loose "$10K or 4,000+ GPU-hours per model" with actual range, corrected aggregate from "high six figures" to ~$100K, and updated the cost-summary table entry. Fixed Pythia "16 model sizes" → "16 models spanning 8 sizes". Relabeled ResearchGym row to "full pass (3 seeds)" so the dollars match the GPU-hours. Chart fixes: axis labels now align with bar positions (flex space-between instead of grid with centered labels). Figure 2 axis converted to uniform decades ($100/$1k/$10k/$100k); all bars recomputed and small ~1% errors corrected. Figure 3 caption clarifies that bars show maximum compression, not ranges. Vertical gridlines are now continuous across all rows (chart-body wrapper with absolute-positioned ::before instead of per-track backgrounds). Each figure sets its own --grid-interval. Mobile keeps the per-track gradient. Removed three stray image-markdown references accidentally pasted into "consequences" in the closing section. --- _posts/2026-04-25-eval-costs-bottleneck.md | 78 ++++++++++++++-------- 1 file changed, 51 insertions(+), 27 deletions(-) diff --git a/_posts/2026-04-25-eval-costs-bottleneck.md b/_posts/2026-04-25-eval-costs-bottleneck.md index 1afcdc1..879a15d 100644 --- a/_posts/2026-04-25-eval-costs-bottleneck.md +++ b/_posts/2026-04-25-eval-costs-bottleneck.md @@ -231,13 +231,12 @@ description: "A field guide to evaluation costs: where the money goes, why old c color: var(--fg-subtle); } .eval-cost-article .axis-scale { - display: grid; - grid-template-columns: repeat(6, 1fr); + display: flex; + justify-content: space-between; border-bottom: 1px solid var(--border-strong); padding-bottom: 4px; } .eval-cost-article .axis-scale span { - text-align: center; font-variant-numeric: tabular-nums; white-space: nowrap; } @@ -269,15 +268,33 @@ description: "A field guide to evaluation costs: where the money goes, why old c font-family: 'IBM Plex Mono', monospace; font-size: 11.5px; } +.eval-cost-article .chart-body { + position: relative; + --grid-interval: 20%; +} +.eval-cost-article .chart-body::before { + content: ""; + position: absolute; + left: 228px; + right: 106px; + top: 0; + bottom: 0; + background-image: repeating-linear-gradient(to right, + transparent 0, + transparent calc(var(--grid-interval) - 1px), + var(--border) calc(var(--grid-interval) - 1px), + var(--border) var(--grid-interval)); + pointer-events: none; + z-index: 0; +} +.eval-cost-article .chart-body .chart-row { + position: relative; + z-index: 1; +} .eval-cost-article .range-track, .eval-cost-article .bar-track { position: relative; height: 22px; - background: repeating-linear-gradient(to right, - transparent 0, - transparent calc(20% - 1px), - var(--border) calc(20% - 1px), - var(--border) 20%); } .eval-cost-article .range-bar { position: absolute; @@ -418,6 +435,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c } @media (max-width: 760px) { + .eval-cost-article .chart-body::before { display: none; } .eval-cost-article { font-size: 16.5px; line-height: 1.68; } .eval-cost-article h2 { margin-top: 58px; letter-spacing: -0.025em; } .eval-cost-article .figure { margin: 36px auto; } @@ -482,9 +500,9 @@ description: "A field guide to evaluation costs: where the money goes, why old c

Making static LLM benchmarks cheaper

-

The cost problem started before agents. When Stanford's CRFM released HELM in 2022, full-coverage evaluation already required roughly $10,000 or 4,000+ GPU-hours per model. Perlitz et al. (2023) restate that figure, and IBM Research notes that putting Granite-13B through HELM "can consume as many as 1,000 GPU hours." Multiplied across HELM's 30 models and 42 scenarios, the aggregate ran into the high six figures.

+

The cost problem started before agents. When Stanford's CRFM released HELM in 2022, the paper's own per-model accounting (Section 6, p. 43) showed API costs ranging from $169 for OpenAI's ada (350M) to $10,926 for AI21's J1-Jumbo (178B), and 540 to 4,200 GPU-hours for the open models, with BLOOM (176B) and OPT (175B) at the top end. Perlitz et al. (2023) restate those figures, and IBM Research notes that putting Granite-13B through HELM "can consume as many as 1,000 GPU hours." Across HELM's 30 models and 42 scenarios, the aggregate of reported costs and GPU compute came to roughly $100,000.

-

The more striking observation came from Perlitz et al.'s analysis of EleutherAI's Pythia checkpoints, developers pay for evaluation even more. Pythia released 154 checkpoints across 16 model sizes so the community could study training dynamics. Running the LM Evaluation Harness across all those checkpoints turns eval into a multiplier on training: Perlitz et al. (2024) noted that evaluation costs "may even surpass those of pretraining when evaluating checkpoints." For small models, evaluation becomes the dominant compute line item across the whole development cycle. When we scale inference-time compute, we scale evaluation costs.

+

The more striking observation came from Perlitz et al.'s analysis of EleutherAI's Pythia checkpoints, developers pay for evaluation even more. Pythia released 154 checkpoints across 16 models spanning 8 sizes so the community could study training dynamics. Running the LM Evaluation Harness across all those checkpoints turns eval into a multiplier on training: Perlitz et al. (2024) noted that evaluation costs "may even surpass those of pretraining when evaluating checkpoints." For small models, evaluation becomes the dominant compute line item across the whole development cycle. When we scale inference-time compute, we scale evaluation costs.

Perlitz et al. then asked how much of HELM actually carried the rankings. The result was uncomfortable: a 100× to 200× reduction in compute preserved nearly the same ordering, and even a 400× reduction still grouped models into the same coarse tiers. Flash-HELM turned that finding into a coarse-to-fine procedure: run cheap evaluations first, then spend high-resolution compute only on the top candidates. Much of HELM's compute was not discovering new information; it was confirming rankings that the field could have inferred much more cheaply.

@@ -507,13 +525,15 @@ description: "A field guide to evaluation costs: where the money goes, why old c
Figure 1. Each bar shows the minimum-to-maximum cost across HAL configurations on a single benchmark. Highlighted bars cross the round $1,000-per-run threshold. A "run" is one full agent evaluation across all tasks. Within-benchmark spread reflects the model × scaffold × token-budget product. Source: live HAL leaderboard, April 2026.
@@ -558,14 +578,16 @@ description: "A field guide to evaluation costs: where the money goes, why old c $5,000 or more
Figure 2. All values in USD per single evaluation of one model or agent through the full benchmark protocol. GPU costs converted at $2.50/H100-hr, $1.50/A10-hr; API and grading costs included where applicable. Highlighted bars denote benchmarks costing at least the round $5,000-per-evaluation threshold. The most expensive of these match the most expensive agent benchmarks (Figure 1) but require GPU compute that has no API substitute.
@@ -578,16 +600,18 @@ description: "A field guide to evaluation costs: where the money goes, why old c
Compression factors achievable by benchmark type
Maximum reduction in evaluation compute that preserves model-rank fidelity, log scale
-Measured compression +Maximum measured compression No general compression method
+
Figure 3. The toolkit for compressing evaluation does not transfer as benchmarks become more complex. Bars show the maximum measured compression that preserves model-rank fidelity; labels give the published range. The highlighted bar flags the ~1× baseline where no general compression method exists. Static benchmarks routinely compress 100–200× without losing rankings. Agent benchmarks compress 2–3.5× at best. Training-in-the-loop benchmarks resist subsampling because the unit being evaluated is the trained model.

Reliability is the expensive part

@@ -633,7 +657,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c -HELM (per LLM, 2022)Static LLM~$8,000 – $10,000One LLM through full HELM (~4,000 GPU-hrs) +HELM (per LLM, 2022)Static LLM$85 – $10,926 API; 540 – 4,200 GPU-hrs openOne LLM through 42 scenarios; per-model table in HELM §6 p. 43 ScienceAgentBenchAgentic, science$0.19 – $77One agent config across 102 tasks TAU-bench AirlineAgentic$0.31 – $180One agent across all airline tasks SciCodeAgentic, science$0.12 – $625One agent across 338 sub-problems @@ -641,7 +665,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c SWE-bench Verified MiniAgentic, coding$4 – $1,600One agent across 50 issues Online Mind2WebAgentic, web$5 – $1,610One agent across 300 web tasks GAIAAgentic, multimodal$7.80 – $2,829One agent across GAIA tasks -ResearchGym (per seed)ML research, training$540 – $1,2605 tasks × 24h GPU + API +ResearchGym (full pass)ML research, training$540 – $1,2605 tasks × 24h × 3 seeds (~360 GPU-hrs) + API RE-Bench (full agent)ML R&D, training$1,200 – $1,8007 environments × 8h on H100 The Well (per architecture)SciML, training$1,920 – $2,8805 LRs × 16 datasets × 12h H100 MLE-Bench (1 seed)ML R&D, training~$2,700 – $3,00075 Kaggle competitions × 24h on A10 From 6c3d2c2e5e0c8b28704a33dc573b4d57318116dd Mon Sep 17 00:00:00 2001 From: Avijit Ghosh Date: Tue, 28 Apr 2026 13:31:16 -0400 Subject: [PATCH 19/20] fold Exgentic numbers into summary, add Bandel et al. citation --- _posts/2026-04-25-eval-costs-bottleneck.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/_posts/2026-04-25-eval-costs-bottleneck.md b/_posts/2026-04-25-eval-costs-bottleneck.md index 879a15d..2b349e2 100644 --- a/_posts/2026-04-25-eval-costs-bottleneck.md +++ b/_posts/2026-04-25-eval-costs-bottleneck.md @@ -496,7 +496,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c } -

Summary. AI evaluation has crossed a cost threshold that changes who can do it. The Holistic Agent Leaderboard (HAL) recently spent about $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. Now imagine comparing the effects of scaffolds versus models, as Exgentic does, or scaling agentic steps into the millions, as UK-AISI recently did. In scientific ML, The Well costs about 960 H100-hours to evaluate one new architecture and 3,840 H100-hours for a full four-baseline sweep. While compression techniques have been proposed for static benchmarks, new agent benchmarks are noisy, scaffold-sensitive, and only partly compressible. Training-in-the-loop benchmarks are expensive by construction, and when you try to add reliability to these evals, repeated runs further multiply the cost.

+

Summary. AI evaluation has crossed a cost threshold that changes who can do it. The Holistic Agent Leaderboard (HAL) recently spent about $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. A single GAIA run on a frontier model can cost $2,829 before caching. Exgentic's $22,000 sweep across agent configurations found a 33× cost spread on identical tasks, isolating scaffold choice as a first-order cost driver, and UK-AISI recently scaled agentic steps into the millions to study inference-time compute. In scientific ML, The Well costs about 960 H100-hours to evaluate one new architecture and 3,840 H100-hours for a full four-baseline sweep. While compression techniques have been proposed for static benchmarks, new agent benchmarks are noisy, scaffold-sensitive, and only partly compressible. Training-in-the-loop benchmarks are expensive by construction, and when you try to add reliability to these evals, repeated runs further multiply the cost.

Making static LLM benchmarks cheaper

@@ -724,6 +724,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c
  • Mehta (2025). Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems. arXiv:2511.14136.
  • Starace et al. (2025). PaperBench: Evaluating AI's Ability to Replicate AI Research. arXiv:2504.01848.
  • UK AISI (2025). Evidence for inference scaling in AI cyber tasks: increased evaluation budgets reveal higher success rates. aisi.gov.uk.
  • +
  • Bandel et al. (2026). General Agent Evaluation. arXiv:2602.22953.
  • Garikaparthi et al. (2026). ResearchGym: Evaluating Language Model Agents on Real-World AI Research. arXiv:2602.15112.
  • Ndzomga (2026). Efficient Benchmarking of AI Agents. arXiv:2603.23749.
  • Rabanser et al. (2026). Towards a Science of AI Agent Reliability. arXiv:2602.16666.
  • From cdd8df21fa02c058b8f80abf9c69145ee823abf0 Mon Sep 17 00:00:00 2001 From: Avijit Ghosh Date: Tue, 28 Apr 2026 21:52:35 -0400 Subject: [PATCH 20/20] push publication date to Apr 29 2026 --- ...osts-bottleneck.md => 2026-04-29-eval-costs-bottleneck.md} | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) rename _posts/{2026-04-25-eval-costs-bottleneck.md => 2026-04-29-eval-costs-bottleneck.md} (99%) diff --git a/_posts/2026-04-25-eval-costs-bottleneck.md b/_posts/2026-04-29-eval-costs-bottleneck.md similarity index 99% rename from _posts/2026-04-25-eval-costs-bottleneck.md rename to _posts/2026-04-29-eval-costs-bottleneck.md index 2b349e2..d02e2cf 100644 --- a/_posts/2026-04-25-eval-costs-bottleneck.md +++ b/_posts/2026-04-29-eval-costs-bottleneck.md @@ -1,7 +1,7 @@ --- layout: post title: "AI evals are becoming the new compute bottleneck" -date: 2026-04-25 +date: 2026-04-29 published: true category: Research image: /assets/img/blogs/eval-costs-bottleneck.png @@ -742,7 +742,7 @@ description: "A field guide to evaluation costs: where the money goes, why old c year = {2026}, month = apr, howpublished = {EvalEval Coalition Blog}, - url = {https://evalevalai.com/research/2026/04/25/eval-costs-bottleneck/} + url = {https://evalevalai.com/research/2026/04/29/eval-costs-bottleneck/} }{% endraw %}