Skip to content

Eval costs blog#24

Merged
evijit merged 20 commits intomasterfrom
post/eval-costs-blog
Apr 29, 2026
Merged

Eval costs blog#24
evijit merged 20 commits intomasterfrom
post/eval-costs-blog

Conversation

@evijit
Copy link
Copy Markdown
Member

@evijit evijit commented Apr 25, 2026

Added a PR for a blog post detailing how evals have gotten extremely expensive, with a direct callout to Every Eval Ever.

evijit added 9 commits April 25, 2026 09:52
Field-guide essay by Avijit Ghosh on evaluation costs across static,
agentic, and training-in-the-loop benchmarks. Preserves the responsive
CSS bar charts from the source HTML and rethemes them with the
evaleval design tokens (Inter/IBM Plex Mono, --fg/--accent/--border)
so dark mode and the prose layout work without conflicts.
- Remove the inline roadmap nav (sidebar TOC already covers it).
- Retheme the highlight color from brown to var(--accent) so the bars
  match the evaleval blue identity in both light and dark modes.
- Add tex2jax_ignore on the article wrapper so MathJax stops parsing
  dollar-sign pairs in prose (the summary was rendering as italic math).
- Cap figures, table, and table-note at 760px so they share the same
  centered column as the body paragraphs (previously they spanned the
  full prose container, sticking out to the left of the text margin).
- Add mathjax_ignore alongside tex2jax_ignore on the wrapper.
  MathJax 3 ignores 'mathjax_ignore' by default; tex2jax_ignore was
  the v2 convention, so without this some prose paragraphs (e.g. the
  Claude Opus pricing line) had their dollar pairs rendered as math.
Mobile breakpoint was painting the entire track with var(--bg-subtle),
creating a visible grey strip behind every bar. Match the desktop
treatment instead: transparent background with subtle vertical tick
gradient (every third for a coarser mobile rhythm).
Give the BLUF a subtle bg-subtle fill, accent-colored left rule, and
soft border so it reads as a pull-quote at the top of the article
instead of a plain bordered span.
@evijit evijit self-assigned this Apr 25, 2026
@evijit
Copy link
Copy Markdown
Member Author

evijit commented Apr 25, 2026

Suggested coauthors, Leshem, Yifan, Georgia <more?>

evijit and others added 10 commits April 25, 2026 11:04
Added authors Yifan Mai, Georgia Channing, and Leshem Choshen to the article. Revised summary and various paragraphs for clarity and detail.
- restructure sources block as numbered bibliography (full titles, year, authors)
- correct RE-Bench attribution (Wijk et al., not METR)
- integrate reliability-multiplier callout into surrounding prose
- remove filler intros and 'X not Y' staccato patterns
- rephrase 'floors not ceilings' line
Cross-checked HELM cost claims against Section 6 model table (p. 43):
replaced loose "$10K or 4,000+ GPU-hours per model" with actual
range, corrected aggregate from "high six figures" to ~$100K, and
updated the cost-summary table entry. Fixed Pythia "16 model sizes"
→ "16 models spanning 8 sizes". Relabeled ResearchGym row to "full
pass (3 seeds)" so the dollars match the GPU-hours.

Chart fixes: axis labels now align with bar positions (flex
space-between instead of grid with centered labels). Figure 2 axis
converted to uniform decades ($100/$1k/$10k/$100k); all bars
recomputed and small ~1% errors corrected. Figure 3 caption
clarifies that bars show maximum compression, not ranges.

Vertical gridlines are now continuous across all rows (chart-body
wrapper with absolute-positioned ::before instead of per-track
backgrounds). Each figure sets its own --grid-interval. Mobile
keeps the per-track gradient.

Removed three stray image-markdown references accidentally pasted
into "consequences" in the closing section.
@evijit evijit force-pushed the post/eval-costs-blog branch from e4206bf to 6c3d2c2 Compare April 28, 2026 17:39
@evijit evijit merged commit 4b722ca into master Apr 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants