Conversation
Field-guide essay by Avijit Ghosh on evaluation costs across static, agentic, and training-in-the-loop benchmarks. Preserves the responsive CSS bar charts from the source HTML and rethemes them with the evaleval design tokens (Inter/IBM Plex Mono, --fg/--accent/--border) so dark mode and the prose layout work without conflicts.
- Remove the inline roadmap nav (sidebar TOC already covers it). - Retheme the highlight color from brown to var(--accent) so the bars match the evaleval blue identity in both light and dark modes. - Add tex2jax_ignore on the article wrapper so MathJax stops parsing dollar-sign pairs in prose (the summary was rendering as italic math).
- Cap figures, table, and table-note at 760px so they share the same centered column as the body paragraphs (previously they spanned the full prose container, sticking out to the left of the text margin). - Add mathjax_ignore alongside tex2jax_ignore on the wrapper. MathJax 3 ignores 'mathjax_ignore' by default; tex2jax_ignore was the v2 convention, so without this some prose paragraphs (e.g. the Claude Opus pricing line) had their dollar pairs rendered as math.
Mobile breakpoint was painting the entire track with var(--bg-subtle), creating a visible grey strip behind every bar. Match the desktop treatment instead: transparent background with subtle vertical tick gradient (every third for a coarser mobile rhythm).
Give the BLUF a subtle bg-subtle fill, accent-colored left rule, and soft border so it reads as a pull-quote at the top of the article instead of a plain bordered span.
Member
Author
|
Suggested coauthors, Leshem, Yifan, Georgia <more?> |
Added authors Yifan Mai, Georgia Channing, and Leshem Choshen to the article. Revised summary and various paragraphs for clarity and detail.
- restructure sources block as numbered bibliography (full titles, year, authors) - correct RE-Bench attribution (Wijk et al., not METR) - integrate reliability-multiplier callout into surrounding prose - remove filler intros and 'X not Y' staccato patterns - rephrase 'floors not ceilings' line
Cross-checked HELM cost claims against Section 6 model table (p. 43): replaced loose "$10K or 4,000+ GPU-hours per model" with actual range, corrected aggregate from "high six figures" to ~$100K, and updated the cost-summary table entry. Fixed Pythia "16 model sizes" → "16 models spanning 8 sizes". Relabeled ResearchGym row to "full pass (3 seeds)" so the dollars match the GPU-hours. Chart fixes: axis labels now align with bar positions (flex space-between instead of grid with centered labels). Figure 2 axis converted to uniform decades ($100/$1k/$10k/$100k); all bars recomputed and small ~1% errors corrected. Figure 3 caption clarifies that bars show maximum compression, not ranges. Vertical gridlines are now continuous across all rows (chart-body wrapper with absolute-positioned ::before instead of per-track backgrounds). Each figure sets its own --grid-interval. Mobile keeps the per-track gradient. Removed three stray image-markdown references accidentally pasted into "consequences" in the closing section.
e4206bf to
6c3d2c2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Added a PR for a blog post detailing how evals have gotten extremely expensive, with a direct callout to Every Eval Ever.