Skip to content

Calibrate reasoning: add reasoningFrequency, update defaults#124

Merged
jimmytacks merged 1 commit into
mainfrom
experiment/094-reasoning-calibration
Apr 3, 2026
Merged

Calibrate reasoning: add reasoningFrequency, update defaults#124
jimmytacks merged 1 commit into
mainfrom
experiment/094-reasoning-calibration

Conversation

@jimmytacks
Copy link
Copy Markdown
Collaborator

Summary

  • Add reasoningFrequency parameter (0.0–1.0, default 0.47) — fraction of assistant turns that include reasoning output
  • Calibrate reasoningOutputSize default from 500 → 265 tokens (mean from 127 Models Agent conversations)
  • Uses Bresenham-style deterministic distribution for even spacing of reasoning across cycles
  • Frontend support added to ParameterPanel; sweep metadata added to PARAM_META

Engine Change

The sim previously overcharged reasoning output by ~3-4x (size overstatement × frequency overstatement). Since reasoning is billed at output price ($25/M), this inflates all absolute cost numbers. Rankings are unaffected (affects all strategies equally), but absolute costs were misleading.

Prior findings impact: All absolute cost numbers from prior experiments are overstated. Strategy rankings and relative comparisons remain valid since reasoning affects all strategies equally. FINDINGS.md will be updated after merge.

Test plan

  • All 197 tests pass
  • Lint passes
  • Production build succeeds
  • Conversation tests updated for new defaults + 3 new frequency-specific tests
  • Summary growth tests pinned to explicit reasoning settings to isolate concerns

Closes #94

🤖 Generated with Claude Code

…aults

Calibrate reasoning output based on analysis of 127 Models Agent conversations:
- reasoningOutputSize: 500 → 265 (calibrated mean)
- New reasoningFrequency parameter (default 0.47): fraction of turns with
  reasoning output. Uses Bresenham-style distribution for deterministic,
  even spacing across cycles.

Previously the sim charged reasoning on every turn at 500 tokens — a ~3-4x
overstatement vs real data (only 47% of turns include thinking, at 265 avg).

Closes #94

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jimmytacks jimmytacks merged commit 9f22d3d into main Apr 3, 2026
1 check passed
@jimmytacks jimmytacks deleted the experiment/094-reasoning-calibration branch April 3, 2026 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Calibrate reasoningOutputSize and add reasoning frequency parameter

1 participant