Skip to content

feat(core): multi-sample runs + variance, folded into confidence#7

Merged
MerlijnW70 merged 1 commit into
mainfrom
feat/multi-sample-variance
Jun 21, 2026
Merged

feat(core): multi-sample runs + variance, folded into confidence#7
MerlijnW70 merged 1 commit into
mainfrom
feat/multi-sample-variance

Conversation

@MerlijnW70

Copy link
Copy Markdown
Owner

Implements multi-sample runs + variance (roadmap item 2), built on top of #6's groundwork only conceptually — independent change.

What

Wires the long-reserved Scenario.samples field (plus a --samples CLI override) so a scenario can be measured N times. The per-sample total_cu values form a SampleStats distribution (count / min / median / max / variance / std-dev / CV) attached to the measurement, and the coefficient of variation folds into the confidence score — implementing the spec §12 "sample variance" factor that was listed-but-unimplemented.

Honestly gated (no fake precision)

The shipping CLI uses the deterministic recorded backend, where running N times yields byte-identical results. New ExecutionBackend::is_deterministic() (true for recorded) makes the profiler run a deterministic backend exactly once — it never fabricates a run-to-run spread it did not observe. Multi-sampling therefore activates only for non-deterministic execution backends (e.g. Mollusk, once driven from the CLI). This is the design subtlety the research flagged, handled explicitly.

Details

  • SampleStats is Option + skip_serializing_if, so existing single-sample JSON reports are byte-identical (no golden churn).
  • Confidence demotion: CV ≥2% → Medium, ≥10% → Low, each with a reason.
  • samples is now settable from [scenario.<name>] samples = N and overridable with --samples.
  • Pure stats (SampleStats::from_samples) and the confidence fold are unit-tested; a VaryingBackend proves end-to-end sampling + variance + demotion, and a test asserts the recorded backend ignores samples.

Docs

  • reference §12 (variance factor) and §15 (--samples now real — fixes that doc drift); CHANGELOG.

Local gate: grade A (core 67 tests).

🤖 Generated with Claude Code

Wire the long-reserved `Scenario.samples` field (plus a `--samples`
override) so a scenario can be measured N times on a non-deterministic
backend. The per-sample `total_cu` values become a `SampleStats`
distribution (count/min/median/max/variance/std-dev/CV) on the
measurement, and the coefficient of variation folds into the confidence
score (CV >=2% -> Medium, >=10% -> Low) — implementing the spec §12
"sample variance" factor.

Crucially this is honestly gated: `ExecutionBackend::is_deterministic()`
(true for the recorded backend) makes the profiler run a deterministic
backend exactly once, so it never fabricates a run-to-run spread it did
not observe. The field is now settable from `[scenario.x] samples` and
overridable with `--samples`.

`SampleStats` serializes only when present (Option, skip-if-none), so
existing single-sample JSON reports are byte-identical.

Docs: reference §12/§15 updated; CHANGELOG. Gate: grade A.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MerlijnW70 MerlijnW70 force-pushed the feat/multi-sample-variance branch from cfc9da8 to 16dd8a0 Compare June 21, 2026 14:13
@MerlijnW70 MerlijnW70 merged commit 6a581e3 into main Jun 21, 2026
4 checks passed
@MerlijnW70 MerlijnW70 deleted the feat/multi-sample-variance branch June 21, 2026 14:14
@MerlijnW70 MerlijnW70 mentioned this pull request Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants