feat(core): multi-sample runs + variance, folded into confidence#7
Merged
Conversation
Wire the long-reserved `Scenario.samples` field (plus a `--samples` override) so a scenario can be measured N times on a non-deterministic backend. The per-sample `total_cu` values become a `SampleStats` distribution (count/min/median/max/variance/std-dev/CV) on the measurement, and the coefficient of variation folds into the confidence score (CV >=2% -> Medium, >=10% -> Low) — implementing the spec §12 "sample variance" factor. Crucially this is honestly gated: `ExecutionBackend::is_deterministic()` (true for the recorded backend) makes the profiler run a deterministic backend exactly once, so it never fabricates a run-to-run spread it did not observe. The field is now settable from `[scenario.x] samples` and overridable with `--samples`. `SampleStats` serializes only when present (Option, skip-if-none), so existing single-sample JSON reports are byte-identical. Docs: reference §12/§15 updated; CHANGELOG. Gate: grade A. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cfc9da8 to
16dd8a0
Compare
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements multi-sample runs + variance (roadmap item 2), built on top of #6's groundwork only conceptually — independent change.
What
Wires the long-reserved
Scenario.samplesfield (plus a--samplesCLI override) so a scenario can be measured N times. The per-sampletotal_cuvalues form aSampleStatsdistribution (count / min / median / max / variance / std-dev / CV) attached to the measurement, and the coefficient of variation folds into the confidence score — implementing the spec §12 "sample variance" factor that was listed-but-unimplemented.Honestly gated (no fake precision)
The shipping CLI uses the deterministic recorded backend, where running N times yields byte-identical results. New
ExecutionBackend::is_deterministic()(true for recorded) makes the profiler run a deterministic backend exactly once — it never fabricates a run-to-run spread it did not observe. Multi-sampling therefore activates only for non-deterministic execution backends (e.g. Mollusk, once driven from the CLI). This is the design subtlety the research flagged, handled explicitly.Details
SampleStatsisOption+skip_serializing_if, so existing single-sample JSON reports are byte-identical (no golden churn).samplesis now settable from[scenario.<name>] samples = Nand overridable with--samples.SampleStats::from_samples) and the confidence fold are unit-tested; aVaryingBackendproves end-to-end sampling + variance + demotion, and a test asserts the recorded backend ignoressamples.Docs
--samplesnow real — fixes that doc drift); CHANGELOG.Local gate: grade A (core 67 tests).
🤖 Generated with Claude Code