feat(engine): R2 eval quality — blind multi-judge, skill dedup, match replay, prompt-bank#13
Open
LeoLin990405 wants to merge 3 commits into
Open
feat(engine): R2 eval quality — blind multi-judge, skill dedup, match replay, prompt-bank#13LeoLin990405 wants to merge 3 commits into
LeoLin990405 wants to merge 3 commits into
Conversation
… replay, prompt-bank
New modules
- engine/v5/multi-judge.mjs: blind multi-judge aggregation
· anonymizePrompt() replaces regime names with Civ-A/B/C before each judge sees
the transcript, eliminating name-recognition bias
· parseScoreTable() handles three table formats (single-score with rank prefix,
triple-score with rank prefix, triple-score no prefix)
· aggregateJudgements() averages N judge outputs and de-anonymizes civ labels
· runMultiJudge() injectable _runJudge for testing; never touches Gemini
- engine/v5/skill-quality.mjs: deterministic skill deduplication (zero LLM calls)
· skillFingerprint() = 16-char SHA-256 for exact-duplicate detection
· jaccardSimilarity() = word-level Jaccard (words > 3 chars) for near-duplicates
· findDuplicate() gates skill writes at threshold 0.6
· analyzeSkillsDir() produces full quality report: total, dup groups, topics, dates
- engine/v5/replay.mjs: re-run any past match with identical params
· Reads original meta.json (regime/backend/task), mints replay-* matchId,
writes replayOf lineage field, spawns run-v5.mjs
· Injectable _spawn/_runV5 for deterministic unit tests
- engine/prompts/governance-scenarios.json: 10 curated challenge scenarios
(military, political, economic, diplomatic, crisis) for --prompt-bank
Modified
- engine/v5/skill-sediment.mjs: near-duplicate gate before writing skill file
- engine/v5/tournament.mjs: multi-judge + prompt-bank wired in; judge logic
refactored into buildJudgePrompt/judgeSingle/judgeMulti; pickScenario() added
- bin/civagent: replay subcommand, skills --stats, tournament --multi-judge /
--judges N / --prompt-bank / --seed N flags
Tests: 32 → 69 (multi-judge ×11, skill-quality ×16, replay ×7 new cases)
All 69 pass; lint:syntax clean.
P1 — anonymizePrompt prefix-collision (multi-judge.mjs)
Replacing "china/jin" before "china/jin-jurchen" corrupted the longer name
into "Civ-A-jurchen". Fixed by sorting the replacement pass in descending
name-length order. The `map` index (Civ-A = civNames[0]) is unchanged; only
the substitution order is reordered. Test: "china/jin vs china/jin-jurchen"
verifies no partial corruption and correct label assignment.
P1 — normalizeSkill banner blindspot (skill-quality.mjs)
skill-sediment.mjs prepends an HTML comment provenance banner before the
frontmatter when writing a skill to disk. normalizeSkill() stripped
frontmatter but not the banner, so a candidate (no banner) and the saved
file (has banner) produced different fingerprints — exact-dup detection
always missed. Fixed by stripping /<!--[\s\S]*?-->\s*/g before frontmatter.
Two new tests: banner-content stripped from normalised output; findDuplicate
recognises a banner-prefixed saved file as identical to a bare candidate.
P2 — runMultiJudge N-fill from full chain (multi-judge.mjs)
slice(0, judgesN) ran only the first N providers even if some were
unavailable, leaving fewer than N successful judges. Now iterates the full
resolved chain and stops when successCount reaches judgesN, skipping
unavailable providers without wasting a quota slot. Two new tests: gap-fill
(p1 fails → p3 fills); early-stop (p3 never called when p1+p2 both succeed).
P2 — backend removed from blind-eval section header (tournament.mjs)
buildJudgePrompt() included "backend ${r.backend}" in the Markdown heading
seen by every judge. The backend is often tightly correlated with the civ,
breaking the blind-eval guarantee. Heading now shows only exit code.
Tests: 69 → 72 (all pass).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
multi-judge.mjs— blind multi-judge aggregation: civ names anonymized to Civ-A/B/C before each judge sees transcripts; scores parsed from three Markdown table formats and averaged across N providers; de-anonymizes back to regime ids in the final result. Never touches Gemini.skill-quality.mjs— deterministic dedup (zero LLM calls): SHA-256 fingerprint for exact matches, Jaccard similarity (threshold 0.6, words >3 chars) for near-duplicates;analyzeSkillsDirquality report with dup groups, unique topics, and sedimentation dates.replay.mjs— re-run any past match: reads originalmeta.json, mints areplay-*matchId withreplayOflineage, spawnsrun-v5.mjswith injectable_spawn/_runV5for deterministic tests.governance-scenarios.json— 10 curated challenge scenarios (military, political, economic, diplomatic, crisis) for--prompt-bank.skill-sediment.mjs— near-duplicate gate added before writing a skill file.tournament.mjs— multi-judge + prompt-bank wired in; judge logic refactored intobuildJudgePrompt/judgeSingle/judgeMulti;pickScenario()for--prompt-bank.bin/civagent—replaysubcommand,skills --stats,tournament --multi-judge / --judges N / --prompt-bank / --seed N.Test plan
npm run lint:syntax— all engine files + bash syntax cleannpm test— 69/69 pass (was 32; +11 multi-judge, +16 skill-quality, +7 replay)