feat(engine): R2 eval quality — blind multi-judge, skill dedup, match replay, prompt-bank by LeoLin990405 · Pull Request #13 · LeoLin990405/civagent

LeoLin990405 · 2026-05-28T01:44:42Z

Summary

multi-judge.mjs — blind multi-judge aggregation: civ names anonymized to Civ-A/B/C before each judge sees transcripts; scores parsed from three Markdown table formats and averaged across N providers; de-anonymizes back to regime ids in the final result. Never touches Gemini.
skill-quality.mjs — deterministic dedup (zero LLM calls): SHA-256 fingerprint for exact matches, Jaccard similarity (threshold 0.6, words >3 chars) for near-duplicates; analyzeSkillsDir quality report with dup groups, unique topics, and sedimentation dates.
replay.mjs — re-run any past match: reads original meta.json, mints a replay-* matchId with replayOf lineage, spawns run-v5.mjs with injectable _spawn/_runV5 for deterministic tests.
governance-scenarios.json — 10 curated challenge scenarios (military, political, economic, diplomatic, crisis) for --prompt-bank.
skill-sediment.mjs — near-duplicate gate added before writing a skill file.
tournament.mjs — multi-judge + prompt-bank wired in; judge logic refactored into buildJudgePrompt/judgeSingle/judgeMulti; pickScenario() for --prompt-bank.
bin/civagent — replay subcommand, skills --stats, tournament --multi-judge / --judges N / --prompt-bank / --seed N.

Test plan

npm run lint:syntax — all engine files + bash syntax clean
npm test — 69/69 pass (was 32; +11 multi-judge, +16 skill-quality, +7 replay)
Codex review requested on this PR

… replay, prompt-bank New modules - engine/v5/multi-judge.mjs: blind multi-judge aggregation · anonymizePrompt() replaces regime names with Civ-A/B/C before each judge sees the transcript, eliminating name-recognition bias · parseScoreTable() handles three table formats (single-score with rank prefix, triple-score with rank prefix, triple-score no prefix) · aggregateJudgements() averages N judge outputs and de-anonymizes civ labels · runMultiJudge() injectable _runJudge for testing; never touches Gemini - engine/v5/skill-quality.mjs: deterministic skill deduplication (zero LLM calls) · skillFingerprint() = 16-char SHA-256 for exact-duplicate detection · jaccardSimilarity() = word-level Jaccard (words > 3 chars) for near-duplicates · findDuplicate() gates skill writes at threshold 0.6 · analyzeSkillsDir() produces full quality report: total, dup groups, topics, dates - engine/v5/replay.mjs: re-run any past match with identical params · Reads original meta.json (regime/backend/task), mints replay-* matchId, writes replayOf lineage field, spawns run-v5.mjs · Injectable _spawn/_runV5 for deterministic unit tests - engine/prompts/governance-scenarios.json: 10 curated challenge scenarios (military, political, economic, diplomatic, crisis) for --prompt-bank Modified - engine/v5/skill-sediment.mjs: near-duplicate gate before writing skill file - engine/v5/tournament.mjs: multi-judge + prompt-bank wired in; judge logic refactored into buildJudgePrompt/judgeSingle/judgeMulti; pickScenario() added - bin/civagent: replay subcommand, skills --stats, tournament --multi-judge / --judges N / --prompt-bank / --seed N flags Tests: 32 → 69 (multi-judge ×11, skill-quality ×16, replay ×7 new cases) All 69 pass; lint:syntax clean.

P1 — anonymizePrompt prefix-collision (multi-judge.mjs) Replacing "china/jin" before "china/jin-jurchen" corrupted the longer name into "Civ-A-jurchen". Fixed by sorting the replacement pass in descending name-length order. The `map` index (Civ-A = civNames[0]) is unchanged; only the substitution order is reordered. Test: "china/jin vs china/jin-jurchen" verifies no partial corruption and correct label assignment. P1 — normalizeSkill banner blindspot (skill-quality.mjs) skill-sediment.mjs prepends an HTML comment provenance banner before the frontmatter when writing a skill to disk. normalizeSkill() stripped frontmatter but not the banner, so a candidate (no banner) and the saved file (has banner) produced different fingerprints — exact-dup detection always missed. Fixed by stripping /\s*/g before frontmatter. Two new tests: banner-content stripped from normalised output; findDuplicate recognises a banner-prefixed saved file as identical to a bare candidate. P2 — runMultiJudge N-fill from full chain (multi-judge.mjs) slice(0, judgesN) ran only the first N providers even if some were unavailable, leaving fewer than N successful judges. Now iterates the full resolved chain and stops when successCount reaches judgesN, skipping unavailable providers without wasting a quota slot. Two new tests: gap-fill (p1 fails → p3 fills); early-stop (p3 never called when p1+p2 both succeed). P2 — backend removed from blind-eval section header (tournament.mjs) buildJudgePrompt() included "backend ${r.backend}" in the Markdown heading seen by every judge. The backend is often tightly correlated with the civ, breaking the blind-eval guarantee. Heading now shows only exit code. Tests: 69 → 72 (all pass).

LeoLin990405 added 3 commits May 28, 2026 09:44

test(engine): cover PR13 review fixes

68786e8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(engine): R2 eval quality — blind multi-judge, skill dedup, match replay, prompt-bank#13

feat(engine): R2 eval quality — blind multi-judge, skill dedup, match replay, prompt-bank#13
LeoLin990405 wants to merge 3 commits into
mainfrom
feat/r2-backend-eval

LeoLin990405 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LeoLin990405 commented May 28, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant