Skip to content

feat(engine): R2 eval quality — blind multi-judge, skill dedup, match replay, prompt-bank#13

Open
LeoLin990405 wants to merge 3 commits into
mainfrom
feat/r2-backend-eval
Open

feat(engine): R2 eval quality — blind multi-judge, skill dedup, match replay, prompt-bank#13
LeoLin990405 wants to merge 3 commits into
mainfrom
feat/r2-backend-eval

Conversation

@LeoLin990405

Copy link
Copy Markdown
Owner

Summary

  • multi-judge.mjs — blind multi-judge aggregation: civ names anonymized to Civ-A/B/C before each judge sees transcripts; scores parsed from three Markdown table formats and averaged across N providers; de-anonymizes back to regime ids in the final result. Never touches Gemini.
  • skill-quality.mjs — deterministic dedup (zero LLM calls): SHA-256 fingerprint for exact matches, Jaccard similarity (threshold 0.6, words >3 chars) for near-duplicates; analyzeSkillsDir quality report with dup groups, unique topics, and sedimentation dates.
  • replay.mjs — re-run any past match: reads original meta.json, mints a replay-* matchId with replayOf lineage, spawns run-v5.mjs with injectable _spawn/_runV5 for deterministic tests.
  • governance-scenarios.json — 10 curated challenge scenarios (military, political, economic, diplomatic, crisis) for --prompt-bank.
  • skill-sediment.mjs — near-duplicate gate added before writing a skill file.
  • tournament.mjs — multi-judge + prompt-bank wired in; judge logic refactored into buildJudgePrompt/judgeSingle/judgeMulti; pickScenario() for --prompt-bank.
  • bin/civagent — replay subcommand, skills --stats, tournament --multi-judge / --judges N / --prompt-bank / --seed N.

Test plan

  • npm run lint:syntax — all engine files + bash syntax clean
  • npm test — 69/69 pass (was 32; +11 multi-judge, +16 skill-quality, +7 replay)
  • Codex review requested on this PR

… replay, prompt-bank

New modules
- engine/v5/multi-judge.mjs: blind multi-judge aggregation
  · anonymizePrompt() replaces regime names with Civ-A/B/C before each judge sees
    the transcript, eliminating name-recognition bias
  · parseScoreTable() handles three table formats (single-score with rank prefix,
    triple-score with rank prefix, triple-score no prefix)
  · aggregateJudgements() averages N judge outputs and de-anonymizes civ labels
  · runMultiJudge() injectable _runJudge for testing; never touches Gemini

- engine/v5/skill-quality.mjs: deterministic skill deduplication (zero LLM calls)
  · skillFingerprint() = 16-char SHA-256 for exact-duplicate detection
  · jaccardSimilarity() = word-level Jaccard (words > 3 chars) for near-duplicates
  · findDuplicate() gates skill writes at threshold 0.6
  · analyzeSkillsDir() produces full quality report: total, dup groups, topics, dates

- engine/v5/replay.mjs: re-run any past match with identical params
  · Reads original meta.json (regime/backend/task), mints replay-* matchId,
    writes replayOf lineage field, spawns run-v5.mjs
  · Injectable _spawn/_runV5 for deterministic unit tests

- engine/prompts/governance-scenarios.json: 10 curated challenge scenarios
  (military, political, economic, diplomatic, crisis) for --prompt-bank

Modified
- engine/v5/skill-sediment.mjs: near-duplicate gate before writing skill file
- engine/v5/tournament.mjs: multi-judge + prompt-bank wired in; judge logic
  refactored into buildJudgePrompt/judgeSingle/judgeMulti; pickScenario() added
- bin/civagent: replay subcommand, skills --stats, tournament --multi-judge /
  --judges N / --prompt-bank / --seed N flags

Tests: 32 → 69 (multi-judge ×11, skill-quality ×16, replay ×7 new cases)
All 69 pass; lint:syntax clean.
P1 — anonymizePrompt prefix-collision (multi-judge.mjs)
  Replacing "china/jin" before "china/jin-jurchen" corrupted the longer name
  into "Civ-A-jurchen". Fixed by sorting the replacement pass in descending
  name-length order. The `map` index (Civ-A = civNames[0]) is unchanged; only
  the substitution order is reordered. Test: "china/jin vs china/jin-jurchen"
  verifies no partial corruption and correct label assignment.

P1 — normalizeSkill banner blindspot (skill-quality.mjs)
  skill-sediment.mjs prepends an HTML comment provenance banner before the
  frontmatter when writing a skill to disk. normalizeSkill() stripped
  frontmatter but not the banner, so a candidate (no banner) and the saved
  file (has banner) produced different fingerprints — exact-dup detection
  always missed. Fixed by stripping /<!--[\s\S]*?-->\s*/g before frontmatter.
  Two new tests: banner-content stripped from normalised output; findDuplicate
  recognises a banner-prefixed saved file as identical to a bare candidate.

P2 — runMultiJudge N-fill from full chain (multi-judge.mjs)
  slice(0, judgesN) ran only the first N providers even if some were
  unavailable, leaving fewer than N successful judges. Now iterates the full
  resolved chain and stops when successCount reaches judgesN, skipping
  unavailable providers without wasting a quota slot. Two new tests: gap-fill
  (p1 fails → p3 fills); early-stop (p3 never called when p1+p2 both succeed).

P2 — backend removed from blind-eval section header (tournament.mjs)
  buildJudgePrompt() included "backend ${r.backend}" in the Markdown heading
  seen by every judge. The backend is often tightly correlated with the civ,
  breaking the blind-eval guarantee. Heading now shows only exit code.

Tests: 69 → 72 (all pass).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant