Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,27 @@
# 📜 Changelog

## v5.2.0 (unreleased) — R2 Evaluation & Learning Quality 🏛️

Multi-judge blind evaluation, skill deduplication, match replay, and prompt-bank (iteration plan: [ITERATION_PLAN.md](./ITERATION_PLAN.md), Round 2).

### Added
- **`engine/v5/multi-judge.mjs`** — blind multi-judge aggregation for tournaments. Civ names are anonymized (Civ-A, Civ-B …) before each judge sees the transcript to prevent name-recognition bias. N independent providers run in sequence; their scores are parsed from Markdown tables and averaged. De-anonymization restores real regime ids in the final `Map<regime, scores>`. (`anonymizePrompt`, `parseScoreTable`, `aggregateJudgements`, `runMultiJudge`)
- **`engine/v5/skill-quality.mjs`** — deterministic skill deduplication (no LLM calls). SHA-256 fingerprints catch exact duplicates; word-level Jaccard similarity (threshold 0.6, words >3 chars) catches near-duplicates. `analyzeSkillsDir` produces a full quality report: total, duplicate groups, unique topics, first/last sedimentation dates.
- **`engine/v5/replay.mjs`** — re-run a past match with identical regime/backend/task. Reads `meta.json` from the original match, mints a new `matchId` with a `replay-` prefix, writes a `replayOf` lineage field, and spawns `run-v5.mjs` with injectable `_spawn`/`_runV5` for testability.
- **`engine/prompts/governance-scenarios.json`** — 10 curated governance challenge scenarios (military, political, economic, diplomatic, crisis) for use with `--prompt-bank`.
- **`civagent tournament --multi-judge [--judges N]`** — run N judges in blind mode; default N=2.
- **`civagent tournament --prompt-bank [--seed N]`** — pick a random (or seeded) scenario from the built-in prompt bank.
- **`civagent replay <matchId>`** — re-run any past match.
- **`civagent skills <regime> --stats`** — show dedup quality analysis for a regime's skill library.
- **Tests** — 32 → 55: `multi-judge.test.mjs` (11 cases), `skill-quality.test.mjs` (16 cases), `replay.test.mjs` (7 cases).

### Changed
- **`engine/v5/skill-sediment.mjs`** — near-duplicate gate added before writing a skill file; uses `findDuplicate` from `skill-quality.mjs`.
- **`engine/v5/tournament.mjs`** — judge logic refactored into `buildJudgePrompt`/`judgeSingle`/`judgeMulti`; `runTournament` accepts `multiJudge` and `judgesN`; `pickScenario` added for prompt-bank.
- **`package.json` `lint:syntax`** — includes `multi-judge.mjs`, `skill-quality.mjs`, `replay.mjs`.

---

## v5.1.0 (unreleased) — R1 Engine Robustness 🔧

Backend robustness pass (iteration plan: [ITERATION_PLAN.md](./ITERATION_PLAN.md), Round 1). Focus: correctness, concurrency safety, and removing the hard external dependency on Gemini.
Expand Down
8 changes: 5 additions & 3 deletions ITERATION_PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,9 +109,11 @@ CivAgent v5 是跑在 Claude Code runtime 上的多智能体编排系统:57
- **验收**:`civagent setup` 不再提示 gemini,准确反映可用后端。

### Round 2 — 评测严谨性 + 学习闭环质量(前端②上线后并行)
- 多裁判 / 盲评:N 个非 gemini provider 各打分后聚合,消除单裁判偏差(解决 V5-DESIGN 里"裁判=单一 gemini"的可信度问题)。
- skill 沉淀质量度量:去重相似 skill、跨局重复 pattern 检测、沉淀产出率指标。
- 对局可复现:固定题库 + 记录 provider/版本/seed,支持 `civagent replay <matchId>`。
- ✅ 多裁判 / 盲评:N 个非 gemini provider 各打分后聚合,消除单裁判偏差(`multi-judge.mjs`,`--multi-judge` / `--judges N`)。
- ✅ skill 沉淀质量度量:去重相似 skill(SHA-256 精确 + Jaccard 近似,threshold 0.6)、跨局重复 pattern 检测、`--stats` 分析报告(`skill-quality.mjs`)。
- ✅ 对局可复现:固定题库 10 场景(`governance-scenarios.json`,`--prompt-bank / --seed`),支持 `civagent replay <matchId>`(`replay.mjs`,`replayOf` 血统字段)。
- ✅ 测试覆盖:multi-judge 11 用例、skill-quality 16 用例、replay 7 用例(共 34 新用例,累计 ~55)。
- **前端 R2(进行中)**:形态②政体可视化浏览器(PR 待提,由 antigravity 负责)。

### Round 3 — 控制台后端能力(前端③上线)
- 写 API:发起对局/锦标赛、在线编辑 regime(带 schema 校验 + 史实审稿入环)、管理 skill 库(启用/禁用/删除)。
Expand Down
79 changes: 72 additions & 7 deletions bin/civagent
Original file line number Diff line number Diff line change
Expand Up @@ -26,19 +26,27 @@ ${BOLD}Usage:${NC}
civagent switch <regime> Set active regime
civagent run [prompt] Launch CC with active regime's agents (v4 mode)
civagent run --v5 "task" Launch v5 with isolated HOME + skill sedimentation
civagent skills <regime> List learned skills for a regime
civagent skills <regime> [--stats] List learned skills (--stats: dedup analysis)
civagent match-log Recent match transcripts
civagent replay <matchId> Re-run a past match with identical params
civagent tournament --civs a,b,c,d "task"
Parallel match across civilizations + judge ranking
civagent tournament --civs a,b,c,d --multi-judge [--judges N] "task"
Blind multi-judge evaluation (default N=2)
civagent tournament --civs a,b,c,d --prompt-bank [--seed N]
Use a random scenario from the built-in prompt bank
civagent agents Show generated CC agents for active regime
civagent modes List 6 orchestration modes
civagent setup Check tool availability (CC, Codex, opencode, cn-cc)

${BOLD}Examples:${NC}
civagent switch china/tang
civagent run --v5 "重构这个模块的代码"
civagent replay 2024-01-15T10-30-00-000-ab12
civagent skills china/tang --stats
civagent tournament --civs china/tang,china/qin,global/athens,global/roman-republic \\
"how do we handle a famine on the eastern frontier?"
civagent tournament --civs china/tang,china/qin --multi-judge --judges 3 --prompt-bank

${BOLD}Regimes:${NC}
20 Chinese dynasties: xia, shang, zhou, qin, han, tang, song, ming, qing, ...
Expand Down Expand Up @@ -258,7 +266,15 @@ cmd_setup() {

cmd_skills() {
local regime="${1:-}"
[[ -n "$regime" ]] || { echo "Usage: civagent skills <region/regime-id>"; exit 1; }
local stats=false
shift || true
while [[ $# -gt 0 ]]; do
case "$1" in
--stats) stats=true; shift ;;
*) shift ;;
esac
done
[[ -n "$regime" ]] || { echo "Usage: civagent skills <region/regime-id> [--stats]"; exit 1; }
local dir="$REGIMES_DIR/$regime/skills"
if [[ ! -d "$dir" ]]; then
echo "No learned skills yet for $regime."
Expand All @@ -270,6 +286,23 @@ cmd_skills() {
echo -e " ${GREEN}•${NC} $(basename "$f")"
python3 -c "import re,sys;t=open(sys.argv[1]).read();m=re.search(r'description:\s*(.+)',t);print(' '+m.group(1)) if m else None" "$f" 2>/dev/null
done
if $stats; then
echo ""
echo -e "${BOLD}Quality analysis${NC}"
node --input-type=module <<EOF
import { analyzeSkillsDir } from '$ENGINE_DIR/v5/skill-quality.mjs';
const r = analyzeSkillsDir('$dir');
console.log(' Total skills : ' + r.total);
console.log(' Unique topics : ' + r.uniqueTopics.length);
if (r.stats.duplicateCount) console.log(' Near-duplicates : ' + r.stats.duplicateCount);
if (r.stats.firstSedimented) console.log(' First sedimented : ' + r.stats.firstSedimented);
if (r.stats.lastSedimented) console.log(' Last sedimented : ' + r.stats.lastSedimented);
if (r.duplicateGroups.length) {
console.log(' Duplicate groups :');
r.duplicateGroups.forEach(g => console.log(' ' + g.join(', ')));
}
EOF
fi
}

cmd_match_log() {
Expand All @@ -286,20 +319,51 @@ cmd_match_log() {
done
}

cmd_replay() {
local match_id="${1:-}"
[[ -n "$match_id" ]] || { echo "Usage: civagent replay <matchId>"; exit 1; }
exec node --input-type=module <<EOF
import { replayMatch } from '$ENGINE_DIR/v5/replay.mjs';
replayMatch('$match_id').then(({ newMatchId, exitCode }) => {
console.error('[replay] done — new match: ' + newMatchId);
process.exit(exitCode ?? 0);
}).catch(e => { console.error('[replay] error:', e.message); process.exit(1); });
EOF
}

cmd_tournament() {
local civs=""
local multi_judge=false
local judges_n=""
local prompt_bank=false
local seed=""
local rest=()
while [[ $# -gt 0 ]]; do
case "$1" in
--civs) civs="$2"; shift 2 ;;
*) rest+=("$1"); shift ;;
--civs) civs="$2"; shift 2 ;;
--multi-judge) multi_judge=true; shift ;;
--judges) judges_n="$2"; shift 2 ;;
--prompt-bank) prompt_bank=true; shift ;;
--seed) seed="$2"; shift 2 ;;
*) rest+=("$1"); shift ;;
esac
done
if [[ -z "$civs" || ${#rest[@]} -eq 0 ]]; then
if [[ -z "$civs" ]]; then
echo "Usage: civagent tournament --civs a/x,b/y,c/z [--multi-judge] [--judges N] [--prompt-bank] [--seed N] \"task\""
exit 1
fi
if ! $prompt_bank && [[ ${#rest[@]} -eq 0 ]]; then
echo "Usage: civagent tournament --civs a/x,b/y,c/z \"task prompt\""
echo " civagent tournament --civs a/x,b/y,c/z --prompt-bank"
exit 1
fi
exec node "$ENGINE_DIR/v5/tournament.mjs" --civs "$civs" "${rest[@]}"
local node_args=("$ENGINE_DIR/v5/tournament.mjs" --civs "$civs")
$multi_judge && node_args+=(--multi-judge)
[[ -n "$judges_n" ]] && node_args+=(--judges "$judges_n")
$prompt_bank && node_args+=(--prompt-bank)
[[ -n "$seed" ]] && node_args+=(--seed "$seed")
node_args+=("${rest[@]}")
exec node "${node_args[@]}"
}

# ── main ─────────────────────────────────────────────────────────────────────
Expand All @@ -312,8 +376,9 @@ case "${1:-}" in
agents) cmd_agents ;;
modes) cmd_modes ;;
setup) cmd_setup ;;
skills) cmd_skills "${2:-}" ;;
skills) shift; cmd_skills "${1:-}" "${@:2}" ;;
match-log) cmd_match_log ;;
replay) cmd_replay "${2:-}" ;;
tournament) shift; cmd_tournament "$@" ;;
help|--help|-h|"") usage ;;
*) echo "Unknown command: $1"; usage; exit 1 ;;
Expand Down
52 changes: 52 additions & 0 deletions engine/prompts/governance-scenarios.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
[
{
"id": "frontier-defense-01",
"category": "military",
"prompt": "Establish secure and robust border defense policies for agricultural frontiers facing seasonal tribal raids."
},
{
"id": "succession-crisis-01",
"category": "political",
"prompt": "The head of state has died without a designated heir. Multiple factions claim legitimacy. Design a governance process to resolve the succession crisis without civil war."
},
{
"id": "economic-collapse-01",
"category": "economic",
"prompt": "A sudden 40% drop in tax revenue threatens the state treasury. Propose emergency fiscal measures that preserve core governance functions while minimising unrest."
},
{
"id": "external-threat-01",
"category": "diplomacy",
"prompt": "A neighbouring power demands territorial concessions under threat of war. Your military capacity is roughly equal. Design a multi-channel response strategy."
},
{
"id": "internal-revolt-01",
"category": "internal",
"prompt": "A major province is threatening secession after three consecutive years of drought and perceived neglect from the centre. Draft a reconciliation and relief plan."
},
{
"id": "technological-disruption-01",
"category": "innovation",
"prompt": "A new military technology is spreading rapidly among rivals. Design an adoption strategy and a counter-proliferation policy that fits your governance structure."
},
{
"id": "plague-response-01",
"category": "crisis",
"prompt": "An epidemic is spreading through major cities, killing 5–10% of the population per month. Coordinate an immediate public-health response within your governance structure."
},
{
"id": "trade-route-disruption-01",
"category": "economic",
"prompt": "The main trade route has been cut off by a rival power. Design a diversification strategy to maintain economic stability within 12 months."
},
{
"id": "famine-relief-01",
"category": "crisis",
"prompt": "Crop failure across three provinces threatens mass starvation. Allocate grain reserves, coordinate relief logistics, and prevent hoarding and speculation."
},
{
"id": "religious-schism-01",
"category": "social",
"prompt": "A doctrinal split has divided the state's official religion. Factions are mobilising followers against each other. Draft a policy to maintain civil order and state authority."
}
]
Loading
Loading