feat(eval): add PersonaMem MCQ benchmark + full v1-32k raw eval (69.4%)#40
Conversation
PersonaMem was the last registered benchmark with no report. It is a 4-option multiple-choice personalization benchmark, but was wired to free-text QA + LLM judge that never showed the model the options. Rewire it to exact-match MCQ letter accuracy (no LLM judge), matching the dataset authors' protocol. - adapter: ast.literal_eval fallback for Python-repr `all_options` (recovers 303/589 rows json.loads silently dropped to 0 options); one scenario per (context, end_index) cut point so each question's haystack is exactly turns[:end_index] -- no future/cross-persona leak; carry options + gold letter in metadata. - judge: select_choice() MCQ reader (temp 0, presents options). - bench: per-cut-point partition ingest (mem_ prefix; partition ids must match ^mem_[a-z0-9_]+$), partition-scoped retrieval, letter exact-match, per-question-type accuracy; ingest + search retry so a transient timeout scores one question (counted in error_rate) instead of aborting the run. Full v1-32k raw run (589 q, all-MiniLM-384 + bge-reranker-base, top-10, DeepSeek-V4-Pro reader): MCQ accuracy 69.4% (409/589), above the ~50-52% full-context frontier oracle and the 25% chance baseline. Report in eval/reports/personamem/v1/run-1/; public pages (EN + zh) rewritten from the stale free-text framing; README metric/version fixed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements the PersonaMem multiple-choice personalization benchmark, including its runner, dataset adapter, and an MCQ prompt/reader method. It also updates the documentation and adds evaluation reports showing a 69.4% accuracy. The review feedback suggests logging exceptions in _ensure_partition instead of swallowing them silently, and recommends replacing fragile manual string slicing with the robust _PAREN_LETTER_RE regex when parsing option letters.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| async def _ensure_partition(client: HebbClient, partition_id: str) -> None: | ||
| """Create the partition (idempotent — swallow already-exists errors).""" | ||
| try: | ||
| await client.create_partition(partition_id, name=partition_id) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Swallowing all exceptions silently in _ensure_partition can hide genuine issues like network failures, authentication errors, or configuration problems, making them hard to debug. It is highly recommended to at least log the exception at a DEBUG or WARNING level so that unexpected failures are visible in the logs.
| async def _ensure_partition(client: HebbClient, partition_id: str) -> None: | |
| """Create the partition (idempotent — swallow already-exists errors).""" | |
| try: | |
| await client.create_partition(partition_id, name=partition_id) | |
| except Exception: | |
| pass | |
| async def _ensure_partition(client: HebbClient, partition_id: str) -> None: | |
| """Create the partition (idempotent — swallow already-exists errors).""" | |
| try: | |
| await client.create_partition(partition_id, name=partition_id) | |
| except Exception as e: | |
| logger.debug("Failed to ensure partition %s (it may already exist): %s", partition_id, e) |
| valid = { | ||
| o.strip()[1].lower() | ||
| for o in options | ||
| if len(o.strip()) > 2 and o.strip()[0] == "(" | ||
| } |
There was a problem hiding this comment.
The manual string slicing o.strip()[1] to extract the option letter is fragile and assumes a strict format. Since _PAREN_LETTER_RE is already defined in this file, it should be used to robustly parse the option letters and avoid potential issues if the option format has leading spaces or other minor variations.
| valid = { | |
| o.strip()[1].lower() | |
| for o in options | |
| if len(o.strip()) > 2 and o.strip()[0] == "(" | |
| } | |
| valid = { | |
| m.group(1).lower() | |
| for o in options | |
| if (m := _PAREN_LETTER_RE.match(o.strip())) | |
| } |
| chosen_text = next( | ||
| (o for o in options if o.strip()[1:2].lower() == chosen), | ||
| f"({chosen})", | ||
| ) |
There was a problem hiding this comment.
Similarly, using o.strip()[1:2] to find the matching option is fragile. Using the existing _PAREN_LETTER_RE regex makes this extraction much more robust and consistent with the rest of the parsing logic.
| chosen_text = next( | |
| (o for o in options if o.strip()[1:2].lower() == chosen), | |
| f"({chosen})", | |
| ) | |
| chosen_text = next( | |
| (o for o in options if (m := _PAREN_LETTER_RE.match(o.strip())) and m.group(1).lower() == chosen), | |
| f"({chosen})", | |
| ) |
Summary
PersonaMem was the last registered benchmark with no report. It is a 4-option multiple-choice personalization benchmark, but the wired-up bench inherited free-text QA + LLM judge that never showed the model the options. This PR rewires it to exact-match MCQ letter accuracy (no LLM judge) — the dataset authors' own protocol ("No LLM judges are involved") — and runs the full v1-32k split.
Fixes that made it runnable & faithful
eval/datasets/personamem.py):ast.literal_evalfallback for Python-reprall_options(recovers 303/589 rowsjson.loadssilently dropped to 0 options); one scenario per(context, end_index)cut point so each question's haystack is exactlyturns[:end_index]— no future-turn or cross-persona leakage; carries options + gold letter in metadata.eval/judge.py):select_choice()MCQ reader (temp 0, presents the 4 options).eval/benchmarks/personamem_bench.py): per-cut-point partition ingest (mem_prefix — partition ids must match^mem_[a-z0-9_]+$), partition-scoped retrieval, letter exact-match, per-question-type accuracy; ingest + search retry so a transient timeout scores one question (counted inerror_rate) instead of aborting all 589.Result —
eval/reports/personamem/v1/run-1/(589 q, shipped default = all-MiniLM-384 + bge-reranker-base rerank, top-10, DeepSeek-V4-Pro reader temp 0):Above the 25% random baseline and the ~50–52% full-context frontier oracle (arXiv:2504.14225) — while reading only top-10 retrieved memories. Strongest on recalling why a preference changed (88.9%), weakest on generative suggest new ideas (39.8%), matching the paper's difficulty curve.
Public pages (EN +
zh/mirror) rewritten from the stale free-text "67.6% QA (37q)" framing to correct MCQ framing with the right anchors;eval/README.mdtable row fixed (was "QA judge / —").Test plan
pytest tests/eval tests/test_eval)Note (separate issue)
The
^mem_[a-z0-9_]+$partition validation also rejects MemBench'smembench_…partition ids, so a MemBench re-run today would 404 on ingest (its existing reports predate the validation). Not fixed here.🤖 Generated with Claude Code