feat(eval): add PersonaMem MCQ benchmark + full v1-32k raw eval (69.4%) by ch-liuzhide · Pull Request #40 · afx-team/hebb-mind

ch-liuzhide · 2026-06-18T07:17:17Z

Summary

PersonaMem was the last registered benchmark with no report. It is a 4-option multiple-choice personalization benchmark, but the wired-up bench inherited free-text QA + LLM judge that never showed the model the options. This PR rewires it to exact-match MCQ letter accuracy (no LLM judge) — the dataset authors' own protocol ("No LLM judges are involved") — and runs the full v1-32k split.

Fixes that made it runnable & faithful

Adapter (eval/datasets/personamem.py): ast.literal_eval fallback for Python-repr all_options (recovers 303/589 rows json.loads silently dropped to 0 options); one scenario per (context, end_index) cut point so each question's haystack is exactly turns[:end_index] — no future-turn or cross-persona leakage; carries options + gold letter in metadata.
Judge (eval/judge.py): select_choice() MCQ reader (temp 0, presents the 4 options).
Bench (eval/benchmarks/personamem_bench.py): per-cut-point partition ingest (mem_ prefix — partition ids must match ^mem_[a-z0-9_]+$), partition-scoped retrieval, letter exact-match, per-question-type accuracy; ingest + search retry so a transient timeout scores one question (counted in error_rate) instead of aborting all 589.

Result — eval/reports/personamem/v1/run-1/ (589 q, shipped default = all-MiniLM-384 + bge-reranker-base rerank, top-10, DeepSeek-V4-Pro reader temp 0):

Metric	Value
MCQ accuracy	69.4% (409/589)
answered-only (excl. 8 infra timeouts)	70.4%
valid_choice_rate	98.6%

Above the 25% random baseline and the ~50–52% full-context frontier oracle (arXiv:2504.14225) — while reading only top-10 retrieved memories. Strongest on recalling why a preference changed (88.9%), weakest on generative suggest new ideas (39.8%), matching the paper's difficulty curve.

Public pages (EN + zh/ mirror) rewritten from the stale free-text "67.6% QA (37q)" framing to correct MCQ framing with the right anchors; eval/README.md table row fixed (was "QA judge / —").

A 3-lens adversarial verification (correctness / methodology / isolation) confirmed the metric is faithful to the official protocol and the isolation is leak-free.

Test plan

All 31 eval tests pass (pytest tests/eval tests/test_eval)
Adapter loads 589 questions across 222 scenarios; every question has 4 options + gold letter present
Full 589-q run completes; report + per-category breakdown render
(optional, not in this PR) same-reader full-context control to isolate retrieval's contribution

Note (separate issue)

The ^mem_[a-z0-9_]+$ partition validation also rejects MemBench's membench_… partition ids, so a MemBench re-run today would 404 on ingest (its existing reports predate the validation). Not fixed here.

🤖 Generated with Claude Code

PersonaMem was the last registered benchmark with no report. It is a 4-option multiple-choice personalization benchmark, but was wired to free-text QA + LLM judge that never showed the model the options. Rewire it to exact-match MCQ letter accuracy (no LLM judge), matching the dataset authors' protocol. - adapter: ast.literal_eval fallback for Python-repr `all_options` (recovers 303/589 rows json.loads silently dropped to 0 options); one scenario per (context, end_index) cut point so each question's haystack is exactly turns[:end_index] -- no future/cross-persona leak; carry options + gold letter in metadata. - judge: select_choice() MCQ reader (temp 0, presents options). - bench: per-cut-point partition ingest (mem_ prefix; partition ids must match ^mem_[a-z0-9_]+$), partition-scoped retrieval, letter exact-match, per-question-type accuracy; ingest + search retry so a transient timeout scores one question (counted in error_rate) instead of aborting the run. Full v1-32k raw run (589 q, all-MiniLM-384 + bge-reranker-base, top-10, DeepSeek-V4-Pro reader): MCQ accuracy 69.4% (409/589), above the ~50-52% full-context frontier oracle and the 25% chance baseline. Report in eval/reports/personamem/v1/run-1/; public pages (EN + zh) rewritten from the stale free-text framing; README metric/version fixed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request implements the PersonaMem multiple-choice personalization benchmark, including its runner, dataset adapter, and an MCQ prompt/reader method. It also updates the documentation and adds evaluation reports showing a 69.4% accuracy. The review feedback suggests logging exceptions in _ensure_partition instead of swallowing them silently, and recommends replacing fragile manual string slicing with the robust _PAREN_LETTER_RE regex when parsing option letters.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-18T07:20:30Z

+async def _ensure_partition(client: HebbClient, partition_id: str) -> None:
+    """Create the partition (idempotent — swallow already-exists errors)."""
+    try:
+        await client.create_partition(partition_id, name=partition_id)
+    except Exception:
+        pass


Swallowing all exceptions silently in _ensure_partition can hide genuine issues like network failures, authentication errors, or configuration problems, making them hard to debug. It is highly recommended to at least log the exception at a DEBUG or WARNING level so that unexpected failures are visible in the logs.

Suggested change

async def _ensure_partition(client: HebbClient, partition_id: str) -> None:

"""Create the partition (idempotent — swallow already-exists errors)."""

try:

await client.create_partition(partition_id, name=partition_id)

except Exception:

pass

async def _ensure_partition(client: HebbClient, partition_id: str) -> None:

"""Create the partition (idempotent — swallow already-exists errors)."""

try:

await client.create_partition(partition_id, name=partition_id)

except Exception as e:

logger.debug("Failed to ensure partition %s (it may already exist): %s", partition_id, e)

gemini-code-assist · 2026-06-18T07:20:31Z

+            valid = {
+                o.strip()[1].lower()
+                for o in options
+                if len(o.strip()) > 2 and o.strip()[0] == "("
+            }


The manual string slicing o.strip()[1] to extract the option letter is fragile and assumes a strict format. Since _PAREN_LETTER_RE is already defined in this file, it should be used to robustly parse the option letters and avoid potential issues if the option format has leading spaces or other minor variations.

Suggested change

valid = {

o.strip()[1].lower()

for o in options

if len(o.strip()) > 2 and o.strip()[0] == "("

}

valid = {

m.group(1).lower()

for o in options

if (m := _PAREN_LETTER_RE.match(o.strip()))

}

gemini-code-assist · 2026-06-18T07:20:31Z

+                    chosen_text = next(
+                        (o for o in options if o.strip()[1:2].lower() == chosen),
+                        f"({chosen})",
+                    )


Similarly, using o.strip()[1:2] to find the matching option is fragile. Using the existing _PAREN_LETTER_RE regex makes this extraction much more robust and consistent with the rest of the parsing logic.

Suggested change

chosen_text = next(

(o for o in options if o.strip()[1:2].lower() == chosen),

f"({chosen})",

)

chosen_text = next(

(o for o in options if (m := _PAREN_LETTER_RE.match(o.strip())) and m.group(1).lower() == chosen),

f"({chosen})",

)

gemini-code-assist Bot reviewed Jun 18, 2026

View reviewed changes

ch-liuzhide merged commit 0c57bc2 into main Jun 18, 2026
18 checks passed

ch-liuzhide deleted the eval/personamem-mcq branch June 18, 2026 11:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(eval): add PersonaMem MCQ benchmark + full v1-32k raw eval (69.4%)#40

feat(eval): add PersonaMem MCQ benchmark + full v1-32k raw eval (69.4%)#40
ch-liuzhide merged 1 commit into
mainfrom
eval/personamem-mcq

ch-liuzhide commented Jun 18, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ch-liuzhide commented Jun 18, 2026

Summary

Test plan

Note (separate issue)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant