Canonical identity and schema upgrade tooling by yananlong · Pull Request #116 · evaleval/every_eval_ever

yananlong · 2026-04-24T15:01:49Z

Summary

add canonical identity backfill and validation tooling
add strict schema upgrade support for 0.2.2 records
extend converter output validation and tests
remove the tracked temp datastore gitlink from the fork branch

Validation

.venv/bin/python -m pytest

Add metric identity and deterministic aggregate-instance linkage fields

Co-authored-by: yananlong <25546626+yananlong@users.noreply.github.com>

…or content_hash is present Co-authored-by: yananlong <25546626+yananlong@users.noreply.github.com>

Remove unsafe hash algorithms from dedupe_identity block

Require hash_algorithm when run_fingerprint or content_hash is present in dedupe_identity

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Add optional dedupe_identity block to aggregate schema

…dentities

- Extend augmentation rules to cover remaining benchmark families and metric phrases - Add check-canonical-identity package command and CLI wiring - Add regression tests for new augmentation rules and checker behavior - Document canonical identity backfill/audit commands in README Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Only count JSON files as aggregate candidates when they contain an evaluation_results field, so datastore-root scans avoid false positives while still flagging malformed aggregate payloads. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…-04-18T16-51-32

Resolve schema-path conflict by keeping the upstream symlinked eval schema entry and dropping the fork-side legacy standalone schema file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Enforce strict 0.2.2 schemas and add schema upgrader CLI

Extend canonical backfill and path normalization coverage

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

damian1996 · 2026-04-27T14:02:09Z

        extra='forbid',
    )
-    schema_version: str = Field(
+    schema_version: Literal['0.2.2'] = Field(


Do you think is it good idea to hardcore this value? It would be ideal to change the version in only one place after release of a new version to make it as painless as possible. In this case it would be ideal to have these literals fetched when eval_types and instance_level_types are rebuilded from json schemas.

damian1996 · 2026-04-27T14:02:30Z

        extra='forbid',
    )
-    schema_version: str = Field(
+    schema_version: Literal['instance_level_eval_0.2.2'] = Field(


the same case as eval_types.py

damian1996 · 2026-04-27T14:04:11Z

+
+from every_eval_ever.validate import expand_paths
+
+AGGREGATE_SCHEMA_VERSION = '0.2.2'


There should be some global state of these variables with versions (like in every_eval_ever/converters/init.py) to avoid hardcoding them in all places.

I'm flagging it because this script could be somewhat useful in fixing data in the future.

# Conflicts: # every_eval_ever/converters/inspect/adapter.py # every_eval_ever/converters/inspect/instance_level_adapter.py

Copilot

Pull request overview

Copilot reviewed 20 out of 21 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T11:38:29Z

+  "files_scanned": 6448,
+  "results_scanned": 49659,
+  "missing": {
+    "evaluation_result_id": 37071,
+    "metric_id": 37071,
+    "metric_name": 37071,
+    "metric_kind": 37071,
+    "metric_unit": 37071
+  },
+  "malformed": {
+    "evaluation_result_id_pattern": 12588
+  },


This file looks like generated audit output rather than a source artifact. Consider moving these before/after audit snapshots into documentation (e.g., README) or removing them from version control to keep the repo lean and avoid stale data.

Suggested change

"files_scanned": 6448,

"results_scanned": 49659,

"missing": {

"evaluation_result_id": 37071,

"metric_id": 37071,

"metric_name": 37071,

"metric_kind": 37071,

"metric_unit": 37071

},

"malformed": {

"evaluation_result_id_pattern": 12588

},

"_notice": "Generated audit snapshot content should not be maintained as a source artifact. Move before/after audit summaries to project documentation or regenerate them locally when needed.",

Copilot · 2026-04-28T11:38:30Z

+def upgrade_instance_file(
+    file_path: Path, *, write_changes: bool = False
+) -> SchemaVersionUpgradeReport:
+    rows: list[dict] = []
+    old_versions: set[str | None] = set()
+    changed_rows = 0
+
+    with file_path.open(encoding='utf-8') as handle:
+        for raw_line in handle:
+            stripped = raw_line.strip()
+            if not stripped:
+                continue
+            row = json.loads(stripped)
+            if not isinstance(row, dict):
+                raise ValueError(
+                    f'Instance payload line is not a JSON object: {file_path}'
+                )
+            old_version = row.get('schema_version')
+            if old_version != INSTANCE_SCHEMA_VERSION:
+                old_versions.add(old_version)
+                row['schema_version'] = INSTANCE_SCHEMA_VERSION
+                changed_rows += 1
+            rows.append(row)
+
+    if write_changes and changed_rows:
+        _write_jsonl(file_path, rows)


upgrade_instance_file reads the entire JSONL file into memory before writing. For large instance-level files this can be a significant memory spike. Consider streaming: write each (possibly updated) row to a temporary file and then replace the original when --write is set, while still counting rows_changed/rows_scanned.

Copilot · 2026-04-28T11:38:30Z

    def __init__(
        self,
        evaulation_id: str,
        format: str,
        hash_algorithm: str,
        evaluation_dir: str,
+        file_stem: str | None = None,
    ):
        _require_helm_dependencies()
        self.evaluation_id = evaulation_id
        self.format = format
        self.hash_algorithm = hash_algorithm
        self.evaluation_dir = evaluation_dir
-        self.path = f'{evaluation_dir}/{evaulation_id}.{format}'
+        self.path = f'{evaluation_dir}/{file_stem or evaulation_id}.{format}'


The constructor parameter is misspelled as evaulation_id. This is easy to miss and prevents using this adapter with keyword arguments (e.g. evaluation_id=...). Rename it to evaluation_id throughout the signature/assignments and update any call sites accordingly.

Copilot · 2026-04-28T11:38:30Z

+    metric_name: str | None = None
+    metric_kind: str | None = None
+    metric_unit: str | None = None
+    metric_parameters: dict[str, str | float | bool | None] | None = None


CanonicalPatch.metric_parameters is annotated as dict[str, str | float | bool | None], but the code assigns ints (e.g., {'k': 1} / {'n': 4}), which will trip static type checkers and can lead to inconsistent expectations. Consider either allowing int in the union type or explicitly casting these parameter values to float/str at assignment time.

Suggested change

metric_parameters: dict[str, str | float | bool | None] | None = None

metric_parameters: dict[str, str | int | float | bool | None] | None = None

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

evijit

Codex review: I left a few inline comments. I ran the new focused tests successfully, then ran the full suite and found one regression tied to the instance schema-version strictness change.

evijit · 2026-04-30T17:24:40Z

        extra='forbid',
    )
-    schema_version: str = Field(
+    schema_version: Literal['instance_level_eval_0.2.2'] = Field(


Codex: making this field a strict literal exposes a missed producer outside every_eval_ever/converters. The full suite now fails in tests/test_openeval_adapter.py::test_include_instances_writes_valid_jsonl_sidecar and test_non_binary_instance_scores_do_not_claim_correctness because utils/openeval/adapter.py::make_instance_log still passes the aggregate SCHEMA_VERSION value (0.2.2) into InstanceLevelEvaluationLog. Please update that adapter to use instance_level_eval_0.2.2 as well, otherwise any OpenEval instance export path is broken by this change.

We shouldn't hard code the version, right?

evijit · 2026-04-30T17:24:40Z

+            result['evaluation_result_id'] = evaluation_result_id
+            changed = True
+
+        sample_updates[raw_evaluation_name] = {


Codex: this mapping is keyed only by raw evaluation_name, so files with multiple aggregate results for the same eval slice but different metrics overwrite earlier entries. For example, HELM-style results DatasetA with Acc on DatasetA and F1 on DatasetA produce two distinct evaluation_result_ids, but sample_updates['DatasetA'] keeps only the last one, and every sample row gets linked to that one metric. That silently corrupts the aggregate-to-sample foreign key for multi-metric evals. The updater needs a disambiguator, or it should duplicate/emit one instance row per aggregate result when a sample contributes to multiple metrics.

evijit · 2026-04-30T17:24:40Z

+            main as augment_canonical_identity_main,
+        )
+
+        forwarded_args = [*args.paths]


Codex: augment_canonical_identity.build_parser() added --skip-path-normalization, but the top-level every_eval_ever CLI does not expose or forward it here. As a result python -m every_eval_ever augment-canonical-identity --skip-path-normalization ... exits with unrecognized arguments, even though the underlying command supports the option. Please add the flag to this parser and forward it alongside --write / --skip-samples.

mrshu · 2026-04-30T18:52:06Z

+    if benchmark_family == 'theory_of_mind':
+        return _theory_of_mind_patch(raw_evaluation_name, metric_config)
+
+    return None


This leaves the repo's checked-in HAL data uncovered by the backfill. On a copy of the current PR checkout, augment-canonical-identity data/ --write updated all 246 HAL aggregate files, but check-canonical-identity data/ --fail-on-issues still failed with 342 rows missing metric_id, metric_name, metric_kind, and metric_unit. The HAL adapter emits single-score accuracy-style metrics in utils/hal/adapter.py, so either the backfill needs a hal-* rule or the HAL producer should be updated; otherwise the new audit cannot pass on the repo's own data/ directory.

mrshu · 2026-04-30T18:52:07Z

+def main(argv: list[str] | None = None) -> int:
+    args = build_parser().parse_args(argv)
+    file_paths = expand_paths(args.paths)
+    reports = [


This command assumes every path returned by expand_paths() is an EEE datastore file. That is true for the checked-in data/ directory, but not for common repo-scoped invocations: upgrade-schema-version . walks schema JSON and fixture JSON under tests/data, then crashes on list-shaped HELM fixture JSON; with --write, dict-shaped non-aggregate JSON can also be rewritten with a datastore schema version. Please filter JSON files by aggregate shape, or skip/report non-aggregate JSON instead of passing every .json to upgrade_aggregate_file.

mrshu · 2026-04-30T18:52:08Z

+
+    if write_changes and path_changed:
+        if sample_exists and target_sample_file_path is not None:
+            _move_file(sample_file_path, target_sample_file_path)


Moving the sample sidecar before the aggregate can leave the datastore split if the aggregate move fails. I reproduced this with an over-nested aggregate plus sidecar where the normalized aggregate target already existed: the sidecar was moved to the normalized directory, then _move_file(file_path, target_file_path) raised FileExistsError, leaving the aggregate in the old path and its samples in the new path. Preflight both target paths before moving either file, or move through a rollback-safe sequence.

yananlong and others added 23 commits February 23, 2026 17:48

Add optional dedupe_identity block to aggregate schema

69215e4

Merge pull request #1 from yananlong/schema-metric-linkage

386fc29

Add metric identity and deterministic aggregate-instance linkage fields

Initial plan

81c73f5

Initial plan

4343201

Remove unsafe hash algorithms (sha1, md5) from dedupe_identity block

41b9310

Co-authored-by: yananlong <25546626+yananlong@users.noreply.github.com>

Add if/then constraint requiring hash_algorithm when run_fingerprint …

5786d94

…or content_hash is present Co-authored-by: yananlong <25546626+yananlong@users.noreply.github.com>

Merge pull request #3 from yananlong/copilot/sub-pr-2

b240a1a

Remove unsafe hash algorithms from dedupe_identity block

Merge pull request #4 from yananlong/copilot/sub-pr-2-again

fc5e626

Require hash_algorithm when run_fingerprint or content_hash is present in dedupe_identity

Update eval.schema.json

0666632

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge pull request #2 from yananlong/schema-dedupe-identity

1db3b90

Add optional dedupe_identity block to aggregate schema

feat: add augment-canonical-identity command for backfilling metric i…

855d866

…dentities

feat: add audit JSON files for benchmark metrics analysis

0d8cdab

Merge remote-tracking branch 'origin/main' into copilot/worktree-2026…

c5eb8c5

…-04-18T16-51-32

Merge fork/main into copilot branch

84ab649

Resolve schema-path conflict by keeping the upstream symlinked eval schema entry and dropping the fork-side legacy standalone schema file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Enforce strict 0.2.2 schemas and add schema upgrader CLI

454e8b9

Extend canonical backfill and path normalization coverage

556e185

Merge pull request #6 from yananlong/pr1-strict-schema-upgrader

fcd5a7d

Enforce strict 0.2.2 schemas and add schema upgrader CLI

Merge pull request #7 from yananlong/pr2-canonical-path-normalization

5319be0

Extend canonical backfill and path normalization coverage

Merge branch 'evaleval:main' into main

fde863c

Remove tracked temp datastore

b3725e5

Merge remote-tracking branch 'fork/main'

e0374d0

yananlong marked this pull request as ready for review April 25, 2026 10:12

Copilot AI review requested due to automatic review settings April 25, 2026 10:12

Copilot started reviewing on behalf of yananlong April 25, 2026 10:13 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

damian1996 reviewed Apr 27, 2026

View reviewed changes

yananlong added 2 commits April 28, 2026 08:28

Merge remote-tracking branch 'fork/pr1-strict-schema-upgrader'

81d29c4

Merge remote-tracking branch 'origin/main'

fea87e2

# Conflicts: # every_eval_ever/converters/inspect/adapter.py # every_eval_ever/converters/inspect/instance_level_adapter.py

yananlong marked this pull request as draft April 28, 2026 11:31

yananlong requested a review from Copilot April 28, 2026 11:32

Copilot started reviewing on behalf of yananlong April 28, 2026 11:32 View session

yananlong marked this pull request as ready for review April 28, 2026 11:33

Copilot AI reviewed Apr 28, 2026

View reviewed changes

yananlong requested a review from Copilot April 29, 2026 15:54

Copilot started reviewing on behalf of yananlong April 29, 2026 15:55 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

yananlong force-pushed the main branch from 29bf338 to fea87e2 Compare April 29, 2026 16:31

Merge branch 'evaleval:main' into main

568c0a3

evijit reviewed Apr 30, 2026

View reviewed changes

mrshu reviewed Apr 30, 2026

View reviewed changes


		from every_eval_ever.validate import expand_paths

		AGGREGATE_SCHEMA_VERSION = '0.2.2'

	metric_parameters: dict[str, str \| float \| bool \| None] \| None = None
	metric_parameters: dict[str, str \| int \| float \| bool \| None] \| None = None

Conversation

yananlong commented Apr 24, 2026

Summary

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

damian1996 Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

evijit left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

damian1996 Apr 27, 2026 •

edited

Loading