Skip to content

Canonical identity and schema upgrade tooling#116

Open
yananlong wants to merge 26 commits intoevaleval:mainfrom
yananlong:main
Open

Canonical identity and schema upgrade tooling#116
yananlong wants to merge 26 commits intoevaleval:mainfrom
yananlong:main

Conversation

@yananlong
Copy link
Copy Markdown
Contributor

Summary

  • add canonical identity backfill and validation tooling
  • add strict schema upgrade support for 0.2.2 records
  • extend converter output validation and tests
  • remove the tracked temp datastore gitlink from the fork branch

Validation

  • .venv/bin/python -m pytest

yananlong and others added 23 commits February 23, 2026 17:48
Add metric identity and deterministic aggregate-instance linkage fields
Co-authored-by: yananlong <25546626+yananlong@users.noreply.github.com>
…or content_hash is present

Co-authored-by: yananlong <25546626+yananlong@users.noreply.github.com>
Remove unsafe hash algorithms from dedupe_identity block
Require hash_algorithm when run_fingerprint or content_hash is present in dedupe_identity
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Add optional dedupe_identity block to aggregate schema
- Extend augmentation rules to cover remaining benchmark families and metric phrases
- Add check-canonical-identity package command and CLI wiring
- Add regression tests for new augmentation rules and checker behavior
- Document canonical identity backfill/audit commands in README

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Only count JSON files as aggregate candidates when they contain an evaluation_results field, so datastore-root scans avoid false positives while still flagging malformed aggregate payloads.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolve schema-path conflict by keeping the upstream symlinked eval schema entry and dropping the fork-side legacy standalone schema file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Enforce strict 0.2.2 schemas and add schema upgrader CLI
Extend canonical backfill and path normalization coverage
@yananlong yananlong marked this pull request as ready for review April 25, 2026 10:12
Copilot AI review requested due to automatic review settings April 25, 2026 10:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

extra='forbid',
)
schema_version: str = Field(
schema_version: Literal['0.2.2'] = Field(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think is it good idea to hardcore this value? It would be ideal to change the version in only one place after release of a new version to make it as painless as possible. In this case it would be ideal to have these literals fetched when eval_types and instance_level_types are rebuilded from json schemas.

extra='forbid',
)
schema_version: str = Field(
schema_version: Literal['instance_level_eval_0.2.2'] = Field(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same case as eval_types.py


from every_eval_ever.validate import expand_paths

AGGREGATE_SCHEMA_VERSION = '0.2.2'
Copy link
Copy Markdown
Collaborator

@damian1996 damian1996 Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be some global state of these variables with versions (like in every_eval_ever/converters/init.py) to avoid hardcoding them in all places.

I'm flagging it because this script could be somewhat useful in fixing data in the future.

# Conflicts:
#	every_eval_ever/converters/inspect/adapter.py
#	every_eval_ever/converters/inspect/instance_level_adapter.py
@yananlong yananlong marked this pull request as draft April 28, 2026 11:31
@yananlong yananlong requested a review from Copilot April 28, 2026 11:32
@yananlong yananlong marked this pull request as ready for review April 28, 2026 11:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 21 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .tmp/audit_before.json
Comment on lines +2 to +13
"files_scanned": 6448,
"results_scanned": 49659,
"missing": {
"evaluation_result_id": 37071,
"metric_id": 37071,
"metric_name": 37071,
"metric_kind": 37071,
"metric_unit": 37071
},
"malformed": {
"evaluation_result_id_pattern": 12588
},
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file looks like generated audit output rather than a source artifact. Consider moving these before/after audit snapshots into documentation (e.g., README) or removing them from version control to keep the repo lean and avoid stale data.

Suggested change
"files_scanned": 6448,
"results_scanned": 49659,
"missing": {
"evaluation_result_id": 37071,
"metric_id": 37071,
"metric_name": 37071,
"metric_kind": 37071,
"metric_unit": 37071
},
"malformed": {
"evaluation_result_id_pattern": 12588
},
"_notice": "Generated audit snapshot content should not be maintained as a source artifact. Move before/after audit summaries to project documentation or regenerate them locally when needed.",

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +90
def upgrade_instance_file(
file_path: Path, *, write_changes: bool = False
) -> SchemaVersionUpgradeReport:
rows: list[dict] = []
old_versions: set[str | None] = set()
changed_rows = 0

with file_path.open(encoding='utf-8') as handle:
for raw_line in handle:
stripped = raw_line.strip()
if not stripped:
continue
row = json.loads(stripped)
if not isinstance(row, dict):
raise ValueError(
f'Instance payload line is not a JSON object: {file_path}'
)
old_version = row.get('schema_version')
if old_version != INSTANCE_SCHEMA_VERSION:
old_versions.add(old_version)
row['schema_version'] = INSTANCE_SCHEMA_VERSION
changed_rows += 1
rows.append(row)

if write_changes and changed_rows:
_write_jsonl(file_path, rows)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upgrade_instance_file reads the entire JSONL file into memory before writing. For large instance-level files this can be a significant memory spike. Consider streaming: write each (possibly updated) row to a temporary file and then replace the original when --write is set, while still counting rows_changed/rows_scanned.

Copilot uses AI. Check for mistakes.
Comment on lines 39 to +52
def __init__(
self,
evaulation_id: str,
format: str,
hash_algorithm: str,
evaluation_dir: str,
file_stem: str | None = None,
):
_require_helm_dependencies()
self.evaluation_id = evaulation_id
self.format = format
self.hash_algorithm = hash_algorithm
self.evaluation_dir = evaluation_dir
self.path = f'{evaluation_dir}/{evaulation_id}.{format}'
self.path = f'{evaluation_dir}/{file_stem or evaulation_id}.{format}'
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor parameter is misspelled as evaulation_id. This is easy to miss and prevents using this adapter with keyword arguments (e.g. evaluation_id=...). Rename it to evaluation_id throughout the signature/assignments and update any call sites accordingly.

Copilot uses AI. Check for mistakes.
metric_name: str | None = None
metric_kind: str | None = None
metric_unit: str | None = None
metric_parameters: dict[str, str | float | bool | None] | None = None
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CanonicalPatch.metric_parameters is annotated as dict[str, str | float | bool | None], but the code assigns ints (e.g., {'k': 1} / {'n': 4}), which will trip static type checkers and can lead to inconsistent expectations. Consider either allowing int in the union type or explicitly casting these parameter values to float/str at assignment time.

Suggested change
metric_parameters: dict[str, str | float | bool | None] | None = None
metric_parameters: dict[str, str | int | float | bool | None] | None = None

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Member

@evijit evijit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex review: I left a few inline comments. I ran the new focused tests successfully, then ran the full suite and found one regression tied to the instance schema-version strictness change.

extra='forbid',
)
schema_version: str = Field(
schema_version: Literal['instance_level_eval_0.2.2'] = Field(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex: making this field a strict literal exposes a missed producer outside every_eval_ever/converters. The full suite now fails in tests/test_openeval_adapter.py::test_include_instances_writes_valid_jsonl_sidecar and test_non_binary_instance_scores_do_not_claim_correctness because utils/openeval/adapter.py::make_instance_log still passes the aggregate SCHEMA_VERSION value (0.2.2) into InstanceLevelEvaluationLog. Please update that adapter to use instance_level_eval_0.2.2 as well, otherwise any OpenEval instance export path is broken by this change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't hard code the version, right?

result['evaluation_result_id'] = evaluation_result_id
changed = True

sample_updates[raw_evaluation_name] = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex: this mapping is keyed only by raw evaluation_name, so files with multiple aggregate results for the same eval slice but different metrics overwrite earlier entries. For example, HELM-style results DatasetA with Acc on DatasetA and F1 on DatasetA produce two distinct evaluation_result_ids, but sample_updates['DatasetA'] keeps only the last one, and every sample row gets linked to that one metric. That silently corrupts the aggregate-to-sample foreign key for multi-metric evals. The updater needs a disambiguator, or it should duplicate/emit one instance row per aggregate result when a sample contributes to multiple metrics.

Comment thread every_eval_ever/cli.py
main as augment_canonical_identity_main,
)

forwarded_args = [*args.paths]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex: augment_canonical_identity.build_parser() added --skip-path-normalization, but the top-level every_eval_ever CLI does not expose or forward it here. As a result python -m every_eval_ever augment-canonical-identity --skip-path-normalization ... exits with unrecognized arguments, even though the underlying command supports the option. Please add the flag to this parser and forward it alongside --write / --skip-samples.

if benchmark_family == 'theory_of_mind':
return _theory_of_mind_patch(raw_evaluation_name, metric_config)

return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leaves the repo's checked-in HAL data uncovered by the backfill. On a copy of the current PR checkout, augment-canonical-identity data/ --write updated all 246 HAL aggregate files, but check-canonical-identity data/ --fail-on-issues still failed with 342 rows missing metric_id, metric_name, metric_kind, and metric_unit. The HAL adapter emits single-score accuracy-style metrics in utils/hal/adapter.py, so either the backfill needs a hal-* rule or the HAL producer should be updated; otherwise the new audit cannot pass on the repo's own data/ directory.

def main(argv: list[str] | None = None) -> int:
args = build_parser().parse_args(argv)
file_paths = expand_paths(args.paths)
reports = [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command assumes every path returned by expand_paths() is an EEE datastore file. That is true for the checked-in data/ directory, but not for common repo-scoped invocations: upgrade-schema-version . walks schema JSON and fixture JSON under tests/data, then crashes on list-shaped HELM fixture JSON; with --write, dict-shaped non-aggregate JSON can also be rewritten with a datastore schema version. Please filter JSON files by aggregate shape, or skip/report non-aggregate JSON instead of passing every .json to upgrade_aggregate_file.


if write_changes and path_changed:
if sample_exists and target_sample_file_path is not None:
_move_file(sample_file_path, target_sample_file_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the sample sidecar before the aggregate can leave the datastore split if the aggregate move fails. I reproduced this with an over-nested aggregate plus sidecar where the normalized aggregate target already existed: the sidecar was moved to the normalized directory, then _move_file(file_path, target_file_path) raised FileExistsError, leaving the aggregate in the old path and its samples in the new path. Preflight both target paths before moving either file, or move through a rollback-safe sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants