Canonical identity and schema upgrade tooling#116
Canonical identity and schema upgrade tooling#116yananlong wants to merge 26 commits intoevaleval:mainfrom
Conversation
Add metric identity and deterministic aggregate-instance linkage fields
Co-authored-by: yananlong <25546626+yananlong@users.noreply.github.com>
…or content_hash is present Co-authored-by: yananlong <25546626+yananlong@users.noreply.github.com>
Remove unsafe hash algorithms from dedupe_identity block
Require hash_algorithm when run_fingerprint or content_hash is present in dedupe_identity
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Add optional dedupe_identity block to aggregate schema
- Extend augmentation rules to cover remaining benchmark families and metric phrases - Add check-canonical-identity package command and CLI wiring - Add regression tests for new augmentation rules and checker behavior - Document canonical identity backfill/audit commands in README Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Only count JSON files as aggregate candidates when they contain an evaluation_results field, so datastore-root scans avoid false positives while still flagging malformed aggregate payloads. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolve schema-path conflict by keeping the upstream symlinked eval schema entry and dropping the fork-side legacy standalone schema file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Enforce strict 0.2.2 schemas and add schema upgrader CLI
Extend canonical backfill and path normalization coverage
| extra='forbid', | ||
| ) | ||
| schema_version: str = Field( | ||
| schema_version: Literal['0.2.2'] = Field( |
There was a problem hiding this comment.
Do you think is it good idea to hardcore this value? It would be ideal to change the version in only one place after release of a new version to make it as painless as possible. In this case it would be ideal to have these literals fetched when eval_types and instance_level_types are rebuilded from json schemas.
| extra='forbid', | ||
| ) | ||
| schema_version: str = Field( | ||
| schema_version: Literal['instance_level_eval_0.2.2'] = Field( |
There was a problem hiding this comment.
the same case as eval_types.py
|
|
||
| from every_eval_ever.validate import expand_paths | ||
|
|
||
| AGGREGATE_SCHEMA_VERSION = '0.2.2' |
There was a problem hiding this comment.
There should be some global state of these variables with versions (like in every_eval_ever/converters/init.py) to avoid hardcoding them in all places.
I'm flagging it because this script could be somewhat useful in fixing data in the future.
# Conflicts: # every_eval_ever/converters/inspect/adapter.py # every_eval_ever/converters/inspect/instance_level_adapter.py
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 21 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "files_scanned": 6448, | ||
| "results_scanned": 49659, | ||
| "missing": { | ||
| "evaluation_result_id": 37071, | ||
| "metric_id": 37071, | ||
| "metric_name": 37071, | ||
| "metric_kind": 37071, | ||
| "metric_unit": 37071 | ||
| }, | ||
| "malformed": { | ||
| "evaluation_result_id_pattern": 12588 | ||
| }, |
There was a problem hiding this comment.
This file looks like generated audit output rather than a source artifact. Consider moving these before/after audit snapshots into documentation (e.g., README) or removing them from version control to keep the repo lean and avoid stale data.
| "files_scanned": 6448, | |
| "results_scanned": 49659, | |
| "missing": { | |
| "evaluation_result_id": 37071, | |
| "metric_id": 37071, | |
| "metric_name": 37071, | |
| "metric_kind": 37071, | |
| "metric_unit": 37071 | |
| }, | |
| "malformed": { | |
| "evaluation_result_id_pattern": 12588 | |
| }, | |
| "_notice": "Generated audit snapshot content should not be maintained as a source artifact. Move before/after audit summaries to project documentation or regenerate them locally when needed.", |
| def upgrade_instance_file( | ||
| file_path: Path, *, write_changes: bool = False | ||
| ) -> SchemaVersionUpgradeReport: | ||
| rows: list[dict] = [] | ||
| old_versions: set[str | None] = set() | ||
| changed_rows = 0 | ||
|
|
||
| with file_path.open(encoding='utf-8') as handle: | ||
| for raw_line in handle: | ||
| stripped = raw_line.strip() | ||
| if not stripped: | ||
| continue | ||
| row = json.loads(stripped) | ||
| if not isinstance(row, dict): | ||
| raise ValueError( | ||
| f'Instance payload line is not a JSON object: {file_path}' | ||
| ) | ||
| old_version = row.get('schema_version') | ||
| if old_version != INSTANCE_SCHEMA_VERSION: | ||
| old_versions.add(old_version) | ||
| row['schema_version'] = INSTANCE_SCHEMA_VERSION | ||
| changed_rows += 1 | ||
| rows.append(row) | ||
|
|
||
| if write_changes and changed_rows: | ||
| _write_jsonl(file_path, rows) |
There was a problem hiding this comment.
upgrade_instance_file reads the entire JSONL file into memory before writing. For large instance-level files this can be a significant memory spike. Consider streaming: write each (possibly updated) row to a temporary file and then replace the original when --write is set, while still counting rows_changed/rows_scanned.
| def __init__( | ||
| self, | ||
| evaulation_id: str, | ||
| format: str, | ||
| hash_algorithm: str, | ||
| evaluation_dir: str, | ||
| file_stem: str | None = None, | ||
| ): | ||
| _require_helm_dependencies() | ||
| self.evaluation_id = evaulation_id | ||
| self.format = format | ||
| self.hash_algorithm = hash_algorithm | ||
| self.evaluation_dir = evaluation_dir | ||
| self.path = f'{evaluation_dir}/{evaulation_id}.{format}' | ||
| self.path = f'{evaluation_dir}/{file_stem or evaulation_id}.{format}' |
There was a problem hiding this comment.
The constructor parameter is misspelled as evaulation_id. This is easy to miss and prevents using this adapter with keyword arguments (e.g. evaluation_id=...). Rename it to evaluation_id throughout the signature/assignments and update any call sites accordingly.
| metric_name: str | None = None | ||
| metric_kind: str | None = None | ||
| metric_unit: str | None = None | ||
| metric_parameters: dict[str, str | float | bool | None] | None = None |
There was a problem hiding this comment.
CanonicalPatch.metric_parameters is annotated as dict[str, str | float | bool | None], but the code assigns ints (e.g., {'k': 1} / {'n': 4}), which will trip static type checkers and can lead to inconsistent expectations. Consider either allowing int in the union type or explicitly casting these parameter values to float/str at assignment time.
| metric_parameters: dict[str, str | float | bool | None] | None = None | |
| metric_parameters: dict[str, str | int | float | bool | None] | None = None |
evijit
left a comment
There was a problem hiding this comment.
Codex review: I left a few inline comments. I ran the new focused tests successfully, then ran the full suite and found one regression tied to the instance schema-version strictness change.
| extra='forbid', | ||
| ) | ||
| schema_version: str = Field( | ||
| schema_version: Literal['instance_level_eval_0.2.2'] = Field( |
There was a problem hiding this comment.
Codex: making this field a strict literal exposes a missed producer outside every_eval_ever/converters. The full suite now fails in tests/test_openeval_adapter.py::test_include_instances_writes_valid_jsonl_sidecar and test_non_binary_instance_scores_do_not_claim_correctness because utils/openeval/adapter.py::make_instance_log still passes the aggregate SCHEMA_VERSION value (0.2.2) into InstanceLevelEvaluationLog. Please update that adapter to use instance_level_eval_0.2.2 as well, otherwise any OpenEval instance export path is broken by this change.
There was a problem hiding this comment.
We shouldn't hard code the version, right?
| result['evaluation_result_id'] = evaluation_result_id | ||
| changed = True | ||
|
|
||
| sample_updates[raw_evaluation_name] = { |
There was a problem hiding this comment.
Codex: this mapping is keyed only by raw evaluation_name, so files with multiple aggregate results for the same eval slice but different metrics overwrite earlier entries. For example, HELM-style results DatasetA with Acc on DatasetA and F1 on DatasetA produce two distinct evaluation_result_ids, but sample_updates['DatasetA'] keeps only the last one, and every sample row gets linked to that one metric. That silently corrupts the aggregate-to-sample foreign key for multi-metric evals. The updater needs a disambiguator, or it should duplicate/emit one instance row per aggregate result when a sample contributes to multiple metrics.
| main as augment_canonical_identity_main, | ||
| ) | ||
|
|
||
| forwarded_args = [*args.paths] |
There was a problem hiding this comment.
Codex: augment_canonical_identity.build_parser() added --skip-path-normalization, but the top-level every_eval_ever CLI does not expose or forward it here. As a result python -m every_eval_ever augment-canonical-identity --skip-path-normalization ... exits with unrecognized arguments, even though the underlying command supports the option. Please add the flag to this parser and forward it alongside --write / --skip-samples.
| if benchmark_family == 'theory_of_mind': | ||
| return _theory_of_mind_patch(raw_evaluation_name, metric_config) | ||
|
|
||
| return None |
There was a problem hiding this comment.
This leaves the repo's checked-in HAL data uncovered by the backfill. On a copy of the current PR checkout, augment-canonical-identity data/ --write updated all 246 HAL aggregate files, but check-canonical-identity data/ --fail-on-issues still failed with 342 rows missing metric_id, metric_name, metric_kind, and metric_unit. The HAL adapter emits single-score accuracy-style metrics in utils/hal/adapter.py, so either the backfill needs a hal-* rule or the HAL producer should be updated; otherwise the new audit cannot pass on the repo's own data/ directory.
| def main(argv: list[str] | None = None) -> int: | ||
| args = build_parser().parse_args(argv) | ||
| file_paths = expand_paths(args.paths) | ||
| reports = [ |
There was a problem hiding this comment.
This command assumes every path returned by expand_paths() is an EEE datastore file. That is true for the checked-in data/ directory, but not for common repo-scoped invocations: upgrade-schema-version . walks schema JSON and fixture JSON under tests/data, then crashes on list-shaped HELM fixture JSON; with --write, dict-shaped non-aggregate JSON can also be rewritten with a datastore schema version. Please filter JSON files by aggregate shape, or skip/report non-aggregate JSON instead of passing every .json to upgrade_aggregate_file.
|
|
||
| if write_changes and path_changed: | ||
| if sample_exists and target_sample_file_path is not None: | ||
| _move_file(sample_file_path, target_sample_file_path) |
There was a problem hiding this comment.
Moving the sample sidecar before the aggregate can leave the datastore split if the aggregate move fails. I reproduced this with an over-nested aggregate plus sidecar where the normalized aggregate target already existed: the sidecar was moved to the normalized directory, then _move_file(file_path, target_file_path) raised FileExistsError, leaving the aggregate in the old path and its samples in the new path. Preflight both target paths before moving either file, or move through a rollback-safe sequence.
Summary
Validation