Add unified version registry for GCS/HF dataset versioning#601
Add unified version registry for GCS/HF dataset versioning#601
Conversation
Introduces a single version_manifest.json that tracks all dataset versions across both GCS and Hugging Face, enabling programmatic version queries and rollback. Adds consumer API (get_data_version, get_data_manifest) exported at package level. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Rename the module to reflect that both GCS and HF are equal backends, not just GCS. Add a constants block for all backend config, remove optional HF parameters from public API, and extract step-based procedural blocks into named helper functions. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Move pytest fixtures to conftest.py for auto-discovery (fixes F401/F811), remove unused imports from data_upload.py, apply ruff formatting. Co-Authored-By: Claude Opus 4.6 <[email protected]>
juaristi22
left a comment
There was a problem hiding this comment.
PR #601 Review: Unified Version Manifest System
Overall: Solid architecture, but has one real bug and a few gaps.
Must Fix
-
Semver sorting is broken —
list_versions()uses Python string sort, which means"1.9.0" > "1.10.0". Need to usepackaging.version.Versionor similar for proper ordering. -
Broad exception catching in rollback tag creation — catches
Exceptionand checks"already exists" in str(e). Should catch the specific HF exception type (HfHubHTTPErroror similar).
Should Fix
-
upload_files_to_gcsreturn type changed — now returns a dict instead of None. Any external callers will silently get unexpected return types. Needs an audit of callers. -
No test for rollback when
hf=None— the code handles this case (line checksif old_manifest.hf else None) but no test verifies it. -
Temp file cleanup — uses
NamedTemporaryFile(delete=False)+ manualos.unlink, but if the upload fails between creation and cleanup, the temp file leaks. Should wrap in try/finally.
Minor / Nice to Have
-
Changelog fragment doesn't mention the two new public APIs (
get_data_version,get_data_manifest) that consumers can use. -
No semver sorting tests — needs edge cases like
"1.9.0"vs"1.10.0","2.0.0"vs"1.100.0". -
Constant duplication —
GCS_BUCKET_NAMEandHF_REPO_NAMEare defined in bothdata_upload.pyandversion_manifest.py.
What's Good
- Clean dataclass hierarchy (
HFVersionInfo/GCSVersionInfo/VersionManifest/VersionRegistry) - Append-only registry pattern is the right call — good for audit trails
- 39 tests with solid coverage of serialization, registry I/O, and consumer API
- Rollback as a new entry with
special_operationmetadata rather than destructive deletion
Resolves conflict in data_upload.py by combining PR #601's version manifest integration with main's new upload_from_hf_staging_to_gcs function and expanded subdirectory docstring. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Merged main into this branchPR #538 (calibration pipeline improvements) just merged to main, so I merged main into Conflict resolutionThere was one conflict in
The core of your PR (version manifest creation, return types for tracking GCS generations and HF commit OIDs) is fully preserved. Please review the merge commit to make sure nothing looks off. |
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Fixes #600
Summary
gcs_version.pywith typed dataclasses (VersionRegistry,VersionManifest,HFVersionInfo,GCSVersionInfo) and registry I/O separated into_upload_registry_to_gcs()/_upload_registry_to_hf()data_upload.pyto capture GCS generation numbers and HF commit SHAs, build aVersionManifest, and append it to the unified registry on both backendsget_data_version()andget_data_manifest()from package root for public consumer access (fetches from HF, no credentials needed)special_operation="roll-back"metadataTest plan
pytest policyengine_us_data/tests/test_gcs_version.py -v— 39 tests passblack . -l 79 --check— formatting clean🤖 Generated with Claude Code