Skip to content

Add unified version registry for GCS/HF dataset versioning#601

Open
anth-volk wants to merge 7 commits intomainfrom
fix/fix-versioning
Open

Add unified version registry for GCS/HF dataset versioning#601
anth-volk wants to merge 7 commits intomainfrom
fix/fix-versioning

Conversation

@anth-volk
Copy link
Collaborator

Fixes #600

Summary

  • Adds gcs_version.py with typed dataclasses (VersionRegistry, VersionManifest, HFVersionInfo, GCSVersionInfo) and registry I/O separated into _upload_registry_to_gcs() / _upload_registry_to_hf()
  • Modifies data_upload.py to capture GCS generation numbers and HF commit SHAs, build a VersionManifest, and append it to the unified registry on both backends
  • Exports get_data_version() and get_data_manifest() from package root for public consumer access (fetches from HF, no credentials needed)
  • Includes rollback support that treats rollback as a new release with special_operation="roll-back" metadata
  • 39 tests covering serialization, registry I/O, query functions, rollback, and consumer API

Test plan

  • pytest policyengine_us_data/tests/test_gcs_version.py -v — 39 tests pass
  • black . -l 79 --check — formatting clean
  • CI passes on this branch

🤖 Generated with Claude Code

anth-volk and others added 5 commits March 11, 2026 18:40
Introduces a single version_manifest.json that tracks all dataset
versions across both GCS and Hugging Face, enabling programmatic
version queries and rollback. Adds consumer API (get_data_version,
get_data_manifest) exported at package level.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Rename the module to reflect that both GCS and HF are equal backends,
not just GCS. Add a constants block for all backend config, remove
optional HF parameters from public API, and extract step-based
procedural blocks into named helper functions.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Move pytest fixtures to conftest.py for auto-discovery (fixes F401/F811),
remove unused imports from data_upload.py, apply ruff formatting.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@anth-volk anth-volk marked this pull request as ready for review March 11, 2026 19:52
@anth-volk anth-volk requested a review from juaristi22 March 11, 2026 19:52
Copy link
Collaborator

@juaristi22 juaristi22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #601 Review: Unified Version Manifest System

Overall: Solid architecture, but has one real bug and a few gaps.

Must Fix

  1. Semver sorting is brokenlist_versions() uses Python string sort, which means "1.9.0" > "1.10.0". Need to use packaging.version.Version or similar for proper ordering.

  2. Broad exception catching in rollback tag creation — catches Exception and checks "already exists" in str(e). Should catch the specific HF exception type (HfHubHTTPError or similar).

Should Fix

  1. upload_files_to_gcs return type changed — now returns a dict instead of None. Any external callers will silently get unexpected return types. Needs an audit of callers.

  2. No test for rollback when hf=None — the code handles this case (line checks if old_manifest.hf else None) but no test verifies it.

  3. Temp file cleanup — uses NamedTemporaryFile(delete=False) + manual os.unlink, but if the upload fails between creation and cleanup, the temp file leaks. Should wrap in try/finally.

Minor / Nice to Have

  1. Changelog fragment doesn't mention the two new public APIs (get_data_version, get_data_manifest) that consumers can use.

  2. No semver sorting tests — needs edge cases like "1.9.0" vs "1.10.0", "2.0.0" vs "1.100.0".

  3. Constant duplicationGCS_BUCKET_NAME and HF_REPO_NAME are defined in both data_upload.py and version_manifest.py.

What's Good

  • Clean dataclass hierarchy (HFVersionInfo / GCSVersionInfo / VersionManifest / VersionRegistry)
  • Append-only registry pattern is the right call — good for audit trails
  • 39 tests with solid coverage of serialization, registry I/O, and consumer API
  • Rollback as a new entry with special_operation metadata rather than destructive deletion

Resolves conflict in data_upload.py by combining PR #601's version
manifest integration with main's new upload_from_hf_staging_to_gcs
function and expanded subdirectory docstring.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@baogorek
Copy link
Collaborator

Merged main into this branch

PR #538 (calibration pipeline improvements) just merged to main, so I merged main into fix/fix-versioning to bring it up to date.

Conflict resolution

There was one conflict in policyengine_us_data/utils/data_upload.py. Here's how it was resolved:

Area Resolution
Imports Kept hf_hub_download from main (needed by new function). Kept PR's removal of unused RevisionNotFoundError and json — confirmed main's new code doesn't use them either.
upload_data_files() Used PR's version (version manifest integration) — main didn't change this.
upload_files_to_hf() / upload_files_to_gcs() Used PR's version with return types (str / Dict[str, int]).
upload_local_area_file() docstring Merged both: PR's -> int return type + Returns section, combined with main's expanded subdirectory list ("states/, districts/, cities/, and national/").
upload_from_hf_staging_to_gcs() Kept main's new function at the bottom (uses hf_hub_download to transfer from HF staging to GCS).

The core of your PR (version manifest creation, return types for tracking GCS generations and HF commit OIDs) is fully preserved. Please review the merge commit to make sure nothing looks off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add unified version registry for GCS and HF dataset versioning

3 participants