Skip to content

Add entity-level HDFStore output format alongside h5py#568

Open
anth-volk wants to merge 4 commits intomainfrom
add-hdfstore-output
Open

Add entity-level HDFStore output format alongside h5py#568
anth-volk wants to merge 4 commits intomainfrom
add-hdfstore-output

Conversation

@anth-volk
Copy link
Collaborator

@anth-volk anth-volk commented Mar 4, 2026

Fixes #567

Related to PolicyEngine/policyengine-us#7700

Summary

  • stacked_dataset_builder.py now produces a Pandas HDFStore file (.hdfstore.h5) alongside the existing h5py file, with one table per entity and an embedded uprating manifest
  • Upload pipeline uploads HDFStore files to dedicated subdirectories (states_hdfstore/, districts_hdfstore/, cities_hdfstore/)
  • Comparison test validates both formats contain identical data for all ~183 variables

Test plan

  • Run stacked_dataset_builder on a single CD/state and confirm both .h5 and .hdfstore.h5 files are created
  • Run pytest test_format_comparison.py --h5py-path STATE.h5 --hdfstore-path STATE.hdfstore.h5 and confirm all variables match
  • Verify HDFStore contains _variable_metadata manifest with correct entity and uprating columns
  • Verify all 6 entity tables are present with correct row counts

🤖 Generated with Claude Code

The stacked_dataset_builder now produces a Pandas HDFStore file
(.hdfstore.h5) in addition to the existing h5py file. The HDFStore
contains one table per entity (person, household, tax_unit, spm_unit,
family, marital_unit) plus an embedded _variable_metadata manifest
recording each variable's entity and uprating parameter path.

The upload pipeline uploads HDFStore files to dedicated subdirectories
(states_hdfstore/, districts_hdfstore/, cities_hdfstore/).

A comparison test (test_format_comparison.py) validates that both
formats contain identical data for all variables.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
anth-volk and others added 3 commits March 5, 2026 17:47
Replaces the two-file-input test with a self-contained roundtrip
script that takes only an h5py file path, generates an HDFStore
using inlined splitting logic, then compares both formats. Handles
entity-level h5py files and yearly/ETERNITY/monthly period keys.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@anth-volk anth-volk requested a review from juaristi22 March 5, 2026 18:39
@anth-volk anth-volk marked this pull request as ready for review March 5, 2026 18:39
Copy link
Collaborator

@juaristi22 juaristi22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: HDFStore output format

Overall the feature makes sense — producing an entity-level HDFStore alongside the variable-centric h5py file is a clean step toward the API v2 alpha format. A few structural and testing concerns below.


1. Extract HDFStore logic into its own module

The three new functions (_split_into_entity_dfs, _build_uprating_manifest, _save_hdfstore) are a general-purpose h5py → HDFStore format transform. They don't depend on calibration, CD stacking, or anything specific to stacked_dataset_builder.py.

Embedding them in stacked_dataset_builder.py creates two problems:

  • Merge conflict with #538. That PR reorganizes calibration modules from datasets/cps/local_area_calibration/ to calibration/. This PR adds ~150 lines of new functions to the same file. The merge will be painful.
  • Coupling to a build pipeline. The HDFStore format is consumed by API v2 and extend_single_year_dataset(). Those consumers shouldn't need to trace into a calibration build module for the serialization logic.

Suggestion: Move these to a standalone utility, e.g. policyengine_us_data/utils/hdfstore.py:

# policyengine_us_data/utils/hdfstore.py
def split_into_entity_dfs(arrays, system, vars_to_save): ...
def build_uprating_manifest(vars_to_save, system): ...
def save_hdfstore(entity_dfs, manifest_df, path, time_period): ...

Then stacked_dataset_builder.py and any future consumer just imports from it.


2. The test reimplements the logic instead of testing it

test_format_comparison.py contains its own copy of _split_into_entity_dfs, _build_uprating_manifest, and _save_hdfstore. This means the test validates its own implementation, not the production code. If the production _split_into_entity_dfs has a bug (e.g., incorrect __period suffix stripping, wrong person_X_idX_id rename), this test cannot catch it.

Note also that the two implementations already diverge:

  • Production works on a combined_df with variable__period columns and handles the person_X_idX_id rename.
  • Test works on raw h5py arrays with no period suffix and doesn't handle that rename.
  • Production's _save_hdfstore does not deduplicate columns; the test version does (df.loc[:, ~df.columns.duplicated()]).

Suggestion: If the conversion logic lives in a shared module (point 1), the test can import it directly instead of reimplementing it. The test would then validate the actual production code path.


3. Nice-to-have test improvements

The current tests check value equality and structural completeness, which covers the basics well. A few additional checks that could be worth adding down the road if you want to harden things further:

  • Weight conservation — verifying that household_weight, person_weight, etc. sum to the same totals across formats. This is probably the highest-value addition since it's the property that matters most for microsimulation correctness.
  • Referential integrity — checking that every person_household_id in the person table references a valid household_id in the household table (and likewise for other entities). Silent FK violations won't fail value comparisons but would break simulations.
  • Row count consistency — e.g. n_unique(person_household_id) == len(household_df).
  • Group entity dedup correctness — the current comparison for group entities falls back to np.unique set comparison when array lengths differ, which can mask cases where dedup picks the wrong representative row. Not urgent but worth being aware of.
  • Bidirectional column check — currently checks h5py → HDFStore but not the reverse (no extra/spurious columns in HDFStore).
  • _time_period roundtrip — the _time_period metadata table is written but never read back or verified.

None of these are blockers — just ideas for making the validation more robust over time.


4. Minor: manifest entity mismatch

_build_uprating_manifest classifies unknown variables (not in system.variables) as entity "unknown", but _split_into_entity_dfs classifies them as "household". The manifest entity won't match where the variable actually ends up. Should be consistent — probably both "household".


5. Upload pattern in publish_local_area.py

The HDFStore upload uses os.path.exists(hdfstore_path) checks repeated across states/districts/cities. This is fine for now but could be a helper like _upload_hdfstore_if_exists(output_path, category) to reduce the copy-paste.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add entity-level HDFStore output format alongside h5py

2 participants