Add entity-level HDFStore output format alongside h5py#568
Add entity-level HDFStore output format alongside h5py#568
Conversation
The stacked_dataset_builder now produces a Pandas HDFStore file (.hdfstore.h5) in addition to the existing h5py file. The HDFStore contains one table per entity (person, household, tax_unit, spm_unit, family, marital_unit) plus an embedded _variable_metadata manifest recording each variable's entity and uprating parameter path. The upload pipeline uploads HDFStore files to dedicated subdirectories (states_hdfstore/, districts_hdfstore/, cities_hdfstore/). A comparison test (test_format_comparison.py) validates that both formats contain identical data for all variables. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Replaces the two-file-input test with a self-contained roundtrip script that takes only an h5py file path, generates an HDFStore using inlined splitting logic, then compares both formats. Handles entity-level h5py files and yearly/ETERNITY/monthly period keys. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
juaristi22
left a comment
There was a problem hiding this comment.
Review: HDFStore output format
Overall the feature makes sense — producing an entity-level HDFStore alongside the variable-centric h5py file is a clean step toward the API v2 alpha format. A few structural and testing concerns below.
1. Extract HDFStore logic into its own module
The three new functions (_split_into_entity_dfs, _build_uprating_manifest, _save_hdfstore) are a general-purpose h5py → HDFStore format transform. They don't depend on calibration, CD stacking, or anything specific to stacked_dataset_builder.py.
Embedding them in stacked_dataset_builder.py creates two problems:
- Merge conflict with #538. That PR reorganizes calibration modules from
datasets/cps/local_area_calibration/tocalibration/. This PR adds ~150 lines of new functions to the same file. The merge will be painful. - Coupling to a build pipeline. The HDFStore format is consumed by API v2 and
extend_single_year_dataset(). Those consumers shouldn't need to trace into a calibration build module for the serialization logic.
Suggestion: Move these to a standalone utility, e.g. policyengine_us_data/utils/hdfstore.py:
# policyengine_us_data/utils/hdfstore.py
def split_into_entity_dfs(arrays, system, vars_to_save): ...
def build_uprating_manifest(vars_to_save, system): ...
def save_hdfstore(entity_dfs, manifest_df, path, time_period): ...Then stacked_dataset_builder.py and any future consumer just imports from it.
2. The test reimplements the logic instead of testing it
test_format_comparison.py contains its own copy of _split_into_entity_dfs, _build_uprating_manifest, and _save_hdfstore. This means the test validates its own implementation, not the production code. If the production _split_into_entity_dfs has a bug (e.g., incorrect __period suffix stripping, wrong person_X_id → X_id rename), this test cannot catch it.
Note also that the two implementations already diverge:
- Production works on a
combined_dfwithvariable__periodcolumns and handles theperson_X_id→X_idrename. - Test works on raw h5py arrays with no period suffix and doesn't handle that rename.
- Production's
_save_hdfstoredoes not deduplicate columns; the test version does (df.loc[:, ~df.columns.duplicated()]).
Suggestion: If the conversion logic lives in a shared module (point 1), the test can import it directly instead of reimplementing it. The test would then validate the actual production code path.
3. Nice-to-have test improvements
The current tests check value equality and structural completeness, which covers the basics well. A few additional checks that could be worth adding down the road if you want to harden things further:
- Weight conservation — verifying that
household_weight,person_weight, etc. sum to the same totals across formats. This is probably the highest-value addition since it's the property that matters most for microsimulation correctness. - Referential integrity — checking that every
person_household_idin the person table references a validhousehold_idin the household table (and likewise for other entities). Silent FK violations won't fail value comparisons but would break simulations. - Row count consistency — e.g.
n_unique(person_household_id) == len(household_df). - Group entity dedup correctness — the current comparison for group entities falls back to
np.uniqueset comparison when array lengths differ, which can mask cases where dedup picks the wrong representative row. Not urgent but worth being aware of. - Bidirectional column check — currently checks h5py → HDFStore but not the reverse (no extra/spurious columns in HDFStore).
_time_periodroundtrip — the_time_periodmetadata table is written but never read back or verified.
None of these are blockers — just ideas for making the validation more robust over time.
4. Minor: manifest entity mismatch
_build_uprating_manifest classifies unknown variables (not in system.variables) as entity "unknown", but _split_into_entity_dfs classifies them as "household". The manifest entity won't match where the variable actually ends up. Should be consistent — probably both "household".
5. Upload pattern in publish_local_area.py
The HDFStore upload uses os.path.exists(hdfstore_path) checks repeated across states/districts/cities. This is fine for now but could be a helper like _upload_hdfstore_if_exists(output_path, category) to reduce the copy-paste.
Fixes #567
Related to PolicyEngine/policyengine-us#7700
Summary
stacked_dataset_builder.pynow produces a Pandas HDFStore file (.hdfstore.h5) alongside the existing h5py file, with one table per entity and an embedded uprating manifeststates_hdfstore/,districts_hdfstore/,cities_hdfstore/)Test plan
stacked_dataset_builderon a single CD/state and confirm both.h5and.hdfstore.h5files are createdpytest test_format_comparison.py --h5py-path STATE.h5 --hdfstore-path STATE.hdfstore.h5and confirm all variables match_variable_metadatamanifest with correct entity and uprating columns🤖 Generated with Claude Code