Add entity-level HDFStore output format alongside h5py by anth-volk · Pull Request #568 · PolicyEngine/policyengine-us-data

anth-volk · 2026-03-04T20:37:52Z

Fixes #567

Related to PolicyEngine/policyengine-us#7700

Summary

stacked_dataset_builder.py now produces a Pandas HDFStore file (.hdfstore.h5) alongside the existing h5py file, with one table per entity and an embedded uprating manifest
Upload pipeline uploads HDFStore files to dedicated subdirectories (states_hdfstore/, districts_hdfstore/, cities_hdfstore/)
Comparison test validates both formats contain identical data for all ~183 variables

Test plan

Run stacked_dataset_builder on a single CD/state and confirm both .h5 and .hdfstore.h5 files are created
Run pytest test_format_comparison.py --h5py-path STATE.h5 --hdfstore-path STATE.hdfstore.h5 and confirm all variables match
Verify HDFStore contains _variable_metadata manifest with correct entity and uprating columns
Verify all 6 entity tables are present with correct row counts

🤖 Generated with Claude Code

The stacked_dataset_builder now produces a Pandas HDFStore file (.hdfstore.h5) in addition to the existing h5py file. The HDFStore contains one table per entity (person, household, tax_unit, spm_unit, family, marital_unit) plus an embedded _variable_metadata manifest recording each variable's entity and uprating parameter path. The upload pipeline uploads HDFStore files to dedicated subdirectories (states_hdfstore/, districts_hdfstore/, cities_hdfstore/). A comparison test (test_format_comparison.py) validates that both formats contain identical data for all variables. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Replaces the two-file-input test with a self-contained roundtrip script that takes only an h5py file path, generates an HDFStore using inlined splitting logic, then compares both formats. Handles entity-level h5py files and yearly/ETERNITY/monthly period keys. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Co-Authored-By: Claude Opus 4.6 <[email protected]>

juaristi22

Review: HDFStore output format

Overall the feature makes sense — producing an entity-level HDFStore alongside the variable-centric h5py file is a clean step toward the API v2 alpha format. A few structural and testing concerns below.

1. Extract HDFStore logic into its own module

The three new functions (_split_into_entity_dfs, _build_uprating_manifest, _save_hdfstore) are a general-purpose h5py → HDFStore format transform. They don't depend on calibration, CD stacking, or anything specific to stacked_dataset_builder.py.

Embedding them in stacked_dataset_builder.py creates two problems:

Merge conflict with #538. That PR reorganizes calibration modules from datasets/cps/local_area_calibration/ to calibration/. This PR adds ~150 lines of new functions to the same file. The merge will be painful.
Coupling to a build pipeline. The HDFStore format is consumed by API v2 and extend_single_year_dataset(). Those consumers shouldn't need to trace into a calibration build module for the serialization logic.

Suggestion: Move these to a standalone utility, e.g. policyengine_us_data/utils/hdfstore.py:

# policyengine_us_data/utils/hdfstore.py
def split_into_entity_dfs(arrays, system, vars_to_save): ...
def build_uprating_manifest(vars_to_save, system): ...
def save_hdfstore(entity_dfs, manifest_df, path, time_period): ...

Then stacked_dataset_builder.py and any future consumer just imports from it.

2. The test reimplements the logic instead of testing it

test_format_comparison.py contains its own copy of _split_into_entity_dfs, _build_uprating_manifest, and _save_hdfstore. This means the test validates its own implementation, not the production code. If the production _split_into_entity_dfs has a bug (e.g., incorrect __period suffix stripping, wrong person_X_id → X_id rename), this test cannot catch it.

Note also that the two implementations already diverge:

Production works on a combined_df with variable__period columns and handles the person_X_id → X_id rename.
Test works on raw h5py arrays with no period suffix and doesn't handle that rename.
Production's _save_hdfstore does not deduplicate columns; the test version does (df.loc[:, ~df.columns.duplicated()]).

Suggestion: If the conversion logic lives in a shared module (point 1), the test can import it directly instead of reimplementing it. The test would then validate the actual production code path.

3. Nice-to-have test improvements

The current tests check value equality and structural completeness, which covers the basics well. A few additional checks that could be worth adding down the road if you want to harden things further:

Weight conservation — verifying that household_weight, person_weight, etc. sum to the same totals across formats. This is probably the highest-value addition since it's the property that matters most for microsimulation correctness.
Referential integrity — checking that every person_household_id in the person table references a valid household_id in the household table (and likewise for other entities). Silent FK violations won't fail value comparisons but would break simulations.
Row count consistency — e.g. n_unique(person_household_id) == len(household_df).
Group entity dedup correctness — the current comparison for group entities falls back to np.unique set comparison when array lengths differ, which can mask cases where dedup picks the wrong representative row. Not urgent but worth being aware of.
Bidirectional column check — currently checks h5py → HDFStore but not the reverse (no extra/spurious columns in HDFStore).
_time_period roundtrip — the _time_period metadata table is written but never read back or verified.

None of these are blockers — just ideas for making the validation more robust over time.

4. Minor: manifest entity mismatch

_build_uprating_manifest classifies unknown variables (not in system.variables) as entity "unknown", but _split_into_entity_dfs classifies them as "household". The manifest entity won't match where the variable actually ends up. Should be consistent — probably both "household".

5. Upload pattern in `publish_local_area.py`

The HDFStore upload uses os.path.exists(hdfstore_path) checks repeated across states/districts/cities. This is fine for now but could be a helper like _upload_hdfstore_if_exists(output_path, category) to reduce the copy-paste.

anth-volk mentioned this pull request Mar 4, 2026

Add extend_single_year_dataset for fast dataset year projection PolicyEngine/policyengine-us#7700

Open

5 tasks

anth-volk and others added 3 commits March 5, 2026 17:47

style: Run black formatter on changed files

82b88fd

Co-Authored-By: Claude Opus 4.6 <[email protected]>

Merge remote-tracking branch 'origin/main' into add-hdfstore-output

3fff876

anth-volk requested a review from juaristi22 March 5, 2026 18:39

anth-volk marked this pull request as ready for review March 5, 2026 18:39

juaristi22 reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add entity-level HDFStore output format alongside h5py#568

Add entity-level HDFStore output format alongside h5py#568
anth-volk wants to merge 4 commits intomainfrom
add-hdfstore-output

anth-volk commented Mar 4, 2026 •

edited

Loading

Uh oh!

juaristi22 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anth-volk commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

juaristi22 left a comment

Choose a reason for hiding this comment

Review: HDFStore output format

1. Extract HDFStore logic into its own module

2. The test reimplements the logic instead of testing it

3. Nice-to-have test improvements

4. Minor: manifest entity mismatch

5. Upload pattern in publish_local_area.py

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anth-volk commented Mar 4, 2026 •

edited

Loading

5. Upload pattern in `publish_local_area.py`