Skip to content

Replace batched QRF with sequential fit_predict()#594

Merged
baogorek merged 5 commits intomainfrom
microimpute-fit-predict
Mar 9, 2026
Merged

Replace batched QRF with sequential fit_predict()#594
baogorek merged 5 commits intomainfrom
microimpute-fit-predict

Conversation

@MaxGhenis
Copy link
Contributor

Summary

  • Replace _batch_qrf() (groups of 10 variables, destroying covariance) with _sequential_qrf() using microimpute's fit_predict() — single sequential QRF preserving full joint distribution across all 85+ PUF income variables
  • Replace manual sampling/gc in _impute_weeks_unemployed() and _impute_retirement_contributions() with max_train_samples parameter
  • Bump microimpute dependency to >=1.15.1

Fixes the covariance destruction from batching (related: #590). Upstream features added in PolicyEngine/microimpute#170.

Test plan

  • CI passes (unit tests with synthetic data)
  • Full dataset build produces reasonable distributions
  • Verify income variable correlations are higher than batched approach

🤖 Generated with Claude Code

@baogorek
Copy link
Collaborator

baogorek commented Mar 9, 2026

The old _batch_qrf explicitly zero-filled variables missing from the training data and included them in the returned dict. The new _sequential_qrf returns only predictions.columns, so if fit_predict() drops any requested variable, the callers at lines 511, 514, and 537 will KeyError — they do bare y_full[variable] lookups with no guard.

Rather than silently zero-filling like before, I'd suggest failing loud:

result = {var: predictions[var].values for var in predictions.columns}
missing = set(output_vars) - set(result)
if missing:
    raise ValueError(
        f"{len(missing)} variables requested but not returned by "
        f"fit_predict(): {sorted(missing)[:10]}"
    )
return result

@MaxGhenis
Copy link
Contributor Author

Good catch — added the explicit ValueError in d2114a6. microimpute's fit_predict() does zero-fill missing vars so they'll always be in predictions.columns, but better to fail loud if something unexpected happens.

Also found that _drop_formula_variables was silently dropping 6 PUF-imputed adds variables (taxable_pension_income, interest_deduction, tax_exempt_pension_income, pre_tax_contributions, self_employed_pension_contribution_ald, self_employed_health_insurance_ald) whose sub-components aren't separately stored. Added them to _KEEP_FORMULA_VARS in the same commit.

Copy link
Collaborator

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

MaxGhenis and others added 5 commits March 9, 2026 15:07
The old _batch_qrf() split 85+ PUF income variables into batches of
10, each with a fresh QRF model. This destroyed covariance between
variables in different batches (e.g. employment_income and
long_term_capital_gains).

Now uses microimpute's fit_predict() which runs a single sequential
QRF: each variable is conditioned on all previously imputed variables,
preserving the full joint distribution. Also replaces manual sampling
and gc cleanup in weeks_unemployed and retirement imputation with
max_train_samples parameter.

Requires microimpute>=1.15.1 (fit_predict + max_train_samples).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Instead of whitelisting formula vars in _KEEP_FORMULA_VARS, rename
them to their leaf input equivalents so _drop_formula_variables works
correctly and the engine can recompute aggregates from formulas:

- taxable_pension_income -> taxable_private_pension_income
- tax_exempt_pension_income -> tax_exempt_private_pension_income
- interest_deduction -> deductible_mortgage_interest
- self_employed_pension_contribution_ald -> _person
- self_employed_health_insurance_ald -> _person

Co-Authored-By: Claude Opus 4.6 <[email protected]>
HRSWK and A_HRS1 are already weekly measures from the CPS — they don't
need to be multiplied by WKSWORK/52. The old code deflated part-year
workers' weekly hours (e.g. someone working 40 hrs/wk for 26 weeks
showed as 20 hrs/wk).

Fixes part of #561

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The modal run command failed because local_area.py now has two
local entrypoints (main and main_promote), requiring explicit
::main disambiguation. Also temporarily disables push and
repository_dispatch triggers to prevent blocking other workflows.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@MaxGhenis MaxGhenis force-pushed the microimpute-fit-predict branch from 4e5eb36 to 979c382 Compare March 9, 2026 19:07
@baogorek baogorek merged commit ae4059c into main Mar 9, 2026
6 checks passed
@baogorek baogorek deleted the microimpute-fit-predict branch March 9, 2026 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants