Replace batched QRF with sequential fit_predict()#594
Conversation
|
The old Rather than silently zero-filling like before, I'd suggest failing loud: result = {var: predictions[var].values for var in predictions.columns}
missing = set(output_vars) - set(result)
if missing:
raise ValueError(
f"{len(missing)} variables requested but not returned by "
f"fit_predict(): {sorted(missing)[:10]}"
)
return result |
|
Good catch — added the explicit ValueError in d2114a6. microimpute's fit_predict() does zero-fill missing vars so they'll always be in predictions.columns, but better to fail loud if something unexpected happens. Also found that |
The old _batch_qrf() split 85+ PUF income variables into batches of 10, each with a fresh QRF model. This destroyed covariance between variables in different batches (e.g. employment_income and long_term_capital_gains). Now uses microimpute's fit_predict() which runs a single sequential QRF: each variable is conditioned on all previously imputed variables, preserving the full joint distribution. Also replaces manual sampling and gc cleanup in weeks_unemployed and retirement imputation with max_train_samples parameter. Requires microimpute>=1.15.1 (fit_predict + max_train_samples). Co-Authored-By: Claude Opus 4.6 <[email protected]>
Instead of whitelisting formula vars in _KEEP_FORMULA_VARS, rename them to their leaf input equivalents so _drop_formula_variables works correctly and the engine can recompute aggregates from formulas: - taxable_pension_income -> taxable_private_pension_income - tax_exempt_pension_income -> tax_exempt_private_pension_income - interest_deduction -> deductible_mortgage_interest - self_employed_pension_contribution_ald -> _person - self_employed_health_insurance_ald -> _person Co-Authored-By: Claude Opus 4.6 <[email protected]>
HRSWK and A_HRS1 are already weekly measures from the CPS — they don't need to be multiplied by WKSWORK/52. The old code deflated part-year workers' weekly hours (e.g. someone working 40 hrs/wk for 26 weeks showed as 20 hrs/wk). Fixes part of #561 Co-Authored-By: Claude Opus 4.6 <[email protected]>
The modal run command failed because local_area.py now has two local entrypoints (main and main_promote), requiring explicit ::main disambiguation. Also temporarily disables push and repository_dispatch triggers to prevent blocking other workflows. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
4e5eb36 to
979c382
Compare
Summary
_batch_qrf()(groups of 10 variables, destroying covariance) with_sequential_qrf()using microimpute'sfit_predict()— single sequential QRF preserving full joint distribution across all 85+ PUF income variables_impute_weeks_unemployed()and_impute_retirement_contributions()withmax_train_samplesparameterFixes the covariance destruction from batching (related: #590). Upstream features added in PolicyEngine/microimpute#170.
Test plan
🤖 Generated with Claude Code