You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When building the extended CPS, the PUF clone half receives CPS-only variables (like retirement contributions) either by:
Direct duplication from the CPS donor record (else branch in puf_clone_dataset)
PUF override via OVERRIDDEN_IMPUTED_VARIABLES (e.g. pre_tax_contributions)
Neither approach preserves the relationship between these variables and income. A PUF clone with $0 wages can end up with $50k in 401(k) contributions, because there's no model linking contributions to the income variables that are common between CPS and PUF.
This creates implausible records and makes calibration harder — you can't calibrate away a structural data quality issue.
Proposed solution
For variables that exist in CPS but not PUF, train predictive models using the CPS half to predict these variables from features common to both CPS and PUF:
Other CPS-only variables currently in OVERRIDDEN_IMPUTED_VARIABLES that should respect income relationships
Approach
On the CPS half (which has both income variables and CPS-only variables), train lightweight models (e.g. quantile regression, gradient boosting) predicting each CPS-only variable from the common features
Apply these models to the PUF clone half, using the PUF-derived income values as inputs
This ensures that a PUF clone with high wages gets plausible retirement contributions, and one with $0 wages gets $0 contributions
Problem
When building the extended CPS, the PUF clone half receives CPS-only variables (like retirement contributions) either by:
elsebranch inpuf_clone_dataset)OVERRIDDEN_IMPUTED_VARIABLES(e.g.pre_tax_contributions)Neither approach preserves the relationship between these variables and income. A PUF clone with $0 wages can end up with $50k in 401(k) contributions, because there's no model linking contributions to the income variables that are common between CPS and PUF.
This creates implausible records and makes calibration harder — you can't calibrate away a structural data quality issue.
Proposed solution
For variables that exist in CPS but not PUF, train predictive models using the CPS half to predict these variables from features common to both CPS and PUF:
Common predictors (available in both datasets):
employment_income)Variables to model (examples):
pre_tax_contributions(retirement contributions — see Add calibration targets for retirement contributions #553)traditional_401k_contributionstraditional_ira_contributionsroth_401k_contributionsself_employed_pension_contributionsOVERRIDDEN_IMPUTED_VARIABLESthat should respect income relationshipsApproach
Related