Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions statvar_imports/pennsylvania/pennsylvania_education/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
#### Copyright 2025 Google LLC
####
#### Licensed under the Apache License, Version 2.0 (the "License");
#### you may not use this file except in compliance with the License.
####
#### http://www.apache.org/licenses/LICENSE-2.0
####
#### Unless required by applicable law or agreed to in writing, software
#### distributed under the License is distributed on an "AS IS" BASIS,
#### WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#### See the License for the specific language governing permissions and
#### limitations under the License.
-----


## Pennsylvania_education Import

Dataset related to Pennsylvania's Education at county level.
-----

**Provenance Description:**
Data assets within this catalog are authored and maintained by individual Commonwealth agencies, which serve as the authoritative sources for their respective domains. The portal, managed by the Office of Administration, provides a transparent audit trail by documenting original publication dates, metadata update frequencies, and the specific departmental "stewards" responsible for the data's accuracy and integrity.

### How to Use

The workflow for this data import involves two main steps: downloading the necessary files and then processing them.

#### Step 1: Download the Data

- **Source:** [Pennsylvania_Education](https://data.pa.gov/browse?sortBy=relevance&page=1&pageSize=20)
- **Description:** The provided URL links to the Education data category within the Commonwealth of Pennsylvania’s open data repository. This portal serves as a centralized clearinghouse for public records, statistics, and geospatial data managed by the Pennsylvania Department of Education (PDE) and related agencies.

To fetch the necessary data files, you'll need to run download script `download_script.py`.

The download_script will download below mentioned files in the `input_files` folder. Within this folder, there are four sub-folders, each containing categorized data for both adults and children:

- educational_attainment_by_age_range_and_gender

- post_secondary_completions_total_awards_degrees

- public_school_enrollment_by_county_grade_and_race

- undergraduate_stem_enrollment


### Auto refresh Type

This import will be refreshed in a fully automated manner.

-----

#### Step 2: Process the Files

After downloading the files, you can process them to generate the final output. To do this:

**Option A: Use the `run_processing.sh` script**

The `run_processing.sh` script automates the processing of all the downloaded files.

**Run the following command:**

```bash
sh run.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The command to run the processing script is incorrect. The script is named run_processing.sh, not run.sh.

Suggested change
sh run.sh
sh run_processing.sh

```

**Option B: Manually Execute the Processing Script**

You can also run the `stat_var_processor.py` script individually for each file. This script is located in the `data/tools/statvar_importer/` directory.

Here are the specific commands for each file:

```bash
python3 stat_var_processor.py
--input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/educational_attainment_by_age_range_and_gender/*.csv"
--pv_map=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_pvmap.csv"
--config_file=../../statvar_imports/pennsylvania/pennsylvania_education/common_metadata.csv"
--output_path=../../statvar_imports/pennsylvania/pennsylvania_education/output_files/educational_attainment_by_age_range_and_gender_output"
--existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
--places_resolved_csv=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_places_resolver.csv"
```
Comment on lines +73 to +80
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This command has two issues:

  1. The --input_data argument uses a glob pattern (*.csv), which is inconsistent with the run_processing.sh script. It should be the specific filename.
  2. It incorrectly includes a --places_resolved_csv argument. This dataset does not have a corresponding places_resolved.csv file, and the run_processing.sh script does not use this argument for this dataset.
Suggested change
python3 stat_var_processor.py
--input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/educational_attainment_by_age_range_and_gender/*.csv"
--pv_map=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_pvmap.csv"
--config_file=../../statvar_imports/pennsylvania/pennsylvania_education/common_metadata.csv"
--output_path=../../statvar_imports/pennsylvania/pennsylvania_education/output_files/educational_attainment_by_age_range_and_gender_output"
--existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
--places_resolved_csv=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_places_resolver.csv"
```
python3 stat_var_processor.py \
--input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/educational_attainment_by_age_range_and_gender.csv" \
--pv_map=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_pvmap.csv" \
--config_file=../../statvar_imports/pennsylvania/pennsylvania_education/common_metadata.csv" \
--output_path=../../statvar_imports/pennsylvania/pennsylvania_education/output_files/educational_attainment_by_age_range_and_gender_output" \
--existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf


```bash
python3 stat_var_processor.py
--input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/post_secondary_completions_total_awards_degrees/*.csv"
--pv_map=../../statvar_imports/pennsylvania/pennsylvania_education/post_secondary_completions_total_awards_degrees_pvmap.csv"
--config_file=../../statvar_imports/pennsylvania/pennsylvania_education/common_metadata.csv"
--output_path=../../statvar_imports/pennsylvania/pennsylvania_education/output_files/post_secondary_completions_total_awards_degrees_output"
--existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
--places_resolved_csv=../../statvar_imports/pennsylvania/pennsylvania_education/post_secondary_completions_total_awards_degrees_places_resolver.csv"
```

```bash
python3 stat_var_processor.py
--input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/public_school_enrollment_by_county_grade_and_race/*.csv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The --input_data argument uses a glob pattern (*.csv), which is inconsistent with the run_processing.sh script and the download script which creates a single file per dataset. This should be the specific filename.

Suggested change
--input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/public_school_enrollment_by_county_grade_and_race/*.csv"
--input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/public_school_enrollment_by_county_grade_and_race.csv"

--pv_map=../../statvar_imports/pennsylvania/pennsylvania_education/public_school_enrollment_by_county_grade_and_race_pvmap.csv"
--config_file=../../statvar_imports/pennsylvania/pennsylvania_education/common_metadata.csv"
--output_path=../../statvar_imports/pennsylvania/pennsylvania_education/output_files/public_school_enrollment_by_county_grade_and_race_output"
--existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
--places_resolved_csv=../../statvar_imports/pennsylvania/pennsylvania_education/public_school_enrollment_by_county_grade_and_race_places_resolver.csv"
```

```bash
python3 stat_var_processor.py
--input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/undergraduate_stem_enrollment/*.csv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The --input_data argument uses a glob pattern (*.csv), which is inconsistent with the run_processing.sh script and the download script which creates a single file per dataset. This should be the specific filename.

Suggested change
--input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/undergraduate_stem_enrollment/*.csv"
--input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/undergraduate_stem_enrollment.csv"

--pv_map=../../statvar_imports/pennsylvania/pennsylvania_education/undergraduate_stem_enrollment_pvmap.csv"
--config_file=../../statvar_imports/pennsylvania/pennsylvania_education/common_metadata.csv"
--output_path=../../statvar_imports/pennsylvania/pennsylvania_education/output_files/undergraduate_stem_enrollment_output"
--existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf
--places_resolved_csv=../../statvar_imports/pennsylvania/pennsylvania_education/undergraduate_stem_enrollment_places_resolver.csv"
```

Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
parameter,value
#places_within,
output_columns,"observationAbout,observationDate,value,variableMeasured"
header_rows,1
url,https://data.pa.gov/browse?sortBy=relevance&page=1&pageSize=20
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
import os
import requests

def download_file(url, output_path):
print(f'Downloading {url} to {output_path}...')
response = requests.get(url, stream=True)
response.raise_for_status()

os.makedirs(os.path.dirname(output_path), exist_ok=True)
with open(output_path, 'wb') as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print('Download complete.')

def main():
base_path = os.path.dirname(os.path.abspath(__file__))
input_files_dir = os.path.join(base_path, 'input_files')

datasets = {
'educational_attainment_by_age_range_and_gender': 'xwn6-8rmw',
'post_secondary_completions_total_awards_degrees': 'jqcu-bcsg',
'public_school_enrollment_by_county_grade_and_race': 'wb8u-h3s8',
'undergraduate_stem_enrollment': 'r75w-4bue'
}

for folder, data_id in datasets.items():
url = f'https://data.pa.gov/api/views/{data_id}/rows.csv?accessType=DOWNLOAD'
output_path = os.path.join(input_files_dir, f'{folder}.csv')
download_file(url, output_path)

if __name__ == '__main__':
main()
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
key,,,p1,v1,p2,v2
County FIPS Code,observationAbout,geoId/{Data},populationType,Person,statType,measuredValue
Total Population,measuredProperty,count,value,{Number},,
No High School Diploma,educationalAttainment,NoDiploma,value,{Number},,
High School Diploma Or Equivalent,educationalAttainment,HighSchoolDiplomaIncludesEquivalency,value,{Number},,
Some College No Degree,educationalAttainment,SomeCollegeNoDegree,value,{Number},,
Associate's Degree,educationalAttainment,AssociatesDegree,value,{Number},,
Bachelor's Degree,educationalAttainment,BachelorsDegree,value,{Number},,
Graduate or Professional Degree,educationalAttainment,GraduateOrProfessionalDegree,value,{Number},,
Total Post-Secondary Degrees,educationalAttainment,PostSecondaryDegree,value,{Number},,
Male,gender,Male,,,,
Female,gender,Female,,,,
35 to 44 Years,age,[35 44 Years],,,,
25 to 34 Years,age,[25 34 Years],,,,
45 to 64 Years,age,[45 64 Years],,,,
,,,,,,
2010,observationDate,2010,,,,
2011,observationDate,2011,,,,
2012,observationDate,2012,,,,
2013,observationDate,2013,,,,
2014,observationDate,2014,,,,
2015,observationDate,2015,,,,
2016,observationDate,2016,,,,
34 changes: 34 additions & 0 deletions statvar_imports/pennsylvania/pennsylvania_education/manifest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{
"import_specifications": [
{
"import_name": "Pennsylvania_Education",
"curator_emails": ["[email protected]"],
"provenance_url": "https://data.pa.gov/",
"provenance_description": "Dataset related to the pennsylvania's Education at country level.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are a couple of typos in the provenance_description. "country" should be "county", and "pennsylvania's" should be "Pennsylvania's".

Suggested change
"provenance_description": "Dataset related to the pennsylvania's Education at country level.",
"provenance_description": "Dataset related to Pennsylvania's Education at county level.",

"scripts": ["download_script.py", "run_processing.sh"],
"source_files": [
"input_files/*.csv"
],
"import_inputs": [
{
"template_mcf": "output_files/educational_attainment_by_age_range_and_gender/educational_attainment_by_age_range_and_gender_output.tmcf",
"cleaned_csv": "output_files/educational_attainment_by_age_range_and_gender/educational_attainment_by_age_range_and_gender_output.csv"
},
{
"template_mcf": "output_files/post_secondary_completions_total_awards_degrees/post_secondary_completions_total_awards_degrees_output.tmcf",
"cleaned_csv": "output_files/post_secondary_completions_total_awards_degrees/post_secondary_completions_total_awards_degrees_output.csv"
},
{
"template_mcf": "output_files/public_school_enrollment_by_county_grade_and_race/public_school_enrollment_by_county_grade_and_race_output.tmcf",
"cleaned_csv": "output_files/public_school_enrollment_by_county_grade_and_race/public_school_enrollment_by_county_grade_and_race_output.csv"
},
{
"template_mcf": "output_files/undergraduate_stem_enrollment/undergraduate_stem_enrollment_output.tmcf",
"cleaned_csv": "output_files/undergraduate_stem_enrollment/undergraduate_stem_enrollment_output.csv"
}
],
"cron_schedule": "0 02 * * 2"
}
]
}

Loading
Loading