-
Notifications
You must be signed in to change notification settings - Fork 132
Pennsylvania education #1890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Pennsylvania education #1890
Conversation
Summary of ChangesHello @smarthg-gi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request establishes a new, automated pipeline for importing a wide array of education-related statistics for Pennsylvania. The initiative aims to enrich the data platform with detailed county-level and institutional education metrics, facilitating a deeper understanding of educational trends and demographics within the state. The implemented solution ensures that the data can be regularly updated with minimal manual intervention. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a new data import for Pennsylvania education statistics. The changes include a download script, processing scripts, configuration files, and test data. I've identified several areas for improvement, primarily in the documentation and data mapping files, to ensure correctness and clarity. Key issues include incorrect commands and file paths in the README, typos and malformed entries in the manifest and CSV files, and potentially non-standard enum values in a pvmap file. Addressing these points will improve the robustness and maintainability of the import.
| 210748,ipedsId/210748,Altoona Beauty School Inc,,,,,, | ||
| 210775,ipedsId/210775,Alvernia University,,,,,, | ||
| 211088,ipedsId/211088,Arcadia University,,,,,, | ||
| 211006,ipedsId/211006,Automotive Training Center-Exton,,,,,,` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Westmoreland,geoId/42129 | ||
| Wyoming,geoId/42131 | ||
| York,geoId/42133 | ||
| Statewide [cite: Data], No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is malformed. It seems to be an incomplete entry for "Statewide" and contains a note [cite: Data] which should not be in the data. This line will likely be ignored or cause an error during processing. Please either complete the entry with a valid dcid or remove the line. Also, the file is missing a newline at the end.
| Pennsylvania State University-Penn State Shenango,ipedsId/214634 | ||
| Pennsylvania State University-World Campus,ipedsId/479956 | ||
| Pennsylvania College of Technology,ipedsId/366252 | ||
| The Pennsylvania State University, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| **Run the following command:** | ||
|
|
||
| ```bash | ||
| sh run.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| python3 stat_var_processor.py | ||
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/educational_attainment_by_age_range_and_gender/*.csv" | ||
| --pv_map=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_pvmap.csv" | ||
| --config_file=../../statvar_imports/pennsylvania/pennsylvania_education/common_metadata.csv" | ||
| --output_path=../../statvar_imports/pennsylvania/pennsylvania_education/output_files/educational_attainment_by_age_range_and_gender_output" | ||
| --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf | ||
| --places_resolved_csv=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_places_resolver.csv" | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command has two issues:
- The
--input_dataargument uses a glob pattern (*.csv), which is inconsistent with therun_processing.shscript. It should be the specific filename. - It incorrectly includes a
--places_resolved_csvargument. This dataset does not have a correspondingplaces_resolved.csvfile, and therun_processing.shscript does not use this argument for this dataset.
| python3 stat_var_processor.py | |
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/educational_attainment_by_age_range_and_gender/*.csv" | |
| --pv_map=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_pvmap.csv" | |
| --config_file=../../statvar_imports/pennsylvania/pennsylvania_education/common_metadata.csv" | |
| --output_path=../../statvar_imports/pennsylvania/pennsylvania_education/output_files/educational_attainment_by_age_range_and_gender_output" | |
| --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf | |
| --places_resolved_csv=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_places_resolver.csv" | |
| ``` | |
| python3 stat_var_processor.py \ | |
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/educational_attainment_by_age_range_and_gender.csv" \ | |
| --pv_map=../../statvar_imports/pennsylvania/pennsylvania_education/educational_attainment_by_age_range_and_gender_pvmap.csv" \ | |
| --config_file=../../statvar_imports/pennsylvania/pennsylvania_education/common_metadata.csv" \ | |
| --output_path=../../statvar_imports/pennsylvania/pennsylvania_education/output_files/educational_attainment_by_age_range_and_gender_output" \ | |
| --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf |
|
|
||
| ```bash | ||
| python3 stat_var_processor.py | ||
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/public_school_enrollment_by_county_grade_and_race/*.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The --input_data argument uses a glob pattern (*.csv), which is inconsistent with the run_processing.sh script and the download script which creates a single file per dataset. This should be the specific filename.
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/public_school_enrollment_by_county_grade_and_race/*.csv" | |
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/public_school_enrollment_by_county_grade_and_race.csv" |
|
|
||
| ```bash | ||
| python3 stat_var_processor.py | ||
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/undergraduate_stem_enrollment/*.csv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The --input_data argument uses a glob pattern (*.csv), which is inconsistent with the run_processing.sh script and the download script which creates a single file per dataset. This should be the specific filename.
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/undergraduate_stem_enrollment/*.csv" | |
| --input_data=../../statvar_imports/pennsylvania/pennsylvania_education/input_files/undergraduate_stem_enrollment.csv" |
| "import_name": "Pennsylvania_Education", | ||
| "curator_emails": ["[email protected]"], | ||
| "provenance_url": "https://data.pa.gov/", | ||
| "provenance_description": "Dataset related to the pennsylvania's Education at country level.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of typos in the provenance_description. "country" should be "county", and "pennsylvania's" should be "Pennsylvania's".
| "provenance_description": "Dataset related to the pennsylvania's Education at country level.", | |
| "provenance_description": "Dataset related to Pennsylvania's Education at county level.", |
| Engineering Technologies and Engineering-Related Fields,enrollmentLevel,Graduate,bachelorsDegreeMajor,EngineeringMajor,,,,,,,, | ||
| Physical Sciences,enrollmentLevel,Graduate,bachelorsDegreeMajor,PhysicalSciences,,,,,,,, | ||
| Mathematics and Statistics,enrollmentLevel,Graduate,bachelorsDegreeMajor,MathAndStatisticsMajor,,,,,,,, | ||
| Biological and Biomedical Sciences,enrollmentLevel,Graduate,bachelorsDegreeMajor,BiologicalAndBiomedicalSciencesMajor,,,,,,,, | ||
| Science Technologies/Technicians,enrollmentLevel,Graduate,bachelorsDegreeMajor,ScienceAndTechnologiesMajor,,,,,,,, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the values for bachelorsDegreeMajor seem to be non-standard. Please check if these are correct Data Commons enums:
EngineeringMajor(line 4) should likely beEngineering.MathAndStatisticsMajor(line 6) should likely beMathematicsAndStatistics.ScienceAndTechnologiesMajor(line 8) should likely beScienceAndTechnology.
Using incorrect enum values will lead to issues during data import.
Engineering Technologies and Engineering-Related Fields,enrollmentLevel,Graduate,bachelorsDegreeMajor,Engineering,,,,,,,,
Physical Sciences,enrollmentLevel,Graduate,bachelorsDegreeMajor,PhysicalSciences,,,,,,,,
Mathematics and Statistics,enrollmentLevel,Graduate,bachelorsDegreeMajor,MathematicsAndStatistics,,,,,,,,
Biological and Biomedical Sciences,enrollmentLevel,Graduate,bachelorsDegreeMajor,BiologicalAndBiomedicalSciencesMajor,,,,,,,,
Science Technologies/Technicians,enrollmentLevel,Graduate,bachelorsDegreeMajor,ScienceAndTechnology,,,,,,,,
No description provided.