This repository serves as a template for creating Reproducible Analytical Pipeline (RAP) Python projects, complete with fundamental tooling, CI/CD integration, and policy compliance. It is designed to help RAP developers and data engineers get started quickly with standardised tooling, letting them focus on writing analytical code. The template takes care of directory structures, tool configurations, automated testing, and compliance requirements.
This template is generated using Copier, an open source tool for rendering project from templates and natively supports updating projects as the original template matures.
See this demo repository for an example created from this template.
- Features
- Getting Started
- Updating Project with Template Changes
- Structure
- Design Decisions
- Alternatives Software/Tools
- Future Plans
- Contributing
- License
This template includes comprehensive features to help you get started developing RAP Python projects quickly:
- Modern Python with Poetry for dependency management
- Standardised on Poetry only for consistency across ONS projects
- Lockfile support for reproducible builds
- Modular ETL pipeline components (extract, transform, load)
- Example data processing workflows
- Comprehensive logging and error handling
- Configurable data transformation rules
- Ruff for fast linting and formatting
- MyPy for static type checking
- pytest with coverage reporting
- Pre-commit hooks for automated quality checks
- Bandit security scanning
- Secret detection with detect-secrets
- Enhanced .gitignore with security patterns
- Full ONS Policy Compliance:
- GitHub Usage Policy compliance
- Software Coding Policy adherence
- Software Development Guidelines alignment
- Comprehensive GitHub Actions workflows
- VS Code devcontainer with recommended extensions
- Automated testing, linting, security scanning, and type checking
- Branch protection and code review requirements
- Architectural Decision Records (ADRs) for significant choices
- Comprehensive documentation templates
- British spelling and ONS standards
- govcookiecutter alignment
You have two options for project generation from this template:
- GitHub Template Feature: Utilise the Use this template feature on GitHub to create a new repository based on this template directly from the web interface. While convenient and fast, this method offers limited customisation options compared to local generation.
- Running Copier Locally: Use Copier locally to tailor the template to your specific requirements. This method allows for further customisation according to your project's needs and automatically set up the repository and branch protection.
Note
DO NOT FORK this repository. Instead, use the Use this template feature.
To get started:
- Click on Use this template
- Name your new repository and provide a description, then click Create repository. Note: the repository name
should be lowercase and use hyphens (
-) instead of underscores. - GitHub will now copy the contents over and GitHub Actions will process the template and commit to your new repository shortly after you click Create repository.
- Wait until the "Rename Project from Template" job in GitHub Actions finished running!
- Once the Rename Project from Template action has run, you can clone your new repository and start working on your project. π
- Some GitHub Actions workflows will fail on the first run post-clone since the repository will not be fully configured until the "Rename Project from Template" job has finished running. This is expected behaviour and can be safely ignored. Subsequent runs will not have this issue.
-
Python 3.10+: We recommend using pyenv for managing Python versions.
-
Copier: Install Copier using pip or pipx.
pip install --user copier # OR # Install pipx and add it to your PATH and then install Copier pip install --user pipx && pipx ensurepath pipx install copier
-
Operation System: Ubuntu/MacOS
-
Git: Ensure Git is installed and configured.
-
GitHub CLI: [OPTIONAL] Ensure GitHub CLI is installed and you are authenticated (
gh auth login) if you would like to automate the repository creation and configuration like branch protection.
Copier will ask you a series of questions to customise the project to your needs. Once you have answered all the questions, Copier will generate the project for you.
To generate the project run:
copier copy --trust gh:ONS-Innovation/python-rap-template /path/to/your/new/projectReplace /path/to/your/new/project with the path to the directory where you want to create your new project. This
directory should match the name of the repository you want to create.
This step is only required if you answered No to the Do you want to set up the git repository? question.
Otherwise, this would
have been automatically done for you.
-
Go to your project directory, and initialise a git repository and make the initial commit
cd /path/to/your/new/project git init -b main git add . git commit -m "Initial commit"
-
Create a new repo in GitHub. See [GitHub How-to](https://docs.github.com/en/repositories/creating-and-managing-repositories/quickstart-for-repositories]
-
Push your project to the repository on GitHub:
git remote add origin https://github.com/<repository_owner>/<repository_name>.git git push -u origin main
Now you can start working on your project. π
To update your project when the template changes, see Updating Project with Template Changes
There are a few steps you should take after cloning your new repository to ensure it is fully configured and ready for use.
If your repository is private/internal, you should update the PIRR.md file in the root of your repository with the
reasoning for the private/internal status of the repository.
Familiarise yourself with the ONS GitHub Policy and ensure your repository is compliant with the policy. Few key points to note are:
- Branch Protection: Ensure
the
mainor any other primary branch is protected. - Signed Commits: Use GPG keys to sign your commits.
- Security Alerts: Make use of Secret scanning and Push protection. Dependabot alerts will be enabled by default when using this template.
If you answered Yes to the Do you want to set up the git repository? question, then these settings would have been
automatically configured for you. However, it is recommended to review these settings to ensure they meet your
requirements.
This template helps ensure compliance with the ONS GitHub Usage Policy by automatically including:
- CODEOWNERS file: Automatically created with the specified code owners for the repository
- Repository naming validation: Enforces lowercase, hyphen/underscore naming conventions
- Private/Internal Repository Reasoning Record (PIRR): Generated for non-public repositories with guidance for completion
- Compliance checklist: Added to the generated README to guide developers through required steps
- Enhanced .gitignore: Includes patterns to prevent accidental commit of sensitive files
Caution
CURRENTLY UNSUPPORTED: This is currently unsupported due to an upstream issue with Copier. Once the issue is resolved, this section will be updated with instructions on how to update your project with changes.
View Details
You can update your project with changes made to the template since you generated your project. This is useful to keep your project up to date with the latest tooling and configuration.
If you always used Copier with this project, getting last updates with Copier is simple:
cd ~/path/to/your/project
make copier-updateCopier will ask you all questions again, but default values will be those you answered last time. Just hit Enter to
accept those defaults, or change them if needed or you can use poetry run copier update --force instead to avoid
answering the questions again.
For more see Copier docs and poetry run copier --help-all.
The structure of the templated repo is as follows:
βββ .github # Contains GitHub-specific configurations, including Actions workflows for CI/CD processes.
β βββ workflows # Directory for GitHub Actions workflows.
β β βββ ci.yml # Workflow for Continuous Integration, running tests and other checks on commits to `main` and on pull requests.
β β βββ codeql.yml # CodeQL workflow for automated identification of security vulnerabilities in the codebase. (Public Repos Only)
β β βββ security-scan.yml # Security scan workflow for running Bandit on the project.
β βββ dependabot.yml # Configuration for Dependabot, which automatically checks for outdated dependencies and creates pull requests to update them.
β βββ ISSUE_TEMPLATE.md # Template for issues raised in the repository.
β βββ PULL_REQUEST_TEMPLATE.md # Template for pull requests raised in the repository.
β βββ release.yml # Configuration on how to categorise changes into a structured changelog when using 'Generate release notes' feature.
βββ {module_name}/ # Main Python package directory containing RAP pipeline code.
β βββ __init__.py # Initialises the directory as a Python package with ETLPipeline class.
β βββ extract.py # Data extraction functionality with DataExtractor class.
β βββ transform.py # Data transformation and cleaning with DataTransformer class.
β βββ load.py # Data loading and output functionality with DataLoader class.
βββ tests # Contains all test files.
βββ e2e # Directory for end-to-end tests.
β βββ test_etl_workflow.py # End-to-end tests for the ETL workflow.
βββ unit # Directory for unit tests, containing tests for individual components of the project.
βββ test_extract.py # Unit tests for the extract module.
βββ test_transform.py # Unit tests for the transform module.
βββ test_load.py # Unit tests for the load module.
βββ .copier-answers.yml # Configuration file for Copier, specifying the answers to prompts when generating the project. Required for project updates.
βββ .editorconfig # Configuration file for maintaining consistent coding styles for multiple developers working on the same project across various editors and IDEs.
βββ .gitattributes # Git attributes file for defining attributes per path, such as line endings and merge strategies.
βββ .gitignore # Specifies intentionally untracked files to ignore when using Git, like build outputs and temporary files.
βββ .python-version # Specifies the Python version to be used with pyenv.
βββ .pre-commit-config.yaml # Configuration file for pre-commit hooks, used to run linting and formatting on the project.
βββ CODE_OF_CONDUCT.md # A code of conduct for the project, outlining the standards of behaviour for contributors.
βββ CONTRIBUTING.md # Guidelines for contributing to the project, including information on how to raise issues and submit pull requests.
βββ LICENSE # The license under which the project is made available.
βββ Makefile # A script used with the make build automation tool, containing commands to automate common tasks.
βββ PIRR.md # Private Internal Reasoning Record (PIRR) for the repository, documenting the reasoning for the private/internal status of the repository. (Private/Internal Repos Only)
βββ poetry.lock # Lock file for Poetry, pinning exact versions of dependencies to ensure consistent builds.
βββ pyproject.toml # Central project configuration file for Python, used by Poetry and tools like Ruff, MyPy, etc.
βββ run_etl.py # Example script demonstrating RAP pipeline usage with multiple execution methods.
βββ README.md # The main README file providing an overview of the project, setup instructions, and other essential information.
βββ SECURITY.md # A security policy for the project, providing information on how to report security vulnerabilities.
Although this template is opinionated, there are many alternatives to the tools used in this template which you may prefer. See the Alternatives Software/Tools section for more information.
1. Why use Poetry exclusively?
- Poetry is a modern Python package management tool that simplifies dependency management and packaging. It is also a build tool that can be used to package your project into a distributable format.
- Poetry provides robust dependency resolution, lockfile support, and integrated virtual environment management.
- By standardising on Poetry only, we ensure consistency across all ONS RAP projects and eliminate configuration complexity.
- Poetry aligns with modern Python best practices and is increasingly adopted in the data science community.
2. What is Ruff and why use Ruff?
- Ruff is a newer all-in-one alternative to tools such as flake8, isort, pydocstyle, pyupgrade, and autoflake. It is designed to be a more modern and user-friendly alternative to these tools while being extremely fast since it is written in Rust.
- Ruff is also designed to be more extensible and configurable than the tools it replaces.
4. Why use pytest for testing instead of unittest?
- pytest is a modern testing framework for Python that is designed to be easy to use and understand.
- pytest is more developer-friendly than unittest and has a more extensive ecosystem of plugins and extensions.
5. Why is MegaLinter not used for Python?
- While MegaLinter provides convenience by bundling multiple linters into a single package, opting for individual tools allows for greater flexibility and customisation to match project-specific requirements and coding standards.
- You are not able to control the versions of the linters used in MegaLinter, which can lead to issues with compatibility and consistency.
- Although it can easily run in CI, it requires Docker to run locally. For a basic repository with small amounts of Python it might be sufficient, but for more complex projects, tooling managed via your chosen package manager is encouraged.
6. Why not use SuperLinter?
- SuperLinter is a similar tool to MegaLinter, but it is not as developer-friendly and does not have as extensive documentation.
- SuperLinter does not allow auto-fixing of issues, which is a feature of MegaLinter.
7. Why does this focus on RAP/ETL patterns rather than web applications?
- This template is specifically designed for Reproducible Analytical Pipeline (RAP) development, which is the primary use case for data analysis projects at ONS.
- The included sample code demonstrates common data processing patterns: extraction, transformation, and loading of data.
- For web applications, other ONS templates or frameworks like Flask/FastAPI would be more appropriate.
- The RAP focus ensures that data analysts and researchers have a solid foundation for analytical work.
8. My projects do not have a CodeQL workflow. Why?
- CodeQL is only available for public repositories. If your repository is private/internal, the CodeQL workflow will not be included as it requires GitHub Advanced Security Enterprise plan which is currently not available for our organisation.
- CodeQL will attempt to run when you first clone the repo, however it will fail if the repo is private/internal. You can safely ignore this failure and or remove the CodeQL workflows run from your repo
9. Why is Secret Scanning and Push Protection not enabled?
- Secret scanning and push protection are enabled for public repositories.
- Private/Internal repositories cannot use these without GitHub Advanced Security Enterprise plan which is currently not available for our organisation.
- Enhanced RAP-specific examples and workflows
- Additional data source connectors (databases, APIs, cloud storage)
- More comprehensive data validation and quality checks
- Integration with ONS data platforms and services
- Advanced statistical analysis templates
- Enhanced documentation and developer guidance
- Ability to update projects with the latest template changes
:TODO: Add instructions for development
See CONTRIBUTING.md for details.
See LICENSE for details.