Stop spending weeks on boilerplate. This PySpark project template for Databricks gives you medallion architecture, Python packaging, unit + integration + load tests, CI/CD via Declarative Automation Bundles, DQX data quality, and service-principal-based production deploys β all wired together and ready to ship. Whether you're starting a new Databricks ETL project or looking for a reference implementation of production-ready PySpark pipelines, fork this and go.
If this saves you time, a star helps others find it. Let's connect on LinkedIn.
- Databricks Free Edition (Serverless)
- Databricks Runtime 18.0 LTS
- Databricks Unity Catalog
- Databricks Declarative Automation Bundles (former Asset Bundles)
- Databricks CLI
- Databricks Python SDK
- Databricks DQX
- Databricks AI Dev Kit
- Databricks Dashboards
- Claude Code
- PySpark 4.1
- Spark Declarative Pipelines (SDP)
- Python 3.12+
- GitHub Actions
- Pytest
This project template demonstrates how to:
- use agentic development (with Databricks AI Dev Kit and Claude Code) in data projects. The template ships with a
CLAUDE.mdthat documents the project's conventions. - structure PySpark code inside classes/packages, deploy it as a Python wheel (instead of notebooks), and manage the project with uv.
- package and deploy code with Declarative Automation Bundles to different environments (dev, staging, prod). Use GitHub Actions to automate CI/CD pipeline.
- utilize Databricks Lakeflow Jobs to execute a DAG - Yes, you don't need Airflow to manage your DAGs here!!!. Generate job definitions to run with environment-specific conditions using Databricks SDK.
- isolate "dev" environments / catalogs to avoid concurrency issues between developer tests.
- separate deploy-time config (environment variables, CI secrets) from runtime config (job parameters overridable from the Databricks UI), keeping jobs flexible without coupling them to the build process.
- utilize job tags to track issues, costs, and ownership.
- use a Lakeflow Spark Declarative Pipeline to run the same ETL logic side-by-side with the PySpark job, demonstrating both paradigms from one codebase.
- use the medallion architecture to organize your data.
- run unit tests on transformations with the pytest package. Set up VS Code to run tests on your local machine.
- run integration tests by setting the input data and validating the output data.
- run load tests to exercise both the initial bulk load and incremental daily updates, validating that the pipeline handles production-scale data volumes without regressions.
- use Databricks AI/BI Dashboards to visualize the gold layer.
- utilize the coverage package to generate test coverage reports.
- use structured logging with a per-run
log_leveloverride and run-scoped correlation ID on every line, giving you full observability during incidents without a code change. - lint and format code with ruff and pre-commit.
- use a Makefile to automate repetitive tasks.
- utilize Databricks DQX to enforce data quality rules, such as null checks, uniqueness, thresholds, and schema validation, and filter bad data into quarantine tables.
- utilize service principals to run production code.
- utilize the Databricks SDK for Python to manage catalogs, schemas, workspaces, and accounts. Refer to the
scriptsfolder for examples. - utilize Databricks Unity Catalog to manage permissions and get data lineage.
- utilize serverless job clusters on Databricks Free Edition to deploy your pipelines.
- enforce production guardrails out of the box β identity-locked CI deploys, a health-check task that runs before any data is touched, wheel version pinning, per-task timeouts, schema-drift guards, queued runs, and on-call alerting that doesn't page on manual cancellations.
Agentic development:
- Claude Code: 5 Essentials for Data Engineering
- Mastering Claude Code in 30 minutes
- Introducing Databricks AI Dev Kit - Skills, MCP server, Builder App
Debates on the use of notebooks vs. Python packaging:
- The Rise of The Notebook Engineer
- Please donβt make me use Databricks notebooks
- this Linkedin thread by Daniel Beach
- this Linkedin thread by Ryan Chynoweth
- this Linkedin thread by Jaco van Gelder
Sessions on Databricks Declarative Automation Bundles, CI/CD, and Software Development Life Cycle at Data + AI Summit 2025:
- CI/CD for Databricks: Advanced Asset Bundles and GitHub Actions
- Deploying Databricks Asset Bundles (DABs) at Scale
- A Prescription for Success: Leveraging DABs for Faster Deployment and Better Patient Outcomes
Other resources:
- Goodbye Pip and Poetry. Why UV Might Be All You Need
- The Spark Revolution You Didnβt See Coming: How Apache Spark 4.0 in Databricks Just Changed Everything
databricks-template/
β
βββ .github/ # CI/CD automation
β βββ workflows/
β βββ onpush.yml # GitHub Actions pipeline
β
βββ src/ # Main source code
β βββ template/ # Python package
β βββ main.py # Entry point with CLI (argparse)
β βββ config.py # Configuration management
β βββ baseTask.py # Base class for all tasks
β βββ commonSchemas.py # Shared PySpark schemas
β βββ job1/ # Job-specific tasks
β βββ extract_source1.py
β βββ extract_source2.py # DQX validation + quarantine
β βββ generate_orders.py
β βββ generate_orders_agg.py
β βββ health_check.py # Prod smoke task (runs first)
β βββ seed_sources.py # Idempotent daily seeder (prod integration)
β
βββ tests/ # Unit and integration tests
β βββ job1/
β βββ unit_test.py # Pytest unit tests
β βββ unit_test_sdp.py # SDP pipeline unit tests
β βββ integration_setup.py # Integration test setup (seed data)
β βββ integration_validate.py # Integration test validation
β
βββ resources/ # Databricks workflow templates
β βββ jobs.yml # Generated job definition (auto-created)
β
βββ scripts/ # Helper scripts
β βββ sdk_generate_template_job.py # Job definition generator (Databricks SDK)
β βββ sdk_init_workspace.py # Workspace initialization (SP, catalogs, schemas, grants)
β βββ sdk_truncate_tables.py # Truncate all medallion tables in a target environment
β βββ sdk_analyze_job_costs.py # Cost analysis script
β βββ sdk_workspace_and_account.py # Workspace and account management
β βββ _sdk_sql.py # SQL warehouse helpers (used by other scripts)
β
βββ docs/ # Documentation assets
β βββ dag.png
β βββ task_output.png
β βββ data_lineage.png
β βββ data_quality.png
β βββ ci_cd.png
β
βββ dist/ # Build artifacts (Python wheel)
βββ coverage_reports/ # Test coverage reports
β
βββ databricks.yml # Declarative Automation Bundle config
βββ pyproject.toml # Python project configuration (uv)
βββ Makefile # Build automation
βββ .pre-commit-config.yaml # Pre-commit hooks (ruff)
βββ README.md # This file
-
(Optional) Install Databricks AI Dev Kit and Claude Code.
-
Create a Databricks Free Edition workspace.
-
Install and configure the Databricks CLI on your local machine. Check the current version in
databricks.yml. Follow the instructions here. -
Set up the Python environment and run unit tests on your local machine.
make sync && make test -
Initialize the workspace. Create an external location in Databricks and update the
storage-rootparameter in the Makefile. This step will create the catalogs, schemas, service principal, and the required grants. For more details, see Overview of external locations. Then run:make init -
Generate a secret for the service principal. In Databricks, go to: Workspace -> Settings -> Identity and access -> Service principals -> Secrets. Generate a new secret for your service principal and update the corresponding profiles in your .databrickscfg file. Your configuration should look similar to this:
[dev] host = https://xxxx.cloud.databricks.com/ token = bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb [staging] host = https://xxxx.cloud.databricks.com/ client_id = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa [prod] host = https://xxxx.cloud.databricks.com/ client_id = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa -
Deploy and execute on the dev workspace.
make deploy env=dev
Deploy-time environment variables (
-
Configure CI/CD automation with the service principal ID and secret. Configure GitHub Actions repository secrets (DATABRICKS_HOST, DATABRICKS_PRINCIPAL_ID, DATABRICKS_SECRET).
-
(Optional) You can also execute unit tests from your preferred IDE. Here's a screenshot from VS Code with Microsoft's Python extension installed.
These are defined as JobParameterDefinition in scripts/sdk_generate_template_job.py and threaded into every task as CLI args via {{job.parameters.*}}. Operators can override them for a single run using the Databricks Jobs UI "Run with different parameters" dialog β no code change or redeployment needed.
| Parameter | CLI arg | Purpose | Default (dev/staging) | Default (prod) |
|---|---|---|---|---|
log_level |
--log-level |
DEBUG / INFO / WARNING. Bump to DEBUG for a single prod run during incident response. |
INFO |
INFO |
quarantine_fail_ratio |
--quarantine-fail-ratio |
Hard-fail extract_source2 if more than this fraction of rows are quarantined by DQX. Defaults to disabled so demo seed data still ingests. |
1.0 |
0.1 |
seed_date |
--seed-date |
ISO-8601 date (e.g. 2024-03-15) for the seed_sources task. Empty string (default) resolves to today's date at runtime. Override per-run to backfill a specific day. |
"" β today |
"" β today |
Read by scripts/sdk_generate_template_job.py when generating resources/jobs.yml β never on Databricks serverless compute.
| Variable | Purpose | Default |
|---|---|---|
TEMPLATE_ALERT_EMAILS |
Comma-separated recipients for prod JobEmailNotifications (on_failure + on_duration_warning). Wired from CI secret of the same name. |
data-platform-oncall@example.com |
TEMPLATE_SP_APP_ID |
Override the service principal application_id looked up by display name. Used by CI to avoid the SCIM lookup. |
resolved from SP_DISPLAY_NAME |
databricks.ymlsetsmode: productionon the prod target β DABs enforces that the deployer identity equals the run-as identity (the SP).make deploy env=prodfrom a developer's local machine will fail by design; only CI can push prod.run_asandpermissionson every staging/prod job are pinned to the service principal'sapplication_id(not${workspace.current_user.userName}), wired byscripts/sdk_generate_template_job.py.health_checktask runs first in prod and fails fast on a broken wheel, missing grant, or unreachable SQL warehouse β before any medallion table is touched.- Wheel version pinning:
_project_version()readspyproject.tomlto produce the exact wheel filename in the bundle'sJobEnvironment.dependencies, so a forgotten rebuild can't silently deploy an old wheel. - Per-environment retries: 0 in dev (fast feedback), 2 in staging/prod (transient failure resilience). Retries on staging/prod back off
MIN_RETRY_INTERVAL_MS(60s) before re-attempting, giving transient lock/metastore blips time to clear. - Per-task timeouts: each task has its own
timeout_seconds(300s for health-check, 900s for extracts, 1800s for transforms) so a single hung task can't consume the whole job budget. - Schema-drift guard: all writes use
overwriteSchema=falseso an upstream change in column type or order fails the task loudly instead of silently propagating bad data. - Queued runs, not skipped: prod has
max_concurrent_runs=1paired withqueue.enabled=trueβ if a run is still in flight when the next 5 a.m. tick arrives, the new run queues rather than getting silently dropped. - Health-rule-backed duration alert: the
on_duration_warning_threshold_exceededemail is wired to aJobsHealthRuleonRUN_DURATION_SECONDS > 1800(30 min). Without that rule, the email would be wired to an event that can never fire. - Cancelled/skipped runs don't page:
notification_settings.no_alert_for_canceled_runsandno_alert_for_skipped_runsare bothtrue, so manual cancellations or upstream-condition skips don't generate failure alerts.






