databricks-template — agentic development for Databricks + production-ready ETL

🚀 Overview

Stop spending weeks on boilerplate. This PySpark project template for Databricks gives you medallion architecture, Python packaging, unit + integration + load tests, CI/CD via Declarative Automation Bundles, DQX data quality, and service-principal-based production deploys — all wired together and ready to ship. Whether you're starting a new Databricks ETL project or looking for a reference implementation of production-ready PySpark pipelines, fork this and go.

If this saves you time, a star helps others find it. Let's connect on LinkedIn.

🧪 Technologies

Databricks Free Edition (Serverless)
Databricks Runtime 18.0 LTS
Databricks Unity Catalog
Databricks Declarative Automation Bundles (former Asset Bundles)
Databricks CLI
Databricks Python SDK
Databricks DQX
Databricks AI Dev Kit
Databricks Dashboards
Claude Code
PySpark 4.1
Spark Declarative Pipelines (SDP)
Python 3.12+
GitHub Actions
Pytest

📦 Features

This project template demonstrates how to:

use agentic development (with Databricks AI Dev Kit and Claude Code) in data projects. The template ships with a CLAUDE.md that documents the project's conventions.
structure PySpark code inside classes/packages, deploy it as a Python wheel (instead of notebooks), and manage the project with uv.
package and deploy code with Declarative Automation Bundles to different environments (dev, staging, prod). Use GitHub Actions to automate CI/CD pipeline.
utilize Databricks Lakeflow Jobs to execute a DAG - Yes, you don't need Airflow to manage your DAGs here!!!. Generate job definitions to run with environment-specific conditions using Databricks SDK.
isolate "dev" environments / catalogs to avoid concurrency issues between developer tests.
separate deploy-time config (environment variables, CI secrets) from runtime config (job parameters overridable from the Databricks UI), keeping jobs flexible without coupling them to the build process.
utilize job tags to track issues, costs, and ownership.
use a Lakeflow Spark Declarative Pipeline to run the same ETL logic side-by-side with the PySpark job, demonstrating both paradigms from one codebase.
use the medallion architecture to organize your data.
run unit tests on transformations with the pytest package. Set up VS Code to run tests on your local machine.
run integration tests by setting the input data and validating the output data.
run load tests to exercise both the initial bulk load and incremental daily updates, validating that the pipeline handles production-scale data volumes without regressions.
use Databricks AI/BI Dashboards to visualize the gold layer.
utilize the coverage package to generate test coverage reports.
use structured logging with a per-run log_level override and run-scoped correlation ID on every line, giving you full observability during incidents without a code change.
lint and format code with ruff and pre-commit.
use a Makefile to automate repetitive tasks.
utilize Databricks DQX to enforce data quality rules, such as null checks, uniqueness, thresholds, and schema validation, and filter bad data into quarantine tables.
utilize service principals to run production code.
utilize the Databricks SDK for Python to manage catalogs, schemas, workspaces, and accounts. Refer to the scripts folder for examples.
utilize Databricks Unity Catalog to manage permissions and get data lineage.
utilize serverless job clusters on Databricks Free Edition to deploy your pipelines.
enforce production guardrails out of the box — identity-locked CI deploys, a health-check task that runs before any data is touched, wheel version pinning, per-task timeouts, schema-drift guards, queued runs, and on-call alerting that doesn't page on manual cancellations.

🧠 Resources

Agentic development:

Debates on the use of notebooks vs. Python packaging:

Sessions on Databricks Declarative Automation Bundles, CI/CD, and Software Development Life Cycle at Data + AI Summit 2025:

Other resources:

📁 Folder Structure

databricks-template/
│
├── .github/                       # CI/CD automation
│   └── workflows/
│       └── onpush.yml             # GitHub Actions pipeline
│
├── src/                           # Main source code
│   └── template/                  # Python package
│       ├── main.py                # Entry point with CLI (argparse)
│       ├── config.py              # Configuration management
│       ├── baseTask.py            # Base class for all tasks
│       ├── commonSchemas.py       # Shared PySpark schemas
│       └── job1/                  # Job-specific tasks
│           ├── extract_source1.py
│           ├── extract_source2.py        # DQX validation + quarantine
│           ├── generate_orders.py
│           ├── generate_orders_agg.py
│           ├── health_check.py           # Prod smoke task (runs first)
│           └── seed_sources.py           # Idempotent daily seeder (prod integration)
│
├── tests/                          # Unit and integration tests
│   └── job1/
│       ├── unit_test.py            # Pytest unit tests
│       ├── unit_test_sdp.py        # SDP pipeline unit tests
│       ├── integration_setup.py    # Integration test setup (seed data)
│       └── integration_validate.py # Integration test validation
│
├── resources/                      # Databricks workflow templates
│   └── jobs.yml                    # Generated job definition (auto-created)
│
├── scripts/                              # Helper scripts
│   ├── sdk_generate_template_job.py      # Job definition generator (Databricks SDK)
│   ├── sdk_init_workspace.py             # Workspace initialization (SP, catalogs, schemas, grants)
│   ├── sdk_truncate_tables.py            # Truncate all medallion tables in a target environment
│   ├── sdk_analyze_job_costs.py          # Cost analysis script
│   ├── sdk_workspace_and_account.py      # Workspace and account management
│   └── _sdk_sql.py                       # SQL warehouse helpers (used by other scripts)
│
├── docs/                           # Documentation assets
│   ├── dag.png
│   ├── task_output.png
│   ├── data_lineage.png
│   ├── data_quality.png
│   └── ci_cd.png
│
├── dist/                        # Build artifacts (Python wheel)
├── coverage_reports/            # Test coverage reports
│
├── databricks.yml               # Declarative Automation Bundle config
├── pyproject.toml               # Python project configuration (uv)
├── Makefile                     # Build automation
├── .pre-commit-config.yaml      # Pre-commit hooks (ruff)
└── README.md                    # This file

Dashboard

Development Lifecycle

Databricks Jobs

Logging

Data Lineage (Unity Catalog)

Quarantine table (generated by Databricks DQX)

Instructions

(Optional) Install Databricks AI Dev Kit and Claude Code.
Create a Databricks Free Edition workspace.
Install and configure the Databricks CLI on your local machine. Check the current version in databricks.yml. Follow the instructions here.
Set up the Python environment and run unit tests on your local machine.
```
 make sync && make test
```
Initialize the workspace. Create an external location in Databricks and update the storage-root parameter in the Makefile. This step will create the catalogs, schemas, service principal, and the required grants. For more details, see Overview of external locations. Then run:
```
 make init
```

Generate a secret for the service principal. In Databricks, go to: Workspace -> Settings -> Identity and access -> Service principals -> Secrets. Generate a new secret for your service principal and update the corresponding profiles in your .databrickscfg file. Your configuration should look similar to this:

 [dev]
 host          = https://xxxx.cloud.databricks.com/
 token         = bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
                 
 [staging]
 host          = https://xxxx.cloud.databricks.com/
 client_id     = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
 client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

 [prod]
 host          = https://xxxx.cloud.databricks.com/
 client_id     = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
 client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Deploy and execute on the dev workspace.
```
 make deploy env=dev
```

Deploy-time environment variables (

Configure CI/CD automation with the service principal ID and secret. Configure GitHub Actions repository secrets (DATABRICKS_HOST, DATABRICKS_PRINCIPAL_ID, DATABRICKS_SECRET).
(Optional) You can also execute unit tests from your preferred IDE. Here's a screenshot from VS Code with Microsoft's Python extension installed.

Job-level parameters (runtime, overridable per-run)

These are defined as JobParameterDefinition in scripts/sdk_generate_template_job.py and threaded into every task as CLI args via {{job.parameters.*}}. Operators can override them for a single run using the Databricks Jobs UI "Run with different parameters" dialog — no code change or redeployment needed.

Parameter	CLI arg	Purpose	Default (dev/staging)	Default (prod)
`log_level`	`--log-level`	`DEBUG` / `INFO` / `WARNING`. Bump to `DEBUG` for a single prod run during incident response.	`INFO`	`INFO`
`quarantine_fail_ratio`	`--quarantine-fail-ratio`	Hard-fail `extract_source2` if more than this fraction of rows are quarantined by DQX. Defaults to disabled so demo seed data still ingests.	`1.0`	`0.1`
`seed_date`	`--seed-date`	ISO-8601 date (e.g. `2024-03-15`) for the `seed_sources` task. Empty string (default) resolves to today's date at runtime. Override per-run to backfill a specific day.	`""` → today	`""` → today

Deploy-time environment variables (CI/build machine only)

Read by scripts/sdk_generate_template_job.py when generating resources/jobs.yml — never on Databricks serverless compute.

Variable	Purpose	Default
`TEMPLATE_ALERT_EMAILS`	Comma-separated recipients for prod `JobEmailNotifications` (on_failure + on_duration_warning). Wired from CI secret of the same name.	`data-platform-oncall@example.com`
`TEMPLATE_SP_APP_ID`	Override the service principal `application_id` looked up by display name. Used by CI to avoid the SCIM lookup.	resolved from `SP_DISPLAY_NAME`

Production guardrails

databricks.yml sets mode: production on the prod target — DABs enforces that the deployer identity equals the run-as identity (the SP). make deploy env=prod from a developer's local machine will fail by design; only CI can push prod.
run_as and permissions on every staging/prod job are pinned to the service principal's application_id (not ${workspace.current_user.userName}), wired by scripts/sdk_generate_template_job.py.
health_check task runs first in prod and fails fast on a broken wheel, missing grant, or unreachable SQL warehouse — before any medallion table is touched.
Wheel version pinning: _project_version() reads pyproject.toml to produce the exact wheel filename in the bundle's JobEnvironment.dependencies, so a forgotten rebuild can't silently deploy an old wheel.
Per-environment retries: 0 in dev (fast feedback), 2 in staging/prod (transient failure resilience). Retries on staging/prod back off MIN_RETRY_INTERVAL_MS (60s) before re-attempting, giving transient lock/metastore blips time to clear.
Per-task timeouts: each task has its own timeout_seconds (300s for health-check, 900s for extracts, 1800s for transforms) so a single hung task can't consume the whole job budget.
Schema-drift guard: all writes use overwriteSchema=false so an upstream change in column type or order fails the task loudly instead of silently propagating bad data.
Queued runs, not skipped: prod has max_concurrent_runs=1 paired with queue.enabled=true — if a run is still in flight when the next 5 a.m. tick arrives, the new run queues rather than getting silently dropped.
Health-rule-backed duration alert: the on_duration_warning_threshold_exceeded email is wired to a JobsHealthRule on RUN_DURATION_SECONDS > 1800 (30 min). Without that rule, the email would be wired to an event that can never fire.
Cancelled/skipped runs don't page: notification_settings.no_alert_for_canceled_runs and no_alert_for_skipped_runs are both true, so manual cancellations or upstream-condition skips don't generate failure alerts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

databricks-template — agentic development for Databricks + production-ready ETL

🚀 Overview

🧪 Technologies

📦 Features

🧠 Resources

📁 Folder Structure

Dashboard

Development Lifecycle

Databricks Jobs

Logging

Data Lineage (Unity Catalog)

Quarantine table (generated by Databricks DQX)

Instructions

Job-level parameters (runtime, overridable per-run)

Deploy-time environment variables (CI/build machine only)

Production guardrails

Star History

About

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github/workflows		.github/workflows
docs		docs
resources		resources
scripts		scripts
src/template		src/template
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md
databricks.yml		databricks.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

databricks-template — agentic development for Databricks + production-ready ETL

🚀 Overview

🧪 Technologies

📦 Features

🧠 Resources

📁 Folder Structure

Dashboard

Development Lifecycle

Databricks Jobs

Logging

Data Lineage (Unity Catalog)

Quarantine table (generated by Databricks DQX)

Instructions

Job-level parameters (runtime, overridable per-run)

Deploy-time environment variables (CI/build machine only)

Production guardrails

Star History

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages