Skip to content

andre-salvati/databricks-template

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

88 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

databricks-template β€” agentic development for Databricks + production-ready ETL

Databricks PySpark CI/CD Claude Code Stars

πŸš€ Overview

Stop spending weeks on boilerplate. This PySpark project template for Databricks gives you medallion architecture, Python packaging, unit + integration + load tests, CI/CD via Declarative Automation Bundles, DQX data quality, and service-principal-based production deploys β€” all wired together and ready to ship. Whether you're starting a new Databricks ETL project or looking for a reference implementation of production-ready PySpark pipelines, fork this and go.

If this saves you time, a star helps others find it. Let's connect on LinkedIn.

πŸ§ͺ Technologies

  • Databricks Free Edition (Serverless)
  • Databricks Runtime 18.0 LTS
  • Databricks Unity Catalog
  • Databricks Declarative Automation Bundles (former Asset Bundles)
  • Databricks CLI
  • Databricks Python SDK
  • Databricks DQX
  • Databricks AI Dev Kit
  • Databricks Dashboards
  • Claude Code
  • PySpark 4.1
  • Spark Declarative Pipelines (SDP)
  • Python 3.12+
  • GitHub Actions
  • Pytest

πŸ“¦ Features

This project template demonstrates how to:

  • use agentic development (with Databricks AI Dev Kit and Claude Code) in data projects. The template ships with a CLAUDE.md that documents the project's conventions.
  • structure PySpark code inside classes/packages, deploy it as a Python wheel (instead of notebooks), and manage the project with uv.
  • package and deploy code with Declarative Automation Bundles to different environments (dev, staging, prod). Use GitHub Actions to automate CI/CD pipeline.
  • utilize Databricks Lakeflow Jobs to execute a DAG - Yes, you don't need Airflow to manage your DAGs here!!!. Generate job definitions to run with environment-specific conditions using Databricks SDK.
  • isolate "dev" environments / catalogs to avoid concurrency issues between developer tests.
  • separate deploy-time config (environment variables, CI secrets) from runtime config (job parameters overridable from the Databricks UI), keeping jobs flexible without coupling them to the build process.
  • utilize job tags to track issues, costs, and ownership.
  • use a Lakeflow Spark Declarative Pipeline to run the same ETL logic side-by-side with the PySpark job, demonstrating both paradigms from one codebase.
  • use the medallion architecture to organize your data.
  • run unit tests on transformations with the pytest package. Set up VS Code to run tests on your local machine.
  • run integration tests by setting the input data and validating the output data.
  • run load tests to exercise both the initial bulk load and incremental daily updates, validating that the pipeline handles production-scale data volumes without regressions.
  • use Databricks AI/BI Dashboards to visualize the gold layer.
  • utilize the coverage package to generate test coverage reports.
  • use structured logging with a per-run log_level override and run-scoped correlation ID on every line, giving you full observability during incidents without a code change.
  • lint and format code with ruff and pre-commit.
  • use a Makefile to automate repetitive tasks.
  • utilize Databricks DQX to enforce data quality rules, such as null checks, uniqueness, thresholds, and schema validation, and filter bad data into quarantine tables.
  • utilize service principals to run production code.
  • utilize the Databricks SDK for Python to manage catalogs, schemas, workspaces, and accounts. Refer to the scripts folder for examples.
  • utilize Databricks Unity Catalog to manage permissions and get data lineage.
  • utilize serverless job clusters on Databricks Free Edition to deploy your pipelines.
  • enforce production guardrails out of the box β€” identity-locked CI deploys, a health-check task that runs before any data is touched, wheel version pinning, per-task timeouts, schema-drift guards, queued runs, and on-call alerting that doesn't page on manual cancellations.

🧠 Resources

Agentic development:

Debates on the use of notebooks vs. Python packaging:

Sessions on Databricks Declarative Automation Bundles, CI/CD, and Software Development Life Cycle at Data + AI Summit 2025:

Other resources:

πŸ“ Folder Structure

databricks-template/
β”‚
β”œβ”€β”€ .github/                       # CI/CD automation
β”‚   └── workflows/
β”‚       └── onpush.yml             # GitHub Actions pipeline
β”‚
β”œβ”€β”€ src/                           # Main source code
β”‚   └── template/                  # Python package
β”‚       β”œβ”€β”€ main.py                # Entry point with CLI (argparse)
β”‚       β”œβ”€β”€ config.py              # Configuration management
β”‚       β”œβ”€β”€ baseTask.py            # Base class for all tasks
β”‚       β”œβ”€β”€ commonSchemas.py       # Shared PySpark schemas
β”‚       └── job1/                  # Job-specific tasks
β”‚           β”œβ”€β”€ extract_source1.py
β”‚           β”œβ”€β”€ extract_source2.py        # DQX validation + quarantine
β”‚           β”œβ”€β”€ generate_orders.py
β”‚           β”œβ”€β”€ generate_orders_agg.py
β”‚           β”œβ”€β”€ health_check.py           # Prod smoke task (runs first)
β”‚           └── seed_sources.py           # Idempotent daily seeder (prod integration)
β”‚
β”œβ”€β”€ tests/                          # Unit and integration tests
β”‚   └── job1/
β”‚       β”œβ”€β”€ unit_test.py            # Pytest unit tests
β”‚       β”œβ”€β”€ unit_test_sdp.py        # SDP pipeline unit tests
β”‚       β”œβ”€β”€ integration_setup.py    # Integration test setup (seed data)
β”‚       └── integration_validate.py # Integration test validation
β”‚
β”œβ”€β”€ resources/                      # Databricks workflow templates
β”‚   └── jobs.yml                    # Generated job definition (auto-created)
β”‚
β”œβ”€β”€ scripts/                              # Helper scripts
β”‚   β”œβ”€β”€ sdk_generate_template_job.py      # Job definition generator (Databricks SDK)
β”‚   β”œβ”€β”€ sdk_init_workspace.py             # Workspace initialization (SP, catalogs, schemas, grants)
β”‚   β”œβ”€β”€ sdk_truncate_tables.py            # Truncate all medallion tables in a target environment
β”‚   β”œβ”€β”€ sdk_analyze_job_costs.py          # Cost analysis script
β”‚   β”œβ”€β”€ sdk_workspace_and_account.py      # Workspace and account management
β”‚   └── _sdk_sql.py                       # SQL warehouse helpers (used by other scripts)
β”‚
β”œβ”€β”€ docs/                           # Documentation assets
β”‚   β”œβ”€β”€ dag.png
β”‚   β”œβ”€β”€ task_output.png
β”‚   β”œβ”€β”€ data_lineage.png
β”‚   β”œβ”€β”€ data_quality.png
β”‚   └── ci_cd.png
β”‚
β”œβ”€β”€ dist/                        # Build artifacts (Python wheel)
β”œβ”€β”€ coverage_reports/            # Test coverage reports
β”‚
β”œβ”€β”€ databricks.yml               # Declarative Automation Bundle config
β”œβ”€β”€ pyproject.toml               # Python project configuration (uv)
β”œβ”€β”€ Makefile                     # Build automation
β”œβ”€β”€ .pre-commit-config.yaml      # Pre-commit hooks (ruff)
└── README.md                    # This file

Dashboard



Development Lifecycle



Databricks Jobs



Logging



Data Lineage (Unity Catalog)



Quarantine table (generated by Databricks DQX)



Instructions

  1. (Optional) Install Databricks AI Dev Kit and Claude Code.

  2. Create a Databricks Free Edition workspace.

  3. Install and configure the Databricks CLI on your local machine. Check the current version in databricks.yml. Follow the instructions here.

  4. Set up the Python environment and run unit tests on your local machine.

     make sync && make test
    
  5. Initialize the workspace. Create an external location in Databricks and update the storage-root parameter in the Makefile. This step will create the catalogs, schemas, service principal, and the required grants. For more details, see Overview of external locations. Then run:

     make init
    
  6. Generate a secret for the service principal. In Databricks, go to: Workspace -> Settings -> Identity and access -> Service principals -> Secrets. Generate a new secret for your service principal and update the corresponding profiles in your .databrickscfg file. Your configuration should look similar to this:

     [dev]
     host          = https://xxxx.cloud.databricks.com/
     token         = bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
                     
     [staging]
     host          = https://xxxx.cloud.databricks.com/
     client_id     = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
     client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    
     [prod]
     host          = https://xxxx.cloud.databricks.com/
     client_id     = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
     client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    
  7. Deploy and execute on the dev workspace.

     make deploy env=dev
    

Deploy-time environment variables (

  1. Configure CI/CD automation with the service principal ID and secret. Configure GitHub Actions repository secrets (DATABRICKS_HOST, DATABRICKS_PRINCIPAL_ID, DATABRICKS_SECRET).

  2. (Optional) You can also execute unit tests from your preferred IDE. Here's a screenshot from VS Code with Microsoft's Python extension installed.

Job-level parameters (runtime, overridable per-run)

These are defined as JobParameterDefinition in scripts/sdk_generate_template_job.py and threaded into every task as CLI args via {{job.parameters.*}}. Operators can override them for a single run using the Databricks Jobs UI "Run with different parameters" dialog β€” no code change or redeployment needed.

Parameter CLI arg Purpose Default (dev/staging) Default (prod)
log_level --log-level DEBUG / INFO / WARNING. Bump to DEBUG for a single prod run during incident response. INFO INFO
quarantine_fail_ratio --quarantine-fail-ratio Hard-fail extract_source2 if more than this fraction of rows are quarantined by DQX. Defaults to disabled so demo seed data still ingests. 1.0 0.1
seed_date --seed-date ISO-8601 date (e.g. 2024-03-15) for the seed_sources task. Empty string (default) resolves to today's date at runtime. Override per-run to backfill a specific day. "" β†’ today "" β†’ today

Deploy-time environment variables (CI/build machine only)

Read by scripts/sdk_generate_template_job.py when generating resources/jobs.yml β€” never on Databricks serverless compute.

Variable Purpose Default
TEMPLATE_ALERT_EMAILS Comma-separated recipients for prod JobEmailNotifications (on_failure + on_duration_warning). Wired from CI secret of the same name. data-platform-oncall@example.com
TEMPLATE_SP_APP_ID Override the service principal application_id looked up by display name. Used by CI to avoid the SCIM lookup. resolved from SP_DISPLAY_NAME

Production guardrails

  • databricks.yml sets mode: production on the prod target β€” DABs enforces that the deployer identity equals the run-as identity (the SP). make deploy env=prod from a developer's local machine will fail by design; only CI can push prod.
  • run_as and permissions on every staging/prod job are pinned to the service principal's application_id (not ${workspace.current_user.userName}), wired by scripts/sdk_generate_template_job.py.
  • health_check task runs first in prod and fails fast on a broken wheel, missing grant, or unreachable SQL warehouse β€” before any medallion table is touched.
  • Wheel version pinning: _project_version() reads pyproject.toml to produce the exact wheel filename in the bundle's JobEnvironment.dependencies, so a forgotten rebuild can't silently deploy an old wheel.
  • Per-environment retries: 0 in dev (fast feedback), 2 in staging/prod (transient failure resilience). Retries on staging/prod back off MIN_RETRY_INTERVAL_MS (60s) before re-attempting, giving transient lock/metastore blips time to clear.
  • Per-task timeouts: each task has its own timeout_seconds (300s for health-check, 900s for extracts, 1800s for transforms) so a single hung task can't consume the whole job budget.
  • Schema-drift guard: all writes use overwriteSchema=false so an upstream change in column type or order fails the task loudly instead of silently propagating bad data.
  • Queued runs, not skipped: prod has max_concurrent_runs=1 paired with queue.enabled=true β€” if a run is still in flight when the next 5 a.m. tick arrives, the new run queues rather than getting silently dropped.
  • Health-rule-backed duration alert: the on_duration_warning_threshold_exceeded email is wired to a JobsHealthRule on RUN_DURATION_SECONDS > 1800 (30 min). Without that rule, the email would be wired to an event that can never fire.
  • Cancelled/skipped runs don't page: notification_settings.no_alert_for_canceled_runs and no_alert_for_skipped_runs are both true, so manual cancellations or upstream-condition skips don't generate failure alerts.

Star History

Star History Chart

Packages

 
 
 

Contributors