Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/onpush.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ on:
paths-ignore:
- 'README.md'
- 'CLAUDE.md'
- 'CHANGELOG.md'
- 'docs/**'
# Manual trigger for re-running CI without a new commit (e.g. after a transient
# GitHub Actions hiccup that silently drops a push event):
Expand Down
16 changes: 11 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,19 @@

---

## [#34](https://github.com/andre-salvati/databricks-template/pull/34) · 2026-06-05 · feat: standardize silver/gold field names, fix dashboard KPIs, add total_orders

Dropped `ds_kpi` from the dashboard — all three KPI counters (Total Value, Total Orders, Number of Customers) now bind to `ds_orders` with aggregate expressions so all five filters update them; added a third KPI tile for Total Orders (`COUNT DISTINCT order_id`).
Standardized field names across silver (`curated.order_enriched`) and gold (`report.order_agg`) following four rules: `{entity}_id` suffix, entity-qualified names, `item_*` prefix for item-level fields, no abbreviations; `date` is now cast to `DateType` in silver.
Added `order_enriched_schema` and `order_agg_schema` to `commonSchemas.py` as canonical schemas for silver and gold; all tests and the integration validator import from there instead of inlining definitions.

---

## [#33](https://github.com/andre-salvati/databricks-template/pull/33) · 2026-06-04 · feat: AI/BI dashboard, country in gold layer, randomized seed data

Added `country` to `curated.order_enriched` and `report.order_agg` (and their SDP equivalents) so the gold layer carries the full customer dimension needed for country-based reporting; unit tests updated accordingly.
Added an AI/BI (Lakeview) dashboard with three line charts (total value by date × country, by date × product, by date × category) and a global filter page (date range, country, customer, product, category); uses `make truncate env=X yes=--yes` before first post-deploy run to handle the schema change to `report.order_agg`.
Dashboard JSON (`resources/orders_dashboard.lvdash.json`) and its DAB resource entry are generated by `sdk_generate_template_job.py` at deploy time with the target catalog embedded — both files are gitignored.
Completed README documentation: added "Databricks Dashboards" to the Technologies section, added a dashboard screenshot block, and replaced the placeholder dashboard Features bullet with a full description of the charts and filter panel.
Improved seed data chart visibility: customers are assigned a non-uniform country distribution (US=200, UK=100, DE=50, FR=50, BR=30, CA=25, AU=20, JP=15, MX=7, IN=3) so country lines are clearly separated; `total_item` now scales with `prod_category_id` (category × $15 base + $10 noise), producing a ~6× spread across categories visible in the category chart.
Added `country` to `curated.order_enriched` and `report.order_agg` (and SDP equivalents) so the gold layer carries the full customer dimension needed for country-based reporting.
Added an AI/BI (Lakeview) dashboard with three line charts (total value by date × country, product, and category) and a global filter page; dashboard JSON is generated by `sdk_generate_template_job.py` at deploy time with the target catalog embedded and is gitignored.
Improved seed data chart visibility with a non-uniform country distribution and `total_item` scaling with `prod_category_id` (category × $15 base + $10 noise), producing a ~6× spread across categories.

---

Expand Down
4 changes: 2 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,8 @@ Medallion schemas (`MEDALLION_SCHEMAS` in `config.py`):

Each task's input/output tables are **hardcoded** in the task module (e.g. `raw.customer` → `curated.order_enriched`). The medallion layer is a semantic contract, not a runtime parameter — this is the dbt `ref()` pattern. Don't parameterize the layer; if a task genuinely needs a configurable target, that's a different task.

`curated.order_enriched` columns: `name, country, id_customer, id_order, total, date, product_id, prod_category_id, seq, desc_item, qty, total_item`
`report.order_agg` columns: `name, country, date, product_id, prod_category_id, total_qty, total_value`
`curated.order_enriched` columns: `customer_name, country, customer_id, order_id, order_total, order_date (DateType), product_id, product_category_id, item_seq, item_description, item_quantity, item_total`
`report.order_agg` columns: `customer_name, country, order_date (DateType), product_id, product_category_id, total_quantity, total_value, total_orders`

### Job-level parameters (runtime, overridable per-run)

Expand Down
69 changes: 18 additions & 51 deletions scripts/sdk_generate_template_job.py
Original file line number Diff line number Diff line change
Expand Up @@ -480,41 +480,15 @@ def _build_dashboard_json(catalog: str) -> dict:
"""
return {
"datasets": [
{
"name": "ds_kpi",
"displayName": "KPIs",
"queryLines": [
f"SELECT ROUND(SUM(total_item), 2) AS total_value, "
f"COUNT(DISTINCT id_order) AS num_orders, "
f"COUNT(DISTINCT id_customer) AS num_customers "
f"FROM {catalog}.curated.order_enriched "
f"WHERE date BETWEEN :date_range.min AND :date_range.max"
],
"parameters": [
{
"keyword": "date_range",
"displayName": "Date Range",
"dataType": "DATE",
"complexType": "RANGE",
"defaultSelection": {
"range": {
"dataType": "DATE",
"min": {"value": "now-1y"},
"max": {"value": "now"},
}
},
}
],
},
{
"name": "ds_orders",
"displayName": "Orders",
"queryLines": [
f"SELECT CAST(date AS DATE) AS order_date, country, name AS customer, "
f"CAST(product_id AS STRING) AS product_id, CAST(prod_category_id AS STRING) AS category_id, "
f"SUM(total_value) AS total_value "
f"SELECT order_date, country, customer_name AS customer, "
f"CAST(product_id AS STRING) AS product_id, CAST(product_category_id AS STRING) AS category_id, "
f"SUM(total_value) AS total_value, SUM(total_orders) AS total_orders "
f"FROM {catalog}.report.order_agg "
f"WHERE date BETWEEN :date_range.min AND :date_range.max "
f"WHERE order_date BETWEEN :date_range.min AND :date_range.max "
f"GROUP BY 1, 2, 3, 4, 5"
],
"parameters": [
Expand Down Expand Up @@ -566,9 +540,9 @@ def _build_dashboard_json(catalog: str) -> dict:
{
"name": "main_query",
"query": {
"datasetName": "ds_kpi",
"fields": [{"name": "total_value", "expression": "`total_value`"}],
"disaggregated": True,
"datasetName": "ds_orders",
"fields": [{"name": "total_value", "expression": "SUM(`total_value`)"}],
"disaggregated": False,
},
}
],
Expand All @@ -583,22 +557,22 @@ def _build_dashboard_json(catalog: str) -> dict:
},
{
"widget": {
"name": "kpi-num-orders",
"name": "kpi-total-orders",
"queries": [
{
"name": "main_query",
"query": {
"datasetName": "ds_kpi",
"fields": [{"name": "num_orders", "expression": "`num_orders`"}],
"disaggregated": True,
"datasetName": "ds_orders",
"fields": [{"name": "total_orders", "expression": "SUM(`total_orders`)"}],
"disaggregated": False,
},
}
],
"spec": {
"version": 2,
"widgetType": "counter",
"encodings": {"value": {"fieldName": "num_orders", "displayName": "Number of Orders"}},
"frame": {"title": "Number of Orders", "showTitle": True},
"encodings": {"value": {"fieldName": "total_orders", "displayName": "Total Orders"}},
"frame": {"title": "Total Orders", "showTitle": True},
},
},
"position": {"x": 2, "y": 2, "width": 2, "height": 3},
Expand All @@ -610,9 +584,11 @@ def _build_dashboard_json(catalog: str) -> dict:
{
"name": "main_query",
"query": {
"datasetName": "ds_kpi",
"fields": [{"name": "num_customers", "expression": "`num_customers`"}],
"disaggregated": True,
"datasetName": "ds_orders",
"fields": [
{"name": "num_customers", "expression": "COUNT(DISTINCT `customer`)"}
],
"disaggregated": False,
},
}
],
Expand Down Expand Up @@ -784,22 +760,13 @@ def _build_dashboard_json(catalog: str) -> dict:
"disaggregated": False,
},
},
{
"name": "q_date_kpi",
"query": {
"datasetName": "ds_kpi",
"parameters": [{"name": "date_range", "keyword": "date_range"}],
"disaggregated": False,
},
},
],
"spec": {
"version": 2,
"widgetType": "filter-date-range-picker",
"encodings": {
"fields": [
{"parameterName": "date_range", "queryName": "q_date"},
{"parameterName": "date_range", "queryName": "q_date_kpi"},
]
},
"frame": {"showTitle": True, "title": "Date Range"},
Expand Down
33 changes: 33 additions & 0 deletions src/template/commonSchemas.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
from pyspark.sql.types import (
DateType,
DoubleType,
FloatType,
IntegerType,
LongType,
StringType,
StructField,
StructType,
Expand Down Expand Up @@ -34,3 +37,33 @@
StructField("total_item", FloatType(), True),
]
)

order_enriched_schema = StructType(
[
StructField("customer_name", StringType(), True),
StructField("country", StringType(), True),
StructField("customer_id", IntegerType(), True),
StructField("order_id", IntegerType(), True),
StructField("order_total", FloatType(), True),
StructField("order_date", DateType(), True),
StructField("product_id", IntegerType(), True),
StructField("product_category_id", IntegerType(), True),
StructField("item_seq", IntegerType(), True),
StructField("item_description", StringType(), True),
StructField("item_quantity", IntegerType(), True),
StructField("item_total", FloatType(), True),
]
)

order_agg_schema = StructType(
[
StructField("customer_name", StringType(), True),
StructField("country", StringType(), True),
StructField("order_date", DateType(), True),
StructField("product_id", IntegerType(), True),
StructField("product_category_id", IntegerType(), True),
StructField("total_quantity", LongType(), True),
StructField("total_value", DoubleType(), True),
StructField("total_orders", LongType(), True),
]
)
20 changes: 10 additions & 10 deletions src/template/job1/generate_orders.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,18 @@ def enrich_order(self, df_customer, df_order, df_order_item):
df_order_item.join(df_order, df_order_item["id_order"] == df_order["id"])
.join(df_customer, df_order["id_customer"] == df_customer["id"])
.select(
"name",
df_customer["name"].alias("customer_name"),
"country",
"id_customer",
"id_order",
"total",
"date",
df_order["id_customer"].alias("customer_id"),
df_order_item["id_order"].alias("order_id"),
df_order["total"].alias("order_total"),
df_order["date"].cast("date").alias("order_date"),
"product_id",
"prod_category_id",
"seq",
"desc_item",
"qty",
"total_item",
df_order["prod_category_id"].alias("product_category_id"),
df_order_item["seq"].alias("item_seq"),
df_order_item["desc_item"].alias("item_description"),
df_order_item["qty"].alias("item_quantity"),
df_order_item["total_item"].alias("item_total"),
)
)

Expand Down
9 changes: 5 additions & 4 deletions src/template/job1/generate_orders_agg.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from pyspark.sql.functions import sum
from pyspark.sql.functions import countDistinct, sum

from ..baseTask import BaseTask

Expand All @@ -10,9 +10,10 @@ def __init__(self, config):
def aggregate_orders(self, df_order):
# TODO code your transformations here...

return df_order.groupBy("name", "country", "date", "product_id", "prod_category_id").agg(
sum("qty").alias("total_qty"),
sum("total_item").alias("total_value"),
return df_order.groupBy("customer_name", "country", "order_date", "product_id", "product_category_id").agg(
sum("item_quantity").alias("total_quantity"),
sum("item_total").alias("total_value"),
countDistinct("order_id").alias("total_orders"),
)

def run(self):
Expand Down
34 changes: 18 additions & 16 deletions src/template/job1_sdp/transforms.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,24 +26,25 @@ def enrich_order(df_customer: DataFrame, df_order: DataFrame, df_order_item: Dat

Returns:
Enriched DataFrame with columns:
name, country, id_customer, id_order, total, date, product_id, prod_category_id, seq, desc_item, qty, total_item
customer_name, country, customer_id, order_id, order_total, order_date, product_id,
product_category_id, item_seq, item_description, item_quantity, item_total
"""
return (
df_order_item.join(df_order, df_order_item["id_order"] == df_order["id"])
.join(df_customer, df_order["id_customer"] == df_customer["id"])
.select(
"name",
df_customer["name"].alias("customer_name"),
"country",
"id_customer",
"id_order",
"total",
"date",
df_order["id_customer"].alias("customer_id"),
df_order_item["id_order"].alias("order_id"),
df_order["total"].alias("order_total"),
df_order["date"].cast("date").alias("order_date"),
"product_id",
"prod_category_id",
"seq",
"desc_item",
"qty",
"total_item",
df_order["prod_category_id"].alias("product_category_id"),
df_order_item["seq"].alias("item_seq"),
df_order_item["desc_item"].alias("item_description"),
df_order_item["qty"].alias("item_quantity"),
df_order_item["total_item"].alias("item_total"),
)
)

Expand All @@ -58,10 +59,11 @@ def aggregate_orders(df_order_enriched: DataFrame) -> DataFrame:
df_order_enriched: curated.order_enriched

Returns:
DataFrame with columns: name, country, date, product_id, prod_category_id,
total_qty (LongType), total_value (DoubleType)
DataFrame with columns: customer_name, country, order_date, product_id,
product_category_id, total_quantity (LongType), total_value (DoubleType), total_orders (LongType)
"""
return df_order_enriched.groupBy("name", "country", "date", "product_id", "prod_category_id").agg(
F.sum("qty").alias("total_qty"),
F.sum("total_item").alias("total_value"),
return df_order_enriched.groupBy("customer_name", "country", "order_date", "product_id", "product_category_id").agg(
F.sum("item_quantity").alias("total_quantity"),
F.sum("item_total").alias("total_value"),
F.countDistinct("order_id").alias("total_orders"),
)
Loading