Skip to content

fix(db): bump preview CNPG storage 10Gi→20Gi (reconcile live patch)#159

Open
mikestankavich wants to merge 1 commit into
mainfrom
fix/preview-db-storage-20gi
Open

fix(db): bump preview CNPG storage 10Gi→20Gi (reconcile live patch)#159
mikestankavich wants to merge 1 commit into
mainfrom
fix/preview-db-storage-20gi

Conversation

@mikestankavich

@mikestankavich mikestankavich commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

What

Bumps the preview CNPG cluster (trakrf-db-preview) storage from 10Gi → 20Gi via a new per-env dbCluster.storageSize override. Prod is unchanged (no override → chart's 10Gi default).

Why

On 2026-06-20 preview went offline (app.preview.trakrf.idHTTP 503 "no available server").

Root cause: preview's single data+WAL PVC was full (9.5G). CNPG logged no free disk space for WALs and refused to start Postgres → trakrf-db-preview-1 CrashLoopBackOff → backend 0/1 Ready → Traefik served 503. Not a build/deploy issue (CI was green).

The fill is tag_scans = 8.6 GB (91% of the DB) — the raw-read firehose from the ongoing geofence read test (uncompressed, 30-day retention, ~1.1 KB/row jsonb). Application tables total only ~30 MB. 20Gi is headroom, not the fix — the durable fix is Timescale compression + shorter retention on tag_scans, tracked in TRA-921 (likely 7-day retention + compress after 1–2 days).

Fix sequence

  1. Live remediation (already applied): patched the CNPG Cluster CR to 20Gi; CNPG didn't propagate to the PVC, so also patched the PVC requests.storage directly to trigger the CSI resize. Volume grew to 20Gi, Postgres started, cluster returned to "Cluster in healthy state", backend 1/1, app root → HTTP 200.
  2. This PR: reconciles git so ArgoCD stops showing drift. PVCs can't shrink, so 20Gi is the floor regardless of merge timing.

Notes

  • Per-env override plumbed through the root app's inlineValues. Verified with helm template argocd/root --set cluster=gke: preview renders size: "20Gi", prod renders no size key.

Follow-up tickets

  • TRA-921 — Timescale compression + retention tuning for tag_scans/asset_scans (the actual fix; the geofence firehose will refill 20Gi otherwise).
  • TRA-1017 — monitoring + alerting on CNPG DB disk usage for preview and prod (prod runs the same chart at the same 10Gi default; no alert fired this time).
  • TRA-1016 — periodic prune of preview e2e cruft (minor; demoted — it was not the disk driver).

🤖 Generated with Claude Code

Preview's single data+WAL PVC filled. CNPG logged "no free disk space
for WALs" and refused to start Postgres, so the backend went 0/1 Ready
and app.preview.trakrf.id served HTTP 503 "no available server".

Restored service by live-patching the Cluster CR + PVC to 20Gi. This
commit reconciles git so ArgoCD stops showing drift (PVCs can't shrink,
so this is the floor regardless).

Adds a per-env dbCluster.storageSize override plumbed through the root
app's inlineValues; unset on prod, which keeps the chart's 10Gi default.
Preview fills from e2e/Playwright churn — 20Gi is headroom, not a fix;
periodic prune + disk monitoring tracked separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant