From c313663c46099860bb3a3d83e862392e39b3819a Mon Sep 17 00:00:00 2001 From: Mike Stankavich Date: Wed, 10 Jun 2026 13:48:14 -0500 Subject: [PATCH 1/3] docs(tra-972): design spec for RESEND_API_KEY PreSync guard Co-Authored-By: Claude Opus 4.8 (1M context) --- ...-tra-972-email-key-presync-guard-design.md | 65 +++++++++++++++++++ 1 file changed, 65 insertions(+) create mode 100644 docs/superpowers/specs/2026-06-10-tra-972-email-key-presync-guard-design.md diff --git a/docs/superpowers/specs/2026-06-10-tra-972-email-key-presync-guard-design.md b/docs/superpowers/specs/2026-06-10-tra-972-email-key-presync-guard-design.md new file mode 100644 index 0000000..986fd6b --- /dev/null +++ b/docs/superpowers/specs/2026-06-10-tra-972-email-key-presync-guard-design.md @@ -0,0 +1,65 @@ +# TRA-972 — RESEND_API_KEY PreSync guard (infra hardening) + +**Date:** 2026-06-10 +**Ticket:** TRA-972 — RESEND_API_KEY empty on preview + prod, all transactional email broken (GKE cutover gate) +**Related:** infra#154 (chart plumbing: omit-when-empty + `ignoreDifferences`), TRA-860 (JWT_SECRET fail-fast pattern), TRA-375 (ESO+GSM future) + +## Problem + +Transactional email (org invites, password resets) silently failed on preview + prod GKE because `RESEND_API_KEY` was empty in the `trakrf-backend` Secret. The backend logs the Resend "API key is invalid" error but returns `201`/`200` anyway (send is best-effort, non-fatal), so the UI shows success while nothing is delivered. An empty key is **invisible** — nothing fails loudly until a user reports a missing email. + +infra#154 stops ArgoCD from clobbering the key (omit-when-empty + per-env `ignoreDifferences`), and the real key is injected out-of-band. But nothing **detects** the regression class: a future env that forgets the injection, a deleted key, or a botched cutover would silently ship broken email again. + +## Goal + +Add an infra-layer guard that turns "empty `RESEND_API_KEY` in a deployed env" from a silent runtime warning into a **loud sync failure**, surfaced in the ArgoCD UI before the broken state goes live. This is the cutover gate named in the ticket title. + +## Design + +A helm-templated `Job` in the `trakrf-backend` chart, run as a **PreSync hook**, that fails the sync when `RESEND_API_KEY` is empty or missing. + +### Component: `templates/email-guard-job.yaml` + +- **Trigger:** `helm.sh/hook: pre-install,pre-upgrade` (ArgoCD maps both to **PreSync**, consistent with the existing `migrate-job.yaml`). `hook-weight: "-10"` so it runs *before* the migrate job (`-5`) — a missing key fails the sync fast, before migrations or pod rollout. `hook-delete-policy: before-hook-creation,hook-succeeded`. +- **Check:** the container receives `RESEND_API_KEY` from the `trakrf-backend` Secret via `secretKeyRef` with `optional: true`, then: + ```sh + if [ -z "$RESEND_API_KEY" ]; then + echo "FATAL: RESEND_API_KEY is empty or missing in this deployed env." + echo "Transactional email (invites, password resets) will silently fail." + echo "Inject the key out-of-band (see TRA-972 / TRA-375) before syncing." + exit 1 + fi + echo "RESEND_API_KEY present (len ${#RESEND_API_KEY})." + ``` + `optional: true` is load-bearing: with infra#154 the key is *omitted* from the Secret when empty, so a hard `secretKeyRef` would error at pod-create with an opaque `CreateContainerConfigError`. `optional: true` yields an empty env var instead, which the check catches with a clear message. It also catches a present-but-empty-string key. +- **Validity vs presence:** the guard checks **presence only**, not whether Resend accepts the key. A send-scoped `re_...` key cannot be read-only validated (Resend's `/domains` returns `401 restricted_api_key`); the only validity check is an actual send, which a sync hook must not do. Presence is the high-value, zero-side-effect check. +- **Image:** `busybox` (pinned, multi-arch — **arm64 matters**, GKE nodes are ARM T2A/Axion; the backend's own Go image is shell-less so it can't run the check). Overridable via `emailGuard.image`. +- **Scheduling:** reuses the chart's `nodeSelector`/`affinity`/`tolerations` passthrough (same as `migrate-job`) so it tolerates the GKE `arch=arm64:NoSchedule` taint. Hardened pod (`runAsNonRoot`, drop ALL caps, `readOnlyRootFilesystem`), tiny resources. + +### Gating: `emailGuard.enabled` + +- New chart value `emailGuard.enabled`, **default `false`** — so local dev, AKS, and EKS overlays render nothing (the guard is GKE-prod-fleet hardening, and CI's eks/aks template runs stay clean). +- Flipped `true` for **preview + prod** via `argocd/root/templates/trakrf-backend.yaml` inlineValues (same mechanism as `mqttEnabled`). + +## Rollout ordering (critical) + +preview + prod currently hold an **empty** key, so an active guard would block their sync. Required order: + +1. Merge infra#154 (omit-when-empty + `ignoreDifferences`). +2. Inject the real `re_...` key out-of-band into preview (then prod, on Mike's timing). +3. Merge **this** guard PR. + +The guard PR is held-for-review, so it naturally lands last — no deadlock. On a brand-new env's very first sync the Secret/key must be pre-provisioned (consistent with the out-of-band injection model); documented, not solved here. + +## Out of scope + +- Backend-side loud failure (boot guard / metric in `resend.go`) — platform repo, handed to platform if wanted. +- ESO + GCP Secret Manager (TRA-375). +- Key *validity* checking (would require a live send). + +## Verification + +- `helm template trakrf-backend` with `emailGuard.enabled=false` → no guard Job (default / eks / aks). +- `helm template trakrf-backend --set emailGuard.enabled=true` → guard Job present, hook annotations correct, `RESEND_API_KEY` `secretKeyRef` with `optional: true`. +- `helm template argocd/root` → preview + prod backend apps carry `emailGuard.enabled: true`; others absent/false. +- `helm lint` clean; CI eks/aks matrix green. From e01c226450faf4434d5af90ef79a2525bf01f044 Mon Sep 17 00:00:00 2001 From: Mike Stankavich Date: Wed, 10 Jun 2026 13:51:05 -0500 Subject: [PATCH 2/3] feat(backend): RESEND_API_KEY PreSync guard for deployed envs (TRA-972) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Turns an empty/missing RESEND_API_KEY in a deployed env from a silent, best-effort runtime warning (the backend logs Resend "API key is invalid" but still returns 2xx, so the UI shows success and no email is delivered) into a LOUD ArgoCD sync failure, surfaced before the broken state goes live. This is the cutover gate named in TRA-972. Belt-and-suspenders to infra#154 (omit-when-empty + ignoreDifferences): #154 PREVENTS one cause (ArgoCD clobbering the key to ""); this DETECTS the whole regression class — a missed out-of-band injection, a deleted key, or a botched cutover. - helm/trakrf-backend/templates/email-guard-job.yaml: a pre-install/ pre-upgrade hook Job (ArgoCD PreSync, like migrate-job; weight -10 so it runs first). Reads RESEND_API_KEY from the trakrf-backend Secret via secretKeyRef optional:true (load-bearing: #154 OMITS the key when empty, so a hard ref would die with an opaque CreateContainerConfigError; optional yields an empty env the check catches, and also catches a present-but-empty key). Fails with a clear message + exit 1 when empty. Checks PRESENCE not validity (a send-scoped key can't be read-only validated; a hook must not send a real email). busybox pinned + multi- arch (GKE nodes are arm64; the backend Go image is shell-less). Hardened pod (runAsNonRoot, drop ALL, RO rootfs), tolerations passthrough so it schedules on the ARM-tainted pool. - emailGuard.{enabled,image} chart value, default disabled. - argocd/root: emailGuardEnabled per-env (preview=true now; prod=false with a flip-after-injection comment — an active guard would block live prod syncs until the key is present). Mirrors the mqttEnabled staging convention. Rollout order (critical): merge infra#154 -> inject the real key into preview (then prod, on Mike's timing) -> merge THIS PR. The guard PR is held for review so it lands last; no deadlock. Verified: helm template omits the Job when disabled (default/eks/aks) and renders it correctly when enabled (PreSync hook, optional secretRef, -z check, pinned busybox); guard shell logic exits 1 on empty/unset and 0 when set; argocd/root gives only the preview backend app emailGuard.enabled:true; helm lint + template clean on eks+aks (CI parity). Co-Authored-By: Claude Opus 4.8 (1M context) --- argocd/root/templates/trakrf-backend.yaml | 6 ++ argocd/root/values.yaml | 7 ++ .../templates/email-guard-job.yaml | 96 +++++++++++++++++++ helm/trakrf-backend/values.yaml | 12 +++ 4 files changed, 121 insertions(+) create mode 100644 helm/trakrf-backend/templates/email-guard-job.yaml diff --git a/argocd/root/templates/trakrf-backend.yaml b/argocd/root/templates/trakrf-backend.yaml index d6c5fed..cd1ff9b 100644 --- a/argocd/root/templates/trakrf-backend.yaml +++ b/argocd/root/templates/trakrf-backend.yaml @@ -31,6 +31,12 @@ {{- if $cfg.mqttEnabled }} {{- $base = printf "%smqtt:\n host: mqtt.%s.gke.trakrf.id\n clientId: trakrf-backend-%s-%s\n" $base $env $.Values.cluster $env }} {{- end }} +{{- /* TRA-972: RESEND_API_KEY PreSync guard. Enabled per-env (preview now; prod + only AFTER its key is injected, or the guard blocks live prod syncs). The + guard image comes from the chart default (busybox). */ -}} +{{- if $cfg.emailGuardEnabled }} +{{- $base = printf "%semailGuard:\n enabled: true\n" $base }} +{{- end }} {{- $ingress := "ingress:\n enabled: false\n" }} {{- if $cfg.ingressEnabled }} {{- $ingressCtx := dict diff --git a/argocd/root/values.yaml b/argocd/root/values.yaml index 77cdb88..56336fa 100644 --- a/argocd/root/values.yaml +++ b/argocd/root/values.yaml @@ -35,6 +35,11 @@ namespaces: # until cutover (see TRA-920 prod checklist). GKE-only # (broker host is .gke.); harmless if ever set on a # cluster whose backend image predates TRA-900. +# emailGuardEnabled — TRA-972: renders the RESEND_API_KEY PreSync guard +# Job, which fails the sync loudly if the email key +# is empty/missing in this env. preview=true; prod +# stays false until its key is injected (an active +# guard blocks live prod syncs). Default off. # imageTag — trakrf-backend image tag override. Empty falls # back to chart default (helm/trakrf-backend/values-.yaml). # On GKE BOTH envs are image-updater-tracked @@ -80,6 +85,7 @@ envs: appTrakrfIdHost: app.preview.trakrf.id mqttIp: "" mqttEnabled: true + emailGuardEnabled: true imageTag: "" environmentLabel: preview jwtExpirationSeconds: "900" @@ -95,6 +101,7 @@ envs: appTrakrfIdHost: app.trakrf.id mqttIp: "" mqttEnabled: false # flip to true at prod cutover — see TRA-920 prod checklist + emailGuardEnabled: false # flip to true AFTER prod RESEND_API_KEY is injected — else the PreSync guard blocks live prod syncs (TRA-972) # imageTag="" — prod is image-updater-tracked (TRA-351 option b): the # trakrf-backend-prod ImageUpdater CR pins the resolved :prod DIGEST into # the Application (immutable ref → kubelet pulls the exact build, fixing diff --git a/helm/trakrf-backend/templates/email-guard-job.yaml b/helm/trakrf-backend/templates/email-guard-job.yaml new file mode 100644 index 0000000..ba867b8 --- /dev/null +++ b/helm/trakrf-backend/templates/email-guard-job.yaml @@ -0,0 +1,96 @@ +{{- if .Values.emailGuard.enabled }} +{{- /* + RESEND_API_KEY PreSync guard (TRA-972). Turns an empty/missing email key in a + deployed env from a silent best-effort warning (the backend logs Resend "API + key is invalid" but still returns 2xx, so the UI shows success and nothing is + delivered) into a LOUD sync failure surfaced in the ArgoCD UI before the broken + state goes live. Belt-and-suspenders to infra#154's omit-when-empty + + ignoreDifferences: this DETECTS the regression class (missed injection, deleted + key, botched cutover) that #154 only prevents one cause of. + + pre-install,pre-upgrade => ArgoCD PreSync (same mapping as migrate-job). Runs at + weight -10, before migrate (-5), so a missing key fails fast before migrations + or pod rollout. Enabled only for deployed envs (preview/prod) via the root + chart; default off so local/eks/aks render nothing. + + Checks PRESENCE only, not validity: a send-scoped re_... key can't be read-only + validated (Resend /domains => 401 restricted_api_key) and a hook must not send a + real email. optional:true on the secretKeyRef is load-bearing — #154 OMITS the + key from the Secret when empty, so a hard ref would die with an opaque + CreateContainerConfigError; optional yields an empty env the check catches with + a clear message (and also catches a present-but-empty-string key). +*/}} +apiVersion: batch/v1 +kind: Job +metadata: + name: {{ include "trakrf-backend.fullname" . }}-email-guard + labels: + {{- include "trakrf-backend.labels" . | nindent 4 }} + app.kubernetes.io/component: email-guard + annotations: + helm.sh/hook: pre-install,pre-upgrade + helm.sh/hook-weight: "-10" + helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded +spec: + backoffLimit: 0 + ttlSecondsAfterFinished: 600 + template: + metadata: + labels: + {{- include "trakrf-backend.labels" . | nindent 8 }} + app.kubernetes.io/component: email-guard + spec: + restartPolicy: Never + securityContext: + runAsNonRoot: true + runAsUser: 65534 + seccompProfile: + type: RuntimeDefault + containers: + - name: email-guard + image: {{ .Values.emailGuard.image | quote }} + imagePullPolicy: IfNotPresent + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + capabilities: + drop: + - ALL + env: + - name: RESEND_API_KEY + valueFrom: + secretKeyRef: + name: {{ include "trakrf-backend.fullname" . }} + key: RESEND_API_KEY + optional: true + command: + - /bin/sh + - -c + - | + if [ -z "$RESEND_API_KEY" ]; then + echo "FATAL: RESEND_API_KEY is empty or missing in this deployed env." + echo "Transactional email (invites, password resets) will silently fail." + echo "Inject the key out-of-band (see TRA-972 / TRA-375) before syncing." + exit 1 + fi + echo "RESEND_API_KEY present (len ${#RESEND_API_KEY}). Email guard passed." + resources: + requests: + cpu: 10m + memory: 16Mi + limits: + cpu: 50m + memory: 32Mi + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 8 }} + {{- end }} + {{- with .Values.tolerations }} + tolerations: + {{- toYaml . | nindent 8 }} + {{- end }} +{{- end }} diff --git a/helm/trakrf-backend/values.yaml b/helm/trakrf-backend/values.yaml index 6060ef9..5601e08 100644 --- a/helm/trakrf-backend/values.yaml +++ b/helm/trakrf-backend/values.yaml @@ -156,6 +156,18 @@ secrets: resendApiKey: "" sentryDsn: "" +# RESEND_API_KEY PreSync guard (TRA-972). When enabled, a pre-install/pre-upgrade +# hook Job fails the sync (loudly, in the ArgoCD UI) if RESEND_API_KEY is empty or +# missing in this env's Secret — so a missed out-of-band injection can't silently +# ship broken transactional email. Default off; the root chart flips it true for +# deployed envs (preview + prod) only, so local/eks/aks render nothing. +# Checks presence, not validity (see templates/email-guard-job.yaml). busybox is +# pinned + multi-arch because GKE nodes are arm64 and the backend image is +# shell-less. +emailGuard: + enabled: false + image: busybox:1.37.0 + nodeSelector: {} tolerations: [] affinity: {} From c7c194e22839300b8acbb92f4d4b74b100915038 Mon Sep 17 00:00:00 2001 From: Mike Stankavich Date: Wed, 10 Jun 2026 14:11:14 -0500 Subject: [PATCH 3/3] feat(backend): enable RESEND_API_KEY guard for prod too (TRA-972) Prod RESEND_API_KEY was injected out-of-band 2026-06-10 (Mike green-lit deploy to both preview + prod), so the PreSync guard now passes in prod. Flip prod emailGuardEnabled false->true so the cutover gate protects prod as well as preview. Co-Authored-By: Claude Opus 4.8 (1M context) --- argocd/root/values.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/argocd/root/values.yaml b/argocd/root/values.yaml index 56336fa..3706557 100644 --- a/argocd/root/values.yaml +++ b/argocd/root/values.yaml @@ -101,7 +101,7 @@ envs: appTrakrfIdHost: app.trakrf.id mqttIp: "" mqttEnabled: false # flip to true at prod cutover — see TRA-920 prod checklist - emailGuardEnabled: false # flip to true AFTER prod RESEND_API_KEY is injected — else the PreSync guard blocks live prod syncs (TRA-972) + emailGuardEnabled: true # prod RESEND_API_KEY injected 2026-06-10, so the PreSync guard passes (TRA-972) # imageTag="" — prod is image-updater-tracked (TRA-351 option b): the # trakrf-backend-prod ImageUpdater CR pins the resolved :prod DIGEST into # the Application (immutable ref → kubelet pulls the exact build, fixing