Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions argocd/root/templates/trakrf-backend.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,12 @@
{{- if $cfg.mqttEnabled }}
{{- $base = printf "%smqtt:\n host: mqtt.%s.gke.trakrf.id\n clientId: trakrf-backend-%s-%s\n" $base $env $.Values.cluster $env }}
{{- end }}
{{- /* TRA-972: RESEND_API_KEY PreSync guard. Enabled per-env (preview now; prod
only AFTER its key is injected, or the guard blocks live prod syncs). The
guard image comes from the chart default (busybox). */ -}}
{{- if $cfg.emailGuardEnabled }}
{{- $base = printf "%semailGuard:\n enabled: true\n" $base }}
{{- end }}
{{- $ingress := "ingress:\n enabled: false\n" }}
{{- if $cfg.ingressEnabled }}
{{- $ingressCtx := dict
Expand Down
7 changes: 7 additions & 0 deletions argocd/root/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,11 @@ namespaces:
# until cutover (see TRA-920 prod checklist). GKE-only
# (broker host is .gke.); harmless if ever set on a
# cluster whose backend image predates TRA-900.
# emailGuardEnabled — TRA-972: renders the RESEND_API_KEY PreSync guard
# Job, which fails the sync loudly if the email key
# is empty/missing in this env. preview=true; prod
# stays false until its key is injected (an active
# guard blocks live prod syncs). Default off.
# imageTag — trakrf-backend image tag override. Empty falls
# back to chart default (helm/trakrf-backend/values-<cluster>.yaml).
# On GKE BOTH envs are image-updater-tracked
Expand Down Expand Up @@ -80,6 +85,7 @@ envs:
appTrakrfIdHost: app.preview.trakrf.id
mqttIp: ""
mqttEnabled: true
emailGuardEnabled: true
imageTag: ""
environmentLabel: preview
jwtExpirationSeconds: "900"
Expand All @@ -95,6 +101,7 @@ envs:
appTrakrfIdHost: app.trakrf.id
mqttIp: ""
mqttEnabled: false # flip to true at prod cutover — see TRA-920 prod checklist
emailGuardEnabled: true # prod RESEND_API_KEY injected 2026-06-10, so the PreSync guard passes (TRA-972)
# imageTag="" — prod is image-updater-tracked (TRA-351 option b): the
# trakrf-backend-prod ImageUpdater CR pins the resolved :prod DIGEST into
# the Application (immutable ref → kubelet pulls the exact build, fixing
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# TRA-972 — RESEND_API_KEY PreSync guard (infra hardening)

**Date:** 2026-06-10
**Ticket:** TRA-972 — RESEND_API_KEY empty on preview + prod, all transactional email broken (GKE cutover gate)
**Related:** infra#154 (chart plumbing: omit-when-empty + `ignoreDifferences`), TRA-860 (JWT_SECRET fail-fast pattern), TRA-375 (ESO+GSM future)

## Problem

Transactional email (org invites, password resets) silently failed on preview + prod GKE because `RESEND_API_KEY` was empty in the `trakrf-backend` Secret. The backend logs the Resend "API key is invalid" error but returns `201`/`200` anyway (send is best-effort, non-fatal), so the UI shows success while nothing is delivered. An empty key is **invisible** — nothing fails loudly until a user reports a missing email.

infra#154 stops ArgoCD from clobbering the key (omit-when-empty + per-env `ignoreDifferences`), and the real key is injected out-of-band. But nothing **detects** the regression class: a future env that forgets the injection, a deleted key, or a botched cutover would silently ship broken email again.

## Goal

Add an infra-layer guard that turns "empty `RESEND_API_KEY` in a deployed env" from a silent runtime warning into a **loud sync failure**, surfaced in the ArgoCD UI before the broken state goes live. This is the cutover gate named in the ticket title.

## Design

A helm-templated `Job` in the `trakrf-backend` chart, run as a **PreSync hook**, that fails the sync when `RESEND_API_KEY` is empty or missing.

### Component: `templates/email-guard-job.yaml`

- **Trigger:** `helm.sh/hook: pre-install,pre-upgrade` (ArgoCD maps both to **PreSync**, consistent with the existing `migrate-job.yaml`). `hook-weight: "-10"` so it runs *before* the migrate job (`-5`) — a missing key fails the sync fast, before migrations or pod rollout. `hook-delete-policy: before-hook-creation,hook-succeeded`.
- **Check:** the container receives `RESEND_API_KEY` from the `trakrf-backend` Secret via `secretKeyRef` with `optional: true`, then:
```sh
if [ -z "$RESEND_API_KEY" ]; then
echo "FATAL: RESEND_API_KEY is empty or missing in this deployed env."
echo "Transactional email (invites, password resets) will silently fail."
echo "Inject the key out-of-band (see TRA-972 / TRA-375) before syncing."
exit 1
fi
echo "RESEND_API_KEY present (len ${#RESEND_API_KEY})."
```
`optional: true` is load-bearing: with infra#154 the key is *omitted* from the Secret when empty, so a hard `secretKeyRef` would error at pod-create with an opaque `CreateContainerConfigError`. `optional: true` yields an empty env var instead, which the check catches with a clear message. It also catches a present-but-empty-string key.
- **Validity vs presence:** the guard checks **presence only**, not whether Resend accepts the key. A send-scoped `re_...` key cannot be read-only validated (Resend's `/domains` returns `401 restricted_api_key`); the only validity check is an actual send, which a sync hook must not do. Presence is the high-value, zero-side-effect check.
- **Image:** `busybox` (pinned, multi-arch — **arm64 matters**, GKE nodes are ARM T2A/Axion; the backend's own Go image is shell-less so it can't run the check). Overridable via `emailGuard.image`.
- **Scheduling:** reuses the chart's `nodeSelector`/`affinity`/`tolerations` passthrough (same as `migrate-job`) so it tolerates the GKE `arch=arm64:NoSchedule` taint. Hardened pod (`runAsNonRoot`, drop ALL caps, `readOnlyRootFilesystem`), tiny resources.

### Gating: `emailGuard.enabled`

- New chart value `emailGuard.enabled`, **default `false`** — so local dev, AKS, and EKS overlays render nothing (the guard is GKE-prod-fleet hardening, and CI's eks/aks template runs stay clean).
- Flipped `true` for **preview + prod** via `argocd/root/templates/trakrf-backend.yaml` inlineValues (same mechanism as `mqttEnabled`).

## Rollout ordering (critical)

preview + prod currently hold an **empty** key, so an active guard would block their sync. Required order:

1. Merge infra#154 (omit-when-empty + `ignoreDifferences`).
2. Inject the real `re_...` key out-of-band into preview (then prod, on Mike's timing).
3. Merge **this** guard PR.

The guard PR is held-for-review, so it naturally lands last — no deadlock. On a brand-new env's very first sync the Secret/key must be pre-provisioned (consistent with the out-of-band injection model); documented, not solved here.

## Out of scope

- Backend-side loud failure (boot guard / metric in `resend.go`) — platform repo, handed to platform if wanted.
- ESO + GCP Secret Manager (TRA-375).
- Key *validity* checking (would require a live send).

## Verification

- `helm template trakrf-backend` with `emailGuard.enabled=false` → no guard Job (default / eks / aks).
- `helm template trakrf-backend --set emailGuard.enabled=true` → guard Job present, hook annotations correct, `RESEND_API_KEY` `secretKeyRef` with `optional: true`.
- `helm template argocd/root` → preview + prod backend apps carry `emailGuard.enabled: true`; others absent/false.
- `helm lint` clean; CI eks/aks matrix green.
96 changes: 96 additions & 0 deletions helm/trakrf-backend/templates/email-guard-job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
{{- if .Values.emailGuard.enabled }}
{{- /*
RESEND_API_KEY PreSync guard (TRA-972). Turns an empty/missing email key in a
deployed env from a silent best-effort warning (the backend logs Resend "API
key is invalid" but still returns 2xx, so the UI shows success and nothing is
delivered) into a LOUD sync failure surfaced in the ArgoCD UI before the broken
state goes live. Belt-and-suspenders to infra#154's omit-when-empty +
ignoreDifferences: this DETECTS the regression class (missed injection, deleted
key, botched cutover) that #154 only prevents one cause of.

pre-install,pre-upgrade => ArgoCD PreSync (same mapping as migrate-job). Runs at
weight -10, before migrate (-5), so a missing key fails fast before migrations
or pod rollout. Enabled only for deployed envs (preview/prod) via the root
chart; default off so local/eks/aks render nothing.

Checks PRESENCE only, not validity: a send-scoped re_... key can't be read-only
validated (Resend /domains => 401 restricted_api_key) and a hook must not send a
real email. optional:true on the secretKeyRef is load-bearing — #154 OMITS the
key from the Secret when empty, so a hard ref would die with an opaque
CreateContainerConfigError; optional yields an empty env the check catches with
a clear message (and also catches a present-but-empty-string key).
*/}}
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "trakrf-backend.fullname" . }}-email-guard
labels:
{{- include "trakrf-backend.labels" . | nindent 4 }}
app.kubernetes.io/component: email-guard
annotations:
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-weight: "-10"
helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded
spec:
backoffLimit: 0
ttlSecondsAfterFinished: 600
template:
metadata:
labels:
{{- include "trakrf-backend.labels" . | nindent 8 }}
app.kubernetes.io/component: email-guard
spec:
restartPolicy: Never
securityContext:
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
containers:
- name: email-guard
image: {{ .Values.emailGuard.image | quote }}
imagePullPolicy: IfNotPresent
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
env:
- name: RESEND_API_KEY
valueFrom:
secretKeyRef:
name: {{ include "trakrf-backend.fullname" . }}
key: RESEND_API_KEY
optional: true
command:
- /bin/sh
- -c
- |
if [ -z "$RESEND_API_KEY" ]; then
echo "FATAL: RESEND_API_KEY is empty or missing in this deployed env."
echo "Transactional email (invites, password resets) will silently fail."
echo "Inject the key out-of-band (see TRA-972 / TRA-375) before syncing."
exit 1
fi
echo "RESEND_API_KEY present (len ${#RESEND_API_KEY}). Email guard passed."
resources:
requests:
cpu: 10m
memory: 16Mi
limits:
cpu: 50m
memory: 32Mi
{{- with .Values.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.affinity }}
affinity:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- end }}
12 changes: 12 additions & 0 deletions helm/trakrf-backend/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,18 @@ secrets:
resendApiKey: ""
sentryDsn: ""

# RESEND_API_KEY PreSync guard (TRA-972). When enabled, a pre-install/pre-upgrade
# hook Job fails the sync (loudly, in the ArgoCD UI) if RESEND_API_KEY is empty or
# missing in this env's Secret — so a missed out-of-band injection can't silently
# ship broken transactional email. Default off; the root chart flips it true for
# deployed envs (preview + prod) only, so local/eks/aks render nothing.
# Checks presence, not validity (see templates/email-guard-job.yaml). busybox is
# pinned + multi-arch because GKE nodes are arm64 and the backend image is
# shell-less.
emailGuard:
enabled: false
image: busybox:1.37.0

nodeSelector: {}
tolerations: []
affinity: {}
Expand Down
Loading