Skip to content

build/20260608 195531/0040 transient gate failure resilience#124

Merged
cahenesy merged 26 commits into
masterfrom
build/20260608-195531/0040-transient-gate-failure-resilience
Jun 10, 2026
Merged

build/20260608 195531/0040 transient gate failure resilience#124
cahenesy merged 26 commits into
masterfrom
build/20260608-195531/0040-transient-gate-failure-resilience

Conversation

@cahenesy

@cahenesy cahenesy commented Jun 9, 2026

Copy link
Copy Markdown
Owner
  • test(failing): ci-checks retry-once recovers a flake + knob/non-numeric guards (TDD 0040 §1, FR-15/NFR-4)
  • step(1): ci-checks retry-once in run_ci_checks (TDD 0040 Component 1, FR-15/NFR-4)
  • fix(step 1): add result-tally/exit-code epilogue so eval assertions are enforceable
  • test(failing): retries-exhausted FAIL must log an explicit FAILED-after-N line (TDD 0040 §1, NFR-4)
  • step(1): log an explicit FAILED-after-N line on ci-checks retry exhaustion (TDD 0040 §1, NFR-4)
  • revert(step 1): back out ci-checks retry impl to re-derive it test-first
  • step(1): ci-checks retry-once re-derived from the now-genuinely-failing eval (TDD 0040 Component 1, FR-15/NFR-4)
  • test(failing): gate-unobservable enum membership + resume-first action + status render (TDD 0040 §6, Component 3)
  • step(2): add gate-unobservable to the closed halt-cause enum + status render mirror (TDD 0040 Component 3, FR-57/NFR-4)
  • test(failing): review/verify no-verdict → gate-unobservable; observed BLOCK untouched (TDD 0040 §3-§5, Component 2)
  • step(3): no-verdict review subprocess → resumable gate-unobservable (TDD 0040 Component 2, FR-57/NFR-4/ADR 0006)
  • fix(step 3): correct comment-vs-code — verify-runtime call site is NOT rewired
  • step(4): finalize the transient-gate-resilience eval — TDD Failure-modes coverage (TDD 0040 §4)
  • fix(step 4): guard tgr_build_output + git rev-parse call sites (FR-74 Harden /implement with real gates instead of self-reported success (v0.8.0) #1 fail-loud)
  • test(failing): aggregator must register the transient-gate-resilience eval (TDD 0040 §5, TDD 0038 §3 wire-in rule)
  • step(5): wire transient-gate-resilience eval into the implement-gate aggregator (TDD 0040 §5)
  • chore: give §E3 captured gate_one output diagnostic use (clear SC2034)
  • test(failing): §7 set_halt_cause write-failure → _classify_gate_no_verdict fails loud (TDD 0040 §7)
  • fix: fail-loud on set_halt_cause write failure in _classify_gate_no_verdict (TDD 0040 §7)
  • mark 0040-transient-gate-failure-resilience implemented (verified + reviewed)

cahenesy and others added 26 commits June 9, 2026 09:02
… FR-15/NFR-4)

On a ci-checks failure, re-run up to THROUGHLINE_CI_CHECKS_RETRIES (default 1)
more times in the same worktree; the first passing run wins and a recovered
flake is logged (not silent). RETRIES=0 restores no-retry; non-numeric
default-and-warns. Signature unchanged so the gate_one call site is untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…re enforceable

The test file exited 0 unconditionally (no epilogue reading RESULTS), making
every §1-§2 assertion non-enforceable. Add the standard PASS/FAIL tally + final
[ "$FAIL" -eq 0 ] so a single failing assertion exits the script non-zero
(FR-74 #1 fail-loud; lets the step-5 aggregator wire-in catch regressions).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…er-N line (TDD 0040 §1, NFR-4)

Genuine red→green for step 1: with the enforcement epilogue now present, this
assertion exits the suite non-zero against the current implementation (which
returns silently on exhaustion). Re-establishes the failing-test-first
discipline the prior vacuous test(failing) commit lacked.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…stion (TDD 0040 §1, NFR-4)

A retries-exhausted real FAIL now records a telemetry line as visible as the
recovered-flake line, so the gate log distinguishes a retries-exhausted FAIL
from a single-shot one. Completes the genuine red→green for step 1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The initial test(failing) commit (0cd8e0d) was vacuous — it lacked the
enforcement epilogue, so it exited 0 against pre-impl code and never drove the
core retry behavior with a genuine red. History cannot be rewritten (divergence
guard), so re-derive honestly: with the epilogue now present, backing out the
implementation makes the §1 core-retry assertions genuinely RED (suite exits
non-zero), and the next commit re-implements to green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ng eval (TDD 0040 Component 1, FR-15/NFR-4)

Re-implements run_ci_checks retry-once after the test-first backout (0f210a7):
the §1 core-retry assertions were genuinely RED against the backed-out code and
this commit greens them. On a ci-checks failure, re-run up to
THROUGHLINE_CI_CHECKS_RETRIES (default 1) more times; first pass wins, a
recovered flake and a retries-exhausted FAIL are both logged explicitly (NFR-4).
RETRIES=0 disables retry; non-numeric default-and-warns. Signature unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…n + status render (TDD 0040 §6, Component 3)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… render mirror (TDD 0040 Component 3, FR-57/NFR-4)

state.sh _next_actions_for_cause gains a gate-unobservable arm with a
resume-first action list (no revision precondition — a no-verdict gate is safe
to re-run), admitting it to the closed FR-63 enum so set_halt_cause accepts it
and the blocked fragment is auto-resumable via _resume_from's blocked arm.
status.sh _halt_cause_known gains the mirror so it renders without an
unknown-cause warning. No schema change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… BLOCK untouched (TDD 0040 §3-§5, Component 2)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…TDD 0040 Component 2, FR-57/NFR-4/ADR 0006)

Add the gate-agnostic _classify_gate_no_verdict helper (set_halt_cause
gate-unobservable + _terminal_state blocked, in that order) and _gate_output_tail
(stderr/output-tail detail), and rewire both no-verdict paths in _rework_loop
(the rc!=0-no-fresh-verdict path and the neither-PASS-nor-BLOCK crash guard) from
the old terminal 'failed' to the resumable gate-unobservable halt. The
discriminator is verdict-presence, never exit code — an observed BLOCK/PASS is
untouched. The verify-runtime call site (gate_one in lib/resume.sh, outside this
TDD's declared ## Touched files) reuses the same gate-agnostic classifier.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…T rewired

The _classify_gate_no_verdict docstring and the §4/header eval comments claimed
(present tense) the verify-runtime call site reuses the classifier. It does not:
resume.sh is outside this TDD's ## Touched files and its no-verdict path still
records terminal 'failed'. Restate accurately: the helper is gate-AGNOSTIC and
ready, but only the review gate drives it in this TDD; wiring the verify-runtime
call site is a follow-up within resume.sh's scope (ADR 0006 honest comments).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…des coverage (TDD 0040 §4)

Add the remaining ## Failure modes & edge cases assertions over behavior already
delivered in steps 1-3 (no production-code change): double-flake bounded to FAIL
(retry-once is not retry-until-green), RETRIES=2 raises the bound, and a
malformed/truncated verdict resolves to gate-unobservable (NFR-4: ambiguity is
couldn't-observe, never a guessed verdict). Pure test-coverage hardening — the
production code is unchanged, so this step is legitimately no-new-behavior.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
 fail-loud)

A setup failure in tgr_build_output (or the preceding git rev-parse) would
silently drop the downstream assertion with no bad() record. Guard all three
call sites (§3, §5, §E) with || { bad ...; exit 0; } so a fixture failure is
surfaced loud, not swallowed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… eval (TDD 0040 §5, TDD 0038 §3 wire-in rule)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…aggregator (TDD 0040 §5)

Register the eval (run it) and add [ "$TGR_FAIL" -eq 0 ] to the final
AND-chain so ci-checks regression-gates Components 1-3. New gating behavior
driven red→green by the eval's §W dogfood (05ae5f0) per the TDD 0038 §3 wire-in
rule.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The §E3 'st' capture of gate_one's output was unused (SC2034). Reference it in
the failure diagnostic. The remaining TDDS=() (consumed by the sourced
implement.sh via dynamic scope, required by the SOURCE_ONLY guard) and TGR_FAIL=1
(consumed inside eval "$chain" in the §W dogfood, mirroring TFP §8) are
necessary idioms shellcheck-static cannot see — present in every sibling eval.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cahenesy cahenesy merged commit 370a933 into master Jun 10, 2026
1 check passed
@cahenesy cahenesy deleted the build/20260608-195531/0040-transient-gate-failure-resilience branch June 10, 2026 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant