build/20260608 195531/0038 mechanical per step test first enforcement#117
Merged
cahenesy merged 19 commits intoJun 9, 2026
Merged
Conversation
cahenesy
commented
Jun 9, 2026
Owner
- test(failing): per-step mechanical test-first pre-check (TDD 0038 / FR-15a)
- step(1): mechanical per-step test-first pre-check (TDD 0038 / FR-15a)
- test(failing): build-prompt self-gate + aggregator wire-in rule (TDD 0038 §2/§3)
- step(2): preventive self-gate + aggregator wire-in rule in build-prompt (TDD 0038 / FR-15a)
- test(failing): four per-step-loop fixtures opt out of default-on gate (TDD 0038 §4)
- step(3): reconcile four per-step-loop fixtures for default-on enforcement (TDD 0038 §4 / FR-15a)
- test(failing): aggregator wire-in propagates the new eval's failure (TDD 0038 §3 dogfood)
- step(4): wire test-first-per-step eval into the CI aggregator (TDD 0038 §3 / FR-15a)
- docs(0038): note per-step test-first enforcement in README + bump plugin 3.20.0 -> 3.21.0
- fix(0038): commit the streaming interval before the deterministic test-first BLOCK (TDD 0038 §1 / FR-15a)
- rework: extract sentinel line before TEST_FIRST_SKIPPED: check to prevent prose match
- rework: add §9 prose-mention test for the anchored skip-token (review:1 blocker)
- rework: use tail -1 for _tf_sentinel to match step_id/sha extraction
- rework: anchor _tf_sentinel grep to ^STEP_COMMIT: so prose lines after the sentinel cannot set tf_skip=1
- test(failing): anchored skip-token residual — same-line + multi-sentinel bypass
- fix: anchor skip-token to the last STEP_COMMIT sentinel match (gates.sh:979)
- mark 0038-mechanical-per-step-test-first-enforcement implemented (verified + reviewed)
…R-15a) Drives _per_step_review_loop with impl-first / test(failing)-precursor / TEST_FIRST_SKIPPED / extended-sentinel / knob-off cases. §1 (deterministic test-first BLOCK with no model spawn) and §3 (skip telemetry) are RED — the pre-check does not exist yet.
Component 1: factor _test_first_ok_range from test_first_ok (shared git-history + skip predicate, ADR 0006) and wire a deterministic per-step pre-check into _per_step_review_loop's STEP_COMMIT branch. An impl-first step (no test(failing): precursor in last-cleared..sha, no per-step TEST_FIRST_SKIPPED: token) gets a fixed STEP_REVIEW: BLOCK with NO model review spawned; pass-through otherwise. Honors THROUGHLINE_REQUIRE_TEST_FIRST. Rides the existing per-step BLOCK path (ADR 0007; no new halt type). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…0038 §2/§3) §6 greps build-prompt.md for the preventive self-gate bullet and the aggregator wire-in rule; both absent (RED).
…pt (TDD 0038 / FR-15a) Component 2: a self-verification bullet in the STEP_COMMIT handshake — the build checks test-first ordering (and the optional STEP_COMMIT TEST_FIRST_SKIPPED: token) before emitting the sentinel, turning reactive catch-and-revert into prevention at the source; the §1 mechanical pre-check is the backstop. Component 3: an aggregator wire-in rule under FAILING TEST FIRST — wiring a new eval into tests/implement-gate.test.sh is new gating behavior requiring a failing wire-in test, not TEST_FIRST: SKIPPED-eligible. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… (TDD 0038 §4) §7 greps each of continuous-in-build-review / build-defensive-norms / step-commit-protocol / coproc-verdict-resilience for the THROUGHLINE_REQUIRE_TEST_FIRST=0 export; absent (RED).
…ment (TDD 0038 §4 / FR-15a) Component 4: each fixture drives _per_step_review_loop with impl-only step(N): commits to exercise coproc/handshake/protocol/review mechanics, for which test-first ordering is orthogonal. Under default-on per-step enforcement (§1) those would hit the deterministic BLOCK before the path under test. Export THROUGHLINE_REQUIRE_TEST_FIRST=0 once at file scope (with a WHY comment) so the orthogonal gate is disabled; the dedicated eval is the sole knob-ON exerciser. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…TDD 0038 §3 dogfood) §8 drives the real implement-gate.test.sh final AND-chain with the new eval (TFP_FAIL) stubbed to fail and asserts the overall exit goes non-zero; also asserts the eval is registered. Both RED — the eval is not yet wired in.
…38 §3 / FR-15a) Register tests/test-first-per-step.test.sh as a sub-eval and add its TFP_FAIL term to the final AND-chain, so the per-step pre-check, the build-prompt edits, and the four-fixture non-regression are gated by ci-checks. Per the §3 wire-in rule this is new gating behavior; the §8 dogfood (committed test(failing):) drove the AND-chain term red→green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…gin 3.20.0 -> 3.21.0 README gate-1 description now states failing-test-first is enforced per step (deterministic STEP_REVIEW: BLOCK before any model review) as well as whole-build, and lists the new test-first-per-step eval. Version bump for the functional gates.sh + build-prompt.md change (TDD 0038 / FR-15a). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t-first BLOCK (TDD 0038 §1 / FR-15a) Self-review finding: the deterministic per-step test-first BLOCK reset interval_start without committing the streaming interval to build_active_seconds (it mirrored the protocol-error correction path). But the protocol-error path is bounded by a 2-attempt COUNT budget, whereas the test-first BLOCK has none, so a build looping on impl-first re-emits would discard every interval and evade the overall_active watchdog indefinitely. Align with the review-verdict path (a deterministic BLOCK substitutes for the review): commit the interval first so repeated BLOCKs accumulate active seconds and the watchdog bounds the loop. The observable surface (BLOCK written, no model review spawned) is unchanged and remains covered by tests/test-first-per-step.test.sh §1; this is an internal active-time accounting fix (no flaky timing test added, per the eval's own L-001/L-002 robustness guidance). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…31/0038-mechanical-per-step-test-first-enforcement
…:1 blocker) Verifies a TEST_FIRST_SKIPPED: mention in prose on a NON-sentinel line of a multi-line assistant event does NOT bypass the deterministic test-first BLOCK (the anchored grep -m1 'STEP_COMMIT:' extraction reads the token off the sentinel line only). Negative-control confirmed: fails against the unanchored predecessor, passes against the fix. Eval now 23/0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…31/0038-mechanical-per-step-test-first-enforcement
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r the sentinel cannot set tf_skip=1
…nel bypass §9 (strengthened from the vacuous different-line case) and new §10 drive the two residual bypasses the post-rework grep '^STEP_COMMIT:' impl still has: a TEST_FIRST_SKIPPED: token in trailing same-line prose after the sha, and a stale prior STEP_COMMIT line's token leaking onto a later token-less sentinel (the grep-all-lines vs step_id/sha tail -1 divergence the review flagged). Both must still produce the deterministic test-first BLOCK. Fails against the current code (4 red); the gates.sh fix follows. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sh:979) Replace the divergent `grep '^STEP_COMMIT:'` (which returned ALL matching lines and matched a TEST_FIRST_SKIPPED: token anywhere on them) with the TDD-specified extraction: the same `grep -aoE 'STEP_COMMIT:…' | tail -1` the step_id/sha parse uses, EXTENDED with an optional trailing `( …TEST_FIRST_SKIPPED:…)?` group. The token is now honored only when it immediately follows the sha on the LAST sentinel — consistent with step_id/sha (tail -1), closing the same-line-prose and stale-prior-sentinel bypasses (review:1 majors). Covered by §9 (same-line) and §10 (multi-sentinel) added test-first in the preceding commit; eval 25/0; the four reconciled per-step-loop fixtures still pass; shellcheck clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ified + reviewed)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.