Skip to content

fix: carry PrepareQC lock across consecutive timeouts in voteTimeout#3029

Merged
wen-coding merged 6 commits intomainfrom
wen/load_prepareqc_from_timeoutqc
Mar 6, 2026
Merged

fix: carry PrepareQC lock across consecutive timeouts in voteTimeout#3029
wen-coding merged 6 commits intomainfrom
wen/load_prepareqc_from_timeoutqc

Conversation

@wen-coding
Copy link
Contributor

@wen-coding wen-coding commented Mar 5, 2026

Summary

  • Safety bug fix: voteTimeout only attached i.PrepareQC (current view) to TimeoutVotes. Since pushTimeoutQC clears i.PrepareQC on view change, consecutive timeouts (e.g. offline leader) would lose the PrepareQC lock. This produced a TimeoutQC with no latestPrepareQC, allowing a later leader to propose a conflicting block instead of being forced to repropose — breaking the safety invariant (two CommitQCs possible at the same RoadIndex).
  • Fix: If no PrepareQC was observed in the current view (!pqc.IsPresent()), voteTimeout now inherits the one from i.TimeoutQC.LatestPrepareQC(). The lock propagates through the TimeoutQC chain indefinitely until a CommitQC advances the RoadIndex.
  • No new state or messages: the fix reads an existing field (TimeoutQC.LatestPrepareQC()) that was already persisted; no schema or protocol changes needed.

Attack scenario (before fix)

  1. View (M, 0): PrepareQC forms for P. CommitVotes sent. Timer fires → TimeoutQC(M,0) carries the PrepareQC.
  2. View (M, 1): pushTimeoutQC clears i.PrepareQC. Leader is offline. Timer fires → TimeoutVotes carry no PrepareQC. TimeoutQC(M,1) has latestPrepareQC = None.
  3. View (M, 2): No reproposal forced. Leader proposes conflicting P'. CommitQC for P' forms. Old CommitVotes from step 1 also form CommitQC for P. Two CommitQCs at RoadIndex M.

Why fix #1 alone is sufficient (no lock check in pushProposal needed)

With the fix, every honest validator's TimeoutVote carries the inherited PrepareQC. Any quorum of 2f+1 timeout votes includes ≥ f+1 honest validators carrying it, so NewTimeoutQC always selects it. FullProposal.Verify already enforces that if TimeoutQC.reproposal() returns a proposal, the leader must repropose it — so conflicting proposals are rejected by all honest validators.

Test plan

Tests exercise the real State.pushTimeoutQC, State.pushPrepareQC, and State.voteTimeout methods via scope.Run + s.Run(ctx) (no mirrored production logic):

  • TestVoteTimeoutPrepareQC_BothNone — both None → None
  • TestVoteTimeoutPrepareQC_OnlyCurrentView — current view PrepareQC, no TimeoutQC → uses current
  • TestVoteTimeoutPrepareQC_InheritedFromTimeoutQC — core safety test: None + inherited → uses inherited, chains through 3 consecutive timeouts
  • TestVoteTimeoutPrepareQC_CurrentViewHigherThanInherited — reproposal succeeds (both present, current higher) → uses current
  • TestVoteTimeoutPrepareQC_CurrentViewPresentInheritedNone — current present, TimeoutQC has no PrepareQC → uses current
  • TestVoteTimeoutPrepareQC_PersistedRestart — restart regression: persisted TimeoutQC survives restart, voteTimeout inherits PrepareQC from it
  • TestNewTimeoutQC_MixedPrepareQCs — quorum-intersection: single vote carries PrepareQC among None votes → selected and Verify passes
  • TestNewTimeoutQC_AllNone — all votes carry None → LatestPrepareQC is None, Verify passes
  • TestTimeoutQCVerify_HighestPrepareQCSelected — votes carry PrepareQCs at different views → highest selected, Verify passes
  • Verified test fails when fix is reverted (InheritedFromTimeoutQC catches the bug)
  • All existing consensus and types tests pass

@github-actions
Copy link

github-actions bot commented Mar 5, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMar 6, 2026, 6:28 PM

@wen-coding wen-coding requested a review from pompon0 March 6, 2026 00:13
@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.27%. Comparing base (40146a1) to head (e4dda3a).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3029      +/-   ##
==========================================
- Coverage   58.32%   58.27%   -0.06%     
==========================================
  Files        2079     2077       -2     
  Lines      171702   171284     -418     
==========================================
- Hits       100144    99812     -332     
+ Misses      62631    62579      -52     
+ Partials     8927     8893      -34     
Flag Coverage Δ
sei-chain-pr 78.89% <100.00%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ei-tendermint/internal/autobahn/consensus/inner.go 64.44% <100.00%> (+1.22%) ⬆️

... and 28 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

return types.NewTimeoutQC(votes)
}

// simulatePushTimeoutQC mirrors the production pushTimeoutQC: keeps CommitQC
Copy link
Contributor

@pompon0 pompon0 Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you simulate it on State rather than inner, so that we don't have to "mirror" anything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wen-coding wen-coding requested a review from stevenlanders March 6, 2026 14:57
// startRunOutputs launches runOutputs in the background and returns
// a context that is cancelled when the test finishes.
func startRunOutputs(t *testing.T, s *State) context.Context {
ctx, cancel := context.WithCancel(t.Context())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't make sense. t.Context() is always canceled when the test finishes.
Also, why not s.Run()? The fact that s.runOutputs is suffcient rn is an implementation detail that the test should not need to rely on.
Also the s.Run() output should be checked for (non-cancellation) error, so I'd recommend scope.Run(func(){ s.SpawnBg(s.Run()) })

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

voteTimeout only attached i.PrepareQC (current view) to TimeoutVotes.
Since pushTimeoutQC clears i.PrepareQC on view change, consecutive
timeouts (e.g. offline leader) would lose the PrepareQC lock, producing
a TimeoutQC with no latestPrepareQC. This allowed a later leader to
propose a conflicting block instead of being forced to repropose,
breaking the safety invariant (two CommitQCs at the same RoadIndex).

Fix: voteTimeout now uses the higher of i.PrepareQC and the PrepareQC
inherited from i.TimeoutQC.LatestPrepareQC(). This ensures the lock
propagates through the TimeoutQC chain indefinitely until a CommitQC
advances the RoadIndex.

Made-with: Cursor
i.PrepareQC is always from the current view (strictly higher than any
inherited PrepareQC), so a simple presence check is equivalent to the
view comparison. Suggested in review.

Made-with: Cursor
Tests now call State.pushTimeoutQC / State.pushPrepareQC / State.voteTimeout
directly instead of mirroring production logic in helper functions.

Made-with: Cursor
Replace the ad-hoc startRunOutputs helper (which redundantly wrapped
t.Context() and exposed the runOutputs implementation detail) with the
established scope.Run + SpawnBg + IgnoreCancel pattern. Assertions
inside the closure use error returns instead of require (goroutine
safety).

Made-with: Cursor
@wen-coding wen-coding force-pushed the wen/load_prepareqc_from_timeoutqc branch from b97fff3 to 06c58ae Compare March 6, 2026 16:50
@wen-coding wen-coding requested a review from masih March 6, 2026 16:50
Copy link
Collaborator

@masih masih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for getting this sorted so swiftly 🙌

return nil
})
require.NoError(t, err)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker: It would be great to also add a test cases for

  • the exact quorum-intersection behavior in NewTimeoutQC plus TimeoutQC.Verify.
  • the restart regression test with persisted TimeoutQC just to verify voteTimeout is carried over to prepare.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- types: test NewTimeoutQC with mixed PrepareQCs (quorum-intersection),
  all-None, and highest-selected scenarios, each verified with Verify.
- consensus: test that voteTimeout inherits PrepareQC from a persisted
  TimeoutQC after restart (two-session scope.Run).

Made-with: Cursor
@wen-coding wen-coding merged commit f5844b5 into main Mar 6, 2026
41 checks passed
@wen-coding wen-coding deleted the wen/load_prepareqc_from_timeoutqc branch March 6, 2026 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants