Raft Cluster: log compaction and InstallSnapshot#4008
Conversation
Add snapshot_term field to persistent raft state. It tracks the term of the entry at last_applied, persisted as snapshotTerm in the nodes.conf vars line. This will be used by raftLogTermAt to verify the leader's prev_log_term at the log compaction boundary. Extract the apply loop (last_applied → commit_index) into a single raftApplyCommitted() function, replacing three identical inline loops. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When the entry at prev_log_index is not in the in-memory log (already applied and compacted), return snapshot_term if the index matches last_applied. This allows the AE consistency check to pass after a restart without probing all the way back to index 0. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Trim committed entries from the front of the in-memory log once the number of applied entries exceeds 2x RAFT_LOG_TRIM_KEEP (64). This keeps memory bounded while retaining enough history for slow followers to catch up via normal AE. Followers that fall behind the trim point will need InstallSnapshot (next commit). Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
When the leader's next_index for a follower is behind the first available log entry (trimmed), send INSTALL_SNAPSHOT instead of AE. The snapshot contains the cluster node descriptions (same format as nodes.conf node lines). The follower resets its cluster state, parses the snapshot node lines using the new clusterParseNodeLine() helper (extracted from clusterLoadConfig), and advances last_applied/commit_index to the snapshot point. The leader strips the CLUSTER_NODE_MYSELF flag from the snapshot so the receiver doesn't adopt the leader's identity. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Document snapshotTerm, in-memory log trimming, InstallSnapshot protocol, and no-op entry on election. Explain why snapshotTerm is needed (entry at lastApplied is not in the log after compaction). Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Filter CLUSTER_NODE_MEET (in addition to CLUSTER_NODE_HANDSHAKE) when writing nodes.conf. MEET-flagged nodes represent incomplete join attempts: in gossip they haven't been contacted yet (peer would reject our PING since it doesn't know us), in raft they haven't been committed to the log. In both cases, persisting them is incorrect — on reload the MEET flag is lost, leaving a dead node in the config (gossip) or an uncommitted node counted as a voter (raft). Use the same filter in InstallSnapshot generation. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Never filter the local node from clusterGenNodesDescription, regardless of flags. This ensures myself is always in nodes.conf and in snapshots sent to peers. On startup, if the node is alone (no peers), always set MEET flag and reset size to 0. A singleton is not yet in the raft log and must not commit entries until it joins a cluster via NODE_JOIN. Previously, a singleton that restarted would lose its MEET flag (size became 1 from the persisted node line) and could commit entries prematurely. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Squashed commit of the following:
Raft: fix size_t overflow in PUBLISH message bounds check
Add overflow guard on chan_len + msg_len before comparing against
available buffer space. Without this, a malicious peer could send
lengths that overflow to a small sum, bypassing the bounds check.
Raft: disconnect stale outbound links
When the leader has outbound links that haven't received AE_ACK within
node_timeout/2, disconnect them so clusterConnectNodes can re-establish
fresh connections. This handles half-open connections from peers that
were paused or killed without closing the TCP connection.
On nodes other than the leader, disconnect stale outbound links after
node_timeout/2 if we've never even the HI handshake, meaning the kernel
of the peer accepted the connection but it never reached the process
(e.g. if it's stopped with SIGSTOP).
In other cases, such as between two followers, it's not an error that
a connection is idle for a long time, so we they're not included in this
check.
Raft: fix proposal retry flooding
Remove the unconditional clusterRaftRetryProposals call from cron that
ran every 100ms, flooding the leader with duplicate PROPOSE messages.
Proposals are now retried only on specific events:
- On leader change (AE from new leader): forward to new leader.
- On reconnect to same leader (HI): re-forward in case the previous
send was lost on the dead connection.
- On becoming leader: append locally.
This eliminates duplicate log entries from repeated forwarding while
still ensuring proposals reach the leader after connectivity changes.
Raft: use base node-timeout for leader lease instead of randomized election timeout
The leader lease check in clusterRaftCanGrantVote denied pre-votes
while last_heartbeat was within the randomized election_timeout
[T, 2T). This meant a node whose election timeout fired early
could have its pre-vote denied by a peer whose randomized timeout
was longer — the peer still considered the leader alive.
Use the base cluster-node-timeout for the lease duration. The
randomization is for staggering elections, not for lease duration.
This ensures that once node-timeout has elapsed since the last
heartbeat, all nodes agree the leader is dead and grant pre-votes.
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
ca0e126 to
c5b4afe
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## cluster-v2 #4008 +/- ##
==============================================
+ Coverage 76.47% 76.66% +0.18%
==============================================
Files 166 166
Lines 82871 83008 +137
==============================================
+ Hits 63373 63634 +261
+ Misses 19498 19374 -124
🚀 New features to boost your workflow:
|
Implement Raft log snapshots so the leader can bring lagging or newly joined followers up to speed without needing the full log history.
Changes:
Tests:
Includes #4015 (various fixes, squashed) - required to make the tests pass. (I'll update this PR when that one is merged.)
Closes #3858