Skip to content

Raft Cluster: log compaction and InstallSnapshot#4008

Draft
zuiderkwast wants to merge 9 commits into
valkey-io:cluster-v2from
zuiderkwast:raft-snapshot
Draft

Raft Cluster: log compaction and InstallSnapshot#4008
zuiderkwast wants to merge 9 commits into
valkey-io:cluster-v2from
zuiderkwast:raft-snapshot

Conversation

@zuiderkwast

@zuiderkwast zuiderkwast commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Implement Raft log snapshots so the leader can bring lagging or newly joined followers up to speed without needing the full log history.

Changes:

  • Persist snapshotTerm in nodes.conf (the term of the entry at lastApplied, equivalent to Raft's lastIncludedTerm)
  • Use snapshotTerm in the AE consistency check when the entry at prev_log_index has been compacted
  • Trim committed entries from the in-memory log (keep 64 for slow followers)
  • Send INSTALL_SNAPSHOT when a follower needs entries that have been trimmed — the snapshot is the cluster node descriptions (same format as nodes.conf)
  • Handle snapshot install on the follower: load node lines, remove stale nodes, promote joiner→follower, recompute quorum
  • Append a no-op entry on leader election to commit prior-term entries (Raft §5.4.2)
  • Extract clusterLoadNodeLine() from clusterLoadConfig() for reuse
  • Fix raftLogGet/raftLogLastIndex/raftLogLastTerm to handle empty log after compaction

Tests:

  • Follower catches up via snapshot after leader trims log
  • New node joins a leader with compacted log (via cluster restart)
  • New node joins when leader has trimmed log (via runtime trimming)

Includes #4015 (various fixes, squashed) - required to make the tests pass. (I'll update this PR when that one is merged.)

Closes #3858

Add snapshot_term field to persistent raft state. It tracks the term of
the entry at last_applied, persisted as snapshotTerm in the nodes.conf
vars line. This will be used by raftLogTermAt to verify the leader's
prev_log_term at the log compaction boundary.

Extract the apply loop (last_applied → commit_index) into a single
raftApplyCommitted() function, replacing three identical inline loops.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When the entry at prev_log_index is not in the in-memory log (already
applied and compacted), return snapshot_term if the index matches
last_applied. This allows the AE consistency check to pass after a
restart without probing all the way back to index 0.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Trim committed entries from the front of the in-memory log once the
number of applied entries exceeds 2x RAFT_LOG_TRIM_KEEP (64). This
keeps memory bounded while retaining enough history for slow followers
to catch up via normal AE. Followers that fall behind the trim point
will need InstallSnapshot (next commit).

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 019c1c57-effb-467c-a4c0-147f076ab972

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

When the leader's next_index for a follower is behind the first
available log entry (trimmed), send INSTALL_SNAPSHOT instead of AE.
The snapshot contains the cluster node descriptions (same format as
nodes.conf node lines).

The follower resets its cluster state, parses the snapshot node lines
using the new clusterParseNodeLine() helper (extracted from
clusterLoadConfig), and advances last_applied/commit_index to the
snapshot point.

The leader strips the CLUSTER_NODE_MYSELF flag from the snapshot so the
receiver doesn't adopt the leader's identity.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Document snapshotTerm, in-memory log trimming, InstallSnapshot
protocol, and no-op entry on election. Explain why snapshotTerm
is needed (entry at lastApplied is not in the log after compaction).

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Filter CLUSTER_NODE_MEET (in addition to CLUSTER_NODE_HANDSHAKE) when
writing nodes.conf. MEET-flagged nodes represent incomplete join
attempts: in gossip they haven't been contacted yet (peer would reject
our PING since it doesn't know us), in raft they haven't been committed
to the log. In both cases, persisting them is incorrect — on reload the
MEET flag is lost, leaving a dead node in the config (gossip) or an
uncommitted node counted as a voter (raft).

Use the same filter in InstallSnapshot generation.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Never filter the local node from clusterGenNodesDescription, regardless
of flags. This ensures myself is always in nodes.conf and in snapshots
sent to peers.

On startup, if the node is alone (no peers), always set MEET flag and
reset size to 0. A singleton is not yet in the raft log and must not
commit entries until it joins a cluster via NODE_JOIN. Previously, a
singleton that restarted would lose its MEET flag (size became 1 from
the persisted node line) and could commit entries prematurely.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Squashed commit of the following:

Raft: fix size_t overflow in PUBLISH message bounds check

    Add overflow guard on chan_len + msg_len before comparing against
    available buffer space. Without this, a malicious peer could send
    lengths that overflow to a small sum, bypassing the bounds check.

Raft: disconnect stale outbound links

    When the leader has outbound links that haven't received AE_ACK within
    node_timeout/2, disconnect them so clusterConnectNodes can re-establish
    fresh connections. This handles half-open connections from peers that
    were paused or killed without closing the TCP connection.

    On nodes other than the leader, disconnect stale outbound links after
    node_timeout/2 if we've never even the HI handshake, meaning the kernel
    of the peer accepted the connection but it never reached the process
    (e.g. if it's stopped with SIGSTOP).

    In other cases, such as between two followers, it's not an error that
    a connection is idle for a long time, so we they're not included in this
    check.

Raft: fix proposal retry flooding

    Remove the unconditional clusterRaftRetryProposals call from cron that
    ran every 100ms, flooding the leader with duplicate PROPOSE messages.

    Proposals are now retried only on specific events:
    - On leader change (AE from new leader): forward to new leader.
    - On reconnect to same leader (HI): re-forward in case the previous
      send was lost on the dead connection.
    - On becoming leader: append locally.

    This eliminates duplicate log entries from repeated forwarding while
    still ensuring proposals reach the leader after connectivity changes.

Raft: use base node-timeout for leader lease instead of randomized election timeout

    The leader lease check in clusterRaftCanGrantVote denied pre-votes
    while last_heartbeat was within the randomized election_timeout
    [T, 2T). This meant a node whose election timeout fired early
    could have its pre-vote denied by a peer whose randomized timeout
    was longer — the peer still considered the leader alive.

    Use the base cluster-node-timeout for the lease duration. The
    randomization is for staggering elections, not for lease duration.
    This ensures that once node-timeout has elapsed since the last
    heartbeat, all nodes agree the leader is dead and grant pre-votes.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@zuiderkwast zuiderkwast linked an issue Jun 22, 2026 that may be closed by this pull request
@zuiderkwast zuiderkwast requested a review from murphyjacob4 June 22, 2026 16:58
@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.06349% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.66%. Comparing base (f176b3d) to head (c5b4afe).

Files with missing lines Patch % Lines
src/cluster_raft.c 93.41% 11 Missing ⚠️
src/cluster_nodes.c 81.81% 4 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##           cluster-v2    #4008      +/-   ##
==============================================
+ Coverage       76.47%   76.66%   +0.18%     
==============================================
  Files             166      166              
  Lines           82871    83008     +137     
==============================================
+ Hits            63373    63634     +261     
+ Misses          19498    19374     -124     
Files with missing lines Coverage Δ
src/cluster_nodes.c 77.27% <81.81%> (+1.18%) ⬆️
src/cluster_raft.c 71.13% <93.41%> (+5.23%) ⬆️

... and 25 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Raft Cluster: Log Compaction and Snapshotting

1 participant