server: skip campaign until region syncer history phase is done#10844
server: skip campaign until region syncer history phase is done#10844tharanga wants to merge 3 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @tharanga. Thanks for your PR. I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Welcome @tharanga! |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRegion syncer adds durable committed-region-count storage, in-memory catch-up with recordNoPersist/commit/rollback and explicit end-of-history markers, sticky sync-session flags, periodic leader publishing of committed counts, and a gate preventing unsynced members from campaigning when durable committed count is non-zero. ChangesRegion Syncer History Catch-Up and Leadership Election Gating
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/syncer/server.go`:
- Around line 507-520: The end-of-history empty SyncRegionResponse is being sent
too early using lastIndex (snapshot row count), which causes
handleRegionSyncResponse to commit the wrong index; change the logic so the
server sends exactly one trailing end-of-history marker after retained history
replay completes (not before), using the correct index from
retainedRecordsFrom/retainIndex (or nextIndex when there are no retained
records) rather than lastIndex; locate the marker creation and stream.Send calls
in pkg/syncer/server.go (the block that builds pdpb.SyncRegionResponse with
StartIndex: uint64(lastIndex)) and defer or remove them so that only
syncHistoryRecords emits the final marker with the proper
StartIndex/nextIndex/retainIndex, and apply the same fix to the duplicate block
referenced at the other similar section.
In `@server/server.go`:
- Around line 1783-1789: The current branch in the startup gate returns true
when s.storage.LoadRegionSyncerCommittedRegionCount() fails, which can allow a
stale/empty follower to campaign; change the error handling so that when
LoadRegionSyncerCommittedRegionCount() returns an error (i.e., committed count
cannot be read), the method blocks campaigning by returning false (do not open
the gate), while keeping the existing warning log (use the same log.Warn with
errs.ZapError(err)); ensure this logic sits in the same code path guarded by
HasAttemptedSync() and IsHistorySynced() so that a read/parse failure prevents
using s.electionClient when s.client metadata lookup failed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 884e4d58-fdff-435e-8114-76151c8fee64
📒 Files selected for processing (12)
pkg/storage/endpoint/region_syncer.gopkg/storage/endpoint/region_syncer_test.gopkg/storage/storage.gopkg/syncer/client.gopkg/syncer/client_test.gopkg/syncer/history_buffer.gopkg/syncer/history_buffer_test.gopkg/syncer/server.gopkg/syncer/server_test.gopkg/utils/keypath/absolute_key_path.goserver/server.gotests/server/region_syncer/region_syncer_test.go
|
/ok-to-test |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/syncer/server.go (1)
264-319:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPublish the committed region count before waiting for the first tick.
The first durable write is delayed by
committedRegionCountInterval. If this leader dies during that first second, the stored committed count is still0, so the new campaign gate can misclassify an unsynced follower as already caught up. Publish once synchronously whenRunServerstarts, then keep the ticker for refreshes.Suggested fix
keepAliveTicker := time.NewTicker(syncerKeepAliveInterval) shrinkTicker := time.NewTicker(historyBufferShrinkInterval) committedCountTicker := time.NewTicker(committedRegionCountInterval) // -1 forces an initial publish on the first tick. lastPublishedRegionCount := -1 + publishCommittedRegionCount := func() { + if count := s.server.GetBasicCluster().GetTotalRegionCount(); count != lastPublishedRegionCount { + if err := s.server.GetStorage().SaveRegionSyncerCommittedRegionCount(uint64(count)); err != nil { + log.Warn("failed to persist committed region count", errs.ZapError(err)) + } else { + lastPublishedRegionCount = count + } + } + } + publishCommittedRegionCount() @@ case <-committedCountTicker.C: - // Publish the region count this leader is serving so that a member - // that has not finished a history sync can still decide, after this - // leader is gone, whether it is caught up enough to campaign. Only - // the leader runs RunServer, so only the leader writes this. - if count := s.server.GetBasicCluster().GetTotalRegionCount(); count != lastPublishedRegionCount { - if err := s.server.GetStorage().SaveRegionSyncerCommittedRegionCount(uint64(count)); err != nil { - log.Warn("failed to persist committed region count", errs.ZapError(err)) - } else { - lastPublishedRegionCount = count - } - } + // Publish the region count this leader is serving so that a member + // that has not finished a history sync can still decide, after this + // leader is gone, whether it is caught up enough to campaign. Only + // the leader runs RunServer, so only the leader writes this. + publishCommittedRegionCount()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/syncer/server.go` around lines 264 - 319, The committed region count isn't persisted until the first committedCountTicker tick, which can leave storage at 0 if the leader dies early; call the same save logic once synchronously before entering the for/select loop: compute count via s.server.GetBasicCluster().GetTotalRegionCount(), call s.server.GetStorage().SaveRegionSyncerCommittedRegionCount(uint64(count)), handle the error like in the ticker branch (log Warn on error, otherwise set lastPublishedRegionCount = count), and then proceed to start the existing committedCountTicker for periodic refreshes.
♻️ Duplicate comments (1)
pkg/syncer/server.go (1)
502-510:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRestore the explicit end-of-history marker after replay.
The client-side state machine only flips
historySyncedwhen it sees the empty end-of-history response. These paths return after replaying records, so a fully caught-up follower stays "unsynced" until a later keepalive or region update. If the leader disappears first, that member remains blocked from campaigning even though it already consumed the full history. This regresses the earlier end-of-history marker fix.Suggested fix
func (*RegionSyncer) syncHistoryRecordsLocked(startIndex uint64, records []*core.RegionInfo, stream *regionSyncStream) error { for start := 0; start < len(records); start += maxSyncRegionBatchSize { end := min(start+maxSyncRegionBatchSize, len(records)) if err := stream.sendStreamIfOpen(buildSyncRegionResponse(startIndex+uint64(start), records[start:end])); err != nil { return err } } - return nil + return stream.sendStreamIfOpen(&pdpb.SyncRegionResponse{ + Header: &pdpb.ResponseHeader{ClusterId: keypath.ClusterID()}, + StartIndex: startIndex + uint64(len(records)), + }) } @@ if len(records) > 0 { if err := s.syncHistoryRecordsLocked(syncStartIndex, records, syncStream); err != nil { return err } syncStream.advanceSendIndexLocked(len(records)) + } else { + if err := syncStream.sendStreamIfOpen(&pdpb.SyncRegionResponse{ + Header: &pdpb.ResponseHeader{ClusterId: keypath.ClusterID()}, + StartIndex: nextIndex, + }); err != nil { + return err + } } return nil }Also applies to: 575-587
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/syncer/server.go` around lines 502 - 510, The loop in syncHistoryRecordsLocked sends all batched history chunks but never emits the explicit empty end-of-history marker the client expects to flip historySynced; after the for loop return an explicit empty end-of-history response by calling stream.sendStreamIfOpen with buildSyncRegionResponse for the next index and an empty records slice (propagate any error), and apply the same fix to the other similar replay path mentioned so the client receives the empty marker and can set historySynced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@pkg/syncer/server.go`:
- Around line 264-319: The committed region count isn't persisted until the
first committedCountTicker tick, which can leave storage at 0 if the leader dies
early; call the same save logic once synchronously before entering the
for/select loop: compute count via
s.server.GetBasicCluster().GetTotalRegionCount(), call
s.server.GetStorage().SaveRegionSyncerCommittedRegionCount(uint64(count)),
handle the error like in the ticker branch (log Warn on error, otherwise set
lastPublishedRegionCount = count), and then proceed to start the existing
committedCountTicker for periodic refreshes.
---
Duplicate comments:
In `@pkg/syncer/server.go`:
- Around line 502-510: The loop in syncHistoryRecordsLocked sends all batched
history chunks but never emits the explicit empty end-of-history marker the
client expects to flip historySynced; after the for loop return an explicit
empty end-of-history response by calling stream.sendStreamIfOpen with
buildSyncRegionResponse for the next index and an empty records slice (propagate
any error), and apply the same fix to the other similar replay path mentioned so
the client receives the empty marker and can set historySynced.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4a8c0535-6550-491a-a400-447182505657
📒 Files selected for processing (5)
pkg/syncer/client.gopkg/syncer/client_test.gopkg/syncer/history_buffer.gopkg/syncer/history_buffer_test.gopkg/syncer/server.go
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/syncer/history_buffer_test.go
- pkg/syncer/history_buffer.go
- pkg/syncer/client.go
|
/retest |
1 similar comment
|
/retest |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10844 +/- ##
==========================================
+ Coverage 79.14% 79.20% +0.05%
==========================================
Files 536 541 +5
Lines 74033 74566 +533
==========================================
+ Hits 58594 59057 +463
- Misses 11311 11347 +36
- Partials 4128 4162 +34
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
Thanks for the PR. I checked the latest master, and the underlying issue still exists: region sync completion is not used as a guard before PD leader campaign, so a newly joined member with empty region metadata can still become leader. The fix direction looks right, but I think this PR is too broad for #10843 as-is. The main concern is the 10s grace fallback. With the default timing, other healthy members may only try to recover etcd leadership after roughly 40s, so an unsynced member can still campaign after the 10s grace and become PD leader with empty or partial region metadata. The new regression test observes only a 3s window, so it does not prove the behavior after the grace period. I suggest splitting this into a smaller first fix:
I would also avoid adding the new etcd-persisted committed-region-count key in the first fix. The core issue is local election eligibility: whether this member has completed region sync before campaigning. That should preferably be derived from the sync completion signal and local durable sync state, not from a new cluster-level marker. If we still need a cluster-level etcd marker, I think it should be justified and reviewed separately, because it adds upgrade/rollback semantics and another source of truth. The leader-transfer target preference/refusal logic also looks like a separate policy change. It may be useful, but I would prefer it in a follow-up PR after the core election guard is fixed. Also, the branch currently conflicts with master and CI is failing, so it needs a rebase before final review. |
Signed-off-by: tharanga <tharanga@gmail.com> re-pro of a new PD node acquiring the leadership with no regions fixing the test to fail without any changes Signed-off-by: tharanga <tharanga@gmail.com> server: skip campaign until region syncer history phase done Signed-off-by: tharanga <tharanga@gmail.com>
Signed-off-by: tharanga <tharanga@gmail.com>
069ce6f to
73f1b58
Compare
@okJiang thanks for the feedback. I've rebased my branch to fix conflicts, and preped to cleanly remove additional changes you pointed out. I will address your comments, and ping you once it's ready to review. |
73f1b58 to
007a7dd
Compare
@okJiang the behind-case gate is purely local ( |
|
/retest |
1 similar comment
|
/retest |
Signed-off-by: tharanga <tharanga@gmail.com>
|
@tharanga: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: Close #10843
When a node with an empty region cache joins the PD cluster (either a new node or an existing node that lost its LevelDB state), it can become the leader by entering the campaign loop without waiting for asynchronous region sync to complete. When there are hundreds of thousands of regions, the new leader is unable to rebuild the region cache in a timely manner, causing a spike in
GetRegionrequests, increasing region heartbeat andGetRegionlatency, and ultimately resulting in query errors and elevated latencies.What is changed and how does it work?
When a new member joins the PD cluster, it waits until the historical region sync phase is done before start campaigning for the leadership.
Check List
[x] - Unit and integration tests
Tests
Code changes
etcdSide effects
-- Increasing the state space of node/cluster startup/bootstrap state machine
Release note
Summary by CodeRabbit
New Features
Bug Fixes
Tests