Skip to content

server: skip campaign until region syncer history phase is done#10844

Open
tharanga wants to merge 3 commits into
tikv:masterfrom
tharanga:tharanga/new_member_region_catchup
Open

server: skip campaign until region syncer history phase is done#10844
tharanga wants to merge 3 commits into
tikv:masterfrom
tharanga:tharanga/new_member_region_catchup

Conversation

@tharanga

@tharanga tharanga commented Jun 8, 2026

Copy link
Copy Markdown

What problem does this PR solve?

Issue Number: Close #10843

When a node with an empty region cache joins the PD cluster (either a new node or an existing node that lost its LevelDB state), it can become the leader by entering the campaign loop without waiting for asynchronous region sync to complete. When there are hundreds of thousands of regions, the new leader is unable to rebuild the region cache in a timely manner, causing a spike in GetRegion requests, increasing region heartbeat and GetRegion latency, and ultimately resulting in query errors and elevated latencies.

What is changed and how does it work?

When a new member joins the PD cluster, it waits until the historical region sync phase is done before start campaigning for the leadership.

server: skip campaign until region syncer history phase is done

Check List

[x] - Unit and integration tests

Tests

Unit tests (no failpoints needed — run directly)

  # pkg/storage/endpoint — RegionSyncerStorage Load/Save (3 tests)
  go test ./pkg/storage/endpoint/ -count=1 -v \
    -run 'TestRegionSyncerCommittedRegionCount'

  # pkg/syncer — historySynced durable reload + catch-up commit/rollback (2 tests)
  go test ./pkg/syncer/ -count=1 -v \
    -run 'TestHistorySyncedInitFromDurableState|TestCatchupCommitRollback'

 Integration tests (require failpoints enabled)

  These call failpoint.Enable(...), so the failpoints must be rewritten in first — otherwise the inject sites are no-ops and the tests won't behave
  correctly.

  # enable once for the failpoint-dependent runs
  make failpoint-enable

  # tests/server/region_syncer — unsynced member must not win leadership
  go test ./tests/server/region_syncer/ -count=1 -v -timeout 300s \
    -run 'TestUnsyncedMemberRefusesLeadership'

  # cluster-formation fuzzer (opt-in via env; honors PD_FUZZ_SEED / PD_FUZZ_ITERATIONS)
  PD_RUN_FUZZ=1 go test ./tests/server/region_syncer/ -count=1 -v -timeout 1200s \
    -run 'TestClusterFormationFuzz'

  # restore when done
  make failpoint-disable

Code changes

  • Has persistent data change
    • New storage endpoint to save region counts to etcd

Side effects

  • Increased code complexity
    -- Increasing the state space of node/cluster startup/bootstrap state machine

Release note

When a new member joins the PD cluster, it waits until the historical region sync phase is done before start campaigning for the leadership. This avoids 

Summary by CodeRabbit

  • New Features

    • Persisted region-sync state (committed region count & history sync markers) and sticky sync status to survive restarts.
    • Protocol sends explicit end-of-history markers to improve catch-up behavior.
    • Leadership gate prevents promotion until region sync state is safe.
  • Bug Fixes

    • Prevents stale history offsets on leadership changes by rolling back uncommitted catch-up state.
  • Tests

    • New tests covering persistence, catch-up commit/rollback, and leadership eligibility.

@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. labels Jun 8, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rleungx for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added contribution This PR is from a community contributor. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Jun 8, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Hi @tharanga. Thanks for your PR.

I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot

ti-chi-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Welcome @tharanga!

It looks like this is your first PR to tikv/pd 🎉.

I'm the bot to help you request reviewers, add labels and more, See available commands.

We want to make sure your contribution gets all the attention it needs!



Thank you, and welcome to tikv/pd. 😃

@ti-chi-bot ti-chi-bot Bot added the first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. label Jun 8, 2026
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Region syncer adds durable committed-region-count storage, in-memory catch-up with recordNoPersist/commit/rollback and explicit end-of-history markers, sticky sync-session flags, periodic leader publishing of committed counts, and a gate preventing unsynced members from campaigning when durable committed count is non-zero.

Changes

Region Syncer History Catch-Up and Leadership Election Gating

Layer / File(s) Summary
Storage contract and path for committed region count
pkg/storage/endpoint/region_syncer.go, pkg/storage/storage.go, pkg/utils/keypath/absolute_key_path.go
RegionSyncerStorage interface and StorageEndpoint implementations load/save committed region count as base-10 uint64 strings; keypath helper formats durable etcd key under cluster namespace.
Storage tests for committed region count
pkg/storage/endpoint/region_syncer_test.go
In-memory StorageEndpoint tests: absent-key loads as zero with no error; round-trip preserves uint64↔string (including 0 and math.MaxUint64); corrupt values return parse errors.
History buffer catch-up (in-memory apply / commit / rollback)
pkg/syncer/history_buffer.go, pkg/syncer/history_buffer_test.go
recordNoPersist appends in-memory without persisting index; commit persists next index; rollback clears in-memory buffer and reloads persisted index; TestCatchupCommitRollback validates atomicity and invariants.
Client syncer: sync-session state, end-of-history, and branched recording
pkg/syncer/client.go, pkg/syncer/client_test.go
Adds sticky atomic attemptedSync and historySynced with getters; treats empty regions response as end-of-history (commits and sets flag); uses recordNoPersist during catch-up and record after sync; rolls back on stop when sync incomplete; TestHistorySyncedInitFromDurableState verifies initialization from durable state.
Server syncer: committed region count periodic publishing
pkg/syncer/server.go
Initializes historySynced from durable state; creates committedCountTicker in RunServer to periodically publish/persist leader's committed region count via SaveRegionSyncerCommittedRegionCount when it changes; stops ticker on shutdown; sync methods now send explicit end-of-history markers.
Server syncer tests updated for end-of-history markers
pkg/syncer/server_test.go
Adjusts mocked channel sizing and response-count expectations to include explicit empty end-of-history marker responses and verifies ordering/content.
Leader election gating based on syncer catch-up
server/server.go, tests/server/region_syncer/region_syncer_test.go
canCampaignAsRegionSyncerCaughtUp prevents leadership campaigning when region storage is enabled, syncer attempted sync but history not synced, and durable committed-region count > 0; leaderLoop waits until gate passes; TestUnsyncedMemberRefusesLeadership regression test verifies a newly joined unsynced PD does not win leadership while blocked by failpoint.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • tikv/pd#10672: Modifies syncer response/handling code in pkg/syncer/client.go/pkg/syncer/server.go, overlapping with end-of-history and history handling changes.
  • tikv/pd#10733: Changes handleRegionSyncResponse around end-of-history and shutdown flows that intersect this PR's history-sync logic.
  • tikv/pd#10716: Alters history buffering and persistence semantics in pkg/syncer/history_buffer.go, related to this PR's commit/rollback/recordNoPersist work.

Suggested labels

lgtm, approved, needs-cherry-pick-release-8.5

Suggested reviewers

  • okJiang
  • rleungx
  • bufferflies

Poem

🐰 I buffered tales in memory tight,
Saved the count by candlelight,
When empty echoes called the end,
I committed, then could safely send.
Now leaders wait until we're right.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: preventing campaign/leadership acquisition until region syncer history phase completes.
Description check ✅ Passed The description provides issue reference, problem statement, solution summary, and comprehensive test instructions, closely following the repository template.
Linked Issues check ✅ Passed Code changes directly address issue #10843 objectives: preventing empty-region-cache nodes from becoming leader by blocking campaign until region syncer history phase completes.
Out of Scope Changes check ✅ Passed All changes are scoped to the issue: storage endpoint for region counts, history sync state tracking, commit/rollback semantics, leadership campaign gating, and corresponding tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 8, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/syncer/server.go`:
- Around line 507-520: The end-of-history empty SyncRegionResponse is being sent
too early using lastIndex (snapshot row count), which causes
handleRegionSyncResponse to commit the wrong index; change the logic so the
server sends exactly one trailing end-of-history marker after retained history
replay completes (not before), using the correct index from
retainedRecordsFrom/retainIndex (or nextIndex when there are no retained
records) rather than lastIndex; locate the marker creation and stream.Send calls
in pkg/syncer/server.go (the block that builds pdpb.SyncRegionResponse with
StartIndex: uint64(lastIndex)) and defer or remove them so that only
syncHistoryRecords emits the final marker with the proper
StartIndex/nextIndex/retainIndex, and apply the same fix to the duplicate block
referenced at the other similar section.

In `@server/server.go`:
- Around line 1783-1789: The current branch in the startup gate returns true
when s.storage.LoadRegionSyncerCommittedRegionCount() fails, which can allow a
stale/empty follower to campaign; change the error handling so that when
LoadRegionSyncerCommittedRegionCount() returns an error (i.e., committed count
cannot be read), the method blocks campaigning by returning false (do not open
the gate), while keeping the existing warning log (use the same log.Warn with
errs.ZapError(err)); ensure this logic sits in the same code path guarded by
HasAttemptedSync() and IsHistorySynced() so that a read/parse failure prevents
using s.electionClient when s.client metadata lookup failed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 884e4d58-fdff-435e-8114-76151c8fee64

📥 Commits

Reviewing files that changed from the base of the PR and between 9f1c47b and e3001b0.

📒 Files selected for processing (12)
  • pkg/storage/endpoint/region_syncer.go
  • pkg/storage/endpoint/region_syncer_test.go
  • pkg/storage/storage.go
  • pkg/syncer/client.go
  • pkg/syncer/client_test.go
  • pkg/syncer/history_buffer.go
  • pkg/syncer/history_buffer_test.go
  • pkg/syncer/server.go
  • pkg/syncer/server_test.go
  • pkg/utils/keypath/absolute_key_path.go
  • server/server.go
  • tests/server/region_syncer/region_syncer_test.go

Comment thread pkg/syncer/server.go Outdated
Comment thread server/server.go Outdated
@okJiang

okJiang commented Jun 11, 2026

Copy link
Copy Markdown
Member

/ok-to-test

@ti-chi-bot ti-chi-bot Bot added ok-to-test Indicates a PR is ready to be tested. and removed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Jun 11, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/syncer/server.go (1)

264-319: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Publish the committed region count before waiting for the first tick.

The first durable write is delayed by committedRegionCountInterval. If this leader dies during that first second, the stored committed count is still 0, so the new campaign gate can misclassify an unsynced follower as already caught up. Publish once synchronously when RunServer starts, then keep the ticker for refreshes.

Suggested fix
 	keepAliveTicker := time.NewTicker(syncerKeepAliveInterval)
 	shrinkTicker := time.NewTicker(historyBufferShrinkInterval)
 	committedCountTicker := time.NewTicker(committedRegionCountInterval)
 	// -1 forces an initial publish on the first tick.
 	lastPublishedRegionCount := -1
+	publishCommittedRegionCount := func() {
+		if count := s.server.GetBasicCluster().GetTotalRegionCount(); count != lastPublishedRegionCount {
+			if err := s.server.GetStorage().SaveRegionSyncerCommittedRegionCount(uint64(count)); err != nil {
+				log.Warn("failed to persist committed region count", errs.ZapError(err))
+			} else {
+				lastPublishedRegionCount = count
+			}
+		}
+	}
+	publishCommittedRegionCount()
@@
 		case <-committedCountTicker.C:
-			// Publish the region count this leader is serving so that a member
-			// that has not finished a history sync can still decide, after this
-			// leader is gone, whether it is caught up enough to campaign. Only
-			// the leader runs RunServer, so only the leader writes this.
-			if count := s.server.GetBasicCluster().GetTotalRegionCount(); count != lastPublishedRegionCount {
-				if err := s.server.GetStorage().SaveRegionSyncerCommittedRegionCount(uint64(count)); err != nil {
-					log.Warn("failed to persist committed region count", errs.ZapError(err))
-				} else {
-					lastPublishedRegionCount = count
-				}
-			}
+			// Publish the region count this leader is serving so that a member
+			// that has not finished a history sync can still decide, after this
+			// leader is gone, whether it is caught up enough to campaign. Only
+			// the leader runs RunServer, so only the leader writes this.
+			publishCommittedRegionCount()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/syncer/server.go` around lines 264 - 319, The committed region count
isn't persisted until the first committedCountTicker tick, which can leave
storage at 0 if the leader dies early; call the same save logic once
synchronously before entering the for/select loop: compute count via
s.server.GetBasicCluster().GetTotalRegionCount(), call
s.server.GetStorage().SaveRegionSyncerCommittedRegionCount(uint64(count)),
handle the error like in the ticker branch (log Warn on error, otherwise set
lastPublishedRegionCount = count), and then proceed to start the existing
committedCountTicker for periodic refreshes.
♻️ Duplicate comments (1)
pkg/syncer/server.go (1)

502-510: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restore the explicit end-of-history marker after replay.

The client-side state machine only flips historySynced when it sees the empty end-of-history response. These paths return after replaying records, so a fully caught-up follower stays "unsynced" until a later keepalive or region update. If the leader disappears first, that member remains blocked from campaigning even though it already consumed the full history. This regresses the earlier end-of-history marker fix.

Suggested fix
 func (*RegionSyncer) syncHistoryRecordsLocked(startIndex uint64, records []*core.RegionInfo, stream *regionSyncStream) error {
 	for start := 0; start < len(records); start += maxSyncRegionBatchSize {
 		end := min(start+maxSyncRegionBatchSize, len(records))
 		if err := stream.sendStreamIfOpen(buildSyncRegionResponse(startIndex+uint64(start), records[start:end])); err != nil {
 			return err
 		}
 	}
-	return nil
+	return stream.sendStreamIfOpen(&pdpb.SyncRegionResponse{
+		Header:     &pdpb.ResponseHeader{ClusterId: keypath.ClusterID()},
+		StartIndex: startIndex + uint64(len(records)),
+	})
 }
@@
 	if len(records) > 0 {
 		if err := s.syncHistoryRecordsLocked(syncStartIndex, records, syncStream); err != nil {
 			return err
 		}
 		syncStream.advanceSendIndexLocked(len(records))
+	} else {
+		if err := syncStream.sendStreamIfOpen(&pdpb.SyncRegionResponse{
+			Header:     &pdpb.ResponseHeader{ClusterId: keypath.ClusterID()},
+			StartIndex: nextIndex,
+		}); err != nil {
+			return err
+		}
 	}
 	return nil
 }

Also applies to: 575-587

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/syncer/server.go` around lines 502 - 510, The loop in
syncHistoryRecordsLocked sends all batched history chunks but never emits the
explicit empty end-of-history marker the client expects to flip historySynced;
after the for loop return an explicit empty end-of-history response by calling
stream.sendStreamIfOpen with buildSyncRegionResponse for the next index and an
empty records slice (propagate any error), and apply the same fix to the other
similar replay path mentioned so the client receives the empty marker and can
set historySynced.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/syncer/server.go`:
- Around line 264-319: The committed region count isn't persisted until the
first committedCountTicker tick, which can leave storage at 0 if the leader dies
early; call the same save logic once synchronously before entering the
for/select loop: compute count via
s.server.GetBasicCluster().GetTotalRegionCount(), call
s.server.GetStorage().SaveRegionSyncerCommittedRegionCount(uint64(count)),
handle the error like in the ticker branch (log Warn on error, otherwise set
lastPublishedRegionCount = count), and then proceed to start the existing
committedCountTicker for periodic refreshes.

---

Duplicate comments:
In `@pkg/syncer/server.go`:
- Around line 502-510: The loop in syncHistoryRecordsLocked sends all batched
history chunks but never emits the explicit empty end-of-history marker the
client expects to flip historySynced; after the for loop return an explicit
empty end-of-history response by calling stream.sendStreamIfOpen with
buildSyncRegionResponse for the next index and an empty records slice (propagate
any error), and apply the same fix to the other similar replay path mentioned so
the client receives the empty marker and can set historySynced.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4a8c0535-6550-491a-a400-447182505657

📥 Commits

Reviewing files that changed from the base of the PR and between 9f73d32 and 9a5c1e6.

📒 Files selected for processing (5)
  • pkg/syncer/client.go
  • pkg/syncer/client_test.go
  • pkg/syncer/history_buffer.go
  • pkg/syncer/history_buffer_test.go
  • pkg/syncer/server.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/syncer/history_buffer_test.go
  • pkg/syncer/history_buffer.go
  • pkg/syncer/client.go

@tharanga

Copy link
Copy Markdown
Author

/retest

1 similar comment
@tharanga

Copy link
Copy Markdown
Author

/retest

@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.61789% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.20%. Comparing base (6e3d72f) to head (2f8afc1).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10844      +/-   ##
==========================================
+ Coverage   79.14%   79.20%   +0.05%     
==========================================
  Files         536      541       +5     
  Lines       74033    74566     +533     
==========================================
+ Hits        58594    59057     +463     
- Misses      11311    11347      +36     
- Partials     4128     4162      +34     
Flag Coverage Δ
unittests 79.20% <88.61%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tharanga

Copy link
Copy Markdown
Author

@okJiang thoughts on #10843 and this fix? Appreciate your feedback!

@okJiang

okJiang commented Jun 22, 2026

Copy link
Copy Markdown
Member

Thanks for the PR. I checked the latest master, and the underlying issue still exists: region sync completion is not used as a guard before PD leader campaign, so a newly joined member with empty region metadata can still become leader.

The fix direction looks right, but I think this PR is too broad for #10843 as-is.

The main concern is the 10s grace fallback. With the default timing, other healthy members may only try to recover etcd leadership after roughly 40s, so an unsynced member can still campaign after the 10s grace and become PD leader with empty or partial region metadata. The new regression test observes only a 3s window, so it does not prove the behavior after the grace period.

I suggest splitting this into a smaller first fix:

  1. Keep a sticky "history sync completed" signal.
  2. Gate PD leader campaign on that signal when region storage is enabled and sync has been attempted.
  3. Preserve the bootstrap/no-region path.
  4. Add a test that covers the behavior after the fallback window, or remove/adjust the fallback so an unsynced member does not win before a caught-up member can take over.

I would also avoid adding the new etcd-persisted committed-region-count key in the first fix. The core issue is local election eligibility: whether this member has completed region sync before campaigning. That should preferably be derived from the sync completion signal and local durable sync state, not from a new cluster-level marker.

If we still need a cluster-level etcd marker, I think it should be justified and reviewed separately, because it adds upgrade/rollback semantics and another source of truth.

The leader-transfer target preference/refusal logic also looks like a separate policy change. It may be useful, but I would prefer it in a follow-up PR after the core election guard is fixed.

Also, the branch currently conflicts with master and CI is failing, so it needs a rebase before final review.

tharanga added 2 commits June 22, 2026 11:01
Signed-off-by: tharanga <tharanga@gmail.com>

re-pro of a new PD node acquiring the leadership with no regions
fixing the test to fail without any changes

Signed-off-by: tharanga <tharanga@gmail.com>

server: skip campaign until region syncer history phase done

Signed-off-by: tharanga <tharanga@gmail.com>
Signed-off-by: tharanga <tharanga@gmail.com>
@tharanga tharanga force-pushed the tharanga/new_member_region_catchup branch from 069ce6f to 73f1b58 Compare June 22, 2026 19:04
@tharanga

Copy link
Copy Markdown
Author

Thanks for the PR. I checked the latest master, and the underlying issue still exists: region sync completion is not used as a guard before PD leader campaign, so a newly joined member with empty region metadata can still become leader.

The fix direction looks right, but I think this PR is too broad for #10843 as-is.

The main concern is the 10s grace fallback. With the default timing, other healthy members may only try to recover etcd leadership after roughly 40s, so an unsynced member can still campaign after the 10s grace and become PD leader with empty or partial region metadata. The new regression test observes only a 3s window, so it does not prove the behavior after the grace period.

I suggest splitting this into a smaller first fix:

  1. Keep a sticky "history sync completed" signal.
  2. Gate PD leader campaign on that signal when region storage is enabled and sync has been attempted.
  3. Preserve the bootstrap/no-region path.
  4. Add a test that covers the behavior after the fallback window, or remove/adjust the fallback so an unsynced member does not win before a caught-up member can take over.

I would also avoid adding the new etcd-persisted committed-region-count key in the first fix. The core issue is local election eligibility: whether this member has completed region sync before campaigning. That should preferably be derived from the sync completion signal and local durable sync state, not from a new cluster-level marker.

If we still need a cluster-level etcd marker, I think it should be justified and reviewed separately, because it adds upgrade/rollback semantics and another source of truth.

The leader-transfer target preference/refusal logic also looks like a separate policy change. It may be useful, but I would prefer it in a follow-up PR after the core election guard is fixed.

Also, the branch currently conflicts with master and CI is failing, so it needs a rebase before final review.

@okJiang thanks for the feedback. I've rebased my branch to fix conflicts, and preped to cleanly remove additional changes you pointed out. I will address your comments, and ping you once it's ready to review.

@tharanga tharanga force-pushed the tharanga/new_member_region_catchup branch from 73f1b58 to 007a7dd Compare June 24, 2026 22:02
@tharanga

tharanga commented Jun 25, 2026

Copy link
Copy Markdown
Author

I would also avoid adding the new etcd-persisted committed-region-count key in the first fix. The core issue is local election eligibility: whether this member has completed region sync before campaigning. That should preferably be derived from the sync completion signal and local durable sync state, not from a new cluster-level marker.

If we still need a cluster-level etcd marker, I think it should be justified and reviewed separately, because it adds upgrade/rollback semantics and another source of truth.

@okJiang the behind-case gate is purely local (historySynced), but the bootstrap/empty cluster scenario is provably not locally decidable because in the region-storage mode, the region data is per-node LevelDB with no etcd region count. It looks like stores/bootstrap markers can't distinguish a freshly-bootstrapped cluster from a behind follower. Is there another way to address the bootstrap/empty cluster scenario without etcd-persisted key? (a boolean flag is sufficient)

@tharanga

Copy link
Copy Markdown
Author

/retest

1 similar comment
@tharanga

Copy link
Copy Markdown
Author

/retest

Signed-off-by: tharanga <tharanga@gmail.com>
@ti-chi-bot

ti-chi-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@tharanga: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-2 1933192 link true /test pull-unit-test-next-gen-2
pull-unit-test-next-gen-3 1933192 link true /test pull-unit-test-next-gen-3

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contribution This PR is from a community contributor. dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/needs-triage-completed first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

High query error rate when a new member becomes leader without region metadata

2 participants