Skip to content

checker: prefer missing TiFlash learner before offline peer#10928

Merged
ti-chi-bot[bot] merged 3 commits into
tikv:masterfrom
rleungx:fix-10924
Jun 29, 2026
Merged

checker: prefer missing TiFlash learner before offline peer#10928
ti-chi-bot[bot] merged 3 commits into
tikv:masterfrom
rleungx:fix-10924

Conversation

@rleungx

@rleungx rleungx commented Jun 23, 2026

Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: close #10924

When a region both has an offline TiKV peer and misses a TiFlash learner, rule checker may replace the offline TiKV peer first because the default TiKV placement rule is checked before the TiFlash learner rule. Since a region can only have one operator at a time, this can delay TiFlash learner creation during large offline-store operations.

What is changed and how does it work?

Prefer directly adding a missing TiFlash learner before replacing an offline peer in rule checker.

The priority change is intentionally narrow:

  • it only applies to offline peers, not down peers;
  • it only applies to learner rules constrained to engine=tiflash;
  • it only takes effect when a direct TiFlash learner add operator can be created;
  • it does not use the existing swap-fit fallback before offline peer replacement.

After the TiFlash learner is added, the next rule-check round can still replace the offline TiKV peer.

Check List

Tests

  • Unit test

Commands:

go test ./pkg/schedule/checker -run 'TestRuleCheckerTestSuite/Test(PreferAddTiFlashLearnerOverOfflinePeer|DoNotPreferTiFlashSwapFitOverOfflinePeer|DoNotPreferWideEngineLearnerRuleOverOfflinePeer|FixOfflinePeer)$' -count=1
go test ./pkg/schedule/checker -count=1

Release note

Improve PD rule checker to prioritize adding missing TiFlash learner peers before replacing offline TiKV peers, reducing TiFlash learner creation delays during large offline-store operations.

Summary by CodeRabbit

  • New Features

    • Improved scheduling decisions to prioritize adding missing TiFlash learner replicas before replacing offline peers.
    • Added tracking for a new TiFlash learner-related scheduling event.
  • Bug Fixes

    • Refined peer-repair behavior so offline or down peers are handled more consistently with TiFlash learner placement constraints.
    • Reduced unnecessary swap/replacement actions when TiFlash learner constraints are involved.
  • Tests

    • Expanded coverage with new cases for TiFlash learner prioritization and down/offline peer handling, including ensuring regions aren’t added to pending lists when not needed.

Signed-off-by: Ryan Leung <rleungx@gmail.com>
@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 23, 2026
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4fdec4c5-87a2-4193-91ab-6ea9985c7cc6

📥 Commits

Reviewing files that changed from the base of the PR and between e6fb5eb and 5036aca.

📒 Files selected for processing (3)
  • pkg/schedule/checker/metrics.go
  • pkg/schedule/checker/rule_checker.go
  • pkg/schedule/checker/rule_checker_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/schedule/checker/metrics.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/schedule/checker/rule_checker_test.go

📝 Walkthrough

Walkthrough

Rule checker now tries to add a missing TiFlash learner before replacing an offline peer. The add-peer path was refactored around a shared builder, a TiFlash-specific counter handle was added, and tests cover the updated operator selection order.

Changes

TiFlash Learner Priority in Offline Peer Handling

Layer / File(s) Summary
TiFlash learner helpers and add-peer builder
pkg/schedule/checker/rule_checker.go, pkg/schedule/checker/metrics.go
Adds TiFlash learner rule detection, refactors add-peer operator creation through addRulePeerWithOptions, and adds the cached add-tiflash-learner counter handle.
Offline-peer handling priority
pkg/schedule/checker/rule_checker.go
fixRulePeer tries TiFlash learner repair before replaceUnexpectedRulePeer when an offline peer is found.
TiFlash priority operator-selection tests
pkg/schedule/checker/rule_checker_test.go
Adds tests for TiFlash learner priority over offline peers, the no-preference cases for existing TiFlash or wide/TiFlash-compute learner rules, and the down-peer case.

Sequence Diagram(s)

sequenceDiagram
  participant fixRulePeer
  participant fixMissingTiFlashLearnerPeer
  participant addTiFlashLearnerPeer
  participant addRulePeerWithOptions
  participant replaceUnexpectedRulePeer

  fixRulePeer->>fixMissingTiFlashLearnerPeer: inspect offline peer case first
  fixMissingTiFlashLearnerPeer->>addTiFlashLearnerPeer: add missing TiFlash learner
  addTiFlashLearnerPeer->>addRulePeerWithOptions: select target store
  addRulePeerWithOptions-->>addTiFlashLearnerPeer: operator or ErrNoStoreToAdd
  addTiFlashLearnerPeer-->>fixRulePeer: operator or nil
  alt TiFlash operator returned
    fixRulePeer-->>fixRulePeer: return add-rule-peer
  else no TiFlash operator
    fixRulePeer->>replaceUnexpectedRulePeer: replace offline peer
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

A bunny saw a TiFlash glow,
and hopped where offline pebbles go.
First add the learner, then make room,
and test the path from bloom to boom. 🐰

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: prioritizing missing TiFlash learner creation before offline-peer replacement.
Description check ✅ Passed The description covers the problem, the change, tests, issue linkage, and release note, with only non-critical template sections omitted.
Linked Issues check ✅ Passed The code changes match #10924 by adding missing TiFlash learners first for offline-peer cases and leaving down-peer behavior unchanged.
Out of Scope Changes check ✅ Passed No clearly unrelated changes are present; the metric addition and tests support the TiFlash learner priority fix.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@ti-chi-bot ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 23, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/schedule/checker/rule_checker.go (1)

169-178: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Consider distinguishing the expected ErrNoStoreToAdd from unexpected errors.

fixMissingTiFlashLearnerPeer returns errs.ErrNoStoreToAdd on the no-store path, which is the expected "skip and fall back to offline replacement" signal. Right now any error (including a genuine CreateAddPeerOperator failure) is logged at the same Debug level and silently dropped. Using errors.Is would let you keep the quiet fall-through for the expected case while surfacing real failures at a higher level.

♻️ Optional: discriminate the expected sentinel
 		if c.isOfflinePeer(peer) {
 			op, err := c.fixMissingTiFlashLearnerPeer(region, fit)
-			if err != nil {
-				log.Debug("fail to fix missing TiFlash learner peer before replacing offline peer", errs.ZapError(err))
+			if err != nil && !errors.Is(err, errs.ErrNoStoreToAdd) {
+				log.Warn("fail to fix missing TiFlash learner peer before replacing offline peer", errs.ZapError(err))
 			} else if op != nil {
 				return op, nil
 			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/schedule/checker/rule_checker.go` around lines 169 - 178, The error
handling in the block after calling fixMissingTiFlashLearnerPeer does not
distinguish between the expected ErrNoStoreToAdd error (indicating no suitable
store is available and the function should silently fall through to offline
replacement) and unexpected errors (genuine failures in CreateAddPeerOperator).
Use errors.Is to check if err is specifically errs.ErrNoStoreToAdd - if it is,
allow the code to continue without logging, but if it's a different error, log
it at a higher level (Warn or Error) instead of Debug to properly surface real
failures while keeping expected cases quiet.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/schedule/checker/rule_checker.go`:
- Around line 169-178: The error handling in the block after calling
fixMissingTiFlashLearnerPeer does not distinguish between the expected
ErrNoStoreToAdd error (indicating no suitable store is available and the
function should silently fall through to offline replacement) and unexpected
errors (genuine failures in CreateAddPeerOperator). Use errors.Is to check if
err is specifically errs.ErrNoStoreToAdd - if it is, allow the code to continue
without logging, but if it's a different error, log it at a higher level (Warn
or Error) instead of Debug to properly surface real failures while keeping
expected cases quiet.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1c314507-143b-42c6-982a-1f412459a30d

📥 Commits

Reviewing files that changed from the base of the PR and between 4ae1408 and cb03756.

📒 Files selected for processing (2)
  • pkg/schedule/checker/rule_checker.go
  • pkg/schedule/checker/rule_checker_test.go

@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 74.07407% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.24%. Comparing base (4ae1408) to head (5036aca).
⚠️ Report is 11 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10928      +/-   ##
==========================================
+ Coverage   79.15%   79.24%   +0.08%     
==========================================
  Files         540      541       +1     
  Lines       74801    75112     +311     
==========================================
+ Hits        59212    59520     +308     
- Misses      11405    11410       +5     
+ Partials     4184     4182       -2     
Flag Coverage Δ
unittests 79.24% <74.07%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 23, 2026
@rleungx

rleungx commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

/retest

Signed-off-by: Ryan Leung <rleungx@gmail.com>
@rleungx

rleungx commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

/retest

@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 23, 2026
@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 23, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/tso/keyspace_group_manager.go (1)

1249-1283: 🩺 Stability & Availability | 🟡 Minor

kgm.httpClient has no timeout set, allowing hung DELETE requests to indefinitely pin the finishing marker.

The httpClient passed to KeyspaceGroupManager is created by CreateHTTPClient() in pkg/utils/etcdutil/etcdutil.go:327, which creates &http.Client{} with no timeout set. The function even includes a FIXME comment acknowledging this limitation. Since apiutil.DoDelete builds requests without a context deadline, any hung DELETE in finishSplitKeyspaceGroup will stall indefinitely, leaving the finishing marker set and silently suppressing subsequent finish retries.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/tso/keyspace_group_manager.go` around lines 1249 - 1283, The
finishSplitKeyspaceGroup function calls apiutil.DoDelete with an httpClient that
has no timeout configured, allowing hung DELETE requests to block indefinitely
and leave the finishing marker set. Fix this by adding a timeout context to the
apiutil.DoDelete call. Create a context with an appropriate timeout (using
context.WithTimeout) before invoking apiutil.DoDelete to ensure the request
completes or fails within a bounded time period, preventing indefinite blocking
of the keyspace group split completion process.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/tso/keyspace_group_manager_test.go`:
- Around line 129-136: The defer order in the test creates a potential deadlock:
if the test fails while the handler is waiting on releaseResponse, defer
server.Close() executes before the deferred channel close due to LIFO defer
execution order, causing server.Close() to block on the in-flight request. Move
the defer closeReleaseResponse.Do(func() { close(releaseResponse) }) statement
to after the defer server.Close() statement so that the releaseResponse channel
is closed before the server is closed when defers unwind in reverse order.

---

Outside diff comments:
In `@pkg/tso/keyspace_group_manager.go`:
- Around line 1249-1283: The finishSplitKeyspaceGroup function calls
apiutil.DoDelete with an httpClient that has no timeout configured, allowing
hung DELETE requests to block indefinitely and leave the finishing marker set.
Fix this by adding a timeout context to the apiutil.DoDelete call. Create a
context with an appropriate timeout (using context.WithTimeout) before invoking
apiutil.DoDelete to ensure the request completes or fails within a bounded time
period, preventing indefinite blocking of the keyspace group split completion
process.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 77957973-9104-4035-a072-45c2103d6f8f

📥 Commits

Reviewing files that changed from the base of the PR and between 18db7a8 and 8ffeed7.

📒 Files selected for processing (2)
  • pkg/tso/keyspace_group_manager.go
  • pkg/tso/keyspace_group_manager_test.go

Comment thread pkg/tso/keyspace_group_manager_test.go Outdated
@rleungx

rleungx commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

/retest

2 similar comments
@rleungx

rleungx commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

/retest

@rleungx

rleungx commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

/retest

@rleungx rleungx requested review from JmPotato and bufferflies June 23, 2026 10:04
@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Jun 25, 2026
}
for _, constraint := range rule.LabelConstraints {
if constraint.Key == core.EngineKey && constraint.Op == placement.In &&
len(constraint.Values) == 1 && constraint.Values[0] == core.EngineTiFlash {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need to ignore if the engine is EngineTiFlashCompute?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, TiFlash compute is intentionally excluded. The priority path only matches classic TiFlash learner rules (engine in [tiflash]), and TestDoNotPreferTiFlashComputeLearnerRuleOverOfflinePeer covers the fallback behavior.

return c.addRulePeerWithOptions(region, fit, rf, true, true)
}

func (c *RuleChecker) addRulePeerWithOptions(region *core.RegionInfo, fit *placement.RegionFit, rf *placement.RuleFit, updateFilterState, allowReplacement bool) (*operator.Operator, error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see only the addRulePeer call. Why does it need two more args?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was refactored in the latest head: addRulePeer keeps the original behavior and TiFlash uses a separate addTiFlashLearnerPeer wrapper over addRulePeerWithOptions, so the extra options only apply to the special fallback path.

if err != nil {
log.Debug("fail to fix missing TiFlash learner peer before replacing offline peer", errs.ZapError(err))
} else if op != nil {
return op, nil

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a counter here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ruleCheckerAddTiFlashLearnerCounter for this path.

}
}
if c.isOfflinePeer(peer) {
op, err := c.fixMissingTiFlashLearnerPeer(region, fit)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we only need to fix the missing TiFlash learner peer while offlining? What about when it is down?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For down peers, we keep the existing fast replacement behavior and do not prioritize adding the TiFlash learner first. TestDoNotPreferTiFlashLearnerOverDownPeer covers this.

@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 26, 2026
Signed-off-by: Ryan Leung <rleungx@gmail.com>
@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 26, 2026
@rleungx

rleungx commented Jun 26, 2026

Copy link
Copy Markdown
Member Author

/retest

1 similar comment
@rleungx

rleungx commented Jun 26, 2026

Copy link
Copy Markdown
Member Author

/retest

@ti-chi-bot ti-chi-bot Bot added the lgtm label Jun 29, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JmPotato, lhy1024

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jun 29, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

[LGTM Timeline notifier]

Timeline:

  • 2026-06-25 08:31:05.380748832 +0000 UTC m=+195549.043974345: ☑️ agreed by lhy1024.
  • 2026-06-29 05:56:28.381257254 +0000 UTC m=+18930.081636687: ☑️ agreed by JmPotato.

@ti-chi-bot ti-chi-bot Bot merged commit 6f795b1 into tikv:master Jun 29, 2026
40 of 43 checks passed
@rleungx rleungx deleted the fix-10924 branch June 29, 2026 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rule checker may delay TiFlash learner creation when many TiKV peers are offline

4 participants