checker: prefer missing TiFlash learner before offline peer#10928
Conversation
Signed-off-by: Ryan Leung <rleungx@gmail.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughRule checker now tries to add a missing TiFlash learner before replacing an offline peer. The add-peer path was refactored around a shared builder, a TiFlash-specific counter handle was added, and tests cover the updated operator selection order. ChangesTiFlash Learner Priority in Offline Peer Handling
Sequence Diagram(s)sequenceDiagram
participant fixRulePeer
participant fixMissingTiFlashLearnerPeer
participant addTiFlashLearnerPeer
participant addRulePeerWithOptions
participant replaceUnexpectedRulePeer
fixRulePeer->>fixMissingTiFlashLearnerPeer: inspect offline peer case first
fixMissingTiFlashLearnerPeer->>addTiFlashLearnerPeer: add missing TiFlash learner
addTiFlashLearnerPeer->>addRulePeerWithOptions: select target store
addRulePeerWithOptions-->>addTiFlashLearnerPeer: operator or ErrNoStoreToAdd
addTiFlashLearnerPeer-->>fixRulePeer: operator or nil
alt TiFlash operator returned
fixRulePeer-->>fixRulePeer: return add-rule-peer
else no TiFlash operator
fixRulePeer->>replaceUnexpectedRulePeer: replace offline peer
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/schedule/checker/rule_checker.go (1)
169-178: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueConsider distinguishing the expected
ErrNoStoreToAddfrom unexpected errors.
fixMissingTiFlashLearnerPeerreturnserrs.ErrNoStoreToAddon the no-store path, which is the expected "skip and fall back to offline replacement" signal. Right now any error (including a genuineCreateAddPeerOperatorfailure) is logged at the sameDebuglevel and silently dropped. Usingerrors.Iswould let you keep the quiet fall-through for the expected case while surfacing real failures at a higher level.♻️ Optional: discriminate the expected sentinel
if c.isOfflinePeer(peer) { op, err := c.fixMissingTiFlashLearnerPeer(region, fit) - if err != nil { - log.Debug("fail to fix missing TiFlash learner peer before replacing offline peer", errs.ZapError(err)) + if err != nil && !errors.Is(err, errs.ErrNoStoreToAdd) { + log.Warn("fail to fix missing TiFlash learner peer before replacing offline peer", errs.ZapError(err)) } else if op != nil { return op, nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/schedule/checker/rule_checker.go` around lines 169 - 178, The error handling in the block after calling fixMissingTiFlashLearnerPeer does not distinguish between the expected ErrNoStoreToAdd error (indicating no suitable store is available and the function should silently fall through to offline replacement) and unexpected errors (genuine failures in CreateAddPeerOperator). Use errors.Is to check if err is specifically errs.ErrNoStoreToAdd - if it is, allow the code to continue without logging, but if it's a different error, log it at a higher level (Warn or Error) instead of Debug to properly surface real failures while keeping expected cases quiet.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/schedule/checker/rule_checker.go`:
- Around line 169-178: The error handling in the block after calling
fixMissingTiFlashLearnerPeer does not distinguish between the expected
ErrNoStoreToAdd error (indicating no suitable store is available and the
function should silently fall through to offline replacement) and unexpected
errors (genuine failures in CreateAddPeerOperator). Use errors.Is to check if
err is specifically errs.ErrNoStoreToAdd - if it is, allow the code to continue
without logging, but if it's a different error, log it at a higher level (Warn
or Error) instead of Debug to properly surface real failures while keeping
expected cases quiet.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1c314507-143b-42c6-982a-1f412459a30d
📒 Files selected for processing (2)
pkg/schedule/checker/rule_checker.gopkg/schedule/checker/rule_checker_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10928 +/- ##
==========================================
+ Coverage 79.15% 79.24% +0.08%
==========================================
Files 540 541 +1
Lines 74801 75112 +311
==========================================
+ Hits 59212 59520 +308
- Misses 11405 11410 +5
+ Partials 4184 4182 -2
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
/retest |
Signed-off-by: Ryan Leung <rleungx@gmail.com>
|
/retest |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/tso/keyspace_group_manager.go (1)
1249-1283: 🩺 Stability & Availability | 🟡 Minor
kgm.httpClienthas no timeout set, allowing hung DELETE requests to indefinitely pin the finishing marker.The
httpClientpassed toKeyspaceGroupManageris created byCreateHTTPClient()inpkg/utils/etcdutil/etcdutil.go:327, which creates&http.Client{}with no timeout set. The function even includes a FIXME comment acknowledging this limitation. Sinceapiutil.DoDeletebuilds requests without a context deadline, any hung DELETE infinishSplitKeyspaceGroupwill stall indefinitely, leaving the finishing marker set and silently suppressing subsequent finish retries.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tso/keyspace_group_manager.go` around lines 1249 - 1283, The finishSplitKeyspaceGroup function calls apiutil.DoDelete with an httpClient that has no timeout configured, allowing hung DELETE requests to block indefinitely and leave the finishing marker set. Fix this by adding a timeout context to the apiutil.DoDelete call. Create a context with an appropriate timeout (using context.WithTimeout) before invoking apiutil.DoDelete to ensure the request completes or fails within a bounded time period, preventing indefinite blocking of the keyspace group split completion process.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/tso/keyspace_group_manager_test.go`:
- Around line 129-136: The defer order in the test creates a potential deadlock:
if the test fails while the handler is waiting on releaseResponse, defer
server.Close() executes before the deferred channel close due to LIFO defer
execution order, causing server.Close() to block on the in-flight request. Move
the defer closeReleaseResponse.Do(func() { close(releaseResponse) }) statement
to after the defer server.Close() statement so that the releaseResponse channel
is closed before the server is closed when defers unwind in reverse order.
---
Outside diff comments:
In `@pkg/tso/keyspace_group_manager.go`:
- Around line 1249-1283: The finishSplitKeyspaceGroup function calls
apiutil.DoDelete with an httpClient that has no timeout configured, allowing
hung DELETE requests to block indefinitely and leave the finishing marker set.
Fix this by adding a timeout context to the apiutil.DoDelete call. Create a
context with an appropriate timeout (using context.WithTimeout) before invoking
apiutil.DoDelete to ensure the request completes or fails within a bounded time
period, preventing indefinite blocking of the keyspace group split completion
process.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 77957973-9104-4035-a072-45c2103d6f8f
📒 Files selected for processing (2)
pkg/tso/keyspace_group_manager.gopkg/tso/keyspace_group_manager_test.go
|
/retest |
2 similar comments
|
/retest |
|
/retest |
| } | ||
| for _, constraint := range rule.LabelConstraints { | ||
| if constraint.Key == core.EngineKey && constraint.Op == placement.In && | ||
| len(constraint.Values) == 1 && constraint.Values[0] == core.EngineTiFlash { |
There was a problem hiding this comment.
Does it need to ignore if the engine is EngineTiFlashCompute?
There was a problem hiding this comment.
Yes, TiFlash compute is intentionally excluded. The priority path only matches classic TiFlash learner rules (engine in [tiflash]), and TestDoNotPreferTiFlashComputeLearnerRuleOverOfflinePeer covers the fallback behavior.
| return c.addRulePeerWithOptions(region, fit, rf, true, true) | ||
| } | ||
|
|
||
| func (c *RuleChecker) addRulePeerWithOptions(region *core.RegionInfo, fit *placement.RegionFit, rf *placement.RuleFit, updateFilterState, allowReplacement bool) (*operator.Operator, error) { |
There was a problem hiding this comment.
I see only the addRulePeer call. Why does it need two more args?
There was a problem hiding this comment.
This was refactored in the latest head: addRulePeer keeps the original behavior and TiFlash uses a separate addTiFlashLearnerPeer wrapper over addRulePeerWithOptions, so the extra options only apply to the special fallback path.
| if err != nil { | ||
| log.Debug("fail to fix missing TiFlash learner peer before replacing offline peer", errs.ZapError(err)) | ||
| } else if op != nil { | ||
| return op, nil |
There was a problem hiding this comment.
Added ruleCheckerAddTiFlashLearnerCounter for this path.
| } | ||
| } | ||
| if c.isOfflinePeer(peer) { | ||
| op, err := c.fixMissingTiFlashLearnerPeer(region, fit) |
There was a problem hiding this comment.
Do we only need to fix the missing TiFlash learner peer while offlining? What about when it is down?
There was a problem hiding this comment.
For down peers, we keep the existing fast replacement behavior and do not prioritize adding the TiFlash learner first. TestDoNotPreferTiFlashLearnerOverDownPeer covers this.
Signed-off-by: Ryan Leung <rleungx@gmail.com>
|
/retest |
1 similar comment
|
/retest |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JmPotato, lhy1024 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What problem does this PR solve?
Issue Number: close #10924
When a region both has an offline TiKV peer and misses a TiFlash learner, rule checker may replace the offline TiKV peer first because the default TiKV placement rule is checked before the TiFlash learner rule. Since a region can only have one operator at a time, this can delay TiFlash learner creation during large offline-store operations.
What is changed and how does it work?
Prefer directly adding a missing TiFlash learner before replacing an offline peer in rule checker.
The priority change is intentionally narrow:
After the TiFlash learner is added, the next rule-check round can still replace the offline TiKV peer.
Check List
Tests
Commands:
Release note
Summary by CodeRabbit
New Features
Bug Fixes
Tests