pkg/unsaferecovery: optimize empty region overlap checks#10639
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a plan-based execution timeout and an optional paranoid overlap check to unsafe recovery; exposes both via API and CLI; migrates internal timing to the new timeout, batches create-empty-region planning with an indexed overlap check, and updates tests and status reporting. ChangesUnsafe Recovery: plan-timeout, paranoid check, and batching
Sequence Diagram(s)sequenceDiagram
autonumber
participant CLI as pd-ctl (Client)
participant HTTP as PD HTTP API
participant Handler as PD Handler
participant Controller as UnsafeRecoveryController
participant Reports as PeerReports/RegionIndex
CLI->>HTTP: POST /admin/unsafe/remove-failed-stores {stores, timeout?, plan-execution-timeout?, disable-paranoid-check?}
HTTP->>Handler: parsePlanExecutionTimeout, parseDisableParanoidCheck
Handler->>Controller: RemoveFailedStoresWithOptions(stores, timeout, autoDetect, options)
Controller->>Reports: buildUpFromReports()
Controller->>Controller: generateCreateEmptyRegionPlan() (batch, use emptyRegionIndex)
Controller->>Reports: apply batched create plans to stores
Controller-->>Handler: return status (includes plan timeout / paranoid info)
Handler-->>HTTP: HTTP response
HTTP-->>CLI: print response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
96da8be to
c1fe103
Compare
4ee7a33 to
ad42324
Compare
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
ad42324 to
6e7529d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/unsaferecovery/unsafe_recovery_controller.go (1)
1314-1385:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winInclude the tail create in the paranoid overlap pass.
emptyRegionsonly collects interior gaps. The final[lastEnd, +∞)create — and the full-range create whennewestRegionTree.size() == 0— is still added afterfindOverlapPeerWithEmptyRegions, so a stale peer that overlaps that tail range will now slip past the paranoid check. Please route that last create through the samecreatePlans/emptyRegionspath so it is checked and recorded consistently.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/unsaferecovery/unsafe_recovery_controller.go` around lines 1314 - 1385, The tail create (the newRegion for [lastEnd, +∞) or full-range when newestRegionTree.size() == 0) is currently added directly to the recovery plan after the paranoid check, so it bypasses the overlap verification and the same bookkeeping used for interior gaps; change the logic so the tail/new full-range region is created via createRegion and appended to createPlans and emptyRegions (just like interior gaps), using the same store selection logic (respecting tiflash exclusion via getRandomStoreID and lastStoreID fallback) before the paranoid findOverlapPeerWithEmptyRegions call, and then let the existing loop that processes createPlans (u.getRecoveryPlan, storeRecoveryPlan.Creates append, u.recordAffectedRegion, u.newlyCreatedRegions) handle the tail region so it is checked and recorded consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/api/unsafe_operation.go`:
- Around line 79-95: The parsing of planExecutionTimeout and
disableParanoidCheck must validate input and return HTTP 400 on bad input
instead of silently using defaults: explicitly check input keys
"plan-execution-timeout"/"plan_execution_timeout" are present and of the correct
numeric type, ensure planExecutionTimeout is non-negative and fits as a
time.Duration (reject NaN/inf/negative), and ensure
"disable-paranoid-check"/"disable_paranoid_check" are booleans if provided; on
any invalid type or value respond with a 400 error from the handler and do not
call rc.GetUnsafeRecoveryController().RemoveFailedStoresWithOptions; then pass
the validated PlanExecutionTimeout and DisableParanoidCheck into
RemoveFailedStoresWithOptions.
---
Outside diff comments:
In `@pkg/unsaferecovery/unsafe_recovery_controller.go`:
- Around line 1314-1385: The tail create (the newRegion for [lastEnd, +∞) or
full-range when newestRegionTree.size() == 0) is currently added directly to the
recovery plan after the paranoid check, so it bypasses the overlap verification
and the same bookkeeping used for interior gaps; change the logic so the
tail/new full-range region is created via createRegion and appended to
createPlans and emptyRegions (just like interior gaps), using the same store
selection logic (respecting tiflash exclusion via getRandomStoreID and
lastStoreID fallback) before the paranoid findOverlapPeerWithEmptyRegions call,
and then let the existing loop that processes createPlans (u.getRecoveryPlan,
storeRecoveryPlan.Creates append, u.recordAffectedRegion, u.newlyCreatedRegions)
handle the tail region so it is checked and recorded consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: faefc0af-8ed8-4cbf-b340-fb84fe7b1454
📒 Files selected for processing (6)
pkg/unsaferecovery/unsafe_recovery_controller.gopkg/unsaferecovery/unsafe_recovery_controller_test.goserver/api/unsafe_operation.gotests/server/api/unsafe_operation_test.gotools/pd-ctl/pdctl/command/unsafe_command.gotools/pd-ctl/tests/unsafe/unsafe_operation_test.go
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
|
/test pull-unit-test-next-gen-2 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10639 +/- ##
==========================================
+ Coverage 78.96% 79.01% +0.04%
==========================================
Files 532 532
Lines 71883 72078 +195
==========================================
+ Hits 56766 56951 +185
- Misses 11093 11099 +6
- Partials 4024 4028 +4
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
/test pull-unit-test-next-gen-3 |
|
Checked the latest review feedback against current HEAD
No additional code change was needed, so I did not add a follow-up commit. |
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
|
Addressed the still-valid review item on |
|
Checked the latest review feedback and current HEAD already contains the timeout validation / alias parsing fixes under review, so I did not add another code change here. I reran the failed GitHub Actions jobs once to classify the remaining red CI. The failures are still in the unrelated
|
|
/retest |
1 similar comment
|
/retest |
|
@v01dstar: adding LGTM is restricted to approvers and reviewers in OWNERS files. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lhy1024, nolouch, v01dstar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
1 similar comment
|
/retest |
|
In response to a cherrypick label: new pull request created to branch |
|
In response to a cherrypick label: new pull request created to branch |
What problem does this PR solve?
Issue Number: Close #10638
Unsafe recovery creates empty regions to fill range holes. The paranoid overlap check currently scans every collected peer for each generated empty region, which can make plan generation O(number of holes * number of peer reports). On large clusters this work happens while the unsafe recovery controller lock is held, delaying schedulers/checkers and related unsafe-recovery operations.
PD also re-dispatches a store recovery plan after a fixed 40s execution window. When TiKV spends longer executing create-empty-region plans, PD may repeatedly log
unsafe recovery store recovery plan execution timeout, retryand send duplicate plans.What is changed and how does it work?
Collect newly created empty regions while scanning the newest region tree. These empty regions are generated in key order and do not overlap with each other, so they can be used as a small sorted index. Then scan collected peer reports once and binary-search the empty-region index to detect any overlap.
This changes the paranoid overlap check from O(empty regions * peer reports) to O(newest regions + empty regions + peer reports * log(empty regions)). When there is no generated empty region, the extra check returns immediately. For multiple peer reports of the same region, duplicate same-version same-range reports reuse the first no-overlap result and skip another binary search.
The check uses half-open region range semantics and keeps the existing error path when an overlap is found.
Before/after comparison
For a recovery round with about 1.5M existing regions and 5k generated empty regions:
The exact wall time still depends on key length, CPU, report fanout, the existing newest-region tree scan, and ID allocation for generated regions, so the numbers above are rough estimates rather than a strict production SLA.
This PR also extends the default plan execution timeout from 40s to 60s, and adds request-level options for large unsafe recovery operations:
{ "stores": [1, 2, 3], "timeout": 600, "plan-execution-timeout": 600, "disable-paranoid-check": true }The same options are exposed in
pd-ctl:plan-execution-timeoutcontrols how long PD waits for a store to execute a dispatched recovery plan before re-dispatching it. The default is 60s.disable-paranoid-checkskips the empty-region overlap paranoid check. The default remains enabled.When either option is set, the recovery status output records it, for example
plan execution timeout 10m0sandparanoid check disabled.Check List
Tests
Release note
Summary by CodeRabbit
New Features
Improvements
Tests