pkg/unsaferecovery: optimize empty region overlap checks (#10639)#10944
pkg/unsaferecovery: optimize empty region overlap checks (#10639)#10944ti-chi-bot wants to merge 1 commit into
Conversation
close tikv#10638 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
|
@Connor1996 This PR has conflicts, I have hold it. |
|
@ti-chi-bot: ## If you want to know how to resolve it, please read the guide in TiDB Dev Guide. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@ti-chi-bot: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| "net/http" | ||
| "time" | ||
|
|
||
| <<<<<<< HEAD |
[LGTM Timeline notifier]Timeline:
|
|
/approve cancel |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
This is an automated cherry-pick of #10639
What problem does this PR solve?
Issue Number: Close #10638
Unsafe recovery creates empty regions to fill range holes. The paranoid overlap check currently scans every collected peer for each generated empty region, which can make plan generation O(number of holes * number of peer reports). On large clusters this work happens while the unsafe recovery controller lock is held, delaying schedulers/checkers and related unsafe-recovery operations.
PD also re-dispatches a store recovery plan after a fixed 40s execution window. When TiKV spends longer executing create-empty-region plans, PD may repeatedly log
unsafe recovery store recovery plan execution timeout, retryand send duplicate plans.What is changed and how does it work?
Collect newly created empty regions while scanning the newest region tree. These empty regions are generated in key order and do not overlap with each other, so they can be used as a small sorted index. Then scan collected peer reports once and binary-search the empty-region index to detect any overlap.
This changes the paranoid overlap check from O(empty regions * peer reports) to O(newest regions + empty regions + peer reports * log(empty regions)). When there is no generated empty region, the extra check returns immediately. For multiple peer reports of the same region, duplicate same-version same-range reports reuse the first no-overlap result and skip another binary search.
The check uses half-open region range semantics and keeps the existing error path when an overlap is found.
Before/after comparison
For a recovery round with about 1.5M existing regions and 5k generated empty regions:
The exact wall time still depends on key length, CPU, report fanout, the existing newest-region tree scan, and ID allocation for generated regions, so the numbers above are rough estimates rather than a strict production SLA.
This PR also extends the default plan execution timeout from 40s to 60s, and adds request-level options for large unsafe recovery operations:
{ "stores": [1, 2, 3], "timeout": 600, "plan-execution-timeout": 600, "disable-paranoid-check": true }The same options are exposed in
pd-ctl:plan-execution-timeoutcontrols how long PD waits for a store to execute a dispatched recovery plan before re-dispatching it. The default is 60s.disable-paranoid-checkskips the empty-region overlap paranoid check. The default remains enabled.When either option is set, the recovery status output records it, for example
plan execution timeout 10m0sandparanoid check disabled.Check List
Tests
Release note
Summary by CodeRabbit
New Features
Improvements
Tests