Skip to content

pkg/unsaferecovery: optimize empty region overlap checks (#10639)#10944

Open
ti-chi-bot wants to merge 1 commit into
tikv:release-8.1from
ti-chi-bot:cherry-pick-10639-to-release-8.1
Open

pkg/unsaferecovery: optimize empty region overlap checks (#10639)#10944
ti-chi-bot wants to merge 1 commit into
tikv:release-8.1from
ti-chi-bot:cherry-pick-10639-to-release-8.1

Conversation

@ti-chi-bot

Copy link
Copy Markdown
Member

This is an automated cherry-pick of #10639

What problem does this PR solve?

Issue Number: Close #10638

Unsafe recovery creates empty regions to fill range holes. The paranoid overlap check currently scans every collected peer for each generated empty region, which can make plan generation O(number of holes * number of peer reports). On large clusters this work happens while the unsafe recovery controller lock is held, delaying schedulers/checkers and related unsafe-recovery operations.

PD also re-dispatches a store recovery plan after a fixed 40s execution window. When TiKV spends longer executing create-empty-region plans, PD may repeatedly log unsafe recovery store recovery plan execution timeout, retry and send duplicate plans.

What is changed and how does it work?

pkg/unsaferecovery: optimize empty region overlap checks

Collect newly created empty regions while scanning the newest region tree. These empty regions are generated in key order and do not overlap with each other, so they can be used as a small sorted index. Then scan collected peer reports once and binary-search the empty-region index to detect any overlap.

This changes the paranoid overlap check from O(empty regions * peer reports) to O(newest regions + empty regions + peer reports * log(empty regions)). When there is no generated empty region, the extra check returns immediately. For multiple peer reports of the same region, duplicate same-version same-range reports reuse the first no-overlap result and skip another binary search.

The check uses half-open region range semantics and keeps the existing error path when an overlap is found.

Before/after comparison

For a recovery round with about 1.5M existing regions and 5k generated empty regions:

Item Before this PR After this PR
Paranoid check algorithm For every generated empty region, scan every collected initialized peer report. Build an index from generated empty regions, then scan peer reports once and binary-search the empty-region index.
Complexity O(empty regions * peer reports). O(newest regions + empty regions + peer reports * log(empty regions)).
Work for 1.5M regions / 5k holes Estimated from the original data-structure benchmark: about 36 minutes to more than 1 hour locally, and could become several hours in production while holding the unsafe recovery controller lock. Expected paranoid-check cost is sub-second at this scale; remaining wall time mostly depends on the existing newest-region tree scan and ID allocation for generated regions.
Lock impact The O(empty regions * peer reports) scan runs while the unsafe recovery controller lock is held, so large recoveries can delay schedulers, checkers, and related unsafe-recovery handling. The lock is still held during plan generation, but the expensive paranoid-check part is reduced from a several-hour worst-case risk to sub-second scale in the estimate above.

The exact wall time still depends on key length, CPU, report fanout, the existing newest-region tree scan, and ID allocation for generated regions, so the numbers above are rough estimates rather than a strict production SLA.

This PR also extends the default plan execution timeout from 40s to 60s, and adds request-level options for large unsafe recovery operations:

{
  "stores": [1, 2, 3],
  "timeout": 600,
  "plan-execution-timeout": 600,
  "disable-paranoid-check": true
}

The same options are exposed in pd-ctl:

pd-ctl unsafe remove-failed-stores 1,2,3 --timeout 600 --plan-execution-timeout 600 --disable-paranoid-check
  • plan-execution-timeout controls how long PD waits for a store to execute a dispatched recovery plan before re-dispatching it. The default is 60s.
  • disable-paranoid-check skips the empty-region overlap paranoid check. The default remains enabled.

When either option is set, the recovery status output records it, for example plan execution timeout 10m0s and paranoid check disabled.

Check List

Tests

  • Unit test

Release note

Optimize unsafe recovery empty-region plan generation when many regions and gaps exist

Summary by CodeRabbit

  • New Features

    • Configurable plan execution timeout for unsafe recovery operations.
    • Option to disable the paranoid empty-region overlap check.
    • API and CLI support to pass these options for removal operations.
  • Improvements

    • Batched empty-region creation with overlap guarding; tracks created and affected regions.
    • Status output now reports paranoid-check state and configured plan execution timeout.
  • Tests

    • Added validation and parsing tests for timeouts and paranoid-check flags; CLI validation tests.

close tikv#10638

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot ti-chi-bot added dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-8.1 This PR is cherry-picked to release-8.1 from a source PR. labels Jun 25, 2026
@ti-chi-bot

Copy link
Copy Markdown
Member Author

@Connor1996 This PR has conflicts, I have hold it.
Please resolve them or ask others to resolve them, then comment /unhold to remove the hold label.

@ti-chi-bot

ti-chi-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@ti-chi-bot: ## If you want to know how to resolve it, please read the guide in TiDB Dev Guide.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2ea6cf5e-bdb3-4690-9d02-17ed8e52c365

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@ti-chi-bot

ti-chi-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@ti-chi-bot: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-build-release-8.1 271bb18 link true /test pull-build-release-8.1

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

"net/http"
"time"

<<<<<<< HEAD

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz fix conflict

@ti-chi-bot ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jun 26, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

[LGTM Timeline notifier]

Timeline:

  • 2026-06-26 03:11:47.621096914 +0000 UTC m=+262791.284322417: ☑️ agreed by lhy1024.

@ti-chi-bot ti-chi-bot Bot added the approved label Jun 26, 2026
@lhy1024

lhy1024 commented Jun 26, 2026

Copy link
Copy Markdown
Member

/approve cancel

@ti-chi-bot

ti-chi-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from connor1996. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added cherry-pick-approved Cherry pick PR approved by release team. and removed approved do-not-merge/cherry-pick-not-approved labels Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-approved Cherry pick PR approved by release team. dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-8.1 This PR is cherry-picked to release-8.1 from a source PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants