Skip to content

pkg/unsaferecovery: optimize empty region overlap checks#10639

Merged
ti-chi-bot[bot] merged 6 commits into
tikv:masterfrom
Connor1996:codex/unsafe-recovery-overlap-index
May 12, 2026
Merged

pkg/unsaferecovery: optimize empty region overlap checks#10639
ti-chi-bot[bot] merged 6 commits into
tikv:masterfrom
Connor1996:codex/unsafe-recovery-overlap-index

Conversation

@Connor1996

@Connor1996 Connor1996 commented May 5, 2026

Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: Close #10638

Unsafe recovery creates empty regions to fill range holes. The paranoid overlap check currently scans every collected peer for each generated empty region, which can make plan generation O(number of holes * number of peer reports). On large clusters this work happens while the unsafe recovery controller lock is held, delaying schedulers/checkers and related unsafe-recovery operations.

PD also re-dispatches a store recovery plan after a fixed 40s execution window. When TiKV spends longer executing create-empty-region plans, PD may repeatedly log unsafe recovery store recovery plan execution timeout, retry and send duplicate plans.

What is changed and how does it work?

pkg/unsaferecovery: optimize empty region overlap checks

Collect newly created empty regions while scanning the newest region tree. These empty regions are generated in key order and do not overlap with each other, so they can be used as a small sorted index. Then scan collected peer reports once and binary-search the empty-region index to detect any overlap.

This changes the paranoid overlap check from O(empty regions * peer reports) to O(newest regions + empty regions + peer reports * log(empty regions)). When there is no generated empty region, the extra check returns immediately. For multiple peer reports of the same region, duplicate same-version same-range reports reuse the first no-overlap result and skip another binary search.

The check uses half-open region range semantics and keeps the existing error path when an overlap is found.

Before/after comparison

For a recovery round with about 1.5M existing regions and 5k generated empty regions:

Item Before this PR After this PR
Paranoid check algorithm For every generated empty region, scan every collected initialized peer report. Build an index from generated empty regions, then scan peer reports once and binary-search the empty-region index.
Complexity O(empty regions * peer reports). O(newest regions + empty regions + peer reports * log(empty regions)).
Work for 1.5M regions / 5k holes Estimated from the original data-structure benchmark: about 36 minutes to more than 1 hour locally, and could become several hours in production while holding the unsafe recovery controller lock. Expected paranoid-check cost is sub-second at this scale; remaining wall time mostly depends on the existing newest-region tree scan and ID allocation for generated regions.
Lock impact The O(empty regions * peer reports) scan runs while the unsafe recovery controller lock is held, so large recoveries can delay schedulers, checkers, and related unsafe-recovery handling. The lock is still held during plan generation, but the expensive paranoid-check part is reduced from a several-hour worst-case risk to sub-second scale in the estimate above.

The exact wall time still depends on key length, CPU, report fanout, the existing newest-region tree scan, and ID allocation for generated regions, so the numbers above are rough estimates rather than a strict production SLA.

This PR also extends the default plan execution timeout from 40s to 60s, and adds request-level options for large unsafe recovery operations:

{
  "stores": [1, 2, 3],
  "timeout": 600,
  "plan-execution-timeout": 600,
  "disable-paranoid-check": true
}

The same options are exposed in pd-ctl:

pd-ctl unsafe remove-failed-stores 1,2,3 --timeout 600 --plan-execution-timeout 600 --disable-paranoid-check
  • plan-execution-timeout controls how long PD waits for a store to execute a dispatched recovery plan before re-dispatching it. The default is 60s.
  • disable-paranoid-check skips the empty-region overlap paranoid check. The default remains enabled.

When either option is set, the recovery status output records it, for example plan execution timeout 10m0s and paranoid check disabled.

Check List

Tests

  • Unit test

Release note

Optimize unsafe recovery empty-region plan generation when many regions and gaps exist

Summary by CodeRabbit

  • New Features

    • Configurable plan execution timeout for unsafe recovery operations.
    • Option to disable the paranoid empty-region overlap check.
    • API and CLI support to pass these options for removal operations.
  • Improvements

    • Batched empty-region creation with overlap guarding; tracks created and affected regions.
    • Status output now reports paranoid-check state and configured plan execution timeout.
  • Tests

    • Added validation and parsing tests for timeouts and paranoid-check flags; CLI validation tests.

@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. labels May 5, 2026
@coderabbitai

coderabbitai Bot commented May 5, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a plan-based execution timeout and an optional paranoid overlap check to unsafe recovery; exposes both via API and CLI; migrates internal timing to the new timeout, batches create-empty-region planning with an indexed overlap check, and updates tests and status reporting.

Changes

Unsafe Recovery: plan-timeout, paranoid check, and batching

Layer / File(s) Summary
Data Shape / API
pkg/unsaferecovery/unsafe_recovery_controller.go, server/api/unsafe_operation.go
Add RemoveFailedStoresOptions{PlanExecutionTimeout time.Duration, DisableParanoidCheck bool} and RemoveFailedStoresWithOptions(...). RemoveFailedStores delegates to the options-based API. Add defaultPlanExecutionTimeout and controller fields planExecutionTimeout, disableParanoidCheck.
Core Behavior
pkg/unsaferecovery/unsafe_recovery_controller.go
Controller applies inbound options to state; per-store plan expiry and global timeout extension use planExecutionTimeout instead of storeRequestInterval. Status text reports configured timeout and paranoid-check state.
Algorithm / Performance
pkg/unsaferecovery/unsafe_recovery_controller.go
Rework create-empty-region flow: introduce createPlan batching and emptyRegions collection; add emptyRegionIndex and overlap helpers; perform paranoid overlap checks using the index (gated by disableParanoidCheck) and batch-apply creates.
Controller Wiring / Error Paths
pkg/unsaferecovery/unsafe_recovery_controller.go
Use planExecutionTimeout when extending global deadlines on errors and when setting per-store plan expiries.
API Parsing / Validation
server/api/unsafe_operation.go, server/api/unsafe_operation_test.go, tests/server/api/unsafe_operation_test.go
Parse and validate plan-execution-timeout (and alias), parse disable-paranoid-check (and alias), validate bounds, and pass options into RemoveFailedStoresWithOptions. Tests added/updated for parsing and request behavior.
CLI / Client
tools/pd-ctl/pdctl/command/unsafe_command.go, tools/pd-ctl/tests/unsafe/unsafe_operation_test.go, tools/pd-ctl/pdctl/command/unsafe_command_test.go
Add --plan-execution-timeout and --disable-paranoid-check flags; validate timeout bounds (max constant); include flags in POST body when set; update tests to cover validation and new flags.
Tests / Test Helpers
pkg/unsaferecovery/unsafe_recovery_controller_test.go, server/api/unsafe_operation_test.go, tests/server/api/unsafe_operation_test.go, tools/pd-ctl/*_test.go
Add region-item test helpers that allow custom epochs/initialized state. Add unit tests for parsing/validation of plan-execution-timeout and disable-paranoid-check. Update integration tests to exercise new flags and assert status text includes configured values.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CLI as pd-ctl (Client)
    participant HTTP as PD HTTP API
    participant Handler as PD Handler
    participant Controller as UnsafeRecoveryController
    participant Reports as PeerReports/RegionIndex

    CLI->>HTTP: POST /admin/unsafe/remove-failed-stores {stores, timeout?, plan-execution-timeout?, disable-paranoid-check?}
    HTTP->>Handler: parsePlanExecutionTimeout, parseDisableParanoidCheck
    Handler->>Controller: RemoveFailedStoresWithOptions(stores, timeout, autoDetect, options)
    Controller->>Reports: buildUpFromReports()
    Controller->>Controller: generateCreateEmptyRegionPlan() (batch, use emptyRegionIndex)
    Controller->>Reports: apply batched create plans to stores
    Controller-->>Handler: return status (includes plan timeout / paranoid info)
    Handler-->>HTTP: HTTP response
    HTTP-->>CLI: print response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

lgtm

Suggested reviewers

  • JmPotato
  • lhy1024

Poem

"I'm a rabbit who batches and hops with cheer,
Timeouts tuned so plans finish clear.
Overlaps checked with an index I keep,
Paranoid guard or skip—your choice to reap.
Hooray—recovery runs lighter, swift, and dear 🐇"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.90% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly summarizes the main optimization: improving empty region overlap checks in unsafe recovery.
Description check ✅ Passed The PR description fully covers problem statement, detailed solution explanation, before/after comparison with metrics, configuration options, and includes release note.
Linked Issues check ✅ Passed All code changes directly address the core objective from #10638: reducing the paranoid overlap check from O(holes * peer reports) to O(newest regions + empty regions + peer reports * log(empty regions)) while preserving safety semantics.
Out of Scope Changes check ✅ Passed All changes are scoped to the stated objectives: paranoid check optimization, plan execution timeout extension, and request-level options for large recoveries.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 5, 2026
@Connor1996 Connor1996 force-pushed the codex/unsafe-recovery-overlap-index branch 3 times, most recently from 96da8be to c1fe103 Compare May 5, 2026 21:51
@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 5, 2026
@Connor1996 Connor1996 force-pushed the codex/unsafe-recovery-overlap-index branch 6 times, most recently from 4ee7a33 to ad42324 Compare May 5, 2026 22:22
@Connor1996 Connor1996 changed the base branch from release-8.5-20260121-v8.5.5 to master May 5, 2026 22:26
@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-triage-completed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 5, 2026
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@Connor1996 Connor1996 force-pushed the codex/unsafe-recovery-overlap-index branch from ad42324 to 6e7529d Compare May 5, 2026 22:31
@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. do-not-merge/needs-triage-completed labels May 5, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/unsaferecovery/unsafe_recovery_controller.go (1)

1314-1385: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Include the tail create in the paranoid overlap pass.

emptyRegions only collects interior gaps. The final [lastEnd, +∞) create — and the full-range create when newestRegionTree.size() == 0 — is still added after findOverlapPeerWithEmptyRegions, so a stale peer that overlaps that tail range will now slip past the paranoid check. Please route that last create through the same createPlans/emptyRegions path so it is checked and recorded consistently.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/unsaferecovery/unsafe_recovery_controller.go` around lines 1314 - 1385,
The tail create (the newRegion for [lastEnd, +∞) or full-range when
newestRegionTree.size() == 0) is currently added directly to the recovery plan
after the paranoid check, so it bypasses the overlap verification and the same
bookkeeping used for interior gaps; change the logic so the tail/new full-range
region is created via createRegion and appended to createPlans and emptyRegions
(just like interior gaps), using the same store selection logic (respecting
tiflash exclusion via getRandomStoreID and lastStoreID fallback) before the
paranoid findOverlapPeerWithEmptyRegions call, and then let the existing loop
that processes createPlans (u.getRecoveryPlan, storeRecoveryPlan.Creates append,
u.recordAffectedRegion, u.newlyCreatedRegions) handle the tail region so it is
checked and recorded consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/api/unsafe_operation.go`:
- Around line 79-95: The parsing of planExecutionTimeout and
disableParanoidCheck must validate input and return HTTP 400 on bad input
instead of silently using defaults: explicitly check input keys
"plan-execution-timeout"/"plan_execution_timeout" are present and of the correct
numeric type, ensure planExecutionTimeout is non-negative and fits as a
time.Duration (reject NaN/inf/negative), and ensure
"disable-paranoid-check"/"disable_paranoid_check" are booleans if provided; on
any invalid type or value respond with a 400 error from the handler and do not
call rc.GetUnsafeRecoveryController().RemoveFailedStoresWithOptions; then pass
the validated PlanExecutionTimeout and DisableParanoidCheck into
RemoveFailedStoresWithOptions.

---

Outside diff comments:
In `@pkg/unsaferecovery/unsafe_recovery_controller.go`:
- Around line 1314-1385: The tail create (the newRegion for [lastEnd, +∞) or
full-range when newestRegionTree.size() == 0) is currently added directly to the
recovery plan after the paranoid check, so it bypasses the overlap verification
and the same bookkeeping used for interior gaps; change the logic so the
tail/new full-range region is created via createRegion and appended to
createPlans and emptyRegions (just like interior gaps), using the same store
selection logic (respecting tiflash exclusion via getRandomStoreID and
lastStoreID fallback) before the paranoid findOverlapPeerWithEmptyRegions call,
and then let the existing loop that processes createPlans (u.getRecoveryPlan,
storeRecoveryPlan.Creates append, u.recordAffectedRegion, u.newlyCreatedRegions)
handle the tail region so it is checked and recorded consistently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: faefc0af-8ed8-4cbf-b340-fb84fe7b1454

📥 Commits

Reviewing files that changed from the base of the PR and between f941a7e and 6e7529d.

📒 Files selected for processing (6)
  • pkg/unsaferecovery/unsafe_recovery_controller.go
  • pkg/unsaferecovery/unsafe_recovery_controller_test.go
  • server/api/unsafe_operation.go
  • tests/server/api/unsafe_operation_test.go
  • tools/pd-ctl/pdctl/command/unsafe_command.go
  • tools/pd-ctl/tests/unsafe/unsafe_operation_test.go

Comment thread server/api/unsafe_operation.go Outdated
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@Connor1996

Copy link
Copy Markdown
Member Author

/test pull-unit-test-next-gen-2
/test pull-unit-test-next-gen-3

@codecov

codecov Bot commented May 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.44099% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.01%. Comparing base (b21a183) to head (e7df29d).
⚠️ Report is 10 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10639      +/-   ##
==========================================
+ Coverage   78.96%   79.01%   +0.04%     
==========================================
  Files         532      532              
  Lines       71883    72078     +195     
==========================================
+ Hits        56766    56951     +185     
- Misses      11093    11099       +6     
- Partials     4024     4028       +4     
Flag Coverage Δ
unittests 79.01% <89.44%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Connor1996

Copy link
Copy Markdown
Member Author

/test pull-unit-test-next-gen-3

Comment thread pkg/unsaferecovery/unsafe_recovery_controller.go
@Connor1996

Copy link
Copy Markdown
Member Author

Checked the latest review feedback against current HEAD 57dfc7f.

  • The two server/api nitpicks are already covered by the current helper/test updates in this branch.
  • The earlier tail/full-range paranoid-check concern is already handled by the current appendCreatePlan flow, and covered by TestCreateEmptyRegionTailParanoidCheck plus TestCreateEmptyRegionFullRangeParanoidCheck.

No additional code change was needed, so I did not add a follow-up commit.

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@Connor1996

Copy link
Copy Markdown
Member Author

Addressed the still-valid review item on timeout parsing in server/api: the handler now rejects non-integer, non-positive, and oversized timeout values with 400 instead of casting directly to uint64, and I added unit plus API coverage for those invalid inputs. The two smaller parsing-helper nitpicks from the same review were already present on the PR head before this follow-up.

@Connor1996

Copy link
Copy Markdown
Member Author

Checked the latest review feedback and current HEAD already contains the timeout validation / alias parsing fixes under review, so I did not add another code change here.

I reran the failed GitHub Actions jobs once to classify the remaining red CI. The failures are still in the unrelated chunks (9, Microservice Integration(!TSO)) resource-manager suite, and the symptom changed between runs:

  • First run: tests/integrations/mcs/resourcemanager/resource_manager_test.go:1832 TestCheckBackgroundJobs -> Condition never satisfied
  • Rerun: tests/integrations/mcs/resourcemanager/resource_manager_test.go:1587 standalone-resource-manager-with-client-discovery/TestResourceManagerClientFailover -> ErrGroupNotExists for failover_test

report-coverage is only cascading because covprofile_9 is missing after chunks (9) fails.

@Connor1996

Copy link
Copy Markdown
Member Author

/retest

1 similar comment
@Connor1996

Copy link
Copy Markdown
Member Author

/retest

@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels May 7, 2026
@ti-chi-bot

ti-chi-bot Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

@v01dstar: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot Bot added the lgtm label May 11, 2026
@ti-chi-bot

ti-chi-bot Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lhy1024, nolouch, v01dstar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label May 11, 2026
@ti-chi-bot

ti-chi-bot Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

[LGTM Timeline notifier]

Timeline:

  • 2026-05-07 06:07:11.322444048 +0000 UTC m=+334304.195794020: ☑️ agreed by lhy1024.
  • 2026-05-11 23:01:30.614678353 +0000 UTC m=+133859.147457662: ☑️ agreed by nolouch.

@AndreMouche

Copy link
Copy Markdown
Member

/retest

1 similar comment
@lhy1024

lhy1024 commented May 12, 2026

Copy link
Copy Markdown
Member

/retest

@ti-chi-bot

Copy link
Copy Markdown
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #10829.
But this PR has conflicts, please resolve them!

@ti-chi-bot ti-chi-bot Bot added the needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. label Jun 25, 2026
@ti-chi-bot

Copy link
Copy Markdown
Member

In response to a cherrypick label: new pull request created to branch release-8.1: #10944.
But this PR has conflicts, please resolve them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unsafe recovery create-empty-region plan generation can be O(holes * peer reports)

6 participants