pkg/unsaferecovery: add unsafe recovery abort by Connor1996 · Pull Request #10641 · tikv/pd

Connor1996 · 2026-05-05T22:38:14Z

What problem does this PR solve?

Issue Number: Close #10640

Unsafe recovery currently cannot be explicitly aborted once it starts. Operators can only wait for the current run to finish or time out.

PD also resets unsafe recovery step values from small numbers for each run. Since TiKV echoes the plan step in StoreReport, a delayed report from a previous run can match a later run if the same step is reused.

What is changed and how does it work?

Add a best-effort abort path for unsafe failed-store recovery.

The new API and pd-ctl command move the current controller run through the existing ExitForceLeader cleanup path. Already delivered TiKV plans are not synchronously cancelled, but PD stops the current recovery and dispatches empty recovery plans so TiKV can exit force leader where needed.

Unsafe recovery step IDs are now allocated from a process-wide increasing counter seeded by time, so reports from previous recovery runs no longer match a new run that reuses the same local stage number.

Check List

Tests

Unit test
Integration test

Code changes

Has HTTP APIs changed

Tests

CGO_ENABLED=0 go test ./pkg/unsaferecovery -count=1
CGO_ENABLED=0 go test -tags without_dashboard ./tests/server/api -run TestUnsafeOperationTestSuite -count=1
CGO_ENABLED=0 go test -tags without_dashboard ./pd-ctl/tests/unsafe -run TestRemoveFailedStores -count=1 from tools/

Release note

Add an unsafe recovery abort API and pd-ctl command.

Summary by CodeRabbit

New Features
- Added an unsafe admin HTTP endpoint to abort an in-progress failed stores removal flow.
- Added pd-ctl unsafe remove-failed-stores abort to trigger the cancellation.
- Improved unsafe recovery progress tracking with unique per-run step IDs and safer handling of stale heartbeats.
Bug Fixes
- Updated region cache invalidation to depend on recovery progress since the run began.
Tests
- Expanded API and CLI tests to cover abort success, invalid-state errors, and cleared recovery plan behavior.

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

ti-chi-bot · 2026-05-05T22:38:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign siddontang for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-05-05T22:38:27Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8fe2384a-1f02-4a55-89f9-eac0278964e1

📥 Commits

Reviewing files that changed from the base of the PR and between cfef6f4 and d2a8c21.

📒 Files selected for processing (1)

server/api/unsafe_operation.go

🚧 Files skipped from review as they are similar to previous changes (1)

server/api/unsafe_operation.go

📝 Walkthrough

Walkthrough

Adds explicit abort support for unsafe recovery, makes recovery step IDs unique across runs, and wires the abort path through the HTTP API, pd-ctl command, and tests.

Changes

Unsafe Recovery Abort and Step Tracking

Layer / File(s)	Summary
Controller step tracking `pkg/unsaferecovery/unsafe_recovery_controller.go`	Adds atomic recovery-step generation, stores `recoveryStartStep` in controller state, initializes both values at run start, and updates cache invalidation and cleanup to use the new step progression.
Abort behavior `pkg/unsaferecovery/unsafe_recovery_controller.go`	Adds `AbortFailedStoresRemoval()` to reject aborts when no recovery is running, no-op in `ExitForceLeader`, and otherwise route through existing error handling.
HTTP and CLI wiring `server/api/router.go`, `server/api/unsafe_operation.go`, `tools/pd-ctl/pdctl/command/unsafe_command.go`	Registers `/admin/unsafe/remove-failed-stores/abort`, adds the HTTP handler for abort responses, and adds the matching pd-ctl `abort` subcommand and POST request.
Controller tests `pkg/unsaferecovery/unsafe_recovery_controller_test.go`, `tests/server/api/unsafe_operation_test.go`, `tools/pd-ctl/tests/unsafe/unsafe_operation_test.go`	Updates heartbeat expectations to use the dynamic step value and adds coverage for abort handling, unique step values across runs, and the abort endpoint/command path.

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

tikv/pd#10639 — Changes the unsafe recovery controller flow around RemoveFailedStoresWithOptions, which this PR extends with abort and unique step tracking.

Suggested reviewers

lhy1024
v01dstar
nolouch

Poem

🐰 I tapped abort and hopped away,
Fresh little steps now lead the way.
Old reports can’t sneak back in line,
New recovery hops are now just fine.
A tidy burrow, bright and spry ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is concise and accurately summarizes the main change: adding unsafe recovery abort support.
Description check	✅ Passed	The PR description follows the template with problem, changes, checklist, tests, and release note sections filled in.
Linked Issues check	✅ Passed	The code adds explicit abort handling, uses the ExitForceLeader cleanup path, and makes recovery steps unique across runs as requested.
Out of Scope Changes check	✅ Passed	The changes stay focused on unsafe recovery abort and step-uniqueness support, with related API, CLI, and tests in scope.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/unsaferecovery/unsafe_recovery_controller.go`:
- Around line 51-55: The controller currently only seeds globalRecoveryStep once
and then advances per-run state via u.step locally, allowing later runs to reuse
step numbers; change all places that advance or set u.step (including
initialization) to allocate a fresh value from the process-wide counter by
calling nextRecoveryStep() and storing that into u.step (replace any u.step++ or
local increments with atomic nextRecoveryStep() assignments), and ensure
comparisons that use GetStep() continue to compare against the updated u.step so
every stage in every run uses a unique process-wide step value (symbols to
update: globalRecoveryStep, nextRecoveryStep(), u.step, GetStep(), and any code
paths that call StoreReport or increment step).

In `@server/api/unsafe_operation.go`:
- Around line 85-101: The handler AbortFailedStoresRemoval currently maps all
errors from rc.GetUnsafeRecoveryController().AbortFailedStoresRemoval() to 500;
update the handler to detect the controller's specific "no ongoing unsafe
recovery" / invalid-state error (e.g. compare with a sentinel error like
ErrNoOngoingUnsafeRecovery or an exported invalid-request error via errors.Is)
and return a 4xx (400 or 409) for that case while keeping 500 for real server
failures, and also update the Swagger comment above AbortFailedStoresRemoval to
include the new 4xx Failure case; locate the logic in AbortFailedStoresRemoval
(server/api/unsafe_operation.go) and the controller method
AbortFailedStoresRemoval() to identify the appropriate error value to match.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d2aaad4d-f0b0-4223-bf36-71d61ae99696

📥 Commits

Reviewing files that changed from the base of the PR and between f941a7e and 740e85c.

📒 Files selected for processing (7)

pkg/unsaferecovery/unsafe_recovery_controller.go
pkg/unsaferecovery/unsafe_recovery_controller_test.go
server/api/router.go
server/api/unsafe_operation.go
tests/server/api/unsafe_operation_test.go
tools/pd-ctl/pdctl/command/unsafe_command.go
tools/pd-ctl/tests/unsafe/unsafe_operation_test.go

codecov · 2026-05-05T22:49:16Z

Codecov Report

❌ Patch coverage is 94.44444% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.22%. Comparing base (b36c4da) to head (d2a8c21).

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10641      +/-   ##
==========================================
+ Coverage   79.18%   79.22%   +0.03%     
==========================================
  Files         541      541              
  Lines       75312    75346      +34     
==========================================
+ Hits        59638    59691      +53     
+ Misses      11445    11427      -18     
+ Partials     4229     4228       -1

Flag	Coverage Δ
unittests	`79.22% <94.44%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lhy1024 · 2026-05-07T06:11:57Z

 	}

 	u.timeout = time.Now().Add(time.Duration(timeout) * time.Second)
+	u.step = nextRecoveryStep()


Can we allocate the step from the global monotonic counter on every stage transition, or add a recovery epoch, so delayed reports from a previous run cannot pass the new run's step check?

Updated in 30e6fcd94: changeStage now allocates a fresh step from the process-wide monotonic counter on every transition, so delayed reports from earlier stages or earlier runs can no longer match the current step.

lhy1024 · 2026-05-07T06:12:26Z

+		return errs.ErrUnsafeRecoveryInvalidInput.FastGenByArgs("no ongoing unsafe recovery")
+	}
+
+	u.handleErr(errors.New("aborted by operator"))


Could we make abort a no-op when stage == ExitForceLeader, or keep the controller in ExitForceLeader and add a repeated-abort test?

Adjusted in 30e6fcd94: abort is now a no-op once the controller is already in ExitForceLeader, and TestAbortFailedStoresRemoval covers the repeated-abort path.

Rechecked on current head cfef6f4c: AbortFailedStoresRemoval() still returns nil once the controller is already in ExitForceLeader, and TestAbortFailedStoresRemoval exercises the repeated-abort path after entering that stage. I also reran CGO_ENABLED=0 go test ./pkg/unsaferecovery -run 'TestAbortFailedStoresRemoval|TestUnsafeRecoveryStepIsUniqueAcrossRuns' -count=1 locally after confirming the branch is already up to date with upstream/master.

Rechecked on the current head d2a8c217e: AbortFailedStoresRemoval() still returns nil once the controller is already in ExitForceLeader, keeps the stage there, and TestAbortFailedStoresRemoval still covers the repeated-abort path plus the empty recovery plan dispatch afterward. I also reran CGO_ENABLED=0 go test ./pkg/unsaferecovery -run 'TestAbortFailedStoresRemoval|TestUnsafeRecoveryStepIsUniqueAcrossRuns' -count=1 on this head.

Rechecked on the current head d2a8c217e: AbortFailedStoresRemoval() still returns nil once the controller is already in ExitForceLeader, so repeated aborts remain a no-op and keep the stage there.

TestAbortFailedStoresRemoval still covers that repeated-abort path, the empty recovery plan dispatched on the next heartbeat, and the matching report afterward. I also reran these checks on this head:

CGO_ENABLED=0 go test ./pkg/unsaferecovery -run 'TestAbortFailedStoresRemoval|TestUnsafeRecoveryStepIsUniqueAcrossRuns' -count=1\n- CGO_ENABLED=0 go test -tags without_dashboard ./tests/server/api -run TestUnsafeOperationTestSuite -count=1\n- CGO_ENABLED=0 go test -tags without_dashboard ./pd-ctl/tests/unsafe -run TestRemoveFailedStores -count=1 from tools/

Signed-off-by: Connor1996 <zbk602423539@gmail.com> # Conflicts: # pkg/unsaferecovery/unsafe_recovery_controller.go # server/api/unsafe_operation.go

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

Connor1996 · 2026-06-29T22:56:44Z

Rechecked the latest CodeRabbit feedback on the current head after merging upstream/master.

Both findings are already satisfied in the merged code: changeStage allocates a fresh process-wide recovery step on each stage transition, and the abort handler returns 400 for no ongoing unsafe recovery while keeping 500 for unexpected failures.

Also reran the narrow unsafe-recovery checks locally after the merge:

CGO_ENABLED=0 go test ./pkg/unsaferecovery -count=1
CGO_ENABLED=0 go test -tags without_dashboard ./tests/server/api -run TestUnsafeOperationTestSuite -count=1
CGO_ENABLED=0 go test -tags without_dashboard ./pd-ctl/tests/unsafe -run TestRemoveFailedStores -count=1

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

Connor1996 · 2026-06-29T23:08:57Z

Pushed follow-up commit d2a8c217e to fix the gci import grouping in server/api/unsafe_operation.go, which was the only failing item in the current statics run after the upstream/master merge already on this branch.

Revalidated on the updated head with:

./.tools/bin/golangci-lint run --verbose ./server/api --allow-parallel-runners
CGO_ENABLED=0 go test ./pkg/unsaferecovery -run 'TestAbortFailedStoresRemoval|TestUnsafeRecoveryStepIsUniqueAcrossRuns' -count=1\n- CGO_ENABLED=0 go test -tags without_dashboard ./tests/server/api -run TestUnsafeOperationTestSuite -count=1\n- CGO_ENABLED=0 go test -tags without_dashboard ./pd-ctl/tests/unsafe -run TestRemoveFailedStores -count=1 (from tools/)

Connor1996 · 2026-06-29T23:26:18Z

/retest

Connor1996 · 2026-06-29T23:40:48Z

/retest

pkg/unsaferecovery: add unsafe recovery abort

740e85c

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

ti-chi-bot Bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label May 5, 2026

ti-chi-bot Bot added do-not-merge/needs-triage-completed dco-signoff: yes Indicates the PR's author has signed the dco. labels May 5, 2026

ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 5, 2026

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Comment thread pkg/unsaferecovery/unsafe_recovery_controller.go

Comment thread server/api/unsafe_operation.go

lhy1024 reviewed May 7, 2026

View reviewed changes

Connor1996 added 2 commits June 29, 2026 15:45

Merge remote-tracking branch 'upstream/master' into HEAD

30e6fcd

Signed-off-by: Connor1996 <zbk602423539@gmail.com> # Conflicts: # pkg/unsaferecovery/unsafe_recovery_controller.go # server/api/unsafe_operation.go

Merge remote-tracking branch 'upstream/master' into HEAD

cfef6f4

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

ti-chi-bot Bot removed the do-not-merge/needs-triage-completed label Jun 29, 2026

server/api: fix unsafe operation import order

d2a8c21

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

Uh oh!

Conversation

Connor1996 commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Check List

Tests

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated Code Review Effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lhy1024 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Connor1996 Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

lhy1024 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Connor1996 Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Connor1996 Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Connor1996 Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Connor1996 Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Connor1996 commented Jun 29, 2026

Uh oh!

Connor1996 commented Jun 29, 2026

Uh oh!

Connor1996 commented Jun 29, 2026

Uh oh!

Connor1996 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Connor1996 commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading

codecov Bot commented May 5, 2026 •

edited

Loading