Skip to content

pkg/unsaferecovery: add unsafe recovery abort#10641

Open
Connor1996 wants to merge 4 commits into
tikv:masterfrom
Connor1996:codex/unsafe-recovery-abort-master
Open

pkg/unsaferecovery: add unsafe recovery abort#10641
Connor1996 wants to merge 4 commits into
tikv:masterfrom
Connor1996:codex/unsafe-recovery-abort-master

Conversation

@Connor1996

@Connor1996 Connor1996 commented May 5, 2026

Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: Close #10640

Unsafe recovery currently cannot be explicitly aborted once it starts. Operators can only wait for the current run to finish or time out.

PD also resets unsafe recovery step values from small numbers for each run. Since TiKV echoes the plan step in StoreReport, a delayed report from a previous run can match a later run if the same step is reused.

What is changed and how does it work?

Add a best-effort abort path for unsafe failed-store recovery.

The new API and pd-ctl command move the current controller run through the existing ExitForceLeader cleanup path. Already delivered TiKV plans are not synchronously cancelled, but PD stops the current recovery and dispatches empty recovery plans so TiKV can exit force leader where needed.

Unsafe recovery step IDs are now allocated from a process-wide increasing counter seeded by time, so reports from previous recovery runs no longer match a new run that reuses the same local stage number.

Check List

Tests

  • Unit test
  • Integration test

Code changes

  • Has HTTP APIs changed

Tests

  • CGO_ENABLED=0 go test ./pkg/unsaferecovery -count=1
  • CGO_ENABLED=0 go test -tags without_dashboard ./tests/server/api -run TestUnsafeOperationTestSuite -count=1
  • CGO_ENABLED=0 go test -tags without_dashboard ./pd-ctl/tests/unsafe -run TestRemoveFailedStores -count=1 from tools/

Release note

Add an unsafe recovery abort API and pd-ctl command.

Summary by CodeRabbit

  • New Features
    • Added an unsafe admin HTTP endpoint to abort an in-progress failed stores removal flow.
    • Added pd-ctl unsafe remove-failed-stores abort to trigger the cancellation.
    • Improved unsafe recovery progress tracking with unique per-run step IDs and safer handling of stale heartbeats.
  • Bug Fixes
    • Updated region cache invalidation to depend on recovery progress since the run began.
  • Tests
    • Expanded API and CLI tests to cover abort success, invalid-state errors, and cleared recovery plan behavior.

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@ti-chi-bot ti-chi-bot Bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label May 5, 2026
@ti-chi-bot

ti-chi-bot Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign siddontang for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-triage-completed dco-signoff: yes Indicates the PR's author has signed the dco. labels May 5, 2026
@coderabbitai

coderabbitai Bot commented May 5, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8fe2384a-1f02-4a55-89f9-eac0278964e1

📥 Commits

Reviewing files that changed from the base of the PR and between cfef6f4 and d2a8c21.

📒 Files selected for processing (1)
  • server/api/unsafe_operation.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/api/unsafe_operation.go

📝 Walkthrough

Walkthrough

Adds explicit abort support for unsafe recovery, makes recovery step IDs unique across runs, and wires the abort path through the HTTP API, pd-ctl command, and tests.

Changes

Unsafe Recovery Abort and Step Tracking

Layer / File(s) Summary
Controller step tracking
pkg/unsaferecovery/unsafe_recovery_controller.go
Adds atomic recovery-step generation, stores recoveryStartStep in controller state, initializes both values at run start, and updates cache invalidation and cleanup to use the new step progression.
Abort behavior
pkg/unsaferecovery/unsafe_recovery_controller.go
Adds AbortFailedStoresRemoval() to reject aborts when no recovery is running, no-op in ExitForceLeader, and otherwise route through existing error handling.
HTTP and CLI wiring
server/api/router.go, server/api/unsafe_operation.go, tools/pd-ctl/pdctl/command/unsafe_command.go
Registers /admin/unsafe/remove-failed-stores/abort, adds the HTTP handler for abort responses, and adds the matching pd-ctl abort subcommand and POST request.
Controller tests
pkg/unsaferecovery/unsafe_recovery_controller_test.go, tests/server/api/unsafe_operation_test.go, tools/pd-ctl/tests/unsafe/unsafe_operation_test.go
Updates heartbeat expectations to use the dynamic step value and adds coverage for abort handling, unique step values across runs, and the abort endpoint/command path.

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • tikv/pd#10639 — Changes the unsafe recovery controller flow around RemoveFailedStoresWithOptions, which this PR extends with abort and unique step tracking.

Suggested reviewers

  • lhy1024
  • v01dstar
  • nolouch

Poem

🐰 I tapped abort and hopped away,
Fresh little steps now lead the way.
Old reports can’t sneak back in line,
New recovery hops are now just fine.
A tidy burrow, bright and spry ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the main change: adding unsafe recovery abort support.
Description check ✅ Passed The PR description follows the template with problem, changes, checklist, tests, and release note sections filled in.
Linked Issues check ✅ Passed The code adds explicit abort handling, uses the ExitForceLeader cleanup path, and makes recovery steps unique across runs as requested.
Out of Scope Changes check ✅ Passed The changes stay focused on unsafe recovery abort and step-uniqueness support, with related API, CLI, and tests in scope.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@ti-chi-bot ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 5, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/unsaferecovery/unsafe_recovery_controller.go`:
- Around line 51-55: The controller currently only seeds globalRecoveryStep once
and then advances per-run state via u.step locally, allowing later runs to reuse
step numbers; change all places that advance or set u.step (including
initialization) to allocate a fresh value from the process-wide counter by
calling nextRecoveryStep() and storing that into u.step (replace any u.step++ or
local increments with atomic nextRecoveryStep() assignments), and ensure
comparisons that use GetStep() continue to compare against the updated u.step so
every stage in every run uses a unique process-wide step value (symbols to
update: globalRecoveryStep, nextRecoveryStep(), u.step, GetStep(), and any code
paths that call StoreReport or increment step).

In `@server/api/unsafe_operation.go`:
- Around line 85-101: The handler AbortFailedStoresRemoval currently maps all
errors from rc.GetUnsafeRecoveryController().AbortFailedStoresRemoval() to 500;
update the handler to detect the controller's specific "no ongoing unsafe
recovery" / invalid-state error (e.g. compare with a sentinel error like
ErrNoOngoingUnsafeRecovery or an exported invalid-request error via errors.Is)
and return a 4xx (400 or 409) for that case while keeping 500 for real server
failures, and also update the Swagger comment above AbortFailedStoresRemoval to
include the new 4xx Failure case; locate the logic in AbortFailedStoresRemoval
(server/api/unsafe_operation.go) and the controller method
AbortFailedStoresRemoval() to identify the appropriate error value to match.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d2aaad4d-f0b0-4223-bf36-71d61ae99696

📥 Commits

Reviewing files that changed from the base of the PR and between f941a7e and 740e85c.

📒 Files selected for processing (7)
  • pkg/unsaferecovery/unsafe_recovery_controller.go
  • pkg/unsaferecovery/unsafe_recovery_controller_test.go
  • server/api/router.go
  • server/api/unsafe_operation.go
  • tests/server/api/unsafe_operation_test.go
  • tools/pd-ctl/pdctl/command/unsafe_command.go
  • tools/pd-ctl/tests/unsafe/unsafe_operation_test.go

Comment thread pkg/unsaferecovery/unsafe_recovery_controller.go
Comment thread server/api/unsafe_operation.go
@codecov

codecov Bot commented May 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.44444% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.22%. Comparing base (b36c4da) to head (d2a8c21).

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10641      +/-   ##
==========================================
+ Coverage   79.18%   79.22%   +0.03%     
==========================================
  Files         541      541              
  Lines       75312    75346      +34     
==========================================
+ Hits        59638    59691      +53     
+ Misses      11445    11427      -18     
+ Partials     4229     4228       -1     
Flag Coverage Δ
unittests 79.22% <94.44%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

}

u.timeout = time.Now().Add(time.Duration(timeout) * time.Second)
u.step = nextRecoveryStep()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we allocate the step from the global monotonic counter on every stage transition, or add a recovery epoch, so delayed reports from a previous run cannot pass the new run's step check?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 30e6fcd94: changeStage now allocates a fresh step from the process-wide monotonic counter on every transition, so delayed reports from earlier stages or earlier runs can no longer match the current step.

return errs.ErrUnsafeRecoveryInvalidInput.FastGenByArgs("no ongoing unsafe recovery")
}

u.handleErr(errors.New("aborted by operator"))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make abort a no-op when stage == ExitForceLeader, or keep the controller in ExitForceLeader and add a repeated-abort test?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted in 30e6fcd94: abort is now a no-op once the controller is already in ExitForceLeader, and TestAbortFailedStoresRemoval covers the repeated-abort path.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rechecked on current head cfef6f4c: AbortFailedStoresRemoval() still returns nil once the controller is already in ExitForceLeader, and TestAbortFailedStoresRemoval exercises the repeated-abort path after entering that stage. I also reran CGO_ENABLED=0 go test ./pkg/unsaferecovery -run 'TestAbortFailedStoresRemoval|TestUnsafeRecoveryStepIsUniqueAcrossRuns' -count=1 locally after confirming the branch is already up to date with upstream/master.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rechecked on the current head d2a8c217e: AbortFailedStoresRemoval() still returns nil once the controller is already in ExitForceLeader, keeps the stage there, and TestAbortFailedStoresRemoval still covers the repeated-abort path plus the empty recovery plan dispatch afterward. I also reran CGO_ENABLED=0 go test ./pkg/unsaferecovery -run 'TestAbortFailedStoresRemoval|TestUnsafeRecoveryStepIsUniqueAcrossRuns' -count=1 on this head.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rechecked on the current head d2a8c217e: AbortFailedStoresRemoval() still returns nil once the controller is already in ExitForceLeader, so repeated aborts remain a no-op and keep the stage there.

TestAbortFailedStoresRemoval still covers that repeated-abort path, the empty recovery plan dispatched on the next heartbeat, and the matching report afterward. I also reran these checks on this head:

  • CGO_ENABLED=0 go test ./pkg/unsaferecovery -run 'TestAbortFailedStoresRemoval|TestUnsafeRecoveryStepIsUniqueAcrossRuns' -count=1\n- CGO_ENABLED=0 go test -tags without_dashboard ./tests/server/api -run TestUnsafeOperationTestSuite -count=1\n- CGO_ENABLED=0 go test -tags without_dashboard ./pd-ctl/tests/unsafe -run TestRemoveFailedStores -count=1 from tools/

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

# Conflicts:
#	pkg/unsaferecovery/unsafe_recovery_controller.go
#	server/api/unsafe_operation.go
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@Connor1996

Copy link
Copy Markdown
Member Author

Rechecked the latest CodeRabbit feedback on the current head after merging upstream/master.

Both findings are already satisfied in the merged code: changeStage allocates a fresh process-wide recovery step on each stage transition, and the abort handler returns 400 for no ongoing unsafe recovery while keeping 500 for unexpected failures.

Also reran the narrow unsafe-recovery checks locally after the merge:

  • CGO_ENABLED=0 go test ./pkg/unsaferecovery -count=1
  • CGO_ENABLED=0 go test -tags without_dashboard ./tests/server/api -run TestUnsafeOperationTestSuite -count=1
  • CGO_ENABLED=0 go test -tags without_dashboard ./pd-ctl/tests/unsafe -run TestRemoveFailedStores -count=1

Signed-off-by: Connor1996 <zbk602423539@gmail.com>
@Connor1996

Copy link
Copy Markdown
Member Author

Pushed follow-up commit d2a8c217e to fix the gci import grouping in server/api/unsafe_operation.go, which was the only failing item in the current statics run after the upstream/master merge already on this branch.

Revalidated on the updated head with:

  • ./.tools/bin/golangci-lint run --verbose ./server/api --allow-parallel-runners
  • CGO_ENABLED=0 go test ./pkg/unsaferecovery -run 'TestAbortFailedStoresRemoval|TestUnsafeRecoveryStepIsUniqueAcrossRuns' -count=1\n- CGO_ENABLED=0 go test -tags without_dashboard ./tests/server/api -run TestUnsafeOperationTestSuite -count=1\n- CGO_ENABLED=0 go test -tags without_dashboard ./pd-ctl/tests/unsafe -run TestRemoveFailedStores -count=1 (from tools/)

@Connor1996

Copy link
Copy Markdown
Member Author

/retest

1 similar comment
@Connor1996

Copy link
Copy Markdown
Member Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

unsafe recovery should support explicit abort and ignore stale reports

2 participants