pkg/unsaferecovery: add unsafe recovery abort#10641
Conversation
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds explicit abort support for unsafe recovery, makes recovery step IDs unique across runs, and wires the abort path through the HTTP API, pd-ctl command, and tests. ChangesUnsafe Recovery Abort and Step Tracking
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/unsaferecovery/unsafe_recovery_controller.go`:
- Around line 51-55: The controller currently only seeds globalRecoveryStep once
and then advances per-run state via u.step locally, allowing later runs to reuse
step numbers; change all places that advance or set u.step (including
initialization) to allocate a fresh value from the process-wide counter by
calling nextRecoveryStep() and storing that into u.step (replace any u.step++ or
local increments with atomic nextRecoveryStep() assignments), and ensure
comparisons that use GetStep() continue to compare against the updated u.step so
every stage in every run uses a unique process-wide step value (symbols to
update: globalRecoveryStep, nextRecoveryStep(), u.step, GetStep(), and any code
paths that call StoreReport or increment step).
In `@server/api/unsafe_operation.go`:
- Around line 85-101: The handler AbortFailedStoresRemoval currently maps all
errors from rc.GetUnsafeRecoveryController().AbortFailedStoresRemoval() to 500;
update the handler to detect the controller's specific "no ongoing unsafe
recovery" / invalid-state error (e.g. compare with a sentinel error like
ErrNoOngoingUnsafeRecovery or an exported invalid-request error via errors.Is)
and return a 4xx (400 or 409) for that case while keeping 500 for real server
failures, and also update the Swagger comment above AbortFailedStoresRemoval to
include the new 4xx Failure case; locate the logic in AbortFailedStoresRemoval
(server/api/unsafe_operation.go) and the controller method
AbortFailedStoresRemoval() to identify the appropriate error value to match.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d2aaad4d-f0b0-4223-bf36-71d61ae99696
📒 Files selected for processing (7)
pkg/unsaferecovery/unsafe_recovery_controller.gopkg/unsaferecovery/unsafe_recovery_controller_test.goserver/api/router.goserver/api/unsafe_operation.gotests/server/api/unsafe_operation_test.gotools/pd-ctl/pdctl/command/unsafe_command.gotools/pd-ctl/tests/unsafe/unsafe_operation_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10641 +/- ##
==========================================
+ Coverage 79.18% 79.22% +0.03%
==========================================
Files 541 541
Lines 75312 75346 +34
==========================================
+ Hits 59638 59691 +53
+ Misses 11445 11427 -18
+ Partials 4229 4228 -1
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
| } | ||
|
|
||
| u.timeout = time.Now().Add(time.Duration(timeout) * time.Second) | ||
| u.step = nextRecoveryStep() |
There was a problem hiding this comment.
Can we allocate the step from the global monotonic counter on every stage transition, or add a recovery epoch, so delayed reports from a previous run cannot pass the new run's step check?
There was a problem hiding this comment.
Updated in 30e6fcd94: changeStage now allocates a fresh step from the process-wide monotonic counter on every transition, so delayed reports from earlier stages or earlier runs can no longer match the current step.
| return errs.ErrUnsafeRecoveryInvalidInput.FastGenByArgs("no ongoing unsafe recovery") | ||
| } | ||
|
|
||
| u.handleErr(errors.New("aborted by operator")) |
There was a problem hiding this comment.
Could we make abort a no-op when stage == ExitForceLeader, or keep the controller in ExitForceLeader and add a repeated-abort test?
There was a problem hiding this comment.
Adjusted in 30e6fcd94: abort is now a no-op once the controller is already in ExitForceLeader, and TestAbortFailedStoresRemoval covers the repeated-abort path.
There was a problem hiding this comment.
Rechecked on current head cfef6f4c: AbortFailedStoresRemoval() still returns nil once the controller is already in ExitForceLeader, and TestAbortFailedStoresRemoval exercises the repeated-abort path after entering that stage. I also reran CGO_ENABLED=0 go test ./pkg/unsaferecovery -run 'TestAbortFailedStoresRemoval|TestUnsafeRecoveryStepIsUniqueAcrossRuns' -count=1 locally after confirming the branch is already up to date with upstream/master.
There was a problem hiding this comment.
Rechecked on the current head d2a8c217e: AbortFailedStoresRemoval() still returns nil once the controller is already in ExitForceLeader, keeps the stage there, and TestAbortFailedStoresRemoval still covers the repeated-abort path plus the empty recovery plan dispatch afterward. I also reran CGO_ENABLED=0 go test ./pkg/unsaferecovery -run 'TestAbortFailedStoresRemoval|TestUnsafeRecoveryStepIsUniqueAcrossRuns' -count=1 on this head.
There was a problem hiding this comment.
Rechecked on the current head d2a8c217e: AbortFailedStoresRemoval() still returns nil once the controller is already in ExitForceLeader, so repeated aborts remain a no-op and keep the stage there.
TestAbortFailedStoresRemoval still covers that repeated-abort path, the empty recovery plan dispatched on the next heartbeat, and the matching report afterward. I also reran these checks on this head:
CGO_ENABLED=0 go test ./pkg/unsaferecovery -run 'TestAbortFailedStoresRemoval|TestUnsafeRecoveryStepIsUniqueAcrossRuns' -count=1\n-CGO_ENABLED=0 go test -tags without_dashboard ./tests/server/api -run TestUnsafeOperationTestSuite -count=1\n-CGO_ENABLED=0 go test -tags without_dashboard ./pd-ctl/tests/unsafe -run TestRemoveFailedStores -count=1fromtools/
Signed-off-by: Connor1996 <zbk602423539@gmail.com> # Conflicts: # pkg/unsaferecovery/unsafe_recovery_controller.go # server/api/unsafe_operation.go
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
|
Rechecked the latest CodeRabbit feedback on the current head after merging Both findings are already satisfied in the merged code: Also reran the narrow unsafe-recovery checks locally after the merge:
|
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
|
Pushed follow-up commit Revalidated on the updated head with:
|
|
/retest |
1 similar comment
|
/retest |
What problem does this PR solve?
Issue Number: Close #10640
Unsafe recovery currently cannot be explicitly aborted once it starts. Operators can only wait for the current run to finish or time out.
PD also resets unsafe recovery step values from small numbers for each run. Since TiKV echoes the plan step in
StoreReport, a delayed report from a previous run can match a later run if the same step is reused.What is changed and how does it work?
Check List
Tests
Code changes
Tests
CGO_ENABLED=0 go test ./pkg/unsaferecovery -count=1CGO_ENABLED=0 go test -tags without_dashboard ./tests/server/api -run TestUnsafeOperationTestSuite -count=1CGO_ENABLED=0 go test -tags without_dashboard ./pd-ctl/tests/unsafe -run TestRemoveFailedStores -count=1fromtools/Release note
Summary by CodeRabbit
pd-ctl unsafe remove-failed-stores abortto trigger the cancellation.