server, schedule: make config updates atomic#10942
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughConfig writes now use callback-based update helpers with serialized persistence and rollback. HTTP and dashboard config paths were updated to use those helpers, scheduler config reconciliation now filters invalid entries, and PD leader lookup retries until timeout. ChangesAtomic config updates and PD leader retries
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/server.go`:
- Around line 1304-1305: The scheduling allowance update in UpdateScheduleConfig
is racing with newer writes because SetSchedulingAllowanceStatus is called after
the serialized persist helper releases its lock. Move the halt-status update
into the same critical section as the schedule config persist, or otherwise
guard it with the same serialization used by persist helper so an older
UpdateScheduleConfig cannot overwrite a newer HaltScheduling value; use the
UpdateScheduleConfig and SetSchedulingAllowanceStatus paths as the fix points.
- Around line 1582-1584: Normalize the dashboard address before calling
cfg.Validate in PDServerConfig setup so values like 127.0.0.1:2379 are converted
into a scheme-bearing form first. Move the scheme-prefixing logic that currently
sits later in the server startup flow to happen before the Validate call, using
the same dashboard-address handling path in server.go so Validate sees the
normalized value. Keep the existing special handling for auto and none
unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0fe704dc-0f90-4e93-bcc8-ee0724b8046c
📒 Files selected for processing (9)
pkg/dashboard/adapter/manager.gopkg/mcs/scheduling/server/config/config.gopkg/schedule/config/config_provider.gopkg/schedule/coordinator.gopkg/schedule/coordinator_test.goserver/api/config.goserver/config/config_test.goserver/config/persist_options.goserver/server.go
8cec60e to
3a711ee
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/mcs/scheduling/server/cluster.go`:
- Around line 414-415: The error returned from the ctx.Done() branch in
cluster.go uses uppercase text and should follow the repository rule for
lowercase error strings. Update the error message in the
scheduling/server/cluster.go path within the cluster-related function handling
the ctx.Done() case so the returned errors.New message is entirely lowercase and
keeps the same meaning.
In `@pkg/mcs/scheduling/server/server.go`:
- Around line 211-219: The member probe calls in updatePDLeader can block
forever because they reuse the loop context without a deadline. Wrap the
external calls to GetClient().Status and GetDelegateClient in
context.WithTimeout using a short per-call timeout, and make sure each derived
context is canceled with defer. Keep the existing leader-check flow in
updatePDLeader, but apply the timeout around each endpoint probe so a slow or
unresponsive member cannot stall the whole scheduling loop.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bba85821-30b0-4592-bff7-e15a0476e558
📒 Files selected for processing (11)
pkg/dashboard/adapter/manager.gopkg/mcs/scheduling/server/cluster.gopkg/mcs/scheduling/server/config/config.gopkg/mcs/scheduling/server/server.gopkg/schedule/config/config_provider.gopkg/schedule/coordinator.gopkg/schedule/coordinator_test.goserver/api/config.goserver/config/config_test.goserver/config/persist_options.goserver/server.go
🚧 Files skipped from review as they are similar to previous changes (8)
- pkg/schedule/config/config_provider.go
- server/config/config_test.go
- pkg/dashboard/adapter/manager.go
- server/config/persist_options.go
- pkg/schedule/coordinator.go
- pkg/mcs/scheduling/server/config/config.go
- server/server.go
- server/api/config.go
Signed-off-by: Ryan Leung <rleungx@gmail.com>
3a711ee to
abc4684
Compare
Signed-off-by: Ryan Leung <rleungx@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@server/config/config_test.go`:
- Around line 454-470: The test around opt.Persist and the storage
synchronization should fail immediately if the serialization invariant is broken
instead of silently accepting the race. Update the logic in the test that waits
on storage.secondStarted so that seeing it before closing storage.releaseFirst
triggers a test failure, and add an explicit timeout around the wait on
storage.firstStarted so regressions don’t hang. Use the existing test helpers
and symbols opt.Persist, storage.firstStarted, storage.secondStarted, and
storage.releaseFirst to keep the fix localized.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e55afd3e-4596-4e6c-97b9-1e50ecbe0679
📒 Files selected for processing (2)
server/config/config_test.goserver/config/persist_options.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10942 +/- ##
==========================================
- Coverage 79.26% 79.24% -0.03%
==========================================
Files 540 541 +1
Lines 74954 75189 +235
==========================================
+ Hits 59412 59580 +168
- Misses 11371 11430 +59
- Partials 4171 4179 +8
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Signed-off-by: Ryan Leung <rleungx@gmail.com>
Signed-off-by: Ryan Leung <rleungx@gmail.com>
Signed-off-by: Ryan Leung <rleungx@gmail.com>
|
/test pull-unit-test-next-gen-3 |
| } | ||
| notifier <- struct{}{} | ||
| select { | ||
| case notifier <- struct{}{}: |
There was a problem hiding this comment.
It will lose this update if the notifier is full.
| var ( | ||
| _ *sc.ScheduleConfig = nil | ||
| resourceManagerControllerConfigURL = "/resource-manager/api/v1/config/controller" | ||
| errConfigNotUpdated = errors.New("config is not updated") |
There was a problem hiding this comment.
How about renaming it to config updated failed?
What problem does this PR solve?
Issue Number: Close #7464
What is changed and how does it work?
Check List
Tests
Verified with:
Code changes
Related changes
Release note
Summary by CodeRabbit
New Features
Bug Fixes
Tests