Skip to content

server, schedule: make config updates atomic#10942

Open
rleungx wants to merge 5 commits into
tikv:masterfrom
rleungx:fix-7464-atomic-config-update-master
Open

server, schedule: make config updates atomic#10942
rleungx wants to merge 5 commits into
tikv:masterfrom
rleungx:fix-7464-atomic-config-update-master

Conversation

@rleungx

@rleungx rleungx commented Jun 25, 2026

Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: Close #7464

What is changed and how does it work?

Make schedule and PD server config updates read-modify-write the latest config under a shared lock before persisting, instead of persisting stale cloned snapshots.

Use the atomic update path in config API handlers, full schedule config updates, dashboard address updates, and scheduler initialization cleanup so concurrent config changes are preserved.

In MCS scheduling, make schedule config updates atomic and let PD leader client lookup retry within the request timeout to avoid startup races in operator requests.

Check List

Tests

  • Unit test
  • Integration test

Verified with:

go test ./server/config ./pkg/schedule ./pkg/mcs/scheduling/server/config ./pkg/dashboard/adapter -run Test -timeout 2m
go test ./pkg/mcs/scheduling/server -run 'TestAllocID|TestAskBatchSplit|TestSplit' -count=1 -timeout 2m
go test ./tests/server/api -run 'TestOperatorTestSuite/TestTransferRegionWithPlacementRule' -count=1 -timeout 2m

Code changes

  • Has the configuration change

Related changes

  • Need to cherry-pick to the release branch (release-8.5 PR: tidbcloud/pd-cse#554)

Release note

Fix a race where dynamic config updates could overwrite concurrent changes with stale config snapshots.

Summary by CodeRabbit

  • New Features

    • Schedule and PD-server config updates now use callback-based atomic “update + persist” flows.
    • Dashboard address updates are supported directly, including normalization of “bare” values.
  • Bug Fixes

    • Improved concurrency safety for schedule/config mutations (including non-blocking scheduler update notifications).
    • Persistence failures now reliably roll back in-memory changes to the prior state.
    • PD leader discovery now retries within a bounded timeout before failing.
  • Tests

    • Added/updated coverage for rollback, scheduler removal with concurrent updates, and dashboard address normalization.

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 25, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign likidu, overvenus for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 25, 2026
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Config writes now use callback-based update helpers with serialized persistence and rollback. HTTP and dashboard config paths were updated to use those helpers, scheduler config reconciliation now filters invalid entries, and PD leader lookup retries until timeout.

Changes

Atomic config updates and PD leader retries

Layer / File(s) Summary
Config persistence helpers
pkg/schedule/config/config_provider.go, server/config/persist_options.go, server/config/config_test.go
UpdateScheduleConfig now accepts callback-based mutations and returns errors, while persisted schedule and PD-server writes are serialized with mutexes, restored on persist failure, and covered by field-preservation and concurrency tests.
Scheduling server config
pkg/mcs/scheduling/server/config/config.go
PersistConfig now serializes schedule updates, makes scheduler-update notifications non-blocking, and adds callback-based schedule-config updates with rollback.
Server update paths
server/server.go, server/api/config.go, pkg/dashboard/adapter/manager.go, tests/server/server_test.go
SetScheduleConfig and SetPDServerConfig now delegate to atomic update helpers, the API handlers mutate configs through those helpers, setNewAddress calls SetDashboardAddress, and the dashboard-address test follows the normalized update path.
Scheduler reconciliation
pkg/schedule/coordinator.go, pkg/schedule/coordinator_test.go
InitSchedulers snapshots the original scheduler configs, filters invalid entries by deep equality, persists the filtered set through UpdateScheduleConfig, and the new test checks concurrent updates remain.
PD leader retry flow
pkg/mcs/scheduling/server/cluster.go, pkg/mcs/scheduling/server/server.go
AllocID now binds leader acquisition to a timeout context, and leader lookup/reselection retries through membership checks until a leader is found or the context expires.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • okJiang
  • JmPotato

Poem

🐰 I hop through configs, neat and small,
With mutex paws I guard them all.
The leader waits, then finds its way,
And dashboard updates land today.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: making server and schedule config updates atomic.
Description check ✅ Passed The PR description covers the issue, changes, tests, related changes, and release note with the required sections.
Linked Issues check ✅ Passed The changes address #7464 by preventing stale config writes from overwriting concurrent updates with atomic read-modify-write persistence.
Out of Scope Changes check ✅ Passed The additional MCS and scheduler changes are consistent with the stated PR objectives and do not appear out of scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/server.go`:
- Around line 1304-1305: The scheduling allowance update in UpdateScheduleConfig
is racing with newer writes because SetSchedulingAllowanceStatus is called after
the serialized persist helper releases its lock. Move the halt-status update
into the same critical section as the schedule config persist, or otherwise
guard it with the same serialization used by persist helper so an older
UpdateScheduleConfig cannot overwrite a newer HaltScheduling value; use the
UpdateScheduleConfig and SetSchedulingAllowanceStatus paths as the fix points.
- Around line 1582-1584: Normalize the dashboard address before calling
cfg.Validate in PDServerConfig setup so values like 127.0.0.1:2379 are converted
into a scheme-bearing form first. Move the scheme-prefixing logic that currently
sits later in the server startup flow to happen before the Validate call, using
the same dashboard-address handling path in server.go so Validate sees the
normalized value. Keep the existing special handling for auto and none
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0fe704dc-0f90-4e93-bcc8-ee0724b8046c

📥 Commits

Reviewing files that changed from the base of the PR and between 2e12f48 and 8cec60e.

📒 Files selected for processing (9)
  • pkg/dashboard/adapter/manager.go
  • pkg/mcs/scheduling/server/config/config.go
  • pkg/schedule/config/config_provider.go
  • pkg/schedule/coordinator.go
  • pkg/schedule/coordinator_test.go
  • server/api/config.go
  • server/config/config_test.go
  • server/config/persist_options.go
  • server/server.go

Comment thread server/server.go Outdated
Comment thread server/server.go
@rleungx rleungx force-pushed the fix-7464-atomic-config-update-master branch from 8cec60e to 3a711ee Compare June 25, 2026 08:17

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/mcs/scheduling/server/cluster.go`:
- Around line 414-415: The error returned from the ctx.Done() branch in
cluster.go uses uppercase text and should follow the repository rule for
lowercase error strings. Update the error message in the
scheduling/server/cluster.go path within the cluster-related function handling
the ctx.Done() case so the returned errors.New message is entirely lowercase and
keeps the same meaning.

In `@pkg/mcs/scheduling/server/server.go`:
- Around line 211-219: The member probe calls in updatePDLeader can block
forever because they reuse the loop context without a deadline. Wrap the
external calls to GetClient().Status and GetDelegateClient in
context.WithTimeout using a short per-call timeout, and make sure each derived
context is canceled with defer. Keep the existing leader-check flow in
updatePDLeader, but apply the timeout around each endpoint probe so a slow or
unresponsive member cannot stall the whole scheduling loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bba85821-30b0-4592-bff7-e15a0476e558

📥 Commits

Reviewing files that changed from the base of the PR and between 8cec60e and 3a711ee.

📒 Files selected for processing (11)
  • pkg/dashboard/adapter/manager.go
  • pkg/mcs/scheduling/server/cluster.go
  • pkg/mcs/scheduling/server/config/config.go
  • pkg/mcs/scheduling/server/server.go
  • pkg/schedule/config/config_provider.go
  • pkg/schedule/coordinator.go
  • pkg/schedule/coordinator_test.go
  • server/api/config.go
  • server/config/config_test.go
  • server/config/persist_options.go
  • server/server.go
🚧 Files skipped from review as they are similar to previous changes (8)
  • pkg/schedule/config/config_provider.go
  • server/config/config_test.go
  • pkg/dashboard/adapter/manager.go
  • server/config/persist_options.go
  • pkg/schedule/coordinator.go
  • pkg/mcs/scheduling/server/config/config.go
  • server/server.go
  • server/api/config.go

Comment thread pkg/mcs/scheduling/server/cluster.go Outdated
Comment thread pkg/mcs/scheduling/server/server.go Outdated
@rleungx rleungx changed the title server: make config updates atomic server, schedule: make config updates atomic Jun 25, 2026
@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed and removed do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 25, 2026
Signed-off-by: Ryan Leung <rleungx@gmail.com>
@rleungx rleungx force-pushed the fix-7464-atomic-config-update-master branch from 3a711ee to abc4684 Compare June 25, 2026 08:36
@ti-chi-bot ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 25, 2026
Signed-off-by: Ryan Leung <rleungx@gmail.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/config/config_test.go`:
- Around line 454-470: The test around opt.Persist and the storage
synchronization should fail immediately if the serialization invariant is broken
instead of silently accepting the race. Update the logic in the test that waits
on storage.secondStarted so that seeing it before closing storage.releaseFirst
triggers a test failure, and add an explicit timeout around the wait on
storage.firstStarted so regressions don’t hang. Use the existing test helpers
and symbols opt.Persist, storage.firstStarted, storage.secondStarted, and
storage.releaseFirst to keep the fix localized.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e55afd3e-4596-4e6c-97b9-1e50ecbe0679

📥 Commits

Reviewing files that changed from the base of the PR and between abc4684 and b033d02.

📒 Files selected for processing (2)
  • server/config/config_test.go
  • server/config/persist_options.go

Comment thread server/config/config_test.go Outdated
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 85.71429% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.24%. Comparing base (4335bc3) to head (5c2ea3a).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10942      +/-   ##
==========================================
- Coverage   79.26%   79.24%   -0.03%     
==========================================
  Files         540      541       +1     
  Lines       74954    75189     +235     
==========================================
+ Hits        59412    59580     +168     
- Misses      11371    11430      +59     
- Partials     4171     4179       +8     
Flag Coverage Δ
unittests 79.24% <85.71%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rleungx added 3 commits June 25, 2026 17:51
Signed-off-by: Ryan Leung <rleungx@gmail.com>
Signed-off-by: Ryan Leung <rleungx@gmail.com>
Signed-off-by: Ryan Leung <rleungx@gmail.com>
@rleungx

rleungx commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

/test pull-unit-test-next-gen-3

}
notifier <- struct{}{}
select {
case notifier <- struct{}{}:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will lose this update if the notifier is full.

Comment thread server/api/config.go
var (
_ *sc.ScheduleConfig = nil
resourceManagerControllerConfigURL = "/resource-manager/api/v1/config/controller"
errConfigNotUpdated = errors.New("config is not updated")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming it to config updated failed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Updating config is not atomic

2 participants