Skip to content

split scatter: stabilize pending and avoid read-pool pressure#10856

Open
lhy1024 wants to merge 12 commits into
tikv:masterfrom
lhy1024:lhy/split-scatter-stable-pending
Open

split scatter: stabilize pending and avoid read-pool pressure#10856
lhy1024 wants to merge 12 commits into
tikv:masterfrom
lhy1024:lhy/split-scatter-stable-pending

Conversation

@lhy1024

@lhy1024 lhy1024 commented Jun 9, 2026

Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: Ref #10830

This PR fixes load-based split-scatter behavior around pending-region lifecycle and read CPU pressure:

  • Pending split-scatter entries can be retried or removed with stale region state if they are not tied to the expected split result version.
  • Internal split-scatter can keep selecting leader targets whose TiKV unified read pool is already under high CPU pressure.
  • When read CPU is already balanced, split-scatter should be delayed and kept pending instead of being treated as a completed no-op.

What is changed and how does it work?

split scatter: stabilize pending regions

Record the source region and new regions in the same pending batch, and bind each pending entry to the expected post-split RegionEpoch version.

The dispatcher waits until both the pending region and the source region have observed the expected split version, removes stale pending entries whose region version has already advanced beyond that version, and rechecks pending identity before delaying or deleting entries.

Pending entries now use a 10s retry backoff and a 30min TTL so transient schedule/store/replica/read-CPU gates do not drop split-scatter too early.

split scatter: guard hot read pool targets

Use TiKV StoreConfig sync to read readpool.unified.max-thread-count, track each store's recent 3-minute max unified-read CPU, and filter internal split-scatter leader target stores whose read CPU is close to the unified read-pool limit.

The read-pool pressure guard is only used by PD-internal split-scatter; admin/API scatter keeps its existing behavior.

split scatter: keep balanced read CPU pending

When internal split-scatter chooses a different target leader but moving the leader would not reduce read CPU pressure enough, return a dedicated balanced-read-CPU error. The checker treats this as a transient gate, delays the pending entry, and retries later instead of deleting it as no-op.

If the selected target leader is the current leader, balanced-read-CPU does not block dispatch.

Expose the same store config and recent read CPU data in scheduling service mode so the guard also works in MCS.

Check List

Tests

  • Unit test

Test commands run:

  • go test ./pkg/schedule/checker -run 'TestDispatchSplitScatterKeepsPendingWhenReadCPUIsBalanced|TestDispatchSplitScatterBacksOff|TestCheckSplitScatterRegionsCreatesScatterOperator|TestDispatchSplitScatterKeepsPendingUntilSplitHeartbeat|TestDispatchSplitScatterUsesRangeScatterGroup|TestDispatchSplitScatterKeepsStableGroupWhenRegionSplitsAgain' -count=1
  • go test ./pkg/schedule/scatter -run 'TestInternalScatterSkipsWhenReadCPUIsBalanced|TestInternalScatterAllowsWhenReadCPUIsImbalanced|TestInternalScatterBalancedReadCPUDoesNotBlockSameLeader|TestInternalScatterLeaderFiltersReadPoolPressure|TestInternalScatterPeerSelection|TestInternalScatterLeaderSelection|TestSeedGroupDistributionByRangeAppliesNetChange|TestSeededDistributionSkipsUpdateWhenAddOperatorRejected' -count=1
  • go test ./server/cluster ./pkg/mcs/scheduling/server -run 'TestHandleAskBatchSplit|TestAskBatchSplitRecordsSplitScatterInSchedulingService' -count=1
  • go test ./pkg/schedule/filter ./pkg/schedule/config -run 'TestReadPoolPressureFilterUsesRecentMaxReadCPU|TestStoreConfigEqualChecksReadPool' -count=1
  • git diff --check HEAD~1..HEAD

Side effects

  • Internal split-scatter may skip leader target stores whose TiKV unified read pool is already under high CPU pressure.
  • Internal split-scatter may keep balanced-read-CPU cases pending and retry them later instead of completing them immediately.

Release note

Fix PD load split-scatter pending lifecycle and avoid scattering leaders to TiKV stores with saturated or already balanced read pools.

@ti-chi-bot

ti-chi-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has signed the dco. labels Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR adds recent-max read CPU tracking to rolling store statistics, models TiKV's unified read pool configuration in StoreConfig, and introduces a ReadPoolPressureFilter that excludes high-CPU stores from internal leader scatter targets. Split-scatter pending handling is strengthened with explicit epoch wait-version checks to ensure regions have completed required splits before scheduling, with increased retry backoff and comprehensive test coverage.

Changes

Read Pool Pressure Filter and Split-Scatter Epoch Guards

Layer / File(s) Summary
Recent-max read CPU statistics tracking and exposure
pkg/statistics/store.go, server/cluster/scheduling_controller.go, pkg/mcs/scheduling/server/cluster.go, pkg/mock/mockcluster/mockcluster.go
RollingStoreStats gains a readCPURecentMax rolling-window field initialized via newRollingStoreStats, updated during Observe and Set with computed read CPU, exposed via GetReadCPURecentMax and StoresStats.GetStoreReadCPURecentMax, then forwarded through server scheduling controller, mcs cluster, and mock cluster wrappers.
Unified read pool config in StoreConfig and provider interfaces
pkg/schedule/config/store_config.go, pkg/schedule/config/config_provider.go, pkg/schedule/config/store_config_test.go, server/config/persist_options.go, pkg/mcs/scheduling/server/config/config.go
StoreConfig adds ReadPool/UnifiedReadPool structs and extends Equal to cover ReadPool; GetUnifiedReadPoolMaxThreadCount is added to StoreConfigProvider interface, PersistOptions, and mcs PersistConfig; a unit test confirms Equal detects ReadPool changes.
Read pool pressure filter implementation and registration
pkg/schedule/filter/read_pool_pressure_filter.go, pkg/schedule/filter/counter.go, pkg/schedule/filter/counter_test.go, pkg/schedule/filter/filters_test.go
Adds readPoolPressure filter type constant and string label, introduces NewReadPoolPressureFilter with Target logic that returns statusStoreBusy when CPU ≥ threadCount × usagePerCore × threshold, and validates the filter behavior with unit tests.
Internal scatter leader filtering with read pool pressure
pkg/schedule/scatter/region_scatterer.go, pkg/schedule/scatter/region_scatterer_test.go
region_scatterer adds imports, provider interfaces, and splitScatterReadCPUByStore helper; scatterRegionWithType derives readCPUByStore and passes it to filterAllowedLeaderCandidateStores, which applies ReadPoolPressureFilter to exclude high-pressure candidates; tests verify the pressured store is excluded in favor of remaining candidates.
Split-scatter epoch wait-version guard and stale cleanup
pkg/schedule/checker/split_scatter.go, pkg/schedule/checker/split_scatter_test.go, server/cluster/split_scatter_test.go
splitScatterPendingItem gains regionWaitVersion field; splitScatterRetryBackoff scales from 1 second to 10 seconds; splitScatterRegionVersion helper safely extracts region epoch; recordSplitScatterBatch derives regionWaitVersion from caller-provided sourceWaitVersion or source epoch+1, storing it in both child and source pending entries; collectTopPendingSplitScatter filters items using epoch check; dispatch rechecks and skips when mismatch detected; new test helpers verify pending tracking and back-off behavior; server tests clone split regions with WithIncVersion for heartbeat processing.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(173, 216, 230, 0.5)
    Note over scatterRegionWithType,ReadPoolPressureFilter: Internal scatter leader candidate filtering
    scatterRegionWithType->>splitScatterReadCPUByStore: StoreKindLoads + recentMaxProvider
    splitScatterReadCPUByStore-->>scatterRegionWithType: readCPUByStore map
    scatterRegionWithType->>filterAllowedLeaderCandidateStores: candidates + readCPUByStore
    filterAllowedLeaderCandidateStores->>ReadPoolPressureFilter: Target(store) for each candidate
    ReadPoolPressureFilter-->>filterAllowedLeaderCandidateStores: statusStoreBusy or statusOK
    filterAllowedLeaderCandidateStores-->>scatterRegionWithType: filtered allowed candidates
  end
Loading
sequenceDiagram
  rect rgba(144, 238, 144, 0.5)
    Note over recordBatch,dispatch: Split-scatter epoch-version guard lifecycle
    recordBatch->>pending map: store pending item with regionWaitVersion
    collectTopPending->>pending map: read pending under lock
    collectTopPending->>collectTopPending: compare live region version vs regionWaitVersion
    collectTopPending->>collectTopPending: skip if not yet at minimum version
    dispatch->>pending map: recheck live region against regionWaitVersion
    dispatch->>dispatch: verify live region version >= regionWaitVersion
    dispatch->>pending map: skip dispatch if epoch requirement not met
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • tikv/pd#10621: Introduces the load-based internal split-scatter pending and dispatch flow in pkg/schedule/checker/split_scatter.go that this PR builds epoch/version-based staleness guards on top of.
  • tikv/pd#10652: Forwards the split reason so LOAD triggers RecordSplitScatterBatch, and this PR updates recordSplitScatterBatch/split-scatter pending dispatch gating based on stored batch metadata.
  • tikv/pd#10691: Modifies split-scatter pending item tracking and dispatch filtering in the same split_scatter.go code paths where this PR adds epoch wait-version recording and back-off adjustments.

Suggested labels

size/XL, needs-cherry-pick-release-8.5

Suggested reviewers

  • rleungx
  • okJiang
  • bufferflies
  • ystaticy

Poem

🐇 Hoppity hop through the CPU lane,
Read pool pressure won't scatter in vain!
Wait-versions track the epochs that grow,
Stale pending items now quickly must go.
Each store knows its thermal design fate,
Internal scatter picks leaders first-rate! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately summarizes the main changes: stabilizing split-scatter pending regions and avoiding read-pool pressure when selecting leader targets.
Description check ✅ Passed The description follows the template with all required sections: problem statement with issue reference, detailed explanation of changes, test coverage, and release notes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 9, 2026
@lhy1024 lhy1024 force-pushed the lhy/split-scatter-stable-pending branch 6 times, most recently from 582d591 to 2a8c7d6 Compare June 11, 2026 12:48
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 73.83966% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.12%. Comparing base (6e3d72f) to head (01702d5).
⚠️ Report is 14 commits behind head on master.

❌ Your patch check has failed because the patch coverage (73.83%) is below the target coverage (74.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10856      +/-   ##
==========================================
- Coverage   79.14%   79.12%   -0.02%     
==========================================
  Files         536      541       +5     
  Lines       74033    75044    +1011     
==========================================
+ Hits        58594    59380     +786     
- Misses      11311    11464     +153     
- Partials     4128     4200      +72     
Flag Coverage Δ
unittests 79.12% <73.83%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lhy1024 lhy1024 force-pushed the lhy/split-scatter-stable-pending branch 4 times, most recently from dea3b17 to 130a58c Compare June 12, 2026 03:06
@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 12, 2026
@lhy1024 lhy1024 force-pushed the lhy/split-scatter-stable-pending branch 3 times, most recently from dd50464 to d08f184 Compare June 12, 2026 05:21
@lhy1024 lhy1024 force-pushed the lhy/split-scatter-stable-pending branch 2 times, most recently from 882b374 to a5f5e83 Compare June 12, 2026 05:52
@lhy1024 lhy1024 force-pushed the lhy/split-scatter-stable-pending branch 2 times, most recently from 9f27430 to c24e3d9 Compare June 12, 2026 08:44
@lhy1024 lhy1024 force-pushed the lhy/split-scatter-stable-pending branch from 0bd7327 to 68d3151 Compare June 17, 2026 06:21
lhy1024 added 2 commits June 17, 2026 15:20
Signed-off-by: lhy1024 <19542290+lhy1024@users.noreply.github.com>
Signed-off-by: lhy1024 <19542290+lhy1024@users.noreply.github.com>

// GetUnifiedReadPoolMaxThreadCount returns TiKV unified read pool max thread count.
func (o *PersistConfig) GetUnifiedReadPoolMaxThreadCount() uint64 {
return o.GetStoreConfig().GetUnifiedReadPoolMaxThreadCount()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to configure it? Just get it from tikv?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still sourced from TiKV. Root PD syncs StoreConfig from a TiKV /config endpoint and persists/propagates it; the scheduling service reads the same StoreConfig through the config watcher. This method only exposes the synced value through StoreConfigProvider so the scatterer can use it in both monolithic PD and scheduling-service mode.

CheckRegionKeys(uint64, uint64) error
IsEnableRegionBucket() bool
IsRaftKV2() bool
GetUnifiedReadPoolMaxThreadCount() uint64

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do if not all stores are the same?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows the existing StoreConfig model in PD: PD syncs StoreConfig from one available TiKV and treats it as cluster-level config, as we already do for region split size, region bucket, and raft-kv2. TiKV configs are expected to be homogeneous in a cluster. Supporting heterogeneous read-pool sizes would require a per-store StoreConfig/heartbeat model, which is larger than this PR.

Comment thread pkg/schedule/filter/counter_test.go Outdated
}{
{int(storeStateTombstone), "store-state-tombstone-filter"},
{int(filtersLen - 1), "store-state-slow-trend-filter"},
{int(readPoolPressure), "read-pool-pressure-filter"},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should add a new item, not remove the latest.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I will keep the existing latest filter case and add readPoolPressure as a new case instead of replacing it.

}

// Target filters stores whose read CPU is close to the unified read-pool limit.
func (f *readPoolPressureFilter) Target(_ config.SharedConfigProvider, store *core.StoreInfo) *plan.Status {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only pick the target store if the source store is only busy?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This filter is intentionally target-side only. Internal split-scatter has already been triggered by the load-split flow; here we are not deciding whether the source is busy enough to scatter, but whether a target store still has enough read-pool headroom to receive a new leader.

The target store may not be busy yet, but moving a leader to it can immediately increase its read load. So this filter acts as a proactive brake: once the target read CPU approaches the unified read-pool capacity threshold, we avoid selecting it as the new leader target before it actually becomes busy. Source-side transferability is checked separately by isAllowedLeaderSource. If all candidate targets are under read-pool pressure, no target leader will be selected and the internal scatter will be skipped/retried later.

@bufferflies bufferflies left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

lhy1024 added 2 commits June 23, 2026 21:30
Signed-off-by: lhy1024 <19542290+lhy1024@users.noreply.github.com>
Signed-off-by: lhy1024 <19542290+lhy1024@users.noreply.github.com>
(cherry picked from commit 36ba64aa01908b19d6c14369a8cf2c40662782e2)
@ti-chi-bot ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 23, 2026
lhy1024 added 2 commits June 23, 2026 23:11
Signed-off-by: lhy1024 <19542290+lhy1024@users.noreply.github.com>
Signed-off-by: lhy1024 <19542290+lhy1024@users.noreply.github.com>
@lhy1024 lhy1024 force-pushed the lhy/split-scatter-stable-pending branch from 87d7734 to eb243c5 Compare June 23, 2026 16:29
@lhy1024

lhy1024 commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

/retest

@lhy1024 lhy1024 changed the title split scatter: stabilize pending regions and filter read-pool pressure stores split scatter: stabilize pending and avoid read-pool pressure Jun 24, 2026
Signed-off-by: lhy1024 <19542290+lhy1024@users.noreply.github.com>
@lhy1024 lhy1024 force-pushed the lhy/split-scatter-stable-pending branch from 01702d5 to e49f324 Compare June 24, 2026 03:46
Signed-off-by: lhy1024 <19542290+lhy1024@users.noreply.github.com>
@lhy1024

lhy1024 commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

Case1

After introducing split scatter, for test cases where the read pool is already under significant and relatively uniform pressure, unnecessary scattering will cause performance degradation.
image

Case2

Therefore, we introduced a target CPU filter mechanism. If the CPU usage of a target exceeds 70% of the unified read pool CPU, we will no longer select it as a target. This restores the QPS to normal, but the tail latency will still be high due to task backlog.
image
Strangely, after further slowing down the speed, it was found that even after the scatter operator stopped, the subsequent wait duration still increased after a period of scatter operation.
image
image

Case3

Therefore, we introduced another mechanism: if the CPU usage difference between the source and target is less than 10%, we will no longer perform scheduling. This aligns with the design philosophy, as the goal of split scatter is to distribute the results of load base splitting, and the ultimate goal of this distribution is to prevent CPU overruns.
image
img_v3_0212v_7f3dfe89-ec5f-4f65-a9d2-cc0ebbba32bg
img_v3_0212v_f3bd8f76-eefa-4121-a9e1-4b7af4044e3g

@lhy1024 lhy1024 requested review from bufferflies and rleungx June 24, 2026 05:35
@ti-chi-bot

ti-chi-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bufferflies
Once this PR has been reviewed and has the lgtm label, please assign niubell for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jun 24, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

[LGTM Timeline notifier]

Timeline:

  • 2026-06-24 06:51:03.810404288 +0000 UTC m=+103147.473629801: ☑️ agreed by bufferflies.

@ti-chi-bot

ti-chi-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@lhy1024: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-3 ec6f7cd link true /test pull-unit-test-next-gen-3

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Signed-off-by: lhy1024 <19542290+lhy1024@users.noreply.github.com>
@lhy1024 lhy1024 force-pushed the lhy/split-scatter-stable-pending branch from ec6f7cd to 33bf4e1 Compare June 24, 2026 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants