Skip to content

schedule: skip slow/evicted stores in hot region scheduling#10946

Open
LykxSassinator wants to merge 1 commit into
tikv:masterfrom
LykxSassinator:lucasliang/no_scheduling_on_slow_node
Open

schedule: skip slow/evicted stores in hot region scheduling#10946
LykxSassinator wants to merge 1 commit into
tikv:masterfrom
LykxSassinator:lucasliang/no_scheduling_on_slow_node

Conversation

@LykxSassinator

@LykxSassinator LykxSassinator commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: Close #10936

Hot-region scheduling can still choose a destination store that has already
been marked as unsuitable for hot traffic, such as slow stores, stopping
stores, or stores evicted by slow-trend detection.

What is changed and how does it work?

add a hot-region-specific target filter that rejects stores evicted as
slow-store, stopping-store, or slow-trend

apply the filter to hot-region destination selection in the balance-hot-region
solver, grant-hot-region scheduler, and shuffle-hot-region scheduler

add unit tests for the new filter and scheduler regressions covering hot-read
move-leader fallback and hot-read peer scheduling

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Code changes

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

Release note

Hot-region schedulers now skip stores evicted by slow-store, stopping-store,
or slow-trend checks when selecting scheduling targets.

Deep Review

Problem Summary

  • Issue #10936 is about hot-region scheduling still selecting destination stores that PD has already marked as unsuitable for receiving hot-region traffic.
  • The practical requirement is narrower than generic region movement: hot-region schedulers should not move hot peers or hot leaders onto stores evicted due to slow-store, stopping-store, or slow-trend signals.
  • The earlier broader fix path was risky because putting this policy into shared StoreStateFilter regionTarget logic would also change unrelated MoveRegion users such as replica repair and generic balance-region flows.

Solution Walkthrough

  • pkg/schedule/filter/filters.go:572-600 adds NewHotRegionEvictedTargetFilter, a target-only filter that rejects stores when any of these store-state flags are set:
    • store.EvictedAsSlowStore()
    • store.EvictedAsStoppingStore()
    • store.IsEvictedAsSlowTrend()
  • pkg/schedule/schedulers/hot_region_solver.go:564-569 wires the new filter into the hot-region movePeer destination path, which covers hot peer relocation.
  • pkg/schedule/schedulers/hot_region_solver.go:585-588 also adds the filter to the hot-read move-leader fallback candidate set, which is the non-obvious path that can still materialize as a MoveLeader-style choice inside hot scheduling.
  • pkg/schedule/schedulers/grant_hot_region.go:301-304 applies the same filter to grant-hot-region destination selection.
  • pkg/schedule/schedulers/shuffle_hot_region.go:156-160 applies it to shuffle-hot-region destination selection.
  • pkg/schedule/filter/filters_test.go:304-323 adds direct coverage for the new filter and verifies all three eviction states.
  • pkg/schedule/schedulers/hot_region_test.go:1333-1422 adds scheduler regressions for:
    • hot-read move-leader fallback skipping evicted targets
    • hot-read peer scheduling skipping evicted targets
  • Importantly, the patch does not modify shared StoreStateFilter regionTarget behavior, so non-hot MoveRegion callers keep their previous semantics.

Findings (ordered by severity)

  • No concrete correctness, compatibility, or robustness bugs found in the current scoped implementation.

Costs and Negative Impacts

  • Correctness:
    • The current patch fixes the hot-region scheduling gap without changing global MoveRegion behavior.
    • Coverage is broader than the original practical suggestion because it consistently honors slow-store, stopping-store, and slow-trend eviction states, not only slowStoreEvicted.
  • Security:
    • None found.
  • Compatibility:
    • The behavioral change is intentionally limited to hot-region schedulers and should not affect replica checker, balance-region, or other non-hot placement flows.
  • Robustness:
    • The main hot-region target-selection paths are covered in code and tests.
    • Residual risk is mostly future regression from missed filter wiring in less-direct hot schedulers, not a currently observed logic hole.
  • Cognitive Load:
    • Slight increase from introducing one hot-region-specific filter.
    • There is also a small naming mismatch in the internal filter counter: the new filter reports leaderEvicted / "leader-evicted-filter" in pkg/schedule/filter/counter.go:55-86, while the exported constructor name is now NewHotRegionEvictedTargetFilter. This is a maintainability and observability clarity cost, not a correctness bug.
  • CPU:
    • Negligible. The filter adds a few boolean checks per hot-region target candidate.
  • Memory:
    • Negligible.
  • Log Volume:
    • No meaningful change.

Engineering Rules Check

  • None.

Questions and Assumptions

  • Assumed the intended scope is exactly the hot-region scheduler family, based on the issue context and the user’s requested practical fix direction.
  • Assumed the current worktree diff is the authoritative patch under review.

Suggested Tests / Validation

  • The added unit coverage in pkg/schedule/filter/filters_test.go:304-323 and pkg/schedule/schedulers/hot_region_test.go:1333-1422 is directionally correct and exercises the two highest-risk solver paths.
  • Residual test gap: add a scheduler-level negative test for grant-hot-region rejecting an evicted destination store.
  • Residual test gap: add a scheduler-level negative test for shuffle-hot-region rejecting an evicted destination store.

Summary by CodeRabbit

  • New Features

    • Added support for avoiding destination stores that have been marked for hot-region eviction during hot leader and region scheduling.
    • Expanded scheduling logic so these stores are skipped when selecting target stores.
  • Bug Fixes

    • Improved hot read scheduling to prevent transfers to stores that should no longer receive leadership or peer moves.
    • Added coverage to verify scheduling behaves correctly across multiple eviction states.

Signed-off-by: lucasliang <nkcs_lykx@hotmail.com>
@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/needs-triage-completed labels Jun 26, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bufferflies for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 06825b08-f927-4a54-917a-a2bfef42b18c

📥 Commits

Reviewing files that changed from the base of the PR and between 6a29eb9 and 1c49f48.

📒 Files selected for processing (7)
  • pkg/schedule/filter/counter.go
  • pkg/schedule/filter/filters.go
  • pkg/schedule/filter/filters_test.go
  • pkg/schedule/schedulers/grant_hot_region.go
  • pkg/schedule/schedulers/hot_region_solver.go
  • pkg/schedule/schedulers/hot_region_test.go
  • pkg/schedule/schedulers/shuffle_hot_region.go

📝 Walkthrough

Walkthrough

Adds a new hot-region target filter that rejects evicted stores, registers its filter type name, wires it into hot-region scheduler target selection, and adds unit and scheduler tests for eviction cases.

Changes

Hot-region target eviction filtering

Layer / File(s) Summary
Filter type and rejection logic
pkg/schedule/filter/counter.go, pkg/schedule/filter/filters.go
Adds the leaderEvicted filter type, its string label, and NewHotRegionEvictedTargetFilter with source acceptance and target rejection logic.
Scheduler target selection
pkg/schedule/schedulers/grant_hot_region.go, pkg/schedule/schedulers/hot_region_solver.go, pkg/schedule/schedulers/shuffle_hot_region.go
Adds the new filter to hot-region destination candidate lists in grant, solver, and shuffle scheduling paths.
Unit and scheduler coverage
pkg/schedule/filter/filters_test.go, pkg/schedule/schedulers/hot_region_test.go
Adds unit coverage for the new filter and scheduler tests covering evicted target stores in hot-read scheduling.

Sequence Diagram(s)

sequenceDiagram
  participant scheduler as "grantHotRegionScheduler.transfer"
  participant filter as "hotRegionEvictedTargetFilter"
  participant store as "core.StoreInfo"
  scheduler->>filter: Target(store)
  filter->>store: EvictedAsSlowStore() / EvictedAsStoppingStore() / IsEvictedAsSlowTrend()
  store-->>filter: eviction state
  filter-->>scheduler: statusOK / statusStoreRejectLeader
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I hopped through hot-region fields today 🐰
Dodged slow-score carrots along the way
Evicted burrows? No target for me
Only the lively stores get a tea
Hop-hop—schedule stays spry!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the main change: skipping slow or evicted stores in hot-region scheduling.
Description check ✅ Passed The description matches the template well, includes the issue number, change summary, tests, and a release note.
Linked Issues check ✅ Passed The patch aligns with #10936 by preventing hot-region targets from using stores marked slow or otherwise evicted.
Out of Scope Changes check ✅ Passed The changes stay within hot-region scheduling filters and tests, with no clear unrelated or extraneous modifications.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@ti-chi-bot ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 26, 2026
@LykxSassinator

Copy link
Copy Markdown
Contributor Author

/cc @rleungx @bufferflies

@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.44444% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 79.17%. Comparing base (4335bc3) to head (1c49f48).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10946      +/-   ##
==========================================
- Coverage   79.26%   79.17%   -0.09%     
==========================================
  Files         540      541       +1     
  Lines       74954    75103     +149     
==========================================
+ Hits        59412    59465      +53     
- Misses      11371    11438      +67     
- Partials     4171     4200      +29     
Flag Coverage Δ
unittests 79.17% <94.44%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


// Target filters stores when select them as schedule target.
func (*hotRegionEvictedTargetFilter) Target(_ config.SharedConfigProvider, store *core.StoreInfo) *plan.Status {
if store.EvictedAsSlowStore() || store.EvictedAsStoppingStore() || store.IsEvictedAsSlowTrend() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only filters stores that have already been marked as evicted; a store with a high slow score but no eviction flag can still pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PD continues to schedule hot regions on nodes with high slow scores, causing spikes upon node recovery

2 participants