Skip to content

Improves segmented top-k test compilation times #9404

Merged
elstehle merged 3 commits into
NVIDIA:mainfrom
elstehle:enh/topk-test-compilation-times
Jun 12, 2026
Merged

Improves segmented top-k test compilation times #9404
elstehle merged 3 commits into
NVIDIA:mainfrom
elstehle:enh/topk-test-compilation-times

Conversation

@elstehle

Copy link
Copy Markdown
Contributor

Description

The sort we use for verification contributes a majority of the compilation times. A host-only sort for verification would cut the compilation time in half but runtime would exploude to 1 min and above. DeviceSegmentedRadixSort is a reasonable alternative, cutting Improves top-k test compilation times

Before / after

Test Metric Before (DeviceSegmentedSort) After (DeviceSegmentedRadixSort) Δ
keys compile 1:48.7 (108.7 s) 1:21.8 (81.8 s) −26.9 s (−25%)
runtime 3.46 / 3.41 s 3.54 s +0.1 s
pairs compile 2:29.7 (149.7 s) 1:46.0 (106.0 s) −43.6 s (−29%)
runtime 1.88 / 1.84 s 1.95 s +0.07 s

@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL Jun 11, 2026
@elstehle elstehle force-pushed the enh/topk-test-compilation-times branch from 4a1a735 to abc9e62 Compare June 11, 2026 14:43
@elstehle elstehle marked this pull request as ready for review June 11, 2026 14:45
@elstehle elstehle requested a review from a team as a code owner June 11, 2026 14:45
@elstehle elstehle requested a review from pauleonix June 11, 2026 14:45
@cccl-authenticator-app cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL Jun 11, 2026
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a26ea090-9984-43e8-b5e6-1aa8a7d9d14d

📥 Commits

Reviewing files that changed from the base of the PR and between 4cbbe8c and b075075.

📒 Files selected for processing (3)
  • cub/test/catch2_test_device_segmented_topk_keys.cu
  • cub/test/catch2_test_device_segmented_topk_pairs.cu
  • cub/test/catch2_test_device_topk_common.cuh
✅ Files skipped from review due to trivial changes (2)
  • cub/test/catch2_test_device_segmented_topk_pairs.cu
  • cub/test/catch2_test_device_segmented_topk_keys.cu
🚧 Files skipped from review as they are similar to previous changes (1)
  • cub/test/catch2_test_device_topk_common.cuh

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

Summary

This PR reduces compilation time for segmented top-k tests by switching the verification reference sort from cub::DeviceSegmentedSort to cub::DeviceSegmentedRadixSort. Measured results show ~25–29% faster compile times for the affected tests with only a small runtime increase (~0.07–0.1 s).

Changes

  • cub/test/catch2_test_device_topk_common.cuh

    • Replaced include of DeviceSegmentedSort with DeviceSegmentedRadixSort.
    • Switched the verification implementation to use cub::DeviceSegmentedRadixSort::SortKeys / SortKeysDescending, preserving the two-step temp-storage query/allocation + execution flow.
    • Updated inline comment/TODO to explain rationale (radix sort compiles much faster with minimal runtime cost).
  • cub/test/catch2_test_device_segmented_topk_keys.cu

    • Documentation-only tweak: clarified comments that per-segment k is passed to batched_topk_keys as a deferred (not immediate) sequence. No behavioral changes.
  • cub/test/catch2_test_device_segmented_topk_pairs.cu

    • Documentation-only tweak: clarified comments that per-segment k is passed to batched_topk_pairs as a deferred (not immediate) sequence. No behavioral changes.

Metrics

  • keys:
    • Compile time: DeviceSegmentedSort 108.7 s → DeviceSegmentedRadixSort 81.8 s (−26.9 s, −25%)
    • Runtime: 3.46 / 3.41 s → 3.54 s (+~0.1 s)
  • pairs:
    • Compile time: 149.7 s → 106.0 s (−43.6 s, −29%)
    • Runtime: 1.88 / 1.84 s → 1.95 s (+~0.07 s)

Review notes / follow-ups

Impact

  • Significant reduction in test compilation time for segmented top-k verification.
  • Negligible runtime overhead.
  • No changes to public APIs or observable test behavior; two files contain comment clarifications only.

suggestion:

Walkthrough

This PR switches the reference segmented sort in top-k tests from DeviceSegmentedSort to DeviceSegmentedRadixSort and updates four inline test comments to state that per-segment k is passed as a deferred sequence to batched top-k operations.

Changes

Top-K Test Infrastructure Update

Layer / File(s) Summary
Reference sort implementation switch
cub/test/catch2_test_device_topk_common.cuh
Include directive changed from device_segmented_sort.cuh to device_segmented_radix_sort.cuh. The segmented_sort_keys function now queries temp storage and calls DeviceSegmentedRadixSort::SortKeys and DeviceSegmentedRadixSort::SortKeysDescending (two-step temp-storage query/allocation and execution). In-function comment/TODO updated.
Test comment corrections
cub/test/catch2_test_device_segmented_topk_keys.cu, cub/test/catch2_test_device_segmented_topk_pairs.cu
Inline comments in four test cases (fixed-size and variable-size segment tests for both topk_keys and topk_pairs) corrected to document that per-segment k parameters are passed as deferred sequences to the batched operations.

Possibly related PRs

  • NVIDIA/cccl#9311: Added per-segment k test coverage and deferred/immediate sequence forms in the same batched_topk_* call sites that this PR documents.

Suggested reviewers

  • gevtushenko

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
cub/test/catch2_test_device_segmented_topk_keys.cu (2)

359-359: ⚡ Quick win

suggestion: Grammar: "an deferred" should be "a deferred".


458-458: ⚡ Quick win

suggestion: Grammar: "an deferred" should be "a deferred".

cub/test/catch2_test_device_segmented_topk_pairs.cu (2)

501-501: ⚡ Quick win

suggestion: Grammar: "an deferred" should be "a deferred".


621-621: ⚡ Quick win

suggestion: Grammar: "an deferred" should be "a deferred".


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9e11938b-cf83-4294-87a3-86c15cc3befa

📥 Commits

Reviewing files that changed from the base of the PR and between 9cd6640 and abc9e62.

📒 Files selected for processing (3)
  • cub/test/catch2_test_device_segmented_topk_keys.cu
  • cub/test/catch2_test_device_segmented_topk_pairs.cu
  • cub/test/catch2_test_device_topk_common.cuh

@github-actions

This comment has been minimized.

@fbusato

fbusato commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

question. We use DeviceSegmentedRadixSort several times in our tests. What about compile it in a dedicated file and use extern template for the common case?

@elstehle

elstehle commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

question. We use DeviceSegmentedRadixSort several times in our tests. What about compile it in a dedicated file and use extern template for the common case?

Yeah, I think that's worth exploring. Like, we'd extern-template a segmented_sort_keys helper for what we'll identify to be common key/offset types. I will file an issue to explore the option as a follow-up and keep this PR focused on the immediate gains from switching to DeviceSegmentedRadixSort.

I've opened this issue to explore the option: #9413

@elstehle elstehle force-pushed the enh/topk-test-compilation-times branch from 4cbbe8c to b075075 Compare June 12, 2026 05:23
@elstehle elstehle enabled auto-merge (squash) June 12, 2026 05:24
@github-actions

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 38m 22s: Pass: 100%/242 | Total: 1d 21h | Max: 31m 54s | Hits: 99%/146538

See results here.

@elstehle elstehle merged commit 43c30c2 into NVIDIA:main Jun 12, 2026
263 checks passed
@github-project-automation github-project-automation Bot moved this from In Review to Done in CCCL Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants