Vectorize output store in ublkcp DeviceTransform kernel#9481
Vectorize output store in ublkcp DeviceTransform kernel#9481nanan-nvidia wants to merge 14 commits into
Conversation
12991ad to
c6559fb
Compare
|
/ok to test c6559fb |
|
/ok to test 365bf3a |
365bf3a to
c2a253d
Compare
|
/ok to test c2a253d |
|
@nanan-nvidia can you please post the output of |
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Important: Once we approve the general mechanics, we should update the per-function documentation.
| // didn't merge the changes. The problem was mostly a 25% increase in integer instructions, as shown by ncu. | ||
| template <int threads_per_block, | ||
| int UnrollFactor, | ||
| int OutputAlign, |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| // When the caller guarantees aligned_size_t<N> num_items, i.e. the output pointer is N-byte aligned and the element | ||
| // count is a multiple of N, if 1. there are no predicates, 2. memory layout is contiguous, 3. semantically we can |
There was a problem hiding this comment.
Critical:
the element count is a multiple of N
This is only true for elements with power of two size. Think of aligned_size_t<16>{n} and int3*, which is valid for every n that is a multiple of 4.
There was a problem hiding this comment.
In vectorize_store we require both input element size and output element size to be pow2, so this should not be a bug.
From the definition,
explicit constexpr aligned_size_t(size_t __s) : value(__s) {
_CCCL_ASSERT(value % align == 0,
"aligned_size_t must be constructed with a size that is a multiple of the alignment");
}
It seems aligned_size_t<16>(n) just mean n % 16 == 0? Does it actually mean (sizeof(T) * n) % 16 == 0 semantically?
There was a problem hiding this comment.
aligned_size_t refers to a size in bytes, not in elements, at least according to what we currently document.
|
@nanan-nvidia the results look promising. I think the implementation needs a bit more work though. Please check out the notes I have from a discussion with DevTech working on PyTorch: https://github.com/NVIDIA-dev/cccl_private/issues/598. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
/ok to test 6b64983 |
ublkcp kernel for deviceTransform when user promises cuda::aligned_size_t<16>…type-erased iterator)
…default-constructible outputs)
…ar3 is_aligned assert)
a1d2946 to
c9a53d7
Compare
|
/ok to test c9a53d7 |
|
/ok to test 897a0a4 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
OverviewThis PR improves performance of CUB Key Changes1) New tuning policy:
|
| Layer / File(s) | Summary |
|---|---|
TransformAsyncCopyPolicy store_vec_size field cub/cub/device/dispatch/tuning/tuning_transform.cuh |
Adds store_vec_size field (default 0 = auto) to TransformAsyncCopyPolicy, extends operator== to include it in equality comparison, and updates operator<< to display it in policy output. |
Runtime can_vectorize computation in dispatcher cub/cub/device/dispatch/dispatch_transform.cuh |
Adds cuda/__cmath/pow2.h include; replaces hardcoded false with runtime can_vectorize computed from output element size, candidate vector width clamped to 16 bytes, power-of-two checks on output and all input element sizes, and kernel_source.CanVectorize(...) call result. |
UBLKCP kernel vectorized-store implementation cub/cub/device/dispatch/kernels/kernel_transform.cuh |
Extends transform_kernel_ublkcp with StoreVecSize template parameter and can_vectorize runtime parameter. Adds fast path gated on compile-time vectorize_eligible (contiguous output, trivially relocatable, power-of-two sizes ≤16, always_true predicate) and runtime can_vectorize: performs grouped shared-memory loads, unrolled f(...) evaluation per lane, packed store_t vectorized stores, unconditional scalar tail loop, and early return. Falls through to existing pred-guarded scalar path if conditions unmet. Updates ublkcp dispatch site to instantiate with policy.async_copy.store_vec_size and pass can_vectorize. |
Vectorized store Catch2 tests cub/test/catch2_test_device_transform_vectorized.cu |
Test harness with cast_to functor providing __host__ __device__ static cast for reference generation. Defines GENERIC_COUNTS with non-16B-multiple sizes to exercise scalar tail. Two parameterized C2H_TEST cases cover narrowing to uint8_t and widening from uint8_t, comparing device output to host std::transform reference. Custom ublkcp_store_vec_size_2_selector tuning policy with store_vec_size = 2 enables fallback test verifying correctness with tuned execution environment. |
Comment @coderabbitai help to get the list of available commands.
There was a problem hiding this comment.
Actionable comments posted: 4
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0a2cef84-bae0-4e12-9f34-a71fdf6138d3
📒 Files selected for processing (4)
cub/cub/device/dispatch/dispatch_transform.cuhcub/cub/device/dispatch/kernels/kernel_transform.cuhcub/cub/device/dispatch/tuning/tuning_transform.cuhcub/test/catch2_test_device_transform_vectorized.cu
…vec regression test
|
/ok to test aff5ab5 |
There was a problem hiding this comment.
Actionable comments posted: 1
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: eae71fb4-8cb5-4722-8188-2a3cf5a8adc1
📒 Files selected for processing (2)
cub/cub/device/dispatch/kernels/kernel_transform.cuhcub/test/catch2_test_device_transform_vectorized.cu
🚧 Files skipped from review as they are similar to previous changes (1)
- cub/cub/device/dispatch/kernels/kernel_transform.cuh
|
/ok to test 71cefee |
| if (can_vectorize) | ||
| { | ||
| // store_vec (S) output elements per STG.128/64/.../8, defaulting to vec_size (= 16 / sizeof(output), today's | ||
| // 16-byte store). Shrinking S narrows the store but also reduces the number of fully-unrolled lambda calls per | ||
| // store, which bounds register pressure for heavy functors (whose stores aren't the bottleneck anyway). res[] is | ||
| // indexed only by the fully-unrolled k, i.e. compile-time, so it stays in registers and never spills to local | ||
| // memory regardless of S. | ||
| using store_t = decltype(load_store_type<store_vec * out_size>()); | ||
| auto* out_vec = reinterpret_cast<store_t*>(out); | ||
| const int num_groups = valid_items / store_vec; | ||
| for (int g = threadIdx.x; g < num_groups; g += threads_per_block) | ||
| { | ||
| char* smem = smem_base; | ||
| auto load_chunk = [&](auto aligned_ptr) { | ||
| using T = typename decltype(aligned_ptr)::value_type; | ||
| // on blackwell, head_padding should always be zero | ||
| // on hopper, bulk_copy_alignment is 128 bytes, head_padding could be 112 bytes for example | ||
| // alignof(T) will always be powers of 2 per C++ standard | ||
| const T* base = reinterpret_cast<const T*>(smem + aligned_ptr.head_padding); | ||
| smem += tile_padding + int{sizeof(T)} * tile_size; | ||
| // Gather this input's vec_size elements for output-vector v into a register array. we take the maximal | ||
| // alignment out of alignof(T) and 16 bytes. If input is narrower, we will waste a few (0-16) registers | ||
| constexpr ::cuda::std::size_t chunk_align = (::cuda::std::max) (alignof(T), alignof(int4)); | ||
| ::cuda::__uninitialized_array<T, store_vec, chunk_align> elems; | ||
| constexpr int chunk_bytes = int{sizeof(T)} * store_vec; | ||
| // if same width or narrowing (e.g. int32 -> int8), we split it up into multiple 16 byte reads | ||
| // CAREFUL: the byte width sizeof(T) * vec_size can exceed 16 when the input is wider than the output. | ||
| // However, since input both input type size and output size is pow2, when the input is wider, it has to be | ||
| // pow2 times wider. Therefore, chunk_bytes = input size * vec_size is always divisible by 16 | ||
| // (recall 16 = output size * vec_size) , i.e. we can read it as multiple int4 loads | ||
| if constexpr (chunk_bytes % int{sizeof(int4)} == 0) | ||
| { | ||
| constexpr int n = chunk_bytes / int{sizeof(int4)}; | ||
| const int4* s = reinterpret_cast<const int4*>(base) + g * n; | ||
| _CCCL_PRAGMA_UNROLL_FULL() | ||
| for (int i = 0; i < n; ++i) | ||
| { | ||
| reinterpret_cast<int4*>(elems.data())[i] = s[i]; | ||
| } | ||
| } | ||
| // if widening (e.g. int8 -> int32), just load it in one go. recall chunk_bytes = input size * vec_size, and | ||
| // vec_size = 16 / output size. Since output size is pow2, vec_size is pow2. Hence chunk_bytes is always pow2. | ||
| // this ensures load_store_type<chunk_bytes> will never fail. | ||
| else | ||
| { | ||
| using sub_t = decltype(load_store_type<chunk_bytes>()); | ||
| *reinterpret_cast<sub_t*>(elems.data()) = reinterpret_cast<const sub_t*>(base)[g]; | ||
| } | ||
| return elems; | ||
| }; | ||
| auto chunks = ::cuda::std::tuple{load_chunk(aligned_ptrs)...}; | ||
|
|
||
| // must fully unroll to take full advantage of ILP. otherwise perf regress by half | ||
| ::cuda::__uninitialized_array<output_t, store_vec, sizeof(output_t) * store_vec> res; | ||
| _CCCL_PRAGMA_UNROLL_FULL() | ||
| for (int k = 0; k < store_vec; ++k) | ||
| { | ||
| res[k] = ::cuda::std::apply( | ||
| [&](auto&... c) { | ||
| return f(c[k]...); | ||
| }, | ||
| chunks); | ||
| } | ||
| out_vec[g] = *reinterpret_cast<const store_t*>(res.data()); | ||
| } | ||
|
|
||
| // scalar tail: the up to (store_vec - 1) trailing elements not covered by a whole store group. can_vectorize | ||
| // implies an always_true predicate, so we store unconditionally. | ||
| for (int idx = num_groups * store_vec + threadIdx.x; idx < valid_items; idx += threads_per_block) | ||
| { | ||
| char* smem = smem_base; | ||
| auto fetch_operand = [&](auto aligned_ptr) { | ||
| using T = typename decltype(aligned_ptr)::value_type; | ||
| const int head_padding = alignof(T) < bulk_copy_alignment ? aligned_ptr.head_padding : 0; | ||
| const char* src = smem + head_padding; | ||
| smem += tile_padding + int{sizeof(T)} * tile_size; | ||
| return reinterpret_cast<const T*>(src)[idx]; | ||
| }; | ||
| out[idx] = ::cuda::std::apply( | ||
| [&](auto... values) { | ||
| return f(values...); | ||
| }, | ||
| ::cuda::std::tuple<InTs...>{fetch_operand(aligned_ptrs)...}); | ||
| } | ||
| return; | ||
| } |
There was a problem hiding this comment.
Nitpick: This is a nontrivial amount of code, I believe we should extract it into a function.
There was a problem hiding this comment.
not sure about this one: this code does not seem to be that reusable across different kernels
| // Runtime check whether this launch can take the ublkcp kernel's vectorized (STG.128) store path. The output value | ||
| // type must pack into a 16-byte vector and all pointers must be suitably aligned. The kernel additionally gates on | ||
| // compile-time eligibility (contiguous, trivially relocatable, power-of-two element sizes, no predicate). | ||
| bool can_vectorize = false; | ||
| if constexpr (THRUST_NS_QUALIFIER::is_contiguous_iterator_v<RandomAccessIteratorOut>) | ||
| { | ||
| using output_t = it_value_t<RandomAccessIteratorOut>; | ||
| constexpr int out_size = int{size_of<output_t>}; | ||
| constexpr int vec_size = (out_size > 0 && out_size <= 16) ? 16 / out_size : 1; | ||
| if constexpr (vec_size > 1 && ::cuda::is_power_of_two(out_size) | ||
| && (... && ::cuda::is_power_of_two(int{sizeof(it_value_t<RandomAccessIteratorsIn>)}))) | ||
| { | ||
| can_vectorize = kernel_source.CanVectorize(vec_size, out, ::cuda::std::get<Is>(in)...); | ||
| } | ||
| } |
There was a problem hiding this comment.
Question: Why is this a runtime value? I dont see anything that should not be a compile time value
There was a problem hiding this comment.
out is the runtime output pointer and we are checking it's runtime alignment inside CanVectorize.
This comment has been minimized.
This comment has been minimized.
| int unroll_factor = 1; //!< The unroll factor for the transformation loop in the kernel. The value 0 retains the | ||
| //!< compiler's default unrolling (specifying no unroll pragma), 1 means no unrolling. | ||
| // Vectorized store width for the ublkcp kernel. 0 means "auto": store_vec = 16 / sizeof(output) (a 16-byte STG.128). | ||
| // Setting it smaller narrows the store but also reduces the number of fully-unrolled lambda calls per store, which | ||
| // bounds register pressure for heavy functors (their stores aren't the bottleneck anyway). | ||
| int store_vec = 0; //!< Output elements per vectorized store (S). 0 = auto (16 / sizeof(output)). |
There was a problem hiding this comment.
Important question: Why do we need a new tuning parameter, and cannot use unroll_factor?
There was a problem hiding this comment.
I agree with the analysis that in the vectorized path, unroll_factor happens to be the same value as store_vec_size (I'm taking @miscco 's suggestion on renaming store_vec).
I think they should still be different parameters in the policy, because
unroll_factoralready has its own meaning in the regular scalar store. And even just semantically, unrolling something does not mean we are going to store it vectorized. Also, it is set to 1 by default, which is the best default documented for the scalar path. For the vectorized path, the best default would not be 1.store_vec_sizehas its own semantic meaning in the sense that:
a. It means the width at which we store, and in that path we fully unroll the loop as a consequence of a certainstore_vec_size.
b. now we can express auto-vectorization, vectorization to a custom number, or disable vectorization all at once.
| // Runtime check whether this launch can take the ublkcp kernel's vectorized (STG.128) store path. The output value | ||
| // type must pack into a 16-byte vector and all pointers must be suitably aligned. The kernel additionally gates on | ||
| // compile-time eligibility (contiguous, trivially relocatable, power-of-two element sizes, no predicate). | ||
| bool can_vectorize = false; | ||
| if constexpr (THRUST_NS_QUALIFIER::is_contiguous_iterator_v<RandomAccessIteratorOut>) | ||
| { | ||
| using output_t = it_value_t<RandomAccessIteratorOut>; | ||
| constexpr int out_size = int{size_of<output_t>}; | ||
| constexpr int vec_size = (out_size > 0 && out_size <= 16) ? 16 / out_size : 1; | ||
| if constexpr (vec_size > 1 && ::cuda::is_power_of_two(out_size) | ||
| && (... && ::cuda::is_power_of_two(int{sizeof(it_value_t<RandomAccessIteratorsIn>)}))) | ||
| { | ||
| can_vectorize = kernel_source.CanVectorize(vec_size, out, ::cuda::std::get<Is>(in)...); | ||
| } | ||
| } |
There was a problem hiding this comment.
out is the runtime output pointer and we are checking it's runtime alignment inside CanVectorize.
|
|
||
| using output_t = it_value_t<RandomAccessIteratorOut>; | ||
| constexpr int out_size = int{size_of<output_t>}; | ||
| constexpr int vec_size = (out_size > 0 && out_size <= 16) ? 16 / out_size : 1; |
There was a problem hiding this comment.
Q: Why disallow 32 byte wide stores?
There was a problem hiding this comment.
This is because:
- benchmarks on B200s showed no improvement across all cases with 256 bit store
- adding this would require
a. having two code paths for sm90 and sm100 since sm90 doesnt have it;
b. nvcc will only generateSTG.E.ENL2.256when it can prove the pointer is in global memory, so we need to provide extra hints to nvcc, which I feel is unnecessary for zero performance gain
| }; | ||
| auto chunks = ::cuda::std::tuple{load_chunk(aligned_ptrs)...}; | ||
|
|
||
| // must fully unroll to take full advantage of ILP. otherwise perf regress by half |
There was a problem hiding this comment.
This comment is a strong indicator that the vectorization size equals the unroll factor in the tuning policy.
There was a problem hiding this comment.
see my reply on unroll factor
| // scalar tail: the up to (store_vec - 1) trailing elements not covered by a whole store group. can_vectorize | ||
| // implies an always_true predicate, so we store unconditionally. | ||
| for (int idx = num_groups * store_vec + threadIdx.x; idx < valid_items; idx += threads_per_block) |
There was a problem hiding this comment.
Q: Have you tried scaling the items per thread in the host-side dispatch as multiples of the vector store width, so we never need a tail logic?
There was a problem hiding this comment.
Could you elaborate what you meant by scaling?
The main motive for adding a tail logic was so that, if the output pointer is aligned to 16 bytes, but the amount of data we store are not aligned to 16 bytes (i.e. 10,000,001 int8 elements), we can still get the performance benefit of doing the first 10,000,000 vectorized stores.
There was a problem hiding this comment.
Actionable comments posted: 1
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 54b773bf-2cab-483b-bb8d-289f48fab0d4
📒 Files selected for processing (3)
cub/cub/device/dispatch/kernels/kernel_transform.cuhcub/cub/device/dispatch/tuning/tuning_transform.cuhcub/test/catch2_test_device_transform_vectorized.cu
🚧 Files skipped from review as they are similar to previous changes (1)
- cub/cub/device/dispatch/kernels/kernel_transform.cuh
|
/ok to test 54db169 |
This comment has been minimized.
This comment has been minimized.
🥳 CI Workflow Results🟩 Finished in 5h 00m: Pass: 100%/287 | Total: 11d 19h | Max: 2h 27m | Hits: 19%/952946See results here. |
As per #9210, we noticed that the
ublkcpkernel forDeviceTransformhas unvectorized stores, leaving some performance on the table. This PR vectorizes the output store (STG.128) whenever it's safe. There is no change to existing call sites.The store width is controlled by a tunable policy knob
store_vec(output elements per store):0= auto (a 16-byte STG.128, the default),1= disable vectorization (the branch is compiled out, useful for register-heavy functors), andN= useNelements, silently capped to 16 bytes.Benchmarks (B200)
Ref = stock CUB, Cmp = this PR. FAST = this PR faster.
For heavy and complex_cmp, we use the opt-out path (
store_vec=1), since in those cases we should disable the vectorized store. We also show the regressions at the end if vectorization is enabled for those cases.babelstream: before vs this PR (auto)
pytorch: before vs this PR (auto)
fill: before vs this PR (auto)
fib: before vs this PR (auto)
grayscale: before vs this PR (auto)
heavy: before vs this PR (
store_vec=1)complex_cmp — before vs this PR (store_vec=1 opt-out)