Skip to content

Vectorize output store in ublkcp DeviceTransform kernel#9481

Open
nanan-nvidia wants to merge 14 commits into
NVIDIA:mainfrom
nanan-nvidia:device_transform_align_16
Open

Vectorize output store in ublkcp DeviceTransform kernel#9481
nanan-nvidia wants to merge 14 commits into
NVIDIA:mainfrom
nanan-nvidia:device_transform_align_16

Conversation

@nanan-nvidia

@nanan-nvidia nanan-nvidia commented Jun 16, 2026

Copy link
Copy Markdown

As per #9210, we noticed that the ublkcp kernel for DeviceTransform has unvectorized stores, leaving some performance on the table. This PR vectorizes the output store (STG.128) whenever it's safe. There is no change to existing call sites.

The store width is controlled by a tunable policy knob store_vec (output elements per store): 0 = auto (a 16-byte STG.128, the default), 1 = disable vectorization (the branch is compiled out, useful for register-heavy functors), and N = use N elements, silently capped to 16 bytes.

Benchmarks (B200)

Ref = stock CUB, Cmp = this PR. FAST = this PR faster.
For heavy and complex_cmp, we use the opt-out path (store_vec=1), since in those cases we should disable the vectorized store. We also show the regressions at the end if vectorization is enabled for those cases.

babelstream: before vs this PR (auto)

# mul

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |      2^16      |   6.306 us |       5.91% |   6.356 us |       7.02% |    0.049 us |   0.78% |  🔵 SAME  |
|   I8    |      2^20      |   7.736 us |       7.16% |   7.619 us |       7.42% |   -0.117 us |  -1.51% |  🔵 SAME  |
|   I8    |      2^24      |  14.505 us |       2.78% |  12.218 us |       3.87% |   -2.287 us | -15.76% |  🟢 FAST  |
|   I8    |      2^28      | 118.335 us |       0.52% |  84.602 us |       0.94% |  -33.732 us | -28.51% |  🟢 FAST  |
|   I8    |      2^32      |   1.751 ms |       0.02% |   1.234 ms |       0.14% | -517.062 us | -29.53% |  🟢 FAST  |
|   I16   |      2^16      |   6.501 us |       7.46% |   6.713 us |       7.82% |    0.212 us |   3.26% |  🔵 SAME  |
|   I16   |      2^20      |   8.039 us |       5.20% |   7.900 us |       6.25% |   -0.140 us |  -1.74% |  🔵 SAME  |
|   I16   |      2^24      |  18.302 us |       2.39% |  17.484 us |       5.54% |   -0.818 us |  -4.47% |  🟢 FAST  |
|   I16   |      2^28      | 163.566 us |       0.39% | 161.485 us |       0.64% |   -2.081 us |  -1.27% |  🟢 FAST  |
|   I16   |      2^32      |   2.498 ms |       0.50% |   2.456 ms |       0.10% |  -42.237 us |  -1.69% |  🟢 FAST  |
|   F32   |      2^16      |   6.293 us |       6.31% |   6.564 us |       9.42% |    0.270 us |   4.30% |  🔵 SAME  |
|   F32   |      2^20      |   8.285 us |       3.87% |   8.420 us |       5.23% |    0.135 us |   1.63% |  🔵 SAME  |
|   F32   |      2^24      |  26.804 us |       1.79% |  26.927 us |       2.06% |    0.123 us |   0.46% |  🔵 SAME  |
|   F32   |      2^28      | 314.114 us |       0.34% | 314.360 us |       0.34% |    0.246 us |   0.08% |  🔵 SAME  |
|   F32   |      2^32      |   4.916 ms |       0.06% |   4.920 ms |       0.06% |    3.894 us |   0.08% |  🔴 SLOW  |
|   F64   |      2^16      |   6.451 us |       9.96% |   6.671 us |      11.92% |    0.220 us |   3.41% |  🔵 SAME  |
|   F64   |      2^20      |  10.087 us |       4.52% |  10.080 us |       4.52% |   -0.007 us |  -0.07% |  🔵 SAME  |
|   F64   |      2^24      |  46.580 us |       2.12% |  46.577 us |       2.08% |   -0.003 us |  -0.01% |  🔵 SAME  |
|   F64   |      2^28      | 621.230 us |       0.23% | 621.556 us |       0.23% |    0.326 us |   0.05% |  🔵 SAME  |
|   F64   |      2^32      |   9.829 ms |       0.04% |   9.834 ms |       0.03% |    5.393 us |   0.05% |  🔴 SLOW  |
|  I128   |      2^16      |   6.833 us |      13.34% |   6.970 us |      12.07% |    0.137 us |   2.01% |  🔵 SAME  |
|  I128   |      2^20      |  12.313 us |       2.62% |  12.414 us |       5.36% |    0.101 us |   0.82% |  🔵 SAME  |
|  I128   |      2^24      |  84.485 us |       1.00% |  84.444 us |       0.98% |   -0.041 us |  -0.05% |  🔵 SAME  |
|  I128   |      2^28      |   1.236 ms |       0.15% |   1.236 ms |       0.14% |   -0.017 us |  -0.00% |  🔵 SAME  |
|  I128   |      2^32      |  19.671 ms |       0.05% |  19.672 ms |       0.05% |    0.724 us |   0.00% |  🔵 SAME  |

# add

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |      2^16      |   6.357 us |       7.93% |   6.587 us |       9.30% |    0.230 us |   3.61% |  🔵 SAME  |
|   I8    |      2^20      |   8.027 us |       5.08% |   8.067 us |       4.31% |    0.039 us |   0.49% |  🔵 SAME  |
|   I8    |      2^24      |  16.526 us |       2.13% |  15.160 us |       6.30% |   -1.366 us |  -8.27% |  🟢 FAST  |
|   I8    |      2^28      | 147.173 us |       0.43% | 122.853 us |       0.30% |  -24.320 us | -16.52% |  🟢 FAST  |
|   I8    |      2^32      |   2.200 ms |       0.28% |   1.784 ms |       0.09% | -415.559 us | -18.89% |  🟢 FAST  |
|   I16   |      2^16      |   6.288 us |       5.92% |   6.372 us |       8.32% |    0.085 us |   1.35% |  🔵 SAME  |
|   I16   |      2^20      |   8.317 us |       4.00% |   8.404 us |       5.50% |    0.088 us |   1.05% |  🔵 SAME  |
|   I16   |      2^24      |  23.393 us |       4.15% |  22.659 us |       1.86% |   -0.734 us |  -3.14% |  🟢 FAST  |
|   I16   |      2^28      | 241.646 us |       0.15% | 235.356 us |       0.21% |   -6.289 us |  -2.60% |  🟢 FAST  |
|   I16   |      2^32      |   3.703 ms |       0.02% |   3.594 ms |       0.02% | -108.967 us |  -2.94% |  🟢 FAST  |
|   F32   |      2^16      |   6.329 us |       6.99% |   6.362 us |       7.96% |    0.033 us |   0.53% |  🔵 SAME  |
|   F32   |      2^20      |   9.412 us |      10.12% |   9.383 us |      10.26% |   -0.029 us |  -0.31% |  🔵 SAME  |
|   F32   |      2^24      |  37.355 us |       2.17% |  37.071 us |       1.41% |   -0.284 us |  -0.76% |  🔵 SAME  |
|   F32   |      2^28      | 455.122 us |       0.17% | 454.877 us |       0.22% |   -0.244 us |  -0.05% |  🔵 SAME  |
|   F32   |      2^32      |   7.107 ms |       0.01% |   7.200 ms |       0.05% |   93.829 us |   1.32% |  🔴 SLOW  |
|   F64   |      2^16      |   6.794 us |      13.02% |   7.004 us |      13.44% |    0.210 us |   3.09% |  🔵 SAME  |
|   F64   |      2^20      |  10.454 us |       5.53% |  10.468 us |       5.71% |    0.014 us |   0.13% |  🔵 SAME  |
|   F64   |      2^24      |  65.620 us |       0.53% |  65.603 us |       0.56% |   -0.016 us |  -0.02% |  🔵 SAME  |
|   F64   |      2^28      | 902.886 us |       0.19% | 904.178 us |       0.18% |    1.292 us |   0.14% |  🔵 SAME  |
|   F64   |      2^32      |  14.389 ms |       0.05% |  14.400 ms |       0.04% |   11.482 us |   0.08% |  🔴 SLOW  |
|  I128   |      2^16      |   7.998 us |       6.13% |   8.030 us |       5.71% |    0.031 us |   0.39% |  🔵 SAME  |
|  I128   |      2^20      |  16.283 us |       2.26% |  16.307 us |       2.08% |    0.024 us |   0.15% |  🔵 SAME  |
|  I128   |      2^24      | 121.849 us |       0.80% | 121.846 us |       0.82% |   -0.003 us |  -0.00% |  🔵 SAME  |
|  I128   |      2^28      |   1.805 ms |       0.11% |   1.805 ms |       0.10% |   -0.012 us |  -0.00% |  🔵 SAME  |

# triad

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |      2^16      |   6.291 us |       6.57% |   6.331 us |       7.56% |    0.041 us |   0.65% |  🔵 SAME  |
|   I8    |      2^20      |   8.070 us |       4.47% |   8.086 us |       4.09% |    0.017 us |   0.21% |  🔵 SAME  |
|   I8    |      2^24      |  17.959 us |       4.40% |  14.726 us |       5.12% |   -3.233 us | -18.00% |  🟢 FAST  |
|   I8    |      2^28      | 155.595 us |       0.22% | 122.837 us |       0.26% |  -32.758 us | -21.05% |  🟢 FAST  |
|   I8    |      2^32      |   2.348 ms |       0.03% |   1.785 ms |       0.12% | -563.678 us | -24.00% |  🟢 FAST  |
|   I16   |      2^16      |   6.277 us |       6.26% |   6.445 us |       9.86% |    0.169 us |   2.69% |  🔵 SAME  |
|   I16   |      2^20      |   8.279 us |       4.00% |   8.332 us |       6.04% |    0.053 us |   0.64% |  🔵 SAME  |
|   I16   |      2^24      |  24.565 us |       1.11% |  23.114 us |       3.50% |   -1.451 us |  -5.91% |  🟢 FAST  |
|   I16   |      2^28      | 243.850 us |       0.16% | 237.883 us |       0.28% |   -5.967 us |  -2.45% |  🟢 FAST  |
|   I16   |      2^32      |   3.754 ms |       0.02% |   3.660 ms |       0.02% |  -93.749 us |  -2.50% |  🟢 FAST  |
|   F32   |      2^16      |   6.436 us |       9.39% |   6.536 us |      11.17% |    0.100 us |   1.55% |  🔵 SAME  |
|   F32   |      2^20      |   9.200 us |      10.53% |   8.637 us |       9.30% |   -0.563 us |  -6.12% |  🔵 SAME  |
|   F32   |      2^24      |  37.625 us |       2.49% |  37.313 us |       2.10% |   -0.311 us |  -0.83% |  🔵 SAME  |
|   F32   |      2^28      | 454.663 us |       0.08% | 453.885 us |       0.28% |   -0.778 us |  -0.17% |  🟢 FAST  |
|   F32   |      2^32      |   7.160 ms |       0.48% |   7.208 ms |       0.05% |   48.151 us |   0.67% |  🔴 SLOW  |
|   F64   |      2^16      |   6.693 us |      12.46% |   6.782 us |      13.09% |    0.090 us |   1.34% |  🔵 SAME  |
|   F64   |      2^20      |  11.402 us |       8.54% |  11.669 us |       7.66% |    0.267 us |   2.34% |  🔵 SAME  |
|   F64   |      2^24      |  67.454 us |       1.31% |  67.513 us |       1.27% |    0.059 us |   0.09% |  🔵 SAME  |
|   F64   |      2^28      | 898.814 us |       0.29% | 902.516 us |       0.31% |    3.702 us |   0.41% |  🔴 SLOW  |
|   F64   |      2^32      |  14.389 ms |       0.04% |  14.401 ms |       0.04% |   11.958 us |   0.08% |  🔴 SLOW  |
|  I128   |      2^16      |   7.654 us |      10.63% |   7.966 us |       5.63% |    0.312 us |   4.07% |  🔵 SAME  |
|  I128   |      2^20      |  15.802 us |       5.44% |  15.801 us |       3.72% |   -0.002 us |  -0.01% |  🔵 SAME  |
|  I128   |      2^24      | 121.290 us |       0.64% | 121.232 us |       0.63% |   -0.058 us |  -0.05% |  🔵 SAME  |
|  I128   |      2^28      |   1.805 ms |       0.14% |   1.805 ms |       0.14% |    0.223 us |   0.01% |  🔵 SAME  |

# nstream

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |      2^16      |   6.323 us |       7.57% |   6.547 us |      10.52% |    0.224 us |   3.55% |  🔵 SAME  |
|   I8    |      2^20      |   8.196 us |       4.49% |   8.002 us |       5.83% |   -0.194 us |  -2.37% |  🔵 SAME  |
|   I8    |      2^24      |  20.610 us |       1.59% |  18.502 us |       2.12% |   -2.108 us | -10.23% |  🟢 FAST  |
|   I8    |      2^28      | 211.379 us |       0.37% | 164.128 us |       0.38% |  -47.251 us | -22.35% |  🟢 FAST  |
|   I8    |      2^32      |   3.256 ms |       0.03% |   2.455 ms |       0.03% | -800.878 us | -24.60% |  🟢 FAST  |
|   I16   |      2^16      |   6.294 us |       6.67% |   7.343 us |      12.85% |    1.048 us |  16.66% |  🔴 SLOW  |
|   I16   |      2^20      |   8.312 us |       4.33% |   8.314 us |       4.17% |    0.002 us |   0.02% |  🔵 SAME  |
|   I16   |      2^24      |  28.795 us |       1.16% |  28.779 us |       1.26% |   -0.016 us |  -0.05% |  🔵 SAME  |
|   I16   |      2^28      | 322.669 us |       0.31% | 313.207 us |       0.15% |   -9.462 us |  -2.93% |  🟢 FAST  |
|   I16   |      2^32      |   5.031 ms |       0.04% |   4.830 ms |       0.02% | -200.962 us |  -3.99% |  🟢 FAST  |
|   F32   |      2^16      |   6.688 us |      12.29% |   6.913 us |      13.26% |    0.225 us |   3.36% |  🔵 SAME  |
|   F32   |      2^20      |  10.239 us |       3.25% |  10.240 us |       3.18% |    0.000 us |   0.00% |  🔵 SAME  |
|   F32   |      2^24      |  48.060 us |       2.02% |  47.642 us |       1.75% |   -0.418 us |  -0.87% |  🔵 SAME  |
|   F32   |      2^28      | 602.729 us |       0.14% | 600.761 us |       0.15% |   -1.969 us |  -0.33% |  🟢 FAST  |
|   F32   |      2^32      |   9.469 ms |       0.01% |   9.433 ms |       0.01% |  -35.498 us |  -0.37% |  🟢 FAST  |
|   F64   |      2^16      |   7.565 us |      11.63% |   7.762 us |       9.82% |    0.198 us |   2.62% |  🔵 SAME  |
|   F64   |      2^20      |  12.406 us |       2.64% |  12.427 us |       2.80% |    0.021 us |   0.17% |  🔵 SAME  |
|   F64   |      2^24      |  84.059 us |       0.41% |  84.016 us |       0.39% |   -0.043 us |  -0.05% |  🔵 SAME  |
|   F64   |      2^28      |   1.185 ms |       0.08% |   1.185 ms |       0.08% |    0.027 us |   0.00% |  🔵 SAME  |
|   F64   |      2^32      |  18.767 ms |       0.01% |  18.767 ms |       0.01% |    0.467 us |   0.00% |  🔵 SAME  |
|  I128   |      2^16      |   8.036 us |       4.86% |   8.078 us |       4.20% |    0.041 us |   0.51% |  🔵 SAME  |
|  I128   |      2^20      |  18.347 us |       2.18% |  18.356 us |       2.06% |    0.009 us |   0.05% |  🔵 SAME  |
|  I128   |      2^24      | 158.384 us |       0.57% | 158.321 us |       0.56% |   -0.063 us |  -0.04% |  🔵 SAME  |
|  I128   |      2^28      |   2.357 ms |       0.05% |   2.357 ms |       0.04% |    0.058 us |   0.00% |  🔵 SAME  |

# Summary

- Total Matches: 97
  - Pass    (diff <= min_noise): 63
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  34

pytorch: before vs this PR (auto)

# relu

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   F16   |      2^16      |   6.251 us |       5.07% |   6.985 us |       5.73% |  0.734 us |  11.74% |  🔴 SLOW  |
|   F16   |      2^20      |   7.846 us |       7.12% |   7.254 us |       6.07% | -0.592 us |  -7.54% |  🟢 FAST  |
|   F16   |      2^24      |  18.462 us |       1.85% |  17.246 us |       3.81% | -1.216 us |  -6.59% |  🟢 FAST  |
|   F16   |      2^28      | 169.508 us |       0.98% | 161.801 us |       0.59% | -7.706 us |  -4.55% |  🟢 FAST  |
|  BF16   |      2^16      |   6.281 us |       6.12% |   7.112 us |       3.32% |  0.831 us |  13.24% |  🔴 SLOW  |
|  BF16   |      2^20      |   7.828 us |       7.24% |   7.389 us |       7.35% | -0.439 us |  -5.61% |  🔵 SAME  |
|  BF16   |      2^24      |  18.473 us |       1.89% |  17.332 us |       3.35% | -1.141 us |  -6.18% |  🟢 FAST  |
|  BF16   |      2^28      | 168.713 us |       0.60% | 161.744 us |       0.63% | -6.969 us |  -4.13% |  🟢 FAST  |
|   F32   |      2^16      |   6.293 us |       6.63% |   6.857 us |       7.49% |  0.563 us |   8.95% |  🔴 SLOW  |
|   F32   |      2^20      |   8.331 us |       4.52% |   9.176 us |       2.51% |  0.845 us |  10.14% |  🔴 SLOW  |
|   F32   |      2^24      |  26.904 us |       2.30% |  27.517 us |       1.83% |  0.614 us |   2.28% |  🔴 SLOW  |
|   F32   |      2^28      | 316.146 us |       0.37% | 316.735 us |       0.34% |  0.589 us |   0.19% |  🔵 SAME  |

# sigmoid

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|   F16   |      2^16      |   6.319 us |       6.94% |   7.246 us |       4.69% |   0.928 us |  14.68% |  🔴 SLOW  |
|   F16   |      2^20      |   8.261 us |       5.25% |   8.801 us |       5.75% |   0.540 us |   6.54% |  🔴 SLOW  |
|   F16   |      2^24      |  26.984 us |       2.58% |  23.545 us |       0.27% |  -3.439 us | -12.74% |  🟢 FAST  |
|   F16   |      2^28      | 307.080 us |       0.13% | 248.142 us |       0.24% | -58.938 us | -19.19% |  🟢 FAST  |
|  BF16   |      2^16      |   6.307 us |       6.39% |   7.290 us |       5.42% |   0.983 us |  15.58% |  🔴 SLOW  |
|  BF16   |      2^20      |   8.264 us |       4.77% |   8.904 us |       5.29% |   0.640 us |   7.75% |  🔴 SLOW  |
|  BF16   |      2^24      |  26.896 us |       2.23% |  23.058 us |       3.48% |  -3.838 us | -14.27% |  🟢 FAST  |
|  BF16   |      2^28      | 307.085 us |       0.13% | 247.726 us |       0.15% | -59.359 us | -19.33% |  🟢 FAST  |
|   F32   |      2^16      |   6.388 us |       8.56% |   6.953 us |      13.46% |   0.565 us |   8.85% |  🔴 SLOW  |
|   F32   |      2^20      |   8.336 us |       4.59% |   8.292 us |       4.03% |  -0.044 us |  -0.52% |  🔵 SAME  |
|   F32   |      2^24      |  30.801 us |       1.13% |  27.370 us |       3.24% |  -3.430 us | -11.14% |  🟢 FAST  |
|   F32   |      2^28      | 375.154 us |       1.85% | 327.321 us |       2.59% | -47.833 us | -12.75% |  🟢 FAST  |

# tanh

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|   F16   |      2^16      |   6.397 us |       7.22% |   6.542 us |       9.56% |   0.145 us |   2.27% |  🔵 SAME  |
|   F16   |      2^20      |   8.115 us |       4.41% |   7.865 us |       7.06% |  -0.250 us |  -3.08% |  🔵 SAME  |
|   F16   |      2^24      |  24.654 us |       1.09% |  21.242 us |       2.30% |  -3.412 us | -13.84% |  🟢 FAST  |
|   F16   |      2^28      | 270.441 us |       0.13% | 215.834 us |       1.03% | -54.607 us | -20.19% |  🟢 FAST  |
|  BF16   |      2^16      |   6.313 us |       7.05% |   6.787 us |       9.01% |   0.475 us |   7.52% |  🔴 SLOW  |
|  BF16   |      2^20      |   8.108 us |       4.58% |   8.086 us |       4.42% |  -0.022 us |  -0.27% |  🔵 SAME  |
|  BF16   |      2^24      |  24.639 us |       0.98% |  21.287 us |       3.40% |  -3.352 us | -13.60% |  🟢 FAST  |
|  BF16   |      2^28      | 270.455 us |       0.13% | 219.780 us |       0.24% | -50.675 us | -18.74% |  🟢 FAST  |
|   F32   |      2^16      |   6.330 us |       7.55% |   6.814 us |       8.55% |   0.485 us |   7.66% |  🔴 SLOW  |
|   F32   |      2^20      |   8.297 us |       4.13% |   8.372 us |       5.03% |   0.074 us |   0.90% |  🔵 SAME  |
|   F32   |      2^24      |  29.270 us |       3.04% |  26.774 us |       1.69% |  -2.496 us |  -8.53% |  🟢 FAST  |
|   F32   |      2^28      | 365.789 us |       2.01% | 313.951 us |       0.50% | -51.838 us | -14.17% |  🟢 FAST  |

# gelu

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|   F16   |      2^16      |   6.423 us |       7.86% |   7.042 us |      10.81% |   0.619 us |   9.64% |  🔴 SLOW  |
|   F16   |      2^20      |   8.322 us |       4.34% |   8.765 us |       5.82% |   0.443 us |   5.33% |  🔴 SLOW  |
|   F16   |      2^24      |  31.121 us |       2.36% |  27.478 us |       1.42% |  -3.643 us | -11.71% |  🟢 FAST  |
|   F16   |      2^28      | 378.934 us |       0.10% | 307.751 us |       0.18% | -71.183 us | -18.78% |  🟢 FAST  |
|  BF16   |      2^16      |   6.375 us |       7.26% |   7.134 us |       8.80% |   0.760 us |  11.91% |  🔴 SLOW  |
|  BF16   |      2^20      |   8.368 us |       4.70% |   8.501 us |       5.64% |   0.133 us |   1.59% |  🔵 SAME  |
|  BF16   |      2^24      |  31.057 us |       2.12% |  26.718 us |       1.13% |  -4.338 us | -13.97% |  🟢 FAST  |
|  BF16   |      2^28      | 378.825 us |       0.10% | 307.487 us |       0.22% | -71.338 us | -18.83% |  🟢 FAST  |
|   F32   |      2^16      |   6.376 us |       7.91% |   6.403 us |       9.32% |   0.027 us |   0.43% |  🔵 SAME  |
|   F32   |      2^20      |   8.341 us |       4.67% |   8.271 us |       3.80% |  -0.070 us |  -0.83% |  🔵 SAME  |
|   F32   |      2^24      |  34.875 us |       1.12% |  26.810 us |       1.98% |  -8.065 us | -23.13% |  🟢 FAST  |
|   F32   |      2^28      | 427.480 us |       2.32% | 343.895 us |       4.74% | -83.585 us | -19.55% |  🟢 FAST  |

# sin

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|   F16   |      2^16      |   6.279 us |       5.78% |   6.655 us |      12.25% |   0.376 us |   5.99% |  🔴 SLOW  |
|   F16   |      2^20      |   8.320 us |       4.40% |   8.874 us |       5.45% |   0.554 us |   6.65% |  🔴 SLOW  |
|   F16   |      2^24      |  30.057 us |       2.97% |  27.866 us |       1.64% |  -2.191 us |  -7.29% |  🟢 FAST  |
|   F16   |      2^28      | 352.673 us |       0.21% | 328.425 us |       0.17% | -24.248 us |  -6.88% |  🟢 FAST  |
|  BF16   |      2^16      |   6.309 us |       6.91% |   7.606 us |      11.06% |   1.298 us |  20.57% |  🔴 SLOW  |
|  BF16   |      2^20      |   8.312 us |       4.49% |   8.280 us |       3.68% |  -0.032 us |  -0.39% |  🔵 SAME  |
|  BF16   |      2^24      |  28.788 us |       1.10% |  24.539 us |       1.31% |  -4.249 us | -14.76% |  🟢 FAST  |
|  BF16   |      2^28      | 336.019 us |       0.11% | 265.292 us |       0.37% | -70.727 us | -21.05% |  🟢 FAST  |
|   F32   |      2^16      |   6.332 us |       7.35% |   6.899 us |      10.58% |   0.567 us |   8.96% |  🔴 SLOW  |
|   F32   |      2^20      |   8.388 us |       5.08% |   8.357 us |       4.85% |  -0.032 us |  -0.38% |  🔵 SAME  |
|   F32   |      2^24      |  32.723 us |       1.16% |  28.582 us |       1.29% |  -4.141 us | -12.65% |  🟢 FAST  |
|   F32   |      2^28      | 395.562 us |       2.06% | 350.783 us |       2.19% | -44.779 us | -11.32% |  🟢 FAST  |

# exp

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|   F16   |      2^16      |   6.296 us |       5.95% |   6.469 us |      10.35% |   0.173 us |   2.75% |  🔵 SAME  |
|   F16   |      2^20      |   8.100 us |       4.54% |   8.146 us |       4.09% |   0.046 us |   0.57% |  🔵 SAME  |
|   F16   |      2^24      |  22.461 us |       1.48% |  21.952 us |       4.00% |  -0.509 us |  -2.27% |  🟢 FAST  |
|   F16   |      2^28      | 233.685 us |       0.22% | 225.232 us |       0.16% |  -8.453 us |  -3.62% |  🟢 FAST  |
|  BF16   |      2^16      |   6.253 us |       5.24% |   6.424 us |       9.73% |   0.171 us |   2.74% |  🔵 SAME  |
|  BF16   |      2^20      |   8.092 us |       4.79% |   8.091 us |       4.06% |  -0.001 us |  -0.01% |  🔵 SAME  |
|  BF16   |      2^24      |  20.473 us |       1.62% |  16.707 us |       4.22% |  -3.766 us | -18.39% |  🟢 FAST  |
|  BF16   |      2^28      | 196.764 us |       0.22% | 159.394 us |       0.68% | -37.370 us | -18.99% |  🟢 FAST  |
|   F32   |      2^16      |   6.296 us |       6.52% |   6.310 us |       7.17% |   0.014 us |   0.22% |  🔵 SAME  |
|   F32   |      2^20      |   8.307 us |       4.26% |   8.273 us |       3.79% |  -0.034 us |  -0.41% |  🔵 SAME  |
|   F32   |      2^24      |  26.766 us |       1.63% |  26.821 us |       1.91% |   0.055 us |   0.21% |  🔵 SAME  |
|   F32   |      2^28      | 335.347 us |       3.03% | 316.159 us |       0.36% | -19.188 us |  -5.72% |  🟢 FAST  |

# add

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   F16   |      2^16      |   6.310 us |       6.83% |   6.469 us |      10.29% |  0.159 us |   2.53% |  🔵 SAME  |
|   F16   |      2^20      |   8.685 us |       5.96% |   8.279 us |       3.84% | -0.406 us |  -4.68% |  🟢 FAST  |
|   F16   |      2^24      |  24.154 us |       2.65% |  22.821 us |       2.48% | -1.332 us |  -5.52% |  🟢 FAST  |
|   F16   |      2^28      | 246.429 us |       1.28% | 236.778 us |       0.62% | -9.651 us |  -3.92% |  🟢 FAST  |
|  BF16   |      2^16      |   6.410 us |       7.50% |   6.452 us |       9.79% |  0.042 us |   0.66% |  🔵 SAME  |
|  BF16   |      2^20      |   8.337 us |       4.40% |   8.281 us |       3.92% | -0.056 us |  -0.67% |  🔵 SAME  |
|  BF16   |      2^24      |  24.241 us |       2.82% |  22.769 us |       2.23% | -1.472 us |  -6.07% |  🟢 FAST  |
|  BF16   |      2^28      | 246.159 us |       1.35% | 236.480 us |       0.55% | -9.679 us |  -3.93% |  🟢 FAST  |
|   F32   |      2^16      |   6.420 us |       9.17% |   6.551 us |       9.76% |  0.132 us |   2.05% |  🔵 SAME  |
|   F32   |      2^20      |   9.174 us |      10.62% |   9.231 us |       8.20% |  0.057 us |   0.62% |  🔵 SAME  |
|   F32   |      2^24      |  37.029 us |       1.22% |  37.425 us |       1.67% |  0.397 us |   1.07% |  🔵 SAME  |
|   F32   |      2^28      | 458.986 us |       0.55% | 455.817 us |       0.18% | -3.168 us |  -0.69% |  🟢 FAST  |

# sub

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|   F16   |      2^16      |   6.312 us |       6.75% |   6.566 us |      10.34% |   0.254 us |   4.02% |  🔵 SAME  |
|   F16   |      2^20      |   8.295 us |       4.23% |   8.229 us |       4.37% |  -0.066 us |  -0.79% |  🔵 SAME  |
|   F16   |      2^24      |  23.967 us |       3.75% |  22.714 us |       2.13% |  -1.253 us |  -5.23% |  🟢 FAST  |
|   F16   |      2^28      | 246.956 us |       1.54% | 236.744 us |       0.61% | -10.212 us |  -4.14% |  🟢 FAST  |
|  BF16   |      2^16      |   6.334 us |       7.39% |   6.567 us |      11.20% |   0.233 us |   3.68% |  🔵 SAME  |
|  BF16   |      2^20      |   8.282 us |       4.19% |   8.217 us |       4.23% |  -0.065 us |  -0.78% |  🔵 SAME  |
|  BF16   |      2^24      |  24.036 us |       3.64% |  22.787 us |       2.40% |  -1.249 us |  -5.20% |  🟢 FAST  |
|  BF16   |      2^28      | 245.817 us |       1.32% | 236.226 us |       0.48% |  -9.591 us |  -3.90% |  🟢 FAST  |
|   F32   |      2^16      |   6.419 us |       9.57% |   6.603 us |      11.21% |   0.184 us |   2.86% |  🔵 SAME  |
|   F32   |      2^20      |   9.930 us |       6.79% |   9.296 us |       3.74% |  -0.633 us |  -6.38% |  🟢 FAST  |
|   F32   |      2^24      |  37.092 us |       1.49% |  37.123 us |       1.40% |   0.031 us |   0.08% |  🔵 SAME  |
|   F32   |      2^28      | 458.741 us |       0.57% | 455.879 us |       0.17% |  -2.862 us |  -0.62% |  🟢 FAST  |

# mul

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|   F16   |      2^16      |   6.321 us |       7.35% |   7.034 us |       6.95% |   0.713 us |  11.27% |  🔴 SLOW  |
|   F16   |      2^20      |   8.282 us |       3.93% |   8.908 us |       7.67% |   0.626 us |   7.56% |  🔴 SLOW  |
|   F16   |      2^24      |  24.164 us |       3.33% |  23.544 us |       0.72% |  -0.620 us |  -2.57% |  🟢 FAST  |
|   F16   |      2^28      | 247.112 us |       1.56% | 236.788 us |       0.64% | -10.324 us |  -4.18% |  🟢 FAST  |
|  BF16   |      2^16      |   6.271 us |       5.57% |   7.050 us |       5.53% |   0.779 us |  12.42% |  🔴 SLOW  |
|  BF16   |      2^20      |   8.298 us |       4.16% |   8.882 us |       7.43% |   0.583 us |   7.03% |  🔴 SLOW  |
|  BF16   |      2^24      |  24.289 us |       2.90% |  23.556 us |       0.46% |  -0.732 us |  -3.01% |  🟢 FAST  |
|  BF16   |      2^28      | 245.992 us |       1.31% | 235.868 us |       0.46% | -10.123 us |  -4.12% |  🟢 FAST  |
|   F32   |      2^16      |   6.402 us |       9.00% |   6.754 us |      10.40% |   0.353 us |   5.51% |  🔵 SAME  |
|   F32   |      2^20      |   9.365 us |      10.30% |   9.280 us |       6.28% |  -0.086 us |  -0.91% |  🔵 SAME  |
|   F32   |      2^24      |  37.138 us |       1.67% |  37.690 us |       1.38% |   0.552 us |   1.49% |  🔴 SLOW  |
|   F32   |      2^28      | 459.053 us |       0.55% | 455.800 us |       0.19% |  -3.253 us |  -0.71% |  🟢 FAST  |

# div

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|   F16   |      2^16      |   6.361 us |       7.93% |   7.127 us |       4.21% |   0.765 us |  12.03% |  🔴 SLOW  |
|   F16   |      2^20      |   8.305 us |       4.15% |   8.437 us |       5.47% |   0.132 us |   1.59% |  🔵 SAME  |
|   F16   |      2^24      |  26.525 us |       1.37% |  24.665 us |       1.18% |  -1.860 us |  -7.01% |  🟢 FAST  |
|   F16   |      2^28      | 293.718 us |       1.93% | 270.563 us |       1.98% | -23.155 us |  -7.88% |  🟢 FAST  |
|  BF16   |      2^16      |   6.379 us |       8.19% |   6.954 us |       9.07% |   0.576 us |   9.02% |  🔴 SLOW  |
|  BF16   |      2^20      |   8.310 us |       4.12% |   9.001 us |       4.70% |   0.691 us |   8.32% |  🔴 SLOW  |
|  BF16   |      2^24      |  26.716 us |       1.20% |  24.162 us |       2.31% |  -2.554 us |  -9.56% |  🟢 FAST  |
|  BF16   |      2^28      | 303.896 us |       0.31% | 253.096 us |       1.72% | -50.800 us | -16.72% |  🟢 FAST  |
|   F32   |      2^16      |   6.541 us |      10.92% |   7.248 us |      10.24% |   0.708 us |  10.82% |  🔴 SLOW  |
|   F32   |      2^20      |   9.613 us |       9.27% |   9.553 us |       9.61% |  -0.060 us |  -0.63% |  🔵 SAME  |
|   F32   |      2^24      |  38.943 us |       0.91% |  38.010 us |       2.57% |  -0.933 us |  -2.40% |  🟢 FAST  |
|   F32   |      2^28      | 491.279 us |       2.08% | 464.526 us |       0.82% | -26.752 us |  -5.45% |  🟢 FAST  |

# le

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|   F16   |      2^16      |   6.323 us |       6.83% |   6.907 us |       8.48% |   0.584 us |   9.24% |  🔴 SLOW  |
|   F16   |      2^20      |   8.307 us |       4.12% |   9.098 us |       4.87% |   0.791 us |   9.52% |  🔴 SLOW  |
|   F16   |      2^24      |  24.441 us |       1.79% |  23.510 us |       1.25% |  -0.932 us |  -3.81% |  🟢 FAST  |
|   F16   |      2^28      | 249.291 us |       0.73% | 236.563 us |       0.22% | -12.728 us |  -5.11% |  🟢 FAST  |
|  BF16   |      2^16      |   6.357 us |       7.25% |   6.574 us |       9.71% |   0.216 us |   3.40% |  🔵 SAME  |
|  BF16   |      2^20      |   8.287 us |       3.93% |   8.255 us |       3.76% |  -0.032 us |  -0.38% |  🔵 SAME  |
|  BF16   |      2^24      |  24.438 us |       1.79% |  22.832 us |       2.88% |  -1.607 us |  -6.57% |  🟢 FAST  |
|  BF16   |      2^28      | 248.557 us |       0.43% | 236.697 us |       0.41% | -11.860 us |  -4.77% |  🟢 FAST  |
|   F32   |      2^16      |   6.393 us |       8.51% |   6.424 us |       9.42% |   0.031 us |   0.48% |  🔵 SAME  |
|   F32   |      2^20      |   9.274 us |      10.56% |   9.183 us |      10.81% |  -0.091 us |  -0.98% |  🔵 SAME  |
|   F32   |      2^24      |  37.217 us |       1.91% |  37.010 us |       1.13% |  -0.206 us |  -0.55% |  🔵 SAME  |
|   F32   |      2^28      | 455.716 us |       0.30% | 456.017 us |       0.22% |   0.301 us |   0.07% |  🔵 SAME  |

# ge

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|   F16   |      2^16      |   6.295 us |       6.18% |   6.837 us |      10.69% |   0.542 us |   8.61% |  🔴 SLOW  |
|   F16   |      2^20      |   8.278 us |       3.74% |   8.687 us |       6.22% |   0.408 us |   4.93% |  🔴 SLOW  |
|   F16   |      2^24      |  24.531 us |       1.27% |  23.073 us |       3.58% |  -1.458 us |  -5.94% |  🟢 FAST  |
|   F16   |      2^28      | 248.326 us |       0.37% | 236.616 us |       0.34% | -11.710 us |  -4.72% |  🟢 FAST  |
|  BF16   |      2^16      |   6.313 us |       6.87% |   6.994 us |       7.14% |   0.681 us |  10.79% |  🔴 SLOW  |
|  BF16   |      2^20      |   8.325 us |       4.15% |   8.565 us |       5.99% |   0.239 us |   2.88% |  🔵 SAME  |
|  BF16   |      2^24      |  24.459 us |       1.77% |  23.538 us |       1.16% |  -0.921 us |  -3.77% |  🟢 FAST  |
|  BF16   |      2^28      | 247.964 us |       0.18% | 236.545 us |       0.07% | -11.419 us |  -4.60% |  🟢 FAST  |
|   F32   |      2^16      |   6.345 us |       7.37% |   6.725 us |       8.40% |   0.379 us |   5.98% |  🔵 SAME  |
|   F32   |      2^20      |   9.473 us |       9.79% |   9.255 us |       7.30% |  -0.219 us |  -2.31% |  🔵 SAME  |
|   F32   |      2^24      |  37.209 us |       1.79% |  37.015 us |       1.13% |  -0.194 us |  -0.52% |  🔵 SAME  |
|   F32   |      2^28      | 455.627 us |       0.27% | 456.012 us |       0.27% |   0.385 us |   0.08% |  🔵 SAME  |

# fmin

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|   F16   |      2^16      |   6.403 us |       7.90% |   6.518 us |      10.77% |   0.115 us |   1.80% |  🔵 SAME  |
|   F16   |      2^20      |   8.280 us |       6.50% |   8.195 us |       4.66% |  -0.085 us |  -1.03% |  🔵 SAME  |
|   F16   |      2^24      |  24.485 us |       1.62% |  23.072 us |       3.52% |  -1.414 us |  -5.77% |  🟢 FAST  |
|   F16   |      2^28      | 259.245 us |       2.14% | 236.786 us |       0.60% | -22.460 us |  -8.66% |  🟢 FAST  |
|  BF16   |      2^16      |   6.318 us |       7.04% |   6.521 us |      10.70% |   0.203 us |   3.22% |  🔵 SAME  |
|  BF16   |      2^20      |   8.275 us |       4.27% |   8.240 us |       4.20% |  -0.034 us |  -0.42% |  🔵 SAME  |
|  BF16   |      2^24      |  24.356 us |       2.21% |  23.071 us |       3.51% |  -1.284 us |  -5.27% |  🟢 FAST  |
|  BF16   |      2^28      | 256.561 us |       1.45% | 236.415 us |       0.54% | -20.146 us |  -7.85% |  🟢 FAST  |
|   F32   |      2^16      |   7.005 us |       6.07% |   6.437 us |       9.53% |  -0.569 us |  -8.12% |  🟢 FAST  |
|   F32   |      2^20      |   9.382 us |       8.58% |   9.300 us |      10.55% |  -0.082 us |  -0.87% |  🔵 SAME  |
|   F32   |      2^24      |  37.740 us |       1.44% |  37.062 us |       1.36% |  -0.678 us |  -1.80% |  🟢 FAST  |
|   F32   |      2^28      | 458.709 us |       0.50% | 455.697 us |       0.20% |  -3.012 us |  -0.66% |  🟢 FAST  |

# fmax

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|   F16   |      2^16      |   6.600 us |       8.96% |   6.598 us |      11.14% |  -0.003 us |  -0.04% |  🔵 SAME  |
|   F16   |      2^20      |   8.914 us |       5.27% |   8.298 us |       4.05% |  -0.616 us |  -6.91% |  🟢 FAST  |
|   F16   |      2^24      |  23.713 us |       2.43% |  22.921 us |       3.13% |  -0.792 us |  -3.34% |  🟢 FAST  |
|   F16   |      2^28      | 258.364 us |       2.31% | 236.720 us |       0.61% | -21.644 us |  -8.38% |  🟢 FAST  |
|  BF16   |      2^16      |   6.355 us |       8.02% |   6.516 us |      10.75% |   0.160 us |   2.52% |  🔵 SAME  |
|  BF16   |      2^20      |   8.682 us |       5.92% |   8.300 us |       3.99% |  -0.383 us |  -4.41% |  🟢 FAST  |
|  BF16   |      2^24      |  23.720 us |       2.48% |  22.928 us |       3.02% |  -0.793 us |  -3.34% |  🟢 FAST  |
|  BF16   |      2^28      | 255.967 us |       1.44% | 236.279 us |       0.48% | -19.689 us |  -7.69% |  🟢 FAST  |
|   F32   |      2^16      |   6.915 us |       8.34% |   6.645 us |      12.04% |  -0.270 us |  -3.90% |  🔵 SAME  |
|   F32   |      2^20      |   9.345 us |       9.06% |   9.377 us |      10.28% |   0.032 us |   0.35% |  🔵 SAME  |
|   F32   |      2^24      |  38.702 us |       1.48% |  37.065 us |       1.40% |  -1.637 us |  -4.23% |  🟢 FAST  |
|   F32   |      2^28      | 459.056 us |       0.58% | 455.624 us |       0.21% |  -3.432 us |  -0.75% |  🟢 FAST  |

# Summary

- Total Matches: 168
  - Pass    (diff <= min_noise): 55
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  113

fill: before vs this PR (auto)

# fill

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |      2^16      |   6.157 us |       6.84% |   6.205 us |      14.33% |  0.048 us |   0.79% |  🔵 SAME  |
|   I8    |      2^20      |   6.177 us |       9.18% |   6.239 us |       9.76% |  0.062 us |   1.00% |  🔵 SAME  |
|   I8    |      2^24      |   8.333 us |       4.55% |   8.283 us |       3.95% | -0.050 us |  -0.60% |  🔵 SAME  |
|   I8    |      2^28      |  41.174 us |       1.31% |  41.107 us |       1.18% | -0.067 us |  -0.16% |  🔵 SAME  |
|   I8    |      2^32      | 569.329 us |       0.16% | 569.344 us |       0.15% |  0.015 us |   0.00% |  🔵 SAME  |
|   I16   |      2^16      |   6.132 us |       9.77% |   6.149 us |       9.05% |  0.017 us |   0.28% |  🔵 SAME  |
|   I16   |      2^20      |   6.262 us |       5.27% |   6.259 us |       5.24% | -0.004 us |  -0.06% |  🔵 SAME  |
|   I16   |      2^24      |  10.712 us |       7.08% |  10.287 us |       3.32% | -0.425 us |  -3.97% |  🟢 FAST  |
|   I16   |      2^28      |  76.594 us |       0.65% |  75.944 us |       0.67% | -0.649 us |  -0.85% |  🟢 FAST  |
|   I16   |      2^32      |   1.133 ms |       0.18% |   1.133 ms |       0.17% |  0.109 us |   0.01% |  🔵 SAME  |
|   I32   |      2^16      |   6.570 us |      11.53% |   6.210 us |       5.42% | -0.361 us |  -5.49% |  🟢 FAST  |
|   I32   |      2^20      |   6.577 us |       7.77% |   6.241 us |       5.04% | -0.336 us |  -5.11% |  🟢 FAST  |
|   I32   |      2^24      |  14.593 us |       4.20% |  14.595 us |       4.32% |  0.003 us |   0.02% |  🔵 SAME  |
|   I32   |      2^28      | 147.251 us |       0.42% | 147.345 us |       0.36% |  0.094 us |   0.06% |  🔵 SAME  |
|   I32   |      2^32      |   2.260 ms |       0.13% |   2.260 ms |       0.14% | -0.428 us |  -0.02% |  🔵 SAME  |
|   I64   |      2^16      |   6.233 us |       5.23% |   6.243 us |       5.33% |  0.010 us |   0.16% |  🔵 SAME  |
|   I64   |      2^20      |   7.227 us |      12.11% |   7.257 us |      13.39% |  0.030 us |   0.42% |  🔵 SAME  |
|   I64   |      2^24      |  24.504 us |       2.36% |  24.553 us |       1.57% |  0.048 us |   0.20% |  🔵 SAME  |
|   I64   |      2^28      | 288.622 us |       0.31% | 288.638 us |       0.29% |  0.016 us |   0.01% |  🔵 SAME  |
|   I64   |      2^32      |   4.518 ms |       0.10% |   4.518 ms |       0.11% |  0.342 us |   0.01% |  🔵 SAME  |

# Summary

- Total Matches: 20
  - Pass    (diff <= min_noise): 16
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  4

fib: before vs this PR (auto)

# fibonacci

## [0] NVIDIA B200

|  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|      2^16      |   7.936 us |       7.08% |   8.335 us |       4.16% |   0.399 us |   5.02% |  🔴 SLOW  |
|      2^20      |  13.949 us |       3.92% |  13.896 us |       4.00% |  -0.052 us |  -0.38% |  🔵 SAME  |
|      2^24      |  90.213 us |       0.32% |  91.262 us |       1.05% |   1.050 us |   1.16% |  🔴 SLOW  |
|      2^28      |   1.317 ms |       0.03% |   1.330 ms |       0.04% |  12.292 us |   0.93% |  🔴 SLOW  |
|      2^32      |  20.950 ms |       0.00% |  21.148 ms |       0.00% | 198.453 us |   0.95% |  🔴 SLOW  |

# Summary

- Total Matches: 5
  - Pass    (diff <= min_noise): 1
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  4

grayscale: before vs this PR (auto)

# grayscale

## [0] NVIDIA B200

|  T{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|----------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   F32   |      2^16      |   6.577 us |      10.30% |   6.544 us |      10.81% | -0.033 us |  -0.51% |  🔵 SAME  |
|   F32   |      2^20      |  10.044 us |       6.26% |  10.034 us |       7.95% | -0.010 us |  -0.10% |  🔵 SAME  |
|   F32   |      2^24      |  47.214 us |       0.98% |  47.212 us |       0.96% | -0.002 us |  -0.00% |  🔵 SAME  |
|   F32   |      2^28      | 603.674 us |       0.44% | 603.902 us |       0.45% |  0.228 us |   0.04% |  🔵 SAME  |
|   F32   |      2^32      |   9.624 ms |       0.51% |   9.625 ms |       0.50% |  1.684 us |   0.02% |  🔵 SAME  |
|   F64   |      2^16      |   7.213 us |      11.39% |   7.355 us |      10.92% |  0.142 us |   1.97% |  🔵 SAME  |
|   F64   |      2^20      |  12.376 us |       2.65% |  12.380 us |       2.54% |  0.004 us |   0.03% |  🔵 SAME  |
|   F64   |      2^24      |  83.780 us |       0.73% |  83.764 us |       0.75% | -0.016 us |  -0.02% |  🔵 SAME  |
|   F64   |      2^28      |   1.184 ms |       0.09% |   1.184 ms |       0.08% | -0.148 us |  -0.01% |  🔵 SAME  |

# Summary

- Total Matches: 9
  - Pass    (diff <= min_noise): 9
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  0

heavy: before vs this PR (store_vec=1)

# heavy

## [0] NVIDIA B200

|  Heaviness{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|-----------------|----------------|------------|-------------|------------|-------------|--------------|---------|----------|
|       32        |      2^16      |   8.051 us |       4.96% |   7.773 us |       7.69% |    -0.278 us |  -3.45% |  🔵 SAME  |
|       32        |      2^20      |  24.530 us |       2.59% |  24.545 us |       2.45% |     0.015 us |   0.06% |  🔵 SAME  |
|       32        |      2^24      | 252.037 us |       0.15% | 252.040 us |       0.14% |     0.002 us |   0.00% |  🔵 SAME  |
|       32        |      2^28      |   3.860 ms |       0.02% |   3.860 ms |       0.02% |     0.070 us |   0.00% |  🔵 SAME  |
|       32        |      2^32      |  61.585 ms |       0.00% |  61.586 ms |       0.00% |     0.320 us |   0.00% |  🔵 SAME  |
|       64        |      2^16      |   9.076 us |       8.52% |   9.144 us |       6.94% |     0.068 us |   0.75% |  🔵 SAME  |
|       64        |      2^20      |  43.157 us |       0.91% |  43.196 us |       0.94% |     0.039 us |   0.09% |  🔵 SAME  |
|       64        |      2^24      | 512.204 us |       0.08% | 512.164 us |       0.08% |    -0.040 us |  -0.01% |  🔵 SAME  |
|       64        |      2^28      |   7.884 ms |       0.01% |   7.884 ms |       0.01% |     0.036 us |   0.00% |  🔵 SAME  |
|       64        |      2^32      | 125.804 ms |       0.00% | 125.804 ms |       0.00% |    -0.278 us |  -0.00% |  🔵 SAME  |
|       128       |      2^16      |  11.861 us |       4.83% |  12.189 us |       2.75% |     0.328 us |   2.76% |  🔴 SLOW  |
|       128       |      2^20      |  79.807 us |       1.26% |  79.841 us |       0.42% |     0.035 us |   0.04% |  🔵 SAME  |
|       128       |      2^24      |   1.043 ms |       0.05% |   1.043 ms |       0.07% |    -0.468 us |  -0.04% |  🔵 SAME  |
|       128       |      2^28      |  16.219 ms |       0.00% |  16.219 ms |       0.00% |     0.701 us |   0.00% |  🔴 SLOW  |
|       128       |      2^32      | 259.078 ms |       0.00% | 259.079 ms |       0.00% |     0.949 us |   0.00% |  🔵 SAME  |
|       256       |      2^16      |  18.499 us |       1.82% |  18.498 us |       1.80% |    -0.001 us |  -0.01% |  🔵 SAME  |
|       256       |      2^20      | 204.865 us |       4.63% | 203.131 us |       5.36% |    -1.734 us |  -0.85% |  🔵 SAME  |
|       256       |      2^24      |   2.744 ms |       1.24% |   2.739 ms |       1.20% |    -5.325 us |  -0.19% |  🔵 SAME  |
|       256       |      2^28      |  41.877 ms |       0.40% |  41.894 ms |       0.50% |    16.977 us |   0.04% |  🔵 SAME  |
|       256       |      2^32      | 707.687 ms |       1.16% | 706.294 ms |       1.00% | -1392.844 us |  -0.20% |  🔵 SAME  |

# Summary

- Total Matches: 20
  - Pass    (diff <= min_noise): 18
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  2

complex_cmp — before vs this PR (store_vec=1 opt-out)

# compare_complex

## [0] NVIDIA B200

|  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|----------------|------------|-------------|------------|-------------|-----------|---------|----------|
|      2^16      |   8.037 us |       4.89% |   8.070 us |       4.34% |  0.033 us |   0.41% |  🔵 SAME  |
|      2^20      |  12.606 us |       3.55% |  12.644 us |       3.63% |  0.038 us |   0.30% |  🔵 SAME  |
|      2^24      |  89.987 us |       0.46% |  89.994 us |       0.45% |  0.007 us |   0.01% |  🔵 SAME  |
|      2^28      |   1.307 ms |       0.04% |   1.307 ms |       0.04% | -0.005 us |  -0.00% |  🔵 SAME  |
|      2^32      |  20.784 ms |       0.00% |  20.784 ms |       0.00% | -0.263 us |  -0.00% |  🔵 SAME  |

# Summary

- Total Matches: 5
  - Pass    (diff <= min_noise): 5
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  0

@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL Jun 16, 2026
@nanan-nvidia nanan-nvidia force-pushed the device_transform_align_16 branch from 12991ad to c6559fb Compare June 16, 2026 05:44
@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test c6559fb

@nanan-nvidia nanan-nvidia self-assigned this Jun 16, 2026
@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test 365bf3a

@nanan-nvidia nanan-nvidia force-pushed the device_transform_align_16 branch from 365bf3a to c2a253d Compare June 16, 2026 06:45
@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test c2a253d

@bernhardmgruber

Copy link
Copy Markdown
Contributor

@nanan-nvidia can you please post the output of nvbench_compare.py --no-color as well? I think they are easier to read.

@nanan-nvidia

This comment was marked as outdated.

Comment thread cub/cub/device/dispatch/kernels/kernel_transform.cuh
Comment thread cub/cub/device/dispatch/kernels/kernel_transform.cuh Outdated
Comment thread cub/cub/device/dispatch/kernels/kernel_transform.cuh
Comment thread cub/test/catch2_test_device_transform_vectorized.cu Outdated
Comment thread cub/test/catch2_test_device_transform_aligned.cu Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: Once we approve the general mechanics, we should update the per-function documentation.

Comment thread cub/cub/device/dispatch/kernels/kernel_transform.cuh Outdated
// didn't merge the changes. The problem was mostly a 25% increase in integer instructions, as shown by ncu.
template <int threads_per_block,
int UnrollFactor,
int OutputAlign,

This comment was marked as outdated.

Comment on lines +902 to +903
// When the caller guarantees aligned_size_t<N> num_items, i.e. the output pointer is N-byte aligned and the element
// count is a multiple of N, if 1. there are no predicates, 2. memory layout is contiguous, 3. semantically we can

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical:

the element count is a multiple of N

This is only true for elements with power of two size. Think of aligned_size_t<16>{n} and int3*, which is valid for every n that is a multiple of 4.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In vectorize_store we require both input element size and output element size to be pow2, so this should not be a bug.

From the definition,

explicit constexpr aligned_size_t(size_t __s) : value(__s) {
  _CCCL_ASSERT(value % align == 0,
               "aligned_size_t must be constructed with a size that is a multiple of the alignment");
}

It seems aligned_size_t<16>(n) just mean n % 16 == 0? Does it actually mean (sizeof(T) * n) % 16 == 0 semantically?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aligned_size_t refers to a size in bytes, not in elements, at least according to what we currently document.

Comment thread cub/cub/device/dispatch/kernels/kernel_transform.cuh Outdated
@bernhardmgruber

Copy link
Copy Markdown
Contributor

@nanan-nvidia the results look promising. I think the implementation needs a bit more work though. Please check out the notes I have from a discussion with DevTech working on PyTorch: https://github.com/NVIDIA-dev/cccl_private/issues/598.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test 6b64983

@nanan-nvidia nanan-nvidia changed the title Vectorize store in ublkcp kernel for deviceTransform when user promises cuda::aligned_size_t<16> Vectorize output store in ublkcp DeviceTransform kernel Jun 23, 2026
@nanan-nvidia nanan-nvidia force-pushed the device_transform_align_16 branch from a1d2946 to c9a53d7 Compare June 23, 2026 03:08
@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test c9a53d7

@nanan-nvidia nanan-nvidia moved this from In Progress to In Review in CCCL Jun 23, 2026
@nanan-nvidia nanan-nvidia marked this pull request as ready for review June 23, 2026 04:36
@nanan-nvidia nanan-nvidia requested a review from a team as a code owner June 23, 2026 04:36
@nanan-nvidia nanan-nvidia requested a review from fbusato June 23, 2026 04:36
@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test 897a0a4

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Note: CodeRabbit is enabled on this repository as a convenience for maintainers and contributors. Use your best judgment when considering its review comments and suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what ultimately matter for merging.

Overview

This PR improves performance of CUB ublkcp-based async DeviceTransform by adding an optional vectorized output-store fast path (up to 16 bytes / STG.128) when it is provably safe, while preserving existing behavior by default.

Key Changes

1) New tuning policy: store_vec_size

In cub/cub/device/dispatch/tuning/tuning_transform.cuh, TransformAsyncCopyPolicy adds a store_vec_size knob (default 0):

  • store_vec_size = 0: auto (default vector width choice: 16 / sizeof(output))
  • store_vec_size = 1: disable vectorization (scalar path; useful for register-heavy functors)
  • store_vec_size = N: request N elements per vectorized store (N is silently capped so the store width does not exceed 16 bytes)

TransformAsyncCopyPolicy equality (operator==) and operator<< printing are updated to include .store_vec_size.

2) Runtime eligibility: compute can_vectorize in the dispatcher

In cub/cub/device/dispatch/dispatch_transform.cuh (async ublkcp launch path):

  • It no longer hard-disables the vectorized-store eligibility flag.
  • When the output iterator is contiguous, it computes a runtime can_vectorize by:
    • deriving the output value size
    • selecting a candidate vector width (up to 16 bytes)
    • verifying power-of-two constraints for:
      • output element/value size
      • each input element size
      • the chosen vector width
  • The resulting can_vectorize is passed into kernel_source.CanVectorize(...) and forwarded to the kernel launcher.

3) Kernel support: vectorized stores gated on compile-time safety + runtime can_vectorize

In cub/cub/device/dispatch/kernels/kernel_transform.cuh, transform_kernel_ublkcp:

  • Adds a StoreVecSize template parameter and accepts a runtime [[maybe_unused]] bool can_vectorize.
  • Computes a compile-time vectorize_eligible condition that requires (among other checks):
    • no predicate usage (Predicate is ::cuda::always_true)
    • contiguous output iterator layout
    • trivially relocatable output element type
    • output element size is power-of-two and ≤ 16 bytes
    • all input element sizes are power-of-two
  • If vectorize_eligible && can_vectorize:
    • executes a vectorized/grouped store path using store_vec_size-sized groups
    • fully unrolls per-lane f(...) evaluation over the loaded operands
    • stores packed results via a store_t-typed cast
    • performs an explicit scalar tail loop for elements not covered by whole store_vec_size groups and returns early
  • Otherwise, the kernel falls back to the existing pred-guarded scalar/tiled logic.

The transform_kernel ublkcp branch is updated to instantiate this kernel with policy.async_copy.store_vec_size and forward can_vectorize.

Tests: correctness for vectorized and non-vectorized paths

A new cub/test/catch2_test_device_transform_vectorized.cu adds coverage for cub::DeviceTransform::Transform / transform_many when vectorized-store casting paths are possible:

  • Narrowing: generate uint16_t/uint32_t/uint64_t, cast to uint8_t, compare to std::transform
  • Widening: generate uint8_t, cast to uint16_t/uint32_t/uint64_t, compare to std::transform
  • Uses GENERIC_COUNTS chosen to exercise both vectorized coverage and scalar tail behavior (including counts not divisible by the 16-byte vectorized store width)
  • Adds an explicit store_vec_size test that sets async.store_vec_size = 2 (with a policy selector) for a uint32_t -> uint8_t transform and verifies correctness vs host reference.

Safety/Robustness Notes

Vectorized stores are enabled only when both:

  • compile-time safety holds (contiguous layout, trivially relocatable output, and power-of-two element sizes with output width ≤ 16 bytes), and
  • the dispatcher’s runtime can_vectorize says the chosen vector width and iterator/value sizes are compatible.

This prevents problematic non-power-of-two element cases (e.g., sizes like uchar3) from taking the vectorized path.

Performance

Benchmarks on NVIDIA B200 show substantial gains on memory-bound workloads where output element sizing permits vectorization, including:

  • babelstream: mul with I8 up to 28.51% (2^32 elements) and add with I8 up to 18.89%
  • PyTorch-like transcendental ops (sigmoid/tanh/gelu): improvements up to 24.76%
  • fill operations: improvements up to ~5.49%

Some minor regressions may occur for certain larger floating-point types/smaller problem sizes (within measurement noise or when vectorization offers minimal benefit). Performance-sensitive kernels can opt out via store_vec_size = 1.

Review Process / Iteration

Reviewer feedback requested benchmark output formatted with nvbench_compare.py --no-color; updated results were provided in that format. Iterative refinement addressed edge cases and integration concerns around type-erased iterators, preventing unwanted default construction, gating alignment/eligibility checks on power-of-two widths, and adding regression coverage for non-power-of-two element sizes.

Walkthrough

Adds a vectorized-store fast path to the UBLKCP transform kernel by introducing store_vec_size tuning parameter, computing runtime can_vectorize eligibility via power-of-two checks and CanVectorize() call, extending the kernel to perform grouped packed stores with scalar tail when eligible, and validating with narrowing, widening, and tuned Catch2 tests.

Changes

Vectorized Store Fast Path

Layer / File(s) Summary
TransformAsyncCopyPolicy store_vec_size field
cub/cub/device/dispatch/tuning/tuning_transform.cuh
Adds store_vec_size field (default 0 = auto) to TransformAsyncCopyPolicy, extends operator== to include it in equality comparison, and updates operator<< to display it in policy output.
Runtime can_vectorize computation in dispatcher
cub/cub/device/dispatch/dispatch_transform.cuh
Adds cuda/__cmath/pow2.h include; replaces hardcoded false with runtime can_vectorize computed from output element size, candidate vector width clamped to 16 bytes, power-of-two checks on output and all input element sizes, and kernel_source.CanVectorize(...) call result.
UBLKCP kernel vectorized-store implementation
cub/cub/device/dispatch/kernels/kernel_transform.cuh
Extends transform_kernel_ublkcp with StoreVecSize template parameter and can_vectorize runtime parameter. Adds fast path gated on compile-time vectorize_eligible (contiguous output, trivially relocatable, power-of-two sizes ≤16, always_true predicate) and runtime can_vectorize: performs grouped shared-memory loads, unrolled f(...) evaluation per lane, packed store_t vectorized stores, unconditional scalar tail loop, and early return. Falls through to existing pred-guarded scalar path if conditions unmet. Updates ublkcp dispatch site to instantiate with policy.async_copy.store_vec_size and pass can_vectorize.
Vectorized store Catch2 tests
cub/test/catch2_test_device_transform_vectorized.cu
Test harness with cast_to functor providing __host__ __device__ static cast for reference generation. Defines GENERIC_COUNTS with non-16B-multiple sizes to exercise scalar tail. Two parameterized C2H_TEST cases cover narrowing to uint8_t and widening from uint8_t, comparing device output to host std::transform reference. Custom ublkcp_store_vec_size_2_selector tuning policy with store_vec_size = 2 enables fallback test verifying correctness with tuned execution environment.

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0a2cef84-bae0-4e12-9f34-a71fdf6138d3

📥 Commits

Reviewing files that changed from the base of the PR and between f150d51 and c9a53d7.

📒 Files selected for processing (4)
  • cub/cub/device/dispatch/dispatch_transform.cuh
  • cub/cub/device/dispatch/kernels/kernel_transform.cuh
  • cub/cub/device/dispatch/tuning/tuning_transform.cuh
  • cub/test/catch2_test_device_transform_vectorized.cu

Comment thread cub/cub/device/dispatch/kernels/kernel_transform.cuh Outdated
Comment thread cub/cub/device/dispatch/kernels/kernel_transform.cuh
Comment thread cub/test/catch2_test_device_transform_vectorized.cu
Comment thread cub/test/catch2_test_device_transform_vectorized.cu
@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test aff5ab5

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: eae71fb4-8cb5-4722-8188-2a3cf5a8adc1

📥 Commits

Reviewing files that changed from the base of the PR and between 897a0a4 and aff5ab5.

📒 Files selected for processing (2)
  • cub/cub/device/dispatch/kernels/kernel_transform.cuh
  • cub/test/catch2_test_device_transform_vectorized.cu
🚧 Files skipped from review as they are similar to previous changes (1)
  • cub/cub/device/dispatch/kernels/kernel_transform.cuh

Comment thread cub/test/catch2_test_device_transform_vectorized.cu Outdated
@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test 71cefee

Comment thread cub/cub/device/dispatch/kernels/kernel_transform.cuh Outdated
Comment on lines +918 to +1003
if (can_vectorize)
{
// store_vec (S) output elements per STG.128/64/.../8, defaulting to vec_size (= 16 / sizeof(output), today's
// 16-byte store). Shrinking S narrows the store but also reduces the number of fully-unrolled lambda calls per
// store, which bounds register pressure for heavy functors (whose stores aren't the bottleneck anyway). res[] is
// indexed only by the fully-unrolled k, i.e. compile-time, so it stays in registers and never spills to local
// memory regardless of S.
using store_t = decltype(load_store_type<store_vec * out_size>());
auto* out_vec = reinterpret_cast<store_t*>(out);
const int num_groups = valid_items / store_vec;
for (int g = threadIdx.x; g < num_groups; g += threads_per_block)
{
char* smem = smem_base;
auto load_chunk = [&](auto aligned_ptr) {
using T = typename decltype(aligned_ptr)::value_type;
// on blackwell, head_padding should always be zero
// on hopper, bulk_copy_alignment is 128 bytes, head_padding could be 112 bytes for example
// alignof(T) will always be powers of 2 per C++ standard
const T* base = reinterpret_cast<const T*>(smem + aligned_ptr.head_padding);
smem += tile_padding + int{sizeof(T)} * tile_size;
// Gather this input's vec_size elements for output-vector v into a register array. we take the maximal
// alignment out of alignof(T) and 16 bytes. If input is narrower, we will waste a few (0-16) registers
constexpr ::cuda::std::size_t chunk_align = (::cuda::std::max) (alignof(T), alignof(int4));
::cuda::__uninitialized_array<T, store_vec, chunk_align> elems;
constexpr int chunk_bytes = int{sizeof(T)} * store_vec;
// if same width or narrowing (e.g. int32 -> int8), we split it up into multiple 16 byte reads
// CAREFUL: the byte width sizeof(T) * vec_size can exceed 16 when the input is wider than the output.
// However, since input both input type size and output size is pow2, when the input is wider, it has to be
// pow2 times wider. Therefore, chunk_bytes = input size * vec_size is always divisible by 16
// (recall 16 = output size * vec_size) , i.e. we can read it as multiple int4 loads
if constexpr (chunk_bytes % int{sizeof(int4)} == 0)
{
constexpr int n = chunk_bytes / int{sizeof(int4)};
const int4* s = reinterpret_cast<const int4*>(base) + g * n;
_CCCL_PRAGMA_UNROLL_FULL()
for (int i = 0; i < n; ++i)
{
reinterpret_cast<int4*>(elems.data())[i] = s[i];
}
}
// if widening (e.g. int8 -> int32), just load it in one go. recall chunk_bytes = input size * vec_size, and
// vec_size = 16 / output size. Since output size is pow2, vec_size is pow2. Hence chunk_bytes is always pow2.
// this ensures load_store_type<chunk_bytes> will never fail.
else
{
using sub_t = decltype(load_store_type<chunk_bytes>());
*reinterpret_cast<sub_t*>(elems.data()) = reinterpret_cast<const sub_t*>(base)[g];
}
return elems;
};
auto chunks = ::cuda::std::tuple{load_chunk(aligned_ptrs)...};

// must fully unroll to take full advantage of ILP. otherwise perf regress by half
::cuda::__uninitialized_array<output_t, store_vec, sizeof(output_t) * store_vec> res;
_CCCL_PRAGMA_UNROLL_FULL()
for (int k = 0; k < store_vec; ++k)
{
res[k] = ::cuda::std::apply(
[&](auto&... c) {
return f(c[k]...);
},
chunks);
}
out_vec[g] = *reinterpret_cast<const store_t*>(res.data());
}

// scalar tail: the up to (store_vec - 1) trailing elements not covered by a whole store group. can_vectorize
// implies an always_true predicate, so we store unconditionally.
for (int idx = num_groups * store_vec + threadIdx.x; idx < valid_items; idx += threads_per_block)
{
char* smem = smem_base;
auto fetch_operand = [&](auto aligned_ptr) {
using T = typename decltype(aligned_ptr)::value_type;
const int head_padding = alignof(T) < bulk_copy_alignment ? aligned_ptr.head_padding : 0;
const char* src = smem + head_padding;
smem += tile_padding + int{sizeof(T)} * tile_size;
return reinterpret_cast<const T*>(src)[idx];
};
out[idx] = ::cuda::std::apply(
[&](auto... values) {
return f(values...);
},
::cuda::std::tuple<InTs...>{fetch_operand(aligned_ptrs)...});
}
return;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: This is a nontrivial amount of code, I believe we should extract it into a function.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this one: this code does not seem to be that reusable across different kernels

Comment on lines +318 to +332
// Runtime check whether this launch can take the ublkcp kernel's vectorized (STG.128) store path. The output value
// type must pack into a 16-byte vector and all pointers must be suitably aligned. The kernel additionally gates on
// compile-time eligibility (contiguous, trivially relocatable, power-of-two element sizes, no predicate).
bool can_vectorize = false;
if constexpr (THRUST_NS_QUALIFIER::is_contiguous_iterator_v<RandomAccessIteratorOut>)
{
using output_t = it_value_t<RandomAccessIteratorOut>;
constexpr int out_size = int{size_of<output_t>};
constexpr int vec_size = (out_size > 0 && out_size <= 16) ? 16 / out_size : 1;
if constexpr (vec_size > 1 && ::cuda::is_power_of_two(out_size)
&& (... && ::cuda::is_power_of_two(int{sizeof(it_value_t<RandomAccessIteratorsIn>)})))
{
can_vectorize = kernel_source.CanVectorize(vec_size, out, ::cuda::std::get<Is>(in)...);
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why is this a runtime value? I dont see anything that should not be a compile time value

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out is the runtime output pointer and we are checking it's runtime alignment inside CanVectorize.

@github-actions

This comment has been minimized.

Comment on lines +146 to +151
int unroll_factor = 1; //!< The unroll factor for the transformation loop in the kernel. The value 0 retains the
//!< compiler's default unrolling (specifying no unroll pragma), 1 means no unrolling.
// Vectorized store width for the ublkcp kernel. 0 means "auto": store_vec = 16 / sizeof(output) (a 16-byte STG.128).
// Setting it smaller narrows the store but also reduces the number of fully-unrolled lambda calls per store, which
// bounds register pressure for heavy functors (their stores aren't the bottleneck anyway).
int store_vec = 0; //!< Output elements per vectorized store (S). 0 = auto (16 / sizeof(output)).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important question: Why do we need a new tuning parameter, and cannot use unroll_factor?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the analysis that in the vectorized path, unroll_factor happens to be the same value as store_vec_size (I'm taking @miscco 's suggestion on renaming store_vec).

I think they should still be different parameters in the policy, because

  1. unroll_factor already has its own meaning in the regular scalar store. And even just semantically, unrolling something does not mean we are going to store it vectorized. Also, it is set to 1 by default, which is the best default documented for the scalar path. For the vectorized path, the best default would not be 1.
  2. store_vec_size has its own semantic meaning in the sense that:
    a. It means the width at which we store, and in that path we fully unroll the loop as a consequence of a certain store_vec_size.
    b. now we can express auto-vectorization, vectorization to a custom number, or disable vectorization all at once.

Comment on lines +318 to +332
// Runtime check whether this launch can take the ublkcp kernel's vectorized (STG.128) store path. The output value
// type must pack into a 16-byte vector and all pointers must be suitably aligned. The kernel additionally gates on
// compile-time eligibility (contiguous, trivially relocatable, power-of-two element sizes, no predicate).
bool can_vectorize = false;
if constexpr (THRUST_NS_QUALIFIER::is_contiguous_iterator_v<RandomAccessIteratorOut>)
{
using output_t = it_value_t<RandomAccessIteratorOut>;
constexpr int out_size = int{size_of<output_t>};
constexpr int vec_size = (out_size > 0 && out_size <= 16) ? 16 / out_size : 1;
if constexpr (vec_size > 1 && ::cuda::is_power_of_two(out_size)
&& (... && ::cuda::is_power_of_two(int{sizeof(it_value_t<RandomAccessIteratorsIn>)})))
{
can_vectorize = kernel_source.CanVectorize(vec_size, out, ::cuda::std::get<Is>(in)...);
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out is the runtime output pointer and we are checking it's runtime alignment inside CanVectorize.


using output_t = it_value_t<RandomAccessIteratorOut>;
constexpr int out_size = int{size_of<output_t>};
constexpr int vec_size = (out_size > 0 && out_size <= 16) ? 16 / out_size : 1;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Why disallow 32 byte wide stores?

@nanan-nvidia nanan-nvidia Jun 23, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because:

  1. benchmarks on B200s showed no improvement across all cases with 256 bit store
  2. adding this would require
    a. having two code paths for sm90 and sm100 since sm90 doesnt have it;
    b. nvcc will only generate STG.E.ENL2.256 when it can prove the pointer is in global memory, so we need to provide extra hints to nvcc, which I feel is unnecessary for zero performance gain

};
auto chunks = ::cuda::std::tuple{load_chunk(aligned_ptrs)...};

// must fully unroll to take full advantage of ILP. otherwise perf regress by half

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a strong indicator that the vectorization size equals the unroll factor in the tuning policy.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my reply on unroll factor

Comment on lines +984 to +986
// scalar tail: the up to (store_vec - 1) trailing elements not covered by a whole store group. can_vectorize
// implies an always_true predicate, so we store unconditionally.
for (int idx = num_groups * store_vec + threadIdx.x; idx < valid_items; idx += threads_per_block)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Have you tried scaling the items per thread in the host-side dispatch as multiples of the vector store width, so we never need a tail logic?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate what you meant by scaling?

The main motive for adding a tail logic was so that, if the output pointer is aligned to 16 bytes, but the amount of data we store are not aligned to 16 bytes (i.e. 10,000,001 int8 elements), we can still get the performance benefit of doing the first 10,000,000 vectorized stores.

Comment thread cub/test/catch2_test_device_transform_vectorized.cu Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 54b773bf-2cab-483b-bb8d-289f48fab0d4

📥 Commits

Reviewing files that changed from the base of the PR and between 71cefee and 54db169.

📒 Files selected for processing (3)
  • cub/cub/device/dispatch/kernels/kernel_transform.cuh
  • cub/cub/device/dispatch/tuning/tuning_transform.cuh
  • cub/test/catch2_test_device_transform_vectorized.cu
🚧 Files skipped from review as they are similar to previous changes (1)
  • cub/cub/device/dispatch/kernels/kernel_transform.cuh

Comment thread cub/test/catch2_test_device_transform_vectorized.cu
@nanan-nvidia

Copy link
Copy Markdown
Author

/ok to test 54db169

@github-actions

This comment has been minimized.

@github-actions

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 5h 00m: Pass: 100%/287 | Total: 11d 19h | Max: 2h 27m | Hits: 19%/952946

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

3 participants