[common] Add support for cuBLASLt GEMM for GroupedTensor #2502

pggPL · 2025-12-10T14:31:13Z

Description

Adds nvte_grouped_gemm API using cuBLASLt grouped matmul for batched GEMM on tensors with varying shapes. A GPU kernel (setup_grouped_gemm_kernel) converts NVTEGroupedTensor format (contiguous buffer + offsets) to cuBLAS requirements (pointer arrays + per-matrix M/N/K).

New API

void nvte_grouped_gemm(
                       int transa, int transb, const NVTETensor alpha, const NVTEGroupedTensor A,
                       const NVTEGroupedTensor B, const NVTETensor beta, const NVTEGroupedTensor C,
                       NVTEGroupedTensor D, NVTETensor workspace_setup, NVTETensor workspace_cublas,
                       NVTEGroupedMatmulConfig config, cudaStream_t stream)

Computes D = alpha * op(A) @ op(B) + beta * C for groups of matrices with potentially different shapes.

Type of change

New feature (non-breaking change which adds functionality)

Changes

GPU setup kernel computing pointers/dims from grouped tensor metadata
FP8 support with scale_inv handling and TN layout selection on Hopper
GroupedGemmSetupWorkspace struct for cuBLAS workspace layout
Tests in test_grouped_gemm.cu comparing against nvte_multi_tensor_gemm (FP8/BF16, various shapes and transpose layouts)

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

- Add FP8 scale_inv pointer handling in nvte_grouped_gemm for proper FP8 GEMM - Fix random padding in tests to ensure 16-byte alignment for all dtypes - Reorder GroupedGemmSetupWorkspace members for natural alignment - Remove debug prints Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

pggPL · 2025-12-10T21:42:35Z

/te-ci L0

greptile-apps · 2025-12-10T21:49:47Z

Greptile Summary

Adds nvte_grouped_gemm API leveraging cuBLASLt grouped matmul for efficient batched GEMM on tensors with varying shapes. The implementation includes:

Core functionality: GPU kernel (setup_grouped_gemm_kernel) bridges NVTEGroupedTensor format (contiguous buffer + offsets) to cuBLAS requirements (pointer arrays + per-matrix dimensions)
FP8 support: Handles scale_inv pointers and enforces TN layout selection on Hopper GPUs via operand selection logic
Configuration: GroupedMatmulConfig with optional avg_m/avg_n/avg_k hints for algorithm heuristics
Testing: Comprehensive tests compare against nvte_multi_tensor_gemm baseline across FP8/BF16, various shapes, and transpose configurations
Test infrastructure: New build_grouped_tensor helper with RAII memory management and random padding for alignment testing

The implementation follows established patterns from the codebase, includes proper validation, and has thorough test coverage.

Confidence Score: 4/5

This PR is safe to merge with minor considerations
Score reflects solid implementation with comprehensive testing, but has one logical issue in config.h attr_sizes that needs addressing. The code follows established patterns, includes proper validation and error handling, and has extensive test coverage comparing against baseline implementation. The const_cast usage and non-deterministic test padding are acceptable design choices.
Pay attention to transformer_engine/common/gemm/config.h for the attr_sizes calculation issue with std::optional

Important Files Changed

Filename	Overview
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu	New cuBLASLt grouped GEMM implementation with FP8 support; includes GPU setup kernel, operand selection, and workspace management. Well-structured with comprehensive validation.
transformer_engine/common/include/transformer_engine/gemm.h	API header additions for grouped GEMM: new config types, attributes, and main `nvte_grouped_gemm` function declaration with clear documentation.
tests/cpp/operator/test_grouped_gemm.cu	Comprehensive test suite comparing grouped GEMM against multi-tensor GEMM baseline. Tests FP8/BF16, various shapes, transpose combinations, and null C case.
tests/cpp/test_common.cu	Adds `build_grouped_tensor` helper with RAII memory management. Handles uniform/varying dimensions, computes offsets, includes random padding for alignment testing.

Sequence Diagram

sequenceDiagram
    participant User
    participant nvte_grouped_gemm
    participant Validation
    participant Operand Selection
    participant Setup Kernel
    participant cuBLASLt

    User->>nvte_grouped_gemm: Call with A, B, C, D, alpha, beta
    nvte_grouped_gemm->>Validation: Check SM >= 100 (Blackwell)
    nvte_grouped_gemm->>Validation: validate_grouped_gemm_inputs()
    Validation-->>nvte_grouped_gemm: OK
    
    nvte_grouped_gemm->>Operand Selection: select_grouped_operand(A, transa)
    Operand Selection->>Operand Selection: Check FP8 TN layout requirements
    Operand Selection->>Operand Selection: Choose row-wise vs column-wise data
    Operand Selection-->>nvte_grouped_gemm: A_sel (dptr, dtype, trans, use_columnwise)
    
    nvte_grouped_gemm->>Operand Selection: select_grouped_operand(B, transb)
    Operand Selection-->>nvte_grouped_gemm: B_sel (dptr, dtype, trans, use_columnwise)
    
    nvte_grouped_gemm->>Setup Kernel: Allocate setup workspace
    nvte_grouped_gemm->>Setup Kernel: launch_grouped_gemm_setup()
    Setup Kernel->>Setup Kernel: setup_grouped_gemm_kernel<<<blocks, threads>>>
    Note over Setup Kernel: Per-tensor computation:<br/>- Compute A/B/C/D pointers from offsets<br/>- Compute M/N/K from dimensions<br/>- Fill alpha_ptrs, beta_ptrs arrays
    Setup Kernel-->>nvte_grouped_gemm: Workspace populated
    
    nvte_grouped_gemm->>cuBLASLt: init_matrix_layouts(descA, descB, descC, descD)
    nvte_grouped_gemm->>cuBLASLt: init_matmul_desc(op_A, op_B)
    nvte_grouped_gemm->>cuBLASLt: set_fp8_scale_pointers() if FP8
    nvte_grouped_gemm->>cuBLASLt: select_grouped_gemm_algo() with avg hints
    cuBLASLt-->>nvte_grouped_gemm: Algorithm selected
    
    nvte_grouped_gemm->>cuBLASLt: cublasLtMatmul()
    Note over cuBLASLt: Execute grouped GEMM:<br/>D[i] = alpha[i] * op(A[i]) @ op(B[i]) + beta[i] * C[i]
    cuBLASLt-->>nvte_grouped_gemm: GEMM complete
    
    nvte_grouped_gemm-->>User: Return

greptile-apps

Additional Comments (4)

tests/cpp/operator/test_grouped_gemm.cu, line 75 (link)

logic: missing columnwise_data in move assignment
tests/cpp/operator/test_grouped_gemm.cu, line 336-337 (link)

logic: kSameFirst and kSameLast test cases use identical shape tuples. Should kSameFirst and kSameLast have different shape patterns to properly test the respective scenarios?
tests/cpp/operator/test_grouped_gemm.cu, line 367-378 (link)

logic: missing case for InputCase::kFP8Delayed
transformer_engine/common/include/transformer_engine/gemm.h, line 265-266 (link)

style: The avg_m, avg_n, avg_k parameters are not documented in the function comment

What do these average dimension parameters represent and how should they be computed?

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{4 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

- Remove unused alignment parameter from GroupedGemmSetupWorkspace::from_buffers - Simplify select_grouped_operand by removing dead code branches - Add GroupedOperandSelection.tensor field to avoid passing tensor separately - Extract set_fp8_scale_pointers and init_matrix_layouts helpers - Add safety check for FP8 on Hopper column-wise fallback - Support NULL C tensor when beta=0 (uses D as placeholder) - Remove unused get_scale_inv() from test - Add use_null_c test parameter and test case - Fix documentation: alpha/beta are single element tensors only Signed-off-by: Piotr Gadzinski <[email protected]> Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps

Additional Comments (1)

transformer_engine/common/include/transformer_engine/gemm.h, line 266 (link)

syntax: Documentation incorrectly states alpha[i] and beta[i]. The implementation uses a single alpha and beta value for all matrices in the group (batch stride is 1 in init_matmul_desc at cublaslt_gemm.cu:1404), not per-matrix scaling.

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

- Change alpha/beta from single values to per-matrix arrays - Validate alpha/beta have exactly num_tensors elements - Update kernel to index alpha_ptr[idx] and beta_ptr[idx] - Move alpha/beta validation to validate_grouped_gemm_inputs - Update tests to use per-matrix alpha/beta arrays - Update documentation Signed-off-by: Piotr Gadzinski <[email protected]> Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Piotr Gadzinski <[email protected]> Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL · 2025-12-11T11:28:23Z

/te-ci

timmoon10 · 2025-12-16T22:07:15Z

transformer_engine/common/include/transformer_engine/gemm.h

+                       NVTEMatmulConfig config, cudaStream_t stream, const int64_t *avg_m,
+                       const int64_t *avg_n, const int64_t *avg_k);


The average sizes seem like advanced configs that would be better to leave out of the top-level API. Can we move them inside NVTEMatmulConfig?

It was suggestion of @ptrendx - customer may potentially want to use them if they know something more about the shapes. For example if there are multiple tensors of with k dimension D and one tensor with K dimension equal to 1, then it is potentially true that telling cublas that avg dim = D will result in better performance.

We should definitely expose these options, but it would be better to put them in NVTEMatmulConfig rather than the top-level function signature. If you look at nvte_cublas_gemm:

TransformerEngine/transformer_engine/common/include/transformer_engine/gemm.h

Lines 109 to 112 in 97a09c2

void nvte_cublas_gemm(const NVTETensor A, const NVTETensor B, NVTETensor D, const NVTETensor bias,

NVTETensor pre_gelu_out, bool transa, bool transb, bool grad,

NVTETensor workspace, bool accumulate, bool use_split_accumulator,

int math_sm_count, cudaStream_t stream);

You see advanced options like pre_gelu_out and use_split_accumulator that are only needed for specialized cases. It's even worse for use_split_accumulator, since that's only relevant for FP8 on Hopper and now it's sitting uselessly in the API forever. Compare with the v2 API, which is much closer to the original GEMM API:

TransformerEngine/transformer_engine/common/include/transformer_engine/gemm.h

Lines 131 to 133 in 97a09c2

void nvte_cublas_gemm_v2(int transa, int transb, const float *alpha, const NVTETensor A,

const NVTETensor B, const float *beta, const NVTETensor C, NVTETensor D,

NVTETensor workspace, NVTEMatmulConfig config, cudaStream_t stream);

The idea of NVTEMatmulConfig is to hold these advanced, kernel-specific options in a way where we can add or deprecate them easily without breaking API changes.

I added analogous NVTEGroupedMatmulConfig.

It's worth noting that cublas does not support pre_gelu_out and use_split_accumulator for grouped gemm.

transformer_engine/common/include/transformer_engine/gemm.h

transformer_engine/common/gemm/cublaslt_gemm.cu

Signed-off-by: Pawel Gadzinski <[email protected]>

greptile-apps

Additional Comments (2)

transformer_engine/common/include/transformer_engine/gemm.h, line 256 (link)

syntax: Documentation mentions non-existent config parameter - this parameter is not in the function signature on line 276-280
tests/cpp/operator/test_grouped_gemm.cu, line 141 (link)

logic: Alignment calculation appears incorrect - multiplying 16 bytes by 8 bits gives 128 bits, not the intended 16-byte alignment

_{6 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL · 2025-12-19T11:32:06Z

/te-ci L0

Signed-off-by: Pawel Gadzinski <[email protected]>

greptile-apps

Additional Comments (1)

tests/cpp/operator/test_grouped_gemm.cu, line 485 (link)

syntax: incorrect version check - should be 130100 not 130200

The API requires cuBLAS 13.1+ (version 130100), but this conditional check uses 130200. This mismatch means tests will be skipped even on cuBLAS 13.1.

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Pawel Gadzinski <[email protected]>

greptile-apps · 2025-12-22T10:41:48Z

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

pggPL · 2025-12-22T10:44:25Z

/te-ci

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps

Additional Comments (4)

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 445 (link)

style: Use const_cast sparingly - casting away const from a_base could be risky if the caller expects the data to remain unchanged

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 451-455 (link)

style: The comment mentions "Test stores A" but this is implementation code, not test code - update comment for clarity

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
tests/cpp/operator/test_grouped_gemm.cu, line 282-283 (link)

style: Commented code should be removed before merging

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 313-318 (link)

style: Variable naming could be clearer - rowa/cola/rowb/colb could be confused with actual row/column counts vs storage dimensions

Consider renaming to lda_rows/lda_cols etc. or adding clarifying comments

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{8 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

transformer_engine/common/gemm/config.h

transformer_engine/common/gemm/cublaslt_grouped_gemm.cuh

transformer_engine/common/include/transformer_engine/gemm.h

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

tests/cpp/operator/test_grouped_gemm.cu

zhongbozhu · 2026-01-06T19:00:25Z

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu

+inline int64_t compute_avg_first_dim(const transformer_engine::GroupedTensor *t) {
+  // logical_shape[0] is either num_tensors*M (uniform) or sum_of_M (varying first)
+  // In both cases, dividing by num_tensors gives the average
+  return static_cast<int64_t>(t->logical_shape.data[0]) / static_cast<int64_t>(t->num_tensors);


what is logical_shape.data[0]? do we have access to this field if we want cuda graph (ie. without D2H copy)

logical_shape represents total shape of all tensors in groupped tensor. For example we can have grouped tensor of logical shape [10, 10] and tensors of shape [3, 10] and [7, 10] inside it. The logical shape must be constant to use cuda graphs.

I see, this could be an issue for future cuda graph support of dynamic shapes MOE. Basically, how it works is like for input buffer shape [10, 10], this [10, 10] shape is static, but there will be an on-device tensor that says the splits will be [[3, 10], [3, 10]], and the actual valid shape will be [6, 10]. The excessive shape should be the gemm performance wasted (?)

That being said, this function is simply for getting a cublas heuristic, so it's nothing fundamental. Can we support injecting static values directly here? Assuming users know their model well, and they can manually set those values and avoid knowing these shapes on host, since that would mean D2H copy, therefore no cuda graph.

The only problem may be that the wrong statistics are computed, not any cuda errors or incorrect memory read, am I right?

There is support for custom statistic injection here.

wrong statistics can be avoided if we do zero pad, but that's another story, need to check with cublas folks

btw, does it PR have BF16 grouped gemm? is grouped gemm optimization sm100 only for now?

it supports bf16 and there are plans to support it on other devices also

Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps

Additional Comments (4)

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 449-453 (link)

style: const_cast removes const correctness. While the comment says cuBLAS won't modify the data, this breaks type safety guarantees.

Consider if cuBLASLt API could accept const void** for input matrices, or document why this is necessary.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
tests/cpp/operator/test_grouped_gemm.cu, line 136-145 (link)

style: Random padding for alignment testing is good, but the random seed is hardcoded (12345).

Verify tests are deterministic and document why this specific padding strategy was chosen.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 582-585 (link)

style: Fast accumulator mode is always enabled for FP8, but split accumulator isn't supported by cuBLASLt grouped GEMM API.

Check that this accuracy trade-off is documented in the public API so users are aware.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 514 (link)

style: SM100 (Blackwell) requirement is very restrictive.

Verify this is a hard requirement from cuBLASLt grouped GEMM API and document the limitation clearly for users.

_{8 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

greptile-apps

Additional Comments (3)

transformer_engine/common/gemm/config.h, line 47-48 (link)

logic: attr_sizes calculation for optional fields includes full std::optional size rather than just int64_t

when nvte_get_grouped_matmul_config_attribute copies value to buffer, it only copies sizeof(int64_t) (line 150-160 in config.cpp), but attr_sizes declares sizeof(avg_m) which is sizeof(std::optional<int64_t>) (typically 16 bytes, not 8)
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 449 (link)

style: const_cast removes const qualifier from input pointers A, B, C

while comment notes this is safe because cuBLAS won't modify the data, this relies on cuBLAS internal behavior and could break if cuBLAS implementation changes - consider verifying this is documented in cuBLAS API
tests/cpp/test_common.cu, line 1093-1102 (link)

style: random padding logic may generate non-deterministic test behavior

while the seed is fixed (12345), random padding affects memory layout which could expose alignment issues inconsistently across test runs - consider documenting this is intentional for broader coverage

_{9 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-07T16:28:22Z

Additional Comments (5)

transformer_engine/common/gemm/config.h, line 48
attr_sizes incorrectly uses sizeof(avg_m) which is sizeof(std::optional<int64_t>) (typically 16 bytes), but the get/set functions treat it as sizeof(int64_t) (8 bytes)

  static constexpr size_t attr_sizes[] = {sizeof(int64_t), sizeof(int64_t), sizeof(int64_t),
                                          sizeof(int)};

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 514
SM100+ check inconsistent with documentation. Documentation states "Blackwell (SM100) or newer" but code checks >= 100. This allows SM100 (generic) but excludes SM100a/103a (actual Blackwell variants). The version-gated API requires CUBLAS_VERSION >= 130100, suggesting Blackwell-class hardware.

Check if >= 100 is intentional or should be checking specific Blackwell variants

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 314
Potential mismatched dimension logic for column-wise storage. When use_columnwise=true, rows_A and cols_A are computed as ws.M and ws.K respectively, but for the transposed case the dimensions appear flipped (line 313-314 vs 317-318).

Verify the dimension logic is correct for both row-wise and column-wise layouts with all transpose combinations

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu, line 459
Comment states "Input A is stored as {K,M} when !transa" but the dimension extraction logic doesn't match the comment. When transa=false, M[idx] = a_last (line 457), which means a_last represents M. However, if A is stored as {K,M} with !transa, a_last should be the second dimension which is M. The logic appears correct, but the comment is confusing about storage order vs logical order

tests/cpp/operator/test_grouped_gemm.cu, line 94
Test comment mislabeled. Comment says "Same M (first dim)" but the shapes show M=64 (same), N varying (80→96→112), and K varying (32→48→64). This is NOT same M with varying N and K - it's same M only. The test actually tests same-M-varying-N-and-K which is valid, but the comment is misleading

code drop

bf3ebc2

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL changed the title ~~[common] Add support for cublasLt GEMM for GroupedTensor~~ [common] Add support for cuBLASLt GEMM for GroupedTensor Dec 10, 2025

pre-commit-ci bot and others added 3 commits December 10, 2025 14:32

[pre-commit.ci] auto fixes from pre-commit.com hooks

76293d4

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

785df34

for more information, see https://pre-commit.ci

ptrendx added the MoE label Dec 10, 2025

ptrendx linked an issue Dec 10, 2025 that may be closed by this pull request

GroupedGemm: FP8 per-tensor via cuBLAS #2452

Open

pggPL and others added 2 commits December 10, 2025 22:34

fix

1329b37

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

47c58be

for more information, see https://pre-commit.ci

pggPL marked this pull request as ready for review December 10, 2025 21:43

greptile-apps bot reviewed Dec 10, 2025

View reviewed changes

ptrendx requested a review from timmoon10 December 10, 2025 22:35

pggPL and others added 2 commits December 11, 2025 11:56

[pre-commit.ci] auto fixes from pre-commit.com hooks

3b2fcdf

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Dec 11, 2025

View reviewed changes

pggPL and others added 3 commits December 11, 2025 12:16

[pre-commit.ci] auto fixes from pre-commit.com hooks

101766b

for more information, see https://pre-commit.ci

Fix alpha/beta numel - use SimpleTensor::numel()

1167f75

Signed-off-by: Piotr Gadzinski <[email protected]> Signed-off-by: Pawel Gadzinski <[email protected]>

This comment was marked as resolved.

Sign in to view

timmoon10 reviewed Dec 16, 2025

View reviewed changes

Refactor: move grouped GEMM to separate file and cleanup API

e4a80a3

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL force-pushed the grouped_gemm branch from 29d99aa to e4a80a3 Compare December 19, 2025 08:58

Merge branch 'main' into grouped_gemm

db1e177

greptile-apps bot reviewed Dec 19, 2025

View reviewed changes

fix

047a9f9

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL force-pushed the grouped_gemm branch from e47f433 to 047a9f9 Compare December 19, 2025 11:31

Require Blackwell (SM100) and cuBLAS 13.1+ for grouped GEMM

77b422a

Signed-off-by: Pawel Gadzinski <[email protected]>

greptile-apps bot reviewed Dec 22, 2025

View reviewed changes

fix

9c8158e

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL and others added 2 commits December 30, 2025 09:19

fixes

ed540c8

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

359a9f5

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Dec 30, 2025

View reviewed changes

pggPL added 2 commits December 30, 2025 12:08

fixes

a702426

Signed-off-by: Pawel Gadzinski <[email protected]>

fix

fb027d0

Signed-off-by: Pawel Gadzinski <[email protected]>

pggPL force-pushed the grouped_gemm branch from 010df2d to fb027d0 Compare December 30, 2025 11:15

[pre-commit.ci] auto fixes from pre-commit.com hooks

ae85415

for more information, see https://pre-commit.ci