[VL] Add KeyGroupedPartitioning support to columnar shuffle#12084
Open
minni31 wants to merge 1 commit into
Open
[VL] Add KeyGroupedPartitioning support to columnar shuffle#12084minni31 wants to merge 1 commit into
minni31 wants to merge 1 commit into
Conversation
|
Run Gluten Clickhouse CI on x86 |
02ddcb4 to
92ac0bc
Compare
|
Run Gluten Clickhouse CI on x86 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CONTEXT
KeyGroupedPartitioningis a Spark partitioning scheme used by V2 data source connectors (e.g., Iceberg, Paimon) where data is partitioned by specific key expressions with known unique partition values. Currently, Gluten's columnar shuffle exchange does not handle this partitioning type, causing a fallback to vanilla Spark for any query involving V2 sources with key-grouped partitioning.WHAT
Adds
KeyGroupedPartitioningsupport to the columnar shuffle exchange path in the Velox backend. The implementation reuses the existing JVM-side partition ID computation pattern (same mechanism asRangePartitioning):KeyGroupedPartitioningto the validation whitelist inColumnarShuffleExchangeExecBase, allowing the columnar shuffle to accept this partitioning type.KeyGroupedPartitionerfrom the partitioning'suniquePartitionValues, mapping each partition key to its index.BindReferences) and looking up the result in theKeyGroupedPartitioner. The pid column is prepended to each batch so the native shuffle writer can read it directly.RangePartitioningShortNamefor the native partitioning descriptor since both Range and KeyGrouped use the same JVM-side pid prepend pattern — the native shuffle writer reads the prepended column rather than computing partition IDs natively.ArraySeqto avoid aliasing issues with mutable array reuse.Tests
VeloxShufflePartitioningSuiteNote: End-to-end KeyGroupedPartitioning tests require V2 data source connectors (Iceberg/Paimon) which are not available in this test module. The KeyGrouped unit tests validate key extraction,
KeyGroupedPartitionerconstruction, and the full key-extraction-to-partition-lookup flow.