Skip to content

[AutoSparkUT] Add RapidsDataFrameJoinSuite#14654

Open
wjxiz1992 wants to merge 1 commit intoNVIDIA:mainfrom
wjxiz1992:fix/autosparkut-dataframe-join-suite
Open

[AutoSparkUT] Add RapidsDataFrameJoinSuite#14654
wjxiz1992 wants to merge 1 commit intoNVIDIA:mainfrom
wjxiz1992:fix/autosparkut-dataframe-join-suite

Conversation

@wjxiz1992
Copy link
Copy Markdown
Collaborator

Summary

Migrates Spark DataFrameJoinSuite (19 tests) to RAPIDS using the minimal-inheritance pattern:

class RapidsDataFrameJoinSuite
  extends DataFrameJoinSuite with RapidsSQLTestsTrait {}

Closes Tier-A1 of our local suite-migration queue (core join workload — complements RapidsJoinSuite).

  • Suite source: spark/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala at Spark 3.3.0 (tag v3.3.0), full class (19 test(...) blocks). Pinned permalink · master reference
  • RAPIDS wrapper: tests/src/test/spark330/scala/org/apache/spark/sql/rapids/suites/RapidsDataFrameJoinSuite.scala
  • Registered in RapidsTestSettings.scala with one KNOWN_ISSUE exclusion.

Per-test mapping (Spark 3.3.0)

All 19 tests come from the parent class unchanged. Line ranges below are for sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala at tag v3.3.0:

# Test Lines Status
1 join - join using 40–66 pass
2 join - join using multiple columns 69–99 pass
3 join - sorted columns not in join's outputSet 101–118 pass
4 join - join using multiple columns and specifying join type 120–173 pass
5 join - cross join 175–198 pass
6 broadcast join hint using broadcast function 200–225 pass
7 broadcast join hint using Dataset.hint 227–244 pass
8 join - outer join conversion 246–275 pass
9 process outer join results using the non-nullable columns in the join input 277–282 pass
10 SPARK-16991: Full outer join followed by inner join produces wrong results 284–289 pass
11 SPARK-17685: WholeStageCodegenExec throws IndexOutOfBoundsException 291–292 pass
12 SPARK-23087: don't throw Analysis Exception in CheckCartesianProduct when join condition is false or null 294–296 pass
13 SPARK-24385: Resolve ambiguity in self-joins with EqualNullSafe ~293 pass
14 SPARK-24690 enables star schema detection even if CBO disabled 298–357 excluded (see below)
15 Supports multi-part names for broadcast hint resolution pass
16 The same table name exists in two databases for broadcast hint resolution pass
17 SPARK-32693: Compare two dataframes with same schema except nullable property pass
18 SPARK-34527: Resolve common columns from USING JOIN pass
19 SPARK-39376: Hide duplicated columns in star expansion of subquery alias from USING JOIN pass

Exclusion

One test is excluded as KNOWN_ISSUE:

  • SPARK-24690 enables star schema detection even if CBO disabledKNOWN_ISSUE("https://github.com/NVIDIA/spark-rapids/issues/14653")

Root cause: the proprietary com.nvidia.spark.rapids.optimizer.JoinReorderRule rule (shipped in rapids-4-spark-private_2.12) violates the Catalyst LogicalPlan structural-integrity invariant when reordering a 4-way star join under STARSCHEMA_DETECTION=true + CBO_ENABLED=false + PLAN_STATS_ENABLED=true. The failure occurs in the "Operator Optimization before Inferring Filters" batch (LogicalPlan optimization), so CPU fallback cannot recover — the plan never reaches physical planning.

Contributes to #14653.

Local Maven validation

mvn package -pl tests -am -Dbuildver=330 \
  -Dmaven.repo.local=./.mvn-repo \
  -s jenkins/settings.xml -P mirror-apache-to-urm \
  -DwildcardSuites=org.apache.spark.sql.rapids.suites.RapidsDataFrameJoinSuite \
  -Drapids.test.gpu.allocFraction=0.3 \
  -Drapids.test.gpu.maxAllocFraction=0.3 \
  -Drapids.test.gpu.minAllocFraction=0

Result (with exclusion applied):

Run starting. Expected test count is: 19
RapidsDataFrameJoinSuite:
Run completed in 28 seconds, 53 milliseconds.
Tests: succeeded 18, failed 0, canceled 0, ignored 1, pending 0
BUILD SUCCESS

Checklist

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

Migrates Spark DataFrameJoinSuite (19 tests) to RAPIDS using the
minimal-inheritance pattern:

    class RapidsDataFrameJoinSuite
      extends DataFrameJoinSuite with RapidsSQLTestsTrait {}

Contributes test coverage for the core join workload on GPU,
complementing the existing RapidsJoinSuite.

Local Maven validation (spark330 shim, GPU allocFraction=0.3):

  Run starting. Expected test count is: 19
  RapidsDataFrameJoinSuite:
  Tests: succeeded 18, failed 0, canceled 0, ignored 1, pending 0

One test excluded as KNOWN_ISSUE:
  SPARK-24690 enables star schema detection even if CBO disabled
  -> NVIDIA#14653
Root cause is in the proprietary JoinReorderRule logical-plan rule
(in rapids-4-spark-private_2.12), which violates the Catalyst
structural-integrity invariant when reordering a 4-way star join
under STARSCHEMA_DETECTION=true + CBO_ENABLED=false +
PLAN_STATS_ENABLED=true. The failure occurs at the LogicalPlan
optimization stage, before any physical planning, so CPU fallback
cannot recover.

Contributes to NVIDIA#14653.

Signed-off-by: Allen Xu <[email protected]>
Copilot AI review requested due to automatic review settings April 23, 2026 07:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Spark 3.3.0 RAPIDS GPU test wrapper for Spark’s DataFrameJoinSuite, registering it in the spark330 test settings with a single KNOWN_ISSUE exclusion tied to an existing optimizer-rule bug.

Changes:

  • Introduce RapidsDataFrameJoinSuite as a minimal-inheritance wrapper (DataFrameJoinSuite + RapidsSQLTestsTrait) for Spark 3.3.0.
  • Register the new suite in RapidsTestSettings and exclude the failing SPARK-24690 test via KNOWN_ISSUE(https://github.com/NVIDIA/spark-rapids/issues/14653).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala Enables the new join suite for spark330 and applies a single known-issue exclusion.
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/suites/RapidsDataFrameJoinSuite.scala Adds the GPU wrapper suite extending Spark’s DataFrameJoinSuite with RapidsSQLTestsTrait.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 23, 2026

Greptile Summary

Adds RapidsDataFrameJoinSuite for Spark 3.3.0 using the minimal-inheritance pattern (extends DataFrameJoinSuite with RapidsSQLTestsTrait), migrating all 19 upstream tests. One test (SPARK-24690) is correctly excluded via KNOWN_ISSUE pointing to a tracked issue.

Confidence Score: 5/5

Safe to merge; the change is purely additive test infrastructure following established project patterns.

Both files are boilerplate wrappers consistent with existing suites. The only finding is a P2 placement style note — no correctness, data, or reliability concerns.

No files require special attention.

Important Files Changed

Filename Overview
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/suites/RapidsDataFrameJoinSuite.scala New minimal-inheritance RAPIDS wrapper for Spark 3.3.0 DataFrameJoinSuite; correctly uses shim header, package, and RapidsSQLTestsTrait pattern consistent with other suites in the directory.
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala Registers RapidsDataFrameJoinSuite with one well-documented KNOWN_ISSUE exclusion; placement is slightly inconsistent with the alphabetical grouping of other RapidsDataFrame* suites.

Class Diagram

%%{init: {'theme': 'neutral'}}%%
classDiagram
    class DataFrameJoinSuite {
        +test("join - join using")
        +test("join - join using multiple columns")
        +test("join - cross join")
        +test("broadcast join hint")
        +test("SPARK-24690 ...") ~~excluded~~
        +... 14 more tests
    }
    class RapidsSQLTestsTrait {
        +sparkConf: SparkConf
        +GPU config overrides
        +beforeAll()
        +afterAll()
    }
    class RapidsDataFrameJoinSuite {
        shim: spark330
    }
    DataFrameJoinSuite <|-- RapidsDataFrameJoinSuite
    RapidsSQLTestsTrait <|.. RapidsDataFrameJoinSuite
    class RapidsTestSettings {
        +enableSuite[RapidsDataFrameJoinSuite]
        +exclude("SPARK-24690", KNOWN_ISSUE)
    }
    RapidsTestSettings --> RapidsDataFrameJoinSuite : registers
Loading

Reviews (1): Last reviewed commit: "[AutoSparkUT] Add RapidsDataFrameJoinSui..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants