Skip to content

Conversation

@ashrithb
Copy link
Contributor

@ashrithb ashrithb commented Dec 18, 2025

What changes were proposed in this pull request?

This PR fixes a flaky test SPARK-47148: AQE should avoid to submit shuffle job on cancellation in AdaptiveQueryExecSuite.

The fix replaces the shallow exception chain search with a recursive traversal that finds the "coalesce test error" message at any depth in the exception hierarchy.

Why are the changes needed?

The test was flaky because the original assertion only checked:

  • The top-level exception
  • Its direct cause
  • Its immediate suppressed exceptions
// OLD (flaky):
val errMsgList = (error :: error.getCause :: error.getSuppressed.toList)
  .filter(e => e != null && e.getMessage != null)
  .map(_.getMessage)

assert(errMsgList.exists(_.contains("coalesce test error")))

But, from race conditions in AQE's concurrent execution, the "coalesce test error" can be wrapped at varying depths depending on:

  • Thread scheduling/timing
  • Which AQE stage reports the error first
  • How exceptions are wrapped during job cancellation

When the error was nested more than one level deep, the assertion failed.

Does this PR introduce any user-facing change?

No. This is a test-only change.

How was this patch tested?

  • Ran the test 20+ consecutive times locally with 100% pass rate
  • Before the fix, the test would intermittently fail in CI (see linked JIRA)

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Dec 18, 2025
@ashrithb ashrithb marked this pull request as ready for review December 18, 2025 04:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant