[SPARK-53478] Resolve SparkFiles.get against root directory when job artifact UUID is set in local mode#56623
Open
wilmerdooley wants to merge 3 commits into
Open
[SPARK-53478] Resolve SparkFiles.get against root directory when job artifact UUID is set in local mode#56623wilmerdooley wants to merge 3 commits into
wilmerdooley wants to merge 3 commits into
Conversation
…le and SparkFiles.get in local Signed-off-by: wilmerdooley <wilmerdooley1@gmail.com>
test_spark_files_get_with_sc_add_file ran its job on the shared default session of this ReusedSQLTestCase class, so when the full test_artifact module runs in CI order an earlier test that registers a my_pyfile.py artifact leaves the executor with a stale copy and the task fails. Run the verification on self.spark.newSession(), like the sibling add-file tests, which keeps the SparkContext.addFile resolution under test while isolating it from prior tests' session artifacts.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
When running in local mode with a non-default job artifact UUID,
SparkFiles.getresolved filenames against the per-session artifact directory, while files added viaSparkContext.addFilewere placed directly under the root directory. This made such files inaccessible from SQL-planned operations even though they were visible to code that ran outside a SQL session. This change falls back to the root directory when the job-specific path does not exist in local mode, mirroring the same behavior on the Scala, JVM Python worker, and PySpark Python sides.core/src/main/scala/org/apache/spark/SparkFiles.scala: inget, if the job-specific path does not exist and the master is local, fall back to the file directly undergetRootDirectory().core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala: exportSPARK_LOCAL_MODEto Python workers so they can apply the same fallback.python/pyspark/core/files.py: inSparkFiles.get, when running on a worker, the file is missing under the current root, the master is local, and the job artifact UUID is non-default, fall back to the file under the parent of the current root directory.JIRA: https://issues.apache.org/jira/browse/SPARK-53478
Why are the changes needed?
In local mode with a non-default job artifact UUID, files added through
SparkContext.addFilecould not be resolved bySparkFiles.getfrom inside SQL-planned operations.SparkContext.addFilewrites the file directly under the root directory, butSparkFiles.getlooked under the per-session artifact directory, so the lookup missed the file and the path was unusable. The fallback restores the ability to read those files from SQL operations in local mode, while keeping the lookup scoped to local mode so session isolation semantics on real executors are unchanged.Does this PR introduce any user-facing change?
Yes. In local mode with a non-default job artifact UUID, a file added via
SparkContext.addFileis now resolvable throughSparkFiles.getfrom SQL-planned operations, whereas before it was not found. There is no change on real executors or when the default artifact UUID is in use.How was this patch tested?
Added regression tests that add a file via
SparkContext.addFileand read it back throughSparkFiles.getfrom a SQL-planned operation, on both the Scala and Python sides:sql/core/src/test/scala/org/apache/spark/sql/artifact/ArtifactManagerSuite.scala:SPARK-53478: SparkFiles.get resolves files added via SparkContext.addFile in local mode.python/pyspark/sql/tests/test_artifact.py:test_spark_files_get_with_sc_add_file.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code