[SPARK-54754][SQL] OrcSerializer should not parse the schema every time it is serialized #53525

leixm · 2025-12-18T09:20:20Z

What changes were proposed in this pull request?

For maps, arrays, and structs, calling OrcUtils.orcTypeDescription(dataType) during each serialization process is quite time-consuming. We can reuse TypeDescription.

I tested calling OrcSerializer#serialize with the 10k/100k/1m rows of data, the schema of which is as follows:

private val schema: StructType = StructType(Seq(
  StructField("id", LongType, nullable = false),
  StructField("attrs",
    MapType(StringType, MapType(StringType, IntegerType, valueContainsNull = false),
      valueContainsNull = true),
    nullable = true)
))

Test result:

row count	master	pr	time saving
10k	16 ms	8 ms	50%
100k	201 ms	143 ms	29%
1m	1799 ms	1056 ms	41%

The above benchmarks were run on my local machine.

Why are the changes needed?

Improve performance of OrcSerializer#serialize.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.
New benchmark (in a separate PR)

Was this patch authored or co-authored using generative AI tooling?

No.

…me it is serialized

leixm · 2025-12-19T07:02:39Z

ping @dongjoon-hyun Can you help review?

yaooqinn

+1, LGTM

yaooqinn · 2025-12-20T13:15:20Z

Merged to master, thank you @leixm.

BTW, user 'jerrylei' is added to Apache Spark contributor's role as JIRA side, please let me know if it's not you.

[SPARK-54754][SQL] OrcSerializer should not parse the schema every ti…

7db6d21

…me it is serialized

github-actions bot added the SQL label Dec 18, 2025

leixm added 3 commits December 18, 2025 17:21

fix

b3f7843

fix

f2de299

fix

fc8972f

yaooqinn approved these changes Dec 20, 2025

View reviewed changes

yaooqinn closed this in 00163b8 Dec 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-54754][SQL] OrcSerializer should not parse the schema every time it is serialized #53525

[SPARK-54754][SQL] OrcSerializer should not parse the schema every time it is serialized #53525

Uh oh!

leixm commented Dec 18, 2025 •

edited

Loading

Uh oh!

leixm commented Dec 19, 2025

Uh oh!

yaooqinn left a comment

Uh oh!

yaooqinn commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-54754][SQL] OrcSerializer should not parse the schema every time it is serialized #53525

[SPARK-54754][SQL] OrcSerializer should not parse the schema every time it is serialized #53525

Uh oh!

Conversation

leixm commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

leixm commented Dec 19, 2025

Uh oh!

yaooqinn left a comment

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

leixm commented Dec 18, 2025 •

edited

Loading