Skip to content

Conversation

@leixm
Copy link
Contributor

@leixm leixm commented Dec 18, 2025

What changes were proposed in this pull request?

For maps, arrays, and structs, calling OrcUtils.orcTypeDescription(dataType) during each serialization process is quite time-consuming. We can reuse TypeDescription.

I tested calling OrcSerializer#serialize with the 10k/100k/1m rows of data, the schema of which is as follows:

private val schema: StructType = StructType(Seq(
  StructField("id", LongType, nullable = false),
  StructField("attrs",
    MapType(StringType, MapType(StringType, IntegerType, valueContainsNull = false),
      valueContainsNull = true),
    nullable = true)
))

Test result:

row count master pr time saving
10k 16 ms 8 ms 50%
100k 201 ms 143 ms 29%
1m 1799 ms 1056 ms 41%

The above benchmarks were run on my local machine.

Why are the changes needed?

Improve performance of OrcSerializer#serialize.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.
New benchmark (in a separate PR)

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Dec 18, 2025
@leixm
Copy link
Contributor Author

leixm commented Dec 19, 2025

ping @dongjoon-hyun Can you help review?

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@yaooqinn yaooqinn closed this in 00163b8 Dec 20, 2025
@yaooqinn
Copy link
Member

Merged to master, thank you @leixm.

BTW, user 'jerrylei' is added to Apache Spark contributor's role as JIRA side, please let me know if it's not you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants