[SPARK-57578][SQL] Fix UTF8String.codePointFrom/trimLeft/trimRight out-of-bounds read on truncated trailing UTF-8 sequences by jubins · Pull Request #56625 · apache/spark

jubins · 2026-06-20T00:26:33Z

What is the purpose of the change

Fixes SPARK-57578 — UTF8String.codePointFrom, trimLeft(UTF8String), and trimRight(UTF8String) perform out-of-bounds native memory reads when the string ends in a truncated multi-byte UTF-8 sequence (i.e., a leading byte is present but one or more continuation bytes are missing).

The root cause is that all three methods call numBytesForFirstByte() to determine a character's byte width and then unconditionally read that many bytes forward, without checking whether the range stays within numBytes. For a string whose last byte is a lone 2/3/4-byte leader, this reads past the end of the backing buffer via Platform.getByte(base, offset + n) — which either silently returns adjacent memory or causes a JVM crash depending on the allocator and platform.

The companion method reverse() was fixed for the exact same class of bug in SPARK-57507 (line 1160) using Math.min(numBytesForFirstByte(getByte(i)), numBytes - i). That commit explicitly noted codePointFrom, trimLeft, and trimRight as unfixed siblings tracked in SPARK-57520. This PR applies the same clamping pattern to all three.

Brief change log

UTF8String.codePointFrom(): replaced int numBytes = numBytesForFirstByte(b) with int numBytes = Math.min(numBytesForFirstByte(b), this.numBytes - byteIndex) so the case 2/3/4 branches never read continuation bytes that do not exist
UTF8String.trimLeft(UTF8String): clamped the copyUTF8String end argument from searchIdx + numBytesForFirstByte(...) - 1 to searchIdx + Math.min(numBytesForFirstByte(...), numBytes - searchIdx) - 1
UTF8String.trimRight(UTF8String): clamped stringCharLen[numChars] from numBytesForFirstByte(getByte(charIdx)) to Math.min(numBytesForFirstByte(getByte(charIdx)), numBytes - charIdx) so the subsequent copyUTF8String call cannot extend past the buffer
UTF8StringSuite: added unit tests for all three methods covering lone 2-, 3-, and 4-byte leaders as both the source string and the trim-set string

Verifying this change

This change is covered by unit tests in UTF8StringSuite:

trimRightWithTrimString / trimLeftWithTrimString: new assertions confirm that passing a truncated single-byte trim-set (lone 0xC3, 0xE4, 0xF0) does not crash and returns the input unchanged when no match exists; also covers a source string whose trailing byte is a lone multi-byte leader
testCodePointFrom: new assertions confirm that calling codePointFrom(0) on a 1-byte string whose only byte is a 2-byte leader, and codePointFrom(1) on a 2-byte string whose second byte is a 3- or 4-byte leader, completes without reading out of bounds

Does this pull request potentially affect one of the following parts

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public/@Evolving: no — UTF8String is an internal unsafe type
The serializers: no
The runtime per-record code paths (performance sensitive): yes — trimLeft, trimRight, and codePointFrom are hot paths, but the fix adds only a single Math.min call per character iteration, which is negligible
Anything that affects deployment or recovery: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no

If yes, how is the feature documented? not applicable

Was generative AI tooling used to co-author this PR?

Yes — Claude Code was used as a pair-programming assistant. All code was written, understood, and verified by the author.
Generated-by: Claude Opus 4.8

…t-of-bounds read on truncated trailing UTF-8 sequences

uros-b

I think that @LuciferYang already has an ongoing PR for this issue.
@jubins please take a look at: #56585.

jubins added 2 commits June 19, 2026 17:25

[SPARK-57578][SQL] Fix UTF8String.codePointFrom/trimLeft/trimRight ou…

3b15968

…t-of-bounds read on truncated trailing UTF-8 sequences

Attempt to fix tests

e0763a6

uros-b reviewed Jun 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57578][SQL] Fix UTF8String.codePointFrom/trimLeft/trimRight out-of-bounds read on truncated trailing UTF-8 sequences#56625

[SPARK-57578][SQL] Fix UTF8String.codePointFrom/trimLeft/trimRight out-of-bounds read on truncated trailing UTF-8 sequences#56625
jubins wants to merge 2 commits into
apache:masterfrom
jubins:j-SPARK-57578-fix-utf8string

jubins commented Jun 20, 2026

Uh oh!

uros-b left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jubins commented Jun 20, 2026

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts

Documentation

Was generative AI tooling used to co-author this PR?

Uh oh!

uros-b left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants