Skip to content

[SPARK-57578][SQL] Fix UTF8String.codePointFrom/trimLeft/trimRight out-of-bounds read on truncated trailing UTF-8 sequences#56625

Open
jubins wants to merge 2 commits into
apache:masterfrom
jubins:j-SPARK-57578-fix-utf8string
Open

[SPARK-57578][SQL] Fix UTF8String.codePointFrom/trimLeft/trimRight out-of-bounds read on truncated trailing UTF-8 sequences#56625
jubins wants to merge 2 commits into
apache:masterfrom
jubins:j-SPARK-57578-fix-utf8string

Conversation

@jubins

@jubins jubins commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

What is the purpose of the change

Fixes SPARK-57578UTF8String.codePointFrom, trimLeft(UTF8String), and trimRight(UTF8String) perform out-of-bounds native memory reads when the string ends in a truncated multi-byte UTF-8 sequence (i.e., a leading byte is present but one or more continuation bytes are missing).

The root cause is that all three methods call numBytesForFirstByte() to determine a character's byte width and then unconditionally read that many bytes forward, without checking whether the range stays within numBytes. For a string whose last byte is a lone 2/3/4-byte leader, this reads past the end of the backing buffer via Platform.getByte(base, offset + n) — which either silently returns adjacent memory or causes a JVM crash depending on the allocator and platform.

The companion method reverse() was fixed for the exact same class of bug in SPARK-57507 (line 1160) using Math.min(numBytesForFirstByte(getByte(i)), numBytes - i). That commit explicitly noted codePointFrom, trimLeft, and trimRight as unfixed siblings tracked in SPARK-57520. This PR applies the same clamping pattern to all three.

Brief change log

  • UTF8String.codePointFrom(): replaced int numBytes = numBytesForFirstByte(b) with int numBytes = Math.min(numBytesForFirstByte(b), this.numBytes - byteIndex) so the case 2/3/4 branches never read continuation bytes that do not exist
  • UTF8String.trimLeft(UTF8String): clamped the copyUTF8String end argument from searchIdx + numBytesForFirstByte(...) - 1 to searchIdx + Math.min(numBytesForFirstByte(...), numBytes - searchIdx) - 1
  • UTF8String.trimRight(UTF8String): clamped stringCharLen[numChars] from numBytesForFirstByte(getByte(charIdx)) to Math.min(numBytesForFirstByte(getByte(charIdx)), numBytes - charIdx) so the subsequent copyUTF8String call cannot extend past the buffer
  • UTF8StringSuite: added unit tests for all three methods covering lone 2-, 3-, and 4-byte leaders as both the source string and the trim-set string

Verifying this change

This change is covered by unit tests in UTF8StringSuite:

  • trimRightWithTrimString / trimLeftWithTrimString: new assertions confirm that passing a truncated single-byte trim-set (lone 0xC3, 0xE4, 0xF0) does not crash and returns the input unchanged when no match exists; also covers a source string whose trailing byte is a lone multi-byte leader
  • testCodePointFrom: new assertions confirm that calling codePointFrom(0) on a 1-byte string whose only byte is a 2-byte leader, and codePointFrom(1) on a 2-byte string whose second byte is a 3- or 4-byte leader, completes without reading out of bounds

Does this pull request potentially affect one of the following parts

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public/@Evolving: noUTF8String is an internal unsafe type
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): yestrimLeft, trimRight, and codePointFrom are hot paths, but the fix adds only a single Math.min call per character iteration, which is negligible
  • Anything that affects deployment or recovery: no
  • The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no

If yes, how is the feature documented? not applicable

Was generative AI tooling used to co-author this PR?

  • Yes — Claude Code was used as a pair-programming assistant. All code was written, understood, and verified by the author.
    Generated-by: Claude Opus 4.8

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that @LuciferYang already has an ongoing PR for this issue.
@jubins please take a look at: #56585.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants