Optimize stdlib string hot paths#812
Merged
stephenamar-db merged 1 commit intodatabricks:masterfrom May 1, 2026
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is a small follow-up to the jrsonnet comparison work. jrsonnet's string-heavy stdlib paths generally stay close to raw byte/char operations and avoid allocator-heavy generic paths where Unicode semantics do not require them. This PR applies the same principle in a JVM-friendly way without copying Rust internals.
Changes:
std.stripChars/std.lstripChars/std.rstripCharsfor common non-surrogate BMP delimiter setscharAtcomparisonjava.util.BitSetmembership instead of boxedSet[Int]std.substrwhen the requested range has no surrogate pairs before the end offsetsubstringfor ASCII/BMP prefixescodePointCount/offsetByCodePointsfor full Unicode safetystd.findSubstrfor BMP-only haystacksString.indexOfoffsets are already Jsonnet codepoint offsets in that casestd.parseInt/std.parseOctal/std.parseHexusecharAtloops on the hot pathStandardCharsets.UTF_8in base64 string encode/decode pathsfindSubstrregression benchmark so this path is measured inbench.runRegressionsThe intent is JIT/GC friendliness: fewer boxed sets, fewer temporary collections, fewer repeated codepoint scans, and tighter monomorphic loops on common ASCII/BMP stdlib inputs.
Correctness
Ran:
Result: passed.
Also ran:
./mill --no-server 'sjsonnet.jvm[3.3.7].reformat' ./mill --no-server bench.reformat git diff --checkResult: passed.
Targeted JMH
Command on both master and this branch:
Lower is better.
Full JMH
Also ran the full suite on both master and this branch:
Result: passed on both branches.
The full one-shot run showed visible noise on unrelated benchmarks, so I would not treat sub-0.2ms movement there as strong signal. Examples of unrelated swings in the same run include
reverse6.635 -> 7.903 ms/op andmember0.635 -> 0.734 ms/op. For the directly touched paths, the full run was mixed:Given that the focused run isolates the changed files better and the full run moved unrelated benchmarks by larger margins, the targeted JMH data is the more useful signal for this PR. The gains are modest, but the changes remove unnecessary allocation/Unicode machinery from common stdlib paths while keeping correctness fallbacks for non-BMP inputs.