Skip to content

Optimize stdlib string hot paths#812

Merged
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:perf/stdlib-string-hotpaths
May 1, 2026
Merged

Optimize stdlib string hot paths#812
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:perf/stdlib-string-hotpaths

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 30, 2026

Summary

This is a small follow-up to the jrsonnet comparison work. jrsonnet's string-heavy stdlib paths generally stay close to raw byte/char operations and avoid allocator-heavy generic paths where Unicode semantics do not require them. This PR applies the same principle in a JVM-friendly way without copying Rust internals.

Changes:

  • specialize std.stripChars / std.lstripChars / std.rstripChars for common non-surrogate BMP delimiter sets
    • single delimiter: direct charAt comparison
    • multiple delimiters: java.util.BitSet membership instead of boxed Set[Int]
    • surrogate-containing delimiter sets still use the existing codepoint-safe path
  • specialize std.substr when the requested range has no surrogate pairs before the end offset
    • uses direct substring for ASCII/BMP prefixes
    • falls back to codePointCount / offsetByCodePoints for full Unicode safety
  • specialize std.findSubstr for BMP-only haystacks
    • String.indexOf offsets are already Jsonnet codepoint offsets in that case
    • fallback keeps the existing incremental codepoint accounting
  • make std.parseInt / std.parseOctal / std.parseHex use charAt loops on the hot path
    • preserves the existing official hex digit mapping behavior covered by tests
  • use StandardCharsets.UTF_8 in base64 string encode/decode paths
  • add a findSubstr regression benchmark so this path is measured in bench.runRegressions

The intent is JIT/GC friendliness: fewer boxed sets, fewer temporary collections, fewer repeated codepoint scans, and tighter monomorphic loops on common ASCII/BMP stdlib inputs.

Correctness

Ran:

./mill --no-server 'sjsonnet.jvm[3.3.7].test.testOnly' \
  sjsonnet.StdStripCharsTests sjsonnet.UnicodeHandlingTests sjsonnet.StdParseTests \
  sjsonnet.Base64Tests sjsonnet.StdLibOfficialCompatibilityTests sjsonnet.FileTests \
  'sjsonnet.js[3.3.7].compile' 'sjsonnet.native[3.3.7].compile'

Result: passed.

Also ran:

./mill --no-server 'sjsonnet.jvm[3.3.7].reformat'
./mill --no-server bench.reformat
git diff --check

Result: passed.

Targeted JMH

Command on both master and this branch:

./mill --no-server bench.runRegressions \
  bench/resources/go_suite/stripChars.jsonnet \
  bench/resources/go_suite/lstripChars.jsonnet \
  bench/resources/go_suite/rstripChars.jsonnet \
  bench/resources/go_suite/substr.jsonnet \
  bench/resources/go_suite/findSubstr.jsonnet \
  bench/resources/go_suite/parseInt.jsonnet \
  bench/resources/go_suite/base64_stress.jsonnet

Lower is better.

Benchmark master ms/op PR ms/op Delta
stripChars 0.115 0.112 +2.6%
lstripChars 0.115 0.113 +1.7%
rstripChars 0.116 0.114 +1.7%
substr 0.056 0.055 +1.8%
findSubstr 0.053 0.052 +1.9%
parseInt 0.033 0.032 +3.0%
base64_stress 0.186 0.175 +5.9%

Full JMH

Also ran the full suite on both master and this branch:

./mill --no-server bench.runRegressions

Result: passed on both branches.

The full one-shot run showed visible noise on unrelated benchmarks, so I would not treat sub-0.2ms movement there as strong signal. Examples of unrelated swings in the same run include reverse 6.635 -> 7.903 ms/op and member 0.635 -> 0.734 ms/op. For the directly touched paths, the full run was mixed:

Benchmark master ms/op PR ms/op
base64 0.195 0.147
base64Decode 0.120 0.116
findSubstr 0.052 0.051
substr 0.056 0.064
parseInt 0.032 0.039
stripChars 0.115 0.129
lstripChars 0.113 0.114
rstripChars 0.118 0.131

Given that the focused run isolates the changed files better and the full run moved unrelated benchmarks by larger margins, the targeted JMH data is the more useful signal for this PR. The gains are modest, but the changes remove unnecessary allocation/Unicode machinery from common stdlib paths while keeping correctness fallbacks for non-BMP inputs.

@He-Pin He-Pin marked this pull request as draft April 30, 2026 14:53
@He-Pin He-Pin marked this pull request as ready for review April 30, 2026 15:02
@stephenamar-db stephenamar-db merged commit 407fb18 into databricks:master May 1, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants