Optimize stdlib string hot paths by He-Pin · Pull Request #812 · databricks/sjsonnet

He-Pin · 2026-04-30T14:31:33Z

Summary

This is a small follow-up to the jrsonnet comparison work. jrsonnet's string-heavy stdlib paths generally stay close to raw byte/char operations and avoid allocator-heavy generic paths where Unicode semantics do not require them. This PR applies the same principle in a JVM-friendly way without copying Rust internals.

Changes:

specialize std.stripChars / std.lstripChars / std.rstripChars for common non-surrogate BMP delimiter sets
- single delimiter: direct charAt comparison
- multiple delimiters: java.util.BitSet membership instead of boxed Set[Int]
- surrogate-containing delimiter sets still use the existing codepoint-safe path
specialize std.substr when the requested range has no surrogate pairs before the end offset
- uses direct substring for ASCII/BMP prefixes
- falls back to codePointCount / offsetByCodePoints for full Unicode safety
specialize std.findSubstr for BMP-only haystacks
- String.indexOf offsets are already Jsonnet codepoint offsets in that case
- fallback keeps the existing incremental codepoint accounting
make std.parseInt / std.parseOctal / std.parseHex use charAt loops on the hot path
- preserves the existing official hex digit mapping behavior covered by tests
use StandardCharsets.UTF_8 in base64 string encode/decode paths
add a findSubstr regression benchmark so this path is measured in bench.runRegressions

The intent is JIT/GC friendliness: fewer boxed sets, fewer temporary collections, fewer repeated codepoint scans, and tighter monomorphic loops on common ASCII/BMP stdlib inputs.

Correctness

Ran:

./mill --no-server 'sjsonnet.jvm[3.3.7].test.testOnly' \
  sjsonnet.StdStripCharsTests sjsonnet.UnicodeHandlingTests sjsonnet.StdParseTests \
  sjsonnet.Base64Tests sjsonnet.StdLibOfficialCompatibilityTests sjsonnet.FileTests \
  'sjsonnet.js[3.3.7].compile' 'sjsonnet.native[3.3.7].compile'

Result: passed.

Also ran:

./mill --no-server 'sjsonnet.jvm[3.3.7].reformat'
./mill --no-server bench.reformat
git diff --check

Result: passed.

Targeted JMH

Command on both master and this branch:

./mill --no-server bench.runRegressions \
  bench/resources/go_suite/stripChars.jsonnet \
  bench/resources/go_suite/lstripChars.jsonnet \
  bench/resources/go_suite/rstripChars.jsonnet \
  bench/resources/go_suite/substr.jsonnet \
  bench/resources/go_suite/findSubstr.jsonnet \
  bench/resources/go_suite/parseInt.jsonnet \
  bench/resources/go_suite/base64_stress.jsonnet

Lower is better.

Benchmark	master ms/op	PR ms/op	Delta
stripChars	0.115	0.112	+2.6%
lstripChars	0.115	0.113	+1.7%
rstripChars	0.116	0.114	+1.7%
substr	0.056	0.055	+1.8%
findSubstr	0.053	0.052	+1.9%
parseInt	0.033	0.032	+3.0%
base64_stress	0.186	0.175	+5.9%

Full JMH

Also ran the full suite on both master and this branch:

./mill --no-server bench.runRegressions

Result: passed on both branches.

The full one-shot run showed visible noise on unrelated benchmarks, so I would not treat sub-0.2ms movement there as strong signal. Examples of unrelated swings in the same run include reverse 6.635 -> 7.903 ms/op and member 0.635 -> 0.734 ms/op. For the directly touched paths, the full run was mixed:

Benchmark	master ms/op	PR ms/op
base64	0.195	0.147
base64Decode	0.120	0.116
findSubstr	0.052	0.051
substr	0.056	0.064
parseInt	0.032	0.039
stripChars	0.115	0.129
lstripChars	0.113	0.114
rstripChars	0.118	0.131

Given that the focused run isolates the changed files better and the full run moved unrelated benchmarks by larger margins, the targeted JMH data is the more useful signal for this PR. The gains are modest, but the changes remove unnecessary allocation/Unicode machinery from common stdlib paths while keeping correctness fallbacks for non-BMP inputs.

Optimize stdlib string hot paths

508c6bc

He-Pin marked this pull request as draft April 30, 2026 14:53

He-Pin marked this pull request as ready for review April 30, 2026 15:02

stephenamar-db merged commit 407fb18 into databricks:master May 1, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize stdlib string hot paths#812

Optimize stdlib string hot paths#812
stephenamar-db merged 1 commit intodatabricks:masterfrom
He-Pin:perf/stdlib-string-hotpaths

He-Pin commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

He-Pin commented Apr 30, 2026

Summary

Correctness

Targeted JMH

Full JMH

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants