perf: improve strings handling by anthony-swirldslabs · Pull Request #636 · hashgraph/pbj

anthony-swirldslabs · 2025-10-01T00:32:07Z

Description:
This is the last part of changes extracted from @jasperpotts 's draft PoC #612 . We store string fields as UTF-8 byte arrays internally in models. We allow codecs to parse original UTF-8 and implant them directly into the model w/o performing the UTF-8 decoding (which internally would convert the text into UTF-16 (or Latin-1) because that's how Java stores strings.) The public API of the models is unchanged, and whenever anyone calls a getter for a string field, the model will encode it into a Java string on the fly.

The idea behind this optimization is that we rarely read or use strings in our business logic. So we could as well skip the decoding part when parsing models.

Three caveats come with the fix:

New public constructors are introduced that accept raw byte[] for strings and also have a fake unused argument so as to resolve a generic erasure clash. This looks a bit ugly. Also, this is not very safe because it allows a malicious code to create a mutable model instance (by retaining references to their byte[]) and then pass the object to a code that expects the models to be immutable. We could make the constructors private, but we'd have to move all the codecs into the same package with their models, which could be a breaking change.
If we end up reading the string multiple times, either directly or via log/toString(), either now or in the future, then this optimization will be defeated as we'll be encoding the bytes multiple times. We could work-around this by adding caching for the encoded strings, same as we do with hashCode/protobufSize already. Also, we'll be running performance tests once these changes are released.
The fuzz test thresholds had to be relaxed because PBJ no longer decodes UTF-8 strings when parsing models, so it now fails less often than Protoc. Also, this means that encoding errors may now happen inside our business logic code rather than at deserialization points as before. There isn't a simple solution for this caveat.

Related issue(s):

Fixes #620

Notes for reviewer:
All tests should pass.

Checklist

Documented (Code comments, README, etc.)
Tested (unit, integration, etc.)

Signed-off-by: Anthony Petrov <[email protected]>

github-actions · 2025-10-01T00:33:55Z

JUnit Test Report

77 files ±0 77 suites ±0 6m 18s ⏱️ + 3m 34s
1 328 tests ±0 1 324 ✅ ±0 4 💤 ±0 0 ❌ ±0
7 185 runs ±0 7 165 ✅ ±0 20 💤 ±0 0 ❌ ±0

Results for commit 67e2746. ± Comparison against base commit 756dcbb.

♻️ This comment has been updated with latest results.

github-actions · 2025-10-01T00:39:04Z

Integration Test Report

408 files + 3 408 suites +3 14m 17s ⏱️ - 2m 35s
114 854 tests +17 114 854 ✅ +17 0 💤 ±0 0 ❌ ±0
115 095 runs +17 115 095 ✅ +17 0 💤 ±0 0 ❌ ±0

Results for commit 67e2746. ± Comparison against base commit 756dcbb.

♻️ This comment has been updated with latest results.

pbj-integration-tests/src/jmh/java/com/hedera/pbj/integration/jmh/utf8/Utf8ToolsV2.java

Signed-off-by: Anthony Petrov <[email protected]>

jasperpotts · 2025-10-06T17:44:17Z

For concern(1) I have one idea but don't love it.

You could make the byte[] constructor protected then make model objects sealed. In codec you could have a sub-class of model object that allows use of protected constructor. The model object could be sealed to only allow that single sub-class. No idea if it would have performance consequences, I hate it on a ugly point of view but maybe less than exposing byte[] not sure.

On concern (2) we should get a performance run with this branch by Alex to see how it effects CN performance before merging. If the gain is tiny or non-existent then lets not merge.

anthony-swirldslabs · 2025-10-08T17:33:19Z

Closing the PR per #620 (comment) .

perf: improve strings handling

4b876bc

Signed-off-by: Anthony Petrov <[email protected]>

anthony-swirldslabs requested a review from jasperpotts October 1, 2025 00:32

anthony-swirldslabs self-assigned this Oct 1, 2025

anthony-swirldslabs requested review from a team as code owners October 1, 2025 00:32

imalygin reviewed Oct 3, 2025

View reviewed changes

pbj-integration-tests/src/jmh/java/com/hedera/pbj/integration/jmh/utf8/Utf8ToolsV2.java Outdated Show resolved Hide resolved

imalygin previously approved these changes Oct 3, 2025

View reviewed changes

merge from upstream

df64a92

Signed-off-by: Anthony Petrov <[email protected]>

anthony-swirldslabs dismissed imalygin’s stale review via df64a92 October 3, 2025 18:12

address comments, fix previous merge, spotless

67e2746

Signed-off-by: Anthony Petrov <[email protected]>

imalygin approved these changes Oct 3, 2025

View reviewed changes

anthony-swirldslabs mentioned this pull request Oct 8, 2025

Revisit String performance #620

Open

anthony-swirldslabs closed this Oct 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: improve strings handling#636

perf: improve strings handling#636
anthony-swirldslabs wants to merge 3 commits intomainfrom
620-stringByteArray

anthony-swirldslabs commented Oct 1, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 1, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

jasperpotts commented Oct 6, 2025

Uh oh!

anthony-swirldslabs commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

anthony-swirldslabs commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JUnit Test Report

Uh oh!

github-actions bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Integration Test Report

Uh oh!

Uh oh!

jasperpotts commented Oct 6, 2025

Uh oh!

anthony-swirldslabs commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

anthony-swirldslabs commented Oct 1, 2025 •

edited

Loading

github-actions bot commented Oct 1, 2025 •

edited

Loading

github-actions bot commented Oct 1, 2025 •

edited

Loading