Skip to content

Add an inlining stage for single callsite methods#12899

Open
roberttoyonaga wants to merge 4 commits into
oracle:masterfrom
roberttoyonaga:SingleCallsiteInlining
Open

Add an inlining stage for single callsite methods#12899
roberttoyonaga wants to merge 4 commits into
oracle:masterfrom
roberttoyonaga:SingleCallsiteInlining

Conversation

@roberttoyonaga

Copy link
Copy Markdown
Collaborator

Summary

This PR adds a new inlining stage for single callsite methods. We should be able to benefit from inlining single callsite methods without paying the code area price. This stage is done completely independently of the normal Trivial inlining stage. It can be turned off using the AOTSingleCallsiteInline hosted option, similar to the existing AOTTrivialInline option.

This inlining stage works by first counting the callsites for each method in the universe, then inlining those methods specifically. Similar to the Trivial inlining stage, rounds are used. However, usually only 1 or 2 rounds will ever execute. An additional round will execute only when an inlining candidate exceeds the fallback threshold (rare).

Overall, adding this inlining stage improves performance. See a few benchmark results below.


Results

Using Renaissance:

Test details:

  • Tested on Linux amd64
  • For stability during the build and during benchmark execution
    • Intel turbo boost disabled
    • CPU frequency pinned to 2100000kHz with cpupower
    • cpupower frequency-set --governor performance
    • Caches dropped before building with sh -c 'echo 3 >/proc/sys/vm/drop_caches'
    • Each benchmark was run multiple times. The first few runs are discounted as warm-up.

Definitions:

  • new : The inliner with single callsite inlining implemented
  • old : The old inliner with default settings.
  • duration: Execution time of the Renaissance benchmark
  • % improvement : Calculated as 100 * (old duration - new duration)/ old duration
  • STDEV Standard deviation in benchmark run duration
  • inline time : time taken only by the inlining operations
  • build time : total time taken by the image builder
  • Peak Build RSS : Peak RSS reported by the image builder.
Benchmark Duration (ms) [old] Duration (ms) [new] % improvement STDEV (ms) [old] STDEV (ms) [new] Code Area (MB) [old] Code Area (MB) [new] File Size (MB) [old] File Size (MB) [new] Inline Time (s) [old] Inline Time (s) [new] Build Time (s) [old] Build Time (s) [new] Peak Build RSS (GB) [old] Peak Build RSS (GB) [new] Total runs Warm up runs
mnemonics 11127.89 10368.52 6.824 77.685 70.535 7.42 7.49 19.5 19.5 0.8 1.25 38.85 39.35 1.735 1.82 16. 6.
reactors 35201.1 33134.3 5.871 564.39 1057. 7.86 7.94 21.07 21.13 0.8 1.3 40.1 41.2 1.81 1.93 10. 3.
future-genetic 2743.19 2684.55 2.138 16.91 19.525 7.54 7.6 19.63 19.69 0.75 1.3 39.05 39.65 1.74 1.815 25. 5.
par-mnemonics 9192.965 8634.455 6.075 89.955 60.565 7.43 7.49 19.5 19.5 0.8 1.2 38.7 39.35 1.74 1.8 16. 6.
philosophers 6818.26 6422.29 5.807 245.965 209.43 9.97 10.21 27.69 27.88 1. 1.4 46.6 47.6 2.05 2.2 30. 10.
scala-doku 5911.5 5931.97 -0.346 51.76 21.15 7.49 7.55 19.57 19.57 0.8 1.1 39.1 40. 1.76 1.84 20. 10.
fj-kmeans 6422.1 6488.66 -1.036 122.97 94.03 7.39 7.46 19.38 19.44 0.7 1.2 38.3 40. 1.72 1.8 30. 20.
akka-uct 30300.175 30102.31 0.653 836.825 1013.41 9.52 9.59 24.32 24.32 1. 1.5 45.2 45.7 2.03 2.07 10. 5.
scala-kmeans 722.883 616.075 14.775 3.39 4.1 7.37 7.46 19.38 19.44 0.8 1.3 38.5 39. 1.74 1.79 25. 5.

Using a Quarkus hello-world rest benchmark:

This benchmark has 2 endpoints: "greeting" and "beer". Both return plaintext, but "beer" does a little more work.

Test details:

  • Tested on Linux amd64
  • For stability during the build and during benchmark execution
    • Intel turbo boost disabled
    • CPU frequency pinned to 2100000kHz with cpupower
    • cpupower frequency-set --governor performance
    • Caches dropped before building and running with sh -c 'echo 3 >/proc/sys/vm/drop_caches'
    • Both the old and new configurations were built and run 5 times in an alternating fashion. The results were averaged.
    • Pinned the load driver (Hyperfoil) to 4 CPUs and the Quarkus app to 2 other CPUs

New Definitions:

  • Req/s : Quarkus app throughput in requests per second
  • % improvement : Calculated as 100 * (new throughput - old throughput)/ old throughput
Benchmark Throughput (req/s) [old] Throughput (req/s) [new] % improvement Code Area (MB) [old] Code Area (MB) [new] File Size (MB) [old] File Size (MB) [new] Inline Time (s) [old] Inline Time (s) [new] Build Time (s) [old] Build Time (s) [new] Peak Build RSS (GB) [old] Peak Build RSS (GB) [new]
"greeting" endpoint 48161.4 53484.8 11.053 20.14 20.52 46.82 47.19 1.388 2.371 81.5 82.429 2.704 2.83
"beer" endpoint 23598.25 26249.5 11.235 Same as above Same as above Same as above Same as above Same as above Same as above Same as above Same as above Same as above Same as above

Other notes

Improving the Trivial Inlining stage
I also tried improving the Trivial inlining stage by switching from using raw node counting to estimatedNodeSize(). That should give a more accurate prediction of code area than using the raw node count. However, this only resulted in improvement in one Renaissance benchmark (par-mnemonics), so I am not sure whether this change is worth it. You can see the code for that here roberttoyonaga#4.

Unit tests
I was not able to find any existing tests for the Native Image Inliner. I have manually checked for correctness by building with debug info and checking the generated assembly. However, I don't think that approach translates well to unit tests. Another option I was considering is testing for correctness at the Graal IR level. However, I'm not sure the best way to go about doing this so I decided it was best to ask for advice here before investing too much in a particular approach.

@oracle-contributor-agreement oracle-contributor-agreement Bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Feb 2, 2026
@roberttoyonaga roberttoyonaga marked this pull request as ready for review February 2, 2026 15:41
@roberttoyonaga

Copy link
Copy Markdown
Collaborator Author

cc @christianhaeubl
This PR is based on something we talked about a few months ago at the GraalVM summit. It adds an inlining stage for single callsite methods to achieve more inlining without suffering the code area penalty.
We also talked about using estimatedNodeSize() in the Trivial inliner cost calculations. I experimented with this, but unfortunately it did not seem to have much of an impact roberttoyonaga#4

@christianhaeubl

Copy link
Copy Markdown
Member

@dougxc please assign someone from the compiler side as reviewer.

@dougxc

dougxc commented Feb 3, 2026

Copy link
Copy Markdown
Member

@axel22 or @boris-spas , could you please have a look at this.

@roberttoyonaga

Copy link
Copy Markdown
Collaborator Author

@axel22 or @boris-spas - Just a gentle ping to keep this on your radar. Have you had a chance to take a look at this?

@roberttoyonaga

Copy link
Copy Markdown
Collaborator Author

@axel22 or @boris-spas, another ping to keep this on your radar.

Have you had time to start looking at this?

@thomaswue thomaswue assigned thomaswue and unassigned axel22 and boris-spas Mar 20, 2026
@thomaswue

Copy link
Copy Markdown
Member

Hi @roberttoyonaga! Thank you for the PR. I will take this over for now and check how we can integrate. We do have larger changes to the inliner pending and I will see how this combines.

@roberttoyonaga

Copy link
Copy Markdown
Collaborator Author

Hi @thomaswue just a gentle ping. Do you know if there have been any updates with regard to this? Thanks!

@roberttoyonaga

Copy link
Copy Markdown
Collaborator Author

It would be great to get this in before the July release. Since this PR is not really moving, we are considering putting this into the GraalVM for JDK 25.0 maintenance repo (graalvm/graalvm-community-jdk25u#50).
@thomaswue Are there any updates regarding this?

@thomaswue

Copy link
Copy Markdown
Member

I am benchmarking this today. Now backporting into 25.0.x is an option to be considered anyway, because if it is integrated, it will be on the 25.1.x innovation release line.

@Karm

Karm commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

I integrated the optional inlining stage into graalvm/graalvm-community-jdk25u#50 (review) to be part of the July release.

@thomaswue

Copy link
Copy Markdown
Member

There is bad interaction of this PR with the -H:Preserve flag or other features that force a method to become a root compilation anyway. I see modest benefits, I played around with expanding this a bit for higher benefits by including non-leaf methods in some cases.

@roberttoyonaga roberttoyonaga force-pushed the SingleCallsiteInlining branch from 8876f3b to 7d9d5a8 Compare June 19, 2026 15:26
@roberttoyonaga roberttoyonaga force-pushed the SingleCallsiteInlining branch from 1cf21dc to 5986cb2 Compare June 19, 2026 15:40
@roberttoyonaga

Copy link
Copy Markdown
Collaborator Author

I've rebased with master, fixed conflicts, added a test mx scismoketest, switched the feature OFF by default, and made the feature "experimental"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ibm-redhat-interest native-image OCA Verified All contributors have signed the Oracle Contributor Agreement. performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants