Detailed Performance Metrics (Resolves Issue #104) by LegL0ngly · Pull Request #106 · beehive-lab/GPULlama3.java

LegL0ngly · 2026-04-12T00:48:01Z

Resolves #104

Added functionality to keep track of the fine-grained performance metrics described in Issue #104.

In LastRunMetrics:

Modified the record to accept all necessary variables (except for model load duration) as parameters and updated the record's documentation accordingly.
Added a static long for load duration, because it will not change between every print. Also added a method for setting the load duration to be called from ModelLoader.
Modified the existing print statement to print new parameters, and print durations in both nanoseconds and seconds. Also created if statements to print the model load duration only once, and the TornadoVM metrics only when using the GPU.

In InferenceEngine:

Sent new metric information to the setMetrics method in LastRunMetrics
TornadoVM metrics are retrieved from TornadoVMMasterPlan when using GPU, otherwise they are set to 0, so they are not printed.
Model load time is NOT retrieved here; the load duration is sent to LastRunMetrics from ModelLoader because it does not need to be retrieved after every response.
Added a timer for inference start in CPU Phi3, which was previously missing
Changed the name of the TornadoVMMasterPlan object in the GPU Granite method to be consistent with the naming of all other TornadoVMMasterPlan objects in InferenceEngine.

In ModelLoader:

Added simple model duration timers and logic; times sent to LastRunMetrics
Declared a new model instead of returning the result of modelType.loadModel immediately to time the loading duration properly.

In TornadoVMMasterPlan:

Added variables to keep track of compile and warmup duration, as well as corresponding accessor and mutator methods. The accessor methods are called from InferenceEngine and passed to the setMetrics method in LastRunMetrics. The mutators are called when times are calculated within TornadaVMMasterPlan.
Timing variables were already in place, but only written to when using a flag for debugging. They are now always written to, and the if statements for the flags are still there, and will still print the same information if the flag is used.

One last note not related to the changes:
In the contributing instructions, contributors are instructed to PR the "develop" branch. It appears the branch no longer exists, and the instructions are outdated. Might be confusing for new contributors.

Added the requested metrics besides the tornadovm-related ones and model load duration; they reference other scripts.

…metrics

…g errors

CLAassistant · 2026-04-12T00:48:09Z

All committers have signed the CLA.

Copilot

Pull request overview

This PR implements fine-grained performance metrics (nanosecond precision) to address Issue #104, expanding runtime reporting to include model load, prompt evaluation, decode, and TornadoVM-specific initialization timings.

Changes:

Expanded LastRunMetrics to store/report detailed durations and token counts (plus TTFT and throughputs).
Instrumented ModelLoader to measure and report model load duration.
Added TornadoVM compile/warmup duration tracking and propagated these metrics through InferenceEngine.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlan.java`	Tracks and exposes TornadoVM compile/warmup durations for downstream metrics reporting.
`src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java`	Measures model load time and forwards it to `LastRunMetrics`.
`src/main/java/org/beehive/gpullama3/inference/InferenceEngine.java`	Collects per-run timing/token metrics (prompt vs decode) and forwards TornadoVM timings when applicable.
`src/main/java/org/beehive/gpullama3/auxiliary/LastRunMetrics.java`	Stores expanded metrics and prints detailed performance output (including TTFT and throughputs).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-12T08:50:20Z

src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlan.java

+        // Calculate and Set Total Warmup Time
+        long copyTime = System.nanoTime();
+        tornadoVMPlan.setWarmupDurationNanos(copyTime - planCreationTime);


warmupDurationNanos is currently set as (copyTime - planCreationTime), which includes both the JIT warmup (withPreCompilation()) and the subsequent forceCopyInReadOnlyDataLayered() transfer. If the intent is to report the Issue #104 "task graph warmup duration" (warmup/execution until steady state), this should be measured against warmupTime (end of withPreCompilation) and exclude the copy/transfer step, or alternatively split into separate warmup vs copy/transfer metrics.

Suggested change

// Calculate and Set Total Warmup Time

long copyTime = System.nanoTime();

tornadoVMPlan.setWarmupDurationNanos(copyTime - planCreationTime);

// Calculate and set warmup time excluding the copy/transfer step

long copyTime = System.nanoTime();

tornadoVMPlan.setWarmupDurationNanos(warmupTime - planCreationTime);

Copilot · 2026-04-12T08:50:20Z

src/main/java/org/beehive/gpullama3/inference/InferenceEngine.java

        // Calculate and print performance metrics
        long endNanos = System.nanoTime();
-        double totalTimeSeconds = (endNanos - startNanos) / 1_000_000_000.0;
+        if (inferenceStartNanos == 0) {
+            inferenceStartNanos = endNanos; // Prevents negative time if no tokens were generated
+        }
        int totalTokens = promptIndex + generatedTokens.size();
-
-        LastRunMetrics.setMetrics(totalTokens, totalTimeSeconds);
+        long totalNanos = (endNanos - startNanos);
+        int promptEvalCount = promptIndex;
+        long promptNanos = inferenceStartNanos - startNanos;
+        int inferenceEvalCount = generatedTokens.size();
+        long inferenceNanos = endNanos - inferenceStartNanos;
+        long tornadoCompileNanos = 0;
+        long tornadoWarmupNanos = 0;
+        LastRunMetrics.setMetrics(totalTokens, totalNanos, promptEvalCount, promptNanos, inferenceEvalCount, inferenceNanos, tornadoCompileNanos, tornadoWarmupNanos);


In the Qwen3 CPU path, promptNanos is derived from inferenceStartNanos - startNanos, but inferenceStartNanos is only set in the else branch of the main loop. The first generated token is produced in the same iteration when the last prompt token is ingested (the branch that does position++), so inferenceStartNanos ends up being recorded after at least one decode step has already happened. This will misattribute some decode time into promptNanos and undercount inferenceNanos; consider capturing inferenceStartNanos immediately before sampling the first generated token when prompt ingestion completes.

Copilot · 2026-04-12T08:50:21Z

src/main/java/org/beehive/gpullama3/inference/InferenceEngine.java

+        // Calculate and Print Performance Metrics
        long endNanos = System.nanoTime();
-        double totalTimeSeconds = (endNanos - startNanos) / 1_000_000_000.0;
+        if (inferenceStartNanos == 0) {
+            inferenceStartNanos = endNanos; // Prevents negative time if no tokens were generated
+        }
        int totalTokens = promptIndex + generatedTokens.size();
-
-        LastRunMetrics.setMetrics(totalTokens, totalTimeSeconds);
+        long totalNanos = (endNanos - startNanos);
+        int promptEvalCount = promptIndex;
+        long promptNanos = inferenceStartNanos - startNanos;
+        int inferenceEvalCount = generatedTokens.size();
+        long inferenceNanos = endNanos - inferenceStartNanos;
+        long tornadoCompileNanos = 0;
+        long tornadoWarmupNanos = 0;
+        // If statement to prevent inadvertent crashes from future features
+        if (tornadoVMPlan != null) {
+            tornadoCompileNanos = tornadoVMPlan.getCompileDurationNanos();
+            tornadoWarmupNanos = tornadoVMPlan.getWarmupDurationNanos();
+            // Reset values so they are only output if they are changed in tornadoVMMasterPlan.java
+            tornadoVMPlan.setCompileDurationNanos(0);
+            tornadoVMPlan.setWarmupDurationNanos(0);
+        }
+        LastRunMetrics.setMetrics(totalTokens, totalNanos, promptEvalCount, promptNanos, inferenceEvalCount, inferenceNanos, tornadoCompileNanos, tornadoWarmupNanos);


In the Qwen3 GPU path, promptNanos is derived from inferenceStartNanos - startNanos, but inferenceStartNanos is only set when entering the post-prompt else branch. As with the CPU variant, the first generated token is produced in the iteration that consumes the final prompt token (the branch that increments position), so inferenceStartNanos is recorded after generation has already started. This skews promptNanos/inferenceNanos and derived throughput/TTFT; capture inferenceStartNanos right before sampling the first generated token when prompt ingestion completes.

Copilot · 2026-04-12T08:50:21Z

src/main/java/org/beehive/gpullama3/auxiliary/LastRunMetrics.java

+            double totalSeconds = latestMetrics.totalNanos() / 1_000_000_000.0;
+            // Time to First Token = Load Time + Prompt Eval time (which includes first decode step)
+            long ttftNanos = loadDurationNanos + latestMetrics.promptNanos();
+            double ttftSeconds = ttftNanos / 1_000_000_000.0;
+            double promptSeconds = latestMetrics.promptNanos() / 1_000_000_000.0;
+            double prefillThroughput = latestMetrics.promptEvalCount() / promptSeconds;
+            double inferenceSeconds = latestMetrics.inferenceNanos() / 1_000_000_000.0;
+            double decodeThroughput = latestMetrics.inferenceEvalCount() / inferenceSeconds;
+            double tokensPerSecond = latestMetrics.totalTokens() / totalSeconds;
+            System.err.printf("\n\nAchieved tok/s: %.2f. Total tokens: %d, Total time: %d ns (%.2f s), Time to first Token: %d ns (%.2f s)\nPrefill throughput: %.2f tok/s, Prompt tokens: %d, Prompt time: %d ns (%.2f s)\nDecode throughput: %.2f tok/s, Inference tokens: %d, Inference time: %d ns (%.2f s)\n", tokensPerSecond, latestMetrics.totalTokens(), latestMetrics.totalNanos(), totalSeconds, ttftNanos, ttftSeconds, prefillThroughput, latestMetrics.promptEvalCount(), latestMetrics.promptNanos(), promptSeconds, decodeThroughput, latestMetrics.inferenceEvalCount(), latestMetrics.inferenceNanos(), inferenceSeconds);


printMetrics() computes throughputs via division by promptSeconds, inferenceSeconds, and totalSeconds without guarding for zero durations. Several legitimate flows can yield promptNanos == 0 or inferenceNanos == 0 (e.g., when no inference tokens are generated and inferenceStartNanos is forced to endNanos), which will result in Infinity/NaN being printed. Add explicit checks (e.g., duration > 0 and count > 0) and print 0 or "N/A" when the corresponding duration is zero.

mikepapadim · 2026-04-12T13:58:37Z

@LegL0ngly thank you for your contribution, but this feature it was intend to be on top the changes that we are adding in the PR: #102

LegL0ngly · 2026-04-12T15:35:14Z

Ah well, I enjoyed working on it anyway. Thanks for getting back to me quickly.

LegL0ngly and others added 8 commits April 9, 2026 21:57

Added some new metrics

f910cf3

Added the requested metrics besides the tornadovm-related ones and model load duration; they reference other scripts.

Added throughput metrics, updated parameter information

805dce0

Added Model Load Time Metric

89f6b62

Added remaining metrics (TornadoVM, Time to First Token)

45c5a26

All metrics finalized, documentation cleaned up, all models now send …

96adf3b

…metrics

Fixed typos, missing imports, and missing timing variable for CPU Phi3

5ed313b

Fixed a variable inconsistent with the naming scheme which was causin…

f62a618

…g errors

Merge branch 'beehive-lab:main' into feat/performance-metrics

fe8eceb

mikepapadim requested review from Copilot, mikepapadim and orionpapadakis and removed request for mikepapadim and orionpapadakis April 12, 2026 08:46

Copilot started reviewing on behalf of mikepapadim April 12, 2026 08:47 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

LegL0ngly closed this Apr 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detailed Performance Metrics (Resolves Issue #104)#106

Detailed Performance Metrics (Resolves Issue #104)#106
LegL0ngly wants to merge 8 commits intobeehive-lab:mainfrom
LegL0ngly:feat/performance-metrics

LegL0ngly commented Apr 12, 2026

Uh oh!

CLAassistant commented Apr 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

Copilot AI Apr 12, 2026

Uh oh!

mikepapadim commented Apr 12, 2026

Uh oh!

LegL0ngly commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

LegL0ngly commented Apr 12, 2026

Uh oh!

CLAassistant commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

mikepapadim commented Apr 12, 2026

Uh oh!

LegL0ngly commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Apr 12, 2026 •

edited

Loading