Skip to content

Detailed Performance Metrics (Resolves Issue #104)#106

Closed
LegL0ngly wants to merge 8 commits intobeehive-lab:mainfrom
LegL0ngly:feat/performance-metrics
Closed

Detailed Performance Metrics (Resolves Issue #104)#106
LegL0ngly wants to merge 8 commits intobeehive-lab:mainfrom
LegL0ngly:feat/performance-metrics

Conversation

@LegL0ngly
Copy link
Copy Markdown

Resolves #104

Added functionality to keep track of the fine-grained performance metrics described in Issue #104.

In LastRunMetrics:

  • Modified the record to accept all necessary variables (except for model load duration) as parameters and updated the record's documentation accordingly.
  • Added a static long for load duration, because it will not change between every print. Also added a method for setting the load duration to be called from ModelLoader.
  • Modified the existing print statement to print new parameters, and print durations in both nanoseconds and seconds. Also created if statements to print the model load duration only once, and the TornadoVM metrics only when using the GPU.

In InferenceEngine:

  • Sent new metric information to the setMetrics method in LastRunMetrics
  • TornadoVM metrics are retrieved from TornadoVMMasterPlan when using GPU, otherwise they are set to 0, so they are not printed.
  • Model load time is NOT retrieved here; the load duration is sent to LastRunMetrics from ModelLoader because it does not need to be retrieved after every response.
  • Added a timer for inference start in CPU Phi3, which was previously missing
  • Changed the name of the TornadoVMMasterPlan object in the GPU Granite method to be consistent with the naming of all other TornadoVMMasterPlan objects in InferenceEngine.

In ModelLoader:

  • Added simple model duration timers and logic; times sent to LastRunMetrics
  • Declared a new model instead of returning the result of modelType.loadModel immediately to time the loading duration properly.

In TornadoVMMasterPlan:

  • Added variables to keep track of compile and warmup duration, as well as corresponding accessor and mutator methods. The accessor methods are called from InferenceEngine and passed to the setMetrics method in LastRunMetrics. The mutators are called when times are calculated within TornadaVMMasterPlan.
  • Timing variables were already in place, but only written to when using a flag for debugging. They are now always written to, and the if statements for the flags are still there, and will still print the same information if the flag is used.

One last note not related to the changes:
In the contributing instructions, contributors are instructed to PR the "develop" branch. It appears the branch no longer exists, and the instructions are outdated. Might be confusing for new contributors.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 12, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements fine-grained performance metrics (nanosecond precision) to address Issue #104, expanding runtime reporting to include model load, prompt evaluation, decode, and TornadoVM-specific initialization timings.

Changes:

  • Expanded LastRunMetrics to store/report detailed durations and token counts (plus TTFT and throughputs).
  • Instrumented ModelLoader to measure and report model load duration.
  • Added TornadoVM compile/warmup duration tracking and propagated these metrics through InferenceEngine.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlan.java Tracks and exposes TornadoVM compile/warmup durations for downstream metrics reporting.
src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java Measures model load time and forwards it to LastRunMetrics.
src/main/java/org/beehive/gpullama3/inference/InferenceEngine.java Collects per-run timing/token metrics (prompt vs decode) and forwards TornadoVM timings when applicable.
src/main/java/org/beehive/gpullama3/auxiliary/LastRunMetrics.java Stores expanded metrics and prints detailed performance output (including TTFT and throughputs).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +90 to +92
// Calculate and Set Total Warmup Time
long copyTime = System.nanoTime();
tornadoVMPlan.setWarmupDurationNanos(copyTime - planCreationTime);
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warmupDurationNanos is currently set as (copyTime - planCreationTime), which includes both the JIT warmup (withPreCompilation()) and the subsequent forceCopyInReadOnlyDataLayered() transfer. If the intent is to report the Issue #104 "task graph warmup duration" (warmup/execution until steady state), this should be measured against warmupTime (end of withPreCompilation) and exclude the copy/transfer step, or alternatively split into separate warmup vs copy/transfer metrics.

Suggested change
// Calculate and Set Total Warmup Time
long copyTime = System.nanoTime();
tornadoVMPlan.setWarmupDurationNanos(copyTime - planCreationTime);
// Calculate and set warmup time excluding the copy/transfer step
long copyTime = System.nanoTime();
tornadoVMPlan.setWarmupDurationNanos(warmupTime - planCreationTime);

Copilot uses AI. Check for mistakes.
Comment on lines 222 to +235
// Calculate and print performance metrics
long endNanos = System.nanoTime();
double totalTimeSeconds = (endNanos - startNanos) / 1_000_000_000.0;
if (inferenceStartNanos == 0) {
inferenceStartNanos = endNanos; // Prevents negative time if no tokens were generated
}
int totalTokens = promptIndex + generatedTokens.size();

LastRunMetrics.setMetrics(totalTokens, totalTimeSeconds);
long totalNanos = (endNanos - startNanos);
int promptEvalCount = promptIndex;
long promptNanos = inferenceStartNanos - startNanos;
int inferenceEvalCount = generatedTokens.size();
long inferenceNanos = endNanos - inferenceStartNanos;
long tornadoCompileNanos = 0;
long tornadoWarmupNanos = 0;
LastRunMetrics.setMetrics(totalTokens, totalNanos, promptEvalCount, promptNanos, inferenceEvalCount, inferenceNanos, tornadoCompileNanos, tornadoWarmupNanos);
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Qwen3 CPU path, promptNanos is derived from inferenceStartNanos - startNanos, but inferenceStartNanos is only set in the else branch of the main loop. The first generated token is produced in the same iteration when the last prompt token is ingested (the branch that does position++), so inferenceStartNanos ends up being recorded after at least one decode step has already happened. This will misattribute some decode time into promptNanos and undercount inferenceNanos; consider capturing inferenceStartNanos immediately before sampling the first generated token when prompt ingestion completes.

Copilot uses AI. Check for mistakes.
Comment on lines +493 to +514
// Calculate and Print Performance Metrics
long endNanos = System.nanoTime();
double totalTimeSeconds = (endNanos - startNanos) / 1_000_000_000.0;
if (inferenceStartNanos == 0) {
inferenceStartNanos = endNanos; // Prevents negative time if no tokens were generated
}
int totalTokens = promptIndex + generatedTokens.size();

LastRunMetrics.setMetrics(totalTokens, totalTimeSeconds);
long totalNanos = (endNanos - startNanos);
int promptEvalCount = promptIndex;
long promptNanos = inferenceStartNanos - startNanos;
int inferenceEvalCount = generatedTokens.size();
long inferenceNanos = endNanos - inferenceStartNanos;
long tornadoCompileNanos = 0;
long tornadoWarmupNanos = 0;
// If statement to prevent inadvertent crashes from future features
if (tornadoVMPlan != null) {
tornadoCompileNanos = tornadoVMPlan.getCompileDurationNanos();
tornadoWarmupNanos = tornadoVMPlan.getWarmupDurationNanos();
// Reset values so they are only output if they are changed in tornadoVMMasterPlan.java
tornadoVMPlan.setCompileDurationNanos(0);
tornadoVMPlan.setWarmupDurationNanos(0);
}
LastRunMetrics.setMetrics(totalTokens, totalNanos, promptEvalCount, promptNanos, inferenceEvalCount, inferenceNanos, tornadoCompileNanos, tornadoWarmupNanos);
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Qwen3 GPU path, promptNanos is derived from inferenceStartNanos - startNanos, but inferenceStartNanos is only set when entering the post-prompt else branch. As with the CPU variant, the first generated token is produced in the iteration that consumes the final prompt token (the branch that increments position), so inferenceStartNanos is recorded after generation has already started. This skews promptNanos/inferenceNanos and derived throughput/TTFT; capture inferenceStartNanos right before sampling the first generated token when prompt ingestion completes.

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +72
double totalSeconds = latestMetrics.totalNanos() / 1_000_000_000.0;
// Time to First Token = Load Time + Prompt Eval time (which includes first decode step)
long ttftNanos = loadDurationNanos + latestMetrics.promptNanos();
double ttftSeconds = ttftNanos / 1_000_000_000.0;
double promptSeconds = latestMetrics.promptNanos() / 1_000_000_000.0;
double prefillThroughput = latestMetrics.promptEvalCount() / promptSeconds;
double inferenceSeconds = latestMetrics.inferenceNanos() / 1_000_000_000.0;
double decodeThroughput = latestMetrics.inferenceEvalCount() / inferenceSeconds;
double tokensPerSecond = latestMetrics.totalTokens() / totalSeconds;
System.err.printf("\n\nAchieved tok/s: %.2f. Total tokens: %d, Total time: %d ns (%.2f s), Time to first Token: %d ns (%.2f s)\nPrefill throughput: %.2f tok/s, Prompt tokens: %d, Prompt time: %d ns (%.2f s)\nDecode throughput: %.2f tok/s, Inference tokens: %d, Inference time: %d ns (%.2f s)\n", tokensPerSecond, latestMetrics.totalTokens(), latestMetrics.totalNanos(), totalSeconds, ttftNanos, ttftSeconds, prefillThroughput, latestMetrics.promptEvalCount(), latestMetrics.promptNanos(), promptSeconds, decodeThroughput, latestMetrics.inferenceEvalCount(), latestMetrics.inferenceNanos(), inferenceSeconds);
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printMetrics() computes throughputs via division by promptSeconds, inferenceSeconds, and totalSeconds without guarding for zero durations. Several legitimate flows can yield promptNanos == 0 or inferenceNanos == 0 (e.g., when no inference tokens are generated and inferenceStartNanos is forced to endNanos), which will result in Infinity/NaN being printed. Add explicit checks (e.g., duration > 0 and count > 0) and print 0 or "N/A" when the corresponding duration is zero.

Copilot uses AI. Check for mistakes.
@mikepapadim
Copy link
Copy Markdown
Member

@LegL0ngly thank you for your contribution, but this feature it was intend to be on top the changes that we are adding in the PR: #102

@LegL0ngly
Copy link
Copy Markdown
Author

Ah well, I enjoyed working on it anyway. Thanks for getting back to me quickly.

@LegL0ngly LegL0ngly closed this Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve performance metrics collection

4 participants