Detailed Performance Metrics (Resolves Issue #104)#106
Detailed Performance Metrics (Resolves Issue #104)#106LegL0ngly wants to merge 8 commits intobeehive-lab:mainfrom
Conversation
Added the requested metrics besides the tornadovm-related ones and model load duration; they reference other scripts.
There was a problem hiding this comment.
Pull request overview
This PR implements fine-grained performance metrics (nanosecond precision) to address Issue #104, expanding runtime reporting to include model load, prompt evaluation, decode, and TornadoVM-specific initialization timings.
Changes:
- Expanded
LastRunMetricsto store/report detailed durations and token counts (plus TTFT and throughputs). - Instrumented
ModelLoaderto measure and report model load duration. - Added TornadoVM compile/warmup duration tracking and propagated these metrics through
InferenceEngine.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/main/java/org/beehive/gpullama3/tornadovm/TornadoVMMasterPlan.java |
Tracks and exposes TornadoVM compile/warmup durations for downstream metrics reporting. |
src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java |
Measures model load time and forwards it to LastRunMetrics. |
src/main/java/org/beehive/gpullama3/inference/InferenceEngine.java |
Collects per-run timing/token metrics (prompt vs decode) and forwards TornadoVM timings when applicable. |
src/main/java/org/beehive/gpullama3/auxiliary/LastRunMetrics.java |
Stores expanded metrics and prints detailed performance output (including TTFT and throughputs). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Calculate and Set Total Warmup Time | ||
| long copyTime = System.nanoTime(); | ||
| tornadoVMPlan.setWarmupDurationNanos(copyTime - planCreationTime); |
There was a problem hiding this comment.
warmupDurationNanos is currently set as (copyTime - planCreationTime), which includes both the JIT warmup (withPreCompilation()) and the subsequent forceCopyInReadOnlyDataLayered() transfer. If the intent is to report the Issue #104 "task graph warmup duration" (warmup/execution until steady state), this should be measured against warmupTime (end of withPreCompilation) and exclude the copy/transfer step, or alternatively split into separate warmup vs copy/transfer metrics.
| // Calculate and Set Total Warmup Time | |
| long copyTime = System.nanoTime(); | |
| tornadoVMPlan.setWarmupDurationNanos(copyTime - planCreationTime); | |
| // Calculate and set warmup time excluding the copy/transfer step | |
| long copyTime = System.nanoTime(); | |
| tornadoVMPlan.setWarmupDurationNanos(warmupTime - planCreationTime); |
| // Calculate and print performance metrics | ||
| long endNanos = System.nanoTime(); | ||
| double totalTimeSeconds = (endNanos - startNanos) / 1_000_000_000.0; | ||
| if (inferenceStartNanos == 0) { | ||
| inferenceStartNanos = endNanos; // Prevents negative time if no tokens were generated | ||
| } | ||
| int totalTokens = promptIndex + generatedTokens.size(); | ||
|
|
||
| LastRunMetrics.setMetrics(totalTokens, totalTimeSeconds); | ||
| long totalNanos = (endNanos - startNanos); | ||
| int promptEvalCount = promptIndex; | ||
| long promptNanos = inferenceStartNanos - startNanos; | ||
| int inferenceEvalCount = generatedTokens.size(); | ||
| long inferenceNanos = endNanos - inferenceStartNanos; | ||
| long tornadoCompileNanos = 0; | ||
| long tornadoWarmupNanos = 0; | ||
| LastRunMetrics.setMetrics(totalTokens, totalNanos, promptEvalCount, promptNanos, inferenceEvalCount, inferenceNanos, tornadoCompileNanos, tornadoWarmupNanos); |
There was a problem hiding this comment.
In the Qwen3 CPU path, promptNanos is derived from inferenceStartNanos - startNanos, but inferenceStartNanos is only set in the else branch of the main loop. The first generated token is produced in the same iteration when the last prompt token is ingested (the branch that does position++), so inferenceStartNanos ends up being recorded after at least one decode step has already happened. This will misattribute some decode time into promptNanos and undercount inferenceNanos; consider capturing inferenceStartNanos immediately before sampling the first generated token when prompt ingestion completes.
| // Calculate and Print Performance Metrics | ||
| long endNanos = System.nanoTime(); | ||
| double totalTimeSeconds = (endNanos - startNanos) / 1_000_000_000.0; | ||
| if (inferenceStartNanos == 0) { | ||
| inferenceStartNanos = endNanos; // Prevents negative time if no tokens were generated | ||
| } | ||
| int totalTokens = promptIndex + generatedTokens.size(); | ||
|
|
||
| LastRunMetrics.setMetrics(totalTokens, totalTimeSeconds); | ||
| long totalNanos = (endNanos - startNanos); | ||
| int promptEvalCount = promptIndex; | ||
| long promptNanos = inferenceStartNanos - startNanos; | ||
| int inferenceEvalCount = generatedTokens.size(); | ||
| long inferenceNanos = endNanos - inferenceStartNanos; | ||
| long tornadoCompileNanos = 0; | ||
| long tornadoWarmupNanos = 0; | ||
| // If statement to prevent inadvertent crashes from future features | ||
| if (tornadoVMPlan != null) { | ||
| tornadoCompileNanos = tornadoVMPlan.getCompileDurationNanos(); | ||
| tornadoWarmupNanos = tornadoVMPlan.getWarmupDurationNanos(); | ||
| // Reset values so they are only output if they are changed in tornadoVMMasterPlan.java | ||
| tornadoVMPlan.setCompileDurationNanos(0); | ||
| tornadoVMPlan.setWarmupDurationNanos(0); | ||
| } | ||
| LastRunMetrics.setMetrics(totalTokens, totalNanos, promptEvalCount, promptNanos, inferenceEvalCount, inferenceNanos, tornadoCompileNanos, tornadoWarmupNanos); |
There was a problem hiding this comment.
In the Qwen3 GPU path, promptNanos is derived from inferenceStartNanos - startNanos, but inferenceStartNanos is only set when entering the post-prompt else branch. As with the CPU variant, the first generated token is produced in the iteration that consumes the final prompt token (the branch that increments position), so inferenceStartNanos is recorded after generation has already started. This skews promptNanos/inferenceNanos and derived throughput/TTFT; capture inferenceStartNanos right before sampling the first generated token when prompt ingestion completes.
| double totalSeconds = latestMetrics.totalNanos() / 1_000_000_000.0; | ||
| // Time to First Token = Load Time + Prompt Eval time (which includes first decode step) | ||
| long ttftNanos = loadDurationNanos + latestMetrics.promptNanos(); | ||
| double ttftSeconds = ttftNanos / 1_000_000_000.0; | ||
| double promptSeconds = latestMetrics.promptNanos() / 1_000_000_000.0; | ||
| double prefillThroughput = latestMetrics.promptEvalCount() / promptSeconds; | ||
| double inferenceSeconds = latestMetrics.inferenceNanos() / 1_000_000_000.0; | ||
| double decodeThroughput = latestMetrics.inferenceEvalCount() / inferenceSeconds; | ||
| double tokensPerSecond = latestMetrics.totalTokens() / totalSeconds; | ||
| System.err.printf("\n\nAchieved tok/s: %.2f. Total tokens: %d, Total time: %d ns (%.2f s), Time to first Token: %d ns (%.2f s)\nPrefill throughput: %.2f tok/s, Prompt tokens: %d, Prompt time: %d ns (%.2f s)\nDecode throughput: %.2f tok/s, Inference tokens: %d, Inference time: %d ns (%.2f s)\n", tokensPerSecond, latestMetrics.totalTokens(), latestMetrics.totalNanos(), totalSeconds, ttftNanos, ttftSeconds, prefillThroughput, latestMetrics.promptEvalCount(), latestMetrics.promptNanos(), promptSeconds, decodeThroughput, latestMetrics.inferenceEvalCount(), latestMetrics.inferenceNanos(), inferenceSeconds); |
There was a problem hiding this comment.
printMetrics() computes throughputs via division by promptSeconds, inferenceSeconds, and totalSeconds without guarding for zero durations. Several legitimate flows can yield promptNanos == 0 or inferenceNanos == 0 (e.g., when no inference tokens are generated and inferenceStartNanos is forced to endNanos), which will result in Infinity/NaN being printed. Add explicit checks (e.g., duration > 0 and count > 0) and print 0 or "N/A" when the corresponding duration is zero.
|
@LegL0ngly thank you for your contribution, but this feature it was intend to be on top the changes that we are adding in the PR: #102 |
|
Ah well, I enjoyed working on it anyway. Thanks for getting back to me quickly. |
Resolves #104
Added functionality to keep track of the fine-grained performance metrics described in Issue #104.
In LastRunMetrics:
In InferenceEngine:
In ModelLoader:
In TornadoVMMasterPlan:
One last note not related to the changes:
In the contributing instructions, contributors are instructed to PR the "develop" branch. It appears the branch no longer exists, and the instructions are outdated. Might be confusing for new contributors.