Skip to content

fix: use last streaming entry for output token count#108

Merged
GuyMoses merged 2 commits into
mainfrom
fix/output-token-undercount
Jun 4, 2026
Merged

fix: use last streaming entry for output token count#108
GuyMoses merged 2 commits into
mainfrom
fix/output-token-undercount

Conversation

@GuyMoses

@GuyMoses GuyMoses commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • The transcript requestId deduplication was taking the first streaming entry per API call, which carries a partial output_tokens count (1–7 tokens from the streaming start). The last entry has the final count.
  • Verified against a real transcript: 132 out of 847 requestIds had differing usage across streaming entries. The first entry consistently had output_tokens: 1 while the last had the real value (100–1280).
  • Impact: ~15% undercount of output tokens (224,818 reported vs 263,699 actual across one session). Input tokens and cache tokens were unaffected.
  • Fix: changed from skip-if-seen to last-write-wins — store the last entry per requestId in a map, then sum across all requestIds after reading the transcript.

Test plan

  • TestReadTurnUsageDeduplicatesRequestID updated to use different output_tokens values across streaming entries (1 vs 150) and assert the final value is used
  • All existing transcript tests pass (single assistant, multi-iteration aggregation, turn reset, tool_result handling, pretty-printed JSON, partial fields)
  • Full test suite passes (go test ./...)
  • Deploy and compare output token totals against native claude_code.token.usage counter

The transcript deduplication was taking the first entry per requestId,
which carries a partial output_tokens count (1-7 tokens from the
streaming start). The final entry has the real count. This caused a ~15%
undercount of output tokens compared to native telemetry.

Change from skip-if-seen to last-write-wins: store the last entry per
requestId, then sum across all requestIds after reading the transcript.
@GuyMoses

GuyMoses commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Plugin vs Native Token Gap — Investigation Summary

After fixing the streaming dedup bug in this PR, I investigated the remaining token gap between the plugin and native Claude Code telemetry.

Root cause

The transcript file does not record token usage for background API calls. Two types are missing:

Background call Frequency Model Token impact Cost impact
Title generation (ai-title entries) ~35/session Haiku ~14K input, ~350 output ~$0.02 (negligible)
Context compaction (away_summary entries) ~10/session Main model (Opus) ~1M input (mostly cache reads), ~3K output ~$0.50–1.50

Native tracing captures both as claude_code.llm_request spans with full token counts. The plugin only reads assistant transcript entries, which don't include these calls.

Magnitude

On a real 29-turn Opus session ($19.30 total):

  • Background gap: ~$0.50–1.50 (~3–8% of session cost)
  • Compaction dominates; title gen is <0.1%
  • Gap grows with session length (more compaction events)

SubagentStop events were silently dropped, losing 28-71% of session
tokens. Route SubagentStop through sendLLMTrace, reading token usage
from the sub-agent transcript file and emitting invoke_agent spans
with gen_ai.agent.name, parented under the Agent tool call span.
@GuyMoses GuyMoses merged commit 1e64ae2 into main Jun 4, 2026
4 checks passed
@GuyMoses GuyMoses deleted the fix/output-token-undercount branch June 4, 2026 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants