Summary
Gemini 2.5 Flash occasionally enters an infinite token-repetition loop during structured-output extraction, truncating the response mid-JSON at max_output_tokens. This is an upstream model bug — documented by Google since Oct 2025, with ~30% hit rate on some reporters' workloads. Persists on both Flash and Pro as of April 2026, no Google fix or ETA.
This issue documents the failure signature, the client-side mitigations we've shipped, and deferred follow-ups. Keep open until either (a) Google fixes upstream, or (b) all deferred items below ship.
Observed failure signatures
Two distinct failures that hit the same code path in nlp-service/services/llm_client.py :: extract_source():
1. Repetition loop → JSON truncation:
extract_llm_parse_failed: 1 validation error for LLMExtractionResponse
Invalid JSON: EOF while parsing a value at line 24 column 64119
[type=json_invalid, input_value='{"entities":[..."AI","AI","AI","AI",', ...]
Config that triggers it: temperature=0.0, max_output_tokens=32768, response_mime_type=application/json, response_schema=LLMExtractionResponse, thinking_budget=512.
2. Transient connection drops (unrelated root cause, same code path):
extract_llm_call_failed: Server disconnected without sending a response.
httpx RemoteProtocolError bubbling out of google-genai with no retry logic.
Root cause
Repetition loop: Upstream sampler pathology on Gemini 2.5 Flash/Pro at temperature=0.0 with structured output. Well-documented by Google users; no Google fix:
Transient drops: google-genai==1.12.1 (our pinned version) has zero retry logic. HttpRetryOptions only landed in ≥1.65 — a 60-version jump we're not taking mid-debug.
Evaluated and confirmed NOT to work
frequency_penalty / presence_penalty — NOT supported on 2.5 models; returns INVALID_ARGUMENT
- Pydantic
max_length / max_items constraints — silently ignored
- Anti-repetition prompt hints alone — no reports of them working
- Lower
top_p — no corroboration
- Streaming mid-body abort — streaming on 2.5 models has its own bugs (empty chunks, 400 errors)
thinking_budget tuning — often ignored outright on 2.5 models
Applied workarounds
Shipped in two commits:
Deferred follow-ups
References
Summary
Gemini 2.5 Flash occasionally enters an infinite token-repetition loop during structured-output extraction, truncating the response mid-JSON at
max_output_tokens. This is an upstream model bug — documented by Google since Oct 2025, with ~30% hit rate on some reporters' workloads. Persists on both Flash and Pro as of April 2026, no Google fix or ETA.This issue documents the failure signature, the client-side mitigations we've shipped, and deferred follow-ups. Keep open until either (a) Google fixes upstream, or (b) all deferred items below ship.
Observed failure signatures
Two distinct failures that hit the same code path in
nlp-service/services/llm_client.py :: extract_source():1. Repetition loop → JSON truncation:
Config that triggers it:
temperature=0.0,max_output_tokens=32768,response_mime_type=application/json,response_schema=LLMExtractionResponse,thinking_budget=512.2. Transient connection drops (unrelated root cause, same code path):
httpx
RemoteProtocolErrorbubbling out ofgoogle-genaiwith no retry logic.Root cause
Repetition loop: Upstream sampler pathology on Gemini 2.5 Flash/Pro at
temperature=0.0with structured output. Well-documented by Google users; no Google fix:Transient drops:
google-genai==1.12.1(our pinned version) has zero retry logic.HttpRetryOptionsonly landed in ≥1.65 — a 60-version jump we're not taking mid-debug.Evaluated and confirmed NOT to work
frequency_penalty/presence_penalty— NOT supported on 2.5 models; returnsINVALID_ARGUMENTmax_length/max_itemsconstraints — silently ignoredtop_p— no corroborationthinking_budgettuning — often ignored outright on 2.5 modelsApplied workarounds
Shipped in two commits:
2234bcc— fix(api): observability overhaul for compile pipeline failuresthrowOnErrorhelper incompile/run/route.tsreplaces 8 inline status-only throws. Reads the sub-route response body (3 s race-capped), embeds first 2 000 chars in the thrown Error sofailCompileProgressstores the real cause incompile_progress.error(visible in UI banner + Activity feed).timed()wrapper logs[compile:<sid>] <step> <ms>on success — makes cold-start vs steady-state latency visible..catchnowconsole.error's rawerr(stack trace), not just the stringified message.compile/{extract,resolve,match,plan,crossref,draft,schema}now wraps its POST body in a top-level try/catch that logs the stack before returning 500. Next.js 16 production doesn't surface uncaught handler errors without explicit logging.extract failed for {uuid}: 500message).api/health: bumped expectedschema_version15 → 16 (the v16 migration had been silently failing the healthcheck, leaving Docker flapping).3ebb275— feat(nlp-service): Gemini retry wrapper + truncation salvage + usage logging_with_retryhelper: 3 attempts, exponential backoff (0.5/1/2 s), catches httpx connect-class errors + google-genaiAPIErrorwith status in {408, 429, 500, 502, 503, 504}. Re-acquires the pyrate-limiter token per attempt so retries count against the 800 RPM cap (not bypass it).extract,lint_scan,disambiguate,crossref,triage,generate_schema,select_pages. Skipped ondraft_page,synthesize_answer,generate_digest— those commit downstream state and can't be safely re-executed after a partial response._log_usagehelper: one stderr line per Gemini call withfinish_reason+prompt_token_count/candidates_token_count/thoughts_token_count. Ground truth for distinguishingMAX_TOKENStruncation from other failure modes.extract_sourcefallback chain on parse failure:json-repairsalvage (new dep,json-repair>=0.59.4) — recovers the valid JSON prefix from a repetition-loop truncation for free; ~90% of loops have extractable content before the stuck tailfinish_reason == MAX_TOKENSand salvage failed → retry once with halved input; different prompt path usually escapes the loopextract_llm_parse_failed_after_fallback: first=...; second=...socompile_progress.errorshows what each attempt hit_EXTRACTION_SYSTEM_PROMPT("Stop as soon as the source is covered — do NOT repeat names or pad the output"). Mixed evidence it helps; costs ~10 input tokens, kept as belt-and-suspenders.Deferred follow-ups
generate_content_streamIS supported withresponse_schemain 1.12.1. A rolling-window detector (same ≥15-char slice repeated ≥4× in last ~100 tokens) +breakthe iterator could abort at ~2-4k tokens instead of 32k. ~94% output-token savings on detection. ~1 day of work.gemini-2.5-flash-preview-09-2025after two consecutive MAX_TOKENS failures. Forum reports it's more stable for the repetition bug specifically; explicitly non-GA.google-genaiSDK upgrade to ≥1.65 for nativeHttpRetryOptions(our own_with_retrybecomes redundant). 60-version jump from 1.12.1 — needs its own investigation cycle.References