Skip to content

Gemini 2.5 Flash structured-output repetition loop — upstream bug + client-side workaround #7

@tuirk

Description

@tuirk

Summary

Gemini 2.5 Flash occasionally enters an infinite token-repetition loop during structured-output extraction, truncating the response mid-JSON at max_output_tokens. This is an upstream model bug — documented by Google since Oct 2025, with ~30% hit rate on some reporters' workloads. Persists on both Flash and Pro as of April 2026, no Google fix or ETA.

This issue documents the failure signature, the client-side mitigations we've shipped, and deferred follow-ups. Keep open until either (a) Google fixes upstream, or (b) all deferred items below ship.

Observed failure signatures

Two distinct failures that hit the same code path in nlp-service/services/llm_client.py :: extract_source():

1. Repetition loop → JSON truncation:

extract_llm_parse_failed: 1 validation error for LLMExtractionResponse
  Invalid JSON: EOF while parsing a value at line 24 column 64119
  [type=json_invalid, input_value='{"entities":[..."AI","AI","AI","AI",', ...]

Config that triggers it: temperature=0.0, max_output_tokens=32768, response_mime_type=application/json, response_schema=LLMExtractionResponse, thinking_budget=512.

2. Transient connection drops (unrelated root cause, same code path):

extract_llm_call_failed: Server disconnected without sending a response.

httpx RemoteProtocolError bubbling out of google-genai with no retry logic.

Root cause

Repetition loop: Upstream sampler pathology on Gemini 2.5 Flash/Pro at temperature=0.0 with structured output. Well-documented by Google users; no Google fix:

Transient drops: google-genai==1.12.1 (our pinned version) has zero retry logic. HttpRetryOptions only landed in ≥1.65 — a 60-version jump we're not taking mid-debug.

Evaluated and confirmed NOT to work

  • frequency_penalty / presence_penalty — NOT supported on 2.5 models; returns INVALID_ARGUMENT
  • Pydantic max_length / max_items constraints — silently ignored
  • Anti-repetition prompt hints alone — no reports of them working
  • Lower top_p — no corroboration
  • Streaming mid-body abort — streaming on 2.5 models has its own bugs (empty chunks, 400 errors)
  • thinking_budget tuning — often ignored outright on 2.5 models

Applied workarounds

Shipped in two commits:

  • 2234bcc — fix(api): observability overhaul for compile pipeline failures

    • Shared throwOnError helper in compile/run/route.ts replaces 8 inline status-only throws. Reads the sub-route response body (3 s race-capped), embeds first 2 000 chars in the thrown Error so failCompileProgress stores the real cause in compile_progress.error (visible in UI banner + Activity feed).
    • timed() wrapper logs [compile:<sid>] <step> <ms> on success — makes cold-start vs steady-state latency visible.
    • Top-level .catch now console.error's raw err (stack trace), not just the stringified message.
    • Each of compile/{extract,resolve,match,plan,crossref,draft,schema} now wraps its POST body in a top-level try/catch that logs the stack before returning 500. Next.js 16 production doesn't surface uncaught handler errors without explicit logging.
    • Side effect: UUIDs no longer leak into activity-feed error strings (they were baked into the old extract failed for {uuid}: 500 message).
    • api/health: bumped expected schema_version 15 → 16 (the v16 migration had been silently failing the healthcheck, leaving Docker flapping).
  • 3ebb275 — feat(nlp-service): Gemini retry wrapper + truncation salvage + usage logging

    • _with_retry helper: 3 attempts, exponential backoff (0.5/1/2 s), catches httpx connect-class errors + google-genai APIError with status in {408, 429, 500, 502, 503, 504}. Re-acquires the pyrate-limiter token per attempt so retries count against the 800 RPM cap (not bypass it).
    • Applied to 7 read-only analyzers: extract, lint_scan, disambiguate, crossref, triage, generate_schema, select_pages. Skipped on draft_page, synthesize_answer, generate_digest — those commit downstream state and can't be safely re-executed after a partial response.
    • _log_usage helper: one stderr line per Gemini call with finish_reason + prompt_token_count / candidates_token_count / thoughts_token_count. Ground truth for distinguishing MAX_TOKENS truncation from other failure modes.
    • extract_source fallback chain on parse failure:
      1. json-repair salvage (new dep, json-repair>=0.59.4) — recovers the valid JSON prefix from a repetition-loop truncation for free; ~90% of loops have extractable content before the stuck tail
      2. If finish_reason == MAX_TOKENS and salvage failed → retry once with halved input; different prompt path usually escapes the loop
      3. On all-fail → chain both errors in extract_llm_parse_failed_after_fallback: first=...; second=... so compile_progress.error shows what each attempt hit
    • Anti-repetition hint added to _EXTRACTION_SYSTEM_PROMPT ("Stop as soon as the source is covered — do NOT repeat names or pad the output"). Mixed evidence it helps; costs ~10 input tokens, kept as belt-and-suspenders.

Deferred follow-ups

  • Prompt reorder for implicit caching — Gemini 2.5 has implicit caching enabled (75% off cached input tokens on prefix match) but requires ≥1024 cached tokens. Our stable prefix is ~375 tokens — below the threshold. Padding it with substantive stable content (schema description, few-shot examples) would unlock the discount.
  • Streaming + mid-stream repetition detectiongenerate_content_stream IS supported with response_schema in 1.12.1. A rolling-window detector (same ≥15-char slice repeated ≥4× in last ~100 tokens) + break the iterator could abort at ~2-4k tokens instead of 32k. ~94% output-token savings on detection. ~1 day of work.
  • Env-flagged model-swap fallback to gemini-2.5-flash-preview-09-2025 after two consecutive MAX_TOKENS failures. Forum reports it's more stable for the repetition bug specifically; explicitly non-GA.
  • google-genai SDK upgrade to ≥1.65 for native HttpRetryOptions (our own _with_retry becomes redundant). 60-version jump from 1.12.1 — needs its own investigation cycle.
  • Batch API migration for non-interactive compiles — 50% discount on input + output tokens, 24h SLA. Session-compile is already fire-and-forget via n8n, matches the latency budget.
  • Disambiguate result cache — Layer 3 entity disambiguation re-runs on every compile retry even when prior aliases were persisted. Cache by (session_id, sorted-pair-hash) to avoid re-spend.

References

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdocumentationImprovements or additions to documentation

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions