Gemini 2.5 Flash structured-output repetition loop — upstream bug + client-side workaround

## Summary

Gemini 2.5 Flash occasionally enters an infinite token-repetition loop during structured-output extraction, truncating the response mid-JSON at `max_output_tokens`. This is an **upstream model bug** — documented by Google since Oct 2025, with ~30% hit rate on some reporters' workloads. Persists on both Flash and Pro as of April 2026, no Google fix or ETA.

This issue documents the failure signature, the client-side mitigations we've shipped, and deferred follow-ups. **Keep open** until either (a) Google fixes upstream, or (b) all deferred items below ship.

## Observed failure signatures

Two distinct failures that hit the same code path in `nlp-service/services/llm_client.py :: extract_source()`:

**1. Repetition loop → JSON truncation:**
```
extract_llm_parse_failed: 1 validation error for LLMExtractionResponse
  Invalid JSON: EOF while parsing a value at line 24 column 64119
  [type=json_invalid, input_value='{"entities":[..."AI","AI","AI","AI",', ...]
```
Config that triggers it: `temperature=0.0`, `max_output_tokens=32768`, `response_mime_type=application/json`, `response_schema=LLMExtractionResponse`, `thinking_budget=512`.

**2. Transient connection drops** (unrelated root cause, same code path):
```
extract_llm_call_failed: Server disconnected without sending a response.
```
httpx `RemoteProtocolError` bubbling out of `google-genai` with no retry logic.

## Root cause

**Repetition loop:** Upstream sampler pathology on Gemini 2.5 Flash/Pro at `temperature=0.0` with structured output. Well-documented by Google users; no Google fix:
- https://discuss.ai.google.dev/t/gemini-2-5-flash-repeats-tokens-until-max-tokens-reached-in-structured-output/107176
- https://discuss.ai.google.dev/t/gemini-2-5-pro-enters-infinite-b-repetition-loop-65k-tokens-billed-same-behavior-as-flash/115368
- https://discuss.ai.google.dev/t/truncated-response-issue-with-gemini-2-5-flash-preview/81258

**Transient drops:** `google-genai==1.12.1` (our pinned version) has zero retry logic. `HttpRetryOptions` only landed in ≥1.65 — a 60-version jump we're not taking mid-debug.

### Evaluated and confirmed NOT to work

- `frequency_penalty` / `presence_penalty` — NOT supported on 2.5 models; returns `INVALID_ARGUMENT`
- Pydantic `max_length` / `max_items` constraints — silently ignored
- Anti-repetition prompt hints alone — no reports of them working
- Lower `top_p` — no corroboration
- Streaming mid-body abort — streaming on 2.5 models has its own bugs (empty chunks, 400 errors)
- `thinking_budget` tuning — often ignored outright on 2.5 models

## Applied workarounds

Shipped in two commits:

- **`2234bcc` — fix(api): observability overhaul for compile pipeline failures**
  - Shared `throwOnError` helper in `compile/run/route.ts` replaces 8 inline status-only throws. Reads the sub-route response body (3 s race-capped), embeds first 2 000 chars in the thrown Error so `failCompileProgress` stores the real cause in `compile_progress.error` (visible in UI banner + Activity feed).
  - `timed()` wrapper logs `[compile:<sid>] <step> <ms>` on success — makes cold-start vs steady-state latency visible.
  - Top-level `.catch` now `console.error`'s raw `err` (stack trace), not just the stringified message.
  - Each of `compile/{extract,resolve,match,plan,crossref,draft,schema}` now wraps its POST body in a top-level try/catch that logs the stack before returning 500. Next.js 16 production doesn't surface uncaught handler errors without explicit logging.
  - Side effect: UUIDs no longer leak into activity-feed error strings (they were baked into the old `extract failed for {uuid}: 500` message).
  - `api/health`: bumped expected `schema_version` 15 → 16 (the v16 migration had been silently failing the healthcheck, leaving Docker flapping).

- **`3ebb275` — feat(nlp-service): Gemini retry wrapper + truncation salvage + usage logging**
  - `_with_retry` helper: 3 attempts, exponential backoff (0.5/1/2 s), catches httpx connect-class errors + google-genai `APIError` with status in {408, 429, 500, 502, 503, 504}. Re-acquires the pyrate-limiter token per attempt so retries count against the 800 RPM cap (not bypass it).
  - Applied to 7 read-only analyzers: `extract`, `lint_scan`, `disambiguate`, `crossref`, `triage`, `generate_schema`, `select_pages`. Skipped on `draft_page`, `synthesize_answer`, `generate_digest` — those commit downstream state and can't be safely re-executed after a partial response.
  - `_log_usage` helper: one stderr line per Gemini call with `finish_reason` + `prompt_token_count` / `candidates_token_count` / `thoughts_token_count`. Ground truth for distinguishing `MAX_TOKENS` truncation from other failure modes.
  - `extract_source` fallback chain on parse failure:
    1. `json-repair` salvage (new dep, `json-repair>=0.59.4`) — recovers the valid JSON prefix from a repetition-loop truncation for free; ~90% of loops have extractable content before the stuck tail
    2. If `finish_reason == MAX_TOKENS` and salvage failed → retry once with halved input; different prompt path usually escapes the loop
    3. On all-fail → chain both errors in `extract_llm_parse_failed_after_fallback: first=...; second=...` so `compile_progress.error` shows what each attempt hit
  - Anti-repetition hint added to `_EXTRACTION_SYSTEM_PROMPT` ("Stop as soon as the source is covered — do NOT repeat names or pad the output"). Mixed evidence it helps; costs ~10 input tokens, kept as belt-and-suspenders.

## Deferred follow-ups

- [ ] **Prompt reorder for implicit caching** — Gemini 2.5 has implicit caching enabled (75% off cached input tokens on prefix match) but requires ≥1024 cached tokens. Our stable prefix is ~375 tokens — below the threshold. Padding it with substantive stable content (schema description, few-shot examples) would unlock the discount.
- [ ] **Streaming + mid-stream repetition detection** — `generate_content_stream` IS supported with `response_schema` in 1.12.1. A rolling-window detector (same ≥15-char slice repeated ≥4× in last ~100 tokens) + `break` the iterator could abort at ~2-4k tokens instead of 32k. ~94% output-token savings on detection. ~1 day of work.
- [ ] **Env-flagged model-swap fallback** to `gemini-2.5-flash-preview-09-2025` after two consecutive MAX_TOKENS failures. Forum reports it's more stable for the repetition bug specifically; explicitly non-GA.
- [ ] **`google-genai` SDK upgrade** to ≥1.65 for native `HttpRetryOptions` (our own `_with_retry` becomes redundant). 60-version jump from 1.12.1 — needs its own investigation cycle.
- [ ] **Batch API migration** for non-interactive compiles — 50% discount on input + output tokens, 24h SLA. Session-compile is already fire-and-forget via n8n, matches the latency budget.
- [ ] **Disambiguate result cache** — Layer 3 entity disambiguation re-runs on every compile retry even when prior aliases were persisted. Cache by (session_id, sorted-pair-hash) to avoid re-spend.

## References

- [python-genai cookbook issue #1091 — retry.if_transient_error doesn't catch google-genai errors](https://github.com/google-gemini/cookbook/issues/1091)
- [python-genai #782 — thinking_budget often ignored on 2.5 models](https://github.com/googleapis/python-genai/issues/782)
- [python-genai #1984 — hallucinations/endless JSON (closed stale, no fix)](https://github.com/googleapis/python-genai/issues/1984)
- [Vertex AI Retry Strategy docs](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/retry-strategy)
- [Gemini 2.5 implicit caching announcement](https://developers.googleblog.com/en/gemini-2-5-models-now-support-implicit-caching/)
- [Gemini Batch API docs](https://ai.google.dev/gemini-api/docs/batch-api)
- [json-repair library](https://github.com/mangiucugna/json_repair)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemini 2.5 Flash structured-output repetition loop — upstream bug + client-side workaround #7

Summary

Observed failure signatures

Root cause

Evaluated and confirmed NOT to work

Applied workarounds

Deferred follow-ups

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Gemini 2.5 Flash structured-output repetition loop — upstream bug + client-side workaround #7

Description

Summary

Observed failure signatures

Root cause

Evaluated and confirmed NOT to work

Applied workarounds

Deferred follow-ups

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions