You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Open scope. I sketched the design but won't be building it — leaving notes here in case someone wants to pick it up.
The problem
Compile spend dominates LLM cost. The hottest call sites — extract, disambiguate, crossref, draft_page — fire once per source or page and account for the bulk of token usage. All three major providers offer a batch/async submission mode at 50% of the standard rate (Gemini Batch API, OpenAI Batch API, Anthropic Message Batches API), with a 24-hour SLA ceiling — most batches return well before that, but it's not guaranteed.
Today every compile call is real-time. There's no way to opt into "I have 80 sources from a research dump, I don't care if it takes a few hours overnight, just halve the bill."
A Budget mode that routes eligible compile steps through whichever provider's batch endpoint exists would close this. The user-facing UX for the eligible paths already supports walking away.
Why the existing UX fits
The whole onboarding flow is already fire-and-forget. Confirmed by reading the code:
The progress page is a polling client only — 2s polls against /api/compile/progress (app/src/app/onboarding/progress/page.tsx:128-181). State rehydrates from the compile_progress row on every poll.
This Budget mode presupposes a provider abstraction. Each provider's adapter optionally implements a submitBatch method; providers without batch support fall back to real-time.
Edge cases
The list I came up with — there are probably more, and some may collapse once the provider abstraction lands.
Per-step eligibility.extract and disambiguate are read-only and easy to batch. draft_page is per-page and large; batchable in principle but mid-batch failures lose more work. crossref depends on draft outputs being committed first — does Budget mode batch each step independently, or pipeline a single mega-batch?
Mid-batch failure handling. Provider returns 95% successes + 5% failures. Fall back to real-time for the failed slice? Mark them and surface in the existing retry-failed flow? Abort the whole batch?
Cancellation semantics. The progress page has a Cancel button (app/src/app/onboarding/progress/page.tsx:715-728). Batch APIs don't all support cancel — Gemini does, OpenAI does, Anthropic only via batch-cancel endpoint. Define cancel as "best-effort upstream cancel + locally discard results" or just "discard results when they arrive"?
Polling cadence. Real-time compile pings n8n step-by-step. Batch submission polls a job ID. 1 min vs 5 min vs 15 min trades latency for API quota burn. Some providers also have polling rate limits.
Structured-output reliability in batch mode. Structured-output bugs are not Gemini-specific:
Claude / DeepSeek / Mistral — no documented bug as of May 2026
The salvage-and-retry logic shipped for Gemini 2.5 Flash structured-output repetition loop — upstream bug + client-side workaround #7 should generalize into a per-provider robustness layer in the abstraction. Batch mode amplifies the cost of these bugs because there's no streaming option to abort early — a runaway response burns the full budget before the failure is visible.
Cost-guard interaction. Existing daily budget cap counts pyrate-limiter tokens. Batched calls bypass the limiter entirely. Need a separate accounting path or a batch-cost preflight.
Mixed mode within a session. If extract goes through batch but draft_page stays sync, the session sits in "queued for batch" → "running real-time" → "queued for batch" — compile_progress step state semantics need to support this without breaking the progress UI.
UI affordance for Budget mode. Settings toggle (sticky, applies to all future compiles)? Per-compile checkbox on the onboarding review page? Auto-suggest at >N sources? "Use Budget mode for this compile" on the dashboard banner?
Design decisions any implementation has to commit to
Trigger. Settings toggle vs per-compile flag vs auto-threshold (e.g. "≥30 sources auto-suggests Budget mode")?
Granularity. Per-step batching, per-session batching, or pipeline-as-one-batch?
Failure fallback. Batch errors → real-time fallback for failed slice, manual retry-failed, or abort?
Polling. Cadence and where it lives — n8n schedule, Next.js setInterval, or per-session worker?
Cancel. Best-effort upstream + local discard, or local-only?
Cost telemetry. Add a separate batch_usage activity log or extend the existing per-call logging?
Provider parity. Require all providers to implement submitBatch, or let providers opt in (and the UI hides Budget mode for providers that don't)?
The problem
Compile spend dominates LLM cost. The hottest call sites —
extract,disambiguate,crossref,draft_page— fire once per source or page and account for the bulk of token usage. All three major providers offer a batch/async submission mode at 50% of the standard rate (Gemini Batch API, OpenAI Batch API, Anthropic Message Batches API), with a 24-hour SLA ceiling — most batches return well before that, but it's not guaranteed.Today every compile call is real-time. There's no way to opt into "I have 80 sources from a research dump, I don't care if it takes a few hours overnight, just halve the bill."
A Budget mode that routes eligible compile steps through whichever provider's batch endpoint exists would close this. The user-facing UX for the eligible paths already supports walking away.
Why the existing UX fits
The whole onboarding flow is already fire-and-forget. Confirmed by reading the code:
POST /api/onboarding/finalizetriggers n8n via webhook then returns 200 immediately (app/src/app/api/onboarding/finalize/route.ts:115). The 10s timeout in app/src/lib/trigger-n8n.ts:34 covers "did n8n receive the trigger," not compile completion./api/compile/progress(app/src/app/onboarding/progress/page.tsx:128-181). State rehydrates from thecompile_progressrow on every poll.Dashboard — always visible so the user can leave(app/src/app/onboarding/progress/page.tsx:732).kompl_active_compile(app/src/app/onboarding/progress/page.tsx:162) persists across navigation so the user sees compile state when they come back.A Budget-mode compile that takes hours is a worse-latency variant of an already-walk-away UX. The wait is opt-in.
Eligible call sites
All n8n-triggered, all already non-interactive:
/api/onboarding/finalize→/webhook/session-compile) — primary target, biggest spend/api/sources/[source_id]/recompile)/api/compile/retry-failed)lint-wiki.json— Mon 11:30 + manual/webhook/lint)weekly-digest.json— Mon 11:40)Not eligible: chat answer synthesis (
synthesize_answer), any user-facing synchronous LLM call. Those stay on the real-time path.Provider batch APIs to support
The shape is comparable across providers, but each has its own quirks:
response_schema(inherits #7 repetition bug)This Budget mode presupposes a provider abstraction. Each provider's adapter optionally implements a
submitBatchmethod; providers without batch support fall back to real-time.Edge cases
The list I came up with — there are probably more, and some may collapse once the provider abstraction lands.
Per-step eligibility.
extractanddisambiguateare read-only and easy to batch.draft_pageis per-page and large; batchable in principle but mid-batch failures lose more work.crossrefdepends on draft outputs being committed first — does Budget mode batch each step independently, or pipeline a single mega-batch?Mid-batch failure handling. Provider returns 95% successes + 5% failures. Fall back to real-time for the failed slice? Mark them and surface in the existing retry-failed flow? Abort the whole batch?
Cancellation semantics. The progress page has a Cancel button (app/src/app/onboarding/progress/page.tsx:715-728). Batch APIs don't all support cancel — Gemini does, OpenAI does, Anthropic only via batch-cancel endpoint. Define cancel as "best-effort upstream cancel + locally discard results" or just "discard results when they arrive"?
Polling cadence. Real-time compile pings n8n step-by-step. Batch submission polls a job ID. 1 min vs 5 min vs 15 min trades latency for API quota burn. Some providers also have polling rate limits.
Structured-output reliability in batch mode. Structured-output bugs are not Gemini-specific:
\n\n\n\npadding bug via Responses API (different mechanism, same output-token bloat shape)The salvage-and-retry logic shipped for Gemini 2.5 Flash structured-output repetition loop — upstream bug + client-side workaround #7 should generalize into a per-provider robustness layer in the abstraction. Batch mode amplifies the cost of these bugs because there's no streaming option to abort early — a runaway response burns the full budget before the failure is visible.
Cost-guard interaction. Existing daily budget cap counts pyrate-limiter tokens. Batched calls bypass the limiter entirely. Need a separate accounting path or a batch-cost preflight.
Mixed mode within a session. If
extractgoes through batch butdraft_pagestays sync, the session sits in "queued for batch" → "running real-time" → "queued for batch" —compile_progressstep state semantics need to support this without breaking the progress UI.UI affordance for Budget mode. Settings toggle (sticky, applies to all future compiles)? Per-compile checkbox on the onboarding review page? Auto-suggest at >N sources? "Use Budget mode for this compile" on the dashboard banner?
Design decisions any implementation has to commit to
batch_usageactivity log or extend the existing per-call logging?submitBatch, or let providers opt in (and the UI hides Budget mode for providers that don't)?Related
Out of scope