Skip to content

Budget mode for non-interactive compiles via provider batch/async APIs #55

@tuirk

Description

@tuirk

Open scope. I sketched the design but won't be building it — leaving notes here in case someone wants to pick it up.

The problem

Compile spend dominates LLM cost. The hottest call sites — extract, disambiguate, crossref, draft_page — fire once per source or page and account for the bulk of token usage. All three major providers offer a batch/async submission mode at 50% of the standard rate (Gemini Batch API, OpenAI Batch API, Anthropic Message Batches API), with a 24-hour SLA ceiling — most batches return well before that, but it's not guaranteed.

Today every compile call is real-time. There's no way to opt into "I have 80 sources from a research dump, I don't care if it takes a few hours overnight, just halve the bill."

A Budget mode that routes eligible compile steps through whichever provider's batch endpoint exists would close this. The user-facing UX for the eligible paths already supports walking away.

Why the existing UX fits

The whole onboarding flow is already fire-and-forget. Confirmed by reading the code:

A Budget-mode compile that takes hours is a worse-latency variant of an already-walk-away UX. The wait is opt-in.

Eligible call sites

All n8n-triggered, all already non-interactive:

  • Onboarding compile (/api/onboarding/finalize/webhook/session-compile) — primary target, biggest spend
  • Recompile after source update (/api/sources/[source_id]/recompile)
  • Retry-failed sweeps (/api/compile/retry-failed)
  • Lint pass (lint-wiki.json — Mon 11:30 + manual /webhook/lint)
  • Weekly digest generation (weekly-digest.json — Mon 11:40)

Not eligible: chat answer synthesis (synthesize_answer), any user-facing synchronous LLM call. Those stay on the real-time path.

Provider batch APIs to support

The shape is comparable across providers, but each has its own quirks:

Provider Endpoint Discount Completion
Gemini Batch API 50% Poll only; supports response_schema (inherits #7 repetition bug)
OpenAI Batch API 50% Poll or webhook
Anthropic Message Batches API 50% Poll only

This Budget mode presupposes a provider abstraction. Each provider's adapter optionally implements a submitBatch method; providers without batch support fall back to real-time.

Edge cases

The list I came up with — there are probably more, and some may collapse once the provider abstraction lands.

  1. Per-step eligibility. extract and disambiguate are read-only and easy to batch. draft_page is per-page and large; batchable in principle but mid-batch failures lose more work. crossref depends on draft outputs being committed first — does Budget mode batch each step independently, or pipeline a single mega-batch?

  2. Mid-batch failure handling. Provider returns 95% successes + 5% failures. Fall back to real-time for the failed slice? Mark them and surface in the existing retry-failed flow? Abort the whole batch?

  3. Cancellation semantics. The progress page has a Cancel button (app/src/app/onboarding/progress/page.tsx:715-728). Batch APIs don't all support cancel — Gemini does, OpenAI does, Anthropic only via batch-cancel endpoint. Define cancel as "best-effort upstream cancel + locally discard results" or just "discard results when they arrive"?

  4. Polling cadence. Real-time compile pings n8n step-by-step. Batch submission polls a job ID. 1 min vs 5 min vs 15 min trades latency for API quota burn. Some providers also have polling rate limits.

  5. Structured-output reliability in batch mode. Structured-output bugs are not Gemini-specific:

    The salvage-and-retry logic shipped for Gemini 2.5 Flash structured-output repetition loop — upstream bug + client-side workaround #7 should generalize into a per-provider robustness layer in the abstraction. Batch mode amplifies the cost of these bugs because there's no streaming option to abort early — a runaway response burns the full budget before the failure is visible.

  6. Cost-guard interaction. Existing daily budget cap counts pyrate-limiter tokens. Batched calls bypass the limiter entirely. Need a separate accounting path or a batch-cost preflight.

  7. Mixed mode within a session. If extract goes through batch but draft_page stays sync, the session sits in "queued for batch" → "running real-time" → "queued for batch" — compile_progress step state semantics need to support this without breaking the progress UI.

  8. UI affordance for Budget mode. Settings toggle (sticky, applies to all future compiles)? Per-compile checkbox on the onboarding review page? Auto-suggest at >N sources? "Use Budget mode for this compile" on the dashboard banner?

Design decisions any implementation has to commit to

  • Trigger. Settings toggle vs per-compile flag vs auto-threshold (e.g. "≥30 sources auto-suggests Budget mode")?
  • Granularity. Per-step batching, per-session batching, or pipeline-as-one-batch?
  • Failure fallback. Batch errors → real-time fallback for failed slice, manual retry-failed, or abort?
  • Polling. Cadence and where it lives — n8n schedule, Next.js setInterval, or per-session worker?
  • Cancel. Best-effort upstream + local discard, or local-only?
  • Cost telemetry. Add a separate batch_usage activity log or extend the existing per-call logging?
  • Provider parity. Require all providers to implement submitBatch, or let providers opt in (and the UI hides Budget mode for providers that don't)?

Related

Out of scope

  • Real-time / chat / synchronous LLM paths.
  • Cross-provider unified pricing math (let each adapter report its own).
  • Auto-routing between providers based on cost/availability.
  • Mid-batch streaming progress (batch is by definition not streaming).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestup-for-grabsMaintainer isn't actively working on this — open for anyone to pick up

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions