Maintained fork of OpenAI's experimental Symphony orchestrator, driven by production use dispatching Codex agents against the Spektra repo (Linear team Synchron). Upstream is an unmaintained reference implementation; this fork fixes the failure modes we hit running it continuously.
stop_continue_labels— issues carrying a configured label are excluded from dispatch, retry, AND have their running agent terminated even while the issue sits in an active state. Fixes the merged-but-state-bounced zombie class (observed: one issue burned ~47M tokens re-reviewing already-merged code).- Auto-decline MCP elicitation requests — headless agents wedged forever on
mcpServer/elicitation/request; we answerdeclineand emit an event. - Structured rate-limit handling — Linear reports
RATELIMITEDwith HTTP 400 and a shared 2500 req/hr window; the client now parses rate-limit headers, sleeps until reset, retries the request itself, and returns structured{:rate_limited, reset_at}errors.
linear_graphqltool output capped at 16KB with narrow-your-query guidance — uncapped responses were the largest remaining token leak (50KB+ per broad query, persisting in thread context for every later turn).- Rate-limit budget gate defaults flipped: all requests are non-critical and
wait out a low budget; only adapter mutations/lookups (comments, labels,
state transitions) pass
critical?: true. Previously only the delta poll respected the gate. delay_until_resettakes a fallback (60s on the rate-limited retry path) for Linear's empty-headers RATELIMITED responses, which otherwise retried immediately into the same exhausted window.- Issue-state batcher call timeout
:infinity→ 30s with structured:issue_state_batch_timeouterror. - Workflow-side:
max_turns20 → 40 (a tight turn cap was upstream's only spend bound;max_tokens_per_issuenow does that job, and every max_turns rollover costs a cold thread).
Known accepted behaviors (reviewed, not bugs): token budget interrupts mid-turn by design (overshoot ≤ one notification); max_turns rollover still cold-starts a thread (mitigated by previous_attempt + workspace persistence; true thread resume blocked on codex app-server support — upstream reverted their attempt in #85); ledger grows one entry per issue (negligible at current volume).
Upstream #88–#90 have been ported into this fork's working tree: opt-in
tracker.required_labels, dashboard issue links, favicon/static asset
versioning, and the associated tests are present. The remaining Phase 0 work is
history reconciliation only: merge or rebase onto upstream/main after the
roadmap implementation is committed, resolving equivalent conflicts in favor of
the fork's current behavior so future upstream syncs do not replay the same
conflicts.
- Tracker-read failures must not kill agent runs (
agent_runner.ex). Retry the refresh in place; if still failing, pause the run keeping the Codex thread + workspace alive. A bookkeeping read failing is not a reason to discard an agent's accumulated context. - Structured rate-limit handling in
linear/client.ex(currently has zero retry logic). ParseX-RateLimit-Requests-Remaining/-Resetinto a shared budget gauge; delay low-priority calls when the budget is low; onRATELIMITEDsleep until reset and retry the request itself. Return structured errors ({:error, {:rate_limited, reset_at}}). - Delete the string-match heuristic (
rate_limited_error?/1) once item 2 lands — it currently misclassifies every HTTP 400 as a rate limit, so genuine bad requests back off 5 min and retry forever instead of failing fast. - Per-issue token budget (
agent.max_tokens_per_issue). Accumulation fromthread/tokenUsage/updatedalready exists (orchestrator.ex ~1670–1706) and only feeds the dashboard. Enforce before each continuation; on breach: interrupt, comment on the issue with the final count, applysymphony-budget-exceeded(reuses stop_continue machinery). - Per-issue dispatch/rework caps (
agent.max_dispatch_attempts,agent.max_rework_cycles). On breach:symphony-stucklabel + comment. - Persist orchestrator state (new
ledgermodule; SQLite/DETS/JSONL keyed by issue id: dispatch_count, rework_count, cumulative_tokens, blocked_reason, last_thread_id). The memory-only blocked map currently forgets wedges on restart and re-dispatches into them; also provides the counters for items 4–5.
- Same-thread stall recovery: on stall/turn-timeout send
thread/interruptand continue on the same thread instead of killing the process and starting a cold thread that re-reads the whole repo. - Smarter stall detection: reset the stall clock on any item activity
(running
commandExecution,outputDelta), not just message traffic — a long silent test run is progress, not a stall. - Previous-attempt context injection: template gets
previous_attempt: {last_agent_message, dirty_files, commits_ahead, turns_used, token_total}on retries/rework, so a restart resumes from notes instead of starting over. (Distinct from upstream's reverted #84/#85 cross-restart Linear-comment resume — this is orchestrator-internal.) - Per-state prompt templates (
prompt_template_by_state) so implementer and reviewer agents each carry only their half of the protocol.
- Delta polling: steady-state poll filters
updatedAt > lastPollAt; full re-query only on startup and explicit refresh. (~2,880 full-project GraphQL queries/day at idle today.) - Batch per-turn issue-state refreshes through the existing
fetch_issue_states_by_ids/1instead of one call per agent per turn. - Retry jitter (±25%) + FIFO slot queue (replace the "No available slots… retrying again" timer spin).
- Workspace creation via local
--referencemirror +workspace.envconfig for shared caches (REMOTION_BROWSER_EXECUTABLE, turbo/vite cache dirs). Today each workspace is ~2.1 GB incl. a private headless-Chrome download. - Async workspace cleanup + retention policy (supervised Task on
terminal state;
workspace.max_total_gb/ keep-last-N) instead of synchronous cleanup at startup that blocks the poll loop.
- Per-issue metrics ledger: on terminal state emit
{issue, pr, tokens, turns, rework_cycles, retries, wall_time, merged_at}(formalizes the manualmetrics.jsonlconvention). - Status API: expose blocked reasons, per-issue token totals, retry
counts, stall events via the observability controller; keep logging + port
active in TUI mode (monitoring currently scrapes
tmux capture-pane). - Configurable elicitation policy (
codex.elicitation_policy: decline | block) + record declined elicitations in the ledger.
- Repo map for agents: move the component→location table into
AGENTS.md(or a generatedREPO_MAP.md) — Codex never readsCLAUDE.md. - Cache-friendly prompt ordering in
WORKFLOW.md: static protocol first,{{ issue.* }}interpolation last, so concurrent agents share an OpenAI prompt-cache prefix.
maintracksupstream/main+ our patches, rebased when upstream moves.- One branch + PR per roadmap item; all changes unit-tested against the
in-memory tracker (
tracker/memory.ex) and the app-server seam. - Run
mix testinelixir/before pushing (one test comparesSystem.tmp_dir!()and fails under a sandboxedTMPDIR; that failure is environmental).