This file is a living checklist to evolve this repo into a recruiter-grade, production-credible LLM orchestration platform.
- Clear architecture: separation of API, orchestration, model providers, storage, eval
- Reliability: retries/backoff, rate-limit handling, timeouts, circuit breakers
- Performance: high-throughput fanout + concurrency control (Go service)
- Observability: structured logs, metrics, tracing, dashboards
- Reproducible evaluation: regression tests + scorecards + self-improvement loop
- Secure secrets handling: no keys in repo, safe env/config patterns
Goal: Move the hottest path (parallel LLM calls, rate-limit management, streaming aggregation) into Go.
-
Define the contract between Python orchestrator and Go service
-
Acceptance: a written interface (request/response schema) and a small design doc section in README.
-
Options:
- HTTP/JSON (fast to ship)
- gRPC (better performance + typed contracts)
-
Output:
- Contract doc:
recursion/go/fanout-service/CONTRACT.md - Smoke test:
recursion/scripts/fanout_smoke_test.py
- Contract doc:
-
-
Implement Go service:
fanout-service- Responsibilities:
- per-provider concurrency limits (semaphores)
- retry/backoff with jitter
- deadline propagation
- rate-limit detection + backoff
- structured logging + metrics
- Acceptance:
- Go service exposes
/fanout(or gRPC method) and can run locally.
- Go service exposes
- Responsibilities:
-
Go service scaffold created at
recursion/go/fanout-service- Run locally:
cd recursion/go/fanout-servicego run .
- Endpoints:
GET /healthzPOST /fanout(implemented; returns per-call results)
- Run locally:
-
Integrate Python → Go call path
- Acceptance:
- A feature flag (env var) chooses Go path vs pure Python path.
- Trace output shape remains consistent.
- Acceptance:
-
Add load test for the Go service
- Acceptance:
- A repeatable script produces throughput/latency results and saves to
results/.
- A repeatable script produces throughput/latency results and saves to
- Run:
python scripts/fanout_load_test.py
- Output:
- Writes
results/fanout_load_*.jsonandresults/fanout_load_*.txt
- Writes
- Acceptance:
-
Create provider interface layer (Python)
- Example:
ProviderClientwithcomplete(messages, model, timeout, ...) - Acceptance:
routes.pyno longer contains provider-specific logic.
- Example:
-
Centralize model naming + routing
- Acceptance:
- one module maps
google/gemini-*→ the correct provider implementation.
- one module maps
- Acceptance:
-
Add hard timeouts and cancellation
- Acceptance:
- If a sub-call exceeds deadline, it is cancelled and the trace records the timeout.
- Acceptance:
-
Add retry policies and error taxonomy
- Acceptance:
- transient errors retry; permanent errors don’t; trace contains reason codes.
- Acceptance:
-
Add caching rules (optional but strong)
- Acceptance:
- cache key includes model + prompt + parameters, and can be disabled.
- Notes:
- Disable with
REX_CACHE_ENABLED=0(orCACHE_ENABLED=0) - TTL via
REX_CACHE_TTL_SECONDS(default 86400)
- Disable with
- Acceptance:
-
Structured JSON logging (Python + Go)
- Acceptance:
- each request has a request-id; logs include model, latency, outcome.
- Acceptance:
-
Metrics (Prometheus)
- Acceptance:
- counters/histograms for latency, errors, tokens, fanout sizes.
- Endpoints:
- Go fanout-service:
GET /metrics(defaulthttp://127.0.0.1:8099/metrics) - Python API:
GET /metrics(defaulthttp://127.0.0.1:8000/metrics)
- Go fanout-service:
- Acceptance:
-
Distributed tracing (OpenTelemetry)
- Acceptance:
- trace spans connect API → orchestrator → provider calls (and Go service spans).
- Current:
- Go fanout-service emits spans when
OTEL_ENABLED=1 - Python API wraps
/api/runin a top-level span and can init tracing viaOTEL_ENABLED=1 - Trace context propagates from Python → Go (
traceparentinjected on fanout request) - Provider spans exist in Python (fanout request + per-model LiteLLM calls)
- Go fanout-service emits spans when
- Run:
- Python: set
OTEL_ENABLED=1(optional:OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4318) - Go: set
OTEL_ENABLED=1(same exporter envs if using OTLP)
- Python: set
- Acceptance:
-
Create
eval/harness- Acceptance:
- can run
pytest -m eval(or similar) and produce a score report.
- can run
- Current:
- Deterministic smoke dataset at
eval/dataset.jsonl - Test runner at
tests/test_eval_harness.py
- Deterministic smoke dataset at
- Acceptance:
-
Regression suite from traces
- Acceptance:
- store a small curated dataset of prompts + expected properties.
- Current:
- Snapshot:
eval/regression_traces.jsonl(generated from deterministic simulated traces) - Test:
tests/test_regression_traces.py(runs under-m eval) - Generator:
scripts/generate_regression_traces.py
- Snapshot:
- Acceptance:
-
Self-improvement loop should be test-gated
- Acceptance:
- improvement changes must improve score or be rejected.
- Current:
- CI runs eval tests on every change.
- Local gate script:
scripts/run_improvement_gated.py(eval-before/after wrapper)
- Acceptance:
- Add streaming responses (server-sent events or websockets) for partial results
- Incrementally synthesize as responses arrive (not only after all complete)
- Configurable quotas/rate limits per “workspace” (not auth—just API client identity)
- Isolation in cache keys / metrics labels
-
GitHub Actions pipeline
- Acceptance:
- runs tests, builds frontend, builds Go service
- Acceptance:
-
Lint + type checks
- Current:
- Ruff bug-gate in CI (fails on syntax/undefined-name class issues)
- Mypy baseline gate in CI (lenient config; tighten over time)
- golangci-lint in CI (govet baseline)
- Current:
- Docker Compose for full stack (backend + go service + redis + frontend)
- Benchmark suite with saved baseline comparisons
- Canary mode for new pipeline versions
- Architecture diagram in README (components + data flow)
- “How it scales” section (concurrency control, backpressure, rate-limits)
- Performance report (before/after Go service) in
results/ - Reliability report: error rates + retry behavior + timeout behavior
- Evaluation report: scorecards + regression history