Under Construction: This project is actively being developed and is not yet ready for production use. APIs and features may change without notice.
Episodic Memory & Semantic Cache for LLM Responses
Because nobody likes paying for the same token twice.
██████╗ ███████╗███████╗██╗ ███████╗██╗ ██╗ ██╔══██╗██╔════╝██╔════╝██║ ██╔════╝╚██╗██╔╝ ██████╔╝█████╗ █████╗ ██║ █████╗ ╚███╔╝ ██╔══██╗██╔══╝ ██╔══╝ ██║ ██╔══╝ ██╔██╗ ██║ ██║███████╗██║ ███████╗███████╗██╔╝ ██╗ ╚═╝ ╚═╝╚══════╝╚═╝ ╚══════╝╚══════╝╚═╝ ╚═╝
Reflex is an OpenAI-compatible HTTP cache for LLM responses: it sits between your agent/app and the provider, returning cached answers instantly and storing misses for later reuse. Cached responses are returned in Tauq format to reduce token overhead.
# 1. Start Qdrant (vector database)
docker run -d -p 6334:6334 -p 6333:6333 qdrant/qdrant
# 2. Run Reflex (HTTP server)
cargo run -p reflex-server --release
# 3. Point your agent to localhost:8080
export OPENAI_BASE_URL=http://localhost:8080/v1# Run the library example (no HTTP server)
cargo run -p reflex-cache --example basic_lookup --features mockEmbed in your own app:
[dependencies]
reflex = { package = "reflex-cache", version = "x.x.x" }- Server + binary (
reflex): crates/reflex-server - Core library (embedded use): crates/reflex-cache (docs.rs: https://docs.rs/reflex-cache)
Request → L1 (exact) → L2 (semantic) → L3 (rerank/verify) → Provider
- L1: exact match (fast, in-memory)
- L2: semantic retrieval (Qdrant vector search)
- L3: verification (cross-encoder rerank to avoid false positives)
cargo test
cargo clippy --all-targets -- -D warnings
cargo fmt -- --checkReflex: Stop paying for the same token twice.
Built with Rust, Qdrant, and a healthy disdain for redundant API calls.