Skip to content

ccheney/reflex

Repository files navigation

Under Construction: This project is actively being developed and is not yet ready for production use. APIs and features may change without notice.

Episodic Memory & Semantic Cache for LLM Responses

Build Status License Crates.io Rust 2024

Because nobody likes paying for the same token twice.

██████╗ ███████╗███████╗██╗     ███████╗██╗  ██╗
██╔══██╗██╔════╝██╔════╝██║     ██╔════╝╚██╗██╔╝
██████╔╝█████╗  █████╗  ██║     █████╗   ╚███╔╝
██╔══██╗██╔══╝  ██╔══╝  ██║     ██╔══╝   ██╔██╗
██║  ██║███████╗██║     ███████╗███████╗██╔╝ ██╗
╚═╝  ╚═╝╚══════╝╚═╝     ╚══════╝╚══════╝╚═╝  ╚═╝
  


What It Is

Reflex is an OpenAI-compatible HTTP cache for LLM responses: it sits between your agent/app and the provider, returning cached answers instantly and storing misses for later reuse. Cached responses are returned in Tauq format to reduce token overhead.


Quick Start (Server)

# 1. Start Qdrant (vector database)
docker run -d -p 6334:6334 -p 6333:6333 qdrant/qdrant

# 2. Run Reflex (HTTP server)
cargo run -p reflex-server --release

# 3. Point your agent to localhost:8080
export OPENAI_BASE_URL=http://localhost:8080/v1

Quick Start (Library)

# Run the library example (no HTTP server)
cargo run -p reflex-cache --example basic_lookup --features mock

Embed in your own app:

[dependencies]
reflex = { package = "reflex-cache", version = "x.x.x" }

Crates In This Repo


How It Works (High Level)

Request → L1 (exact) → L2 (semantic) → L3 (rerank/verify) → Provider
  • L1: exact match (fast, in-memory)
  • L2: semantic retrieval (Qdrant vector search)
  • L3: verification (cross-encoder rerank to avoid false positives)

Development

cargo test
cargo clippy --all-targets -- -D warnings
cargo fmt -- --check

Reflex: Stop paying for the same token twice.

Built with Rust, Qdrant, and a healthy disdain for redundant API calls.