A from-scratch LLM inference server written in Rust with CUDA acceleration.
- OpenAI-compatible API —
/v1/chat/completionswith streaming (SSE) and non-streaming modes - Continuous batching with chunked prefill for stable decode latency
- Structured output — JSON Schema → regex → DFA → token-level FSM constrained generation
- N-gram speculative decoding with adaptive draft length
- Paged KV cache with block-level memory management
- GPU-native attention — extract/mask/interleave ops run as CUDA kernels with no CPU roundtrip
- GGUF model loading with Q4_K_M and Q8_0 dequantization
- CUDA + CPU backends — feature-gated CUDA for CPU-only builds
- Llama model family support (Llama 2/3, TinyLlama, etc.)
forge-core Shared traits (Backend, Model, Scheduler, KvCache), error types
forge-backend Backend trait abstraction
forge-backend-cpu CPU backend (OpenBLAS)
forge-backend-cuda CUDA backend (cuBLAS + custom kernels)
forge-kernels CUDA kernel sources (elementwise, norm, positional, memory, attention)
forge-kvcache Naive and paged KV cache implementations
forge-loader SafeTensors + GGUF model loaders
forge-model-llama Llama model (attention, FFN, RoPE, GQA)
forge-runtime Engine loop, sampling, FSM constraints, speculative decoding
forge-scheduler Continuous batching scheduler
forge-server Axum HTTP server, OpenAI-compatible API
forge-transport In-process channel transport
forge-quantize Quantization scaffolding
- Rust stable (see
rust-toolchain.toml) - CUDA toolkit 12.x+ (for GPU backend)
- A Llama-family model in SafeTensors or GGUF format
# With CUDA (default)
cargo build --release
# CPU-only
cargo build --release --no-default-featurescargo run --release -- --model-path /path/to/model --port 8080| Flag | Default | Description |
|---|---|---|
--model-path |
required | Path to SafeTensors model directory |
--port |
8080 | HTTP server port |
--backend |
cuda | Backend: cuda or cpu |
--kv-cache |
paged | KV cache: paged or naive |
--max-batch-size |
256 | Max concurrent sequences |
--max-prefill-tokens |
4096 | Max prefill tokens per step |
--device |
0 | CUDA device ordinal |
--block-size |
auto | Paged cache block size (tokens). auto picks 256 when CUDA + FA2 + F16 model conditions hold, else 16. Integer overrides. |
--num-blocks |
auto | Total KV cache blocks. auto keeps ~32k token capacity: ceil(32768 / block_size). |
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100,
"stream": false
}'curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [{"role": "user", "content": "What is 2+2?"}],
"json_schema": {
"type": "object",
"properties": {
"answer": {"type": "integer"},
"explanation": {"type": "string"}
},
"required": ["answer", "explanation"]
}
}'GET /v1/models— List available modelsGET /forge/v1/health— Health check
# Unit tests (149 tests)
cargo test --workspace
# Integration test with a model
bash scripts/test_server.sh /path/to/model
# Benchmark (TTFT, ITL, throughput)
bash scripts/benchmark.sh /path/to/modelApache-2.0