ML inference service for embeddings, chunking, reranking, classification, NER, OCR, transcription, text generation, and more — with two-tier caching (memory + singleflight).
Termite is the companion ML service for Antfly, the distributed search engine. It runs automatically in Antfly's swarm mode and can also be used standalone. If you're new, the Antfly quickstart is the fastest way to see everything working together.
- Embeddings — dense and sparse vectors, multimodal (text, images, audio)
- Chunking — semantic text segmentation
- Reranking — cross-encoder relevance scoring
- Classification — zero-shot text classification (NLI-based, 100+ languages)
- Recognition (NER) — named entity recognition, zero-shot labels, relation extraction
- Reading (OCR) — document understanding, OCR, visual question answering
- Transcription — speech-to-text (Whisper, Wav2Vec2)
- Extraction — schema-based structured data extraction
- Rewriting — paraphrasing, question generation (Seq2Seq)
- Generation — text generation with tool calling (OpenAI-compatible)
- Multiple backends — ONNX Runtime, XLA (TPU/CUDA), pure Go
- SIMD / SME acceleration — vector math uses hardware intrinsics via go-highway on x86 and ARM
- Native Go ML — XLA backend powered by GoMLX and GoLLMX, working toward making native Go ML/LLM inference a reality
- Kubernetes operator — autoscaling with TermitePool and TermiteRoute CRDs
# Standalone server
go run ./cmd/termite runTermite supports multiple inference backends, selected via build tags. The omni build includes everything so you can pick at runtime.
| Build | Tags | Description | Use Case |
|---|---|---|---|
| Pure Go | (none) | No CGO, always works | Development, testing |
| ONNX | onnx,ORT |
Fast CPU/GPU via ONNX Runtime | Production (recommended) |
| XLA | xla,XLA |
TPU/CUDA via GoMLX | Cloud TPU, NVIDIA GPU |
| Omni | onnx,ORT,xla,XLA |
All backends | Maximum flexibility |
Includes both ONNX and XLA backends — pick which one to use at runtime without recompiling.
# Download dependencies for all platforms
./scripts/download-onnxruntime.sh
./scripts/download-pjrt.sh
# Build omni binary
CGO_ENABLED=1 go build -tags="onnx,ORT,xla,XLA" -o termite ./pkg/termite/cmd
# Run with backend priority (tries in order until one works)
./termite run --backend-priority="onnx:cuda,xla:tpu,onnx:cpu,go"Dependencies:
- ONNX Runtime - download for your platform or install with homebrew
- Tokenizers - HuggingFace tokenizers bindings
# Download dependencies
./scripts/download-onnxruntime.sh
# Or manually (macOS with homebrew)
CGO_ENABLED=1 \
DYLD_LIBRARY_PATH=/opt/homebrew/opt/onnxruntime/lib \
CGO_LDFLAGS="-L$(pwd) -ltokenizers" \
go run -tags="onnx,ORT" ./pkg/termite/cmd runFor TPU or CUDA GPU acceleration via GoMLX XLA backend. Hardware is autodetected.
Dependencies:
- PJRT CPU Binaries - prebuilt XLA PJRT plugins
- Tokenizers - HuggingFace tokenizers bindings
# Download dependencies
./scripts/download-pjrt.sh
# Build with XLA support
go build -tags="xla,XLA" -o termite ./pkg/termite/cmd
# Run with autodetection (TPU > CUDA > CPU)
./termite runAutodetection:
- TPU: Detected via
libtpu.so,/dev/accel*devices, or GKE TPU node metadata - CUDA: Detected via
nvidia-smiorlibcudart.soin library path
Installing Additional PJRT Plugins:
The omni and XLA builds bundle a CPU PJRT plugin that's auto-discovered from lib/ next to the binary. For TPU or CUDA, install the right plugin:
# Install TPU plugin (for Google Cloud TPU)
go run github.com/gomlx/go-xla/cmd/pjrt_installer@latest -plugin=tpu
# Install CUDA plugin (for NVIDIA GPU)
go run github.com/gomlx/go-xla/cmd/pjrt_installer@latest -plugin=cuda
# Install to a specific location
go run github.com/gomlx/go-xla/cmd/pjrt_installer@latest -plugin=tpu -path=/usr/local/lib/go-xlaInstalled plugins are found automatically via standard go-xla search paths. To override, set PJRT_PLUGIN_LIBRARY_PATH.
Platform Availability:
| Platform | PJRT CPU | Notes |
|---|---|---|
| linux-amd64 | Yes | |
| linux-arm64 | Yes | |
| darwin-arm64 | Yes | Apple Silicon |
| darwin-amd64 | No | Intel Mac not supported upstream |
Pull from registry:
termite pull bge-small-en-v1.5
termite pull mxbai-rerank-base-v1
termite pull chonky-mmbert-small-multilingual-1
# List available models
termite list --remoteModels auto-discovered from chunker_models_dir, embedder_models_dir, reranker_models_dir.
| Model | Size | Dims | Variants | Notes |
|---|---|---|---|---|
bge-small-en-v1.5 |
128MB | 384 | f16, i8 | Fast English embeddings |
all-MiniLM-L6-v2 |
87MB | 384 | f32, f16, i8 | Fastest, good quality |
all-mpnet-base-v2 |
418MB | 768 | f32, f16, i8 | Best sentence-transformers accuracy |
nomic-embed-text-v1.5 |
548MB | 768 | f16, i8 | 8K context, Matryoshka dims |
bge-m3 |
2.2GB | 1024 | f16, i8 | 100+ languages, 8K context |
gte-Qwen2-1.5B-instruct |
6GB | 1536 | f16 | 32K context, instruction-following |
snowflake-arctic-embed-l-v2.0 |
1.3GB | 1024 | f16, i8 | Retrieval-optimized, Matryoshka |
stella_en_1.5B_v5 |
6GB | 1024 | f16 | Premium English, top MTEB scores |
embeddinggemma-300m-ONNX |
1.2GB | 768 | f16, q4, q4f16 | Multilingual, edge-optimized |
splade-cocondenser-ensembledistil |
— | sparse | f32 | Sparse embeddings (SPLADE) |
| Model | Size | Dims | Variants | Notes |
|---|---|---|---|---|
clip-vit-base-patch32 |
584MB | 512 | f16, i8 | Text + image embeddings (CLIP) |
clipclap |
— | 512 | — | Text + image (CLIP variant) |
clap-htsat-unfused |
— | 512 | — | Audio + text embeddings (CLAP) |
| Model | Size | Variants |
|---|---|---|
mxbai-rerank-base-v1 |
713MB | f16, i8 |
| Model | Size | Variants |
|---|---|---|
chonky-mmbert-small-multilingual-1 |
570MB | f16, i8 |
| Model | Size | Variants | Notes |
|---|---|---|---|
mDeBERTa-v3-base-mnli-xnli |
— | f32, f16, i8 | Zero-shot, 100+ languages |
bart-large-mnli |
— | f32, f16, i8 | Zero-shot, English |
| Model | Size | Variants | Capabilities |
|---|---|---|---|
bert-base-NER |
413MB | f32, f16, i8 | labels |
bert-large-NER |
1.3GB | f32, f16, i8 | labels |
gliner_small-v2.1 |
199MB | f32, f16, i8 | labels, zeroshot |
gliner2-base-v1 |
— | f32, f16, i8 | labels, zeroshot (improved) |
gliner-multitask-large-v0.5 |
1.3GB | f32, f16, i8 | labels, zeroshot, relations, answers |
rebel-large |
3.0GB | - | relations |
| Model | Size | Variants | Notes |
|---|---|---|---|
trocr-base-printed |
— | — | Printed text OCR |
donut-base-finetuned-cord-v2 |
— | — | Receipt/form parsing |
donut-base-finetuned-docvqa |
— | — | Document question answering |
moondream2 |
— | — | General vision understanding |
| Model | Size | Variants | Notes |
|---|---|---|---|
whisper-tiny.en |
— | — | OpenAI Whisper, English |
| Model | Size | Variants | Notes |
|---|---|---|---|
gliner2-base-v1 |
— | f32, f16, i8 | Schema-based field extraction |
| Model | Size | Variants |
|---|---|---|
flan-t5-small-squad-qg |
569MB | - |
pegasus_paraphrase |
4.5GB | - |
| Model | Size | Variants |
|---|---|---|
functiongemma-270m-it |
1.1GB | - |
gemma-3-1b-it |
3.7GB | - |
Models come in multiple precision variants, trading off size and speed for accuracy:
| Variant | File | Description |
|---|---|---|
| (default) | model.onnx |
FP32 baseline - highest accuracy |
f16 |
model_f16.onnx |
FP16 - ~50% smaller, recommended for ARM64/M-series |
i8 |
model_i8.onnx |
INT8 dynamic quantization - smallest, fastest CPU inference |
Pull specific variants:
# Pull using variant suffix (recommended)
termite pull bge-small-en-v1.5-i8
# Or use --variants flag
termite pull --variants i8 bge-small-en-v1.5
# Pull multiple models with same variant
termite pull bge-small-en-v1.5-i8 mxbai-rerank-base-v1-i8
# Pull multiple variants for one model
termite pull --variants f16,i8 bge-small-en-v1.5Use variants in config:
embedder:
provider: termite
model: bge-small-en-v1.5-f16 # Use FP16 variantTermite auto-selects the best available variant if not specified.
All endpoints accept JSON. See openapi.yaml for full schema details.
| Endpoint | Method | Description |
|---|---|---|
/api/embed |
POST | Generate dense and sparse embeddings (text, image, audio) |
/api/chunk |
POST | Chunk text into semantic segments |
/api/rerank |
POST | Rerank documents by relevance |
/api/classify |
POST | Zero-shot text classification |
/api/recognize |
POST | Named entity recognition |
/api/read |
POST | OCR and document understanding |
/api/transcribe |
POST | Speech-to-text transcription |
/api/extract |
POST | Schema-based structured data extraction |
/api/rewrite |
POST | Text rewriting (paraphrase, question generation) |
/api/generate |
POST | Text generation (OpenAI-compatible) |
/api/models |
GET | List available models |
/api/version |
GET | Version info |
The /api/embed endpoint supports multimodal input using the OpenAI content format:
{
"model": "clip-vit-base-patch32",
"input": [
{
"content": [
{"type": "text", "text": "a photo of a cat"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
]
}
]
}Config via file (termite.yaml), flags, or environment variables (TERMITE_ prefix):
api_url: "http://localhost:11433"
models_dir: "./models"
# Backend priority with optional device specifiers
# Format: "backend" or "backend:device"
# Devices: auto (default), cuda, coreml, tpu, cpu
backend_priority:
- onnx:cuda # Try ONNX with CUDA first
- xla:tpu # Then XLA with TPU
- onnx:cpu # Fall back to ONNX CPU
- go # Pure Go fallback (always works)
keep_alive: "5m"
max_loaded_models: 3
allow_downloads: false
log:
level: info
style: terminalThe backend_priority setting controls which backends Termite tries, in order. Each entry can be:
- Backend only:
onnx,xla,go- uses auto device detection - Backend with device:
onnx:cuda,xla:tpu,onnx:coreml- explicit device
Available backends (depend on build tags):
| Backend | Build Tags | Devices Supported |
|---|---|---|
onnx |
onnx,ORT |
cuda, coreml (macOS), cpu |
xla |
xla,XLA |
tpu, cuda, cpu |
go |
(none) | cpu only |
Example configurations:
# GPU-first with CPU fallback
backend_priority: ["onnx:cuda", "xla:cuda", "onnx:cpu", "go"]
# macOS with CoreML acceleration
backend_priority: ["onnx:coreml", "go"]
# Cloud TPU deployment
backend_priority: ["xla:tpu", "xla:cpu"]
# Simple auto-detection (default)
backend_priority: ["onnx", "xla", "go"]Deploy on GKE with TPU support using the Termite Operator.
TermitePool: manages a pool of Termite replicas with autoscaling.
apiVersion: termite.antfly.io/v1alpha1
kind: TermitePool
metadata:
name: embeddings-pool
spec:
workloadType: read-heavy
models:
preload:
- name: bge-small-en-v1.5
variant: i8
priority: high
strategy: eager # Always loaded, never evicted
- name: mxbai-rerank-base-v1
variant: i8
priority: high
# strategy defaults to loadingStrategy (lazy)
loadingStrategy: lazy # Default for models without explicit strategy
keepAlive: 5m # Idle timeout for lazy models
replicas:
min: 2
max: 10
hardware:
accelerator: tpu-v5-lite-podslice
topology: "2x2"
autoscaling:
enabled: true
metrics:
- type: queue-depth
target: "50"TermiteRoute: routes traffic to pools based on model or endpoint.
# Build operator
go build -o termite-operator ./cmd/termite-operator
# Generate CRDs and RBAC manifests
make generateSee pkg/operator/ for CRD definitions and controller implementation. The model registry protocol is formally specified in TLA+.
Discord for questions, discussion, and updates.
Apache License 2.0