llama.cpp Studio is a local control plane for downloading, configuring, and serving LLMs from a single machine.
The project combines:
- a Vue 3 frontend
- a FastAPI backend
- YAML-backed state under
data/ - a unified
llama-swapOpenAI-compatible endpoint on port2000
Today, the app manages three runtime families:
llama.cppfor GGUF modelsik_llama.cppfor GGUF modelsLMDeployfor safetensors models
This README has been rebuilt to match the current repository layout and runtime behavior.
- Search Hugging Face for
ggufandsafetensorsmodels - Download GGUF quantizations, optional
mmprojprojector files, and safetensors bundles - Store model and engine state in YAML instead of SQLite
- Build
llama.cppandik_llama.cppfrom source and manage multiple installed versions - Install LMDeploy from PyPI or from source into a dedicated virtual environment
- Install CUDA Toolkit versions into the persistent app data directory
- Configure models per engine using a parameter catalog parsed from the active runtime binary
- Serve models through one OpenAI-compatible endpoint exposed by
llama-swap - Stream progress and notifications over Server-Sent Events
| Purpose | Docker / container | Local dev |
|---|---|---|
| Web UI + FastAPI API | http://localhost:8080 |
frontend: http://localhost:5173, backend API: http://localhost:8081 |
| OpenAI-compatible model endpoint | http://localhost:2000 |
http://localhost:2000 |
| OpenAPI docs | http://localhost:8080/docs |
http://localhost:8081/docs |
| Raw schema | http://localhost:8080/openapi.json |
http://localhost:8081/openapi.json |
In local dev, Vite proxies /api from 5173 to 127.0.0.1:8081.
Browser UI (Vue 3)
-> FastAPI backend
-> YAML config in data/config/
-> Hugging Face downloads in data/models/ and data/hf-cache/
-> engine installs in data/llama-cpp/ and data/lmdeploy/
-> CUDA installs in data/cuda/
-> llama-swap config in data/llama-swap-config.yaml
-> llama-swap on :2000
-> llama.cpp / ik_llama.cpp / LMDeploy runtimes
The backend starts llama-swap automatically when there is an active llama.cpp or ik_llama.cpp binary available.
- Start the app.
- Open
Engines. - Build and activate a
llama.cpporik_llama.cppversion for GGUF models. - If you want safetensors support, install and activate LMDeploy.
- If you need gated Hugging Face access, set
HUGGINGFACE_API_KEYor enter a token in the UI. - Open
Search, find a model, and download it. - Open
Models, configure the model, and choose its engine. - If the UI says the
llama-swapconfig is stale, apply the pending config. - Start the model from the library.
- Call it through
http://localhost:2000/v1/....
Important:
- Saving model config updates the YAML store immediately.
- Applying pending
llama-swapconfig rewritesdata/llama-swap-config.yamland unloads models before regenerating proxy state. - GGUF models require an active
llama.cpporik_llama.cppbuild. - safetensors models require an active LMDeploy install.
- Docker
- Docker Compose
- For NVIDIA GPU use: NVIDIA drivers on the host plus the NVIDIA Container Toolkit
git clone <repo-url>
cd llama-cpp-studio
docker compose -f docker-compose.cpu.yml up --buildThis mode:
- exposes
8080for the UI/API - exposes
2000forllama-swap - mounts
./datato/app/data - mounts
./backendto/app/backend - enables backend reload with
RELOAD=true
This is the best Docker option for backend-focused development. The frontend is still the built bundle from the image, not a live Vite dev server.
docker compose -f docker-compose.cuda.yml up --build -dThis mode:
- exposes the same ports:
8080and2000 - mounts
./datato/app/data - reserves NVIDIA GPUs for the container
- disables backend reload
docker build -t llama-cpp-studio .
docker run -d \
--name llama-cpp-studio \
-p 8080:8080 \
-p 2000:2000 \
-v "$(pwd)/data:/app/data" \
llama-cpp-studioFor NVIDIA GPUs, add --gpus all.
Open:
- UI:
http://localhost:8080 - OpenAPI docs:
http://localhost:8080/docs - model endpoint:
http://localhost:2000/v1/models
- Node.js 20+
- Python 3
- a virtual environment tool such as
venv - a
pythonexecutable onPATHif you want to use the providednpmscripts as-is
If you want to build runtimes on the host instead of inside Docker, you will also need native build tooling such as:
cmakebuild-essentialgitpkg-configlibopenblas-devfor OpenBLAS-backed CPU builds
The repository scripts use python, so if your system only provides python3 you should either add a python alias or translate the commands below accordingly.
npm install
python -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txtnpm run dev:allThat starts:
- frontend Vite dev server on
5173 - backend on
8081
You can also run them separately:
npm run dev:frontend
npm run dev:backendBuild the frontend, then let FastAPI serve the built assets:
npm run build
python backend/main.pyWhen running outside Docker, the app stores persistent state under ./data.
npm run testOr run suites individually:
npm run test:frontend
python -m pytest backend/tests -qThe app is built around a persistent writable data directory. In Docker that is /app/data. Outside Docker it is ./data.
Typical layout:
data/
config/
models.yaml
engines.yaml
settings.yaml
engine_params_catalog.yaml
models/
gguf/
safetensors/
hf-cache/
llama-cpp/
lmdeploy/
cuda/
logs/
temp/
llama-swap-config.yaml
What these are used for:
config/models.yaml: downloaded models and per-model configurationconfig/engines.yaml: installed engine versions, active versions, and build settingsconfig/settings.yaml: app settings such as Hugging Face token and proxy portconfig/engine_params_catalog.yaml: parsed CLI parameter catalog used by the model config UImodels/: downloaded GGUF and safetensors assetshf-cache/: Hugging Face cachellama-cpp/: source checkouts and build artifacts forllama.cppandik_llama.cpplmdeploy/: LMDeploy virtual environments and source installscuda/: CUDA Toolkit installs managed by the applogs/: installer and background task logsllama-swap-config.yaml: generated proxy configuration
GGUF models are managed as quantized entries grouped by Hugging Face repo. They run through either:
llama.cppik_llama.cpp
Current engine management behavior:
- multiple versions can be installed and retained
- versions can be activated or deleted
- updates build the latest source ref with the saved build settings
- parameter support is discovered by scanning the active binary's
--helpoutput
safetensors repos are managed as logical model bundles and run through LMDeploy.
Current LMDeploy flows:
- install latest or specific version from PyPI
- install from a source repository and branch
- keep multiple installs in the engine registry
- activate or remove installs from the UI
CUDA is managed as a persistent install under data/cuda.
The Docker image is prepared to use:
CUDA_HOMECUDA_PATHLD_LIBRARY_PATH- NCCL-related include and library paths
The app can install multiple CUDA versions and keeps a current symlink for the active one.
Each model stores configuration per engine. In practice that means:
- switching a model from
llama.cpptoik_llama.cpporLMDeploydoes not destroy the other engine sections - the UI shows parameters based on the scanned catalog for the active engine
- unsupported flags can be hidden in the config view
- raw custom CLI args can be appended
- the saved
llama-swapcommand can be previewed from the UI and API
Useful model-related routes:
GET /api/modelsPOST /api/models/searchPOST /api/models/downloadGET /api/models/{id}/configPUT /api/models/{id}/configGET /api/models/{id}/saved-llama-swap-cmdPOST /api/models/{id}/preview-llama-swap-cmdPOST /api/models/{id}/startPOST /api/models/{id}/stop
The user-facing inference endpoint is the llama-swap proxy on port 2000.
Useful requests:
curl http://localhost:2000/v1/modelscurl http://localhost:2000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "replace-with-a-model-id-from-v1-models",
"messages": [
{"role": "user", "content": "Say hello in one sentence."}
]
}'The model value should come from GET /v1/models. It may be a sanitized Hugging Face repo plus quantization or a custom alias if you set one in model config.
The FastAPI app exposes a small number of main route groups:
/api/models: model library, downloads, config, start/stop, metadata/api/llama-versions: engine versions, build settings, source builds, CUDA actions/api/lmdeploy: LMDeploy install/remove/status/update checks/api/status: system status and proxy health/api/gpu-info: GPU and CPU capability information/api/events: Server-Sent Events for progress and notifications/api/llama-swap: stale/apply/pending proxy configuration endpoints
OpenAPI docs are available at /docs.
Most users only need a few environment variables:
| Variable | Purpose |
|---|---|
HUGGINGFACE_API_KEY |
Access gated Hugging Face models and authenticated downloads |
HF_HUB_ENABLE_HF_TRANSFER=1 |
Enable faster Hugging Face transfer support when available |
HF_HOME |
Base Hugging Face cache directory |
HUGGINGFACE_HUB_CACHE |
Hugging Face hub cache directory |
CUDA_VISIBLE_DEVICES |
Limit or disable visible GPUs |
RELOAD |
Enable or disable backend auto-reload |
BACKEND_CORS_ORIGINS |
Comma-separated allowed origins |
BACKEND_CORS_ALLOW_CREDENTIALS |
Toggle credentialed CORS requests |
CPU_ONLY_MODE |
Force GPU detection into CPU-only mode |
Advanced / less common:
| Variable | Purpose |
|---|---|
LMDEPLOY_BIN |
Override the LMDeploy executable path used by the backend |
CMAKE or CMAKE_EXECUTABLE |
Override the CMake executable used for source builds |
The app needs write access to the mounted data directory. If the container logs complain about /app/data permissions, fix ownership on the host volume before continuing.
Check these in order:
- an engine version is installed and active
- the model was downloaded successfully
- pending
llama-swapchanges were applied http://localhost:2000/healthis reachable
Saving model config updates the database state, but the generated proxy config may still be stale. Use the UI's apply flow or call:
curl -X POST http://localhost:8080/api/llama-swap/apply-configUse the Engines page action to rescan CLI parameters for the active engine. The backend builds the parameter registry from the runtime binary's --help output.