diff --git a/docs/inference/vllm.mdx b/docs/inference/vllm.mdx index aab625d..026ffd9 100644 --- a/docs/inference/vllm.mdx +++ b/docs/inference/vllm.mdx @@ -17,11 +17,24 @@ vLLM offers significantly higher throughput than [Transformers](/docs/inference/ ## Installation -You need to install [`vLLM`](https://github.com/vllm-project/vllm) v0.14 or a more recent version: + + + Install [`vLLM`](https://github.com/vllm-project/vllm) v0.14 or a more recent version: -```bash -uv pip install vllm==0.14 -``` + ```bash + uv pip install vllm==0.14 + ``` + + + vLLM provides a prebuilt Docker image that serves an OpenAI-compatible API: + + ```bash + docker pull vllm/vllm-openai:latest + ``` + + This image requires NVIDIA GPU access. See the [OpenAI-Compatible Server](#openai-compatible-server) section below for the full `docker run` command. + + ## Basic Usage @@ -108,19 +121,42 @@ for i, output in enumerate(outputs): ## OpenAI-Compatible Server -vLLM can serve models through an OpenAI-compatible API, allowing you to use existing OpenAI client libraries: - -```bash -vllm serve LiquidAI/LFM2.5-1.2B-Instruct \ - --host 0.0.0.0 \ - --port 8000 \ - --dtype auto -``` - -Optional parameters: - -* `--max-model-len L`: Set maximum context length -* `--gpu-memory-utilization 0.9`: Set GPU memory usage (0.0-1.0) +vLLM can serve models through an OpenAI-compatible API, allowing you to use existing OpenAI client libraries. + + + + ```bash + vllm serve LiquidAI/LFM2.5-1.2B-Instruct \ + --host 0.0.0.0 \ + --port 8000 \ + --dtype auto + ``` + + Optional parameters: + + * `--max-model-len L`: Set maximum context length + * `--gpu-memory-utilization 0.9`: Set GPU memory usage (0.0-1.0) + + + ```bash + docker run --runtime nvidia --gpus all \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env "HF_TOKEN=$HF_TOKEN" \ + -p 8000:8000 \ + --ipc=host \ + vllm/vllm-openai:latest \ + --model LiquidAI/LFM2.5-1.2B-Instruct + ``` + + Key flags: + * `--runtime nvidia --gpus all`: GPU access (required) + * `--ipc=host`: Shared memory for tensor parallelism + * `-v ~/.cache/huggingface:/root/.cache/huggingface`: Cache models on host + * `HF_TOKEN`: Set this env var if using gated models + + **Note:** The Docker image does not include optional dependencies. If you need them, build a custom image from the [vLLM Dockerfile](https://docs.vllm.ai/en/stable/deployment/docker/). + + ### Chat Completions