Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 53 additions & 17 deletions docs/inference/vllm.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "vLLM"
description: "vLLM is a high-throughput and memory-efficient inference engine for LLMs. It supports efficient serving with PagedAttention, continuous batching, and optimized CUDA kernels."

Check warning on line 3 in docs/inference/vllm.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai) - vale-spellcheck

docs/inference/vllm.mdx#L3

Did you really mean 'LLMs'?
---

<Tip>
Expand All @@ -17,11 +17,24 @@

## Installation

You need to install [`vLLM`](https://github.com/vllm-project/vllm) v0.14 or a more recent version:
<Tabs>
<Tab title="pip">
Install [`vLLM`](https://github.com/vllm-project/vllm) v0.14 or a more recent version:

```bash
uv pip install vllm==0.14
```
```bash
uv pip install vllm==0.14
```
</Tab>
<Tab title="Docker">
vLLM provides a prebuilt Docker image that serves an OpenAI-compatible API:

```bash
docker pull vllm/vllm-openai:latest
```

This image requires NVIDIA GPU access. See the [OpenAI-Compatible Server](#openai-compatible-server) section below for the full `docker run` command.
</Tab>
</Tabs>

## Basic Usage

Expand Down Expand Up @@ -52,7 +65,7 @@
Control text generation behavior using [`SamplingParams`](https://docs.vllm.ai/en/v0.4.1/dev/sampling_params.html). Key parameters:

* **`temperature`** (`float`, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0
* **`top_p`** (`float`, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top\_p. Typical range: 0.1-1.0

Check warning on line 68 in docs/inference/vllm.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai) - vale-spellcheck

docs/inference/vllm.mdx#L68

Did you really mean 'top_p'?
* **`top_k`** (`int`, default -1): Limits to top-k most probable tokens (-1 = disabled). Typical range: 1-100
* **`min_p`** (`float`): Minimum token probability threshold. Typical range: 0.01-0.2
* **`max_tokens`** (`int`): Maximum number of tokens to generate
Expand Down Expand Up @@ -108,19 +121,42 @@

## OpenAI-Compatible Server

vLLM can serve models through an OpenAI-compatible API, allowing you to use existing OpenAI client libraries:

```bash
vllm serve LiquidAI/LFM2.5-1.2B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype auto
```

Optional parameters:

* `--max-model-len L`: Set maximum context length
* `--gpu-memory-utilization 0.9`: Set GPU memory usage (0.0-1.0)
vLLM can serve models through an OpenAI-compatible API, allowing you to use existing OpenAI client libraries.

<Tabs>
<Tab title="vllm serve">
```bash
vllm serve LiquidAI/LFM2.5-1.2B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype auto
```

Optional parameters:

* `--max-model-len L`: Set maximum context length
* `--gpu-memory-utilization 0.9`: Set GPU memory usage (0.0-1.0)
</Tab>
<Tab title="Docker">
```bash
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model LiquidAI/LFM2.5-1.2B-Instruct
```

Key flags:
* `--runtime nvidia --gpus all`: GPU access (required)
* `--ipc=host`: Shared memory for tensor parallelism
* `-v ~/.cache/huggingface:/root/.cache/huggingface`: Cache models on host
* `HF_TOKEN`: Set this env var if using gated models

**Note:** The Docker image does not include optional dependencies. If you need them, build a custom image from the [vLLM Dockerfile](https://docs.vllm.ai/en/stable/deployment/docker/).
</Tab>
</Tabs>

### Chat Completions

Expand Down Expand Up @@ -185,7 +221,7 @@

### Installation for Vision Models

To use LFM Vision Models with vLLM, install the precompiled wheel along with the required transformers version:

Check warning on line 224 in docs/inference/vllm.mdx

View check run for this annotation

Mintlify / Mintlify Validation (liquidai) - vale-spellcheck

docs/inference/vllm.mdx#L224

Did you really mean 'precompiled'?

```bash
VLLM_PRECOMPILED_WHEEL_COMMIT=72506c98349d6bcd32b4e33eec7b5513453c1502 VLLM_USE_PRECOMPILED=1 uv pip install git+https://github.com/vllm-project/vllm.git
Expand Down