Skip to content

mstar-project/mstar

Repository files navigation

M*

A universal serving system for composite, any-to-any multimodal models

Models are dataflow graphs  ·  requests are Walks  ·  one runtime serves them all

Quickstart  ·  Models  ·  How it works  ·  Docs  ·  Blog  ·  Paper

License: Apache 2.0 Python 3.12 Modalities Docs Paper (arXiv)


M* performance across models and modalities — M* matches or beats state-of-the-art inference systems

One runtime that matches or beats state-of-the-art inference systems. Full methodology and current numbers in the blog post and paper.

What is M*?

M* (pronounced "M-star") is a serving system for the new generation of composite multimodal models — models built from structurally distinct components (vision encoders, transformer backbones, diffusion and flow heads, audio codecs, action generators, world-model predictors) whose execution path changes with the input and the task.

LLM serving stacks assume inference is a single autoregressive loop. Composite models broke that assumption. M*'s core idea is the Walk Graph: a model is a dataflow graph of its components, and every request is a Walk over that graph. A single runtime serves unified multimodal models, omni models, speech LMs, vision-language-action policies, and world models — at or above the performance of engines specialized for each.

Fast — per-component fast paths, matched to each component's bottleneck:

  • Paged attention (FlashInfer) and continuous batching for autoregressive backbones
  • CUDA-graph capture for encoders and decode
  • Classifier-free-guidance parallelism for diffusion / flow
  • Sliding-window chunk streaming for audio codecs
  • Component-level disaggregation with pluggable tensor transport (shared memory, TCP, RDMA)

Flexible — the abstraction mirrors the model:

  • One small Python file per model declares its component graph and its Walks
  • A YAML file maps components to GPUs at per-component, per-walk granularity — arbitrary disaggregation, no code changes
  • Text, image, audio, video, and robot actions, in and out
  • A Python SDK, an OpenAI-compatible API, and a native streaming endpoint

Roadmap. M* is evolving toward many-model, agentic multimodal serving — routing requests across many models and tools within one graph-scheduled runtime.

Quickstart

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install --torch-backend=auto -e .[all]      # install M*
mstar serve bagel          # one command — launch a server (default: http://localhost:8000)

To enable flash-attn support (required for Qwen3-Omni, recommended for BAGEL),

# torch built for CUDA 12.x (cu12)
uv pip install \
  "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl"

# torch built for CUDA 13.x (cu13)
uv pip install \
  "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu13torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl"

Other models: mstar serve qwen3_omni · mstar serve orpheus · mstar serve pi05 · mstar serve vjepa2

Python SDK — works for every model (text, image, audio, video):

from mstar import MStarClient
client = MStarClient("http://localhost:8000")

client.chat("What is the capital of France?").text          # text
client.generate_image("a cat in a hat")                     # → PNG bytes   (BAGEL)
client.tts("Hello there", voice="tara").to_wav("out.wav")   # → speech      (Orpheus)

for event in client.chat("Tell me a story", stream=True):   # streaming
    print(getattr(event, "text", ""), end="", flush=True)

OpenAI-compatible API — drop-in for bagel, qwen3_omni, and orpheus:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

client.chat.completions.create(model="bagel", messages=[{"role": "user", "content": "hi"}])
client.audio.speech.create(model="orpheus", input="hi", voice="tara")   # text-to-speech
client.images.generate(model="bagel", prompt="a cat")                   # image generation

Runnable scripts and curl examples live in examples/. Power users can launch any deployment with an explicit config: mstar-serve --config configs/<model>.yaml.

Note: The first request(s) on a fresh environment can be slow — often tens of seconds to a few minutes. mstar torch.compiles the model on first use, and that compilation happens lazily on the first request that exercises each path.

Supported models

Model Family Input → Output Endpoints
BAGEL Unified multimodal text, image → text, image /v1/chat/completions, /v1/images/generations
Qwen3-Omni Omni text, image, audio, video → text, speech /v1/chat/completions
Orpheus Speech LM text → speech /v1/audio/speech
Pi0.5 Vision-language-action text, image, state → robot actions /generate
V-JEPA 2 / 2-AC World model video (+ actions) → latents, rollouts /generate

Every model is reachable through the SDK and the native /generate endpoint; the OpenAI-compatible routes cover the chat, speech, and image models.

How it works

HTTP / SDK  →  API Server  →  Conductor  →  Workers (one per GPU)  →  streaming results
                                  │              │
                          walks the graph,   own subgraphs; route tensors
                          schedules walks    directly to one another

A model declares a computation graph of components and a set of named Walks (e.g. prefill, decode, image_gen). The Conductor turns each request into a walk over that graph and schedules it; Workers each own a subgraph on their GPU and stream tensors directly to one another. Logical graph structure is decoupled from physical placement, so the same model runs single-GPU or fully disaggregated by changing only the YAML node_groups. Four composable primitives — Sequential, Parallel, Loop, and a cross-partition StreamingGraphEdge — express every model family above. See the paper for the full design.

Performance

Across every model we benchmark, M* matches or beats the system specialized for that family — unified models (BAGEL), omni and speech models (Qwen3-Omni, Orpheus), and world models (V-JEPA 2) — by executing only the components each request needs and giving each its own fast path: paged attention and continuous batching for autoregressive backbones, classifier-free-guidance parallelism for diffusion, chunk streaming for audio codecs, and persistent-cache loops for world-model rollouts.

Benchmark numbers shift as systems evolve — ours and everyone else's — so rather than freeze figures here that go stale, we keep the current results and full methodology in the blog post and the paper.

Contributing

Issues and pull requests are welcome. Found a bug, or want a model or feature supported? Open an issue. To add a model yourself, follow the Adding a New Model guide. PRs to main go through review and CI (ruff).

Citation

If you use M* in your research, please cite:

@article{mstar2026,
  title  = {M*: A Modular, Extensible, Serving System for Multimodal Models},
  author = {Jha, Atindra and Sagan, Naomi and Kamahori, Keisuke and Sivgin, Irmak and
            Sanda, Rohan and Gao, Steven and Horowitz, Mark and Zettlemoyer, Luke and
            Hsu, Olivia and Leskovec, Jure and Kasikci, Baris and Wang, Stephanie},
  year   = {2026},
  eprint = {2606.12688},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG}
}

From Stanford University & the University of Washington. Correspondence: atindra@cs.stanford.edu.

Acknowledgments

M* builds on ideas and proven primitives from the open-source community — paged attention and continuous batching (vLLM), FlashInfer kernels, streaming speech serving (VoxServe), and RDMA tensor transport (Mooncake).

License

Apache License 2.0.