KV handoff with DMA slicing APIs to avoid KV input/output copies. by quic-akuruvil · Pull Request #1039 · quic/efficient-transformers

quic-akuruvil · 2026-06-04T17:14:44Z

Problem

If we don't use DMA slicing, in disaggregated serving, the QPC expect KV cache for all the batches as input, i.e if decode is BS=32 and lets say BS=4 got free, the QPC and LRT would expect KV caches again for all 32 batches. To fix this problem, DMA buffer slicing is introduced, where user can slice the DMA buffer into N Batches and write KV caches for each batch slot, by indexing the specific slot.

Idea

Disaggregated serving pipeline on QAIC with zero‑copy KV cache handoff.
Prefill to decode KV transfer happens through host (shared memory).
Shared memory is used so that there's no copy of KV cache when transferring from prefill to host.
Dump the kv cache from prefill devices to shared memory on host and then pass the pointer of shared memory to decode instance which loads up the kv cache directly from those host buffers.
This can be useful in the disaggregated setting for any large KV footprint. Since we are using DMA buffer slicing hence avoids taking KV as inputs between prefill decode sessions.

Optimization

Adds a new temporary QAICInferenceSession class (cloud_infer_kv_slice.py) that enables zero-copy KV-cache handoff between disaggregated prefill and decode sessions using shared DMA buffers and QAICRT API setDataWithSlices(). On the last prefill chunk, KV outputs are wired directly into the decode session's input slots via a sliced DMA descriptor — eliminating the Python/numpy copy at the prefill→decode boundary.

cluster_id="prefill" gives a pool of stages+1 slots for concurrent chunk pipelining; cluster_id="decode" gives a single fixed slot because decode is strictly sequential

Sample Example Script

Also adds an end-to-end example (examples/disagg_serving/qwen3moe_disagg_mode_with_chunking_kvslice.py) demonstrating the full disaggregated serving flow for Qwen3-MoE with chunked prefill, PP (stages), TS, and DMA-sliced KV handoff.

Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>

Signed-off-by: Ann <quic_akuruvil@quicinc.com>

quic-mohmeh and others added 5 commits June 4, 2026 22:24

Added MDP generation to QEff Compile

16833df

Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>

Formatting and Linting

bc006dd

Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>

Add compiler options - 'stages'

7a0d651

Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>

Added support for layerwise export

8193f30

Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>

Added inference serving with DMA slicing for KV handoff

fe974d0

Signed-off-by: Ann <quic_akuruvil@quicinc.com>

quic-akuruvil requested review from anujgupt-github, quic-hemagnih, quic-rishinr and vbaddi June 4, 2026 17:16

quic-akuruvil assigned ochougul and quic-akuruvil and unassigned ochougul Jun 4, 2026

quic-akuruvil requested a review from ochougul June 4, 2026 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV handoff with DMA slicing APIs to avoid KV input/output copies.#1039

KV handoff with DMA slicing APIs to avoid KV input/output copies.#1039
quic-akuruvil wants to merge 5 commits into
quic:release/v1.22.0_tmpfrom
quic-akuruvil:dma_slice

quic-akuruvil commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

quic-akuruvil commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Idea

Optimization

Sample Example Script

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

quic-akuruvil commented Jun 4, 2026 •

edited

Loading