KV handoff with DMA slicing APIs to avoid KV input/output copies.#1039
Open
quic-akuruvil wants to merge 5 commits into
Open
KV handoff with DMA slicing APIs to avoid KV input/output copies.#1039quic-akuruvil wants to merge 5 commits into
quic-akuruvil wants to merge 5 commits into
Conversation
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Mohit Mehta <mohmeh@qti.qualcomm.com>
Signed-off-by: Ann <quic_akuruvil@quicinc.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
If we don't use DMA slicing, in disaggregated serving, the QPC expect KV cache for all the batches as input, i.e if decode is BS=32 and lets say BS=4 got free, the QPC and LRT would expect KV caches again for all 32 batches. To fix this problem, DMA buffer slicing is introduced, where user can slice the DMA buffer into N Batches and write KV caches for each batch slot, by indexing the specific slot.
Idea
Disaggregated serving pipeline on QAIC with zero‑copy KV cache handoff.
Prefill to decode KV transfer happens through host (shared memory).
Shared memory is used so that there's no copy of KV cache when transferring from prefill to host.
Dump the kv cache from prefill devices to shared memory on host and then pass the pointer of shared memory to decode instance which loads up the kv cache directly from those host buffers.
This can be useful in the disaggregated setting for any large KV footprint. Since we are using DMA buffer slicing hence avoids taking KV as inputs between prefill decode sessions.
Optimization
Adds a new temporary
QAICInferenceSessionclass (cloud_infer_kv_slice.py) that enableszero-copy KV-cachehandoff between disaggregated prefill and decode sessions using shared DMA buffers and QAICRT APIsetDataWithSlices(). On the last prefill chunk, KV outputs are wired directly into the decode session's input slots via a sliced DMA descriptor — eliminating the Python/numpy copy at the prefill→decode boundary.cluster_id="prefill" gives a pool of stages+1 slots for concurrent chunk pipelining;cluster_id="decode" gives a single fixed slot because decode is strictly sequentialSample Example Script
Also adds an end-to-end example (
examples/disagg_serving/qwen3moe_disagg_mode_with_chunking_kvslice.py) demonstrating the full disaggregated serving flow for Qwen3-MoE with chunked prefill, PP (stages), TS, and DMA-sliced KV handoff.