†Indicates Corresponding Author
- [2026-05-07]: 🚀 Codes released.
Abstract: Online streaming video understanding requires models to process continuous visual inputs and respond to user queries in real time, where the unbounded stream and unpredictable query timing turn memory management into a central challenge. Existing methods typically compress visual tokens via visual similarity heuristics, or augment compression with KV-cache-level retrieval. However, compression decisions rarely incorporate semantic signals, and retrieval is often added after compression is finalized, making the two stages hard to coordinate. We present SAVEMem, a training-free dual-stage framework that brings semantic awareness into memory generation and lets the retrieval scope adapt per query. In Stage 1, SAVEMem builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone. In Stage 2, SAVEMem performs query-aware retrieval over this memory. An anchor-conditioned recency gate adapts the retrieval scope from short-term to mid- and long-term memory based on whether the query targets the present or the distant past. Within this scope, late interaction between query and memory tokens selects candidate frames for answering. Applied to Qwen2.5-VL without training, SAVEMem improves the OVO-Bench overall score from 52.27 to 62.69 and yields consistent gains on StreamingBench and ODV-Bench, while reducing peak GPU memory by 48% at 128 frames over the backbone.
The data download scripts are in data.sh. Note that these datasets are very large, you may want to try downloading them once at a time if you have limited storage on your platform.
hf download JoeLeelyf/OVO-Bench --repo-type dataset --local-dir ./ovo
python unzip_ovo.py
hf download mjuicem/StreamingBench --repo-type dataset --local-dir ./StreamingBench
python unzip_streaming.py
hf download MCG-NJU/ODV-Bench --repo-type dataset --local-dir ./odv
python unzip_odv.pyWe use three different benchmarks in this project, you can refer to OVO-Bench, StreamingBench and ODV-Bench for further details.
We use uv to manage packages in this project, the scripts are in setup.sh. You can directly run the setup scripts or run following commands, if you want to do some modifications to the environment.
uv venv --python=python3.11
source .venv/bin/activate
uv pip install -e models/qwen2-5-vl
uv pip install -e models/qwen-vl-utils
uv pip install ffmpeg-python==0.2.0 moviepy==1.0.3 # for StreamingBench / OVO-Bench
uv pip install torch==2.8.0 torchvision==0.23.0
uv pip install transformers==4.49
uv pip install flash_attn --no-build-isolation
uv pip install decord==0.6.0The evaluation scripts for three different benchmarks are in evaluation folder, you can run the scripts in each folder to reproduce experimental results reported in our paper.
bash evaluation/ovobench/ovobench.sh
bash evaluation/streamingbench/streamingbench.sh
bash evaluation/odvbench/odvbench.shCore codes of our proposed SAVEMem is in savemem.py.
If you find our project useful, we hope you can star our repo and cite our paper as follows:
@article{wu2026semantic,
title={Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding},
author={Wu, Hang and Mathews, Sherin Mary and Cai, Yujun and Yang, Ming-Hsuan and Wang, Yiwei},
journal={arXiv preprint arXiv:2605.07897},
year={2026}
}
We sincerely appreciate the contributions of the open-source community. The related projects are as follows:
- Framework: FluxMem, Qwen2.5-VL
- Evaluation: OVO-Bench, StreamingBench and ODV-Bench
This project is licensed under the terms of the Apache License 2.0. You are free to use, modify, and distribute this software under the conditions of the license. This project is intended for academic and research purposes only. Any commercial use is strictly prohibited without prior written consent.



