Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

Hang Wu¹ Sherin Mary Mathews² Yujun Cai^3†
Ming-Hsuan Yang¹ Yiwei Wang¹

¹University of California, Merced ²US Bank ³The University of Queensland
^†Indicates Corresponding Author

🔥 Update

[2026-05-07]: 🚀 Codes released.

🎯 Overview

Abstract: Online streaming video understanding requires models to process continuous visual inputs and respond to user queries in real time, where the unbounded stream and unpredictable query timing turn memory management into a central challenge. Existing methods typically compress visual tokens via visual similarity heuristics, or augment compression with KV-cache-level retrieval. However, compression decisions rarely incorporate semantic signals, and retrieval is often added after compression is finalized, making the two stages hard to coordinate. We present SAVEMem, a training-free dual-stage framework that brings semantic awareness into memory generation and lets the retrieval scope adapt per query. In Stage 1, SAVEMem builds a three-tier streaming memory online under a constant memory budget. A fixed pseudo-question bank provides a lightweight semantic prior, so that long-term retention is shaped by semantic salience rather than visual similarity alone. In Stage 2, SAVEMem performs query-aware retrieval over this memory. An anchor-conditioned recency gate adapts the retrieval scope from short-term to mid- and long-term memory based on whether the query targets the present or the distant past. Within this scope, late interaction between query and memory tokens selects candidate frames for answering. Applied to Qwen2.5-VL without training, SAVEMem improves the OVO-Bench overall score from 52.27 to 62.69 and yields consistent gains on StreamingBench and ODV-Bench, while reducing peak GPU memory by 48% at 128 frames over the backbone.

⚙️ Framework

🕹️ Usage

Data Preparation

The data download scripts are in data.sh. Note that these datasets are very large, you may want to try downloading them once at a time if you have limited storage on your platform.

hf download JoeLeelyf/OVO-Bench --repo-type dataset --local-dir ./ovo
python unzip_ovo.py

hf download mjuicem/StreamingBench --repo-type dataset --local-dir ./StreamingBench
python unzip_streaming.py

hf download MCG-NJU/ODV-Bench --repo-type dataset --local-dir ./odv
python unzip_odv.py

We use three different benchmarks in this project, you can refer to OVO-Bench, StreamingBench and ODV-Bench for further details.

Environment Setup

We use uv to manage packages in this project, the scripts are in setup.sh. You can directly run the setup scripts or run following commands, if you want to do some modifications to the environment.

uv venv --python=python3.11
source .venv/bin/activate

uv pip install -e models/qwen2-5-vl
uv pip install -e models/qwen-vl-utils

uv pip install ffmpeg-python==0.2.0 moviepy==1.0.3   # for StreamingBench / OVO-Bench

uv pip install torch==2.8.0 torchvision==0.23.0
uv pip install transformers==4.49
uv pip install flash_attn --no-build-isolation

uv pip install decord==0.6.0

Evaluation

The evaluation scripts for three different benchmarks are in evaluation folder, you can run the scripts in each folder to reproduce experimental results reported in our paper.

bash evaluation/ovobench/ovobench.sh
bash evaluation/streamingbench/streamingbench.sh
bash evaluation/odvbench/odvbench.sh

Core codes of our proposed SAVEMem is in savemem.py.

📊 Experimental Results

📑 Citation

If you find our project useful, we hope you can star our repo and cite our paper as follows:

@article{wu2026semantic,
  title={Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding},
  author={Wu, Hang and Mathews, Sherin Mary and Cai, Yujun and Yang, Ming-Hsuan and Wang, Yiwei},
  journal={arXiv preprint arXiv:2605.07897},
  year={2026}
}

📝 Acknowledgements

We sincerely appreciate the contributions of the open-source community. The related projects are as follows:

Framework: FluxMem, Qwen2.5-VL
Evaluation: OVO-Bench, StreamingBench and ODV-Bench

License

This project is licensed under the terms of the Apache License 2.0. You are free to use, modify, and distribute this software under the conditions of the license. This project is intended for academic and research purposes only. Any commercial use is strictly prohibited without prior written consent.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
evaluation		evaluation
models		models
qwen-vl-finetune		qwen-vl-finetune
.gitignore		.gitignore
README.md		README.md
data.sh		data.sh
ovo_bench_new.json		ovo_bench_new.json
setup.sh		setup.sh
unzip_odv.py		unzip_odv.py
unzip_ovo.py		unzip_ovo.py
unzip_streaming.py		unzip_streaming.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

🔥 Update

🎯 Overview

⚙️ Framework

🕹️ Usage

Data Preparation

Environment Setup

Evaluation

📊 Experimental Results

📑 Citation

📝 Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding

🔥 Update

🎯 Overview

⚙️ Framework

🕹️ Usage

Data Preparation

Environment Setup

Evaluation

📊 Experimental Results

📑 Citation

📝 Acknowledgements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages