Multi-Stream Streaming Understanding Benchmark
While video streaming understanding has made significant strides, real-world applications such as live sports broadcasting, autonomous driving, and multi-screen collaboration inherently demand continuous, multi-stream interactions. Existing benchmarks are confined to single-stream paradigms, leaving a critical gap in evaluating online, cross-stream reasoning.
We introduce X-Stream, the first benchmark dedicated to multi-stream streaming understanding. It comprises 4,220 rigorously curated QA pairs across 932 videos and evaluates 11 subtasks across multi-window, multi-view, and multi-device scenarios.
X-Stream uses a dual-verification pipeline to prevent over-reliance on a single stream. We further conceptualize MLLMs as naive multiplexers and evaluate spatial, time, and semantic division strategies under a fixed token budget. Current state-of-the-art MLLMs still struggle with concurrent streams, achieving only about 50% overall score and showing poor proactive ability.
X-Stream evaluates online understanding over concurrent streams from multi-window, multi-view, and multi-device settings.
Sufficiency and necessity verification filter pseudo multi-stream questions and retain samples requiring joint evidence.
The benchmark exposes trade-offs among spatial, time, and semantic division multiplexing for future multi-stream agents.
X-Stream is built from 857 hours of raw multi-stream data across more than 20 sources. After preprocessing and pairing, 160 hours are retained across 451 takes with 2 to 5 streams.
| Tasks | Dataset | QA | Videos | Videos / Take | Duration (h) | Cross Stream | Open Ended | Streaming | Proactive |
|---|---|---|---|---|---|---|---|---|---|
| Multi-View and Multi-Video | EgoLife-Eval | 0.3K | 1 | 6 | 20 | No | No | No | No |
| Multi-View and Multi-Video | ProMQA-Assembly | 0.4K | 0.2K | 2 | 7 | No | Yes | No | No |
| Multi-View and Multi-Video | WaymoQA | 6.4K | ~1K | 2 | ~2 | Yes | Yes | No | No |
| Multi-View and Multi-Video | MVU-Bench | 1.8K | 5K | 3-5 | 15 | No | No | No | No |
| Multi-View and Multi-Video | VidDiff | 4.5K | 0.5K | 2 | 3 | No | Yes | No | No |
| Streaming | OVO-Bench | 2.8K | 0.6K | 1 | 85 | No | No | Yes | Yes |
| Streaming | StreamingBench | 4.5K | 0.9K | 1 | 136 | No | No | Yes | Yes |
| Streaming | Inf-Streams-Eval | 2.5K | 0.5K | 1 | 42 | No | Yes | Yes | No |
| Streaming | LiveSports | 1.2K | 0.8K | 1 | 40 | No | Yes | Yes | Yes |
| Streaming | ProactiveVideoQA | 1.4K | 1.4K | 1 | 49 | No | Yes | Yes | Yes |
| Streaming | OmniMMI | 2.3K | 1.1K | 1 | 100 | No | Yes | Yes | Yes |
| Streaming | MMDuet | 2.0K | 2.0K | 1 | 100 | No | Yes | Yes | Yes |
| Streaming | ESTP-Bench | 2.3K | 1.2K | 1 | 80 | No | Yes | Yes | Yes |
| Streaming | PhoStream | 5.6K | 0.6K | 1 | 92 | No | Yes | Yes | Yes |
| Multi-Stream | X-Stream (Ours) | 4.2K | 0.9K | 2-5 | 160 | Yes | Yes | Yes | Yes |
Since MLLMs can process one token stream at a time, X-Stream studies ways to integrate multiple video streams under a fixed average video token rate, Cmax = 250 tokens per video second.
X-Stream uses online inference and LLM-as-a-Judge evaluation. The main comparison uses spatial division multiplexing and reports instant, backward, forward, comprehensive, timing, and multi-stream ability scores.
| Model | Overall | Instant | Backward | Forward | Compre. | ER down | NR down | Single Stream | Multi Coop. | Cross Ref. | Cross Inter. |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Human (video, audio) | 91.84 | 91.73 | 95.19 | 85.10 | 97.50 | 9.50 | 2.55 | 94.12 | 92.05 | 90.10 | 98.55 |
| Proprietary Multimodal Models | |||||||||||
| Gemini 3 Pro (video, audio) | 49.60 | 73.38 | 72.23 | 20.77 | 82.04 | 73.13 | 0.23 | 72.45 | 71.16 | 74.79 | 66.96 |
| GPT-5 (video) | 27.78 | 44.28 | 37.18 | 6.51 | 59.83 | 81.73 | 1.14 | 39.08 | 44.12 | 52.75 | 45.65 |
| GPT-4o (video) | 22.46 | 37.28 | 32.72 | 4.05 | 47.01 | 87.14 | 0.74 | 34.83 | 34.90 | 43.77 | 37.52 |
| Doubao-Seed-1.8 (video) | 36.79 | 55.49 | 57.18 | 14.52 | 59.13 | 66.19 | 3.95 | 47.55 | 35.69 | 56.52 | 60.82 |
| Open-source Multimodal Models | |||||||||||
| Qwen2.5-VL-7B (video) | 25.49 | 40.02 | 36.02 | 8.34 | 45.28 | 68.10 | 11.36 | 43.80 | 41.43 | 42.72 | 40.01 |
| Qwen2.5-Omni-7B (video, audio) | 26.82 | 41.96 | 41.17 | 9.03 | 45.04 | 53.19 | 22.51 | 38.60 | 40.80 | 41.86 | 44.35 |
| Qwen3-VL-8B (video) | 26.78 | 43.41 | 33.30 | 7.53 | 51.01 | 78.40 | 6.50 | 49.88 | 43.41 | 33.30 | 51.01 |
| Qwen3-Omni-30B-A3B (video, audio) | 34.28 | 63.92 | 53.40 | 0.61 | 69.16 | 98.81 | 0.27 | 63.41 | 55.68 | 66.08 | 56.58 |
| Qwen3-VL-30B-A3B (video) | 34.19 | 52.09 | 38.54 | 14.46 | 57.26 | 73.91 | 1.18 | 54.68 | 57.90 | 65.98 | 65.22 |
| Open-source Streaming Models | |||||||||||
| Dispider (video) | 15.44 | 21.71 | 19.29 | 8.09 | 23.90 | 55.63 | 7.26 | 38.01 | 21.97 | 23.65 | 31.37 |
| VideoLLM-online-8B (video) | 8.48 | 15.00 | 15.53 | 0.03 | 17.67 | 99.10 | 0.66 | 13.15 | 16.90 | 10.15 | 6.70 |
| MMDuet2 (video) | 6.79 | 11.76 | 10.37 | 1.44 | 11.27 | 31.49 | 54.11 | 15.84 | 14.44 | 9.16 | 4.96 |
The strongest model reaches 49.60 overall, compared with 91.84 for human experts.
Proactive timing is difficult: Gemini 3 Pro reaches 20.77 on forward questions while most models are much lower.
Spatial, time, and semantic division differ in detail preservation, latency, and high stream-count scalability.
| Model | Visual Grd. | Audio Grd. | Temporal Grd. | Object Count. | Saliency Detect. | 3D Spa. | Counterfactual | Causal | Common Sense | Anomaly | Decision |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Gemini 3 Pro | 66.72 | 64.82 | 68.93 | 76.37 | 63.61 | 70.82 | 75.00 | 41.79 | 69.35 | 70.52 | 44.18 |
| GPT-5 | 36.68 | 21.63* | 36.64 | 42.52 | 53.55 | 52.28 | 15.00 | 37.65 | 44.77 | 45.54 | 28.74 |
| GPT-4o | 31.99 | 22.15* | 32.83 | 36.19 | 42.14 | 40.74 | 40.00 | 33.33 | 38.04 | 37.86 | 24.53 |
| Doubao-Seed-1.8 | 49.87 | 29.91 | 49.31 | 52.75 | 61.13 | 59.14 | 85.00 | 52.45 | 57.92 | 54.29 | 37.10 |
| Qwen3-Omni-30B-A3B | 53.61 | 52.15 | 64.56 | 64.77 | 68.61 | 63.29 | 90.00 | 53.14 | 64.08 | 60.18 | 27.14 |
| Qwen3-VL-30B-A3B | 64.33 | 29.83* | 54.55 | 54.68 | 60.96 | 60.45 | 40.00 | 40.22 | 50.84 | 41.25 | 31.59 |
| Dispider | 19.58 | 13.80* | 18.67 | 25.10 | 22.50 | 22.12 | 50.00 | 17.20 | 21.65 | 17.32 | 10.94 |
| VideoLLM-online-8B | 12.62 | 17.97* | 21.47 | 12.90 | 22.64 | 21.94 | 0.00 | 18.14 | 21.53 | 21.13 | 14.44 |
| MMDuet2 | 22.27 | 18.35* | 22.24 | 36.10 | 24.14 | 23.63 | 10.00 | 17.90 | 22.16 | 20.42 | 9.12 |
| Dataset | Type | License |
|---|---|---|
| brain4cars | Driving | BSD 2-Clause |
| Waymo-E2E | Driving | Waymo Dataset License |
| Apidis-Basketball | Sports | Apidis Academic License |
| e-Sports (Self-record) | Sports | CC BY-NC 4.0 |
| Split-screen Game (Youtube) | Sports | CC BY-NC 4.0 / YouTube Standard License |
| DROID | Robot | CC BY 4.0 |
| UAV-VisLoc | Robot | Apache 2.0 |
| EgoExo4D | Daily Routine | Ego4D License |
| EgoLife | Daily Routine | MIT License |
| Seamless-interaction | Chat | CC BY-NC 4.0 |
| WILDTRACK | Surveillance | None |
| All-Day | Surveillance | CC BY-NC 4.0 |
| FaceEngage | Live Streaming | CC BY-NC 4.0 / YouTube Standard License |
| Streamer-React (Youtube) | Live Streaming | CC BY-NC 4.0 / YouTube Standard License |
| Map-Street (Baidu/Google Map API) | Interfaces | API Terms of Service |
| Comma2K-19 w/. dashboard | Interfaces | MIT License |
@inproceedings{sun2026xstream,
title = {X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding},
author = {Sun, Peiwen and Lu, Xudong and Liu, Huadai and Bo, Yang and Wu, Dongming and Guan, Huankang and Cai, Minghong and Chen, Jinpeng and Guo, Xintong and Li, Shuhan and Liu, Rui and Yue, Xiangyu},
booktitle = {European Conference on Computer Vision},
year = {2026}
}