Multi-Stream Streaming Understanding Benchmark

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

Peiwen Sun1, Xudong Lu1, Huadai Liu3, Yang Bo2, Dongming Wu1, Huankang Guan2, Minghong Cai1, Jinpeng Chen2, Xintong Guo2, Shuhan Li2, Rui Liu2, Xiangyu Yue1
1MMLab, The Chinese University of Hong Kong 2Huawei Inc. 3Independent
X-Stream teaser figure
Figure 1. X-Stream covers multi-angle, multi-view, and multi-device streaming scenarios. The rendered PNG links to the original PDF figure from the paper.

X-Stream evaluates whether MLLMs can continuously integrate multiple synchronized streams instead of taking single-stream shortcuts.

4,220curated QA pairs
932videos
451multi-stream takes
160hretained data
11progressive subtasks
2-5streams per take

Overview

While video streaming understanding has made significant strides, real-world applications such as live sports broadcasting, autonomous driving, and multi-screen collaboration inherently demand continuous, multi-stream interactions. Existing benchmarks are confined to single-stream paradigms, leaving a critical gap in evaluating online, cross-stream reasoning.

We introduce X-Stream, the first benchmark dedicated to multi-stream streaming understanding. It comprises 4,220 rigorously curated QA pairs across 932 videos and evaluates 11 subtasks across multi-window, multi-view, and multi-device scenarios.

X-Stream uses a dual-verification pipeline to prevent over-reliance on a single stream. We further conceptualize MLLMs as naive multiplexers and evaluate spatial, time, and semantic division strategies under a fixed token budget. Current state-of-the-art MLLMs still struggle with concurrent streams, achieving only about 50% overall score and showing poor proactive ability.

Multi-stream task illustration
Figure 2. Multi-stream tasks require synchronized videos with temporal constraints and alignment, while preserving online streaming properties.

First Multi-Stream Streaming Benchmark

X-Stream evaluates online understanding over concurrent streams from multi-window, multi-view, and multi-device settings.

Shortcut-Resistant Data Protocol

Sufficiency and necessity verification filter pseudo multi-stream questions and retain samples requiring joint evidence.

MLLMs as Multiplexers

The benchmark exposes trade-offs among spatial, time, and semantic division multiplexing for future multi-stream agents.

Benchmark and Dataset

X-Stream is built from 857 hours of raw multi-stream data across more than 20 sources. After preprocessing and pairing, 160 hours are retained across 451 takes with 2 to 5 streams.

X-Stream ability taxonomy
Figure 3. Four multi-stream abilities and 11 progressive subtasks.

Benchmark Comparison

TasksDatasetQAVideosVideos / TakeDuration (h)Cross StreamOpen EndedStreamingProactive
Multi-View and Multi-VideoEgoLife-Eval0.3K1620NoNoNoNo
Multi-View and Multi-VideoProMQA-Assembly0.4K0.2K27NoYesNoNo
Multi-View and Multi-VideoWaymoQA6.4K~1K2~2YesYesNoNo
Multi-View and Multi-VideoMVU-Bench1.8K5K3-515NoNoNoNo
Multi-View and Multi-VideoVidDiff4.5K0.5K23NoYesNoNo
StreamingOVO-Bench2.8K0.6K185NoNoYesYes
StreamingStreamingBench4.5K0.9K1136NoNoYesYes
StreamingInf-Streams-Eval2.5K0.5K142NoYesYesNo
StreamingLiveSports1.2K0.8K140NoYesYesYes
StreamingProactiveVideoQA1.4K1.4K149NoYesYesYes
StreamingOmniMMI2.3K1.1K1100NoYesYesYes
StreamingMMDuet2.0K2.0K1100NoYesYesYes
StreamingESTP-Bench2.3K1.2K180NoYesYesYes
StreamingPhoStream5.6K0.6K192NoYesYesYes
Multi-StreamX-Stream (Ours)4.2K0.9K2-5160YesYesYesYes

Data Construction and Multiplexing

Dual-Verification Pipeline

  1. Preprocessing: resample raw videos to 2 FPS and split them into chunks under storage and transmission limits.
  2. QA generation: combine MLLM generation, curated templates, and Gemini-3-Pro refinement with timestamps and rationales.
  3. Multi-stream sufficiency: keep only questions answerable from the complete synchronized clip.
  4. Multi-stream necessity: reject samples that can be answered from isolated single streams.
  5. Human verification: 31 expert annotators perform two rounds of review, correction, and filtering.

MLLMs as Naive Multiplexers

Since MLLMs can process one token stream at a time, X-Stream studies ways to integrate multiple video streams under a fixed average video token rate, Cmax = 250 tokens per video second.

Spatial Division: stitch concurrent frames into a single visual input.
Time Division: interleave streams while preserving temporal alignment and stream identifiers.
Semantic Division: retain salient tokens before interleaving, improving scalability for higher stream counts.
Multiplexing strategies
Method. X-Stream studies spatial, time, and semantic division multiplexing as practical ways to combine multiple video streams into one token stream under a fixed token budget.

Experimental Results

X-Stream uses online inference and LLM-as-a-Judge evaluation. The main comparison uses spatial division multiplexing and reports instant, backward, forward, comprehensive, timing, and multi-stream ability scores.

Main Streaming Performance

ModelOverallInstantBackwardForwardCompre.ER downNR downSingle StreamMulti Coop.Cross Ref.Cross Inter.
Human (video, audio)91.8491.7395.1985.1097.509.502.5594.1292.0590.1098.55
Proprietary Multimodal Models
Gemini 3 Pro (video, audio)49.6073.3872.2320.7782.0473.130.2372.4571.1674.7966.96
GPT-5 (video)27.7844.2837.186.5159.8381.731.1439.0844.1252.7545.65
GPT-4o (video)22.4637.2832.724.0547.0187.140.7434.8334.9043.7737.52
Doubao-Seed-1.8 (video)36.7955.4957.1814.5259.1366.193.9547.5535.6956.5260.82
Open-source Multimodal Models
Qwen2.5-VL-7B (video)25.4940.0236.028.3445.2868.1011.3643.8041.4342.7240.01
Qwen2.5-Omni-7B (video, audio)26.8241.9641.179.0345.0453.1922.5138.6040.8041.8644.35
Qwen3-VL-8B (video)26.7843.4133.307.5351.0178.406.5049.8843.4133.3051.01
Qwen3-Omni-30B-A3B (video, audio)34.2863.9253.400.6169.1698.810.2763.4155.6866.0856.58
Qwen3-VL-30B-A3B (video)34.1952.0938.5414.4657.2673.911.1854.6857.9065.9865.22
Open-source Streaming Models
Dispider (video)15.4421.7119.298.0923.9055.637.2638.0121.9723.6531.37
VideoLLM-online-8B (video)8.4815.0015.530.0317.6799.100.6613.1516.9010.156.70
MMDuet2 (video)6.7911.7610.371.4411.2731.4954.1115.8414.449.164.96

Far from human performance

The strongest model reaches 49.60 overall, compared with 91.84 for human experts.

Forward questions bottleneck

Proactive timing is difficult: Gemini 3 Pro reaches 20.77 on forward questions while most models are much lower.

Multiplexing has trade-offs

Spatial, time, and semantic division differ in detail preservation, latency, and high stream-count scalability.

Performance Across 11 Core Tasks

ModelVisual Grd.Audio Grd.Temporal Grd.Object Count.Saliency Detect.3D Spa.CounterfactualCausalCommon SenseAnomalyDecision
Gemini 3 Pro66.7264.8268.9376.3763.6170.8275.0041.7969.3570.5244.18
GPT-536.6821.63*36.6442.5253.5552.2815.0037.6544.7745.5428.74
GPT-4o31.9922.15*32.8336.1942.1440.7440.0033.3338.0437.8624.53
Doubao-Seed-1.849.8729.9149.3152.7561.1359.1485.0052.4557.9254.2937.10
Qwen3-Omni-30B-A3B53.6152.1564.5664.7768.6163.2990.0053.1464.0860.1827.14
Qwen3-VL-30B-A3B64.3329.83*54.5554.6860.9660.4540.0040.2250.8441.2531.59
Dispider19.5813.80*18.6725.1022.5022.1250.0017.2021.6517.3210.94
VideoLLM-online-8B12.6217.97*21.4712.9022.6421.940.0018.1421.5321.1314.44
MMDuet222.2718.35*22.2436.1024.1423.6310.0017.9022.1620.429.12

Resources

X-Stream resource preview grid
Resource Preview. Representative multi-stream examples from the X-Stream release.

Data Source Licenses

DatasetTypeLicense
brain4carsDrivingBSD 2-Clause
Waymo-E2EDrivingWaymo Dataset License
Apidis-BasketballSportsApidis Academic License
e-Sports (Self-record)SportsCC BY-NC 4.0
Split-screen Game (Youtube)SportsCC BY-NC 4.0 / YouTube Standard License
DROIDRobotCC BY 4.0
UAV-VisLocRobotApache 2.0
EgoExo4DDaily RoutineEgo4D License
EgoLifeDaily RoutineMIT License
Seamless-interactionChatCC BY-NC 4.0
WILDTRACKSurveillanceNone
All-DaySurveillanceCC BY-NC 4.0
FaceEngageLive StreamingCC BY-NC 4.0 / YouTube Standard License
Streamer-React (Youtube)Live StreamingCC BY-NC 4.0 / YouTube Standard License
Map-Street (Baidu/Google Map API)InterfacesAPI Terms of Service
Comma2K-19 w/. dashboardInterfacesMIT License

BibTeX

@inproceedings{sun2026xstream,
  title     = {X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding},
  author    = {Sun, Peiwen and Lu, Xudong and Liu, Huadai and Bo, Yang and Wu, Dongming and Guan, Huankang and Cai, Minghong and Chen, Jinpeng and Guo, Xintong and Li, Shuhan and Liu, Rui and Yue, Xiangyu},
  booktitle = {European Conference on Computer Vision},
  year      = {2026}
}