Multi-Stream Streaming Understanding Benchmark

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

Peiwen Sun1*, Xudong Lu1*, Huadai Liu3*, Yang Bo2, Dongming Wu1, Huankang Guan2, Minghong Cai1, Jinpeng Chen2, Xintong Guo2, Shuhan Li2, Fang Liu2, Rui Liu2, Xiangyu Yue1
1MMLab, The Chinese University of Hong Kong 2Huawei Inc. 3Independent
X-Stream multi-stream scenarios
X-Stream covers multi-angle, multi-view, and multi-device streaming scenarios.

X-Stream evaluates whether MLLMs can continuously integrate multiple synchronized streams instead of taking single-stream shortcuts.

4,220curated QA pairs
932videos
451multi-stream takes
160hretained data
11progressive subtasks
2-5streams per take

Overview

While video streaming understanding has made significant strides, real-world applications such as live sports broadcasting, autonomous driving, and multi-screen collaboration inherently demand continuous, multi-stream interactions. Existing benchmarks are confined to single-stream paradigms, leaving a critical gap in evaluating online, cross-stream reasoning.

We introduce X-Stream, the first benchmark dedicated to multi-stream streaming understanding. It comprises 4,220 rigorously curated QA pairs across 932 videos and evaluates 11 subtasks across multi-window, multi-view, and multi-device scenarios.

X-Stream uses a dual-verification pipeline to prevent over-reliance on a single stream. We further conceptualize MLLMs as naive multiplexers and evaluate spatial, time, and semantic division strategies under a fixed token budget. Current state-of-the-art MLLMs still struggle with concurrent streams, achieving only about 50% overall score and showing poor proactive ability.

Multi-stream task illustration
Multi-stream tasks require synchronized videos with temporal constraints and alignment, while preserving online streaming properties.

First Multi-Stream Streaming Benchmark

X-Stream evaluates online understanding over concurrent streams from multi-window, multi-view, and multi-device settings.

Shortcut-Resistant Data Protocol

Sufficiency and necessity verification filter pseudo multi-stream questions and retain samples requiring joint evidence.

MLLMs as Multiplexers

The benchmark exposes trade-offs among spatial, time, and semantic division multiplexing for future multi-stream agents.

Benchmark and Dataset

X-Stream is built from 857 hours of raw multi-stream data across more than 20 sources. After preprocessing and pairing, 160 hours are retained across 451 takes with 2 to 5 streams.

X-Stream ability taxonomy
Four multi-stream abilities and 11 progressive subtasks.

Benchmark Comparison

TasksDatasetQAVideosVideos / TakeDuration (h)Cross StreamOpen EndedStreamingProactive
Multi-View and Multi-VideoEgoLife-Eval0.3K1620NoNoNoNo
Multi-View and Multi-VideoProMQA-Assembly0.4K0.2K27NoYesNoNo
Multi-View and Multi-VideoWaymoQA6.4K~1K2~2YesYesNoNo
Multi-View and Multi-VideoMVU-Bench1.8K5K3-515NoNoNoNo
Multi-View and Multi-VideoVidDiff4.5K0.5K23NoYesNoNo
StreamingOVO-Bench2.8K0.6K185NoNoYesYes
StreamingStreamingBench4.5K0.9K1136NoNoYesYes
StreamingInf-Streams-Eval2.5K0.5K142NoYesYesNo
StreamingLiveSports1.2K0.8K140NoYesYesYes
StreamingProactiveVideoQA1.4K1.4K149NoYesYesYes
StreamingOmniMMI2.3K1.1K1100NoYesYesYes
StreamingMMDuet2.0K2.0K1100NoYesYesYes
StreamingESTP-Bench2.3K1.2K180NoYesYesYes
StreamingPhoStream5.6K0.6K192NoYesYesYes
Multi-StreamX-Stream (Ours)4.2K0.9K2-5160YesYesYesYes

Data Construction and Multiplexing

Dual-Verification Pipeline

  1. Preprocessing: resample raw videos to 2 FPS and split them into chunks under storage and transmission limits.
  2. QA generation: combine MLLM generation, curated templates, and Gemini-3-Pro refinement with timestamps and rationales.
  3. Multi-stream sufficiency: keep only questions answerable from the complete synchronized clip.
  4. Multi-stream necessity: reject samples that can be answered from isolated single streams.
  5. Human verification: 31 expert annotators perform two rounds of review, correction, and filtering.

MLLMs as Naive Multiplexers

Since MLLMs can process one token stream at a time, X-Stream studies ways to integrate multiple video streams under a fixed average video token rate, Cmax = 250 tokens per video second.

Spatial Division: stitch concurrent frames into a single visual input.
Time Division: interleave streams while preserving temporal alignment and stream identifiers.
Semantic Division: retain salient tokens before interleaving, improving scalability for higher stream counts.
Multiplexing strategies
X-Stream studies spatial, time, and semantic division multiplexing as practical ways to combine multiple video streams into one token stream under a fixed token budget.

Experimental Results

X-Stream uses online inference and LLM-as-a-Judge evaluation. The main comparison uses spatial division multiplexing and reports instant, backward, forward, comprehensive, timing, and multi-stream ability scores.

Main Streaming Performance

ModelOverallInstantBackwardForwardCompre.ER downNR downSingle StreamMulti Coop.Cross Ref.Cross Inter.
Human (video, audio)91.8491.7395.1985.1097.509.502.5594.1292.0590.1098.55
Proprietary Multimodal Models
Gemini 3 Pro (video, audio)49.6073.3872.2320.7782.0473.130.2372.4571.1674.7966.96
GPT-5 (video)27.7844.2837.186.5159.8381.731.1439.0844.1252.7545.65
GPT-4o (video)22.4637.2832.724.0547.0187.140.7434.8334.9043.7737.52
Doubao-Seed-1.8 (video)36.7955.4957.1814.5259.1366.193.9547.5535.6956.5260.82
Open-source Multimodal Models
Qwen2.5-VL-7B (video)25.4940.0236.028.3445.2868.1011.3643.8041.4342.7240.01
Qwen2.5-Omni-7B (video, audio)26.8241.9641.179.0345.0453.1922.5138.6040.8041.8644.35
Qwen3-VL-8B (video)26.7843.4133.307.5351.0178.406.5049.8843.4133.3051.01
Qwen3-Omni-30B-A3B (video, audio)34.2863.9253.400.6169.1698.810.2763.4155.6866.0856.58
Qwen3-VL-30B-A3B (video)34.1952.0938.5414.4657.2673.911.1854.6857.9065.9865.22
Open-source Streaming Models
Dispider (video)15.4421.7119.298.0923.9055.637.2638.0121.9723.6531.37
VideoLLM-online-8B (video)8.4815.0015.530.0317.6799.100.6613.1516.9010.156.70
MMDuet2 (video)6.7911.7610.371.4411.2731.4954.1115.8414.449.164.96

Far from human performance

The strongest model reaches 49.60 overall, compared with 91.84 for human experts.

Forward questions bottleneck

Proactive timing is difficult: Gemini 3 Pro reaches 20.77 on forward questions while most models are much lower.

Multiplexing has trade-offs

Spatial, time, and semantic division differ in detail preservation, latency, and high stream-count scalability.

Performance Across 11 Core Tasks

ModelVisual Grd.Audio Grd.Temporal Grd.Object Count.Saliency Detect.3D Spa.CounterfactualCausalCommon SenseAnomalyDecision
Gemini 3 Pro66.7264.8268.9376.3763.6170.8275.0041.7969.3570.5244.18
GPT-536.6821.63*36.6442.5253.5552.2815.0037.6544.7745.5428.74
GPT-4o31.9922.15*32.8336.1942.1440.7440.0033.3338.0437.8624.53
Doubao-Seed-1.849.8729.9149.3152.7561.1359.1485.0052.4557.9254.2937.10
Qwen3-Omni-30B-A3B53.6152.1564.5664.7768.6163.2990.0053.1464.0860.1827.14
Qwen3-VL-30B-A3B64.3329.83*54.5554.6860.9660.4540.0040.2250.8441.2531.59
Dispider19.5813.80*18.6725.1022.5022.1250.0017.2021.6517.3210.94
VideoLLM-online-8B12.6217.97*21.4712.9022.6421.940.0018.1421.5321.1314.44
MMDuet222.2718.35*22.2436.1024.1423.6310.0017.9022.1620.429.12

Resources

X-Stream multi-stream examples
Representative multi-stream examples from the X-Stream release.

Data Source Licenses

DatasetTypeLicense
brain4carsDrivingBSD 2-Clause
Waymo-E2EDrivingWaymo Dataset License
Apidis-BasketballSportsApidis Academic License
e-Sports (Self-record)SportsCC BY-NC 4.0
Split-screen Game (Youtube)SportsCC BY-NC 4.0 / YouTube Standard License
DROIDRobotCC BY 4.0
UAV-VisLocRobotApache 2.0
EgoExo4DDaily RoutineEgo4D License
EgoLifeDaily RoutineMIT License
Seamless-interactionChatCC BY-NC 4.0
WILDTRACKSurveillanceNone
All-DaySurveillanceCC BY-NC 4.0
FaceEngageLive StreamingCC BY-NC 4.0 / YouTube Standard License
Streamer-React (Youtube)Live StreamingCC BY-NC 4.0 / YouTube Standard License
Map-Street (Baidu/Google Map API)InterfacesAPI Terms of Service
Comma2K-19 w/. dashboardInterfacesMIT License

BibTeX

@article{sun2026x,
  title={X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding},
  author={Sun, Peiwen and Lu, Xudong and Liu, Huadai and Bo, Yang and Wu, Dongming and Guan, Huankang and Cai, Minghong and Chen, Jinpeng and Guo, Xintong and Li, Shuhan and others},
  journal={arXiv preprint arXiv:2606.02482},
  year={2026}
}
    }