I am Peiwen Sun, a Ph.D. student at the Multimedia Laboratory (MMLab), The Chinese University of Hong Kong (CUHK), advised by Prof. Xiangyu Yue. I received my B.Eng. and M.Eng. degrees from Beijing University of Posts and Telecommunications (BUPT).
My research focuses on multimodal learning. I am particularly interested in audio-visual understanding and generation. My goal is to build systems that perceive, reason about, and generate content across audio, vision, and language in the physical world.
🔥 News
- 2026.06: 🎉 X-Stream was accepted to ECCV 2026. See you in Malmö.
- 2026.04: 🎉 SpaceVista was accepted to ICML 2026. See you in Seoul.
🧭 Research Journey
My research has moved through three connected chapters. 👉 Click any stage below to instantly filter the publications to that theme.
🔍 Tap a stage to spotlight its papers ·
📝 Selected Publications

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding
Peiwen Sun*, Xudong Lu*, Huadai Liu*, Yang Bo, Dongming Wu, Huankang Guan, Minghong Cai, Jinpeng Chen, Xintong Guo, Shuhan Li, Fang Liu, Rui Liu, Xiangyu Yue
ECCV 2026
- The first exploration for multi-stream streaming understanding, framing MLLMs as multiplexers over concurrent video streams.

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km
Peiwen Sun*, Shiqiang Lang*, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue
ICML 2026
- SpaceVista-1M and SpaceVista-7B for all-scale spatial reasoning across five spatial scales with scale-aware experts.

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation
Peiwen Sun, Sitong Cheng, Xiangtai Li, Zhen Ye, Huadai Liu, Honggang Zhang, Wei Xue, Yike Guo
🎉🎉🎉 ICLR 2025 Spotlight 🎉🎉🎉
- BEWO-1M dataset and the SpatialSonic model for language-driven, controllable stereo (spatial) audio generation.

Unveiling and Mitigating Bias in Audio Visual Segmentation
Peiwen Sun, Honggang Zhang, Di Hu
🎉🎉🎉 ACM MM 2024 Oral 🎉🎉🎉
- Identifies and mitigates “audio priming bias” and “visual prior” in audio-visual segmentation via active queries and contrastive debiasing.

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
Yaoting Wang*, Peiwen Sun*, Dongzhan Zhou*, Guangyao Li, Honggang Zhang, Di Hu
ECCV 2024
- A new task and benchmark that segments objects in videos from natural-language expressions enriched with audio-visual cues.

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue
AAAI 2025
- X-Codec injects semantic features into the codec to improve audio language models across speech, music, and sound.

Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation
Juncheng Ma, Peiwen Sun, Yaoting Wang, Di Hu
ECCV 2024
- A two-stage progressive training strategy that decouples localization from semantics for audio-visual semantic segmentation.

Can Textual Semantics Mitigate Sounding Object Segmentation Preference?
Yaoting Wang*, Peiwen Sun*, Yuanchao Li, Honggang Zhang, Di Hu
ECCV 2024
- Leverages textual semantics to strengthen audio guidance and mitigate the sounding-object segmentation preference in AVS.

A Method of Audio-Visual Person Verification by Mining Connections between Time Series
Peiwen Sun, Shanshan Zhang, Zishan Liu, Yougen Yuan, Taotao Zhang, Honggang Zhang, Pengfei Hu
Interspeech 2023
- Mines temporal connections between audio and visual streams for robust audio-visual person verification.
Selected Other Publications
-
CVPR 2026 | OneThinker: All-in-one Reasoning Model for Image and Video. Kaituo Feng, Manyuan Zhang, …, Peiwen Sun, et al., Xiangyu Yue. 📄 Paper | 💻 Code
-
ICML 2025 | OmniAudio: Generating Spatial Audio from 360-Degree Video. Huadai Liu, Tianyi Luo, …, Peiwen Sun, et al. 📄 Paper | 🌐 Page | 💻 Code
-
ICLR 2026 | PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation. Huadai Liu, Kaicheng Luo, …, Peiwen Sun, et al., Wei Xue. 📄 Paper | 🌐 Page | 💻 Code
-
ICML 2026 | PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios. Xudong Lu, Huankang Guan, …, Peiwen Sun, et al. 📄 Paper | 💻 Code
-
ACM MM 2024 | FlashSpeech: Efficient Zero-Shot Speech Synthesis. Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, et al. 📄 Paper | 🌐 Page | 💻 Code
-
Interspeech 2026 | Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation. Zhen Ye, Xu Tan, …, Peiwen Sun, et al. 📄 Paper
-
ICASSP 2026 | Free2frame: A Training-Free Framework for Video Understanding with Memory Boosting. Shiqiang Lang, Peiwen Sun, et al., Honggang Zhang. 📄 Paper
-
CVPR 2026 Findings | Evolve Vision-Language-Action Model into an Agent with On-the-fly Tool-use. Ding Yi, Yanzhao Yu, …, Peiwen Sun, et al., Xiangyu Yue. 📄 Paper
-
arXiv 2026 | AURA: Always-On Understanding and Real-Time Assistance via Video Streams. Xudong Lu, Yang Bo, …, Peiwen Sun, et al., Hongsheng Li. 📄 Paper
-
arXiv 2026 | LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video. Shiqiang Lang, Jing Liu, Haoyang He, Peiwen Sun, et al. 📄 Paper
-
arXiv 2026 | Benchmark Everything Everywhere All at Once. Shiyun Xiong, Dongming Wu, Peiwen Sun, et al., Xiangyu Yue. 📄 Paper | 🌐 Page | 💻 Code
-
arXiv 2026 | Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling. Zhen Ye, Xu Tan, …, Peiwen Sun, et al., Wei Xue. 📄 Paper
🎖 Honors and Awards
- Outstanding Graduate 2025, BUPT.
- Outstanding Graduate 2022, BUPT.
- China National Scholarship.
- MCM/ICM Finalist Award (Mathematical Contest in Modeling).
🧑🔬 Academic Service
- ICLR (2023 – Now), ICML (2024 – Now), NeurIPS (2024 – Now)
- CVPR (2023 – Now), ECCV (2024 – Now), ICCV (2025 – Now)
- ACM MM (2024 – Now), COLING (2024 – Now), ICASSP (2024 – Now), Interspeech (2024 – Now)
📖 Educations
- 2025 - now, Ph.D. Student, Multimedia Laboratory (MMLab), The Chinese University of Hong Kong.
- 2022 - 2025, M.Eng., Beijing University of Posts and Telecommunications (BUPT).
- 2018 - 2022, B.Eng., Beijing University of Posts and Telecommunications (BUPT).
💻 Internships
- 2026,
Research Intern, Huawei Hong Kong Research Institute. - 2025,
Research Intern, Astribot Inc. - 2024,
Research Intern, HKUST. - 2021,
Research Intern, Tencent. - 2020,
Research Intern, Megvii.
🎨 Hobbies
Life outside research keeps me curious. 👉 Click any card below to expand it — explore a world map of my photography, dive into my side projects, or see where the outdoors has taken me.
📷 Photography
Capturing everyday moments and the places I travel to — click a pin on the map to see the photos.
Gear I've shot with
- Camera: Sony α7 V (ILCE-7M5) · Sony FE 24-70mm F2.8 GM II · Tamron 70-300mm F/4.5-6.3 Di III RXD
- Action camera: Insta360 X5 · DJI Osmo Action 5 Pro · GoPro HERO11
- Gimbal camera: DJI Osmo Pocket 3 · DJI Pocket 2
💻 Coding for daily life
Building small, handy tools that solve everyday problems.
- WhoGoesConf — find which of a researcher’s Scholar co-authors also have a paper at a target conference (e.g., ECCV 2026).
🏔️ Outdoor sports
- Cycling — 600 km in 6 days
- Motorcycling — Beijing to Tibet, 5000 km
- Hiking — Annapurna Base Camp (ABC) Trek, Nepal
- Scuba diving — 100+ dives across Southeast Asia
- Freediving — AIDA 3 diver & spearfishing
- Surfing — I can do this all day
- Sailing — RYA (Royal Yachting Association) certified