I am Peiwen Sun, a Ph.D. student at the Multimedia Laboratory (MMLab), The Chinese University of Hong Kong (CUHK), advised by Prof. Xiangyu Yue. I received my B.Eng. and M.Eng. degrees from Beijing University of Posts and Telecommunications (BUPT).

My research focuses on multimodal learning. I am particularly interested in audio-visual understanding and generation. My goal is to build systems that perceive, reason about, and generate content across audio, vision, and language in the physical world.

News

2026.06: 🎉 X-Stream was accepted to ECCV 2026. See you in Malmö.
2026.04: 🎉 SpaceVista was accepted to ICML 2026. See you in Seoul.

Research Journey

My research has moved through three connected chapters. 👉 Click any stage below to instantly filter the publications to that theme.

🔍 Tap a stage to spotlight its papers ·

Selected Publications

ECCV 2026

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

Peiwen Sun^*, Xudong Lu^*, Huadai Liu^*, Yang Bo, Dongming Wu, Huankang Guan, Minghong Cai, Jinpeng Chen, Xintong Guo, Shuhan Li, Fang Liu, Rui Liu, Xiangyu Yue

European Conference on Computer Vision (ECCV) 2026

📄 Paper | 🌐 Page | 💻 Code

The first exploration for multi-stream streaming understanding, framing MLLMs as multiplexers over concurrent video streams.

ICML 2026

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

Peiwen Sun^*, Shiqiang Lang^*, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue

International Conference on Machine Learning (ICML) 2026

📄 Paper | 🌐 Page | 💻 Code

SpaceVista-1M and SpaceVista-7B for all-scale spatial reasoning across five spatial scales with scale-aware experts.

ICLR 2025 Spotlight

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Peiwen Sun^*, Sitong Cheng^*, Xiangtai Li^*, Zhen Ye, Huadai Liu, Honggang Zhang, Wei Xue, Yike Guo

International Conference on Learning Representations (ICLR) 2025 ✨ Spotlight ✨

📄 Paper | 🌐 Page | 💻 Code

BEWO-1M dataset and the SpatialSonic model for language-driven, controllable stereo (spatial) audio generation.

ACM MM 2024 Oral

Unveiling and Mitigating Bias in Audio Visual Segmentation

Peiwen Sun, Honggang Zhang, Di Hu

ACM International Conference on Multimedia (ACM MM) 2024 🎤 Oral 🎤

📄 Paper | 🌐 Page | 💻 Code

Identifies and mitigates “audio priming bias” and “visual prior” in audio-visual segmentation via active queries and contrastive debiasing.

ACM MM 2026 Oral

Discrete Coding and Masked Modeling for Text-to-Stereo Audio Generation

Yiming Li, Peiwen Sun, Zhen Ye, Jiahao Pan, Sitong Cheng, Boyi Kang, Jia Dai, Kai Li, Lie Lu, Wei Xue

ACM International Conference on Multimedia (ACM MM) 2026 🎤 Oral 🎤

🌐 Page

Introduces StereoCodec for spatially faithful discrete stereo representations, StereoGen for text-conditioned masked generation, and StereoCLAP for evaluating semantic-spatial consistency.

ECCV 2024

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Yaoting Wang^*, Peiwen Sun^*, Dongzhan Zhou^*, Guangyao Li, Honggang Zhang, Di Hu

European Conference on Computer Vision (ECCV) 2024

📄 Paper | 🌐 Page | 💻 Code

A new task and benchmark that segments objects in videos from natural-language expressions enriched with audio-visual cues.

AAAI 2025

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue

AAAI Conference on Artificial Intelligence (AAAI) 2025

📄 Paper | 🌐 Page | 💻 Code

X-Codec injects semantic features into the codec to improve audio language models across speech, music, and sound.

ECCV 2024

Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

Juncheng Ma, Peiwen Sun, Yaoting Wang, Di Hu

European Conference on Computer Vision (ECCV) 2024

📄 Paper | 🌐 Page | 💻 Code

A two-stage progressive training strategy that decouples localization from semantics for audio-visual semantic segmentation.

ECCV 2024

Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

Yaoting Wang^*, Peiwen Sun^*, Yuanchao Li, Honggang Zhang, Di Hu

European Conference on Computer Vision (ECCV) 2024

📄 Paper | 💻 Code

Leverages textual semantics to strengthen audio guidance and mitigate the sounding-object segmentation preference in AVS.

Interspeech 2023

A Method of Audio-Visual Person Verification by Mining Connections between Time Series

Peiwen Sun, Shanshan Zhang, Zishan Liu, Yougen Yuan, Taotao Zhang, Honggang Zhang, Pengfei Hu

Interspeech 2023

📄 Paper

Mines temporal connections between audio and visual streams for robust audio-visual person verification.

Selected Other Publications

CVPR 2026 | OneThinker: All-in-one Reasoning Model for Image and Video. Kaituo Feng, Manyuan Zhang, …, Peiwen Sun, et al., Xiangyu Yue. 📄 Paper | 💻 Code
ICML 2025 | OmniAudio: Generating Spatial Audio from 360-Degree Video. Huadai Liu, Tianyi Luo, …, Peiwen Sun, et al. 📄 Paper | 🌐 Page | 💻 Code
ICLR 2026 | PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation. Huadai Liu, Kaicheng Luo, …, Peiwen Sun, et al., Wei Xue. 📄 Paper | 🌐 Page | 💻 Code
ICML 2026 | PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios. Xudong Lu, Huankang Guan, …, Peiwen Sun, et al. 📄 Paper | 💻 Code
ACM MM 2024 | FlashSpeech: Efficient Zero-Shot Speech Synthesis. Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, et al. 📄 Paper | 🌐 Page | 💻 Code
Interspeech 2026 | Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation. Zhen Ye, Xu Tan, …, Peiwen Sun, et al. 📄 Paper
ICASSP 2026 | Free2frame: A Training-Free Framework for Video Understanding with Memory Boosting. Shiqiang Lang, Peiwen Sun, et al., Honggang Zhang. 📄 Paper
CVPR 2026 Findings | Evolve Vision-Language-Action Model into an Agent with On-the-fly Tool-use. Ding Yi, Yanzhao Yu, …, Peiwen Sun, et al., Xiangyu Yue. 📄 Paper
ACM MM 2026 | Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling. Zhen Ye, Xu Tan, …, Peiwen Sun, et al., Wei Xue. 📄 Paper
arXiv 2026 | AURA: Always-On Understanding and Real-Time Assistance via Video Streams. Xudong Lu, Yang Bo, …, Peiwen Sun, et al., Hongsheng Li. 📄 Paper
arXiv 2026 | LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video. Shiqiang Lang, Jing Liu, Haoyang He, Peiwen Sun, et al. 📄 Paper
arXiv 2026 | Benchmark Everything Everywhere All at Once. Shiyun Xiong, Dongming Wu, Peiwen Sun, et al., Xiangyu Yue. 📄 Paper | 🌐 Page | 💻 Code

Honors and Awards

Outstanding Graduate, BUPT.
China National Scholarship.
MCM/ICM Finalist Award (Mathematical Contest in Modeling).

Academic Service

ICLR (2023 – Now), ICML (2024 – Now), NeurIPS (2024 – Now)
CVPR (2023 – Now), ECCV (2024 – Now), ICCV (2025 – Now)
ACM MM (2024 – Now), COLING (2024 – Now), ICASSP (2024 – Now), Interspeech (2024 – Now)

Educations

2025 - now, Ph.D. Student, Multimedia Laboratory (MMLab) — The Chinese University of Hong Kong Current
2022 - 2025, M.Eng. — Beijing University of Posts and Telecommunications
2018 - 2022, B.Eng. — Beijing University of Posts and Telecommunications

Internships

2026, Research Intern — Huawei Hong Kong Research Center Current
2025, Research Intern — Astribot
2024, Research Intern — HKUST
2021, Research Intern — Tencent
2020, Research Intern — MEGVII

Hobbies

Life outside research keeps me curious. 👉 Click any card below to expand it — explore a world map of my photography, dive into my side projects, or see where the outdoors has taken me.

📷 Photography — an interactive photo map

Capturing everyday moments and the places I travel to — click a pin on the map to see the photos.

Gear I've shot with

Camera: Sony α7 V (ILCE-7M5) · Sony FE 24-70mm F2.8 GM II · Tamron 70-300mm F/4.5-6.3 Di III RXD
Action camera: Insta360 X5 · DJI Osmo Action 5 Pro · GoPro HERO11
Gimbal camera: DJI Osmo Pocket 3 · DJI Pocket 2

💻 Coding for daily life — small side projects & tools

Building small, handy tools that solve everyday problems.

WhoGoesConf — find which of a researcher’s Scholar co-authors also have a paper at a target conference (e.g., ECCV 2026).

🏔️ Outdoor sports — from 5000 km rides to 100+ dives

Cycling — 600 km in 6 days
Motorcycling — Beijing to Tibet, 5000 km
Hiking — Annapurna Base Camp (ABC) Trek, Nepal
Scuba diving — 100+ dives across Southeast Asia
Freediving — AIDA 3 diver & spearfishing
Surfing — I can do this all day
Sailing — RYA (Royal Yachting Association) certified