📝Text
“A engine sound occurs and moves left.”
“A motorcycle is moving fast from right to left.”
A duck is quacking on the left.
A dog is barking on the left.
A man is speaking on the left.
A duck is quacking on the front left.
A dog is barking on the front left.
A man is speaking on the front left.
A duck is quacking in the front.
A dog is barking in the front.
A man is speaking in the front.
A duck is quacking on the front right.
A dog is barking on the front right.
A man is speaking on the front right.
A duck is quacking on the right.
A dog is barking on the right.
A man is speaking on the right.
A woman is speaking on the left, while a piano is playing on the right.
A cat is meowing in front, while a guitar is playing on the left.
A bird is chirping in front, while a dog is barking on the left.
A woman is speaking on the right, while a piano is playing on the left.
A cat is meowing in front, while a guitar is playing on the right.
A bird is chirping in front, while a dog is barking on the right.
A engine sound is moving to left fastly.
A dog is barking and moving from right to left.
horse hoofbeats are heard from front to left.
A engine sound is moving to right fastly.
A dog is barking and moving from left to right.
horse hoofbeats are heard from front to right.
Recently, diffusion models have achieved great success in mono-channel audio generation. However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions. Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models. To the best of our knowledge, this work represents the first attempt to address these issues. We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources. Beyond text modality, we have also acquired a set of images and rationally paired stereo audios through retrieval to advance multimodal generation. Existing audio generation models tend to generate rather random spatial audio. To provide accurate guidance for Latent Diffusion Models, we introduce the SpatialSonic model utilizing spatial-aware encoders and azimuth state matrices to reveal reasonable spatial guidance. By leveraging spatial guidance, our unified model not only achieves the objectives of generating immersive and controllable spatial audio from text and image but also enables interactive audio generation during inference. Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real-world data to compare our approach with prevailing methods. The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules. Our code, model, and dataset will be released soon.
“A engine sound occurs and moves left.”
“A motorcycle is moving fast from right to left.”
The goal of spatial audio generation is to produce audio that adheres the spatial context in multimodal guidance. For example, the model needs to capture the image of this motorcycle and generate the sound of its engine in motion.
To better facilitate the advancement of multimodal guided spatial audio generation models, we have developed a dual-channel audio dataset named Both Ears Wide Open 1M (BEWO-1M) through rigorous simulations and GPT-assisted caption transformation.
Totally, we constructed 2.8k hours of training audio with more than 1M audio-text pairs and approximately 17 hours of validation data with 6.2k pairs.
Single Stationary
A train on the left blows its horn and rings its bells, accompanied by vibrations.
A man speaks as birds chirp and dogs bark, with the objects directly in front.
A man laughs while an infant cries, positioned at the front right of the scene.
Single Dynamic
A fire truck siren blasts from the right to the front left slowly.
A cat meows a few times while moving from left to right.
The sound of a motorboat engine running moves from right to left.
Double Stationary
The dog barking is on the left side, while a boat engine is running on the right side.
A man speaks farway at the left, while a high pitched bell rings rapidly on the right.
The ringing bell stands to the left, while a train horn blares directly ahead in the scene.
Mixed
A motor with a drum music and man speaking is moving from left to right.
Sound of the bird chirping, dog barking are on the left, while water is splashing on the right.
A man is speaking on the right, and horse hoofbeats are from left to right.
Single Stationary
A train on the left blows its horn and rings its bells, accompanied by vibrations.
A man speaks as birds chirp and dogs bark, with the objects directly in front.
A man laughs while an infant cries, positioned at the front right of the scene.
Single Dynamic
A fire truck siren blasts from the right to the front left slowly.
A cat meows a few times while moving from left to right.
The sound of a motorboat engine running moves from right to left.
Double Stationary
The dog barking is on the left side, while a boat engine is running on the right side.
A man speaks farway at the left, while a high pitched bell rings rapidly on the right.
The ringing bell stands to the left, while a train horn blares directly ahead in the scene.
Mixed
A motor with a drum music and man speaking is moving from left to right.
Sound of the bird chirping, dog barking are on the left, while water is splashing on the right.
A man is speaking on the right, and horse hoofbeats are from left to right.
Single Stationary
A train on the left blows its horn and rings its bells, accompanied by vibrations.
A man speaks as birds chirp and dogs bark, with the objects directly in front.
A man laughs while an infant cries, positioned at the front right of the scene.
Single Dynamic
A fire truck siren blasts from the right to the front left slowly.
A cat meows a few times while moving from left to right.
The sound of a motorboat engine running moves from right to left.
Double Stationary
The dog barking is on the left side, while a boat engine is running on the right side.
A man speaks farway at the left, while a high pitched bell rings rapidly on the right.
The ringing bell stands to the left, while a train horn blares directly ahead in the scene.
Mixed
A motor with a drum music and man speaking is moving from left to right.
Sound of the bird chirping, dog barking are on the left, while water is splashing on the right.
A man is speaking on the right, and horse hoofbeats are from left to right.
Pigeons coo and rustle.
Tick-tocking by a clock.
A person is snoring steadily.
An audience clapping.
A woman is speaking while food is frying and sizzling.
A nearby insect buzzes with nearby vibrations.
Footsteps then a woman speaks followed by a door clanging.
A man and a woman talking then the crinkling of paper.
A medium-pitched, metal bell is ringing.
Pigeons coo and rustle.
Tick-tocking by a clock.
A person is snoring steadily.
An audience clapping.
A woman is speaking while food is frying and sizzling.
A nearby insect buzzes with nearby vibrations.
Footsteps then a woman speaks followed by a door clanging.
A man and a woman talking then the crinkling of paper.
A medium-pitched, metal bell is ringing.
Pigeons coo and rustle.
Tick-tocking by a clock.
A person is snoring steadily.
An audience clapping.
A woman is speaking while food is frying and sizzling.
A nearby insect buzzes with nearby vibrations.
Footsteps then a woman speaks followed by a door clanging.
A man and a woman talking then the crinkling of paper.
A medium-pitched, metal bell is ringing.
Pigeons coo and rustle.
Tick-tocking by a clock.
A person is snoring steadily.
An audience clapping.
A woman is speaking while food is frying and sizzling.
A nearby insect buzzes with nearby vibrations.
Footsteps then a woman speaks followed by a door clanging.
A man and a woman talking then the crinkling of paper.
A medium-pitched, metal bell is ringing.
A dog is barking on the left nearly.
A nearby water is splashing on the right.
A saxophone is playing in the near front.
A distant dog barks on the left.
Water is splashing from afar on the right.
A saxophone plays softly in the distance ahead.
A woman talks on the left in the outdoor.
A bell rings on the right in the outdoor.
A baby is crying in front of the studio.
A woman talks on the left in a large room.
A bell rings on the right in a echoey room.
A baby is crying in the front, the sound echoing.
A man is snoring from 60° right Ahead.
Exciting bass rock music from 30° Left Ahead.
Whistling can be heard from 60° to the left of straight ahead.
Waves are crashing against the shore on the right.
Peaceful and calming music on the right with piano.
A baby keeps crying loudly on the left side
Finger snapping loudy on the right.
A gurgling stream located on the right side of the scene.
Snare drum on the left side.
A high pitched engine moving from left to front right at a fast speed.
Loud snoring person shifts from right to left slowly.
A person whistles and moves from right to directly front.
@article{sun2024both,
title={Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation},
author={Sun, Peiwen and Cheng, Sitong and Li, Xiangtai and Ye, Zhen and Liu, Huadai and Zhang, Honggang and Xue, Wei and Guo, Yike},
journal={arXiv preprint arXiv:2410.10676},
year={2024}
}