Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

1Hong Kong University of Science and Technology, 2Beijing University of Posts and Telecommunications, 3Nanyang Technological University, 4Zhejiang University

The language-driven spatial audio generation dataset and method for immersive soundscapes.



Before listening, follow these steps for immersive experience.

🤩🤩🤩Step-1: Put on headphones🎧 and ensure they fit snugly on both of your ears.

🥳🥳🥳Step-2: Move the slider🕹️ for spatial control.

🥰🥰🥰Step-3: Close your eyes🏖️ and enjoy spatial audio.

Single Context Control

Source direction: Completely Left

A duck is quacking on the left.

A dog is barking on the left.

A man is speaking on the left.

A duck is quacking on the front left.

A dog is barking on the front left.

A man is speaking on the front left.

A duck is quacking in the front.

A dog is barking in the front.

A man is speaking in the front.

A duck is quacking on the front right.

A dog is barking on the front right.

A man is speaking on the front right.

A duck is quacking on the right.

A dog is barking on the right.

A man is speaking on the right.

Multiple Contexts Control

Direction State: Maintained Directions

A woman is speaking on the left, while a piano is playing on the right.

A cat is meowing in front, while a guitar is playing on the left.

A bird is chirping in front, while a dog is barking on the left.

A woman is speaking on the right, while a piano is playing on the left.

A cat is meowing in front, while a guitar is playing on the right.

A bird is chirping in front, while a dog is barking on the right.

Moving Context Control

Moving State: Maintained Directions

A engine sound is moving to left fastly.

A dog is barking and moving from right to left.

horse hoofbeats are heard from front to left.

A engine sound is moving to right fastly.

A dog is barking and moving from left to right.

horse hoofbeats are heard from front to right.

Generated samples with the 5 common directions in description.

Abstract

Recently, diffusion models have achieved great success in mono-channel audio generation. However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions. Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models. To the best of our knowledge, this work represents the first attempt to address these issues. We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources. Beyond text modality, we have also acquired a set of images and rationally paired stereo audios through retrieval to advance multimodal generation. Existing audio generation models tend to generate rather random spatial audio. To provide accurate guidance for Latent Diffusion Models, we introduce the SpatialSonic model utilizing spatial-aware encoders and azimuth state matrices to reveal reasonable spatial guidance. By leveraging spatial guidance, our unified model not only achieves the objectives of generating immersive and controllable spatial audio from text and image but also enables interactive audio generation during inference. Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real-world data to compare our approach with prevailing methods. The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules. Our code, model, and dataset will be released soon.

Overview of SEE-2-SOUND

Task Objective

📝Text

“A engine sound occurs and moves left.”

“A motorcycle is moving fast from right to left.”

📷Image

Example Image 4

🔲Bounding Box

Example Image 4

🖱️Click

Example Image 4

The goal of spatial audio generation is to produce audio that adheres the spatial context in multimodal guidance. For example, the model needs to capture the image of this motorcycle and generate the sound of its engine in motion.

A Noval Dataset: BEWO-1M

Overview of SEE-2-SOUND

To better facilitate the advancement of multimodal guided spatial audio generation models, we have developed a dual-channel audio dataset named Both Ears Wide Open 1M (BEWO-1M) through rigorous simulations and GPT-assisted caption transformation.

Totally, we constructed 2.8k hours of training audio with more than 1M audio-text pairs and approximately 17 hours of validation data with 6.2k pairs.

Text Guided Generation (2-C Audio)

Cooresponding to Tab.3 in our paper.

SS-set

Single Stationary

A train on the left blows its horn and rings its bells, accompanied by vibrations.

A man speaks as birds chirp and dogs bark, with the objects directly in front.

A man laughs while an infant cries, positioned at the front right of the scene.


SD-set

Single Dynamic

A fire truck siren blasts from the right to the front left slowly.

A cat meows a few times while moving from left to right.

The sound of a motorboat engine running moves from right to left.


DS-set

Double Stationary

The dog barking is on the left side, while a boat engine is running on the right side.

A man speaks farway at the left, while a high pitched bell rings rapidly on the right.

The ringing bell stands to the left, while a train horn blares directly ahead in the scene.


M-set

Mixed

A motor with a drum music and man speaking is moving from left to right.

Sound of the bird chirping, dog barking are on the left, while water is splashing on the right.

A man is speaking on the right, and horse hoofbeats are from left to right.

SS-set

Single Stationary

A train on the left blows its horn and rings its bells, accompanied by vibrations.

A man speaks as birds chirp and dogs bark, with the objects directly in front.

A man laughs while an infant cries, positioned at the front right of the scene.


SD-set

Single Dynamic

A fire truck siren blasts from the right to the front left slowly.

A cat meows a few times while moving from left to right.

The sound of a motorboat engine running moves from right to left.


DS-set

Double Stationary

The dog barking is on the left side, while a boat engine is running on the right side.

A man speaks farway at the left, while a high pitched bell rings rapidly on the right.

The ringing bell stands to the left, while a train horn blares directly ahead in the scene.


M-set

Mixed

A motor with a drum music and man speaking is moving from left to right.

Sound of the bird chirping, dog barking are on the left, while water is splashing on the right.

A man is speaking on the right, and horse hoofbeats are from left to right.

SS-set

Single Stationary

A train on the left blows its horn and rings its bells, accompanied by vibrations.

A man speaks as birds chirp and dogs bark, with the objects directly in front.

A man laughs while an infant cries, positioned at the front right of the scene.


SD-set

Single Dynamic

A fire truck siren blasts from the right to the front left slowly.

A cat meows a few times while moving from left to right.

The sound of a motorboat engine running moves from right to left.


DS-set

Double Stationary

The dog barking is on the left side, while a boat engine is running on the right side.

A man speaks farway at the left, while a high pitched bell rings rapidly on the right.

The ringing bell stands to the left, while a train horn blares directly ahead in the scene.


M-set

Mixed

A motor with a drum music and man speaking is moving from left to right.

Sound of the bird chirping, dog barking are on the left, while water is splashing on the right.

A man is speaking on the right, and horse hoofbeats are from left to right.

Visually Guided Generation (2-C Audio)

Cooresponding to Tab.4 and Tab.5 in our paper.

Image

Example Image 4
Example Image 4
Example Image 4

Bounding Box

Example Image 4
Example Image 4
Example Image 4

Click

Example Image 4
Example Image 4
Example Image 4

Applying a filter to convert mono-audio into stereo audio results in decreased quality and channel-wise bias.


Image

Example Image 4
Example Image 4
Example Image 4

Bounding Box

Example Image 4
Example Image 4
Example Image 4

Click

Example Image 4
Example Image 4
Example Image 4

Although See-2-Sound is primarily designed for spatial environmental sounds, we also identify some I2A capabilities within it and have included it in our comparison as one of the few existing works in this area.


Image

Example Image 4
Example Image 4
Example Image 4

Bounding Box

Example Image 4
Example Image 4
Example Image 4

Click

Example Image 4
Example Image 4
Example Image 4

Text Guided Generation (1-C Audio)

Cooresponding to Tab.2 in our paper.

Pigeons coo and rustle.

Tick-tocking by a clock.

A person is snoring steadily.


An audience clapping.

A woman is speaking while food is frying and sizzling.

A nearby insect buzzes with nearby vibrations.


Footsteps then a woman speaks followed by a door clanging.

A man and a woman talking then the crinkling of paper.

A medium-pitched, metal bell is ringing.

Pigeons coo and rustle.

Tick-tocking by a clock.

A person is snoring steadily.


An audience clapping.

A woman is speaking while food is frying and sizzling.

A nearby insect buzzes with nearby vibrations.


Footsteps then a woman speaks followed by a door clanging.

A man and a woman talking then the crinkling of paper.

A medium-pitched, metal bell is ringing.

Pigeons coo and rustle.

Tick-tocking by a clock.

A person is snoring steadily.


An audience clapping.

A woman is speaking while food is frying and sizzling.

A nearby insect buzzes with nearby vibrations.


Footsteps then a woman speaks followed by a door clanging.

A man and a woman talking then the crinkling of paper.

A medium-pitched, metal bell is ringing.

Pigeons coo and rustle.

Tick-tocking by a clock.

A person is snoring steadily.


An audience clapping.

A woman is speaking while food is frying and sizzling.

A nearby insect buzzes with nearby vibrations.


Footsteps then a woman speaks followed by a door clanging.

A man and a woman talking then the crinkling of paper.

A medium-pitched, metal bell is ringing.

Visual Guided Generation (1-C Audio)

Cooresponding to Tab.G18 in the appendix.

Example Image 4
Example Image 4
Example Image 4
Example Image 4
Example Image 4
Example Image 4
Example Image 4
Example Image 4
Example Image 4

Important Attributes Control

When Distance Meets Direction

Moving State: Maintained Directions

A dog is barking on the left nearly.

A nearby water is splashing on the right.

A saxophone is playing in the near front.

A distant dog barks on the left.

Water is splashing from afar on the right.

A saxophone plays softly in the distance ahead.

When Acoustic Environment Meets Direction

Acoustic Environment: Reverberating space

A woman talks on the left in the outdoor.

A bell rings on the right in the outdoor.

A baby is crying in front of the studio.

A woman talks on the left in a large room.

A bell rings on the right in a echoey room.

A baby is crying in the front, the sound echoing.

When Percise Description Meets Direction

Humans are not sensitive to minor changes in the direction.

A man is snoring from 60° right Ahead.

Exciting bass rock music from 30° Left Ahead.

Whistling can be heard from 60° to the left of straight ahead.

When Longer Time Meets Direction

Generating longer (30-sec) audio.

Waves are crashing against the shore on the right.

Peaceful and calming music on the right with piano.

A baby keeps crying loudly on the left side

BEWO-1M Simulation Examples

The data engine is powered by detailed simulations and GPT induction. It utilizes the two basic scenarios depicted below to create intricate soundscapes.

Stationary

Finger snapping loudy on the right.

A gurgling stream located on the right side of the scene.

Snare drum on the left side.

Dynamic

A high pitched engine moving from left to front right at a fast speed.

Loud snoring person shifts from right to left slowly.

A person whistles and moves from right to directly front.

BibTeX

@article{sun2024both,
  title={Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation},
  author={Sun, Peiwen and Cheng, Sitong and Li, Xiangtai and Ye, Zhen and Liu, Huadai and Zhang, Honggang and Xue, Wei and Guo, Yike},
  journal={arXiv preprint arXiv:2410.10676},
  year={2024}
}