Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Peiwen Sun^*1,2, Sitong Cheng^*1, Xiangtai Li^*3, Zhen Ye¹, Huadai Liu⁴, Honggang Zhang², Wei Xue¹, Yike Guo¹

¹Hong Kong University of Science and Technology, ²Beijing University of Posts and Telecommunications, ³Nanyang Technological University, ⁴Zhejiang University

Paper arXiv Code Live Demo 1 (NL) Live Demo 2 (Attr) 🤗 HuggingFace

ICLR 2025 (Spotlight)

The language-driven spatial audio generation dataset and method for immersive soundscapes.

Abstract

Recently, diffusion models have achieved great success in mono-channel audio generation. However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions. Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models. To the best of our knowledge, this work represents the first attempt to address these issues. We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources. Beyond text modality, we have also acquired a set of images and rationally paired stereo audios through retrieval to advance multimodal generation. Existing audio generation models tend to generate rather random spatial audio. To provide accurate guidance for Latent Diffusion Models, we introduce the SpatialSonic model utilizing spatial-aware encoders and azimuth state matrices to reveal reasonable spatial guidance. By leveraging spatial guidance, our unified model not only achieves the objectives of generating immersive and controllable spatial audio from text and image but also enables interactive audio generation during inference. Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real-world data to compare our approach with prevailing methods. The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules. Our code, model, and dataset will be released soon.

A Noval Dataset: BEWO-1M

To better facilitate the advancement of multimodal guided spatial audio generation models, we have developed a dual-channel audio dataset named Both Ears Wide Open 1M (BEWO-1M) through rigorous simulations and GPT-assisted caption transformation.

Totally, we constructed 2.8k hours of training audio with more than 1M audio-text pairs and approximately 17 hours of validation data with 6.2k pairs.

BEWO-1M Download

🟤BEWO-1M: Please temporarily refer to Github. We are still seeking better integration and organization of our dataset.

Task Objective

📝Text

“A engine sound occurs and moves left.”

“A motorcycle is moving fast from right to left.”

📷Image

🔲Bounding Box

🖱️Click

The goal of spatial audio generation is to produce audio that adheres the spatial context in multimodal guidance. For example, the model needs to capture the image of this motorcycle and generate the sound of its engine in motion.

Code and Model

We first release the gradio demo of our model.

Please check it out at demo 1 with the direction description.

Please check it out at demo 2 with the regular description and the slider for spatial control.

Stability AI has only released the 44.1kHz model; we train it with downsampling to comply with the commonly used 16kHz evaluation standard.

Controllable Spatial Audio Generation

Before listening, follow these steps for immersive experience.

🤩🤩🤩Step-1: Put on headphones🎧 and ensure they fit snugly on both of your ears.

🥳🥳🥳Step-2: Move the slider🕹️ for spatial control.

🥰🥰🥰Step-3: Close your eyes🏖️ and enjoy spatial audio.

Single Context Control

Source direction: Completely Left

A duck is quacking on the left.

A dog is barking on the left.

A man is speaking on the left.

A duck is quacking on the front left.

A dog is barking on the front left.

A man is speaking on the front left.

A duck is quacking in the front.

A dog is barking in the front.

A man is speaking in the front.

A duck is quacking on the front right.

A dog is barking on the front right.

A man is speaking on the front right.

A duck is quacking on the right.

A dog is barking on the right.

A man is speaking on the right.

Multiple Contexts Control

Direction State: Maintained Directions

A woman is speaking on the left, while a piano is playing on the right.

A cat is meowing in front, while a guitar is playing on the left.

A bird is chirping in front, while a dog is barking on the left.

A woman is speaking on the right, while a piano is playing on the left.

A cat is meowing in front, while a guitar is playing on the right.

A bird is chirping in front, while a dog is barking on the right.

Moving Context Control

Moving State: Maintained Directions

A engine sound is moving to left fastly.

A dog is barking and moving from right to left.

horse hoofbeats are heard from front to left.

A engine sound is moving to right fastly.

A dog is barking and moving from left to right.

horse hoofbeats are heard from front to right.

Generated samples with the 5 common directions in description.

Text Guided Generation (2-C Audio)

Cooresponding to Tab.3 in our paper.

SS-set
Single Stationary

A train on the left blows its horn and rings its bells, accompanied by vibrations.

A man speaks as birds chirp and dogs bark, with the objects directly in front.

A man laughs while an infant cries, positioned at the front right of the scene.

SD-set
Single Dynamic

A fire truck siren blasts from the right to the front left slowly.

A cat meows a few times while moving from left to right.

The sound of a motorboat engine running moves from right to left.

DS-set
Double Stationary

The dog barking is on the left side, while a boat engine is running on the right side.

A man speaks farway at the left, while a high pitched bell rings rapidly on the right.

The ringing bell stands to the left, while a train horn blares directly ahead in the scene.

M-set
Mixed

A motor with a drum music and man speaking is moving from left to right.

Sound of the bird chirping, dog barking are on the left, while water is splashing on the right.

A man is speaking on the right, and horse hoofbeats are from left to right.

SS-set
Single Stationary

A train on the left blows its horn and rings its bells, accompanied by vibrations.

A man speaks as birds chirp and dogs bark, with the objects directly in front.

A man laughs while an infant cries, positioned at the front right of the scene.

SD-set
Single Dynamic

A fire truck siren blasts from the right to the front left slowly.

A cat meows a few times while moving from left to right.

The sound of a motorboat engine running moves from right to left.

DS-set
Double Stationary

The dog barking is on the left side, while a boat engine is running on the right side.

A man speaks farway at the left, while a high pitched bell rings rapidly on the right.

The ringing bell stands to the left, while a train horn blares directly ahead in the scene.

M-set
Mixed

A motor with a drum music and man speaking is moving from left to right.

Sound of the bird chirping, dog barking are on the left, while water is splashing on the right.

A man is speaking on the right, and horse hoofbeats are from left to right.

SS-set
Single Stationary

A train on the left blows its horn and rings its bells, accompanied by vibrations.

A man speaks as birds chirp and dogs bark, with the objects directly in front.

A man laughs while an infant cries, positioned at the front right of the scene.

SD-set
Single Dynamic

A fire truck siren blasts from the right to the front left slowly.

A cat meows a few times while moving from left to right.

The sound of a motorboat engine running moves from right to left.

DS-set
Double Stationary

The dog barking is on the left side, while a boat engine is running on the right side.

A man speaks farway at the left, while a high pitched bell rings rapidly on the right.

The ringing bell stands to the left, while a train horn blares directly ahead in the scene.

M-set
Mixed

A motor with a drum music and man speaking is moving from left to right.

Sound of the bird chirping, dog barking are on the left, while water is splashing on the right.

A man is speaking on the right, and horse hoofbeats are from left to right.

Visually Guided Generation (2-C Audio)

Cooresponding to Tab.4 and Tab.5 in our paper.

Image

Bounding Box

Click

Applying a filter to convert mono-audio into stereo audio results in decreased quality and channel-wise bias.

Image

Bounding Box

Click

Although See-2-Sound is primarily designed for spatial environmental sounds, we also identify some I2A capabilities within it and have included it in our comparison as one of the few existing works in this area.

Image

Bounding Box

Click

Text Guided Generation (1-C Audio)

Cooresponding to Tab.2 in our paper.

Pigeons coo and rustle.

Tick-tocking by a clock.

A person is snoring steadily.

An audience clapping.

A woman is speaking while food is frying and sizzling.

A nearby insect buzzes with nearby vibrations.

Footsteps then a woman speaks followed by a door clanging.

A man and a woman talking then the crinkling of paper.

A medium-pitched, metal bell is ringing.

Pigeons coo and rustle.

Tick-tocking by a clock.

A person is snoring steadily.

An audience clapping.

A woman is speaking while food is frying and sizzling.

A nearby insect buzzes with nearby vibrations.

Footsteps then a woman speaks followed by a door clanging.

A man and a woman talking then the crinkling of paper.

A medium-pitched, metal bell is ringing.

Pigeons coo and rustle.

Tick-tocking by a clock.

A person is snoring steadily.

An audience clapping.

A woman is speaking while food is frying and sizzling.

A nearby insect buzzes with nearby vibrations.

Footsteps then a woman speaks followed by a door clanging.

A man and a woman talking then the crinkling of paper.

A medium-pitched, metal bell is ringing.

Pigeons coo and rustle.

Tick-tocking by a clock.

A person is snoring steadily.

An audience clapping.

A woman is speaking while food is frying and sizzling.

A nearby insect buzzes with nearby vibrations.

Footsteps then a woman speaks followed by a door clanging.

A man and a woman talking then the crinkling of paper.

A medium-pitched, metal bell is ringing.

Visual Guided Generation (1-C Audio)

Cooresponding to Tab.G18 in the appendix.

Important Attributes Control

When Distance Meets Direction

Moving State: Maintained Directions

A dog is barking on the left nearly.

A nearby water is splashing on the right.

A saxophone is playing in the near front.

A distant dog barks on the left.

Water is splashing from afar on the right.

A saxophone plays softly in the distance ahead.

When Acoustic Environment Meets Direction

Acoustic Environment: Reverberating space

A woman talks on the left in the outdoor.

A bell rings on the right in the outdoor.

A baby is crying in front of the studio.

A woman talks on the left in a large room.

A bell rings on the right in a echoey room.

A baby is crying in the front, the sound echoing.

When Percise Description Meets Direction

Humans are not sensitive to minor changes in the direction.

A man is snoring from 60° right Ahead.

Exciting bass rock music from 30° Left Ahead.

Whistling can be heard from 60° to the left of straight ahead.

When Longer Time Meets Direction

Generating longer (30-sec) audio.

Waves are crashing against the shore on the right.

Peaceful and calming music on the right with piano.

A baby keeps crying loudly on the left side

BEWO-1M Simulation Examples

The data engine is powered by detailed simulations and GPT induction. It utilizes the two basic scenarios depicted below to create intricate soundscapes.

Stationary

Finger snapping loudy on the right.

A gurgling stream located on the right side of the scene.

Snare drum on the left side.

Dynamic

A high pitched engine moving from left to front right at a fast speed.

Loud snoring person shifts from right to left slowly.

A person whistles and moves from right to directly front.

BibTeX

@article{sun2024both,
      title={Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation},
      author={Sun, Peiwen and Cheng, Sitong and Li, Xiangtai and Ye, Zhen and Liu, Huadai and Zhang, Honggang and Xue, Wei and Guo, Yike},
      journal={arXiv preprint arXiv:2410.10676},
      year={2024}
    }

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

ICLR 2025 (Spotlight)

The language-driven spatial audio generation dataset and method for immersive soundscapes.

Abstract

A Noval Dataset: BEWO-1M

BEWO-1M Download

Task Objective

📝Text

📷Image

🔲Bounding Box

🖱️Click

Code and Model

Stability AI has only released the 44.1kHz model; we train it with downsampling to comply with the commonly used 16kHz evaluation standard.

Controllable Spatial Audio Generation

Before listening, follow these steps for immersive experience.

🤩🤩🤩Step-1: Put on headphones🎧 and ensure they fit snugly on both of your ears.

🥳🥳🥳Step-2: Move the slider🕹️ for spatial control.

🥰🥰🥰Step-3: Close your eyes🏖️ and enjoy spatial audio.

Single Context Control

Multiple Contexts Control

Moving Context Control

Generated samples with the 5 common directions in description.

Text Guided Generation (2-C Audio)

Cooresponding to Tab.3 in our paper.

SS-set Single Stationary

SD-set Single Dynamic

DS-set Double Stationary

M-set Mixed

SS-set Single Stationary

SD-set Single Dynamic

DS-set Double Stationary

M-set Mixed

SS-set Single Stationary

SD-set Single Dynamic

DS-set Double Stationary

M-set Mixed

Visually Guided Generation (2-C Audio)

Cooresponding to Tab.4 and Tab.5 in our paper.

Image

Bounding Box

Click

Applying a filter to convert mono-audio into stereo audio results in decreased quality and channel-wise bias.

Image

Bounding Box

Click

Although See-2-Sound is primarily designed for spatial environmental sounds, we also identify some I2A capabilities within it and have included it in our comparison as one of the few existing works in this area.

Image

Bounding Box

Click

Text Guided Generation (1-C Audio)

Cooresponding to Tab.2 in our paper.

Visual Guided Generation (1-C Audio)

Cooresponding to Tab.G18 in the appendix.

Important Attributes Control

When Distance Meets Direction

When Acoustic Environment Meets Direction

When Percise Description Meets Direction

Humans are not sensitive to minor changes in the direction.

When Longer Time Meets Direction

Generating longer (30-sec) audio.

BEWO-1M Simulation Examples

The data engine is powered by detailed simulations and GPT induction. It utilizes the two basic scenarios depicted below to create intricate soundscapes.

Stationary

Dynamic

BibTeX

SS-set
Single Stationary

SD-set
Single Dynamic

DS-set
Double Stationary

M-set
Mixed

SS-set
Single Stationary

SD-set
Single Dynamic

DS-set
Double Stationary

M-set
Mixed

SS-set
Single Stationary

SD-set
Single Dynamic

DS-set
Double Stationary

M-set
Mixed