AI

Diffusion based Text-to-Music Generation with Global and Local Text based Conditioning

By Jisi Zhang Samsung R&D Institute United Kingdom

By Pablo Peso Parada Samsung R&D Institute United Kingdom

By Md Asif Jalal Samsung R&D Institute United Kingdom

By Karthikeyan Saravanan Samsung R&D Institute United Kingdom

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) is an annual flagship conference organized by IEEE Signal Processing Society.

And ICASSP is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. It offers a comprehensive technical program presenting all the latest development in research and technology in the industry that attracts thousands of professionals.

In this blog series, we are introducing our research papers at the ICASSP 2025 and here is a list of them.

#1. Evaluation of Wearable Head BCG for PTT Measurement in Blood Pressure Intervention (Samsung Reseach America)

#2. Better Exploiting Spatial Separability in Multichannel Speech Enhancement with an Align-and-Filter Network (AI Center - Mountain View)

#3. Vision-Language Model Guided Semi-supervised Learning for No-Reference Video Quality Assessment (Samsung R&D Institute India-Bangalore)

#4. Text-aware adapter for few-shot keyword spotting (AI Center - Seoul)

#5. Single-Channel Distance-Based Source Separation for Mobile GPU in Outdoor and Indoor Environments (AI Center - Seoul)

#6. Diffusion based Text-to-Music Generation with Global and Local Text based Conditioning (Samsung R&D Institute United Kingdom)

#7. Find Details in Long Videos: Tower-of-Thoughts and Self-Retrieval Augmented Generation for Video Understanding (Samsung R&D Institute China-Beijing)

#8. Globally Normalizing the Transducer for Streaming Speech Recognition (AI Center - Cambridge)

Introduction

Text-To-Music (TTM) generation model generates music tracks from text descriptions such as “A rock and roll song played by guitar”. Diffusion models have been successfully applied to the TTM task to provide high-fidelity music (Evans et al., 2024). Typically, a text description is processed through a pre-trained representation model to create text embeddings, which are fed into a diffusion model that generates a latent space vector, further decoded (through models such as VAE and HiFi-GAN) into music.

To improve the quality of generated music, recent studies have explored combining multiple text encoders to generate different types of text embedding for conditioning a music diffusion model. However, using multiple text encoders introduce a large number of parameters to the whole TTM model and leads to high inference latency.

In this blog, we present our text-to-music diffusion model, in which the UNet diffusion is conditioned on both global and local text embeddings though lightweight modules. To further reduce the number of parameters, we propose modifications that extract both global and local representations from the same text encoder (e.g., T5 (Raffel et al, 2019)) through pooling mechanisms. Our model reduce the number of model parameters while maintaining competitive music generation quality.

Method

Figure 1. Overview of our text-to-music diffusion model architecture conditioned over multiple text encoders

Figure 1 shows the overall pipeline of the proposed model. Our model includes a pre-trained autoencoder, a HiFi-GAN vocoder, a text conditioner, and a UNet based latent diffusion model. The pre-trained autoencoder encodes mel-spectrograms into compressed latent space vectors. The HiFi-GAN vocoder constructs waveform signals from mel-spectorgrams. The text conditioner converts text prompts to embeddings to condition the UNet based latent diffusion model to generate the music latent space. The UNet architecture is composed of encoder and decoder blocks, each of which is constructed using a ResNet layer and a spatial transformer layer.

Multiple text embeddings for conditioning

The proposed music generation model uses both global and local text embeddings to condition the diffusion UNet model. Given y as the text prompt, the global text embedding G_y ∈ R^(d_G) is passed through a Feature-wise Linear Modulation (FiLM) (Perez et al., 2017). The output of FiLM is concatenated with the time embedding, which is then used to bias the intermediate representation of the UNet.

A local text encoder process y to output local text embedding F_y ∈ R^M×d_F. The local embedding is injected to the intermediate layers of the UNet via the cross-attention mechanism.

Where ψ_i (z_t) denotes the hidden representation of the i-th layer of the UNet and z_t is a noisy latent representation at timestep t corresponding to a pre-defined noise scheduler. W_Q⁽ⁱ⁾, W_K⁽ⁱ⁾, and W_V⁽ⁱ⁾are learnable projection matrices.

Global text embedding through pooling

To further reduce parameters from multiple text encoders, we propose to extract both global and local text embeddings from the same text encoder. The global text embedding can be obtained via applying pooling methods to the local embedding.

Mean pooling

Self-attention pooling (SAP)

Experiments

Our proposed music generation models are trained with public available datasets including MTG, Free Music Archive, and 10k high-quality commercial data from pond5. MusicBench dataset is used for validation and the MusicCaps dataset is used for evaluation. The models are trained to generate 10 second audio tracks sampled at 16 kHz. The diffusion model is trained to optimise a v-objective function. We evaluate models using Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence metrics.

Results

We use the publicly available checkpoints from AudioLDM (Liu et al., 2023), AudioLDM2 (Liu et al., 2024), MusicLDM (Chen et al., 2024), and Stable Audio Open (Evans et al., 2024) from the literature to compare against our method.

Table 1. Objective evaluation of the proposed methods and existing models

Table 1 shows the overall performance of existing models from the literature and the proposed model. The proposed models (last two rows) achieve better FAD and KL scores compared to models conditioned on either the CLAP global embeddings (Wu et al., 2023) or the T5 local embeddings. This indicates that the proposed model can effectively combine and exploit both global and local text embeddings extracted from the two different types of text encoder.

Table 2. Objective evaluation of the proposed methods and existing models

Table 2 presents the results for using different strategies to provide global text embedding. Averaging the local T5 embeddings to create the global text embeddings improved performance compared to using the single T5 model. The mean pooling method achieves competitive performance compared to the model using both CLAP and T5 for conditioning. However, the mean pooling model has much less parameters than the dual text encoder model.

Table 3. Performance of different language models to provide local text embedding

We further verify the benefits of the proposed pooling method via evaluation using two types of language model with different model capacity, namely T5-base and FLANT5-large (Chung et al., 2022). Table 3 shows the results. The proposed mean pooling method with the FLANT5-large achieved better FAD and KL compared to the mean pooling method that uses the T5-base model. This indicates that the pooling based method benefits from a language model that improves the global text representation, and can transfer over to improvements to music generation performance.

Conclusion

In this blog, we present a simple and efficient text-to-music model that conditions global and local text representations at different levels of a diffusion UNet. The proposed model can effectively combine and exploit the two types of text representation to improve the generated music quality. Next, we explore different pre-trained language models and pooling methods for obtaining the global text representations. The mean pooling method achieves competitive music generation results compared to a multiple text encoder conditioning model. This highlights the parameter efficiency of the mean pooling technique.

Link to the paper

https://ieeexplore.ieee.org/document/10889289

References

[1] Chen, K. et al. “MusicLDM: Enhancing Novelty in text-to-music Generation Using Beat-Synchronous mixup Strategies.” ICASSP 2024

[2] Chung, Hyung Won et al. “Scaling Instruction-Finetuned Language Models.” ArXiv abs/2210.11416 (2022)

[3] Evans, Zach et al. “Stable Audio Open.” ArXiv abs/2407.14358 (2024)

[4] Liu, Haohe et al. “AudioLDM: Text-to-Audio Generation with Latent Diffusion Models.” International Conference on Machine Learning (2023).

[5] Liu, Haohe et al. “AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 32 (2024).

[6] Perez, Ethan et al. “FiLM: Visual Reasoning with a General Conditioning Layer.” AAAI Conference on Artificial Intelligence (2017).

[7] Raffel, Colin et al. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” J. Mach. Learn. Res. 21 (2019)

[8] Wu, Yusong et al. “Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation.” ICASSP 2023

#ICASSP #Diffusion

AI