Redefining Music AI: The Power of Sony’s SoniDo as a Versatile Foundation Model

A foundation model refers to a pre-trained model developed on extensive datasets, designed to be versatile and adaptable for a range of downstream tasks. These models have garnered widespread attention and are increasingly integrated into everyday applications. However, the field of music production lacks a powerful foundation model capable of addressing diverse downstream music tasks.

In a new paper Music Foundation Model as Generic Booster for Music Downstream Tasks, a Sony research team presents SoniDo, a groundbreaking music foundation model (MFM). SoniDo is designed to extract hierarchical features from target music samples, offering a robust framework for improving the effectiveness and accessibility of music processing.

SoniDo employs a generative architecture based on a multi-level transformer coupled with a hierarchical encoder. Through careful preprocessing, its intermediate representations are utilized as features for task-specific models across various music-related tasks, enhanced by data augmentation techniques.

The model’s encoder design draws inspiration from Jukebox, but it distinguishes itself by incorporating a hierarchical structure. Using a framework called hierarchically quantized VAE (HQ-VAE), SoniDo enforces a fine-to-coarse conditioning mechanism within its representations. A transformer-based multilevel autoregressive model is then employed to model the probability distribution of the HQ-VAE embeddings. To extract features, input audio is encoded into tokens, processed through the transformer, and the intermediate outputs from specific layers are utilized.

By leveraging hierarchical intermediate features, SoniDo effectively controls information granularity, enabling superior performance in a wide range of downstream tasks. These include both understanding tasks, such as music tagging and transcription, and generative tasks, such as source separation and mixing.

Experimental evaluations demonstrate that SoniDo’s extracted features significantly enhance the training of downstream models, achieving state-of-the-art performance across multiple tasks. These findings underscore the potential of music foundation models like SoniDo to act as powerful boosters for downstream applications.

Beyond improving existing task-specific models, SoniDo also addresses challenges in scenarios with limited data, providing a transformative solution for music processing. This innovation paves the way for more efficient and accessible tools in the domain of music production.

The paper Music Foundation Model as Generic Booster for Music Downstream Tasks is on .


Author: Hecate He | Editor: Chain Zhang


The post first appeared on .

More from this stream

Recomended


Notice: ob_end_flush(): Failed to send buffer of zlib output compression (0) in /home2/mflzrxmy/public_html/website_18d00083/wp-includes/functions.php on line 5464