Audio2Vec is an audio embeddings model that models sentiment into a 256 dimensional space through spectrograms extracted from raw waveforms. Built entirely on the plane to SF! Built on the RAVDESS dataset, with audio waveforms and sentiment labels.

At the highest level, below is the Audio2Vec Pipeline. I tried experimenting with Claude 3.5 Sonnet’s Artifacts, and here’s what it produced! I think it outlines the ML pipeline to a reasonable degree.

alt text

The first step is extracting spectrograms from the raw waveforms, using a library called librosa. This transforms our messy waveform into a way for our system to extract signal strength and loudness over various frequencies within the waveform. Spectrograms, however, are still messy. So, we can build a model that represents these as vectors, and self-validates using our labels to adjust those embedding parameters.

The first round of feature extraction with a 2D convolutional layer captures low-level feature like textures, which we then reduce spatial dimensions to be inputted into a more complex 2D convolutional layer, which we reduce again. Now we can flatten these layers to our desired 1D dimension and represent it in our target dimensions, classified into our different dimensions by cluster.

When reducing this 256 dimension space into 2D & 3D spaces, we get the following. Since we have labels, these colours start to form clusters. The higher the dimension, the more obvious they become.

alt text

alt text

The audio2vec jupyter notebook file can be found here!

Github Repository Twitter Link