CMC24 - Generative AI Music

Part 1
Part 2

Part 1

Class 1: Course Introduction

When: January 8th at 17:00-18:00h (online).

What:

Introduction to the course, its structure, philosophy, and evaluation.
Discussion about the history of generative music, ethical aspects, music representation, and generative techniques.

Before taking this class, students are expected to have watched the following videos from The Sound of AI’s Generative Music AI Course:

Class 2: Genetic Algorithms

When: January 13th at 16:00-17:10h (in presence).

What:

Genetic algorithms for music generation.
Real-world experience / challenges implementing this technique.
Discuss GenJam system.
Exercises and practical challenges.

Before taking this class, students are expected to have watched the following videos and coded along the code walkthrough from The Sound of AI’s Generative Music AI Course:

Genetic Algorithms [video] [slides]
Melody Harmonization with Genetic Algorithms [video] [code]

Class 3: Markov Chains

When: January 13th at 17:20-18:30h (in presence).

What:

Markov chains models for music generation.
Real-world experience / challenges implementing this technique.
Discuss GEDMAS system.
Exercises and practical challenges.

Before taking this class, students are expected to have watched the following videos and coded along the code walkthrough from The Sound of AI’s Generative Music AI Course:

Markov Chains [video] [slides]
Melody Generation with Markov Chains [video] [code]

Class 4: RNN/LSTMs

When: January 14th at 16:00-18:30h (in presence).

What:

RNN/LSTMs for music generation.
Real-world experience / challenges implementing this technique.
Discuss BachBot system.
Exercises and practical challenges.

Before taking this class, students are expected to have watched the following videos and coded along the code walkthroughs from The Sound of AI YouTube channel:

Recurrent Neural Networks Explained Easily [video] [slides]
Long Short Term Memory (LSTM) Networks Explained Easily [video] [slides]
Generating Melodies with LSTM Nets Course:
1. Video lectures (theory + implementation)
2. Code + slides

Class 5: Transformers

When: January 15th at 16:00-18:30h (in presence).

What:

Transformers for music generation.
Real-world experience / challenges implementing this technique.
Focus on Music Transformer.
Exercises and practical challenges.

Before taking this class, students are expected to have watched the following videos and coded along the code walkthrough from The Sound of AI’s Generative Music AI Course:

Transformers Explained Easily: Part 1 [video] [slides]
Transformers Explained Easily: Part 2 [video] [slides]
Melody Generation with Transformers [video] [code]

Class 6: Assignments Review

When: January 27th at 16:00-18:30h (in presence).

What:

Discuss four code assignments.
Check solutions together.

Class 7: Paper Implementation

When: January 28th at 16:30-18:30h (in presence).

What:

How to implement a generative AI music paper.
Check paper implementation together.

Class 8: Creative Reverse Engineering + Wrap-up

When: January 29th at 9:30-11:30h (in presence).

What:

Reverse engineer the output of a generative music system.
During the class, in groups of three people, design a generative music system that can come up with the presented music output.
Reflect on Part 1 of the course, ask questions, tips to get a job as a gen AI music engineer.

Hands-on Demo: Transformers Hack Session

When: TBD for 1:30h (in presence).

What:

Inference + Fine Tuning with Hugging Face Transformers.
Using pre-trained symbolic models (no details, only what they do, as a user rather than a “researcher”).
On one of the cloud providers - AWS, Azure, GCP.

Before taking this class, students are expected to have installed the following libraries and coded along the code walkthrough:

Set up an account on cloud computing platforms [AWS/GCP/Azure (TBD)]
Hugging Face Transformers [blog]
Hugging Face MuPT [blog]

Part 2

Slides [CMC_0_Intro]

Week 1: Audio modeling Introduction; Sound Model Factory

When: February 3 at 16:00-17:30h. February 5 at 17:00-18:00h.

What:

Introduction to the second part of the course on generative audio.
Discussion about the main ideas, audio representations, and architectures commonly used.
Sound Model Factory approach to creating playable audio models.

Class preparation :

Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Donahue, C., & Roberts, A. (2019). Gansynth: Adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710. [Link]
Wyse, L., Kamath, P., & Gupta, C. (2022, April). Sound model factory: An integrated system architecture for generative audio modelling. In International Conference on Computational Intelligence in Music, Sound, Art and Design (Part of EvoStar) (pp. 308-322). Cham: Springer International Publishing. [Link]

Preclass Quiz: [Link]

Slides [CMC_1_DataDrivenSoundModeling.pdf]

Week 2: Representation & Codecs

When: February 10 at 16:00-17:30h. February 12 at 17:00-18:00h.

What:

From Audio representations to Codecs

Class preparation :

Kumar, R., Seetharaman, P., Luebs, A., Kumar, I., & Kumar, K. (2024). High-fidelity audio compression with improved rvqgan. Advances in Neural Information Processing Systems, 36. [Link] - The Descript Audio Codec (DAC) that we will look at more closely next week.
Garcia, H. F., Seetharaman, P., Kumar, R., & Pardo, B. (2023). Vampnet: Music generation via masked acoustic token modeling. arXiv preprint arXiv:2307.04686. [Link] - Uses the DAC in fun and interesting ways, helps to understand and motivate tokenization.

Optional:

Van Den Oord, A., & Vinyals, O. (2017). Neural discrete representation learning. Advances in neural information processing systems, 30. [Link] - 5000 citations - historically important paper, a good image that Kumar et al should really have, and a section specifically on audio.

Preclass Quiz: [Link]

Play! Come in to class Wednesday with something to show/discuss in this Collab Notebook for exploring codec issues (using the Descript Audio Codec): [playground]

You can also checkout the notebooks I was using in class: https://github.com/lonce/DACodecMorphing

Slides [CMC_2a_Representation&SoundModeling.pdf]

Slides[CMC_2b_Representation&SoundModeling.pdf]

Week 3: DDSP and Rave

When: February 17 at 16:00-17:30h. February 19 at 17:00-18:00h.

What:

Fast learning, small datasets, real time inference, differentiation

Class preparation :

Engel, J., Hantrakul, L., Gu, C., & Roberts, A. (2020). DDSP: Differentiable digital signal processing. arXiv preprint arXiv:2001.04643. [Link]
Caillon, A., & Esling, P. (2021). RAVE: A variational autoencoder for fast and high-quality neural audio synthesis. arXiv preprint arXiv:2111.05011. [Link]

(Optional) You might also be interested:
Barahona-Ríos, A., & Collins, T. (2024). NoiseBandNet: controllable time-varying neural synthesis of sound effects using filterbanks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32, 1573-1585.

Play! Come in to class Wednesday with something to show/discuss in this Collab Notebook for DDSP style transfer learning using pretrained violin and/or bassoon model: [playground1] & Collab notebook for DDSP + NoisBandNet implementation (by Blazej Kotowski ) [playground2]

Slides[CMC3_DDSP.RAVE.pdf]

Week 4: Transformers for Audio

When: February 24 at 16:00-17:30h. February 26 at 17:00-18:00h.

What:

Core transformer architecture, considerations for audio
Synthformer - detailed talk through of (my own) Synthformer for interactive audio generation.

Class preparation :

Video: Peter Bloem, Lecture 12.1: Transformers (20 minutes) [Link]
Video: Peter Bloem, Lecture 12.2: Transformers (20 minutes) [Link]
Video: Visualizing transformers and attention (60 minutes [no need to watch the Q&A]) [Link]
Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., … & Défossez, A. (2023). Simple and controllable music generation. Advances in Neural Information Processing Systems, 36, 47704-47720. [Link] (This is the “MusicGen” paper from Meta)

The videos are a “review” of the fundamentals of Transformers - You’ve looked at Transformers before, I know, but they are here because you may not have all the details clear in your mind and they are excellent (Bloem for clear explanation, and 3 Brown one Blue for visualization).

The paper is a classic. It is actually Text-2-Audio, but uses a token-based autoregressive Transformer network at its core, using language as conditioning. Pretty cool, and a good transition to more “proper” text to audio that we will look at next week.

Slides[CMC4_TransformersForAudio.pdf]

Github code for [Synthformer]

Week 5: Text2Audio & Evaluation for generative models

When: March 3 at 16:00-17:30h. March 5 at 17:00-18:00h.

What:

Overview of Diffusion and Transformer models for text-to-audio - CLAP
Objective and subjective approaches to evaluating generative audio

Class preparation :

Valle, R., Badlani, R., Kong, Z., Lee, S. G., Goel, A., Santos, J. F., … & Catanzaro, B. Fugatto 1: Foundational Generative Audio Transformer Opus 1. In The Thirteenth International Conference on Learning Representations. [Link]
(I consider this state-of-the-art. It is from NVidia, but no code is available.)

Optional, but worth a look for understanding CLAP:

Wu, Y., Chen, K., Zhang, T., Hui, Y., Berg-Kirkpatrick, T., & Dubnov, S. (2023, June). Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE. [Link] (Optional, but worth a look for understanding CLAP)

Last Class: 5 minute presentations of your audio transformer explorations.

Preclass quiz [Link]