My fork with experiments; https://github.com/spsanps/diffwave-sashimi

S4 → Sashimi (S4 Diffusion model) → It's Raw! Audio Generation with State-Space Models. Karan Goel, Albert Gu, Chris Donahue, Christopher Ré

Sashimi → https://github.com/albertfgu/diffwave-sashimi → Unconditional Generation, Spectrogram based generation (diffusion)

S4 models → Sashimi → Can it do something like deep-performer? (score → audio)

Only used the Bach Violin Dataset (Small, Aligned, good for experiments).

Used1s samples, either 8Khz or 16Khz

Tried: (from worst to best)

No Diffusion:

MIDI → synthesized waveform

Synthesized waveform:

emil-telmanyi_bwv1001_mov1_syn.wav

Synthesized → Unet (S4) → Original Output

This didn’t work! The model wasn’t really training. (But maybe model size was too small or some other problem with how I set it up).