Masked Audio Generation using a Single Non-Autoregressive Transformer
Paper
•
2401.04577
•
Published
•
37
Masked Audio Generation using a Single Non-Autoregressive Transformer
Note 300M model, text to music, generates 10-second samples.
Note 1.5B model, text to music, generates 10-second samples.
Note 300M model, text to music, generates 30-second samples.
Note 1.5B model, text to music, generates 30-second samples.
Note 300M model, text to sound-effect.
Note 1.5B model, text to sound-effect.