We are on it :) very very soon! \n","updatedAt":"2024-01-10T12:55:05.361Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6481e135578646b5c2386728/SPva4iNw0pORiCXD45cx9.jpeg","fullname":"Yossi Adi","name":"adiyoss","type":"user","isPro":false,"isHf":false}},"numEdits":1,"editors":["adiyoss"],"reactions":[{"reaction":"π","users":["lzxcgnkhnrlnto","akhaliq","TerraNull","seruva19","ghgg99","shadow-none","Samdy2023","fahnub","NeuralUnderload","rutsam","hbkang","AfricanFace"],"count":12},{"reaction":"β€οΈ","users":["reach-vb","Maykeye","kayvansylvan"],"count":3}],"identifiedLanguage":{"language":"en","probability":0.9982258677482605},"isReport":false}},{"id":"659edce358a49686b2763d77","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1595661137210-5f1bdad7cb8f993fa01f4681.jpeg","fullname":"Roblox studio","name":"Roblox22r","type":"user","isPro":false,"isHf":false},"createdAt":"2024-01-10T18:07:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"demo space?","html":"
demo space?
\n","updatedAt":"2024-01-10T18:07:31.033Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1595661137210-5f1bdad7cb8f993fa01f4681.jpeg","fullname":"Roblox studio","name":"Roblox22r","type":"user","isPro":false,"isHf":false}},"numEdits":0,"editors":["Roblox22r"],"reactions":[],"identifiedLanguage":{"language":"it","probability":0.3202105164527893},"isReport":false}},{"id":"65a665f6f1d4e7bccc2843e7","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1655385361868-61b85ce86eb1f2c5e6233736.jpeg","fullname":"Vaibhav Srivastav","name":"reach-vb","type":"user","isPro":false,"isHf":true},"createdAt":"2024-01-16T11:18:14.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"The model weights have now been released: https://huggingface.co/collections/facebook/magnet-659ef0ceb62804e6f41d1466 (go check it out!) β€οΈ","html":"The model weights have now been released: https://huggingface.co/collections/facebook/magnet-659ef0ceb62804e6f41d1466 (go check it out!) β€οΈ
\n","updatedAt":"2024-01-16T11:18:14.508Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1655385361868-61b85ce86eb1f2c5e6233736.jpeg","fullname":"Vaibhav Srivastav","name":"reach-vb","type":"user","isPro":false,"isHf":true}},"numEdits":0,"editors":["reach-vb"],"reactions":[{"reaction":"π€","users":["AfricanFace"],"count":1},{"reaction":"β€οΈ","users":["kayvansylvan"],"count":1}],"identifiedLanguage":{"language":"en","probability":0.8210703730583191},"isReport":false}},{"id":"65a6dd0681cc3017640bfe6b","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61868ce808aae0b5499a2a95/F6BA0anbsoY_Z7M1JrwOe.jpeg","fullname":"Sylvain Filoni","name":"fffiloni","type":"user","isPro":false,"isHf":false},"createdAt":"2024-01-16T19:46:14.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"tempo demo space (waiting for the fix to be merge to the main branch) :\n\nhttps://huggingface.co/spaces/fffiloni/MAGNet","html":"tempo demo space (waiting for the fix to be merge to the main branch) :
\nhttps://huggingface.co/spaces/fffiloni/MAGNet
\n","updatedAt":"2024-01-16T19:46:14.114Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61868ce808aae0b5499a2a95/F6BA0anbsoY_Z7M1JrwOe.jpeg","fullname":"Sylvain Filoni","name":"fffiloni","type":"user","isPro":false,"isHf":false}},"numEdits":0,"editors":["fffiloni"],"reactions":[{"reaction":"π€","users":["fazzy007"],"count":1}],"identifiedLanguage":{"language":"en","probability":0.6520782709121704},"isReport":false},"replies":[{"id":"65d8bf86468a1810d4f808bc","author":{"avatarUrl":"/avatars/8c90fefa006023b331c0fc5e710d1a57.svg","fullname":"Filippa","name":"fazzy007","type":"user","isPro":false,"isHf":false},"createdAt":"2024-02-23T15:53:42.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Will you release a demo space for the Sound-Effect Generator also? ","html":"Will you release a demo space for the Sound-Effect Generator also?
\n","updatedAt":"2024-02-23T15:53:42.185Z","author":{"avatarUrl":"/avatars/8c90fefa006023b331c0fc5e710d1a57.svg","fullname":"Filippa","name":"fazzy007","type":"user","isPro":false,"isHf":false}},"numEdits":0,"editors":["fazzy007"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.6300301551818848},"isReport":false,"parentCommentId":"65a6dd0681cc3017640bfe6b"}},{"id":"65d8c724a3c18e931660e3e4","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61868ce808aae0b5499a2a95/F6BA0anbsoY_Z7M1JrwOe.jpeg","fullname":"Sylvain Filoni","name":"fffiloni","type":"user","isPro":false,"isHf":false},"createdAt":"2024-02-23T16:26:12.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Do you mean AudioGen ? ","html":"Do you mean AudioGen ?
\n","updatedAt":"2024-02-23T16:26:12.167Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61868ce808aae0b5499a2a95/F6BA0anbsoY_Z7M1JrwOe.jpeg","fullname":"Sylvain Filoni","name":"fffiloni","type":"user","isPro":false,"isHf":false}},"numEdits":0,"editors":["fffiloni"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.740908145904541},"isReport":false,"parentCommentId":"65a6dd0681cc3017640bfe6b"}},{"id":"65dc65116d290f6b90c969a5","author":{"avatarUrl":"/avatars/8c90fefa006023b331c0fc5e710d1a57.svg","fullname":"Filippa","name":"fazzy007","type":"user","isPro":false,"isHf":false},"createdAt":"2024-02-26T10:16:49.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"No I meant MAGNet, I thought it was another sound-effect generator :) ","html":"No I meant MAGNet, I thought it was another sound-effect generator :)
\n","updatedAt":"2024-02-26T10:16:49.696Z","author":{"avatarUrl":"/avatars/8c90fefa006023b331c0fc5e710d1a57.svg","fullname":"Filippa","name":"fazzy007","type":"user","isPro":false,"isHf":false}},"numEdits":0,"editors":["fazzy007"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.9654708504676819},"isReport":false,"parentCommentId":"65a6dd0681cc3017640bfe6b"}},{"id":"65dc6b51fd3f08a74dff7192","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61868ce808aae0b5499a2a95/F6BA0anbsoY_Z7M1JrwOe.jpeg","fullname":"Sylvain Filoni","name":"fffiloni","type":"user","isPro":false,"isHf":false},"createdAt":"2024-02-26T10:43:29.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"you can use MAGNet to try to generate sound FX, but i did not get satisfying results from it yet","html":"you can use MAGNet to try to generate sound FX, but i did not get satisfying results from it yet
\n","updatedAt":"2024-02-26T10:43:29.534Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61868ce808aae0b5499a2a95/F6BA0anbsoY_Z7M1JrwOe.jpeg","fullname":"Sylvain Filoni","name":"fffiloni","type":"user","isPro":false,"isHf":false}},"numEdits":0,"editors":["fffiloni"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.9587277173995972},"isReport":false,"parentCommentId":"65a6dd0681cc3017640bfe6b"}},{"id":"65dc97a3139bc4eee389f7b3","author":{"avatarUrl":"/avatars/8c90fefa006023b331c0fc5e710d1a57.svg","fullname":"Filippa","name":"fazzy007","type":"user","isPro":false,"isHf":false},"createdAt":"2024-02-26T13:52:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Okay that's too bad because I didn't get satisfying results when I used AudioGen :/","html":"Okay that's too bad because I didn't get satisfying results when I used AudioGen :/
\n","updatedAt":"2024-02-26T13:52:35.539Z","author":{"avatarUrl":"/avatars/8c90fefa006023b331c0fc5e710d1a57.svg","fullname":"Filippa","name":"fazzy007","type":"user","isPro":false,"isHf":false}},"numEdits":0,"editors":["fazzy007"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.9719926714897156},"isReport":false,"parentCommentId":"65a6dd0681cc3017640bfe6b"}},{"id":"65dca12a4e23db813de5de81","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61868ce808aae0b5499a2a95/F6BA0anbsoY_Z7M1JrwOe.jpeg","fullname":"Sylvain Filoni","name":"fffiloni","type":"user","isPro":false,"isHf":false},"createdAt":"2024-02-26T14:33:14.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"For sound FX, you can try https://huggingface.co/declare-lab/tango ","html":"For sound FX, you can try https://huggingface.co/declare-lab/tango
\n","updatedAt":"2024-02-26T14:33:14.351Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61868ce808aae0b5499a2a95/F6BA0anbsoY_Z7M1JrwOe.jpeg","fullname":"Sylvain Filoni","name":"fffiloni","type":"user","isPro":false,"isHf":false}},"numEdits":0,"editors":["fffiloni"],"reactions":[{"reaction":"β€οΈ","users":["fazzy007"],"count":1}],"identifiedLanguage":{"language":"en","probability":0.6716583371162415},"isReport":false,"parentCommentId":"65a6dd0681cc3017640bfe6b"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2401.04577","authors":[{"_id":"659e0f2b1692b39ff0bbfa72","user":{"avatarUrl":"/avatars/120ac6417b73627e98488afdc715227b.svg","isPro":false,"fullname":"Alon Ziv","user":"alonzi","type":"user"},"name":"Alon Ziv","status":"admin_assigned","statusLastChangedAt":"2024-01-10T09:04:48.360Z","hidden":false},{"_id":"659e0f2b1692b39ff0bbfa73","user":{"avatarUrl":"/avatars/73519deba3176be9c23d49f749aee5da.svg","isPro":false,"fullname":"Itai Gat","user":"itaigat","type":"user"},"name":"Itai Gat","status":"admin_assigned","statusLastChangedAt":"2024-01-10T09:05:02.069Z","hidden":false},{"_id":"659e0f2b1692b39ff0bbfa74","name":"Gael Le Lan","hidden":false},{"_id":"659e0f2b1692b39ff0bbfa75","user":{"avatarUrl":"/avatars/24aaaeb700690bc84ad0212ce4ae9bd4.svg","isPro":false,"fullname":"Tal Remez","user":"TalRemez","type":"user"},"name":"Tal Remez","status":"admin_assigned","statusLastChangedAt":"2024-01-10T09:05:29.272Z","hidden":false},{"_id":"659e0f2b1692b39ff0bbfa76","user":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/IOWMd17Iwls0dXsY1OWjK.jpeg","isPro":false,"fullname":"Felix Kreuk","user":"felixkreuk","type":"user"},"name":"Felix Kreuk","status":"admin_assigned","statusLastChangedAt":"2024-01-10T09:05:35.720Z","hidden":false},{"_id":"659e0f2b1692b39ff0bbfa77","user":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666708948380-noauth.jpeg","isPro":false,"fullname":"Alexandre DΓ©fossez","user":"adefossez","type":"user"},"name":"Alexandre DΓ©fossez","status":"admin_assigned","statusLastChangedAt":"2024-01-10T09:05:41.700Z","hidden":false},{"_id":"659e0f2b1692b39ff0bbfa78","user":{"avatarUrl":"/avatars/49f08d989ca505ae01bce5578a94f6fe.svg","isPro":false,"fullname":"Jade Copet","user":"JadeCopet","type":"user"},"name":"Jade Copet","status":"admin_assigned","statusLastChangedAt":"2024-01-10T09:05:47.556Z","hidden":false},{"_id":"659e0f2b1692b39ff0bbfa79","user":{"avatarUrl":"/avatars/b7ccbddfa745db854dc342be1327cd53.svg","isPro":false,"fullname":"Gabriel Synnaeve","user":"gsynnaeve","type":"user"},"name":"Gabriel Synnaeve","status":"admin_assigned","statusLastChangedAt":"2024-01-10T09:05:54.092Z","hidden":false},{"_id":"659e0f2b1692b39ff0bbfa7a","user":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6481e135578646b5c2386728/SPva4iNw0pORiCXD45cx9.jpeg","isPro":false,"fullname":"Yossi Adi","user":"adiyoss","type":"user"},"name":"Yossi Adi","status":"admin_assigned","statusLastChangedAt":"2024-01-10T09:06:00.172Z","hidden":false}],"publishedAt":"2024-01-09T14:29:39.000Z","title":"Masked Audio Generation using a Single Non-Autoregressive Transformer","summary":"We introduce MAGNeT, a masked generative sequence modeling method that\noperates directly over several streams of audio tokens. Unlike prior work,\nMAGNeT is comprised of a single-stage, non-autoregressive transformer. During\ntraining, we predict spans of masked tokens obtained from a masking scheduler,\nwhile during inference we gradually construct the output sequence using several\ndecoding steps. To further enhance the quality of the generated audio, we\nintroduce a novel rescoring method in which, we leverage an external\npre-trained model to rescore and rank predictions from MAGNeT, which will be\nthen used for later decoding steps. Lastly, we explore a hybrid version of\nMAGNeT, in which we fuse between autoregressive and non-autoregressive models\nto generate the first few seconds in an autoregressive manner while the rest of\nthe sequence is being decoded in parallel. We demonstrate the efficiency of\nMAGNeT for the task of text-to-music and text-to-audio generation and conduct\nan extensive empirical evaluation, considering both objective metrics and human\nstudies. The proposed approach is comparable to the evaluated baselines, while\nbeing significantly faster (x7 faster than the autoregressive baseline).\nThrough ablation studies and analysis, we shed light on the importance of each\nof the components comprising MAGNeT, together with pointing to the trade-offs\nbetween autoregressive and non-autoregressive modeling, considering latency,\nthroughput, and generation quality. Samples are available on our demo page\nhttps://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT.","upvotes":37},"canReadDatabase":false,"canManageCommunity":false,"hasHfLevelAccess":false,"publishedOnDailyAt":"2024-01-10T03:29:48.355Z","upvoted":false,"upvoters":[{"avatarUrl":"/avatars/2aad898b34a940a6aa4368526aa20d84.svg","isPro":false,"fullname":"Yoonjae Jeong","user":"yjeong75","type":"user"},{"avatarUrl":"/avatars/9feb53ddfc5ede212e3964e823d1bb13.svg","isPro":false,"fullname":"SharpWang","user":"ZurichRain","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637c3504c292c0fd3f37361f/wyTkbYKi8HufRT65LGN0P.jpeg","isPro":false,"fullname":"seungheon.doh","user":"seungheondoh","type":"user"},{"avatarUrl":"/avatars/c636cea39c2c0937f01398c94ead5dad.svg","isPro":false,"fullname":"fdsqefsgergd","user":"T-representer","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"avatarUrl":"/avatars/73519deba3176be9c23d49f749aee5da.svg","isPro":false,"fullname":"Itai Gat","user":"itaigat","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6481e135578646b5c2386728/SPva4iNw0pORiCXD45cx9.jpeg","isPro":false,"fullname":"Yossi Adi","user":"adiyoss","type":"user"},{"avatarUrl":"/avatars/ec2d97cc29e6c46fc78135bf031aec81.svg","isPro":false,"fullname":"Sergey Gornostaev","user":"seruva19","type":"user"},{"avatarUrl":"/avatars/ddfd54f4cf90bdd615ef1ab409e26a62.svg","isPro":false,"fullname":"Piotr","user":"piotr-ai","type":"user"},{"avatarUrl":"/avatars/d6841d7a250f4c8a1dc79add859041b2.svg","isPro":false,"fullname":"Nguyα» n TiαΊΏn ΔαΊ‘t","user":"datnt114","type":"user"},{"avatarUrl":"/avatars/55ee1f1a0bf57a9d045e739f6dbbaeed.svg","isPro":false,"fullname":"Rocl Jamez","user":"James62","type":"user"},{"avatarUrl":"/avatars/a48f6a085b8a69fa0a29847fc5ae9065.svg","isPro":false,"fullname":"Guy Yariv","user":"GuyYariv","type":"user"}],"acceptLanguages":["en","*"]}">Masked Audio Generation using a Single Non-Autoregressive Transformer
Abstract
We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. To further enhance the quality of the generated audio, we introduce a novel rescoring method in which, we leverage an external pre-trained model to rescore and rank predictions from MAGNeT, which will be then used for later decoding steps. Lastly, we explore a hybrid version of MAGNeT, in which we fuse between autoregressive and non-autoregressive models to generate the first few seconds in an autoregressive manner while the rest of the sequence is being decoded in parallel. We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation and conduct an extensive empirical evaluation, considering both objective metrics and human studies. The proposed approach is comparable to the evaluated baselines, while being significantly faster (x7 faster than the autoregressive baseline). Through ablation studies and analysis, we shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling, considering latency, throughput, and generation quality. Samples are available on our demo page https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT.
Community
Please release the weights
@timothelaborie
We are on it :) very very soon!
demo space?
The model weights have now been released: https://huggingface.co/collections/facebook/magnet-659ef0ceb62804e6f41d1466 (go check it out!) β€οΈ
tempo demo space (waiting for the fix to be merge to the main branch) :
Will you release a demo space for the Sound-Effect Generator also?
Models citing this paper 8
Browse 8 models citing this paperDatasets citing this paper 0
No dataset linking this paper