We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training. https://huggingface.co/blog/cosmopedia
Here are some key takeaways: π― Prompt curation is crucial: we want to cover many topics with few duplicates. π You can leverage various resources for diversity: using different seed data, generation formats, and target audiences. βοΈ The importance of a good technical stack: for scalable generations with tools like llm-swarm and fast model training and evaluation.
β Today weβre releasing The Stack v2 & StarCoder2: a series of 3B, 7B & 15B code generation models trained on 3.3 to 4.5 trillion tokens of code:
- StarCoder2-15B matches or outperforms CodeLlama 34B, and approaches DeepSeek-33B on multiple benchmarks. - StarCoder2-3B outperforms StarCoderBase-15B and similar sized models. - The Stack v2 a 4x larger dataset than the Stack v1, resulting in 900B unique code tokens π As always, we released everything from models and datasets to curation code. Enjoy!