โจ Today, we're excited to share the full data processing script used in developing our Sailor models. The repo provides an end-to-end data processing pipeline for LLM training. ๐
The pipeline consists of 4 stages๐งน: 1๏ธโฃ Initial data cleaning 2๏ธโฃ Near deduplication 3๏ธโฃ Exact deduplication 4๏ธโฃ Second round of data cleaning
A special focus was given to the data cleaning part of South-East Asian (SEA) languages๐
# Use Case โจ
With this codebase, you can clean your own dataset with:
โ Get filtered data counts after each processing stage โ Easily configure language-specific cleaning rules (we support Arabic, Bengali, Catalan, Spanish, Basque, French, Hindi, Portuguese, Urdu, and optimize for English, Indonesian, Vietnamese, Chinese, Thai, Lao, Malay) โ Investigate what data was removed at each processing stage
# Acknowledgement ๐
The main credit goes to @dreamerdeo , the first author of our Sailor paper โค๏ธ! He put in tremendous effort on the data processing pipeline, enabling the model's great performance. We believe the mini repo will be a valuable resource for researchers working on dataset curation for large language models. ๐
Sharing the recipe openly aligns with our commitment to open language model development. ๐ช And this repo would not have been possible without the contributions from the open community, including the BigScience data cleaning tool, the all-in-one deduplication tool by @chenghao , and the deduplication project from Google. ๐ง
# What's Next ๐
Share your thoughts or leave any comments on what you'd like the Sailor models to do! We also have some exciting news coming soon, and please stay tuned. ๐
โ๏ธ Sailor: A New Multilingual Open LLM for South-East Asia ๐
Last month we have released a new family of multilingual language models called **Sailor**, ranging from 0.5B to 7B parameters, continually pre-trained from the Qwen1.5 models. Based on our extensive benchmarking, the Sailor models demonstrate exceptional performance on South-East Asian languages, taking us one step closer to multilingual LLMs that can serve the diverse needs of the region and beyond.
Today, we're more than excited to share the key technical details behind the Sailor models! ๐ช
**Key highlights**: ๐ Data curation: Merging short examples, document-level code-switching, aggressive data cleaning and deduplication. ๐ค Tokenization Robustness: We find that BPE dropout is really effective to deal with prompt variations. ๐ Optimizing Data Mixture: We propose a new approach to automatically balance capabilities across different languages! ๐ Recipe in Continual Pre-training: We discover a powerful metric that can help predict how well the Sailor models will perform on the original domain (e.g., English) after continual pre-training.
We are thrilled to share these technical details with the community and invite you to explore the Sailor models. We hope Sailor models take us one step closer to multilingual LLMs in the world! ๐โจ