Announcing that we are on our way to solve a long standing issue of document processing: correction of OCR mistakes. Pleias publishes the largest dataset to date with automated OCR correction, 1 billion words in English, French, German and Italian.
OCR quality is long-standing issue of digitization. Cultural heritage texts are especially concerned due to the primary sources being old documents (with many artifacts, blots, degradation) and to the limitation of OCR technology for historical scripts. When we released Common Corpus, a 500 Billion words corpus in the public domain, this was the primary criticism.
Recent breakthrough in post-OCR correction has been made possible thanks to progress in open LLM research and several months of dedicated training and alignment by Pleias as well as the HPC resources from GENCI–IDRIS (Grant 2023-AD011014736) on Jean-Zay.
Announcing today the release of Common Corpus, the largest collection of fully open corpus on HuggingFace: nearly 500b words (600-700b tokens) in public domain.
Common corpus is an international initiative coordinated by @pleias_fr with the support of the state start-up LANGU:IA (start-up d’Etat), supported by the French Ministry of Culture and DINUM and the involvement of the open science LLM community (Occiglot, Eleuther AI) and cultural heritage researchers.
We aim to create the same kind of ecosystem there is now for fine tuning at the pretraining stage, by creating a strong commons without copyright issues or "trade secret" gatekeeping. Contrary to what many AI companies say, Common Corpus shows it is possible to train Large Language Models on fully open corpus. Due to the complexity of copyright check, we have only released a partial amount of the text we hold and will release way more in the months.
Common Corpus is multilingual. It also includes to date the largest open collections in French (110 billion words), German (30 billion words), Spanish (23 billion words), Dutch (18 billion words), Italian (10 billion words) as well as a very long tail of middle to low resource languages.
Our conviction is that open corpora make future models more inclusive, democratic, and respectful of cultural diversity, as well as more qualitative. Common Corpus holds many long texts in book form, editorialized, with reasoning rich content that have never been used to date for LLM pretraining.
Common Corpus is an ongoing work and still need to get enhanced and completed. Sharing is caring: Common Corpus still needs more care to become "a common" like Wikipedia or Wikisource.