Dataset Card for Dataset Name
Dataset Summary
The Palmyra v1.4 dataset is a clean-room dataset. This HuggingFace repository contains a 1 billion token sample of the dataset. The full dataset has the following token counts and is available upon request.
Dataset | Token Count |
---|---|
Commoncrawl (Filtered) | 790 Billion |
C4 (Filtered) | 121 Billion |
GitHub | 31 Billion |
Books (Filtered) | 16 Billion |
ArXiv | 28 Billion |
Wikipedia | 24 Billion |
Languages
Primarily English, though the Wikipedia slice contains multiple languages.
Dataset Structure
The dataset structure is as follows:
{
"text": ...,
"meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...}
}
Dataset Creation
The Writer Linguistics team created this dataset in order to adhere to business data and free copyright content as much as possible.
Source Data
Commoncrawl
We downloaded five dumps from Commoncrawl and ran them through the official cc_net
pipeline. We filtered out low quality data and only kept data that is distributed free of any copyright restrictions.
C4
C4 is downloaded from Huggingface. Filter out low quality data, and only keep data that is distributed free of any copyright restrictions.
GitHub
The raw GitHub data is downloaded from Google BigQuery. We deduplicate on the file level and filter out low quality
files and only keep projects that are distributed under the MIT, BSD, or Apache license.
Wikipedia
We use the Wikipedia dataset available on Huggingface, which is based on the Wikipedia dump from 2023-03-20 and contains
text in 20 different languages. The dataset comes in preprocessed format, so that hyperlinks, comments and other
formatting boilerplate has been removed.
Gutenberg and Public domains
The PG19 subset of the Gutenberg Project and public domains books.
ArXiv
ArXiv data is downloaded from Amazon S3 in the arxiv
requester pays bucket. We only keep latex source files and remove preambles, comments, macros and bibliographies.
- Downloads last month
- 0