Dataset Card for Dataset Name

Name: palmyra-data-index
Creator: Writer

Dataset Summary

The Palmyra v1.4 dataset is a clean-room dataset. This HuggingFace repository contains a 1 billion token sample of the dataset. The full dataset has the following token counts and is available upon request.

Dataset	Token Count
Commoncrawl (Filtered)	790 Billion
C4 (Filtered)	121 Billion
GitHub	31 Billion
Books (Filtered)	16 Billion
ArXiv	28 Billion
Wikipedia	24 Billion

Languages

Primarily English, though the Wikipedia slice contains multiple languages.

Dataset Structure

The dataset structure is as follows:

{
    "text": ...,
    "meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...}
}

Dataset Creation

The Writer Linguistics team created this dataset in order to adhere to business data and free copyright content as much as possible.

Source Data

Commoncrawl

We downloaded five dumps from Commoncrawl and ran them through the official cc_net pipeline. We filtered out low quality data and only kept data that is distributed free of any copyright restrictions.

C4

C4 is downloaded from Huggingface. Filter out low quality data, and only keep data that is distributed free of any copyright restrictions.

GitHub

The raw GitHub data is downloaded from Google BigQuery. We deduplicate on the file level and filter out low quality
files and only keep projects that are distributed under the MIT, BSD, or Apache license.

Wikipedia

We use the Wikipedia dataset available on Huggingface, which is based on the Wikipedia dump from 2023-03-20 and contains
text in 20 different languages. The dataset comes in preprocessed format, so that hyperlinks, comments and other
formatting boilerplate has been removed.

Gutenberg and Public domains

The PG19 subset of the Gutenberg Project and public domains books.

ArXiv

ArXiv data is downloaded from Amazon S3 in the arxiv requester pays bucket. We only keep latex source files and remove preambles, comments, macros and bibliographies.

Datasets:

Writer
/

palmyra-data-index

You need to agree to share your contact information to access this dataset