RedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language models
Today, we’re releasing a new version of the RedPajama dataset, with 30 trillion filtered and deduplicated tokens (100+ trillions raw) from 84 CommonCrawl dumps covering 5 languages, along with 40+ pre-computed data quality annotations that can be used for further filtering and weighting.
Over the last half a year, we have been pleased to see that RedPajama-1T, which we released in March, has ignited the creation of many new language models. So many people from the community have downloaded this 5TB dataset---more than 190,000 times and have been using them in such creative ways! RedPajama-1T consists of 1 trillion high-quality English tokens, but it was only the first step. Today, with the release of RedPajama-V2, we are making a further step towards the development of open datasets by releasing a massive, 30 trillion token web dataset. This is, to our best knowledge, the largest public dataset released specifically for LLM training. Even more excitingly, we include 40+ pre-computed quality annotations, allowing the community to further filter and weigh the data. Specifically, this release includes:
- Over 100 billion text documents with 100+ trillion raw tokens from 84 CommonCrawl dumps;
- 40+ of the most widely used quality annotations pre-computed for a deduplicated 30 trillion tokens subset;
- Five languages: English, French, Spanish, German, and Italian
- All data processing scripts are open source and available on GitHub; all data are available on HuggingFace.
Why RedPajama-Data-v2 and How to Use it?
A central ingredient to state-of-the-art open LLMs like Llama, Mistral, Falcon, MPT, and the RedPajama models is the large amounts of high-quality data that these models are trained on. For example, Llama 2 is trained on 2.4 trillion carefully curated tokens. The most prominent data sources are the crawls made publicly available by CommonCrawl. However, this data is crude and is not ideal for direct use for LLM training due to artifacts arising from the conversion of HTML to plain text, sources of generally low quality, and biases inherent to the distribution of content on the web. Getting the right dataset and data mixture is painful and any LLM developer has to go through the laborious, time-consuming, energy-intensive and expensive steps of processing and filtering this crude data. Although there have been several community projects around this effort, such as C4, RedPajama-1T, Refinedweb (Falcon), Dolma (AI2) and SlimPajama, many of them only cover a small portion of the CommonCrawl crawls; moreover, they represent a very specific way in which data are filtered.
With RedPajama-Data-v2, our goal is to lift this burden off the community and provide a pool of web data serving as a base from which high quality datasets for LLM training can be extracted and based on which LLM training data can be thoroughly researched. It provides, to our best knowledge, the most complete coverage on CommonCrawl (with 84 dumps processed). More importantly, we provide 40+ quality annotations — the result of different ML classifiers on data quality, minhash results that can be used for fuzzy deduplication, or heuristics such as “the fraction of words that contain no alphabetical character”. We provide our best effort implementations of quality annotations used in C4, Gopher, Pretrainer’s Guide, RefinedWeb and Data Selection for Language Models via Importance Resampling. These annotations provide a way for an LLM developer to easily slice and filter the data, combining these into a new data quality pipeline to create their own pre-training dataset.
Here are some examples! The following code snippets show how one can implement commonly used filtering rules in combination with the RedPajama-V2 dataset. For example, implementing the Gopher rules and use these to filter out documents that do not comply with the Gopher rules is as easy as:
In the above snippet, we have used the “sample” config to load just a subset of the dataset. In case you want to load the full dataset for, e.g., snapshot 2023-14 in English, you can run:
We can also use the rules used in RedPajama-v1 or C4:
In the current release, we include 40+ quality annotations, but we very much view this as a “living” project where new additions will be made over time as the field moves towards a better understanding of LLM training data. We hope the community provides feedback, and we are looking forward to continuing to enrich our current pool of annotations.
Data Processing Steps
RedPajama-V2 focuses on CommonCrawl. Other data sources such as Wikipedia are available in RedPajama-V1. We also encourage you to enrich your data mixture with the Stack (by BigScience) for code and s2orc (by AI2) for scientific articles. RedPajama-V2 is built from the ground up based on publicly available web data, consisting of 84 crawls provided by CommonCrawl. The core components that this dataset is made of, are the source data (plain text), 40+ quality annotations, and deduplication clusters.
Creating the Source Data
The first processing step in building this dataset is to pass each CommonCrawl snapshot through the CCNet pipeline. We choose this pipeline due to its light processing, aligning with our guiding principle of preserving as much information in the raw dataset as possible and allowing downstream model developers to filter or reweight the dataset. We use the language filter in CCNet and keep five languages in this release: English, French, Spanish, German and Italian. This processing step produces 100 billion individual text documents.
Quality Annotations
In addition to the text documents processed by CCNet, we compute over 40 of the most widely used quality annotations for the “head” and “middle” buckets. The primary purpose of these annotations is to allow downstream model developers to filter or reweight the dataset based on their criteria, and to foster research into how these annotations should be used. In addition, we also plan, with the help of the community, to include more quality signals over time. With this release, we publish a first set of quality annotations, which consists of our implementations of the most common quality annotations that are described in C4, Gopher, Pretrainer’s Guide, RefinedWeb, in addition to several signals described in other papers. These annotations fall into the following categories:
- Quality signals indicating how natural a given piece of text is. This includes simple heuristic measures such as the number of sentences, the number of words, the fraction of all-caps words, among others.
- Quality signals indicating how repetitive a given piece of text is. Here follow the Gopher rules (Rae et al.) and compute the fraction of characters that appear in duplicated word n-grams and the fraction of characters in the most frequent word n-gram appearing in the documents.
- Content-based quality signals are comprised of signals that take the content into account such as the density of words appearing in a list of blocked words (similar to C4), or documents which come from a list of domains flagged as containing potentially harmful or otherwise offensive content.
- ML-based quality signals revolve around the idea of measuring how similar a given text is to a high-quality domain. Here we use fasttext classifiers trained on various high quality domains such as Wikipedia, as well as importance weights as proposed by Xie et al.
- Deduplication signals with pre-computed Minhash signatures (with 128 permutations) which can be used for fuzzy deduplication at different degrees.
In addition to these minhash signatures, we conduct exact deduplication with a Bloom filter over the sha1 hash-digest of the document. These are stored as a separate quality annotation file to allow the original non-duplicated distribution to be recovered to facilitate research in this direction.
Dataset Statistics
RedPajama-v2 processed 84 CommonCrawl crawls and consists of 113B documents in the five languages (English, German, French, Spanish, and Italian). While we keep the tail partition of the resulting data, consisting of an estimated 80B documents, we also compute the number of documents and tokens for the head and middle partitions (before and after deduplication). Interestingly, while this reduces the token count by 60%, the number of documents decreases disproportionately more by 71%, indicating that the tail documents are generally shorter.
We further deduplicated the head+middle documents using a Bloom filter, which leads to a reduction in the dataset size by roughly 40%. In the following figure, we show the development of the number of documents in the head+middle partition, as a function of the point in time of the crawl. What stands out here is that there is a relatively stable number until 2018, and a significantly smaller number of documents between 2014 and 2016 (up to 10x for, e.g., German). It is also worth noting how the number of unique documents over time develops. Specifically, since we ran the deduplication from the newest snapshot to the oldest, one expects an increasingly smaller number of unique documents in the corpus, which can be observed from the figure below (note the log-scale). However, it is worth pointing out the sudden drop in unique documents occurring for the crawls between 2014 and 2017. We believe that this can be explained from a different list of seeds used by the CommonCrawl web crawler during that period.
In the next figure, we show the distribution of the number of tokens per document, for the tail and the head+middle partitions. With a median per-document token count of 380, the tail documents are considerably shorter than the head+middle documents where the median is 741.
While the raw documents provide the basis for the RedPajama-V2 corpus, a further central component are the quality signals which we have computed for all documents in the head+middle partition. In the figure below, we show the distribution of the quality signals computed for documents from the 2023-06 snapshot.
Dataset Structure
The core of the dataset is composed of the text documents, accompanied by the quality annotations and deduplication clusters. The structure largely follows the one defined by CCNet. Specifically, the documents for a given CommonCrawl snapshot (say, e.g., 2018-43) are partitioned into 5k shards where the key indicates the shard, language of the document, and the perplexity bucket (partition). The quality annotations and duplicates follow the same logic and “mirror” the source filenames:
The document files are left untouched and correspond 1-to-1 to the CCNet output, including the metadata fields. The quality signals, on the other hand, include document ids, metadata, and the quality signals themselves:
Since we have quality signals that can characterise the quality on a line level (e.g., whether a line ends in a terminal punctuation mark), or on a document level we choose the logic used by Dolma, allowing for a unified representation of different types of signals. Specifically, each score corresponds to an array of tuples `(start, end, score)` where start and end correspond to the span in the document string where the score “applies”.
A “Living” Dataset
We envision the release of this dataset to be the start of a larger, community-driven development of large-scale datasets for LLMs. Along the data axis, we hope to continuously grow this pool and enrich it with additional domains and new snapshots over time. Along the data quality side, we view the current set of quality signals as an initial base set of signals that we hope to grow with new additions. In that sense, RedPajama-v2 should be seen as a pool that grows over time as the community learns more about harnessing the data for training performant language models. In the future, we plan to add more quality annotations such as: Contamination annotations against popular LLM benchmarks, topic modelling and classification annotations for each document, and other annotations that the community is excited about!
Model Building at Together
Together is building open models based on RedPajama-Dataset-V2, and we also help companies and organizations build custom models built with principled mixes of open and their proprietary datasets. If you are evaluating solutions to build models, please contact us here.
Acknowledgments
We are appreciative to so many partners and collaborators that together are pushing forward the frontier of open LLM models.
- Thank you to the OLMo team at AI2 and friends at OpenGPT-X for the insightful discussions about datasets and data quality! Also for everyone who builds on the RedPajama dataset, including Cerebras for their SlimPajama efforts, and the over 500 models built on RedPajama to date by the open-source AI community.
- We are grateful to the great team at EleutherAI for paving the path on open training datasets with The Pile and for open-sourcing code we use in training some of the RedPajama models.
- Thank you to our partners of RedPajama-v1, including Ontocord.ai, MILA Québec AI Institute, ETH DS3Lab, Université de Montréal, Stanford Center for Research on Foundation Models (CRFM), Stanford Hazy Research research group and LAION.
- Lower
Cost20% - faster
training4x - network
compression117x
Q: Should I use the RedPajama-V2 Dataset out of the box?
RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.
article