RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens

Foundation models such as GPT-4 have driven rapid improvement in AI. However, the most powerful models are closed commercial models or only partially open. RedPajama is a project to create a set of leading, fully open-source models. Today, we are excited to announce the completion of the first step of this project: the reproduction of the LLaMA training dataset of over 1.2 trillion tokens.

The most capable foundation models today are closed behind commercial APIs, which limits research, customization, and their use with sensitive data. Fully open-source models hold the promise of removing these limitations, if the open community can close the quality gap between open and closed models. Recently, there has been much progress along this front. In many ways, AI is having its Linux moment. Stable Diffusion showed that open-source can not only rival the quality of commercial offerings like DALL-E but can also lead to incredible creativity from broad participation by communities around the world. A similar movement has now begun around large language models with the recent release of semi-open models like LLaMA, Alpaca, Vicuna, and Koala; as well as fully-open models like Pythia, OpenChatKit, Open Assistant and Dolly.

We are launching RedPajama, an effort to produce a reproducible, fully-open, leading language model. RedPajama is a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, and Hazy Research. RedPajama has three key components:

Pre-training data, which needs to be both high quality and have broad coverage
Base models, which are trained at scale on this data
Instruction tuning data and models, which improve the base model to make it usable and safe

Today, we are releasing the first component, pre-training data.

“The RedPajama base dataset is a 1.2 trillion token fully-open dataset created by following the recipe described in the LLaMA paper.”

Our starting point is LLaMA, which is the leading suite of open base models for two reasons: First, LLaMA was trained on a very large (1.2 trillion tokens) dataset that was carefully filtered for quality. Second, the 7 billion parameter LLaMA model is trained for much longer, well beyond the Chincilla-optimal point, to ensure the best quality at that model size. A 7 billion parameter model is particularly valuable for the open community as it can run on a wide variety of GPUs, including many consumer grade GPUs. However, LLaMA and all its derivatives (including Alpaca, Vicuna, and Koala) are only available for non-commercial research purposes. We aim to create a fully open-source reproduction of LLaMA, which would be available for commercial applications, and provide a more transparent pipeline for research.

The RedPajama base dataset

The full RedPajama 1.2 trillion token dataset and a smaller, more consumable random sample can be downloaded through Hugging Face. The full dataset is ~5TB unzipped on disk and ~3TB to download compressed.

RedPajama-Data-1T consists of seven data slices:

CommonCrawl: Five dumps of CommonCrawl, processed using the CCNet pipeline, and filtered via several quality filters including a linear classifier that selects for Wikipedia-like pages.
C4: Standard C4 dataset
GitHub: GitHub data, filtered by licenses and quality
arXiv: Scientific articles removing boilerplate
Books: A corpus of open books, deduplicated by content similarity
Wikipedia: A subset of Wikipedia pages, removing boilerplate
StackExchange: A subset of popular websites under StackExchange, removing boilerplate

For each data slice, we conduct careful data pre-processing and filtering, and tune our quality filters to roughly match the number of tokens as reported by Meta AI in the LLaMA paper:

	RedPajama	LLaMA*
CommonCrawl	878 billion	852 billion
C4	175 billion	190 billion
Github	59 billion	100 billion
Books	26 billion	25 billion
ArXiv	28 billion	33 billion
Wikipedia	24 billion	25 billion
StackExchange	20 billion	27 billion
Total	1.2 trillion	1.25 trillion

* estimated from Table 1 in https://arxiv.org/abs/2302.13971

We are making all data pre-processing and quality filters openly available on Github. Anyone can follow the data preparation recipe and reproduce RedPajama-Data-1T.

Interactively analyzing the RedPajama base dataset

In collaboration with the Meerkat project, we are releasing a Meerkat dashboard and embeddings for exploring the Github subset of the corpus. The image below shows a preview of the dashboard.

Scatterplot of PCA 1 vs PCA 2 colored by search value, showing gradient from purple to yellow. — Interactively explore the data in the RedPajama base dataset and view matching records using Meerkat dashboard.

You can find instructions on how to install and use the dashboard on Github.

Up next: Models, instructions & OpenChatKit

Having reproduced the pre-training data, the next step is to train a strong base model. As part of the INCITE program, with support from Oak Ridge Leadership Computing Facility (OLCF), we are training a full suite of models, with the first becoming available in the coming weeks.

With a strong base model in hand, we are excited to instruction tune the models. Alpaca illustrated the power of instruction tuning – with merely 50K high-quality, diverse instructions, it was able to unlock dramatically improved capabilities. Via OpenChatKit, we received hundreds of thousands of high-quality natural user instructions, which will be used to release instruction-tuned versions of the RedPajama models.

Acknowledgements

We are appreciative to the work done by the growing open-source AI community that made this project possible.

That includes:

Participants in building the RedPajama dataset including Ontocord.ai, ETH DS3Lab, Université de Montréal, Stanford Center for Research on Foundation Models (CRFM), Stanford Hazy Research research group and LAION.
Meta AI — Their inspiring work on LLaMA shows a concrete path towards building strong language models, and it is the original source for our dataset replication.
EleutherAI — This project is built on the backs of the great team at EleutherAI — including the source code they provided for training GPT-NeoX.
An award of computer time was provided by the INCITE program. This research also used resources of the Oak Ridge Leadership Computing Facility (OLCF), which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.

‍