This website uses cookies to anonymously analyze website traffic using Google Analytics.
Company

RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens

April 17, 2023

By 

Together

Foundation models such as GPT-4 have driven rapid improvement in AI. However, the most powerful models are closed commercial models or only partially open. RedPajama is a project to create a set of leading, fully open-source models. Today, we are excited to announce the completion of the first step of this project: the reproduction of the LLaMA training dataset of over 1.2 trillion tokens.


The most capable foundation models today are closed behind commercial APIs, which limits research, customization, and their use with sensitive data. Fully open-source models hold the promise of removing these limitations, if the open community can close the quality gap between open and closed models. Recently, there has been much progress along this front. In many ways, AI is having its Linux moment. Stable Diffusion showed that open-source can not only rival the quality of commercial offerings like DALL-E but can also lead to incredible creativity from broad participation by communities around the world. A similar movement has now begun around large language models with the recent release of semi-open models like LLaMA, Alpaca, Vicuna, and Koala; as well as fully-open models like Pythia, OpenChatKit, Open Assistant and Dolly.

We are launching RedPajama, an effort to produce a reproducible, fully-open, leading language model. RedPajama is a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, and Hazy Research. RedPajama has three key components:

  1. Pre-training data, which needs to be both high quality and have broad coverage
  2. Base models, which are trained at scale on this data
  3. Instruction tuning data and models, which improve the base model to make it usable and safe

Today, we are releasing the first component, pre-training data.

“The RedPajama base dataset is a 1.2 trillion token fully-open dataset created by following the recipe described in the LLaMA paper.”

Our starting point is LLaMA, which is the leading suite of open base models for two reasons: First, LLaMA was trained on a very large (1.2 trillion tokens) dataset that was carefully filtered for quality. Second, the 7 billion parameter LLaMA model is trained for much longer, well beyond the Chincilla-optimal point, to ensure the best quality at that model size. A 7 billion parameter model is particularly valuable for the open community as it can run on a wide variety of GPUs, including many consumer grade GPUs. However, LLaMA and all its derivatives (including Alpaca, Vicuna, and Koala) are only available for non-commercial research purposes. We aim to create a fully open-source reproduction of LLaMA, which would be available for commercial applications, and provide a more transparent pipeline for research.

The RedPajama base dataset

The full RedPajama 1.2 trillion token dataset and a smaller, more consumable random sample can be downloaded through Hugging Face. The full dataset is ~5TB unzipped on disk and ~3TB to download compressed.

RedPajama-Data-1T consists of seven data slices:

  • CommonCrawl: Five dumps of CommonCrawl, processed using the CCNet pipeline, and filtered via several quality filters including a linear classifier that selects for Wikipedia-like pages.
  • C4: Standard C4 dataset
  • GitHub: GitHub data, filtered by licenses and quality
  • arXiv: Scientific articles removing boilerplate
  • Books: A corpus of open books, deduplicated by content similarity
  • Wikipedia: A subset of Wikipedia pages, removing boilerplate
  • StackExchange: A subset of popular websites under StackExchange, removing boilerplate

For each data slice, we conduct careful data pre-processing and filtering, and tune our quality filters to roughly match the number of tokens as reported by Meta AI in the LLaMA paper:

  RedPajama     LLaMA*  
CommonCrawl   878 billion   852 billion  
C4 175 billion 190 billion
Github 59 billion 100 billion
Books 26 billion 25 billion
ArXiv 28 billion 33 billion
Wikipedia 24 billion 25 billion
StackExchange 20 billion 27 billion
Total 1.2 trillion 1.25 trillion

* estimated from Table 1 in https://arxiv.org/abs/2302.13971

We are making all data pre-processing and quality filters openly available on Github. Anyone can follow the data preparation recipe and reproduce RedPajama-Data-1T.

Interactively analyzing the RedPajama base dataset

In collaboration with the Meerkat project, we are releasing a Meerkat dashboard and embeddings for exploring the Github subset of the corpus. The image below shows a preview of the dashboard.

Interactively explore the data in the RedPajama base dataset and view matching records using Meerkat dashboard.

You can find instructions on how to install and use the dashboard on Github.

Up next: Models, instructions & OpenChatKit

Having reproduced the pre-training data, the next step is to train a strong base model. As part of the INCITE program, with support from Oak Ridge Leadership Computing Facility (OLCF), we are training a full suite of models, with the first becoming available in the coming weeks.

With a strong base model in hand, we are excited to instruction tune the models. Alpaca illustrated the power of instruction tuning – with merely 50K high-quality, diverse instructions, it was able to unlock dramatically improved capabilities. Via OpenChatKit, we received hundreds of thousands of high-quality natural user instructions, which will be used to release instruction-tuned versions of the RedPajama models.

Acknowledgements

We are appreciative to the work done by the growing open-source AI community that made this project possible.

That includes:

  • Lower
    Cost
    20%
  • faster
    training
    4x
  • network
    compression
    117x

Q: Should I use the RedPajama-V2 Dataset out of the box?

RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.

No items found.
Start
building
yours
here →