Research

Preparing for the era of 32K context: Early learnings and explorations

July 28, 2023

By 

Together

Today, we’re releasing LLaMA-2-7B-32K, a 32K context model built using Position Interpolation and Together AI’s data recipe and system optimizations, including FlashAttention-2. Fine-tune the model for targeted, long-context tasks—such as multi-document understanding, summarization, and QA—and run inference and fine-tune on 32K context with up to 3x speedup.

LLaMA-2-7B-32K making completions of a book in the Together Playground. Try it yourself at api.together.ai.

In the last few months, we have witnessed the rapid progress of the open-source ecosystem for LLMs — from the original LLaMA model that triggered the “LLaMA moment”, to efforts such as RedPajama, MPT, Falcon, and the recent LLaMA-2 release, open-source models have been catching up with closed-source models. We believe the upcoming opportunity for open-source models is to extend the context length of open models to the regime of 32K-128K, matching that of state-of-the-art closed-source models. We have already seen some exciting efforts here such as MPT-7B-8K and LLongMA-2 (8K).

Today, we’re sharing with the community some recent learnings and explorations at Together AI in the direction of building long-context models with high quality and efficiency. Specifically:

  • LLaMA-2-7B-32K: We extend LLaMA-2-7B to 32K long context, using Meta’s recipe of interpolation and continued pre-training. We share our current data recipe, consisting of a mixture of long context pre-training and instruction tuning data.
  • Examples of building your own long-context models: We share two examples of how to fine-tune LLaMA-2-7B-32K to build specific applications, including book summarization and long-context question answering.
  • Software support: We updated both the inference and training stack to allow for efficient inference and fine-tuning with 32K context, using the recently released FlashAttention-2 and a range of other optimizations. This allows one to create their own 32K context model and conduct inference efficiently.
  • Try it yourself:
  • Go to Together API and run LLaMA-2-7B-32K for inference.
  • Use OpenChatKit to fine-tune a 32K model over LLaMA-2-7B-32K for your own long context applications.
  • Go to HuggingFace and try out LLaMA-2-7B-32K.

Long-context models are already crucial for document understanding, summarization, and retrieval augmented generation. We are excited to share this work with the open-source community and make sustained progress towards better, longer-context models.


Extending LLaMA-2 to 32K context

LLaMA-2 has a context length of 4K tokens. To extend it to 32K context, three things need to come together: modeling, data, and system optimizations.

On the modeling side, we follow Meta’s recent paper and use linear interpolation to extend the context length. This provides a powerful way to extend the context length for models with rotary positional embeddings. We take the LLaMA-2 checkpoint, and continue pre-training/fine-tuning it with linear interpolation for 1.5B tokens.

But this alone is not enough. What data should we use in improving the base model? Instead of simply fine-tuning using generic language datasets such as Pile and RedPajama as in Meta’s recent recipe, we realize that there are two important factors here and we have to be careful about both. First, we need generic long-context language data for the model to learn how to handle the interpolated positional embeddings; and second, we need instruction data to encourage the models to actually take advantagement of the information in the long context. Having both seems to be the key.

Our current data recipe consists of the following mixture of data:

  • In the first phase of continued pre-training, our data mixture contains 25% RedPajama Book, 25% RedPajama ArXiv (including abstracts), 25% other data from RedPajama, and 25% from the UL2 Oscar Data, which is a part of OIG (Open-Instruction-Generalist), asking the model to fill in missing chunks, or complete the text. To enhance the long-context capabilities, we exclude sequences shorter than 2K tokens. The UL2 Oscar Data encourages the model to model long-range dependencies.
  • We then fine-tune the model to focus on its few shot capacity with long contexts, including 20% Natural Instructions (NI), 20% Public Pool of Prompts (P3), 20% the Pile. To mitigate forgetting, we further incorporate 20% RedPajama Book and 20% RedPajama ArXiv with abstracts. We decontaminated all data against HELM core scenarios (see a precise protocol here). We teach the model to leverage the in-context examples by packing as many examples as possible into one 32K-token sequence.

We evaluate the model in two ways: (1) its normalized perplexity under various sequence lengths on PG-19, and (2) its HELM v1.0 scores over 16 core scenarios (evaluated on the same context length that fits LLaMA 2). We see that LLaMA-2-7B-32K incurs reasonable perplexity, comparable to the original LLaMA 2 model. Moreover, on HELM v1.0, LLaMA-2-7B-32K achieves comparable, if not better, quality against the original LLaMA-2-7B base model.

Model 2K 4K 8K 16K 32K
LLaMA-2 1.759 1.747 N/A N/A N/A
LLaMA-2-7B-32K 1.768 1.758 1.750 1.746 1.742

Perplexity-per-byte for various context lengths: exp(1/N_{byte} sum_{i=1,...,N_{tokens}} loss_i)

LLaMA-2-7B LLaMA-2-7B-32K
AVG 0.489 0.522
MMLU - EM 0.435 0.435
BoolQ - EM 0.746 0.784
NarrativeQA - F1 0.483 0.548
NaturalQuestions (closed-book) - F1 0.322 0.299
NaturalQuestions (open-book) - F1 0.622 0.692
QuAC - F1 0.355 0.343
HellaSwag - EM 0.759 0.748
OpenbookQA - EM 0.570 0.533
TruthfulQA - EM 0.29 0.294
MS MARCO (regular) - RR@10 0.25 0.419
MS MARCO (TREC) - NDCG@10 0.469 0.71
CNN/DailyMail - ROUGE-2 0.155 0.151
XSUM - ROUGE-2 0.144 0.129
IMDB - EM 0.951 0.965
CivilComments - EM 0.577 0.601
RAFT - EM 0.684 0.699

Quality of 16 Core Scenarios in HELM v1.0 (evaluated on the same context length that fits LLaMA-2)

Building long-context applications via fine-tuning

The power of LLaMA-2-7B-32K is that it forms a powerful base model that one can fine-tune to build their own applications. We now illustrate two such examples.

Long-context QA. We take as an example the multi-document question answering task from the paper, “Lost in the Middle: How Language Models Use Long Contexts.” The input for the model consists of 1) a question that requires an answer, and 2) k documents, which are passages extracted from Wikipedia. Notably, only one of these documents contains the answer to the question, while the remaining k − 1 documents, termed as "distractor" documents, do not. To successfully perform this task, the model must identify and utilize the document containing the answer from its input context. One potential use case is to enable seamless integration between LLMs and document and vector databases, with latter fetches relevant information (the context) and the former answers questions from the user.

To fine-tune a model that performs better at long-context QA, we prepare the data in the following format:

```
Write a high-quality answer for the given question using only the provided search results (some of which might be irrelevant).

Document [1] (Title: Email retargeting) on sending personalized e-mail to an anonymous website visitor...

Document [2] (Title: Opt-in email) of 2003 does not require an opt-in approach, only an easy opt-out system...

Document [3] (Title: Email marketing) to send direct promotional messages to, or they rent a list of email addresses ...

...

Question: which is the most common use of opt-in e-mail marketing

Answer: a newsletter sent to an advertising firm's customers

```

Our preprocessing procedure mirrors the one used in the aforementioned paper, and we derived our training set from the NaturalQuestion dataset. training/finetune_LLaMA-2-7b-32k-mqa.sh illustrates how to pass this dataset to OCK in order to fine-tune LLaMA-2-7B-32K.

We measure the quality by varying different numbers of documents we pack in the context from 20 to 100. On average, this corresponds to 2.9K tokens to 14.8K tokens in the model input. As we can see, we achieve significant improvement on the quality once we fine-tune LLaMA-2-7B-32K on this task.

Model 20 (Avg 2.9K tokens) 30 (Avg 4.4K tokens) 50 (Avg 7.4K tokens) 100 (Avg 14.8K tokens)
LLaMA-2 0.245 0.238* 0.215* 0.193*
LLaMA-2-7B-32K 0.315 0.293 0.246 0.223
LLaMA-2-7B-32K (fine-tuned) 0.466 0.453 0.427 0.372

Accuracy of multi-document question answering under various # documents. * For LLaMA-2, we truncate the input when it does not fit into the 4K context.

Long-context summarization.  We use BookSum, a unique dataset designed to address the challenges of long-form narrative summarization. This dataset features source documents from the literature domain, including novels, plays, and stories, and offers human-written, highly abstractive summaries. We here focus on chapter-level data.  BookSum poses a unique set of challenges, necessitating that the model comprehensively read through each chapter.

We prepare the data in the following format:

```

Chapter: "Mother, Mother, I am so happy!" whispered the girl, burying her face in the lap of the faded, tired-looking woman who, with back turned to the shrill intrusive light, was sitting in the one arm-chair that their dingy sitting-room contained.  "I am so happy!" she repeated, "and you must be happy, too!"...

Q: Can you write an appropriate summary of the above paragraphs?

A: The following day, Sibyl Vane and her mother discuss the girl's relationship with "Prince Charming." Sibyl is elated and wants her mother to share her joy. She is in love. Mrs. Vane's attitude is more realistic and down-to-earth. She wants her daughter to think of her career...

```

We can finetune LLaMA-2-7B-32K over this dataset with training/finetune_LLaMA-2-7b-32k-booksum.sh.

The test data, on average, comprises approximately 4,500 tokens. In order to evaluate and compare performance, we calculate three specific metrics: Rouge-1, Rouge-2, and Rouge-L scores. For LLaMA-2, we truncate the input when it does not fit into the 4K context. We see that the fine-tuned model achieves higher scores on all these metrics.

Model R1 R2 RL
LLaMA-2 0.063* 0.008* 0.042*
LLaMA-2-7B-32K 0.179 0.032 0.114
LLaMA-2-7B-32K (fine-tuned) 0.355 0.072 0.175

Rouge score on BookSum. * for LLaMA-2, we truncate the input when it does not fit into the 4K context.

System optimizations

One unique challenge for building long-context models is that the increasing context length requires system optimizations.

We release an updated training and inference stack integrating the recently released FlashAttention-2 by our Chief Scientist Tri Dao, together with a series of other optimizations:

  • The current OCK repo now supports fine-tuning with 32K context. With the latest optimizations, we achieve up to 1.6x over well optimized OCK with FlashAttention-1.
  • We also integrate FlashAttention-2 into the inference stack and one can run with HuggingFace Transformer; at 32K context, it provides up to 3x improvement in inference throughput compared to state-of-the-art models.
2K 4K 8K 16K 32K
Baseline OCK 1x 0.99x OOM OOM OOM
Baseline OCK + FA1 2.25x 2.12x 1.64x 1.13x 0.60x
+ FA2 2.36x 2.34x 2.07x 1.57x 0.99x

Throughput measured over 8x A100 cards for various context lengths.

2K 4K 8K 32K
LLaMA 2 (HF, 4.31.0) 41.4 tokens/s 35.5 tokens/s 21.6 tokens/s OOM
MPT-7B-8K (HF, fastest configuration) 45.2 tokens/s 30.4 tokens/s 18.2 tokens/s 4.96 tokens/s
LLaMA-2-7B-32K + FlashAttention-1 41.5 tokens/s 39.6 tokens/s 34.5 tokens/s 10.0 tokens/s
LLaMA-2-7B-32K + FlashAttention-2 48.5 tokens/s 46.4 tokens/s 39.1 tokens/s 13.5 tokens/s

Inference performance (token/s) using a single A100.

Looking ahead

Building long-context models is a challenging task and we are just at the beginning. While we are excited to share with the community what we have learned so far, there are so many things we need to understand, together with the community:

  • Build more models with longer context: We are in the process of applying a similar recipe to other models, including those in the LLaMA-2 family (13B and 70B) and models such as RedPajama-3B, and exploring ways to build models with longer context and better quality.
  • Prepare better data for long-context tasks: How can we train the model such that it uses its context more effectively? This might require us to enrich the training process with more targeted design in data and tasks.
  • Better system support for long-context training and inference: Can we further improve the utilization of the hardware for both training and inference at long-context regime? At Together AI, we are working hard to further optimize the system. Stay tuned for an upcoming release of our fine-tuning and inference API!

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out

LOREM IPSUM

Tag

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

$0.030/image

Try it out

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  

Value Prop #1

Body copy goes here lorem ipsum dolor sit amet

  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  
  • Bullet point goes here lorem ipsum  

List Item  #1

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item  #1

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item  #1

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
  • Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.

List Item  #1

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

List Item  #2

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

List Item  #3

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Build

Benefits included:

  • ✔ Up to $15K in free platform credits*

  • ✔ 3 hours of free forward-deployed engineering time.

Funding: Less than $5M

Grow

Benefits included:

  • ✔ Up to $30K in free platform credits*

  • ✔ 6 hours of free forward-deployed engineering time.

Funding: $5M-$10M

Scale

Benefits included:

  • ✔ Up to $50K in free platform credits*

  • ✔ 10 hours of free forward-deployed engineering time.

Funding: $10M-$25M

Multilinguality

Word limit

Disclaimer

JSON formatting

Uppercase only

Remove commas

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:

Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:

Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.

  • A. 2.08*1e-1 m
  • B. 2.08*1e-9 m
  • C. 2.08*1e-6 m
  • D. 2.08*1e-3 m

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:

Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by

  • A. releasing nitrogen in the soil.
  • B. crowding out non-native species.
  • C. adding carbon dioxide to the atmosphere.
  • D. removing water from the soil and returning it to the atmosphere.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:

Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.

Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:

Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?

Start
building
yours
here →