Llama-2-7B-32K-Instruct — and fine-tuning for Llama-2 models with Together API
We’re excited to release Llama-2-7B-32K-Instruct, a long-context instruction model fine-tuned using Together API! Llama-2-7B-32K-Instruct achieves state-of-the-art performance for longcontext tasks such as summarization and multi-document question / answering (QA), while maintaining similar performance at a shorter context as Llama-2-7B. We are also releasing the full recipe we used to distill, train, test, and deploy the model.
We built Llama-2-7B-32K-Instruct using the Together API and today we’re making fine-tuning Llama-2 available publicly with Together API!
Last month, we released Llama-2-7B-32K, which extended the context length of Llama-2 for the first time from 4K to 32K — giving developers the ability to use open-source AI for long-context tasks such as document understanding, summarization, and QA.
To provide an example of this fine-tuning capability, we’re introducing Llama-2-7B-32K-Instruct — a long-context instruction-tuned model that we built with less than 200 lines of Python script using Together API. We fine-tuned this model over a mixture of three data sources: 1) a set of single- and multi-round conversations generated by human instructions and Llama-2-70B-Chat outputs, collected using the Together Inference API; 2) summarization data from the BookSum dataset; and 3) multi-document QA dataset.
The Llama-2-7B-32K-Instruct recipe: Four main steps to build custom models
The code used to implement this recipe using Together API, including the data preparation, is available on Github. The four main steps are outlined below: Distill, Train, Test, and Deploy.
Step 1: Distill
As mentioned, we fine-tuned our new model with a mixture of three data sources. The details and our pre-processing steps on BookSum and Multi-document Answering can be found in our previous blog post and our complete data recipe. In this blog post, we share our detailed steps of building the single- and multi-round conversation datasets.
We follow the distillation paradigm that is used by Alpaca, Vicuna, WizardLM, and Orca — producing instructions by querying a powerful LLM, which, in our case, is Llama-2-70B-Chat. We do this by leveraging the Together Inference API. For example, given an instruction, such as:
We can formulate a prompt in the form of "[INST] {instruction} [/INST]", which is the standard prompt format of Llama-2-70B-Chat, we can query the Together Inference API using the following code:
We then parse the result with:
In this particular case, it returns:
Sure, here's a table about national parks in the US:
Note: This table lists some of the most well-known national parks in the US, but there are many others that are also worth visiting. The area of each park is approximate and may vary slightly depending on the source.
To build Llama-2-7B-32K-Instruct, we collect instructions from 19K human inputs extracted from ShareGPT-90K (only using human inputs, not ChatGPT outputs). The actual script handles multi-turn conversations and also supports restarting and caching via a SQLite database. You can find the full script here, with merely 122 lines!
Finally, we write all the instructions and answers to a single jsonl file, with each line corresponding to one conversation:
This is also the accepted training format for the Together Fine-tuning API.
Step 2: Train
The second step is to fine-tune the Llama-2-7B-32K model using our data mixture.
First, upload the dataset. Let’s say the instruction data is stored in instructions.jsonl, with the following command:
The data uploads to the Together Cloud, making it available for your fine-tuning jobs. Note that Together AI does not keep any customer or training data unless you give consent to use for your training request.
Second, start a fine-tuning job using the file ID provided in the response to the file upload command:
Third, track the progress of your fine-tuning job in the Jobs page on api.together.ai. You can view event logs and checkpoints.
Step 3: Test
Once your fine-tuning job is completed, which for us took just a few hours, the fine-tuned model will automatically appear on the Models page. Click play to start an instance, and begin testing your model in the Together Playgrounds.
Step 4: Deploy
Now that you’ve tested the model in the Together Playground, you can integrate the model into your application! Query the model using the Together Inference API. Simply click “<>” in the Playground to see examples of how to query it via the API.
One immediate example is to benchmark the model that you just trained! It involves querying the inference API over a collection of evaluation examples. In this script, we provide one example of how to prepare results for Alpaca Eval. We see that for short context, we get comparable quality as a Llama-2-7B-Chat model.
Table 1: Alpaca Eval Result of Llama-2-7B-32K-Instruct. Refer to https://github.com/tatsu-lab/alpaca_eval#limitations for a list of limitations and potential biases of automatic evaluation.
The power of Llama-2-7B-32K-Instruct is the ability to input instructions that are much longer than you can use with the base Llama-2 models. In addition to the summarization example we illustrated in the beginning, following is another example of multi-document QA:
Model Evaluation
We follow the protocol from our previous blog post, and evaluate Llama-2-7B-32K-Instruct on two tasks — long context summarization and multi-document QA, as summarized below:
Table 2: Rogue score on BookSum.
* For Llama-2-7b-chat, we truncate the inputs when it does not fit into the 4K context.
Accuracy of multi-document question answering under various # documents.
* For Llama-2-7b-chat, we truncate the inputs when it does not fit into the 4K context.
We observe that our fine-tuned Llama-2-7B-32K-Instruct consistently outperforms other baseline models including GPT-3.5-Turbo-16k, Llama-2-7b-chat, Longchat-7b-16k and Longchat-7b-v1.5-32k. It suggests the delivered model from Together API is robust across long-context benchmarks. Comparing Llama-2-7B-32K-Instruct and GPT-3.5-Turbo-16K, we observe that Llama-2-7B-32K-Instruct produces comparable, and sometimes better results on summarization and long-context regime of QA (50 docs).
Start fine-tuning Llama-2
The Together API makes it easy to fine-tune Llama-2 with your own data. As outlined in this post, simply upload a file, specify your hyperparameters, and start training. Full documentation available here.
We can’t wait to see what you’ll build!
- Lower
Cost20% - faster
training4x - network
compression117x
Q: Should I use the RedPajama-V2 Dataset out of the box?
RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.