Together AI partners with Meta to release Llama 3.1 models for inference and fine-tuning with accelerated performance at full accuracy
Today marks an inflection point for open source AI with the launch of Meta Llama 3.1 405B, the largest openly available foundation model, that rivals the best closed source models in AI. We are excited to partner with Meta to bring all the Llama 3.1 models (8B, 70B, 405B, and LlamaGuard) to the Together Inference and Fine-tuning Platform. Together Inference delivers horizontal scalability with industry-leading performance of up to 80 tokens per second for Llama 3.1 405B and up to 400 tokens per second for Llama 3.1 8B, which is 1.9x to 4.5x faster than vLLM while maintaining full accuracy with Meta’s reference implementation across all models. This empowers developers and enterprises to build Generative AI applications at production scale in their chosen environment – Together Cloud (serverless or dedicated endpoints) or on private clouds.
Built on our leading inference optimization research, the Together Inference Engine leverages innovations including FlashAttention-3 kernels, custom-built speculators based on RedPajama, and the most accurate quantization techniques available on the market. These advancements enable you to build and run Generative AI applications on Llama 3.1 models with unmatched performance, accuracy, and cost-efficiency at production scale.
Over 100,000 developers and companies like Zomato, DuckDuckGo, and the Washington Post build and run their Generative AI applications on the Together Platform.
Available serverless endpoints:
The future of AI is open source
Llama 3.1 405B is the first openly available model that offers unmatched flexibility, control, and state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation that rival the best closed source models, while providing safety tools for responsible development. These advancements will rapidly accelerate the adoption of open-source AI with developers and enterprises.
At Together AI, we believe the future of Generative AI depends on open research, open science, and trust between researchers, developers, and enterprises.
Our vision is to bring innovations from research to production the fastest. Our team has invented methods like FlashAttention 3, Mixture of Agents, Medusa, Sequoia, Hyena, Mamba, and CocktailSGD at the intersection of AI and systems research leading to faster velocity, faster time to market, and provide cutting-edge benefits to customers.
As the launch partner for the Llama 3.1 models, we're thrilled for customers to leverage the best performance, accuracy, and cost for their Generative AI workloads on the Together Platform while allowing them to keep ownership of their models and their data secure.
Llama 3.1 models available through API and dedicated endpoints
Today, all Llama 3.1 models (8B, 70B, 405B, and LlamaGuard) are available for inference through the Together API and the most capable 405B model is available for QLoRA fine-tuning for enterprises. Developers and enterprises can also fine-tune the models with their proprietary data to achieve higher accuracy and performance, and continue to maintain ownership of their data and models. Function calling is supported natively by each of the models, and JSON mode is available for the 8B and 70B models (coming soon for the 405B model).
Together Turbo endpoints empower businesses to prioritize performance, quality, and price without compromise. It provides the most accurate quantization available for Llama-3.1 models, closely matching full-precision FP16 models. These advancements make Together Inference the fastest engine for NVIDIA GPUs and the most cost-effective solution for building with Llama 3.1 at scale.
Together Turbo endpoints provide fast performance while maintaining quality, matching Meta’s FP16 reference models – making them the most accurate, cost efficient, and performant models available.
Performance: Together Turbo provides best in class throughput ranging from 1.9x up to 4.5x of vLLM across all Llama 3.1 models.



Quality: Together Turbo achieves this performance while maintaining full accuracy compared to Meta’s reference implementation across all models. Llama-3.1-405B-Instruct-Turbo matches the accuracy of Meta reference models.


Llama Guard
Developers can now also use Llama Guard on Together AI as a moderation model, either as a standalone classifier or as a filter to safeguard responses. This feature allows developers and enterprises to easily moderate both prompts and model outputs, screening for several categories of potentially unsafe content that Llama Guard classifies.
Llama 3.1 includes a number of other significant advancements:
- Llama-3.1 models expand context length to 128K, add support across 8 languages, and include Meta Llama 3.1 405B—the largest openly available foundation model.
- Llama 3.1 405B offers unmatched flexibility, control, and state-of-the-art capabilities that rival the best closed source models. This model will enable the community to unlock new capabilities, such as synthetic data generation and model distillation.
- Tools for creating custom agents and novel agentic behaviors.
- New security and safety tools for responsible development.
Get started today!
- Build applications with the new Llama-3.1 serverless endpoints.
- Contact us to deploy your dedicated instance on Together Cloud or run Llama 3.1 models in VPC.
Together, these innovations empower you to scale applications with the performance, quality, and cost-efficiency your business needs. We can't wait to see what you build!
Run Meta Llama 3.1 for your production traffic
Deploy your dedicated instance on Together Cloud or run Llama 3.1 models in VPC.
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #2
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #3
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Build
Benefits included:
✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.
Funding: Less than $5M
Grow
Benefits included:
✔ Up to $30K in free platform credits*
✔ 6 hours of free forward-deployed engineering time.
Funding: $5M-$10M
Scale
Benefits included:
✔ Up to $50K in free platform credits*
✔ 10 hours of free forward-deployed engineering time.
Funding: $10M-$25M
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:
Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:
Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.
- A. 2.08*1e-1 m
- B. 2.08*1e-9 m
- C. 2.08*1e-6 m
- D. 2.08*1e-3 m
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:
Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by
- A. releasing nitrogen in the soil.
- B. crowding out non-native species.
- C. adding carbon dioxide to the atmosphere.
- D. removing water from the soil and returning it to the atmosphere.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:
Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:
Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?
article