This website uses cookies to anonymously analyze website traffic using Google Analytics.
Company

Together AI partners with Meta to release Llama 3.1 models for inference and fine-tuning with accelerated performance at full accuracy

July 23, 2024

By 

Together AI

Today marks an inflection point for open source AI with the launch of Meta Llama 3.1 405B, the largest openly available foundation model, that rivals the best closed source models in AI. We are excited to partner with Meta to bring all the Llama 3.1 models (8B, 70B, 405B, and LlamaGuard) to the Together Inference and Fine-tuning Platform. Together Inference delivers horizontal scalability with industry-leading performance of up to 80 tokens per second for Llama 3.1 405B and up to 400 tokens per second for Llama 3.1 8B, which is 1.9x to 4.5x faster than vLLM while maintaining full accuracy with Meta’s reference implementation across all models. This empowers developers and enterprises to build Generative AI applications at production scale in their chosen environment – Together Cloud (serverless or dedicated endpoints) or on private clouds.

Built on our leading inference optimization research, the Together Inference Engine leverages innovations including FlashAttention-3 kernels, custom-built speculators based on RedPajama, and the most accurate quantization techniques available on the market. These advancements enable you to build and run Generative AI applications on Llama 3.1 models with unmatched performance, accuracy, and cost-efficiency at production scale.

Over 100,000 developers and companies like Zomato, DuckDuckGo, and the Washington Post build and run their Generative AI applications on the Together Platform.

Available serverless endpoints:

The future of AI is open source

Llama 3.1 405B is the first openly available model that offers unmatched flexibility, control, and state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation that rival the best closed source models, while providing safety tools for responsible development. These advancements will rapidly accelerate the adoption of open-source AI with developers and enterprises.

At Together AI, we believe the future of Generative AI depends on open research, open science, and trust between researchers, developers, and enterprises. 

Our vision is to bring innovations from research to production the fastest. Our team has invented methods like FlashAttention 3, Mixture of Agents, Medusa, Sequoia, Hyena, Mamba, and CocktailSGD at the intersection of AI and systems research leading to faster velocity, faster time to market, and provide cutting-edge benefits to customers.

As the launch partner for the Llama 3.1 models, we're thrilled for customers to leverage the best performance, accuracy, and cost for their Generative AI workloads on the Together Platform while allowing them to keep ownership of their models and their data secure.

Llama 3.1 models available through API and dedicated endpoints

Today, all Llama 3.1 models (8B, 70B, 405B, and LlamaGuard) are available for inference through the Together API and the most capable 405B model is available for QLoRA fine-tuning for enterprises. Developers and enterprises can also fine-tune the models with their proprietary data to achieve higher accuracy and performance, and continue to maintain ownership of their data and models. Function calling is supported natively by each of the models, and JSON mode is available for the 8B and 70B models (coming soon for the 405B model).

Together Turbo endpoints empower businesses to prioritize performance, quality, and price without compromise. It provides the most accurate quantization available for Llama-3.1 models, closely matching full-precision FP16 models. These advancements make Together Inference the fastest engine for NVIDIA GPUs and the most cost-effective solution for building with Llama 3.1 at scale. 

Together Turbo endpoints provide fast performance while maintaining quality, matching Meta’s FP16 reference models – making them the most accurate, cost efficient, and performant models available.

Performance: Together Turbo provides best in class throughput ranging from 1.9x up to 4.5x of vLLM across all Llama 3.1 models.

Quality: Together Turbo achieves this performance while maintaining full accuracy compared to Meta’s reference implementation across all models. Llama-3.1-405B-Instruct-Turbo matches the accuracy of Meta reference models.

Llama Guard

Developers can now also use Llama Guard on Together AI as a moderation model, either as a standalone classifier or as a filter to safeguard responses. This feature allows developers and enterprises to easily moderate both prompts and model outputs, screening for several categories of potentially unsafe content that Llama Guard classifies.

Llama 3.1 includes a number of other significant advancements:

  • Llama-3.1 models expand context length to 128K, add support across 8 languages, and include Meta Llama 3.1 405B—the largest openly available foundation model.
  • Llama 3.1 405B offers unmatched flexibility, control, and state-of-the-art capabilities that rival the best closed source models. This model will enable the community to unlock new capabilities, such as synthetic data generation and model distillation.
  • Tools for creating custom agents and novel agentic behaviors.
  • New security and safety tools for responsible development.

Get started today!

  • Build applications with the new Llama-3.1 serverless endpoints.
  • Contact us to deploy your dedicated instance on Together Cloud or run Llama 3.1 models in VPC.

Together, these innovations empower you to scale applications with the performance, quality, and cost-efficiency your business needs. We can't wait to see what you build!

  • Lower
    Cost
    20%
  • faster
    training
    4x
  • network
    compression
    117x

Q: Should I use the RedPajama-V2 Dataset out of the box?

RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.

Run Meta Llama 3.1 for your production traffic

Deploy your dedicated instance on Together Cloud or run Llama 3.1 models in VPC.

Start
building
yours
here →