Company

Salesforce, Zoom, InVideo Train Faster with Together AI Turbocharged with NVIDIA Blackwell

April 24, 2025

・

Together AI

At NVIDIA GTC, we shared our bold plans to scale out thousands of NVIDIA Blackwell GPUs as an NVIDIA Cloud Partner, launch self-serve Instant GPU Clusters, and offer seamless deployment of NVIDIA NIM microservices from build.nvidia.com.

As we’ve brought NVIDIA Blackwell systems online over these past several weeks, we’ve invited a number of pioneering AI companies to take a free test drive of NVIDIA Blackwell on Together AI. Today, we’re excited to share what Zoom, Salesforce, and InVideo discovered when they tried NVIDIA Blackwell infrastructure, turbocharged by the Together Training Stack and Kernel Collection.

The Results

In collaboration with NVIDIA, we have worked hand in hand with customers, helping them step up to NVIDIA HGX™ B200, accelerating both training and inference workloads. Below, we highlight some of the incredible results across AI-native companies and tech-forward enterprise customers. This incredible work would not have been possible without Together AI’s expertise in AI research and systems optimization along with NVIDIA’s cutting-edge accelerated computing platform.

Salesforce: Bringing Agentforce to life

Salesforce leverages Together AI for the entire AI journey: from training, to fine-tuning to inference of their models to deliver Agentforce. Salesforce Research is at the cutting edge of innovation with building agentic frameworks and was very keen to experiment with NVIDIA Blackwell GPUs to accelerate their training pipelines.

Training a mistral-24b model, Salesforce saw a 2x improvement in training speeds upgrading from NVIDIA HGX H200 to HGX B200. This is enabling Salesforce to rapidly train various models and accelerate the integration of research results into Agentforce, thereby enhancing product velocity.

Zoom: Accelerating the amazing Zoom AI Companion

1.2 million people use Zoom AI Companion, featuring incredible AI-powered tools such as real-time transcriptions, meeting summaries, and phone call analyses. Zoom has partnered with Together AI to leverage our research and deliver accelerated performance when training the models powering various AI Companion features. Recently, they took it a step further by trying out Together GPU Clusters accelerated by NVIDIA HGX B200.

Out-of-the-box, Zoom experienced a 1.9X improvement in training speeds over previous generation NVIDIA Hopper GPUs. The teams look forward to taking it a step further by profiling for additional optimizations.

InVideo: Bringing ideas to life through video

InVideo has generated millions of videos, helping its users tell stories like never before through its generative video foundation model. Considering some of the intricacies around current software stack support on NVIDIA Blackwell, the team was initially uncertain that they would see the gains needed to take the leap onto the new architecture.

However, during initial tests with the NVIDIA HGX B200, InVideo immediately saw a 25% improvement when running a training job from NVIDIA HGX H200. Then, in partnership with our researchers, the team made further optimizations and more than doubled this improvement – making the step up to the NVIDIA Blackwell platform even more appealing. This level of performance gain is largely unheard of for modalities outside of text at this time and speaks volumes to the level of expertise of the teams involved. We share some of those optimizations later in this blog.

The Together Training Stack

The Together AI research team has custom-built a training container that gives developers the best representation of the hardware’s capabilities and potential. This container features a co-optimized Llama 3 70B golden model, achieving state-of-the-art (SOTA) Model FLOPS Utilization (MFU).

The stack includes Together AI researchers’ tools for debugging and running diagnostics at scale across many nodes and thousands of processes. These tools deliver:

Comprehensive MFU benchmarks at various levels (e.g., GEMM, ThunderKittens (TK)-based attention kernels)
Full bandwidth benchmarking toolkits
Collective communication diagnostics toolkits for performance analysis and debugging

Getting the most out of the hardware

Price-performance is widely considered the most important metric when it comes to GPU cloud infrastructure. Together AI specializes in delivering higher tokens / sec / node and overall MFU than other providers on the same hardware. This section covers some of the optimizations we’ve found on the NVIDIA Blackwell platform.

FlashAttention

FlashAttention-3 is a key optimization that speeds up LLM inference performance. It requires specific memory access patterns and Tensor Core optimizations that are now supported in the latest version of NVIDIA cuDNN for the Blackwell architecture.

This support includes FP8 FlashAttention optimized for Blackwell, using Blackwell’s FP8 precision and decompression engines for 4x higher throughput vs. FP16 on H100. By fusing several of the training operations together, we significantly reduce bottlenecks in training to gain performance advantages. cuDNN’s FP8 FlashAttention matches FA3’s FP16 performance while using 50% less memory.

Computation / Graph Optimization

Another key optimization technique we leverage is to enable torch.compile for graph-level performance improvements, which compiles PyTorch models into optimized NVIDIA CUDA graphs, reducing Python overhead and kernel launch latency. We work very closely with customers by looking at their end-to-end profiles, identifying critical segments and adjusting the model to leverage torch.compile in the best way possible.

Parallelism Optimization

By tuning Distributed Data Parallel (DDP) settings and overlapping device-to-device (D2D) copies, we’re able to overlap gradient synchronization with backward passes. We leverage CUDA streams to overlap D2D transfers (e.g. GPU-to-GPU sharding) with computation. Combining this with overall reduction in communication overhead and logging costs, overall throughput of the system improves.

Get started with Together GPU Clusters accelerated by the NVIDIA Blackwell platform

‍Together AI was recently recognized as a ClusterMAX™ Gold provider by SemiAnalysis, a leading independent research and analysis company specializing in the Semiconductor and AI industries.

Outside of strong GPU price-performance, Together AI shines in its overall GPU Cluster offering:

Infrastructure and Security
- Strong, one-click Slurm and Kubernetes support
- AI-native storage options, such as VAST and WEKA
- Industry-leading reliability commitments to customers
Technical Expertise and Support
- Deep research expertise on GPU performance
- Strong technical collaboration with NVIDIA
Business Model
- Flexible Consumption Models, including new self-service Instant GPU Clusters
- GPU Availability across current and next-gen hardware needs

If you are interested in a free test drive of Together GPU Clusters accelerated by the NVIDIA Blackwell platform, please contact us. And if you’d like to try our new Instant GPU Clusters, with self-service provisioning, please request access at together.ai/instant

Lower
Cost
20%
faster
training
4x
network
compression
117x

Request Access to Together GPU Clusters accelerated by NVIDIA Blackwell GPUs

Top-Tier NVIDIA hardware: NVIDIA GB200 NVL72 and HGX B200

Request access

Q: Should I use the RedPajama-V2 Dataset out of the box?

RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.