Salesforce, Zoom, InVideo Train Faster with Together AI Turbocharged with NVIDIA Blackwell
At NVIDIA GTC, we shared our bold plans to scale out thousands of NVIDIA Blackwell GPUs as an NVIDIA Cloud Partner, launch self-serve Instant GPU Clusters, and offer seamless deployment of NVIDIA NIM microservices from build.nvidia.com.
As we’ve brought NVIDIA Blackwell systems online over these past several weeks, we’ve invited a number of pioneering AI companies to take a free test drive of NVIDIA Blackwell on Together AI. Today, we’re excited to share what Zoom, Salesforce, and InVideo discovered when they tried NVIDIA Blackwell infrastructure, turbocharged by the Together Training Stack and Kernel Collection.
The Results
In collaboration with NVIDIA, we have worked hand in hand with customers, helping them step up to NVIDIA HGX™ B200, accelerating both training and inference workloads. Below, we highlight some of the incredible results across AI-native companies and tech-forward enterprise customers. This incredible work would not have been possible without Together AI’s expertise in AI research and systems optimization along with NVIDIA’s cutting-edge accelerated computing platform.
Salesforce: Bringing Agentforce to life
Salesforce leverages Together AI for the entire AI journey: from training, to fine-tuning to inference of their models to deliver Agentforce. Salesforce Research is at the cutting edge of innovation with building agentic frameworks and was very keen to experiment with NVIDIA Blackwell GPUs to accelerate their training pipelines.
Training a mistral-24b model, Salesforce saw a 2x improvement in training speeds upgrading from NVIDIA HGX H200 to HGX B200. This is enabling Salesforce to rapidly train various models and accelerate the integration of research results into Agentforce, thereby enhancing product velocity.
{{custom-cta-1}}
Zoom: Accelerating the amazing Zoom AI Companion
1.2 million people use Zoom AI Companion, featuring incredible AI-powered tools such as real-time transcriptions, meeting summaries, and phone call analyses. Zoom has partnered with Together AI to leverage our research and deliver accelerated performance when training the models powering various AI Companion features. Recently, they took it a step further by trying out Together GPU Clusters accelerated by NVIDIA HGX B200.
Out-of-the-box, Zoom experienced a 1.9X improvement in training speeds over previous generation NVIDIA Hopper GPUs. The teams look forward to taking it a step further by profiling for additional optimizations.
InVideo: Bringing ideas to life through video
InVideo has generated millions of videos, helping its users tell stories like never before through its generative video foundation model. Considering some of the intricacies around current software stack support on NVIDIA Blackwell, the team was initially uncertain that they would see the gains needed to take the leap onto the new architecture.
However, during initial tests with the NVIDIA HGX B200, InVideo immediately saw a 25% improvement when running a training job from NVIDIA HGX H200. Then, in partnership with our researchers, the team made further optimizations and more than doubled this improvement – making the step up to the NVIDIA Blackwell platform even more appealing. This level of performance gain is largely unheard of for modalities outside of text at this time and speaks volumes to the level of expertise of the teams involved. We share some of those optimizations later in this blog.
The Together Training Stack
The Together AI research team has custom-built a training container that gives developers the best representation of the hardware’s capabilities and potential. This container features a co-optimized Llama 3 70B golden model, achieving state-of-the-art (SOTA) Model FLOPS Utilization (MFU).
The stack includes Together AI researchers’ tools for debugging and running diagnostics at scale across many nodes and thousands of processes. These tools deliver:
- Comprehensive MFU benchmarks at various levels (e.g., GEMM, ThunderKittens (TK)-based attention kernels)
- Full bandwidth benchmarking toolkits
- Collective communication diagnostics toolkits for performance analysis and debugging
Getting the most out of the hardware
Price-performance is widely considered the most important metric when it comes to GPU cloud infrastructure. Together AI specializes in delivering higher tokens / sec / node and overall MFU than other providers on the same hardware. This section covers some of the optimizations we’ve found on the NVIDIA Blackwell platform.
FlashAttention
FlashAttention-3 is a key optimization that speeds up LLM inference performance. It requires specific memory access patterns and Tensor Core optimizations that are now supported in the latest version of NVIDIA cuDNN for the Blackwell architecture.
This support includes FP8 FlashAttention optimized for Blackwell, using Blackwell’s FP8 precision and decompression engines for 4x higher throughput vs. FP16 on H100. By fusing several of the training operations together, we significantly reduce bottlenecks in training to gain performance advantages. cuDNN’s FP8 FlashAttention matches FA3’s FP16 performance while using 50% less memory.
Computation / Graph Optimization
Another key optimization technique we leverage is to enable torch.compile
for graph-level performance improvements, which compiles PyTorch models into optimized NVIDIA CUDA graphs, reducing Python overhead and kernel launch latency. We work very closely with customers by looking at their end-to-end profiles, identifying critical segments and adjusting the model to leverage torch.compile in the best way possible.
Parallelism Optimization
By tuning Distributed Data Parallel (DDP) settings and overlapping device-to-device (D2D) copies, we’re able to overlap gradient synchronization with backward passes. We leverage CUDA streams to overlap D2D transfers (e.g. GPU-to-GPU sharding) with computation. Combining this with overall reduction in communication overhead and logging costs, overall throughput of the system improves.
Get started with Together GPU Clusters accelerated by the NVIDIA Blackwell platform
Together AI was recently recognized as a ClusterMAX™ Gold provider by SemiAnalysis, a leading independent research and analysis company specializing in the Semiconductor and AI industries.
Outside of strong GPU price-performance, Together AI shines in its overall GPU Cluster offering:
- Infrastructure and Security
- Strong, one-click Slurm and Kubernetes support
- AI-native storage options, such as VAST and WEKA
- Industry-leading reliability commitments to customers
- Technical Expertise and Support
- Deep research expertise on GPU performance
- Strong technical collaboration with NVIDIA
- Business Model
- Flexible Consumption Models, including new self-service Instant GPU Clusters
- GPU Availability across current and next-gen hardware needs
If you are interested in a free test drive of Together GPU Clusters accelerated by the NVIDIA Blackwell platform, please contact us. And if you’d like to try our new Instant GPU Clusters, with self-service provisioning, please request access at together.ai/instant
- Lower
Cost20% - faster
training4x - network
compression117x
Q: Should I use the RedPajama-V2 Dataset out of the box?
RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.
article