Company

Deploy DeepSeek-R1 at scale: Fast, secure serverless APIs and large-scale Together Reasoning Clusters

February 12, 2025

・

Together AI

Since the launch of DeepSeek-R1, we’ve been approached by a surge of companies integrating reasoning models into their production applications. To support this scale we’re expanding our infrastructure beyond our ultra-fast Together Serverless API for DeepSeek-R1 (2x faster than any other API on the market) with Together Reasoning Clusters—dedicated, high-performance compute designed for large-scale, low-latency inference of reasoning models.

Companies now have multiple options to run DeepSeek-R1 in production on Together AI, always optimized for performance, cost, and scalability:

Together Serverless APIs – The fastest DeepSeek-R1 671B endpoint, delivering low-latency, production-grade inference with seamless scalability.
Together Reasoning Clusters – Dedicated GPU infrastructure for high-throughput, low-latency inference, optimized for variable, token-heavy reasoning workloads. Our Reasoning Clusters can achieve up to 110 tokens per second decoding speed.

Whether you're just starting out with reasoning models, rapidly scaling a new application, or running mission-critical reasoning workloads, Together AI provides best-in-class performance at every stage.

The fastest serverless API for DeepSeek-R1

Speed is critical for reasoning models, whether you're building responsive user experiences or multi-step agentic workflows where latency compounds across each step. Together AI offers the fastest serverless API for the full DeepSeek-R1 671B model, 2x faster than any other provider, and we also host the R1 distilled models (70B, 14B, 1.5B).

*Together AI has the fastest endpoint for running the full DeepSeek-R1 671B (Source: Artificial Analysis on 02/12/25)*

Running DeepSeek-R1 on Together Serverless offers:

Fastest inference available: Our endpoint is the fastest for DeepSeek-R1 671B, outperforming all other providers, including the model creator’s own API.
Instant scalability: No infrastructure management required—our OpenAI-compatible APIs make it easy to integrate DeepSeek-R1 into your applications.
Flexible, pay-as-you-go pricing: Suitable for variable workloads with no long-term commitments.
Higher rate limits: 9000 RPM on the scale tier, significantly higher than other serverless inference providers.
Secure hosting and privacy controls: Models are hosted in Together AI’s data centers, ensuring full privacy with no data shared back to DeepSeek.

💡 Want to explore reasoning models? We also offer a free endpoint for DeepSeek-R1 70B distilled, so you can get started with no upfront cost.

Try the full DeepSeek-R1 model on Together Serverless today →

Together Reasoning Clusters: Dedicated GPU infrastructure optimized for reasoning models

For teams running large-scale DeepSeek-R1 workloads, Together Reasoning Clusters provide dedicated infrastructure optimized for speed, cost efficiency, and security. These clusters run with our proprietary Together Inference stack, shown to be 2.5x faster than open-source engines like SGLang, and that powers our own serverless infrastructure. ‍

Key benefits of Together Reasoning Clusters:

High-performance NVIDIA H200 GPUs: Optimized for token-heavy workloads, providing the throughput needed to process large and continuous requests efficiently.
Consistent, low-latency performance: Powered by the Together Inference Engine and proprietary optimizations like the Together Kernel Collection, custom trained speculators and full-stack optimizations, our infrastructure delivers the fastest speeds for DeepSeek-R1 with dedicated endpoints approaching speeds of 110 tokens/second. With dedicated infrastructure, there are no rate limits or resource sharing, ensuring consistent, low-latency performance even with variable token loads.
Customized for your traffic profile: Our expert research team fine-tunes key parameters like batch sizes, prompt caching, and resource allocation to help balance throughput and latency based on your workload’s needs.
Cost-effective scaling: For high volume workloads dedicated infrastructure is more cost effective than token-based pricing. Together Reasoning Clusters provide predictable, contract-based pricing that helps control costs as you scale.
Secure, isolated environments: Run workloads on dedicated infrastructure in North American data centers, ensuring privacy, compliance, and full control over your data.
Enterprise support and SLAs: Benefit from 99.9% uptime guarantees and performance optimizations tailored for reasoning models in production.

Maximizing efficiency with fewer GPUs

The Together Inference Engine enables significantly higher throughput—both tokens per second and queries per second—compared to open-source inference engines like SGLang. We've observed that the same hardware running the Together Inference Engine processes 2.5-3x more tokens per second than when running SGLang across different batch sizes. This efficiency allows companies to achieve the same throughput with 60% fewer GPUs, reducing infrastructure costs while improving performance for high-volume inference workloads.

For companies requiring high-speed, dedicated infrastructure for DeepSeek-R1, Together Reasoning Clusters provide a scalable, cost-effective solution optimized for efficiency and performance.

Talk to us about deploying Together Reasoning Clusters →

Why choose Together Reasoning Clusters?

Reasoning Clusters are great for:

Applications requiring consistent, low-latency performance to meet production demands.
High-throughput workloads where predictable costs and performance are critical.
Organizations managing sensitive data that need dedicated, compliant environments.
Companies looking to optimize costs by moving from token-based pricing to contract-based infrastructure.

DeepSeek-R1 Together Reasoning Cluster Pricing

Size	GPUs	$/1M tokens for R1	3-months	6-months	12-months
Small	128-256	$1.9	$4.28 /hour	$3.64 /hour	$3.09 /hour
Medium	512-1024	$1.6	n/a	$3.53 /hour	$3.00 /hour
Large	1024-2048	$1.3	n/a	n/a	$2.91 /hour

‍

$ per M tokens is based on 16k input tokens and 1k output tokens per query; output speed >30 token/s.

Larger cluster sizes and longer contract commitments provide lower hourly rates, offering substantial savings for companies scaling their production workloads.

Deploy DeepSeek-R1 in production with Together AI

Whether you need fast, reliable serverless inference or larger-scale high-performance infrastructure for reasoning applications, Together AI provides the best platform for running DeepSeek-R1 securely in production.

Lower
Cost
20%
faster
training
4x
network
compression
117x

Interested in running DeepSeek-R1 in production?

Get in touch to discuss your best option to deploy reasoning models at scale, across serverless and dedicated reasoning clusters.

Q: Should I use the RedPajama-V2 Dataset out of the box?

RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.

Interested in running DeepSeek-R1 in production?

Get in touch to discuss your best option to deploy reasoning models at scale, across serverless and dedicated reasoning clusters.

Links in this
article