Models / DeepSeek

DeepSeek

Deploy DeepSeek V3 and R1 on Together AI. Transparent reasoning, frontier performance, and 90% cost savings vs closed-source alternatives.

Why DeepSeek on Together AI?

Designed for production workloads that need 
consistent performance and operational control.

Frontier performance at a fraction of the cost

DeepSeek’s MoE architecture delivers GPT-4 class performance at one-tenth the price. Together AI’s infrastructure adds 70–90% cost savings over closed-source alternatives.

Transparent reasoning, no black boxes

DeepSeek R1 exposes its complete chain-of-thought in <think> tags. Debug, verify, and trust your model’s reasoning process at every step.

Enterprise-ready from day one

SOC 2 Type II certified, HIPAA compliant, and deployed on US-based infrastructure. Full model ownership with no data retention by default.

Meet the DeepSeek family

Explore top-performing models across text, image, video, code, and voice.

New

Code

DeepSeek-V3.2-Exp

New

Chat

DeepSeek-V3.1

Chat

DeepSeek-R1-0528 Throughput

Chat

DeepSeek-V3-0324

Chat

DeepSeek-R1-0528

Chat

DeepSeek R1 Distilled Llama 70B

Chat

DeepSeek R1 Distilled Llama 70B Free

Chat

DeepSeek R1 Distilled Qwen 14B

Deployment options

Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.

  • Serverless

  • Inference

Serverless Inference

Real-time

A fully managed inference API that automatically scales with request volume.

Best for

Variable or unpredictable traffic

Rapid prototyping and iteration

Cost-sensitive or early-stage production workloads

Batch

Process massive workloads of up to 30 billion tokens asynchronously, at up to 50% less cost.

Best for

Classifying large datasets

Offline summarization

Synthetic data generation

Dedicated Inference

Dedicated Model Inference

An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine.

Best for

Predictable or steady traffic

Latency-sensitive applications

High-throughput production workloads

Dedicated Container Inference

Run inference with your own engine and model on fully-managed, scalable infrastructure.

Best for

Generative media models

Non-standard runtimes

Custom inference pipelines