Together Dedicated Endpoints

DEDICATED ENDPOINTS

Deploy models on custom hardware

Customize your own dedicated GPU instance to reliably deploy models with unmatched price-performance at scale.

Deploy an endpoint

Performance and reliability for production scale

Together Dedicated Endpoints allow you to customize your own dedicated GPU instance to reliably deploy models and achieve cost savings at scale.

Get full control of your deployment
Customize your single-tenant deployment powered by the latest NVIDIA GPU hardware and optimized by innovations like speculative decoding.
Handle traffic spikes seamlessly
Thanks to our flexible vertical and horizontal scaling options, your deployment will always meet traffic demands, even during spikes.
Achieve cost savings at scale
Leverage per-minute billing to achieve economies of scale for workload volumes generally higher than 130,000 tokens/minute.

"Together AI’s Dedicated Endpoints give us precise control over latency, throughput, and concurrency—allowing us to serve more than 10 million monthly active users with BLACKBOX AI autonomous coding agents. The flexibility of autoscaling, combined with exceptional engineering support, has been crucial in accelerating our growth."

‍

- Robert Rizk, Co-Founder and CEO of BLACKBOX AI

Available now for the top open-source models

together.ai

Chat

Code

New

Code

New

Chat

Moderation

Chat

Explore all models

Pick the deployment that fits your needs

Together AI offers the most comprehensive deployment options for inference, ensuring you get the right balance of flexibility, performance, and cost-efficiency.

Serverless Endpoints
The simplest way to run inference.
- ✔ API ready for 200+ models.
- ✔ Simplest setup.
- ✔ Highest flexibility.
- ✔ Pay-per-token pricing.
Explore serverless models
On-Demand Dedicated
Configurable dedicated GPU instances.
- ✔ Guaranteed performance (single-tenant).
- ✔ Support for custom models.
- ✔ Full control & customizability.
- ✔ Pay for GPU runtime.
Deploy dedicated endpoint
Monthly Reserved
Reserved GPUs with discounts at scale.
- ✔ Reserved capacity for 1+ months.
- ✔ Fully custom setup.
- ✔ Secure enterprise deployments.
- ✔ Discounted upfront payment.
Request reserved instance

Configure and deploy in seconds with our API

Use our API and CLI to set up, deploy and manage your dedicated endpoints without worrying about the GPU infrastructure.

Access top open-source models

Use our CLI to quickly get a list of the available open-source models you can deploy on a dedicated endpoint.

together models list --type dedicated

Select from a wide range of the top-performing NVIDIA GPUs

Get a comprehensive list of all available GPU configurations for any given model. Select from leading NVIDIA GPUs including H200, H100, and many more.

together endpoints hardware \
--model mistralai/Mixtral-8x7B-Instruct-v0.1

Configure the endpoint for your needs

Customize your endpoint by setting a number of options that give you full control over your deployment.

Ensure consistent capacity with a GPU count, set replicas to get automatic horizontal scaling, and enable or disable optimizations for your deployment.

together endpoints create \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--gpu h100 \
--gpu-count 2 \
--min-replicas 1 \
--max-replicas 3

Read our Docs

Get guaranteed performance with our highly optimized hardware

01
Leverage cutting-edge GPUs
Configure your custom endpoint by picking from a wide selection of powerful NVIDIA GPU hardware (including H200, H100) and configure your deployment to match your needs.
02
Customize your deployment
Configure a deployment that fits your needs by setting a number of GPUs, defining replicas for horizontal scaling, and enabling or disabling performance optimizations.
03
Enable speculative decoding
This technique improves the efficiency of text generation and decoding processes, typically increasing the throughput and improving the handling of uncertain or ambiguous input.

Hardware specs

01
HGX H200-141GB SXM
- NVIDIA HGX H200 architecture 
- 141GB of VRAM HBM3 
- GPU link: SXM5 NVLink 
- Starting at $4.99/hour
02
HGX H100-80GB SXM
- NVIDIA HGX H100 architecture 
- 80GB of VRAM 
- GPU link: SXM5 NVLink
- Starting at $3.36/hour
03
A100-80GB SXM
- NVIDIA A100 architecture
- 80GB of VRAM 
- GPU link: SXM4 NVLink 
- Starting at $2.59/hour
04
A100-80GB PCIE
- NVIDIA A100 architecture 
- 80GB of VRAM 
- GPU link: PCIE 
- Starting at $2.40/hour
05
A100-40GB SXM
- NVIDIA A100 architecture 
- 40GB of VRAM 
- GPU link: PCIE 
- Starting at $2.40/hour
06
L40-48GB PCIE
- NVIDIA L40 architecture 
- 48GB of VRAM 
- GPU link: PCIE 
- Starting at $1.49/hour

Scale Your Deployment with Dedicated Endpoints

Configure your endpoint

Performance and reliability for production scale

Get full control of your deployment

Handle traffic spikes seamlessly

Achieve cost savings at scale

Available now for the top open-source models

DeepSeek-R1-0528

Llama 3.3 70B

Gemma 3 27B

Devstral Small 2505

Magistral Small 2506

Qwen2.5 72B

Cogito V1 Preview Llama 70B

Llama 3.1 8B

Cogito V1 Preview Qwen 32B

Mixtral 8x7B Instruct v0.1

VirtueGuard Text Lite

Llama 3.1 70B

NIM Llama 3.1 8B Instruct

DeepSeek-R1-0528 Throughput

MiniMax M1 40K

Pick the deployment that fits your needs

Serverless Endpoints

On-Demand Dedicated

Monthly Reserved

Configure and deploy in seconds with our API

Access top open-source models

Select from a wide range of the top-performing NVIDIA GPUs

Configure the endpoint for your needs

Get guaranteed performance with our highly optimized hardware

Leverage cutting-edge GPUs

Customize your deployment

Enable speculative decoding

Hardware specs

HGX H200-141GB SXM

HGX H100-80GB SXM

A100-80GB SXM

A100-80GB PCIE

A100-40GB SXM

L40-48GB PCIE

Scale Your Deployment with Dedicated Endpoints

Subscribe to newsletter

On-Demand Dedicated