This website uses cookies to anonymously analyze website traffic using Google Analytics.

200+ generative AI models

Build with open-source and specialized multimodal models for chat, images, code, and more. Migrate from closed models with OpenAI-compatible APIs.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Try now
together.ai

Inference that is fast, simple, and scales as you grow.

import os, requests

url = "https://api.together.xyz/v1/chat/completions"

payload = {
    "model": "deepseek-ai/DeepSeek-R1",
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.7,
    "top_k": 50,
    "repetition_penalty": 1
}
headers = {
    "accept": "application/json",
    "content-type": "application/json",
    "Authorization": "Bearer TOGETHER_API_KEY"
}

response = requests.post(url, json=payload, headers=headers)

print(response.text)

#  Sign up to get your API key here: https://api.together.ai/
#  Documentation for API usage: https://docs.together.ai/
together endpoints create \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--gpu h100 \
--min-replicas 1 \
--max-replicas 8 \
--gpu-count 2 \
--display-name "My Endpoint" \
--wait

Powered by the Together Inference Stack

Built by AI researchers for AI innovators, the Together Inference Stack gives you  the fastest NVIDIA GPUs running our proprietary Inference Engine, optimized by Together Kernel Collection and customized to your traffic.

  • Fast Inference Engine

    Running on the latest NVIDIA GPUs with custom optimizations, our Inference Engine offers inference that's 4x faster than vLLM.

  • Together Kernel Collection

    The Together Kernel Collection, from our Chief Scientist and FlashAttention creator Tri Dao, provides up to 10% faster training and 75% faster inference.

  • Customized to Your Traffic Profile

    Together AI’s Research team will fine-tune and optimize your deployment using our proprietary optimizations such as custom speculators.

""Together AI offers optimized performance at scale, and at a lower cost than closed-source providers – all while maintaining strict privacy standards."

- Vineet Khosla, CTO, The Washington Post

Performance

You get more tokens per second, higher throughput and lower time to first token. And, all these efficiencies mean we can provide you compute at a lower cost.

SPEED RELATIVE TO VLLM

4x FASTER

LLAMA 3.3 70B
DEPLOYED ON 2x h100

6100 TOKENS/SEC

COST RELATIVE TO GPT-4o

11x lower cost

Enterprise-grade security and data privacy

We take security and compliance seriously, with strict data privacy controls to keep your information protected. Your data and models remain fully under your ownership, safeguarded by robust security measures.

Learn more

Flexible deployment options

  • Together Cloud

    • Get started quickly with fully managed serverless endpoints with pay-per-token pricing
    • Dedicated GPU endpoints with autoscaling for consistent performance
  • Your Cloud

    • Dedicated serverless deployments in your cloud provider
    • VPC deployment available for additional security
    • Use your existing cloud spend
  • Together
    GPU Clusters

    • For large-scale inference workloads or foundation model training
    • NVIDIA H100 and H200s clusters with Infiniband and NVLink
    • Available with Together Training and Inference Engines for up to 25% faster training and 75% faster inference than PyTorch

Run inference with the best price-performance at scale

Explore our model library