together.models
Build with 200+ open-source and specialized multimodal models for chat, images, code, and more.
Featured
models
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Run any model on the fastest endpoints
Use our API to deploy any open-source model on the fastest inference stack available with optimal cost efficiency.
Scale into a dedicated deployment anytime with a custom number of instances to get optimal throughput.
RUN INFERENCE
curl -X POST "https://api.together.xyz/v1/chat/completions" \
-H "Authorization: Bearer $TOGETHER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-Vision-Free",
"messages": [{"role": "user", "content": "What are some fun things to do in New York?"}]
}'
RUN INFERENCE
from together import Together
client = Together()
response = client.chat.completions.create(
model="meta-llama/Llama-Vision-Free",
messages=[{"role": "user", "content": "What are some fun things to do in New York?"}],
)
print(response.choices[0].message.content)
RUN INFERENCE
import Together from "together-ai";
const together = new Together({ apiKey: process.env.TOGETHER_API_KEY });
const response = await together.chat.completions.create({
messages: [{"role": "user", "content": "What are some fun things to do in New York?"}],
model: "meta-llama/Llama-Vision-Free",
});
console.log(response.choices[0].message.content)
- Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in Mosaic ML blog. Detailed results and methodology published here.
- Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in Mosaic ML blog. Detailed results and methodology published here.
- Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference. Detailed results and methodology published here.
- Based on published pricing November 8th, 2023, comparing Open AI GPT-3.5-Turbo to Llama-2-13B on Together Inference using Serverless Endpoints. Assumes equal number of input and output tokens.
- Compared to a standard attention implementation in PyTorch, FlashAttention-2 can be up to 9x faster. Source.
- Testing methodology and results published in this research paper.
- Based on published pricing November 8th, 2023, comparing AWS Capacity Blocks and AWS p5.48xlarge instances to Together GPU Clusters configured with an equal number of H100 SXM5 GPUs on our 3200 Gbps Infiniband networking configuration.