Together Evaluations | Together AI

Preview

Together Evaluations

The fastest way to know if your model is good enough for production.

Create evaluation

Confidently Evaluate Your Model—Before You Ship

Together makes it easy to understand how well your model performs—no manual evals, no spreadsheets, no guesswork.

Evaluate The Best LLMs Today
Test serverless models today. Support for custom models and commercial APIs coming soon.
Built for Developers
Run evaluations via the UI or the Evaluations API. No complex pipelines or infra required.
Ship Faster with Confidence
Validate improvements, catch regressions, and confidently push the next model to production.

Three Ways to Measure Quality

LLM-as-a-Judge at Your Fingertips

Evaluate responses with LLMs as judges—faster, more consistent, and scalable than manual review.

Choose from a wide range of judge models, including the latest top-tier LLMs, to match your quality bar and use case. Teams can also use benchmark datasets to test against standardized tasks or real-world scenarios.

Together gives you full control over how evaluations are run—whether you prioritize automation, accuracy, or alignment.

Evaluate the Best Serverless Models

Evaluate a hand-picked collection of the best open-source LLMs, growing fast to support every model that runs on Together AI.

Use your own prompts, logs, or benchmarks to compare models like Kimi K2 and Llama 4 Maverick with fast, confident results.

The Evaluations API, Made for Builders

Integrate evaluations directly into your workflow with Together’s intuitive API.

The Evaluations API makes it easy to programmatically test models and prompts—no complex setup required. Send your own data, get structured results, and automate evaluations as part of your CI pipeline.

Simple to start, powerful at scale—so you can focus on shipping better models, faster.

MODEL_A_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"
MODEL_B_NAME = "Qwen/Qwen2.5-72B-Instruct-Turbo"

JUDGE_MODEL_NAME = "deepseek-ai/DeepSeek-V3"

# Model configurations
model_a_config = {
    "model_name": MODEL_A_NAME,
    "system_template": summarization_generation_template,
    "input_template": "{{text}}",
    "max_tokens": 1024,
    "temperature": 0.5
}

model_b_config = {
    "model_name": MODEL_B_NAME,
    "system_template": summarization_generation_template,
    "input_template": "{{text}}",
    "max_tokens": 1024,
    "temperature": 0.5
}

Read our Docs

Confidently Evaluate Your Model—Before You Ship

Evaluate The Best LLMs Today

Built for Developers

Ship Faster with Confidence

Three Ways to Measure Quality

LLM-as-a-Judge at Your Fingertips

Evaluate the Best Serverless Models

The Evaluations API, Made for Builders

Scalable Evals, Powered by our Batch API

Subscribe to newsletter