This website uses cookies to anonymously analyze website traffic using Google Analytics.
Preview

Together Evaluations

The fastest way to know if your model is good enough for production.

Confidently Evaluate Your Model—Before You Ship

Together makes it easy to understand how well your model performs—no manual evals, no spreadsheets, no guesswork.

  • Evaluate The Best LLMs Today

    Test serverless models today. Support for custom models and commercial APIs coming soon.

  • Built for Developers

    Run evaluations via the UI or the Evaluations API. No complex pipelines or infra required.

  • Ship Faster with Confidence

    Validate improvements, catch regressions, and confidently push the next model to production.

LLM-as-a-Judge at Your Fingertips

Evaluate responses with LLMs as judges—faster, more consistent, and scalable than manual review.

Choose from a wide range of judge models, including the latest top-tier LLMs, to match your quality bar and use case. Teams can also use benchmark datasets to test against standardized tasks or real-world scenarios.

Together gives you full control over how evaluations are run—whether you prioritize automation, accuracy, or alignment.

Evaluate the Best Serverless Models

Evaluate a hand-picked collection of the best open-source LLMs, growing fast to support every model that runs on Together AI.

Use your own prompts, logs, or benchmarks to compare models like Kimi K2 and Llama 4 Maverick with fast, confident results.

The Evaluations API, Made for Builders

Integrate evaluations directly into your workflow with Together’s intuitive API.

The Evaluations API makes it easy to programmatically test models and prompts—no complex setup required. Send your own data, get structured results, and automate evaluations as part of your CI pipeline.

Simple to start, powerful at scale—so you can focus on shipping better models, faster.

MODEL_A_NAME = "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"
MODEL_B_NAME = "Qwen/Qwen2.5-72B-Instruct-Turbo"

JUDGE_MODEL_NAME = "deepseek-ai/DeepSeek-V3"

# Model configurations
model_a_config = {
    "model_name": MODEL_A_NAME,
    "system_template": summarization_generation_template,
    "input_template": "{{text}}",
    "max_tokens": 1024,
    "temperature": 0.5
}

model_b_config = {
    "model_name": MODEL_B_NAME,
    "system_template": summarization_generation_template,
    "input_template": "{{text}}",
    "max_tokens": 1024,
    "temperature": 0.5
}

Scalable Evals, Powered by our Batch API

Together Evaluations runs on our high-throughput Batch API—so every comparison, judgment, and output is delivered with timely turnaround and exceptional cost-efficiency.

By processing evaluations in batch, you get results fast—typically at 50% lower cost than real-time inference—without sacrificing quality or scale.

Perfect for large experiments, frequent model updates, and production-grade workflows.