Inference

Published 10/28/2025

Dynamic AI agent testing for the real world with Collinear Simulations and Together Evals

TL;DR

Build measurably robust AI agents for real-world scenarios with Collinear Simulations and Together Evals API.

Building robust AI agents requires realistic evaluations that represent the real-world diversity of human interactions. 

Meet TraitMix, Collinear’s simulations product that enables teams to generate dynamic, persona-driven AI agent interactions you can plug into your workflows and evaluate seamlessly using Together Evals.

Why dynamic evals?

Most evaluations assume a static user: clear, polite, and consistent. Real users are not. They’re impatient, skeptical, emotional, incoherent, or even contradictory — and these behavioral shifts often break even the most capable models.

TraitMix closes that gap by making agentic testing continuous, compositional, and human-realistic, producing feedback-rich data (that can be looped into retraining) and enabling cross-model comparison on identical simulation seeds. With the Together Evals integration, the entire pipeline — from simulation to judgment — runs reproducibly and at scale. TraitMix enables dynamic evals where a user can:

  • Mix and compose user traits (like impatience, confusion, sarcasm, trust, curiosity)
  • Generate rich, multi-turn conversational data between AI agents and human-like simulators
  • Automatically judge agent interactions using open and reproducible Together Evals infrastructure

The result: high-diversity, high-fidelity data that exposes how your agents truly perform under human variability.

How it works

TraitMix uses a lightweight, model-agnostic method for representing controllable user traits in activation space. You can read more about it in the paper

Using TraitMix, you can:

  1. Select traits & personas: Define behavioral dimensions for simulated users (e.g., “impatient + confused” or “friendly + skeptical”).
  2. Choose domains: Run across verticals like support, retail, healthcare, finance, or open QA.
  3. Simulate dialogues: Generate realistic multi-turn conversations using frontier or custom models.
  4. Judge automatically: Leverage Together Evals’ standardized evaluation pipelines to score outcomes on helpfulness, safety, or factuality.

Every simulation run produces structured outputs that could be used beyond evaluations — ideal for RLHF data generation, regression testing, or failure mode discovery.

Collinear TraitMix (Simulations) API

Collinear Simulations API lets you generate dynamic conversations using concepts from mechanistic interpretability: define user intents, select persona traits (and optional attributes), and choose an AI agent model to get hundreds of long, mult-turn realistic conversations in minutes. The agent can be any model or agent (including the ones hosted on Together). The generated conversations can then be used for evaluations or even improving agents via post-training.

Together Evaluations API

Together Evaluations API lets you benchmark LLMs using an LLM-as-a-judge approach: define a task-specific rubric, choose a strong judge model, and run Evaluations to get aggregate metrics plus row-level rationales in minutes. It works with Together-hosted models as well as pre-generated outputs from any provider, including Collinear, so you can upload CSV or JSONL files and evaluate without re-running inference. This makes it easy to A/B test prompts and models, and to flexibly assess properties including helpfulness, accuracy, tone, and safety (via classify/score/compare). You can then feed these insights back into your workflows. In combination with the Collinear API, it also allows you to test your workflow’s behavior across different user traits.

Getting started

You can make your evaluations dynamic with three simple steps:

  1. Sign up for Collinear and Together AI to get your API keys.
  2. Plug your agent details in the cookbook config here.
  3. Run the cookbook and monitor them on your Together evals dashboard.

Developers, researchers, and enterprise teams can now finally close the loop between interaction, evaluation, and improvement, all within a single ecosystem. AI alignment doesn’t stop at good answers. It starts with good reactions.

With TraitMix, we’re giving every model the chance to meet its most human users, before the real ones do.