Build real-time voice agents on Together AI

Summary

Together AI, the AI Native Cloud, announced a full suite of capabilities for building real-time voice agents — co-located STT, LLM, and TTS on one cloud, eliminating inter-vendor network hops for end-to-end pipeline latency under 500ms
Cartesia Sonic-3 (TTS) and Deepgram (STT) are now natively hosted on Together infrastructure, expanding model choice on the co-located stack
One API, one billing surface, one deployment surface — with zero data retention, SOC 2 Type II, HIPAA, and dedicated data residency for enterprise deployments

Want to talk to an assistant instead of reading this blog? Call (847) 851-4323 and ask our assistant about models, pricing, or how to deploy a voice agent. Then interrupt it mid-sentence. The demo is designed for real-time turn-taking, with responses fast enough to feel conversational. That kind of experience is hard to deliver across fragmented voice stacks, and is exactly what Together AI is built to deliver at production scale.

Today, Together AI, the AI Native Cloud, launched a full suite of capabilities to deliver the industry’s first unified solution for building real-time voice agents on co-located STT, LLM, and TTS infrastructure. Together keeps the entire voice stack on one cloud, reducing latency, simplifying deployment, and improving reliability. We also added more choices for builders through native integrations with industry leading voice models from Cartesia (TTS) and Deepgram (STT).

Production voice systems usually force teams to trade off between speed, model choice, and operational stability. Together’s unified solution is built to remove that tradeoff: low-latency infrastructure for live conversation, flexible model support across the voice stack, and a secure, production-ready platform that takes teams from evaluation to deployment without rebuilding across providers.

The architecture: Unified modular vs. multi-provider stacks

The demo you just called stays real time because it keeps those handoffs inside one Together AI cluster, served from regions that keep round-trip latency to the caller low. Every component — STT, LLM, and TTS — runs on dedicated endpoints with pre-warmed capacity, with end-to-end latency under 500ms measured from the end of user speech to first audio token.

Building a voice pipeline across multiple vendors usually means routing audio and text between separate STT, LLM, and TTS systems over the public internet. Each handoff adds latency, increases operational complexity, and makes the stack harder to manage in production. Together AI’s unified modular architecture is built to remove that overhead by co-locating STT, the LLM, and TTS within the same cluster and serving them from regions close to the end user.

That architecture matters for three reasons:

Speed: Voice agents usually slow down when STT, the LLM, and TTS hand off audio and text across multiple cloud providers. Those inter-vendor network hops add latency at every stage of the pipeline and make real-time conversation harder to maintain. Together keeps those handoffs on local datacenter networking instead of the public internet, enabling end-to-end latency below 500 milliseconds and the responsiveness required for natural turn-taking.
Flexibility: Model flexibility demands choice. Together hosts leading voice models for transcription, reasoning, and synthesis in a single place, giving teams the speed and control to configure the stack they need for each use case. And unlike opaque speech-to-speech systems, Together’s modular design preserves access to the intermediate transcript and response text, so teams can inspect, modify, and route data mid-stream as part of their own application logic.
Reliability: Production demands reliability. Together’s zero-trust architecture is designed for sensitive voice data, while its streamlined end-to-end management experience removes the complexity of stitching together disparate solutions from multiple vendors. Running the full pipeline on one platform gives teams one API, one authentication layer, one billing surface, and unified metrics — reducing operational overhead, lowering latency, and making costs more predictable in production.

For enterprise deployments, the platform is built to support strict requirements, including zero data retention, SOC 2 Type II, HIPAA, and dedicated data residency options.

Model choice on one platform

Teams often patch together different providers for speed, expressive synthesis, or voice cloning, turning their infrastructure into a fragile multi-vendor patchwork. Together AI replaces this with a single, model-agnostic platform. Developers can configure the exact STT, LLM, and TTS stack they need, and swap models without rebuilding integrations.

Through one platform, teams can route audio and text through models like Whisper Large v3, Minimax Speech 2.6 Turbo, Rime Arcana, Kokoro, and the full Together LLM catalog. View our complete catalog of supported models in our STT and TTS docs.

To expand our model library, we are launching native integrations for Deepgram (STT) and Cartesia (TTS). Hosting these models directly on Together AI means teams get industry-leading transcription and synthesis while keeping every handoff securely inside the co-located stack.

Cartesia brings Sonic-3 and Sonic-2 to the platform, delivering expressive, ultra-low-latency TTS purpose-built for voice agents and production deployments.

“At Cartesia, we’re driven to push the limits of real-time, expressive voice AI. By working with Together AI, we’re excited to bring this technology to more developers building the next generation of voice applications. - Arjun Desai, Co-Founder, Cartesia

Deepgram brings Nova-3, Nova-3 Multilingual (STT), Flux (conversational STT), and Aura-2 (TTS) to the platform, covering everything from real-time transcription to enterprise-grade voice synthesis.

"Voice agents live or die by latency, and every network hop between providers is a place where the experience breaks down. By hosting Deepgram's STT natively on Together AI's infrastructure, we're giving developers production-grade transcription without the tradeoff. Fast, accurate, and co-located with the rest of the pipeline." - Abe Pursell, VP of Partnerships, Deepgram

Voice agent in production: Decagon

Decagon runs concierge voice agents in production on this stack. Their agents handle complex enterprise customer support workflows — like resolving billing inquiries, managing account updates, and executing technical troubleshooting — at scale. In these environments, transcription accuracy, low latency, and infrastructure uptime are critical to maintaining a natural customer experience. Running the pipeline on Together AI provides the strict latency bounds required for fluid conversation, all while preserving Decagon's control over the intermediate transcript and response text.

Get started

→ Call the Demo - (847) 851-4323

→ Read the demo guide

→ Read TTS Documentation

→ Read STT Documentation

→ Learn more