Summary
- Together AI, the AI Native Cloud, announced a full suite of capabilities for building real-time voice agents — co-located STT, LLM, and TTS on one cloud, eliminating inter-vendor network hops for end-to-end pipeline latency under 500ms
- Cartesia Sonic-3 (TTS) and Deepgram (STT) are now natively hosted on Together infrastructure, expanding model choice on the co-located stack
- One API, one billing surface, one deployment surface — with zero data retention, SOC 2 Type II, HIPAA, and dedicated data residency for enterprise deployments
Want to talk to an assistant instead of reading this blog? Call (847) 851-4323 and ask our assistant about models, pricing, or how to deploy a voice agent. Then interrupt it mid-sentence. The demo is designed for real-time turn-taking, with responses fast enough to feel conversational. That kind of experience is hard to deliver across fragmented voice stacks, and is exactly what Together AI is built to deliver at production scale.
Today, Together AI, the AI Native Cloud, launched a full suite of capabilities to deliver the industry’s first unified solution for building real-time voice agents on co-located STT, LLM, and TTS infrastructure. Together keeps the entire voice stack on one cloud, reducing latency, simplifying deployment, and improving reliability. We also added more choices for builders through native integrations with industry leading voice models from Cartesia (TTS) and Deepgram (STT).
Production voice systems usually force teams to trade off between speed, model choice, and operational stability. Together’s unified solution is built to remove that tradeoff: low-latency infrastructure for live conversation, flexible model support across the voice stack, and a secure, production-ready platform that takes teams from evaluation to deployment without rebuilding across providers.
The architecture: Unified modular vs. multi-provider stacks
The demo you just called stays real time because it keeps those handoffs inside one Together AI cluster, served from regions that keep round-trip latency to the caller low. Every component — STT, LLM, and TTS — runs on dedicated endpoints with pre-warmed capacity, with end-to-end latency under 500ms measured from the end of user speech to first audio token.
Building a voice pipeline across multiple vendors usually means routing audio and text between separate STT, LLM, and TTS systems over the public internet. Each handoff adds latency, increases operational complexity, and makes the stack harder to manage in production. Together AI’s unified modular architecture is built to remove that overhead by co-locating STT, the LLM, and TTS within the same cluster and serving them from regions close to the end user.
That architecture matters for three reasons:
- Speed: Voice agents usually slow down when STT, the LLM, and TTS hand off audio and text across multiple cloud providers. Those inter-vendor network hops add latency at every stage of the pipeline and make real-time conversation harder to maintain. Together keeps those handoffs on local datacenter networking instead of the public internet, enabling end-to-end latency below 500 milliseconds and the responsiveness required for natural turn-taking.
- Flexibility: Model flexibility demands choice. Together hosts leading voice models for transcription, reasoning, and synthesis in a single place, giving teams the speed and control to configure the stack they need for each use case. And unlike opaque speech-to-speech systems, Together’s modular design preserves access to the intermediate transcript and response text, so teams can inspect, modify, and route data mid-stream as part of their own application logic.
- Reliability: Production demands reliability. Together’s zero-trust architecture is designed for sensitive voice data, while its streamlined end-to-end management experience removes the complexity of stitching together disparate solutions from multiple vendors. Running the full pipeline on one platform gives teams one API, one authentication layer, one billing surface, and unified metrics — reducing operational overhead, lowering latency, and making costs more predictable in production.
For enterprise deployments, the platform is built to support strict requirements, including zero data retention, SOC 2 Type II, HIPAA, and dedicated data residency options.

Model choice on one platform
Teams often patch together different providers for speed, expressive synthesis, or voice cloning, turning their infrastructure into a fragile multi-vendor patchwork. Together AI replaces this with a single, model-agnostic platform. Developers can configure the exact STT, LLM, and TTS stack they need, and swap models without rebuilding integrations.
Through one platform, teams can route audio and text through models like Whisper Large v3, Minimax Speech 2.6 Turbo, Rime Arcana, Kokoro, and the full Together LLM catalog. View our complete catalog of supported models in our STT and TTS docs.
To expand our model library, we are launching native integrations for Deepgram (STT) and Cartesia (TTS). Hosting these models directly on Together AI means teams get industry-leading transcription and synthesis while keeping every handoff securely inside the co-located stack.
Cartesia brings Sonic-3 and Sonic-2 to the platform, delivering expressive, ultra-low-latency TTS purpose-built for voice agents and production deployments.
“At Cartesia, we’re driven to push the limits of real-time, expressive voice AI. By working with Together AI, we’re excited to bring this technology to more developers building the next generation of voice applications. - Arjun Desai, Co-Founder, Cartesia
Deepgram brings Nova-3, Nova-3 Multilingual (STT), Flux (conversational STT), and Aura-2 (TTS) to the platform, covering everything from real-time transcription to enterprise-grade voice synthesis.
"Voice agents live or die by latency, and every network hop between providers is a place where the experience breaks down. By hosting Deepgram's STT natively on Together AI's infrastructure, we're giving developers production-grade transcription without the tradeoff. Fast, accurate, and co-located with the rest of the pipeline." - Abe Pursell, VP of Partnerships, Deepgram
Voice agent in production: Decagon
Decagon runs concierge voice agents in production on this stack. Their agents handle complex enterprise customer support workflows — like resolving billing inquiries, managing account updates, and executing technical troubleshooting — at scale. In these environments, transcription accuracy, low latency, and infrastructure uptime are critical to maintaining a natural customer experience. Running the pipeline on Together AI provides the strict latency bounds required for fluid conversation, all while preserving Decagon's control over the intermediate transcript and response text.
Get started
→ Call the Demo - (847) 851-4323
→ Read the demo guide
→ Read TTS Documentation
→ Read STT Documentation

Audio Name
Audio Description

Performance & Scale
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Infrastructure
Best for
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Build
Benefits included:
✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.
Funding: Less than $5M
Build
Benefits included:
✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.
Funding: Less than $5M
Build
Benefits included:
✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.
Funding: Less than $5M
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:
Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:
Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.
- A. 2.08*1e-1 m
- B. 2.08*1e-9 m
- C. 2.08*1e-6 m
- D. 2.08*1e-3 m
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:
Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by
- A. releasing nitrogen in the soil.
- B. crowding out non-native species.
- C. adding carbon dioxide to the atmosphere.
- D. removing water from the soil and returning it to the atmosphere.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:
Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:
Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?

Audio Name
Audio Description

Performance & Scale
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Infrastructure
Best for
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Build
Benefits included:
✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.
Funding: Less than $5M
Build
Benefits included:
✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.
Funding: Less than $5M
Build
Benefits included:
✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.
Funding: Less than $5M
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:
Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:
Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.
- A. 2.08*1e-1 m
- B. 2.08*1e-9 m
- C. 2.08*1e-6 m
- D. 2.08*1e-3 m
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:
Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by
- A. releasing nitrogen in the soil.
- B. crowding out non-native species.
- C. adding carbon dioxide to the atmosphere.
- D. removing water from the soil and returning it to the atmosphere.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:
Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:
Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?