Scaling AI Companions: How Dippy AI Reached Over 4 Million Tokens/Minute with Together Dedicated Endpoints
In early 2025, over 4 million Dippy AI users were creating and publishing more than 200,000 unique AI characters and exchanging over 300 million messages with them using Dippy AI’s web and mobile apps. As their user base grew, Dippy AI’s team encountered a new challenge: figuring out how to get reliable infrastructure at scale that allowed them to fully focus on building user-facing features.
In this article, we share Dippy AI’s growth journey and how they partnered with Together AI engineers to deploy their custom models on Together Dedicated Endpoints, leveraging this highly optimized GPU infrastructure to seamlessly handle volumes of 4M+ tokens/minute with optimal “throughput per dollar” without having to spend time managing this infrastructure.
Meet Dippy AI
Imagine an AI companion that is always available to chat with you—whether you're seeking someone to talk to late at night, crafting fan-fiction adventures, or practicing important conversations. That's the vision behind Dippy AI, founded by Akshat Jagga and Angad Arneja in April 2024.

As often is the case with startups, Dippy AI’s founders quickly found their users seeking these AI companions for much more than originally thought: some find comfort chatting with an AI character at late hours to combat loneliness, while others spend hours role-playing adventures inspired by favorite movies or TV series.
“We've seen cases of people who use Dippy at 3 AM. They might not have anyone to text, so they chat with an AI therapist for comfort. Others create spin-offs of characters from their favorite series and will chat with them for hours.” – Akshat Jagga, CEO
AI Inference: The Core of Dippy AI
AI inference—the technology powering conversations between users and their AI companions—is central to Dippy’s experience. Without inference, the interactive conversations users love simply wouldn't exist.
“Inference is the core piece—without LLM inference, we essentially wouldn’t have a product.” – Manav Shah, Founding Engineer
Initially, Dippy used large, general-purpose AI models with 100B+ parameters. These models were effective but expensive and challenging to scale, especially once Dippy realized that their product usage was quite cyclical throughout the day, fluctuating according to their core users’ schedules.
As the team learned more about user behaviors, they began transitioning toward smaller, specialized AI models that were equally engaging but easier and more cost-effective to manage.
Preparing for Scale
Dippy AI originally managed its AI inference infrastructure independently. However, the rapid growth of their user base quickly led to challenges. Engineering resources got tied up managing infrastructure issues, distracting from app improvements and enhancing the user experience.
“We were hosting ourselves initially, and when we hit scale, we needed to offload inference optimization to focus on features and improving the AI.” – Manav Shah, Founding Engineer
Latency and cost quickly became key concerns for the business, so Dippy AI looked for a partner who would provide the right infrastructure and guidance to get the best price-performance ratio at scale.
Finding the Right Partner
After learning about Together AI’s end-to-end AI platform, including its own GPU clusters and an inference engine optimized by Together AI researchers, Dippy AI saw an opportunity to work with a partner that deeply understands these challenges and was ready to handle and continuously optimize this infrastructure.
Coming in with their custom model, Dippy worked closely with Together AI engineers to figure out the ideal Together Dedicated Endpoints deployment to meet their needs. After experimenting with different configurations, we found that NVIDIA HGX H100 GPUs provided the optimal “throughput per dollar” ratio for Dippy’s specific use case, volume, throughput, and usage patterns.
Unlike other providers they tested, Dippy AI found that Together AI was uniquely prepared to provide LLM optimizations that allowed them to quickly test and refine their smaller, specialized models.
“We chose Together Dedicated Endpoints due to a combination of cost, latency, and having dedicated optimization without needing internal resources.” – Akshat Jagga, CEO
Handling Peak Volumes
Dippy AI’s highly optimized LLMs reach global peak volumes of over 4 million tokens per minute. Given their cyclical traffic patterns and potential spiky usage, they also needed to ensure that the infrastructure was able to seamlessly handle peak volumes.
With the out-of-the-box auto-scaling of Together Dedicated Endpoints, Dippy experienced predictable, steady availability, with no capacity issues. As a result, their users experience consistent, uninterrupted interactions, even during busy periods.
Reliability, Throughput, and Focus
Once Dippy AI started serving their optimized models through Together Dedicated Endpoints in production, they consistently met and improved their KPIs:
- Time to First Token (TTFT): Reduced to 0.4 seconds (median).
- Throughput: Managed peak volumes up to 4.1 million tokens/minute (99th percentile).
- Latency: Reduced to an average of 3.44 seconds.
Beyond the immediate positive impact these improvements had on their product KPIs, these gains allowed Dippy AI’s team to focus fully on improving the product and enhancing the user experience, without having to worry about their infrastructure.
“Latency wasn't initially our main concern; it was more about throughput, cost, uptime, and reliability. Together AI delivered on these, allowing us to focus on building user-facing features.” – Manav Shah, Founding Engineer
Dippy also valued Together’s quick and responsive support. From joining discussions on Discord to creating tailored analytics dashboards, Together’s hands-on approach simplified Dippy's scaling process.
Future Developments
We're excited to see Dippy AI continue leveraging Together Dedicated Endpoints to support upcoming innovations like allowing their users to talk to their AI companions using voice calls, leveraging state-of-the-art AI audio models.
Together AI remains committed to enabling Dippy to deliver richer, more meaningful experiences to their growing community of millions.
- Lower
Cost20% - faster
training4x - network
compression117x
Q: Should I use the RedPajama-V2 Dataset out of the box?
RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.
Deploy your Dedicated Endpoint today!
Configure your custom endpoint with the fastest GPUs to get maximum performance, control, and cost savings at scale.
article