Optimizing inference speed and costs: Lessons learned from large-scale deployments

How can teams reduce inference latency without massive costs?

Achieving faster inference doesn't always mean paying more for a bigger cluster. At Together AI, we’ve seen teams that consistently deliver both low latency and low cost share these key habits:

They maximize the usable work extracted from every GPU
They actively eliminate invisible compute stalls
They strategically select decoding techniques based on their specific traffic patterns
They view performance tuning as an ongoing discipline, not a one-time configuration task

By excelling in these areas, your cluster can provide faster responses while simultaneously reducing the cost per token.

Why inference cost efficiency matters

AI products are getting more competitive by the week — and user expectations are rising just as fast.

For leading AI-native companies — like Cursor, who needs massive throughput without compromising speed, and Decagon who needs real-time responses despite unpredictable traffic patterns — the pressure is the same everywhere:

Be fast. Sub-500ms TTFT and fast decoding speed
Be predictable. No surprise tail latencies
Be affordable. GPU bills can’t scale linearly with traffic
Be ready for spikes. Because traffic never behaves the way you expect

Across customers, we consistently see the same imperative: deliver sub-second responses, without doubling the GPU bill. The good news? You don’t need exotic architectures or hundreds of extra GPUs to maintain inference cost efficiency.

Most teams get meaningful wins by optimizing how their inference runs, not purely how much hardware they buy.

How inference optimization works

Here are the levers that reliably move both speed and cost in the right direction.

1. Start at the model level: quantization and distillation

Quantization

Dropping precision (FP16 → FP8 → FP4) makes the model lighter on memory and faster to run — with virtually no quality loss when done well, like how we do it here at Together.

This unlocks:

Noticeably faster tokens/sec
Bigger batch sizes at the same GPU footprint
Lower cost per token
Smoother scaling for real-time workloads

We’ve seen in many production deployments that FP8 or FP4 quantization delivers 20–40% throughput improvement, without harming output quality.

Distillation

Not every workload needs the full weight of a frontier model. Distillation trains a smaller model to mimic a larger one, preserving reasoning patterns while cutting compute cost dramatically.

DeepSeek-R1 is a great example. Its distilled variants are fast, lightweight, and still excellent at reasoning — making them perfect for:

Interactive chat
Coding assistants
Routing and classification
High-volume enterprise workloads
Inference at the edge or under tight latency budgets

You can see how teams deploy R1 and its distilled variants securely on Together AI in this post.

Distilled R1 variants deliver a quality-to-latency ratio that’s extremely compelling for production workloads — often enabling 2–5× lower cost at similar quality bands for many tasks.

Together, quantization and distillation offer some of the largest cost reductions available before touching hardware or cluster architecture.

2. Reduce network latency at the edge (regional inference proxies)

Sometimes the biggest latency win isn’t compute, but geography. Even with extremely fast models, network distance is often the slowest part of the request path.

Dropping a lightweight proxy in the same region as your inference cluster cuts out long round-trip paths before generation even starts.

This alone can shave 50–100 ms off TTFT, and make tail latency far more predictable.

3. Reduce unnecessary compute (memory stalls, KV inefficiencies, fragmentation)

Most models aren’t slow— the pipelines around them are. So your GPU spends a lot of time doing nothing and just… waiting. The biggest culprits tend to be:

Kernels that don’t work together efficiently, forcing the GPU to pause between prefill, attention, and decoding
MoE layers that spend more time waiting on memory than doing useful work, especially when expert routing is unbalanced
Prefill paths that struggle with long prompts, leading to slow starts and uneven performance
Batching or scheduling gaps that leave portions of the GPU idle while work is still available

At Together AI, we’ve run benchmarks across Llama, Qwen, Mistral, and DeepSeek families (highlighted in our fastest inference for the top open-source models blog) which show that kernel fusion, smarter MoE execution, streamlined tokenization, and better scheduling can eliminate wasted time, unlocking faster responses and higher throughput.

4. Use the right decoding optimization (MTP, speculative decoding, draft models)

Decoding is where a lot of time gets lost — and also where some of the easiest wins live.

MTP: Predicts multiple tokens at once, increasing decode speed and GPU efficiency
Speculative decoding: Uses a small “draft” model to accelerate generation for predictable workloads
- Traditional speculative decoding uses a fixed drafting strategy, but modern engines allow teams to optimize for their specific traffic distribution — maximizing speed while minimizing quality regressions. We did this with our own speculator, ATLAS.
- We break down these strategies in detail in our customized speculative decoding post.

When tuned properly, these techniques often deliver 20–50% faster decoding and significantly higher throughput per GPU.

5. Pick the right hardware for your workload (and use the right parallelism)

With a new hardware type that comes out every year or so, hardware choice increasingly shapes both cost and latency.

Blackwell GPUs offer major improvements in per-token throughput and attention kernel speed.
NVIDIA Grace Blackwell (GB200) systems tightly pair CPU + GPU memory, reducing data movement overhead and boosting throughput for large batch sizes and long contexts.

But to fully benefit from this hardware, large models need to be split and scheduled intelligently across devices. That’s where parallelism strategies come in:

Tensor parallelism splits individual layers across GPUs, letting very large models run efficiently without becoming memory-bound.
Expert parallelism distributes different experts in MoE models across GPUs, so each GPU specializes in a subset of experts instead of doing everything.

Teams running billions of tokens per day typically see a clear drop in cost-per-token when moving heavy workloads to NVIDIA Blackwell-class hardware with the right parallelism strategy.

6. Dynamically shift GPU capacity across endpoints

Traffic is rarely evenly distributed across all services. For example:

Feature: Dynamic scaling between endpoints based on real-time concurrency and demand
How it works: GPUs automatically reassign to the busiest endpoints, while idle endpoints relinquish capacity.
Outcome: Higher total utilization, fewer idle GPUs, and the ability to handle spikes without overprovisioning.

This is especially valuable for customers with mixed workloads: coding, chat, RAG, batch, and long-form generation. Together AI allows customers to update capacity on their endpoints via a simple API call.

What you can do with better inference optimization

Teams that implement these optimizations unlock:

Lower TTFT and faster decoding
Higher GPU utilization and fewer idle cycles
Reduced cost per token
Improved predictability and tail latencies
Better user experience across interactive and real-time products

Getting started

Here’s a practical, low-friction way to begin:

Measure your baseline (TTFT, decode TPS, TPM/GPU, network RTT)
Deploy a regional proxy if requests originate far from your inference cluster
Enable adaptive continuous batching and monitor GPU utilization
Turn on MTP or speculative decoding depending on workload
Rebalance endpoints by dynamically shifting GPU capacity as traffic changes

FAQ

Do throughput optimizations increase latency?

No. Continuous batching + fused kernels let you raise throughput while lowering latency.

Is NVIDIA Blackwell only for huge workloads?

No. Any workload with meaningful concurrency or long contexts benefits from its bandwidth and memory improvements.

How can I tell if my GPUs are under-utilized?

Look for low decode TPS, small active batch sizes, or long gaps between token generations.