⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Introducing Together AI's new look →

🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

Models / Google

Google

Deploy Google's Gemma 3 models on Together AI. Lightweight, state-of-the-art open models built from the same technology that powers Gemini.

Why Google on Together AI?

Designed for production workloads that need  consistent performance and operational control.

Gemini technology, open and deployable

Gemma models are built from the same research and architecture that powers Gemini. Deploy frontier AI you can fine-tune, own, and run without restrictions.

The best open model on a single GPU

Gemma 3 27B is the most capable open model that fits on a single NVIDIA H100 — with multimodal vision, unparalleled multilingual support, and a 128K context window.

From cloud to edge, one model family

Sizes from 270M to 27B run everywhere — cloud infrastructure, workstations, and mobile. SOC 2 Type II certified and HIPAA compliant on Together AI's US-based infrastructure.

Meet the Google family

Explore top-performing models across text, image, video, code, and voice.

Deploy own model

new

Chat

Gemma 4 31B

Chat

Gemma 3 27B

Video

Google Veo 3.0

Image

Nano Banana Pro (Gemini 3 Pro Image)

new

Image

Gemini 3.1 Flash Image (Nano Banana 2)

Chat

Gemma 3n E4B Instruct

Chat

Gemma 3 12B

Chat

Gemma 3 1B

Chat

Gemma 3 4B

Chat

Gemma Instruct (2B)

Video

Google Veo 3.0 Fast + Audio

Video

Google Veo 3.0 Fast

Video

Google Veo 3.0 + Audio

Video

Google Veo 2.0

Image

Google Imagen 4.0 Fast

Image

Google Imagen 4.0 Preview

Image

Gemini Flash Image 2.5 (Nano Banana)

Image

Google Imagen 4.0 Ultra

Have your own model?

Deploy custom containers on Together’s managed GPU infrastructure with automatic scaling, job queues, and built-in observability.

Deployment options

Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.

Serverless
Inference

Serverless Inference

Real-time

A fully managed inference API that automatically scales with request volume.

Best for

Variable or unpredictable traffic

Rapid prototyping and iteration

Cost-sensitive or early-stage production workloads

Batch

Process massive workloads of up to 30 billion tokens asynchronously, at up to 50% less cost.

Best for

Classifying large datasets

Offline summarization

Synthetic data generation

Dedicated Inference

Dedicated Model Inference

An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine.

Best for

Predictable or steady traffic

Latency-sensitive applications

High-throughput production workloads

Dedicated Container Inference

Run inference with your own engine and model on fully-managed, scalable infrastructure.

Best for

Generative media models

Non-standard runtimes

Custom inference pipelines