Together GPU Clusters

Reliable self-serve, AI-ready 
GPU clusters at scale

Go from zero to production in minutes. Bare-metal performance, InfiniBand networking, and managed orchestration — with flexible pricing for both on-demand and reserved capacity.

Why Together GPU Clusters

Infrastructure that keeps long-running jobs on track — with automated recovery, elastic scale, and zero DevOps overhead.

90% faster training on Blackwell

NVIDIA HGX B200 with Together Kernel Collection delivers 90% faster BF16 training than optimized Hopper.

Automatic recovery

When hardware fails, automated remediation restores capacity — no support tickets, no manual intervention, keeping long-running jobs on track with predictable recovery behavior.

Scale from 8 to 4,000+ GPUs

Managed orchestration and elasticity — start small for experimentation and programmatically scale your AI applications, GPUs and storage as you move to production.

Everything you need to train at scale

Managed infrastructure with built-in observability, orchestration flexibility, and research-grade performance.

    • Sustained performance

      Multi-week stability
      Reduced stragglers
      Predictable latency

      Maintain high utilization across multi-week training runs and model serving. Kernel, hardware, and storage acceleration reduce stragglers and keep latencies predictable.

    • Resilient infrastructure

      Acceptance testing
      Automated remediation
      Node repair

      Scale compute and storage seamlessly from experiments to production over high-speed InfiniBand. Keep capacity online automatically using continuous health checks, automated remediation, and self-serve node repair.

    • Observability & monitoring

      Pre-built dashboards
      Full-stack metrics
      Real-time alerts

      Monitor workloads instantly using pre-built Grafana dashboards and full-stack metrics across GPUs, storage, networking, and Kubernetes. Gain complete system visibility without writing custom instrumentation.

    • Self-serve developer experience

      CUDA version choice
      Project-level RBAC
      Multi-tool access

      Provision clusters instantly with pre-configured tooling alongside selectable drivers and CUDA versions. Manage cross-team access via project-level RBAC using the CLI, SDK, API, Terraform, or the web console.

Frontier research-powered training performance

The Together Kernel Collection, built by our Chief Scientist Tri Dao (creator of FlashAttention), delivers improved training and inference performance.

  • Together Kernel Collection
  • ThunderKittens
  • AI Training Performance: NVIDIA Hopper to Blackwell, with TKC

    TKC vs SOTA Approaches

    90% faster training

    Training a 70B parameter Llama-architecture model (BF16) with an optimized TorchTitan + Together Kernel Collection (TKC) reached 15,264 tokens/second/GPU on NVIDIA HGX B200, up from 8,080 tokens/second on NVIDIA HGX H100—a 90% jump in training speed.

    learn more
  • FP8 GEMM Performance (M x N x K)

    • ThunderKittens B200
    • cuBLAS H100
    • cuBLASB200

    ThunderKittens vs cuBLAS

    ~2× faster

    ThunderKittens’ FP8 kernel for NVIDIA HGX B200 matches NVIDIA cuBLAS GEMM performance while delivering ~2× speedup over H100 FP8 GEMMs, leveraging Blackwell’s Tensor Core–accelerated matrix operations.

    learn more

Fully managed, high-performance shared filesystems for faster training and innovation cycle

Provision and attach shared storage volumes for your GPU clusters to store and persist your training data, model weights — ensure your GPUs do not starve for data.

  • Weka excels at high IOPS workloads

    With strong metadata performance. It scored 826.86 on IO500 and delivers sub 200 microsecond latency. Heavy small file operations and metadata intensive tasks like checkpoint discovery across hundreds of training ranks.

  • VAST simplifies operations

    VAST's disaggregated architecture separates compute from storage for straightforward capacity expansion and a unified namespace. Built for enterprise environments where operational simplicity and broad feature coverage matter most.

Flexible pricing models

Both options are fully self-serve. Choose based on your capacity requirements and commitment level.

  • On-Demand
    Standard hourly rate
    Commitment
    None—pay hourly, terminate anytime
    Best for
    Starting with on-demand for flexibility
    Capacity
    Based on real-time availability
    Scale
    Up to 256 GPUs
    Create now
  • Reserved
    Lower hourly rate
    Commitment
    Up to 6 months, pay upfront
    Best for
    Guaranteed access with better economics
    Capacity
    Locked in for your duration
    Scale
    Up to 4,000+ GPUs
    Reserve capacity

Choose your cluster configuration

Self-serve GPUs with transparent per-GPU billing.

H100 SXM
Hardware
NVIDIA HGX H100 SXM (80GB)
On-demand
$2.99/hr per GPU
Reserved
Starting at $1.75/hr per GPU
Scale
8 to 256 GPUs
I am interested
H100 Inference
Hardware
NVIDIA HGX H100 SXM - Inference
On-demand
$2.39/hr per GPU
Reserved
Starting at $1.76/hr per GPU
Note
Lower InfiniBand bandwidth, suitable for single-node inference
I am interested
H200
Hardware
NVIDIA HGX H200 (141GB)
On-demand
$3.79/hr per GPU
Reserved
Starting at $2.09/hr per GPU
Scale
256 to 1,000 GPUs
I am interested
B200
Hardware
NVIDIA HGX B200
On-demand
$5.50/hr per GPU
Reserved
Starting at $4.00/hr per GPU
Scale
256 to 1,000+ GPUs
I am interested
GB200
Hardware
NVIDIA GB200 NVL72
On-demand
Contact us for pricing
Reserved
Contact us for pricing
Scale
512 to 1,000+ GPUs
I am interested

Orchestration flexibility for your AI workloads

Self-serve GPUs with hourly pricing per GPU.

  • Managed Kubernetes
    For training and inference
    Kubeadm-based upstream-compliant K8s
    Node autoscaling for elastic compute
    Managed Grafana for observability
    Flexible ingress configuration for inference
    HA control plane with managed upgrades
    Cert Manager for HTTPS endpoints
    Get started
  • Slurm on Kubernetes
    For training workloads
    Precise hardware control and gang scheduling
    Submit jobs via srun, sbatch
    Direct SSH access with Slurm simplicity and K8s-backed resilience
    Essential for distributed training synchronization
    Get started

Regions and availability zones

Launch close to your users and data across 25+ cities.

  • USA
    2GW+ in the portfolio with 600MW of near-term capacity in US.
  • Europe
    150 MW+ available in Europe: UK, Spain, France, Portugal, and Iceland also.
  • Asia & Middle East
    Options available based on the scale of the projects in Asia and the Middle East.

Choose from global regions to meet data residency and compliance requirements—HIPAA for healthcare, GDPR for Europe, or banking regulations.

Infrastructure you can trust at scale.
Production-grade security.

We take security and compliance seriously, with strict data privacy controls to keep your information protected. Your data and models remain fully under your ownership, safeguarded by robust security measures.

Learn More

As an NVIDIA Cloud Partner, Together builds and operates clusters on NVIDIA NCP reference architectures for predictable performance and faster time to production. Your data and models remain under your control with strict privacy safeguards and SOC 2–compliant security practices.

  • NVIDIA preferred partner
  • AICPA SOC 2 Type II

Customers running inference in production

    "Together GPU Clusters provided a combination of amazing training performance, expert support, and the ability to scale to meet our rapid growth to help us serve our growing community of AI creators."

    Demi Guo

    CEO, Pika

      “Together AI provides the performance and reliability we need for real-time, high-quality image and video generation at scale. We value that Together AI is much more than an infrastructure provider — they're a true innovation partner, enabling us to push creative boundaries without compromise.”

      Victor Perez

      Co-Founder, Krea