Announcing General Availability of Together Instant Clusters, offering ready to use, self-service NVIDIA GPUs
Standing up multi‑node GPU clusters has historically been too manual and brittle — tickets, contract steps, and complex setup – taking valuable time away from AI engineers and researchers. Today, we are excited to announce the General Availability of Together Instant Clusters, offering an API-first developer experience. Instant Clusters deliver self‑service automation for AI infrastructure—from single‑node (8 GPUs) to large multi‑node clusters with hundreds of interconnected GPUs, with support for NVIDIA Hopper and NVIDIA Blackwell GPUs.
AI Native companies can now manage sudden demand, whether it’s a training run or increased inference traffic, adding capacity fast and bringing a cluster online automatically with the right orchestration (K8s or Slurm) and networking. Instant Clusters can be provisioned in minutes, without long procurement cycles or manual approvals, and preconfigured for low-latency inference and high-goodput distributed training.
“We train models to reason like clinicians over troves of multi-modal data. That means capturing subtle preferences, like how to resolve multiple diagnoses or align with payer-specific logic. With Together Instant Clusters, we can run large-scale reinforcement learning on clinical question sets, experiment rapidly, and distill that learning into smaller, more efficient models that often outperform much larger foundation models.” - Allan Bishop, Founding Engineer, Latent Health
Cloud Ergonomics for GPU Clusters
Developers expect the cloud to be API‑first, self‑service, and predictable. Historically, tightly networked GPU clusters haven’t felt that way — teams pieced together drivers, schedulers, and fabric by hand. Together Instant Clusters make GPU infrastructure feel like the rest of the cloud: automated from request to run, consistent across environments, and designed to scale from a single node to large multi‑node clusters — without changing how you work.
Self‑service, ready in minutes
Provision through console, CLI, or API, and integrate with Terraform or SkyPilot for IaC and multi‑cloud workflows. Choose and lock NVIDIA driver/NVIDIA CUDA versions, bring your own container images, attach shared storage, and be ready to run in minutes.

Batteries included
Clusters come pre-loaded with the components teams usually spend days wiring up themselves:
- GPU Operator to manage drivers and runtime software.
- Ingress controller to handle traffic into your cluster.
- NVIDIA Network Operator for high-performance NVIDIA Quantum InfiniBand and NVIDIA Spectrum-X Ethernet with RoCE networking.
- Cert Manager for secure certificates and HTTPS endpoints.
These and other essentials are already in place, so your cluster is production-ready out of the box.
Optimized for Distributed Training
Training at scale demands the right interconnect and orchestration. Clusters are wired with non‑blocking NVIDIA Quantum‑2 InfiniBand scale-out compute fabric across nodes and NVIDIA NVLink and NVLink Switch inside the node, delivering ultra‑low‑latency, high‑throughput communication for multi‑node training.
Run with Kubernetes or Slurm (SSH when you need it), keep environments reproducible with version‑pinned drivers/CUDA, and checkpoint to shared storage—high‑bandwidth, parallel storage colocated with your compute; durable, resizable, and billed on demand. Ideal for pre‑training, reinforcement learning, and multi‑phase training schedules.
Recommendation: for tightly coupled distributed training, choose NVIDIA HGX H100 for maximum bandwidth within and between nodes; NVIDIA Blackwell is also available for next‑gen training as capacity expands.

Scalable Burst Capacity for Production Inference
When usage surges, services need to burst — not re‑architect. Use Together Instant Clusters to add inference capacity quickly and keep latency SLAs intact. Deploy your serving stack on clusters sized for the moment, resize clusters as user traffic spikes or subsides, and keep one operational model from test to production.
Recommendation: for elastic serving, our NVIDIA HGX H100 Inference plan offers strong cost‑performance and broad framework support. For inference workflows tied to training data/weights, use shared storage and the same environment image for consistency. SkyPilot makes it easy to burst across clouds with one job spec when needed.
Reliable at Scale
Training on large GPU clusters leaves no room for weak links—a bad NIC, miswired cable, or overheating GPU can stall jobs or quietly degrade results. With General Availability, we’ve put in place a full reliability regimen so clusters are solid before a job starts and remain stable throughout. Every node undergoes burn-in and NVLink/NVSwitch checks; inter-node connections are validated with NCCL all-reduces; and reference training runs confirm tokens/sec and Model FLOPs Utilization (MFU) targets. Once deployed, clusters are continuously monitored: idle nodes re-run tests, 24/7 observability flags anomalies in real time, and SLAs with fast communication and fair compensation ensure issues are addressed transparently.
Built for AI natives (including Together's own researchers)
During preview, AI Native companies used Instant Clusters to accelerate their work:
“As an AI Lab, we regularly train a range of models — from large language models to multimodal systems — and our workloads are highly bursty. Together Instant Clusters let us spin up large GPU clusters on demand for 24–48 hours, run intensive training jobs, and then scale back down just as quickly. The ability to get high-performance, interconnected NVIDIA GPUs without the delays of procurement or setup has been a game-changer for our team’s productivity and research velocity.” - Kunal Singh, Lead Data Scientist, Fractal AI research Lab
Together AI is unique among AI cloud providers: a significant portion of our team are AI researchers. Our researchers both use and contribute to the platform itself; Instant Clusters are a direct result of that feedback, built for teams pushing frontier AI and the pioneers training frontier‑scale models.
Our own Tri Dao, creator of FlashAttention, had this to say:
“The limiter isn’t just GPU peak FLOPs; it’s how fast we can get a GPU cluster to start. If we can spin up a clean NVIDIA Hopper GPU or NVIDIA Blackwell GPU cluster with good networking in minutes, our researchers can spend more cycles on data, model architecture, system design, and kernels. That’s how we optimize research velocity.” — Tri Dao, Together AI Chief Scientist
Pricing
Simple, straightforward pricing — with no commitments or surprise fees. Choose the term that fits your run: Hourly, 1–6 Days, or 1 Week–3 Months. Prices below are shown in $/GPU-hour.
Storage & data
- Shared storage: $0.16 per GiB‑month (high‑performance, long‑lived, resizable).
- Data transfer: free egress and ingress.
Get Started Today
Together Instant Clusters with NVIDIA Blackwell and Hopper GPUs are available now.
Get started today:
- Create a cluster through the Together AI console
- Read the documentation
- Lower
Cost20% - faster
training4x - network
compression117x
Q: Should I use the RedPajama-V2 Dataset out of the box?
RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.
article