together.gpu-clusters
Frontier GPU clusters with
16-1000+ interconnected NVIDIA H100 and H200 GPUs, now featuring Together Kernel Collection.
Software stack ready for distributed training
Train with the Together Training stack, delivering nine times faster training speed with FlashAttention-3.11
Slurm configured out-of-the-box for distributed training and the option to use your own scheduler.
Directly SSH into the cluster, download your dataset and you’re ready to go.
Performance metrics
training horsepower
relative to aws
training speed
Benefits
Scale infra – at your pace
Start with as little as 30 days — and expand at your own pace. Scale up or down as your needs change — from 16 to 10,000 GPUs.
SNAPPY SETUP. BLAZING FAST TRAINING.
We value your time. Your cluster comes optimized for distributed training with the high performance Together Kernel Collection and a setup Slurm cluster out of the box. You focus on your model and we’ll ensure everything runs smoothly. ssh in, download your data, and start training.
EXPERT SUPPORT
Our team is dedicated to your success. Our expert team will help unblock you, whether you have AI or system issues. Guaranteed uptime SLA and support included with every cluster. Additional engineering services available when needed.
Hardware specs
- 01
A100 PCIe Cluster Node Specs
- 8x A100 / 80GB / PCIe
- 200Gb non-blocking Ethernet
- 120 vCPU Intel Xeon (Ice Lake)
- 960GB RAM
- 7.68 TB NVMe storage - 02
A100 SXM Cluster Node Specs
- 8x NVIDIA A100 80GB SXM4
- 200 Gbps Ethernet or 1.6 Tbps Infiniband configs available
- 120 vCPU Intel Xeon (Sapphire Rapids)
- 960 GB RAM
- 8 x 960GB NVMe storage - 03
H100 Clusters Node Specs
- 8x Nvidia H100 / 80GB / SXM5
- 3.2 Tbps Infiniband network
- 2x AMD EPYC 9474F 18 Cores 96 Threads 3.6GHz CPUs
- 1.5TB ECC DDR5 Memory
- 8x 3.84TB NVMe SSDs