This website uses cookies to anonymously analyze website traffic using Google Analytics.
Company

A practitioner's guide to testing and running large GPU clusters for training generative AI models

August 13, 2024

By 

Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams

Training generative AI models require clusters of expensive cutting-edge hardware: H100 GPUs & fast storage wired together in multi-network topologies involving Infiniband links, switches, transceivers and ethernet connections.  While an increasing number of HPC and AI cloud services now offer these specialized clusters, they demand substantial capital commitments. However, not all clusters are created equal.

Due to the complexity of this novel hardware, clusters often contain misassembled, misconfigured, or dead-on-arrival (DoA) components that may inadvertently be passed on to customers. Operating under high thermal loads, these clusters are prone to frequent component failures.

To mitigate the risk of low-performance clusters, we employ a process called 'acceptance testing.' For companies training generative AI models, this is not merely a procedural step. As we push the boundaries of AI capabilities, ensuring that our hardware infrastructure—particularly GPU clusters—meets the highest standards of reliability and performance becomes increasingly critical.

This article outlines the acceptance testing process we've developed at Together AI, which we've successfully implemented across clusters containing thousands of GPUs.

Introduction to GPU Cluster Testing

The reliability of GPU clusters varies dramatically, ranging from minor issues to critical failures. Even industry giants like Meta have reported significant hardware challenges. During a 54-day training run of their Llama 3.1 model, “GPU issues were the largest category, accounting for 58.7% of all unexpected issues”.

At Together AI, we serve many AI startups and Fortune 500 companies on their mission-critical infrastructure needs for AI; and during that process, similar hardware issues are often what blocks our customers. These challenges prompted us to develop a robust validation framework for assessing and ensuring hardware quality before deployment to our cloud service.

As a result, we've created a systematic approach to acceptance testing, designed to guarantee reliability for our end customers as we expand our globally distributed cloud service. We share this framework with our providers before cluster delivery and employ this repeatable process to verify quality and performance prior to cluster acceptance.

The Process of Testing Clusters at Together AI

The overarching goal of acceptance testing is to guarantee that the hardware infrastructure not only meets the specified requirements but also delivers the reliability and performance necessary for demanding AI/ML workloads. This process aids in optimizing operational efficiency and plays a crucial role in maintaining the trust and satisfaction of customers who rely on these computational resources.

A key concept in Together AI's acceptance testing is the hierarchical structure of the tests. We test hierarchically to ensure the issues can be pinpointed accurately, starting from basic functionality and gradually moving to more complex integrations and performance evaluations.

1. Preparation and Configuration

The initial phase involves configuring the new hardware within the GPU clusters environment. This setup mimics the end-use scenario, allowing for a comprehensive evaluation of the hardware's performance in an operational context. At a high level, we prepare a cluster by:

  • Installing NVIDIA drivers
  • Installing OFED drivers (for Infiniband)
  • Installing CUDA
  • Installing NCCL
  • Installing HPCX
  • Configuring SLURM cluster
  • Configuring PCI settings for performance

Once we have all the dependencies installed, we begin the process of validating the cluster by stress testing and benchmarking every subsystem and component individually. Our testing builds on each phase culminating with running a reference task tailored to our customer’s use case (e.g., a model build), so we know that the cluster is ready for training.

2. GPU Validation

One of the first subsystems to validate are the GPUs. We start by checking that the GPU type and count matches what’s expected — this can catch simple problems like NVML driver mismatch errors, or “GPU fell off the bus” errors that some have experienced. We can quickly check the number and type of GPUs, for example a machine with 8x H100 should look like this:


$ nvidia-smi | grep H100 | wc -l

8

The heart of GPU validation lies in the stress testing. For this, we utilize DCGM Diagnostics from NVIDIA as well as gpu-burn. DCGM will perform a number of tests including measuring power consumption and temperature while the GPU is under load. If any of the various test cases from DCGM don’t pass, we know we’ve likely got a problem. We generally run DCGM with Apptainer. You can build the sif file like this:


$ apptainer pull docker://nvidia/dcgm:3.3.6-1-ubuntu22.04

And then run DCGM diagnostics:


$ apptainer exec --nv dcgm_3.3.6-1-ubuntu22.04.sif /usr/bin/dcgmi diag --run 3 --fail-early

Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.3.6                                          |
| Driver Version Detected   | 535.161.07                                     |
| GPU Device IDs Detected   | 2330,2330,2330,2330,2330,2330,2330,2330        |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Pass - All                                     |
+-----  Hardware  ----------+------------------------------------------------+
| GPU Memory                | Pass - All                                     |
| Diagnostic                | Pass - All                                     |
+-----  Stress  ------------+------------------------------------------------+
| Targeted Stress           | Pass - All                                     |
| Targeted Power            | Pass - All                                     |
| Memory Bandwidth          | Pass - All                                     |
+---------------------------+------------------------------------------------+

Another great tool for stress testing GPUs is gpu-burn. With gpu-burn we can do a long running stress test, which ensures that even under consistent heavy load we don’t start to see memory errors or other failures. We expect the GPUs to be able to handle consistent stress similar to the load they will be when training. We can also use Apptainer to run gpu-burn:


$ apptainer pull docker://oguzpastirmaci/gpu-burn:latest
$ apptainer exec --nv gpu-burn_latest.sif /app/gpu_burn 60

Burning for 60 seconds.
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-c9c8226e-1831-dec2-1175-1ee6313b649b)
GPU 1: NVIDIA H100 80GB HBM3 (UUID: GPU-834d0678-0c4e-8844-e804-5aafb4af5fc2)
GPU 2: NVIDIA H100 80GB HBM3 (UUID: GPU-1cec0b40-baac-aa2a-ed33-cdcda6313883)
GPU 3: NVIDIA H100 80GB HBM3 (UUID: GPU-b9b3e4ff-ffa9-b553-cb46-5d3e55099cbd)
GPU 4: NVIDIA H100 80GB HBM3 (UUID: GPU-11b4c171-bc74-7457-4c1f-baab3c274113)
GPU 5: NVIDIA H100 80GB HBM3 (UUID: GPU-cf1b0dcb-02e6-9c33-4aba-dc7aa2b7c082)
GPU 6: NVIDIA H100 80GB HBM3 (UUID: GPU-0493e11e-2cf8-13a4-0625-5a96a2fa0251)
GPU 7: NVIDIA H100 80GB HBM3 (UUID: GPU-cfa4b4db-39ee-c050-f742-bd653156c8dc)
Killing processes.. done
Tested 8 GPUs:
        GPU 0: OK
        GPU 1: OK
        GPU 2: OK
        GPU 3: OK
        GPU 4: OK
        GPU 5: OK
        GPU 6: OK
        GPU 7: OK

3. NVLink and NVSwitch Validation

If we have a positive result when evaluating each of the GPUs individually, then we need to make sure the GPUs can work together on a single machine. There are two main tools for this: NCCL tests and nvbandwidth.

NCCL tests can test GPU to GPU communication over NVLink when run on a single machine, and we should see that for large message sizes, we approach the unidirectional performance of NVLink. If the performance is lower than expected, or we get some errors, we can quickly diagnose problems like a bad NVSwitch or down NVLinks. Similarly the nvbandwidth tool will measure copy performance from GPU to GPU.

The nvbandwidth tool can be built from source from the NVIDIA repository https://github.com/NVIDIA/nvbandwidth and then run with default arguments. It will run a lot of tests, for many of which the results look like this, showing the speed of GPU to GPU memcpy:


Running device_to_device_memcpy_write_ce.
memcpy CE GPU(row) <- GPU(column) bandwidth (GB/s)
          0         1         2         3         4         5         6         7
0       N/A    389.04    388.19    389.04    388.90    388.33    388.90    388.61
1    389.04       N/A    388.90    388.33    388.76    389.04    388.19    388.90
2    389.18    388.05       N/A    388.90    388.90    388.33    388.90    388.76
3    389.74    388.61    389.04       N/A    388.19    388.76    388.90    388.19
4    389.46    388.19    388.90    389.04       N/A    388.33    389.04    389.04
5    389.32    389.18    388.33    388.90    389.04       N/A    388.47    389.18
6    389.74    388.90    389.04    388.47    389.04    389.04       N/A    388.33
7    389.74    388.90    389.04    388.33    388.90    389.18    388.33       N/A

4. Network Validation

If the GPUs within a machine are able to communicate at full NVLink bandwidth, we’ll proceed to validating the network configuration to enable full speed distributed training. Most training clusters are built with Infiniband or RoCE networking fabrics to enable extremely fast communication between GPUs on different machines.

In order to test an Infiniband fabric, we use standard tools like ibping, ib_read_bw, ib_write_bw to test that latency and throughput are as expected.

For Machine Learning calculations, we are very interested in making sure that GPUDirect RDMA is working optimally, and for this we again use NCCL tests, similar to validating NVLink. This time we will include multiple nodes in the NCCL test, starting with 2 nodes all the way up to the entire cluster. Generally we are looking for the all_reduce_perf test to show bandwidth around 92% of the theoretical maximum of the fabric: so around 370 GB/s on a 400GB/s fabric.

NCCL tests help identify numerous issues: by validating the entire cluster, we can find individual nodes, leaf switches, or spine switches that may be having an issue. We will algorithmically test smaller groups, from individual nodes, to pairs of nodes, to groups of nodes, to the entire cluster in order to quickly determine if there are any faults with Infiniband or GPU Direct RDMA. The most common failure mode for NCCL tests is to run slower than expected, if something is not right on the Infiniband fabric for example. A good result on a whole cluster NCCL test is a good sign that this cluster will perform well running distributed training workloads.

The nccl-tests repo should be compiled from source from the NVIDIA repo: https://github.com/NVIDIA/nccl-tests. We generally run this with Slurm, so we can easily control which hosts to schedule the job on. Here is a simple example script for running NCCL tests via Slurm. Some of the NCCL environment variables will need to be adjusted to match the configuration on the machine:


#!/bin/bash
#SBATCH --job-name=nccl-tests
#SBATCH --output=nccl-tests-%j.out
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --time=0:05:00
export LD_LIBRARY_PATH=/usr/lib:/usr/lib64
export NCCL_TESTS_HOME=nccl-tests
export NCCL_DEBUG=INFO
export NCCL_ALGO=RING
# export NCCL_DEBUG_SUBSYS=NET
export NCCL_IB_AR_THRESHOLD=0 
export NCCL_IB_PCI_RELAXED_ORDERING=1
export NCCL_IB_SPLIT_DATA_ON_QPS=0 
export NCCL_IB_QPS_PER_CONNECTION=2 
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_7:1,mlx5_8:1,mlx5_9:1
export NCCL_SOCKET_IFNAME=enp27s0np0
export NCCL_IGNORE_CPU_AFFINITY=1
. /opt/hpcx/hpcx-init.sh
hpcx_load
mpirun \
        --bind-to none \
        -mca btl tcp,self \
        -mca coll_hcoll_enable 0 \
        -mca btl_tcp_if_include enp27s0np0 \
        -x PATH \
        -x LD_LIBRARY_PATH \
        ${NCCL_TESTS_HOME}/build/all_reduce_perf -b 3G -e 24G -f 2 -g 8

It is also important to validate Ethernet networks, for which we use iperf3.

5. Storage Validation

Storage performance is usually very important for Machine Learning workloads as well, so another crucial test is to measure storage performance. There are many different storage configurations that can have different performance characteristics.

To measure storage performance we use fio which is a very flexible tool for I/O benchmarks. With fio, we build jobs that measure different scenarios like random reads, random writes, sustained reads, or sustained writes, at various block sizes. Here is an example of running a fio job that tests the read bandwidth:


$ fio fio-read-bw.job

Jobs: 32 (f=32): [_(32),R(32)][100.0%][r=21.8GiB/s][r=22.3k IOPS][eta 00m:00s]                                                        
fio-bandwidthSR-00: (groupid=1, jobs=32): err= 0: pid=3833124: Thu May 23 06:29:40 2024
  Description  : ['Sequential Read bandwidth workload']
  read: IOPS=22.3k, BW=21.8GiB/s (23.4GB/s)(654GiB/30002msec)
    slat (usec): min=17, max=1343, avg=27.84, stdev= 9.14
    clat (usec): min=303, max=11199, avg=1404.29, stdev=394.95
     lat (usec): min=373, max=11224, avg=1432.32, stdev=394.33
    clat percentiles (usec):
     |  1.00th=[  627],  5.00th=[  832], 10.00th=[  963], 20.00th=[ 1106],
     | 30.00th=[ 1205], 40.00th=[ 1287], 50.00th=[ 1369], 60.00th=[ 1450],
     | 70.00th=[ 1549], 80.00th=[ 1680], 90.00th=[ 1876], 95.00th=[ 2073],
     | 99.00th=[ 2573], 99.50th=[ 2802], 99.90th=[ 3785], 99.95th=[ 4293],
     | 99.99th=[ 6718]
   bw (  MiB/s): min=21284, max=23354, per=100.00%, avg=22336.90, stdev=13.99, samples=1888
   iops        : min=21284, max=23354, avg=22336.71, stdev=13.99, samples=1888
  lat (usec)   : 500=0.06%, 750=2.96%, 1000=8.75%
  lat (msec)   : 2=81.67%, 4=6.49%, 10=0.07%, 20=0.01%
  cpu          : usr=0.23%, sys=2.30%, ctx=669797, majf=11, minf=8508
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=669631,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
Run status group 1 (all jobs):
   READ: bw=21.8GiB/s (23.4GB/s), 21.8GiB/s-21.8GiB/s (23.4GB/s-23.4GB/s), io=654GiB (702GB), run=30002-30002msec

6. Model Build

The last phase of our acceptance testing is to run a collection of reference tasks, tailored to the use case of our customers, to ensure that they can achieve expected end-to-end performance. This phase is crucial for validating the operational integrity and performance efficiency of the GPU clusters under real-world conditions.

One popular reference task is to build a model with off-the-shelf frameworks such as PyTorch’s Fully Sharded Data Parallel (FSPD). For customers who are interested in training models at 1-10B scale, we often train a Llama-3 8B architecture and scale up its pretraining to 16 nodes, using FSDP as our distributed training backend and use standard publicly available pre-training datasets for our train and validation splits. 

During the training process, we monitor training throughput (tokens per second), model flops utilization (MFU), GPU utilization, and network communication latencies for standard collective communications like all reduce, in addition to a myriad of other profiling metrics available in PyTorch’s profiler.

Through this exercise we verify our cluster is able to achieve reasonable MFU performance and communication efficiency for models in the 1 to 10 billion parameter range. 

7. Observability

After testing, it’s important to ensure that we are continuously monitoring for hardware failures. At Together AI, our brand is trusted by enterprises because we are able to continuously monitor 24x7 and react in case there’s a hardware failure. Acceptance testing (phases 1-6) is just the first step, but with large clusters, it’s inevitable that we need to deal with hardware failures. 

To monitor our hardware, we use Telegraf, an open-source, lightweight server agent used to collect system metrics. Telegraf is highly customizable and extensible, allowing us to monitor a wide variety of system metrics to ensure maximum uptime and reliability of the hardware. We collect two types of observability metrics, cluster-level and host-level, gathered from the Telegraf metrics. The host-level metrics include the amount of CPUs/GPUs on a node, CPU/GPU usage %, available memory, available disk space, network connectivity, etc.

The cluster-level metrics and dashboards are used to quickly verify cluster-wide health and to help diagnose problems. In the example below, it is immediately noticeable there is an issue with the average number of GPUs in the cluster.

Cluster-level metrics showing an issue with the average number of GPUs in the cluster.

We can then use host-level metrics to pinpoint the server that has a bad GPU, indicating that a GPU “fell of the bus”.

Host-level metrics showing a missing GPU.

We gather GPU temperature and power draw metrics as we’ve seen that a single GPU can get too hot and become a straggler, which will slow down an entire training run.

Host-level metrics showing GPU temperature and power draw.

Another interesting metric is DNS lookup errors. We noticed that DNS lookup errors affected training runs, so we quickly added this monitor to all of our infrastructure. We now monitor the ability of our servers to perform DNS lookups to interesting domains like R2 and S3 (for dataset download) and wandb.ai (for training with Weights & Biases).

Host-level metrics showing DNS lookup errors.

The DNS lookup error metric is a great example of Telegraf’s customizability. It is easy to use the DNS Query plugin, with this simple config, to gather these metrics:


[[inputs.dns_query]]
 servers = ["127.0.0.53"]
 domains = ["wandb.com", "r2.dev"]
 record_type = "A"
 timeout = "2s"

Conclusion

Acceptance testing is an indispensable practice for AI/ML startups striving to deliver top-tier computational resources. By adopting a comprehensive and structured approach to testing, companies can navigate the complexities of the hardware lottery, ensuring that their infrastructure is stable and reliable, and that it can support the types of workloads they intend to run on the GPUs. We encourage our users to run acceptance testing on the GPU clusters we deliver to you, and to flag any issues you encounter that we can help troubleshoot.

  • Lower
    Cost
    20%
  • faster
    training
    4x
  • network
    compression
    117x

Q: Should I use the RedPajama-V2 Dataset out of the box?

RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.

Reserve High Performance GPU Clusters

Get access to our high performance GPUs, thoroughly tested for reliability for your AI/ML workloads.

Start
building
yours
here →