IBM’s Vela AI supercomputer, described here, was not powerful enough for IBM Research’s AI training needs. In 2023, it began the Blue Vela development to supply the major expansion of GPU compute capacity required to support AI model training needs. To date, Blue Vela is actively being used to run Granite model training jobs.
Update. Performance comparison between Vela and Blue Vela added. 12 Aug 2024.
Blue Vela is based around Nvidia’s SuperPod concept and uses IBM Storage Scale appliances as we shall see.
Vela is hosted on the IBM Cloud but the Blue Vela cluster is hosted in an IBM Research on-premises datacenter. This means that IBM Research has ownership and responsibility for all system components, from the infrastructure layer to the software stack.
As the number of GPUs needed to train larger and more connected models increases, communication latency becomes a critical bottleneck. Therefore, the design of Blue Vela originated with the network and Blue Vela is designed around four distinct purpose-built networks.
- A compute InfiniBand fabric, which facilitates GPU-to-GPU communication, as shown below
- A storage InfiniBand fabric, which provides access to the storage subsystem, as shown below
- An in-band Ethernet host network is used for inter-node communication outside the compute fabric
- An out-of-band network, also called the management network, which provides access to the management interfaces on the servers and switches
Blue Vela is based on Nvidia’s SuperPod reference architecture. It uses 128-node Compute Pods. These contain 4 x Scalable Units, each of which contain 32 nodes. The nodes contain Nvidia H100 GPUs. Nvidia’s Unified Fabric Manager is used to manage the InfiniBand networks comprising the compute and storage fabrics. UFM can help recognize and resolve single GPU throttling or non-availability, and it is not available for Ethernet networks.
A compute node is based on Dell’s PowerEdge XE9680 server and consists of:
- Dual 48-core 4th Gen Intel Xeon Scalable Processors
- 8 Nvidia H100 GPUs with 80GB High Bandwidth Memory (HBM)
- 2TB of RAM
- 10 Nvidia ConnectX-7 NDR 400 Gbps InfiniBand Host Channel Adapters (HCA)
– 8 dedicated to compute fabric
– 2 dedicated to storage fabric - 8 x 3.4TB Enterprise NVMe U.2 Gen4 SSDs
- Dual 25G Ethernet Host links
- 1G Management Ethernet Port
IBM “modified the standard storage fabric configuration to integrate IBM’s new Storage Scale System (SSS) 6000, which we were the first to deploy.”
These SSS appliances are integrated scale-up/scale-out – to 1,000 appliances – storage systems with Storage Scale installed. They support automatic, transparent data caching to accelerate queries.
Each SSS 6000 appliance can deliver upwards of 310 GBps throughput for reads and 155 GBps for writes across their InfiniBand and PCIe Gen 5 interconnects. Blue Vela started with two fully populated SSS 6000 chassis, each with 48 x 30 TB U.2 G4 NVMe drives, which provides almost 3 PB of raw storage. Each SSS appliance can accommodate up to seven additional external JBOD enclosures, each with up to 22 TB, to expand capacity. Also, Blue Vela’s fabric allows for up to 32 x SSS 6000 appliances in total.
Blue Vela has separate management nodes using Dell PowerEdge R760XS servers, and utilized to run services such as authentication and authorization, workload scheduling, observability, and security.
On the performance front, the paper authors say: “From the onset, the infrastructure has demonstrated good potential in throughput and has already shown a 5 percent higher performance out-of-the-box compared to other environments of the same configuration.”
“The current performance of the cluster shows high throughputs (90-321B per day depending on the training setting and the model being trained).”
Vela vs Blue Vela
A comparison between the Vela and Blue Vela systems would use this formula: # Training Days = 8 * #tokens * #parameters/(#gpus * flops per GPU). On this basis;
- IBM Vela – 1100 x A100 GPUs for training and theoretical performance = 300 teraFLOPs/GPU (bf16)
- IBM Blue Vela – 5000 x H100 GPUs for training and theoretical performance = 1,000 teraFLOPs/GPU (bf16)
This makes Blue Vela more than three times faster than Vela.
There is much more detailed information about the Blue Vela datacenter design, management features and software stack in the IBM Research paper.