Big Blue developed its own Vela cluster, using Storage Scale, to train its AI models.
IBM’s next generation AI studio, watsonx.ai, which became generally available in July of 2023, was trained on the Vela cluster. Storage Scales is a parallel filesystem and Vela uses it as a quasi-cache in front of object storage, speeding data IO to keep GPUs busy.
The Vela infrastructure powering IBM’s Gen AI model development is described in a freely available research paper.
It describes Vela as a cluster of CPU/GPU servers hosting virtual machines in the IBM Cloud. The server nodes are twin-socket systems with, originally, Cascade Lake Gen 2 Xeon Scalable processors plus 1.5TB of DRAM and 4 x 3.2TB NVMe SSDs, 8 x 80GB Nvidia A100 GPUs, using NVLink and NVSwitch. The Xeons were later upgraded to IceLake.
A 2-level spine-leaf Clos structure, (nonblocking, multistage switching network) based on 100Gbps network interfaces, links the nodes together. The storage drives are accessed over Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) and GPU-direct RDMA (GDR). GDR with RoCE allows GPUs on one system to access the memory of GPUs in another system, using Ethernet network cards. Congestion management is built into the networking subsystem.
Vela is operated by IBM Cloud as IaaS (Infrastructure as a Service). Red Hat OpenShift clusters are used for tasks that span the entire AI lifecycle, from data preparation to model training, adaptation, and ultimately model serving.
The data needed for the AI training is held in object storage but this is too slow for both reading (needed for job loading) and writing (needed for job checkpointing). The IBMers decided to use Storage Scale: “a high-performance file system … inserted between the object storage and the GPUs to act as an intermediating caching mechanism. In doing so, the data can be loaded into the GPUs much faster to start (or re-start) a training job, and model weights can be checkpointed to the file system at a much faster rate than when checkpointing directly to object storage. Thanks to unique technology in the file system we use, the checkpointed data can then be asynchronously sent to object storage but in a way that does not gate progress of the training job.”
A Scale client cluster runs across Vela’s GPU nodes in container-native mode leveraging the CNSA edition of Scale. The paper states Vela: “uses Kubernetes operators to deploy and manage Scale in a cloud-native fashion as well as a CSI Plugin for provisioning and attaching persistent volumes based on Scale. The client cluster does not contain any locally attached storage devices and instead performs remote mount of the file system in the storage cluster. Such an architecture allows compute and storage clusters to grow and shrink independently as workload requirements change.”
It says: “We configure Active File Management(AFM) technology to transparently connect filesets to object storage buckets. File system namespaces represent objects in buckets as files and brings data from the object storage into the file system on demand. When a file is written to the file system, AFM eventually moves it to object storage.”
The total capacity of this Scale parallel file system, using all attached devices, is hundreds of TBs.
The research paper says: “Scale is deployed in Vela using a disaggregated storage model. The dedicated Scale storage cluster consists of tens of IBM Cloud Virtual Server Instances (VSIs) with two 1TB virtual block volumes attached to each instance. The virtual block volumes are hosted on a next-generation cloud-native and highly performant block storage service in IBM Cloud that can meet the high throughput requirements of model training workloads.”
We’re told by a source close to IBM that before it deployed this storage solution based on Scale, “AI researchers using Vela could either use IBM COS directly or an NFS file system that was deployed for Vela.
“Compared to NFS performance, our Scale file system achieves a near 40x read bandwidth speedup (1 GBps vs 40 GBps with Scale), which directly helps with input data read operations. Also compared to IBM COS bandwidth, the Scale file system achieves a near 3x write bandwidth speedup (5 GBps vs 15 GBps with Scale), which accelerates the checkpoint and other data write operations.”
This was based on data from iteration times for a Granite-13B AI training job using NFS and another Granite-8B job using the Scale file system.
Training jobs can take a month or more to run, as a table in the paper indicates:
Vela was overhauled in 2023 with the larger and more powerful Blue Vela cluster, which came online in April this year and was built with Dell and Nvidia. We’ll describe this in a second article.