Nvidia says its InfiniBand technology-infused Spectrum-X Ethernet networking can increase storage fabric network read bandwidth by almost 50 percent.
Spectrum-X is a combination of the Spectrum-4 ASIC-based Ethernet switch offering that accompanies its InfiniBand products. It supports RoCE v2 for remote direct memory access (RDMA) over Converged Ethernet and the BlueField-3 SuperNIC. Nvidia’s InfiniBand products feature adaptive routing to send data packets across the least congested network routes when the initially selected routes are busy or a link outage occurs. The Spectrum-4 SN5000 switch provides up to 51.2 Tbps bandwidth with 64 x 800 Gbps Ethernet ports. There are RoCE extensions for adaptive routing and congestion control, and these work with the BlueField-3 product.
Adaptively routed packets can arrive at the destination out of sequence, and Nvidia’s BlueField-3 product can reassemble them properly, “placing them in order in the host memory and keeping the adaptive routing transparent to the application.”
An Nvidia blog explains that, because Spectrum-X adaptive routing is able to mitigate flow collisions and increase effective bandwidth, the effective storage performance is much higher than with RoCE v2, “the Ethernet networking protocol used by a majority of datacenters for AI compute and storage fabrics.”
The blog discusses checkpointing during an large language model (LLM) training, which can take days, weeks, or even months. Job state is saved periodically so that if the training run fails for any reason, it can be restarted from a saved checkpoint state instead of initiating it from the beginning. It says: “With billion and trillion-parameter models, these checkpoint states become large enough – up to several terabytes of data for today’s largest LLMs – that saving or restoring them generates ‘elephant flows’ … that can overwhelm switch buffers and links.”
This assumes the checkpoint data is being sent to shared storage, an array, for example, across a network and not to local storage in the GPU servers, a technique used in Microsoft’s LLM training.
Nvidia also says that such network traffic spikes can occur in LLM inferencing operations when RAG (retrieval-augmented generation) data is sent to the LLM from a networked storage source holding the RAG data in a vector database. It explains that “vector databases are many-dimensional and can be quite large, especially in the case of knowledge bases consisting of images and videos.”
The RAG data needs to be sent with minimal latency to the LLM and this becomes even more important when the LLM is executing in “multitenant generative AI factories, where the number of queries per second is massive.”
Nvidia says it has tested out these Spectrum-4 features with its Israel-1 AI supercomputer. The testing process measured the read and write bandwidth generated by Nvidia HGX H100 GPU server clients accessing the storage, once with the network configured as a standard RoCE v2 fabric, and then with the adaptive routing and congestion control from Spectrum-X turned on.
Tests were run using different numbers of GPU servers as clients, ranging from 40 to 800 GPUs. In every case, Spectrum-X performed better, with read bandwidth improving from 20 to 48 percent and write bandwidth increasing from 9 to 41 percent.
Nvidia says Spectrum-X works well with its other offerings to accelerate the storage to GPU data path:
- AIR cloud-based network simulation tool for modeling switches, SuperNICs, and storage.
- Cumulus Linux network operating system built around automation and APIs, “ensuring smooth operations and management at scale.”
- DOCA SDK for SuperNICs and DPUs, providing programmability and performance for storage, security, and more.
- NetQ network validation toolset that integrates with switch telemetry.
- GPUDirect Storage for direct data path between storage and GPU memory, making data transfer more efficient.
We can expect Nvidia partners such as DDN, Dell, HPE, Lenovo, VAST Data, and WEKA to support these Spectrum-X features.