Nutanix getting GPUDirect ducks lined up

A Nutanix technical white paper confirms Nutanix will support AI foundation model training with distributed data set access and fast file data delivery to NVIDIA GPU servers.

Foundation models, also called large language models (LLMs) are a feature of Generative AI in which models are trained on GPU servers using large, petabyte-scale, data sets with repetitive access to data elements, by servers with multiple GPUs working in parallel. Nvidia has devised its GPUDirect protocol to enable a GPU server’s memory to have direct access to NVMe SSDs, cutting out latency-lengthening delays caused by the storage system controller’s CPU copying the data to be transferred from the drive into its memory before shipping it out across a network link.

Up until now this has worked with external storage arrays, each having their own controller, but not with hyper-converged infrastructure (HCI) systems where the storage consists of drives locally attached to the scale-out HCI server nodes and virtualized across them to form a SAN. Nutanix also supports file-level access.

As Gen AI model training is becoming a more general requirement Nutanix has decided that it must support it by feeding data as fats as it can to NVIDIA GPU Servers. CEO Rajiv Ramaswami talked about data needed for AI model training in July, saying: “GPU Direct is on our roadmap. We will have GPU direct, especially for files –  this is really where you need GPU Direct. … And the other thing that’s also needed is high bandwidth I/O. We are now supporting 100 gig NICs. … and then a machine with large memory also. All of these things play a role.”

Vishal Sinha

Now a Nutanix “Scaling Storage Solutions for Foundation Model Training” technical white paper, written by SVP and GM (Engineering, Product & GTM) Vishal Sinha, explains the rationale for this and says: “Nutanix plans to integrate NVIDIA GPUDirect Storage and NFSoRDMA, allowing direct data transfers between storage and GPU memory, bypassing the CPU.”

NFSoRDMA is the NFS file protocol being used to send data via Remote Direct Memory Access.

The paper says: ”NUS will soon support a hybrid multi-cloud federated storage namespace, providing a unified view of data across on-premises and cloud environments. This is invaluable for AI/ML training when data sources are distributed across multiple locations, including edge, data centers and cloud, as it simplifies and accelerates data access and preparation.”

NUS is Nutanix Unified Storage “that consolidates file, object, and block storage into a single, high-performance, and cost-optimized solution.”

It says NUS integrates with AWS, leveraging Elastic Block Store (EBS) and Amazon S3 to deliver high-performance, hybrid multi-cloud file storage.

The paper references a Nutanix blog which presents MLPerf Storage v1.0 benchmark results for the ResNet50 image classification workload. Sinha has separayely presented a graph showing Nutanix outperforming other vendors in terms of the number of H100 GPUs supported on this workload:

We envisage Nutanix announcing NUS support for GPUDirect and NFSoRDMA in the first half of 2025, with a hybrid multi-cloud federated storage namespace, 200 and 400 GigE support, not forgetting servers with larger memory than at present.