Nvidia is running its AI supercomputer on Oracle’s cloud infrastructure with its Lustre file system relying on NVMe block access SSDs.
An Nvidia blog details how its DGX Cloud uses Oracle Cloud Infrastructure (OCI) to provide compute, networking, and storage infrastructure to OCI users. DGX Cloud is a multi-node AI-training-as-a-service for enterprises through cloud service providers like OCI.
The team says: “DGX Cloud eliminates the need to procure and install a supercomputer for enterprises that are already operating in the cloud – just open a browser to get started.”
With Nvidia DGX Cloud on OCI, “Nvidia pairs Oracle’s bare-metal infrastructure with the Nvidia NVMesh software. This enables file storage that is scalable on demand for use on DGX Cloud.” Nvidia acquired NVMesh technology by buying Excelero in 2022. The software takes block data from SSDs and presents it to remote systems as a pool of block storage, like a SAN (we’ll get to Lustre in a moment).
OCI bare metal E4 DenseIO compute instances, also known as shapes, are the building blocks for this high-performance storage. They consist of:
- 128 AMD EPYC Milan processor cores
- 2 TB of system memory
- 54.4 TB of NVMe storage across a total of 8 NVMe devices (SSDs)
- 2 x 50 Gbps NICs for Ethernet networking
The two 50Gbps physical NICs on the E4 DenseIO shapes enable high availability. The bare metal instance means no resources are lost to virtualization.
NVMesh takes the raw E4 shape NVMe storage and uses it to build a high-performance data volume. The shapes are combined into pairs with the NVMesh software providing high-availability across the pair. In-built data protection in other words. Encryption is also included.
These shape pairs are then used as the storage underpinnings for a Lustre file system, for both data and metadata storage.
Lustre capacity scales out on-demand dynamically by adding more shape pairs, which also means more metadata capacity is added as well. This ensures metadata capacity limitations don’t cause a processing bottleneck.
The users see Lustre as an Nvidia Base Command Platform (BCP) data set and workspace storage facility. BCP provides a management and control interface to DGX Cloud, acting as its operating system, and providing AI training software as a service. It works with both the DGX Cloud and with an on-premises or co-lo with a deployed DGX SuperPOD. You can access a datasheet to find out more.
Nvidia says testing showed that its DGX Cloud on OCI had storage performance matching on-premises Nvidia Base Command Platform environments. DGX Cloud is also available on Azure with AWS and GCP availability coming.