Nvidia’s GPUDirect Storage vanquishes AI bounce buffers

The datasets used in big data analytics and AI model training can be hundreds of terabytes, involving millions of files and file accesses. Conventional X86 processors are poorly suited for this task and so, GPUs are typically used to crunch the data. Their instruction sets can process millions of repetitive operations many times faster than CPUs.

However, there is a performance bottleneck to overcome when transferring data to GPUs via server-mediated storage.

Typically, data transfers are controlled by the server’s CPU. Data flows from storage that is attached to a host server into the server’s DRAM and then out via the PCIe bus to the GPU. Nvidia says this process becomes IO bound as data transfers increase in number and size. GPU utilisation falls as it waits for data it can crunch.

For example, an IO-bound GPU system used in fraud detection might not respond in realtime to a suspect transaction, resulting in lost money, whereas one not getting access to data faster could detect and prevent the suspect transaction, and alert the account-holder.

Normally data is bounced into host server’s memory and bounced out of it on its way to the GPU. This bounce buffer is required – because that’s the way server CPUs run IO processes. However, it is a performance bottleneck.

If the IO process can be accelerated, with higher speed and lower latency, application run times are shortened and GPU utilisation is increased.

Modern architecture

Nvidia, the dominant GPU supplier, has worked away at this problem in stages. In 2017, it introduced GPUDirect RDMA (remote direct memory access), which enabled network interface cards to bypass CPU host memory and directly access GPU memory.

The company’s GPUDirect Storage (GDS) software, currently in beta, goes beyond the NICs to get drives talking direct to the GPUs. API hooks will enable the storage array vendors to feed more data faster to Nvidia’s GPUs, such as its DGX-2.

GDS enables DMA (direct memory access) between GPU memory and NVMe storage drives. The drives may be direct-attached or external and accessed by NVMe-over-Fabrics. With this architecture, the host server CPU and DRAM are no longer involved, and the IO path between storage and the GPU is shorter and faster.

Blocks & Files GPUDirect diagram. Red arrows show normal data flow. Green arrows show shorter, more direct, GPUDirect Storage data path.

GDS extends the Linux virtual file system to accomplish this – according to Nvidia, Linux cannot currently enable DMA into GPU memory.

The GDS control path uses the file system on the CPU but the data path no longer needs the host CPU and memory. GDS is accessed via new CUDA cuFile APIs on the CPU.

Performance gains

Bandwidth from CPU and system memory to GPUs in an DGX-2 is limited to 50GB/s, Nvidia says, and this can rise to 100GB/sec or more with GDS. The software combines various data sources such as internal and external NVMe drives, adding their bandwidth together.

Nvidia cites a TPC-H decision support benchmark with scale factors (database sizes) of 1K and 10K. Using a DGX-2 with eight drives, the 1K scale factor test latency was 20 per cent of the non-GDS run. This is a fivefold speedup. At the 10K scale factor, latency was 3.33 to five per cent when GDS was used compared to the non-GDS case. This is a 20x-30x speed up. 

Nvidia GDS TPC-H benchmark slide.

Four flying GDS partners

DDN, Excelero, VAST Data and WekaIO are working with Nvidia to ensure their storage supports GPUDirect Storage.

DDN is supplying full GDS integration with its A3i systems: A1200, A1400X and A17990X.

Excelero will have a generally available GDS support in the fourth quarter for disaggregated, converged and hybrid environments. It has a roadmap to develop a GDS-optimised stack for shared file systems.

A Nvidia GDS slide deck provides detailed charts showing VAST Data and WekaIO performance supporting GDS storage access.

VAST Data achieved 92.6GB/sec from its Universal Storage array with GDS while Weka recorded 82GB/sec read bandwidth between its file system and a single DGX-2 across 8 EDR links using two Supermicro BigTwin servers.

Blocks & Files envisages other mainstream storage array suppliers will soon announce support for GDS. Also, a GPUDirect Storage compatibility mode allows the same APIs to be used when non-GPUDirect software components are not in place. At time of writing, Nvidia has not declared Amazon S3 support for GDS.

Nvidia GDS is due for release in the fourth quarter. You can watch an GDS webinar for more detailed information.