VAST Data makes AI workloads ‘accessible to NFS users’

VAST Data “democratises HPC performance” and makes AI workloads accessible to NFS users – or so the company claims.

Using its LightSpeed technology, the high-end storage startup runs vanilla NFS storage much faster than most parallel file systems, reaching a data delivery speed of 92.6GB/s. This is a speed-up of some 50x, compared with usual NFS rates, enabling VAST data storage arrays to feed Nvidia GPUs at line rate.

VAST wanted its array to feed data to an Nvidia GPU server but didn’t have a parallel file system, Howard Marks, VAST Data’s technology evangelist, told us. However, it needed parallel performance to feed data simultaneously through the DGX-2’s 8 ports leading to its 16 Tesla GPUs.

The company appears to be in no hurry to buy or build a parallel file system of its own. Any software upgrades to the parallel file system have to be synchronised to Linux upgrades and this is a complex process, according to Marks. “One VAST customer likened this to a suicide mission.”

So what does VAST do, instead?

It starts with the VAST Data Universal Storage array and NFS, and adds RDMA, nconnect (NFS multi-pathing) and GPUDirect to this engine to make it run faster.

The VAST system is a scale out cluster based on multiple compute node (Cnodes) front end controllers accessing back end databox stores across an internal NVMe fabric. The data boxes or Dnodes have a single QLC flash storage tier with Optane SSDs used to store file metadata and for write-staging before committing writes to the QLC flash. The scheme has data striped across the Dnodes and their drives for faster access and resilience.

VAST Data storage diagram.

According to Marks, an NFS source system can deliver 2GB/sec from one mountpoint across a single TCP/IP network link; one without nconnect.

Linux has nconnect multi-pathing code to add multiple connect support to NFS. VAST’s system uses this and supports port failover to aid reliability.

NFS traditionally operates across the TCP protocol. This involves the operating system processing IO requests through its software stack and copying the data from storage into memory buffers before sending it to to the target system and its memory. RDMA (Remote Data Memory Access) speeds data access because no storage IO stack and data copying into memory buffers is needed. Linux supports RDMA and so VAST uses NFS-over-RDMA instead of TCP to speed data transfer across the link.

The company supports Mellanox ConnectX network interface cards (NICs) because, Marks says, Mellanox’s RDMA implementation is the most mature and it dominates the market. These NICs support Ethernet and InfiniBand.

Marks said: “Our secret sauce is the NFS multi-path design with multiple TCP/IP sessions on separate NICs. ” A VAST graphic shows the concept:

VAST Data’s GPU data feeding scheme

The green boxes are NICs. There are eight interface ports on a DGX GPU server and VAST can link each of them them to its compute nodes.

NFS over RDMA with nconnect multi-pathing pumps data transfer speed up to around 32GB/sc of bandwidth. There is a bottleneck because DGX-2 server memory is involved in the transfer. VAST says there is still more network capacity available.

Nvidia’s GPUDirect technology bypasses the memory bottleneck by enabling DMA (direct memory access) between GPU memory and NVMe storage drives. It enables the storage system NIC to talk directly to the GPU, avoiding the DGX-2’s CPU and memory subsystem.

Blue arrows show normal data transfer steps. Orange line shows CPU/memory bypass effect of GPUDirect

Data transfer speed step summary

  • NFS over TCP reaches c2GB/sec across a single connection
  • NFS over RDMA is capped at 10GB/sec across a single connection with 100Gbit/s NIC.
  • NFSoRDMA with multi-pathing achieves 33.3GB/sec to a DGX-2
  • NFS over RDMA with GPUDirect pumps this up to 92.6GB/sec. 

A VAST test chart compares basic NFS over TCP, NFS with RDMA and NFSoRDMA with GPUDirect, and indicates the amount of DGX-2 CPU utilisation in the data transfer. The NFSoRDMA throughput of 33.3GB/sec features 99 per cent DGX-2 CPU utilisation. Moving to the GPUDirect scheme drastically lowers this CPU utilisation to 16 per cent, showing the memory bypass effect, and boosting the data rate to 92.6GB/sec.

Marks says this is near the line rate maximum of 8 x 100Gbit NICs; no host server CPU is involved in the data transfer; “We’re saturating the network; you can’t go any faster.” To do that you need a faster network.

A Nvidia A100 has 8 x 200Gbit/s NICs and VAST should be able to feed data to it even faster than to the DGX-2.