Dell PowerScale storage upgraded in grab for AI model training

Dell has announced a PowerScale F910 system with a parallel file system.

PowerScale is Dell’s name for the acquired EMC Isilon scaleout filers. Up until now there were five all-flash PowerScale models: F200, F210, F600, F710 and F900, with the PCIe gen 5 bus-using F210 and F710 systems announced in February and using Sapphire Rapids Intel CPUs. These are all PowerEdge servers with directly attached storage, running the OneFS operating system. They can be clustered with from 3 to 252 nodes.

The F910, like the F900, comes in a 2RU chassis with 24 NVMe drives. It can hold up to 1.87 PB of capacity per node, meaning it uses 61 TB SSDs, QLC ones from Solidigm we think. An F910 blog by Dell’s Tom Wilson, a senior product manager in the Unstructured Data Solutions (UDS) group, says the F910 is “offering 20 percent more density per RU compared to the earlier released F710.”

The F910 is essentially the F900 upgraded from Cascade Lake to Sapphire Rapids CPUs and from Gen 3 PCIe to the Gen 5 bus. It also requires OneFS v9.8 compared to the F210 and F710’s OneFS v9.7.

Dell PowerScale F910 slide

The F910 is available on-premises with its OneFS v9.8 OS being available in the public cloud as APEX File Storage (AWS and Azure). Dell says the F910 has 127 percent more streaming performance than the F900 and is up to six times faster than the Azure NetApp Files offering. It is, Dell says, the first Ethernet storage system for Nvidia’s DGX SuperPOD.

Wilson blogs: “It accelerates the model checkpointing and training phases of the AI pipeline and keeps GPUs utilized with up to 300 PBs of storage per cluster.” He adds it: “controls storage costs and optimizes storage utilization by delivering up to 2x performance per watt over the previous generation,” meaning the F900 running OneFS 9.5. 

OneFS 9.8 provides RDMA for NFS v4.1, APEX File Storage for Azure, and source-based routing for IPv6 networks. The PowerScale OS is claimed to protect AI data against being poisoned and also model inversion, in which an attacker trains their own machine learning model on the output from the target model and so can predict the target model’s input data from its outputs. This is akin to reverse engineering using a kind of AI model digital twin. A Defense.AI blog can tell you more. How OneFS provides a defense against model inversion is not disclosed.

Varun Chhabra, Dell’s SVP for ISG Marketing, said in a briefing: “We’re excited to announce Project Lightning which will deliver a parallel file system for unstructured data in PowerScale. Project Lightning will bring extreme performance and unparalleled efficiency with near line rate efficiency – 97 percent network utilisation and the ability to saturate 1000s of data hungry GPUs.”

“Lightning will deliver up to 20x greater performance than traditional all-flash, scale-out NAS vendors meeting making PowerScale the perfect platform for the most advanced AI workloads as well.”

Dell’s Project Lightning has a history. Back in 2010, this project was about PCIe/flash-based server cache technology. It has progressed to enable the PowerScale cluster nodes to perform I/O in parallel. Dell has not revealed any details of how the F910’s software has changed to add parallel file system access. The OneFS 9.8 release notes, for example, do not mention parallel access.

PowerScale model characteristics.

We are not told whether the parallel file system support extends to the other all-flash PowerScale products. Dell has been asked about these points.

Chhabra added some networking points: “GPUs are getting larger and more demanding. So networking also has to keep up the amount of data flowing from GPU to GPU. And from server to storage. Networking is massive. We’ve partnered therefore with Broadcom to have some really big announcements to help customers with their AI network fabric to make sure that they’re getting the maximum performance out of their infrastructure. We have a comprehensive portfolio of Ethternet-based NICs, switches, and networking fabric, all of which we’re making advancements on. Starting with a brand new PowerSwitch that’s based on Broadcom Tomahawk 5, which will support 400 G and 500 G switching.”

Wilson said: “We will be announcing further enhancements coming up in the second half of this year.” These are:

  • 61TB QLC drives that will double storage capacity and data center density to accommodate large data sets required for training complex AI models.
  • Included options for 200GbE Ethernet and HDR 200G InfiniBand options for greater connectivity, faster data access and even more seamless cluster scaling; NVIDIA Spectrum-4 and Quantum QM8790 switches.

The PowerScale F910 will be available globally starting May 21, 2024. You can find more information on Dell’s AI-optimized PowerScale nodes on the spec sheet here and on its PowerScale website

A Dell spokesperson told us: “The new parallel file system will be available at a later date, we’re not disclosing availability today.”

PowerScale market position

Dell’s parallel filesystem IO feat positions PowerScale as a competitor to Lustre, IBM’s Spectrum Scale, VAST Data, WEKA, and other parallel access file system storage players. It instantly upgrades PowerScale to be a serious contender as storage for AI model training, as all the fastest Nvidia GPUDirect-qualified file systems are parallel, not sequential, in nature.

On February 22, Michael Dell tweeted: “A GPU from @nvidia will often sit idle if the storage can’t feed it the data fast enough. This is why we created PowerScale, the fastest AI storage in the world.” That comment did not stack up against GPUDirect supplier stats, which showed the then sequential IO PowerScale as the laggard compared to parallel systems from DDN, Huawei, IBM, NetApp with BeeGFS, VAST, and WEKA.

Nvidia GPUDirect bandwidth

Now it should be a different story, and we look forward to seeing newer PowerScale GPUDirect performance numbers.

By adopting parallel access, PowerScale is now differentiated from NetApp, whose ONTAP file system offering is scale-out and not parallel in nature, and also from Qumulo for the same reason.