DDN generates QLC SSD array for generative AI training

DDN has unveiled an upgraded A1400 X2 ExaScaler array for AI and machine learning storage workloads that uses QLC SSDs and adds a compression facility for higher capacity.

ExaScaler is DDN’s Lustre-based scaleout and parallel file system software. QLC SSDs have a 4bits/cell format enabling the die to hold more data than a TLC (3bits/cell) arrangement at the cost of slower IO speed and shorter endurance. DDN says its new compression feature has been optimised for HPC and AI workloads.

Kurt Kuckein, DDN’s marketing VP, told us: “Over the last four or five years, we’ve seen this uptake in enterprise customer interest around DDN system, specifically driven by these AI algorithms. And that has really taken off this year with the broad interest in generative AI, ChatGPT and others have really driven interest in our AI solutions, especially in conjunction with the Nvidia SuperPOD systems.”

Senior Veep of Products James Coomer tells us that DDN has about 48 A1400X2 arrays supporting Nvidia’s largest SuperPODs: “All the other SuperPODs in the world, the vast majority of them, are running just multiples of the same unit.”

Adding QLC flash and compression to the A1400 X2 array “delivers both the best performance, as well as really good flash capacity for customers.” This system can provide 10x more data than competing systems and use a fraction of their electrical energy, he claimed. It uses 60TB QLC drives, enabling 1.45PB capacity in a 2RU x 24-slot chassis, doubling capacity per watt compared to the 30TB SSDs available from other suppliers.

The A1400X2 QLC uses a standard A1400X2 controller (storage compute node), in a 2RU chassis. It has 732TB of TLC SSD storage and a multi-core, real-time RAID engine and controller combo that can pump out 3.5 million IOPS and 95GBps. This can have two, four or five SP2420 QLC SSD expansion trays added to it, linked across NVMe/oF and Ethernet. Each tray holds up to 2.9PB of raw QLC capacity. That’s 2.3PB usable which, after compression, becomes 4.7PB effective.

The maximum effective QLC capacity is 11.7PB in a fully configured system. DDN claims the new array provides an up to 80 percent cost saving versus a comparable capacity TLC array, and enables apps to run faster than an NFS array. The added client-side compression uses array CPU cycles but, because the resulting dataset is smaller, the overall read and write performance is about the same.

DDN says its QLC version of the A1400X2 has a better price per flash TB than its existing TLC version which delivers better IOPS, up to 70 million, and throughput per rack. A hybrid TLC flash/disk system offers an even lower price per TB. It says it can meet datacenter AI storage budgets at three levels: either optimized for sheer performance, for price/performance, or for lower costs. 

It is generally thought that the larger the model dataset used for training machine learning models, the better the result. That would encourage use of DDN’s A1400X2 QLC array. DDN also sees possibilities for it in other application areas, such as realistic 3D and immersive universes in gaming, protein and molecule creations for drug discovery, and autonomous driving. 

DDN says its A1400X2 QLC system design does not need the internal switches and networks used by scale-out NAS systems, which are based around a controller chassis talking to flash JBOFs through a switched network. That helps lower its rackspace occupancy, cost and management complexity.

Coomer said: “Today’s QLC scale-out NAS systems offer low cost and high capacity, but they are extremely inefficient with IOPS, throughput and latencies, making them unusable for high-performance environments such as AI, machine learning, and real-time applications.”

Given that VAST Data has just announced SuperPOD certification for its scale-out NAS system, and says that parallel file systems are complex compared to NAS, we are going to see the two competing for the same customers. We could see customers new to AI model training who currently use NAS and not a parallel file system, go with VAST, in preference to DDN. Existing parallel file system users could possibly find that the A1400X2 QLC slides more easily into their workflows than a NAS-based system.

NetApp has recently announced a QLC flash array (AFA C-Series), joining Dell (PowerScale), Pure Storage (FlashBlade//E) and VAST Data. QLC flash is now a mainstream technology.

The 60TB drives make a huge capacity increase possible over 30TB drives and we know Solidigm has 60TB QLC SSDs coming. Kioxia and other NAND fabricators/SSD suppliers are bound to follow suit but maybe not Micron – it’s plugging away at higher capacity TLC drives built with 232-layer technology.

DDN will ship its A1400X2 QLC systems in the June-August period.