DDN speeds up AI storage system in a flash

DDN has upgraded its A³I artificial intelligence -dedicated storage system, using flash to speed data access, as well as improving system monitoring tools.

The systems now support all Nvidia DGX servers, including the DGX A100 and SuperPod, with reference architectures for both. DDN launched the A³I in October 2018 as a turnkey, AI-dedicated storage and Nvidia DGX-1 GPU server product set, featuring all-NVMe flash and hybrid flash/disk Exascaler arrays and Lustre parallel-access file system software.

Dr. James Coomer, DDN’s VP products, said in a statement: “With huge performance boosts and innovation in operational efficiency, this new, differentiating feature set eliminates complexities and strengthens our AI supercomputing infrastructure, making it easier to manage for data-intensive scalers and data centric enterprises, alike.” 

DDN A³I system

The monitoring improvements include:

  • End-to-end performance monitoring for DGX and DGX pods with compute, storage and network status visibility
  • New GUI
  • Storage quota monitoring,
  • Workload and system status
  • Call home feature

Performance boosting

The “huge” performance boosts aren’t quantified but are based on ‘Hot Nodes’ caches and ‘Hot Pools’ tiered flash and disk.

The Exascaler systems come with ‘Stratagem, an integrated policy engine that moves data automatically between disk and flash storage tiers. This means active data faster is stored in flash, thus enabling faster access. With this week’s upgrade, Stratagem’s data placement algorithms have been improved to better determined how hot files are based on recent access. Overall, this lowers data access latency and response times, DDN said. There are also new API

Hot Noding

We asked DDN questions about the Hot Nodes.

Blocks & Files: What is a Hot Node?

DDN: A Hot node is a Filesystem client attached to the DDN shared parallel storage appliances, with local NVMe devices that can be used as actively managed buffer between shared and local data. DDN’s A³I filesystem manages the population of the Hot Node local storage to reduce unnecessary access to shared storage, and to reduce latencies for access to the hot data.

Blocks & Files: How much flash storage is used?

DDN: All that’s available. With a single NVIDIA DGX we have around 15TB of local NVMe [flash] in each system.

Blocks & Files: How does giving a hot node some flash storage remove the risk of manual data movement?

DDN: Intelligence in software works to automatically ensure that data is positioned at appropriate time when needed. For most common current read-only-from-cache AI use cases, entire data sets are brought from shared to local space as it’s being read with no detriment to performance. Tested with up to 560 nodes simultaneously and working fine.

Blocks & Files: How do the Hot Nodes get “Excellent IO performance?

DDN: In early (pre beta) at scale testing 10X+ acceleration with Hot Nodes when running IO at scale.

Blocks & Files: How are they different from non-Hot Nodes?

DDN: Hot Nodes are those Compute clients which are designated to have a managed local cache. Typically, hot nodes have accelerator (GPU, CPU or other), high performance network connection to storage and local NVMe devices to act as managed data cache. Non-hot nodes don’t have local managed data cache. It’s a switch in DDN’s A³I filesystem that determines whether a node is hot or not.

Note; DDN did not provide actual performance numbers.