VAST Data has secured a contract from the Texas Advanced Computing Center (TACC) for its single-tier, NFS-based flash storage system, edging out traditional disk-based HPC parallel filesystems such as Lustre.
TACC, located at the University of Texas in Austin, is in the process of building Stampede3, an open-access supercomputer powered by Intel Max Series CPU and GPU nodes that can deliver 10 petaFLOPS of performance. The hardware was funded by a $10 million grant from the National Science Foundation (NSF) and will be used by its scientific supercomputing research community.
Jeff Denworth, co-founder and CMO of VAST Data, said in a blog post: “TACC has selected VAST as the data platform for Stampede3, their next generation large-scale research computer that will power applications for 10,000+ of the United States’ scientists and engineers. The Stampede purchase precedes selection for their next really big system, which will likely be announced later this year.”
With over 1,858 compute nodes, more than 140,000 cores, over 330 terabytes of RAM, and 13 petabytes of VAST Data storage, Stampede3 is poised to provide a significant boost in computing power for the scientific community. The VAST flash system is providing both scratch and nearline storage, replacing a DDN Lustre disk-based system in the prior Stampede2.
The VAST storage will offer 450GBps of read bandwidth, serving as a combined scratch and nearline storage tier. Despite considering various storage options, including Lustre, DAOS, BeeGFS, and Weka, TACC opted for VAST Data due to its ability to handle anticipated AI/ML workloads that require fast random reads. Thanks to data reduction, with a 2:1 reduction ratio, and QLC NAND, TACC found VAST’s flash cost affordable compared to traditional disk storage.
Stampede3 is a hybrid or heterogeneous setup with several subsystems:
- High-end simulation from 560 Xeon Max Series CPUs (c63,000 cores); Sapphire Rapids gen 4 Xeon SPs with high bandwidth memory (HBM),
- AI/ML and graphics subsystem using 40 Max Series GPUs (Ponte Vecchio as was) in 10 Dell PowerEdge XE9640 servers, each with 128GB of HBM2e RAM,
- High memory-dependent computation from 224 Gen 3 Xeon SP nodes incorporated from earlier Stampede2 system,
- Legacy throughout and interactive computing from >1,000 Stampede2 Gen 2 Xeon SP nodes.
TACC executive director Dan Stanzioni commented: “We believe the high bandwidth memory of the Xeon Max CPU nodes will deliver better performance than any CPU that our users have ever seen. They offer more than double the memory bandwidth performance per core over the current 2nd and 3rd Gen Intel Xeon Scalable processor nodes on Stampede 2.”
These processing systems and the storage facilities will be interconnected with an Omni-Path Fabric 400 Gbps network, offering a smooth transition for existing Stampede2 users as it transforms into Stampede3. This upgraded system is projected to operate from this year through 2029.
Stampede2 used a Lustre parallel file system running on 35 Seagate disk-based ClusterStor 300 arrays (Scalable Storage Units or SSUs). There was a 33 x SSU scratch system and a 2 x SSU home system. These were supported up by a 25PB backend DDN Lustre storage system called Stockyard which operates across the TACC site.
Lustre is used by more than half the top 500 supercomputing systems and is a near-standard supercomputing file system because of its ability to deliver read/write IO to thousands of compute nodes simultaneously. TACC chose VAST’s NFS over Lustre because VAST Data’s architecture delivers parallel file system performance without the inherent complexities of a special parallel file system like Lustre. It also out-scales Lustre, we’re told, having proven capable of handling an extremely IO-intensive nuclear physics workload faster and more efficiently.
One TACC nuclear physics workload is extremely IO-intensive and Lustre can only cope with 350 nodes running it before the Lustre metadata server runs out of steam. TACC tested a VAST Data system and found it also supported 350 client nodes on this workload, and ran 30 percent faster than the Lustre storage. It then connected the VAST storage to Frontera and scaled up the client node number through 500, 1,000 and 2,000 nodes to 4,000 clients and the VAST storage ran sufficiently.
Denworth noted that “20U of hardware running VAST software could stand up to 50 racks of Dell servers.” During a VAST software upgrade, one of the VAST storage servers had a hardware failure which caused the overall storage to run with two versions of the VAST operating system. There was also a software installer bug. The storage still functioned, minus one server with two VAST OS versions, and the upgrade process completed when the installer software bug and failed hardware was fixed.
Denworth said: “HPC file system updates are largely done offline (which causes downtime) and it’d be crazy to think about running in production with multiple versions of software on a common data platform.” There was no VAST system downtime during the failures, TACC said.
The VAST storage should be installed in September and Stampede3 should be operating in production mode in March next year.
TACC has also been evaluating a flash file system software upgrade for Frontera, looking at Weka and VAST. Frontera is set to be superseded by the next flagship TACC supercomputer, the exascale-class Horizon, and VAST is now in with a chance of being selected as a Horizon storage supplier.
Dewnworth commented: “Stampede3 will kick off a bright partnership that’s forming between VAST and TACC, and we want to thank them for their support and guidance as we chart a new path to exascale AI and HPC.”
TACC has several supercomputer systems:
- $60 million Dell Frontera; TACC’s flagship system, performing at 23.5 Linpack petaFLOPS from 8,008 Intel Xeon Cascade Lake-based nodes plus specialized subsystems. It is 21st in the TOP 500 supercomputer list. It has 56PB of capacity with 4 x DDN 18K Exascaler storage disk-based arrays providing 300GBps bandwidth. There are also 72 x DDN IME flash servers with 3PB capavcity and 1.5TBps bandwidth.
- $30 million Dell-based Stampede2 provides 10.7 petaFLOPS using 4,200 Intel Knights Landing-based nodes and 1,736 Intel Xeon Skylake-based nodes. It is a capacity-class system ranked number 56 in the Top 500 list, and is the second generation Stampede system and being superseded by Stampede3. Stampede2 is a 2017 system, and followed on from the initial 2012 Stampede system.
- Lonestar5 for HPC and remote visualization jobs running at 301.8 teraFLOPS with >1,800 nodes and >22,000 cores with a 12PB Dell BeeGFS file storage system. Now superseded by Lonestar6.
- Wrangler is a smaller supercomputer of 62 teraFLOPS for data-intensive work, such as Hadoop, with 96 Intel Haswell nodes (24 cores and minimum 128TB DRAM per node), a 500TB high-speed flash-based object storage system, and a 10PB disk-based mass storage subsystem with a replicated site in Indiana.
- Stockyard2 is a global file system providing a shared 10PB DDN Lustre project workspace with 1TB/user and 80GBps bandwidth.
- Ranch is a 100PB tape archive using a Quantum tape library.
Stampede2 has been successful as an open science system, with more than 11,000 users working on more than 3,000 funded projects running more than 11 million simulations and data analysis jobs since it was started in 2017. This replicated the success of Stampede1, which ran more than 8 million simulations with more than 3 billion compute hours delivered to 13,000-plus users on over 3,500 projects.
At one point in 2018, Stampede2 was fitted with 3D Xpoint NVDIMMS as an experimental component in a small subset of the system.