Meta (Facebook as was) has announced its AI Research SuperCluster (RSC) which it claims is among the fastest AI supercomputers running today and will be the fastest in the world once fully built out in mid-2022.
Pure Storage is providing both FlashArray and FlashBlade storage for the RSC and says “RSC will have unparalleled performance to rapidly analyse both structured and unstructured data.” Pure says it helped to design the first generation of Meta’s AI research infrastructure in 2017. That version had 22,000 Nvidia V100 Tensor Core GPUs in a single cluster and performed 35,000 training jobs a day.
Rob Lee, CTO, Pure Storage, issued a statement. “The technologies powering the metaverse will require massively powerful computing solutions capable of instantly analysing ever increasing amounts of data. Meta’s RSC is a breakthrough in supercomputing that will lead to new technologies and customer experiences enabled by AI. We are thrilled to be a part of this project and look forward to seeing the progress Meta’s AI researchers will make.”
RSC is powered by 760 Nvidia DGX A100 systems linked with Nvidia Quantum 200Gbit/sec InfiniBand fabric, delivering 1,896 petaflops of TF32 performance. The system is expected to be the largest customer installation of Nvidia DGX A100 systems to date once fully deployed later this year. An Nvidia blog has much more detail.
RSC will help Meta’s AI researchers build better AI models that can learn from trillions of examples; work across hundreds of different languages; seamlessly analyse text, images and video together; develop new augmented reality tools and more; build for the metaverse, in other words.
Its storage tier has 175 petabytes of Pure Storage FlashArray//C, 46 petabytes of cache storage in Penguin Computing Altus systems, and 10 petabytes of Pure Storage FlashBlade.
The Meta blog reads “Through 2022, we’ll work to increase the number of GPUs from 6,080 to 16,000, which will increase AI training performance by more than 2.5x. The InfiniBand fabric will expand to support 16,000 ports in a two-layer topology with no oversubscription. The storage system will have a target delivery bandwidth of 16TB/sec and exabyte-scale capacity to meet increased demand.” Good news for Pure.
This system is expected to deliver, according to Wells Fargo analyst Aaron Rakers, “~5 exaflops of mixed-precision AI performance (~220 Linpack petaflops).”
Rakers told subscribers “Today’s announcement serves as a significant additional validation of Pure’s positioning as an all-flash array platform provider into a large hyperscale cloud customer (Meta was the cloud customer contributing to Pure F3Q22). … Pure had announced that it was deploying its AFAs at a hyperscale cloud customer impacting their F3Q22 (Oct ’21) quarter – now confirmed to be Meta; we estimate ~$30 million revenue contribution.”
Meta says its researchers have already started using RSC to train large models in natural language processing (NLP) and computer vision for research, with the aim of one day training models with trillions of parameters. Its blog goes into ore detail. “We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people, each speaking a different language, so they can seamlessly collaborate on a research project or play an AR game together.”
It wanted to design a new generation RSC because “We wanted this infrastructure to be able to train models with more than a trillion parameters on data sets as large as an exabyte – which, to provide a sense of scale, is the equivalent of 36,000 years of high-quality video.”
Meta expects RSC to provide “a step function change in compute capability to enable us not only to create more accurate AI models for our existing services, but also to enable completely new user experiences, especially in the metaverse.”