Australia launches Virga cluster with Dell AI rackservers for health research

Dell XE9640 AI rackservers are being used in Australia’s Virga cluster, with a BeeGFS flash backing store, to improve cystic fibrosis diagnosis and treatment among other research areas.

Australia’s Commonwealth Scientific and Industrial Research Organisation (CSIRO) is building a high-performance computing cluster for AI workloads. The Dell servers use direct liquid cooling, designed to be more power-efficient than air-cooling systems, which require more electricity. The Virga cluster is located at the CDC Hume Data Center in Canberra, using 14 racks, and is the first deployment of its kind in Australia. The name “Virga” refers to a dry storm, a form of rain that evaporates before it reaches the ground, reflecting CSIRO’s decades of research into cloud and rain physics.

CSIRO’s Professor Elanor Huntington said in a statement: “AI is used in practically all fields of research at CSIRO, such as developing world-leading flexible printed solar panels, predicting fires, measuring wheat crops, and developing vaccines, just to name a few. High-performance computing systems like Virga also play an important role in CSIRO’s robotics and sensing work and are crucial  to the recently launched National Robotics Strategy to drive competitiveness, and productivity of Australian industry.”  

The Virga HPC system went out to a $14.5 million tender in November 2022 and was positioned a replacement for CSIRO’s installed Bracewell cluster, also built with Dell server nodes. Dell won the tender in July last year with a $16.3 million bid.

Angela Fox, SVP and MD for Dell Technologies Australia and New Zealand, said: “With Dell PowerEdge servers as the foundation, Virga will help create new Australian scientific breakthroughs using its AI capabilities, all the while being both more sustainable and more energy efficient than previous generation clusters.”

The PowerEdge XE9640 servers come in 2RU chassis containing 2 x Gen 4 Xeon Platinum 8452Y processors, each with 36 cores, and either four Intel Data Center GPU Max accelerators or four Nvidia H100 GPUs, 448 of the latter being selected by CSIRO. Nvidia’s Infiniband NDR is used as the high-speed interconnect.

Dell PowerEdge XE9640 server
Dell PowerEdge XE9640 server

There is up to 500 GB of DRAM and four 61.44 TB NVMe SSDs per node, plus 96 GB of high-bandwidth memory per GPU. That means around 246 TB of flash storage per node. The SSD source has not been revealed but Solidigm was the only public supplier of 61.44TB SSDs at the time with the D5-P5336, which has a QLC (4bits/cell) format.

A Dell spokesperson told us: “The Virga Supercomputer will be connected to the existing Dell HPC BeeGFS storage, which Dell commissioned in 2019. When Dell and CSIRO collaborated to architect this storage solution in 2018, we held a shared understanding of the volume, velocity, and variety of data that scientists would leverage to push the boundaries of scientific discovery in the future, and the longevity of this Dell solution continues to be realized.”

CSIRO use of BeeGFS – 2019 ThinkParQ slide
CSIRO use of BeeGFS – 2019 ThinkParQ slide

The Virga cluster uses Nvidia’s Transformer Engine library to speed up AI performance and capabilities and help train large models within days or even hours, CSIRO says.

Virga has 60,000 cores in total, according to the Top500 list where it has the number 72 slot, and it’s rated at 14.94 PFLOPS maximal achieved performance (Rmax). Its theoretical peak performance is 18.46 PFLOPS (Rpeak).

CSIRO Virga HPC cluster
CSIRO Virga HPC cluster

The Virga node count has not been revealed, but a 14-rack structure can house 280 x 2RU slots, some of which will be needed for ancillary equipment. There are 448 H100 GPUs in total in Virga, which implies 112 nodes with 4 x H100s/node. An H100 has 14,592 FP32 CUDA Cores and 576 Tensor Cores.

Dr Jason Dowling of CSIRO’s Australian e-Health Research Centre said: “The new HPC facilities will allow researchers in our Australian e-Health Research Centre to train and validate new computational models, which will help us develop translational software in medical image analysis for image classification, segmentation, reconstruction, registration, synthesis, and automated radiology reporting.

“One collaborative project with the Queensland Children’s Hospital that will benefit from the new cluster is the training of AI models to diagnose pathology from MRI scans of the lungs in children with cystic fibrosis.”