DDN supplying storage for xAI’s Colossus supercomputer

November 19, 2024

DDN storage is being used in an expansion phase of Elon Musk’s xAI Colossus supercomputer.

Grok is xAI’s name for its large language model, while Colossus is the GPU server-based supercomputer used to train it and run inferencing tasks. The Colossus system is based in a set of datacenter halls in Memphis, Tennessee. The Grok LLM is available for use by X/Twitter subscribers and competes with OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude, and Meta’s LLaMA.

DDN is supplying its EXAScaler and Infinia systems, with EXAScaler being a Lustre parallel access file system layered on scale-out all-flash and hybrid hardware and Infinia being DDN’s petabyte-scale object storage system, typically using all-flash nodes.

Alex Bouzari, CEO and co-founder of DDN, stated: “Our solutions are specifically engineered to drive efficiency at massive scale, and this deployment at xAI perfectly demonstrates the capabilities of our high-performance, AI-optimized technology.”

Elon Musk, xAI CEO, not specifically referencing DDN, said on X: “Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200k (50k H200s) in a few months. Excellent work by the team, Nvidia and our many partners/suppliers.”

Dion Harris, Nvidia’s director of accelerated datacenter product solutions, did name DDN, however, saying: “Complementing the power of 100,000 Nvidia Hopper GPUs connected via the Nvidia Spectrum-X Ethernet platform, DDN’s cutting-edge data solutions provide xAI with the tools and infrastructure needed to drive AI development at exceptional scale and efficiency, helping push the limits of what’s possible in AI.”

There have been several iterations of xAI’s Grok:

Grok 1 is a 314-billion-parameter Mixture-of-Experts model, announced in March 2024 and trained from scratch by xAI.
Grok 1.5 was announced later in March 2024 as an updated version with improved reasoning capabilities and an extended context length of 128,000 tokens.
Grok 2 was announced in August 2024 and used 20,000 Nvidia GPUs for training. There were two models: Grok-2 and Grok-2 mini.
Grok 3 phase 1 has 100,000 Hopper H100 GPUs and Nvidia Spectrum-X Ethernet. Its development was announced in July 2024 with availability slated for December.
Grok 3 phase 2 will scale to 200,000 GPUs by adding 100,000 more Hopper GPUs (including 50,000 H200s).
Grok 3 phase 3 could scale to 300,000 GPUs with 100,000 more Blackwell B200 GPUs according to a Musk prediction.

The entire array of systems is connected over a single Ethernet-based RDMA fabric and is possibly the largest GenAI cluster in the world to date. Grok 3 phase 1 was built in 122 days and took 19 days to go from first deployment to training.

VAST announced it was the storage behind Grok 3 phase 1, saying it’s “honored to be the data platform behind xAI’s Colossus cluster in Memphis, fueling the data processing and model training that powers this groundbreaking initiative. Colossus, featuring over 100,000 Nvidia GPUs.” Jeff Denworth, VAST Data co-founder, said: “I’m proud to say that VAST is the tech that is used primarily by this amazing customer.” Note that “primarily” means not the only supplier. The VAST Data storage nodes are shown in a video about the Colossus datacenter’s scale.

DDN president Paul Bloch said in LinkedIn post in late October that DDN was involved in Grok 3 phase 1 as the “primary and main data intelligence and storage platforms provider.”

This raised questions about the relative storage roles DDN and VAST Data played in the Grok 3 phase 1 Colossus system as both are using the “primary” word. When he was asked about this statement, Denworth couldn’t comment on the role of DDN’s storage, only telling us this about xAI and phase 1 of Grok 3: “They’re training and checkpointing and storing their data on VAST.”

Now DDN is involved as a storage supplier for Grok 3 phase 2. Asked about how it related to VAST Data in Colossus, a DDN spokesperson told us: “As for VAST Data, we can’t comment on their current role, but what we do know is that DDN’s cutting-edge technology and close collaboration with Nvidia have been key to moving Colossus forward. Our solutions are designed to meet the toughest data challenges, helping organizations like xAI stay ahead in the AI race.”

Perhaps xAI is using Grok 3 phase 1 and Grok 3 phase 2 to support different workloads that need different storage data supply characteristics. In terms of public flagship AI and hyperscaler customer wins, both VAST and DDN can quote xAI’s Colossus with Hammerspace being used by Meta. WEKA customers include Midjourney and Stability AI. All four have convinced their customers that they can provide the performance, scale, reliability, power efficiency, and cost needed.

DDN supplying storage for xAI’s Colossus supercomputer

ABOUT US

FOLLOW US

Storage news ticker – April 4

Quesma bridges Elasticsearch and SQL, promises faster, cheaper queries

Panzura Symphony taps IBM tape tech to cut cloud costs for cold data