AI processing speeds continue to improve in MLPerf Training

MLCommons has revealed new results for MLPerf Training v4.1, which the org says highlights “the strong alignment between the benchmark suite and the current direction of the AI industry.”

MLCommons is an open engineering consortium and a leader in building benchmarks for AI data processing across different segments, including AI training. It says the MLPerf Training benchmark suite measures how fast systems can train models to a target quality metric.

MLPerf Training v4.1 includes preview category submissions using new and upcoming hardware accelerators Google Trillium TPUv6 and Nvidia Blackwell B200.

Hiwot Kassa, MLCommons
Hiwot Kassa

“As AI-targeted hardware rapidly advances, the value of the MLPerf Training benchmark becomes more important as an open, transparent forum for apples-to-apples comparisons,” said Hiwot Kassa, MLPerf Training working group co-chair. “Everyone benefits: vendors know where they stand versus their competitors, and their customers have better information as they procure AI training systems.”

The benchmark suite comprises full system tests that stress models, software, and hardware, for a range of machine learning (ML) applications. The open source and peer-reviewed benchmark suite promises a level playing field for competition that “drives innovation, performance, and energy efficiency for the entire industry,” says MLCommons.

The latest Training v4.1 results show a substantial shift in submissions for the three benchmarks that represent generative AI training workloads: GPT-3, Stable Diffusion, and Llama 2 70B LoRA, with a 46 percent increase in submissions in total across these three.

The two newest benchmarks in the MLPerf Training suite, Llama 2 70B LoRA and Graph Neural Network (GNN), both had notably higher submission rates: a 16 percent increase for Llama 2, and a 55 percent increase for GNN. They both also saw significant performance improvements in the v4.1 round compared to v4.0, when they were first introduced. There was a 1.26x speedup in the best training time for Llama 2, and a 1.23x speedup for GNN.

The Training v4.1 round includes 155 results from 17 submitting organizations, including ASUSTeK, Azure, Cisco, Clemson University Research Computing and Data, Dell, FlexAI, Fujitsu, GigaComputing, Google, Krai, Lambda, Lenovo, NVIDIA, Oracle, Quanta Cloud Technology, Supermicro, and Tiny Corp.

David Kanter, MLCommons
David Kanter

“We would especially like to welcome first-time MLPerf Training submitters FlexAI and Lambda,” said David Kanter, head of MLPerf at MLCommons. “We are also very excited to see Dell’s first MLPerf Training results that include power measurement. As AI adoption skyrockets, it is critical to measure both the performance and the energy efficiency of AI training.”

On the new accelerator competition between Google and Nvidia, Karl Freund, founder and principal analyst at Cambrian-AI Research, wrote: “The Google Trillium TPU delivered four results for clusters ranging from 512 to 3,072 Trillium accelerators. If I contrast their performance of 102 minutes for the 512-node cluster, it looks pretty good until you realize that the Nvidia Blackwell completes the task in just over 193 minutes using only 64 accelerators. When normalized, always a dangerous and inaccurate math exercise, that makes Blackwell over 4x faster on a per-accelerator comparison.”

This may well illustrate that Nvidia is set to continue leading the AI processing market going forward.

MLCommons was founded in 2018, and has over 125 industry members.