The latest version of the MLPerf Storage benchmark requires careful study to compare vendor single and multi-node systems across the three workloads, system characteristics, and two benchmark run types. MLCommons, the organization behind the benchmark, contacted us with information about the benchmark’s background and use. We’re reproducing it here to help with the assessment of its results.

Kelly Berschauer, marketing director for MLCommons, told us the benchmark was originally tightly focused on the AI practitioner who thinks in terms of training data samples and samples/second, not in terms of files or MBps or IOPS like storage people do. The MLPerf Storage working group members decided for v1.0 to align the storage benchmarks report to the more traditional metrics that purchasers look for (MB/s, IOPs, etc). Each workload in the benchmark defines a “sample” (for that workload) as a certain amount of data, with some random fluctuation of the size of each individual sample around that number to simulate some of the natural variation that we would see in the real world. In the end, MLPerf treats the samples as a constant size, the average of the actual sizes. The average sizes are Cosmoflow (2,828,486 bytes), Resnet50 (114,660 bytes), and Unet3D (146,600,628 bytes). Since the samples are of a “fixed” size, we multiply by the samples/sec to get MiB/sec, and let the AI practitioner know that they can do the reverse to get the samples/sec number they are interested in.
The normalization of MiB/s per GPU is not terribly valuable because the benchmark considers a “passing result” to be an Accelerator Utilization (AU) of 90 percent or higher for Unet3D and ResNet50, and 70 percent or higher for Cosmoflow. The graph included in our article only shows that the MiB/s/GPU results vary by up to 10 percent. The benchmark uses AU percent as a threshold. This is because, since GPUs are a significant investment, the user would like to ensure that each GPU is not starved for data because the storage system cannot keep up. The benchmark places no additional value on keeping the GPU more than 90 percent busy (or more than 70 percent for Cosmoflow).
The normalization of MiB/s per client system (what the benchmark calls “host nodes”) is also not terribly valuable because while there are physical limits on how many GPUs can be installed in a given computer system, there are no limits on the number of simulated GPUs that can be run on a “host node”. As a result, there is no relationship between the number of “host nodes” reported by the benchmark and the number of real computer systems that are required to host the same number of GPUs that we can draw any conclusions from. Simulated GPUs are used by the benchmark to enable storage vendors to run the benchmark without the significant investment of obtaining such a (typically large) number of host nodes and GPUs.
The “scale” of the simulated GPU Training cluster supportable by the given set of storage gear is the core insight provided by the benchmark. For a given set of storage gear, how many simulated GPUs it can support. With modern scale-out architectures (whether compute clusters or storage clusters), there is no “architectural maximum” performance, if you need more performance you can generally just add more gear. This varies by vendor, of course, but is a good rule of thumb. The submitters in the v1.0 round used a wide variety of the amount of gear for their submissions, so we would expect to see a wide variety in the reported topline number.
Different “scales” of GPU clusters will certainly apply different amounts of load to the storage, but the bandwidth per GPU must remain at 90 percent or greater of what is required or the result will not “pass” the test.
Distributed Neural Network (NN) training requires a periodic exchange of the current values of the weights of the NN across the population of GPUs. Without that weight exchange, the network would not learn at all. The periodicity of the weight exchange has been thoroughly researched by the AI community and the benchmark uses the accepted norm for the interval between weight exchanges for each of the three workloads. As the number of GPUs in the Training cluster grows, the time required to complete the weight exchange grows, but the weight exchange is required so this has been treated as an unavoidable cost by the AI community. An “MPI barrier” is used by the benchmark to simulate the periodic weight exchanges. The barrier forces the GPUs to all come to a common stopping point, the same as the weight exchanges do in real-world training. The AU recently calculated by the benchmark does not include the time the GPU is waiting for the simulated weight exchange to complete.
The bandwidth per second per GPU will be the same value no matter the scale of the GPU cluster, except there will periodically be times when no data is requested at all, during the simulated weight exchanges. It only appears that the required B/W per GPU is dropping as scale increases because the “dead time” of weight exchange has not been accounted for correctly if one divides the total data moved by the total runtime.
MLPerf plans to include power measurement as an optional aspect of the benchmark in the v2.0 round, and will very likely tighten the requirements for reporting the rack units consumed by the gear used in each submission. It is also considering several additional features for v2.0. As with all its benchmarks, the working group will continue to refine it over time.