The MLPerf Storage v1.0 benchmark has generated a lot of interest concerning how vendor scores could and should be compared. Huawei, for its part, argues the samples/sec rating should be used – normalized by storage nodes or their rack units – not the MiB/sec throughput rating.
The benchmark results measure storage system throughput in MiB/sec on three AI-relevant workloads, providing a way to compare the ability of different vendors’ systems to feed machine learning data to GPUs and keep them over 90 percent busy.
We thought the differences between vendors were so extreme that the results should be normalized in some way to make more valid comparisons between vendors. When we asked MLPerf if we should normalize for host nodes in order to compare vendors such as Huawei, Juicedata, HPE, Hammerspace, and others, a spokesperson told us: “The scale of a given submission is indicated by the number and type of emulated accelerators – i.e. ten emulated H100s is 10x the work of one emulated H100 from a storage standpoint. While MLCommons does not endorse a particular normalization scheme, normalizing by accelerators may be useful to the broader community.”
We did that, dividing the overall MiB/sec number by the number of GPU accelerators, and produced this chart:
Huawei thinks that this normalization approach is inappropriate. Jiani Liang, responsible for Huawei’s Branding and Marketing Execution, told us: “You divided the overall MiB/sec number by the number of GPU accelerators for comparing. I don’t think that works at the current benchmark rule definition.
“Per GPU bandwidth is a good metric for AI people to understand how fast the storage can support the GPU training, but only if the same GPU cluster scale is specified, since different GPU numbers means different I/O pressure on storage, thus will affect the bandwidth provided to each GPU. Small GPU clusters lead to small I/O pressure to storage, and leads to a slightly higher per-GPU bandwidth. This trend can also be observed in the graph of your article.
“For example, with the same F8000X from YanRong, in a 12-GPU cluster, the average bandwidth per GPU is 2,783 MiB/sec, but in a 36-GPU cluster, the value is 2,711 MiB/sec. On the other hand, the greater the number of GPUs, the greater the overhead of synchronizing between GPUs.
“We also tested the sync time under different host numbers with the same GPU numbers per host using the benchmark. As you can see from the following chart, as the number of hosts increases and the number of GPUs increases, the proportion of synchronization overhead in the overall time increases, resulting in lower bandwidth per GPU. These two factors will affect the per GPU bandwidth even using the same storage system, resulting in loss of comparability.
“Since currently the benchmark does not specify the total GPU numbers and per server GPU numbers, this metric was incorrectly normalized without the same GPU cluster scale.”
Referring to the MLPerf Storage v1.0 benchmark rules: “the benchmark performance metric is samples per second, subject to a minimum accelerator utilization (AU) defined for that workload. Higher samples per second is better.”
Jiani Liang said: “So the challenge is what is the highest throughput one storage system can provide. I agree with you that we need to normalize the result in some way, since the scale of submitted storage systems are different. Normalizing by the number of storage nodes, or storage rack unit may be better for comparing.”
Comment
A sample, in MLPerf Storage terms, is the unit of data on which training is run – for example, an image or a sentence. A benchmark storage scaling unit is defined as the minimum unit by which the performance and scale of a storage system can be increased. Examples of storage scaling units are nodes, controllers, virtual machines, or shelves. Benchmark runs with different numbers of storage scaling units allow a reviewer to evaluate how well a given storage solution is able to scale as more scaling units are added.
We note that the MLPerf Storage benchmark results table presents vendor system scores in MiB/sec terms per workload type, and not samples/sec, and have asked the organization how samples/sec become MiB/sec. When we hear back, we’ll add in the information.