IBM enlists Nvidia for full-stack AI build-out

The 9:3 configuration refers to 9 x DGX-1 servers and 3 x Spectrum Scale NVMe all-flash appliances.

IBM has introduced Spectrum AI, a souped up integrated set of server, GPU and storage components. The reference architecture includes:

  • IBM Elastic Storage Server or Spectrum Scale NVMe all-flash appliance, due in 2019 and which runs Linux
  • Three to nine Nvidia DGX-1 GPU servers (8 x Tesla V100 Tensor Core GPUs) functioning as Spectrum Scale client nodes
  • Spectrum Scale RAID v5
  • Spectrum Discover metadata management software
  • Mellanox 100Gbit/s EDR with SB7700 or SB7800 series fabric switch
  • 10GbitE management network
  • InfiniBand networking
  • Optional IBM Cloud Object Storage

The Spectrum Scale NVMe appliance is a 2U box storing up to 300TB. Three appliances working together can output up to 120 GB/sec throughput.

Read the IBM SpectrumAI infrastructure brief here, and a more detailed description with training results here

These results include Alexnet, Resnet-152, Resnet-50, Inception-3 and Lenet model training runs and IBM charts the results.

Training run

The Resnet-50 and Resnet-152 image recognition training model results enable us to compare SpectrumAI with other AI reference architectures.  Examples include AIRI from Pure Storage, DDN’s A3I, A700 and A800-based ones from NetApp, Dell EMC AI-Ready Solutions, Cisco C480 M5, and IBM’s AC922 Power server

A word of caution, though. IBM supplies charts showing the relative performance at 1, 4 and 8 GPU levels for the two Resnet tests, and not actual numbers. 

Also, we have estimated the numbers from the charts and acknowledge that this introduces a degree of error. But it gives us the opportunity to make the cross-supplier comparison.

IBM is the leading Resnet-152 system at the 4 and 8 GPU levels. We don’t have 2-GPU numbers from IBM and missing bars on the chart signify other vendors haven’t supplied numbers at particular GPU levels
IBM’s SpectrumAI leads the Resnet-50 results with 1, 4 and 8 GPUs. AS above, missing bars on the chart signify vendors haven’t supplied numbers at particular GPU levels.

IBM has published training results for SpectrumAI systems with multiple DGX-1s and these show a broadly linear increase in performance, as DGX-1s are added. We are unable to compare these with other vendors as we don’t have their test run results for multiple DGX-1 servers.

But with these results IBM’s SpectrumAI reference architecture comes out tops in terms of Resnet-152 and Resnet-50 training model performance.