AI/ML

IBM enlists Nvidia for full-stack AI build-out

Published

IBM has introduced Spectrum AI, a souped up integrated set of server, GPU and storage components. The reference architecture includes:

  • IBM Elastic Storage Server or Spectrum Scale NVMe all-flash appliance, due in 2019 and which runs Linux
  • Three to nine Nvidia DGX-1 GPU servers (8 x Tesla V100 Tensor Core GPUs) functioning as Spectrum Scale client nodes
  • Spectrum Scale RAID v5
  • Spectrum Discover metadata management software
  • Mellanox 100Gbit/s EDR with SB7700 or SB7800 series fabric switch
  • 10GbitE management network
  • InfiniBand networking
  • Optional IBM Cloud Object Storage

The Spectrum Scale NVMe appliance is a 2U box storing up to 300TB. Three appliances working together can output up to 120 GB/sec throughput.

Read the IBM SpectrumAI infrastructure brief here, and a more detailed description with training results here.

These results include Alexnet, Resnet-152, Resnet-50, Inception-3 and Lenet model training runs and IBM charts the results.

Training run

The Resnet-50 and Resnet-152 image recognition training model results enable us to compare SpectrumAI with other AI reference architectures. Examples include AIRI from Pure Storage, DDN’s A3I, A700 and A800-based ones from NetApp, Dell EMC AI-Ready Solutions, Cisco C480 M5, and IBM’s AC922 Power server

A word of caution, though. IBM supplies charts showing the relative performance at 1, 4 and 8 GPU levels for the two Resnet tests, and not actual numbers.

Also, we have estimated the numbers from the charts and acknowledge that this introduces a degree of error. But it gives us the opportunity to make the cross-supplier comparison.

IBM has published training results for SpectrumAI systems with multiple DGX-1s and these show a broadly linear increase in performance, as DGX-1s are added. We are unable to compare these with other vendors as we don’t have their test run results for multiple DGX-1 servers.

But with these results IBM’s SpectrumAI reference architecture comes out tops in terms of Resnet-152 and Resnet-50 training model performance.