IBM has introduced Spectrum AI, a souped up integrated set of server, GPU and storage components. The reference architecture includes:
- IBM Elastic Storage Server or Spectrum Scale NVMe all-flash appliance, due in 2019 and which runs Linux
- Three to nine Nvidia DGX-1 GPU servers (8 x Tesla V100 Tensor Core GPUs) functioning as Spectrum Scale client nodes
- Spectrum Scale RAID v5
- Spectrum Discover metadata management software
- Mellanox 100Gbit/s EDR with SB7700 or SB7800 series fabric switch
- 10GbitE management network
- InfiniBand networking
- Optional IBM Cloud Object Storage
The Spectrum Scale NVMe appliance is a 2U box storing up to 300TB. Three appliances working together can output up to 120 GB/sec throughput.
Read the IBM SpectrumAI infrastructure brief here, and a more detailed description with training results here.
These results include Alexnet, Resnet-152, Resnet-50, Inception-3 and Lenet model training runs and IBM charts the results.
Training run
The Resnet-50 and Resnet-152 image recognition training model results enable us to compare SpectrumAI with other AI reference architectures. Examples include AIRI from Pure Storage, DDN’s A3I, A700 and A800-based ones from NetApp, Dell EMC AI-Ready Solutions, Cisco C480 M5, and IBM’s AC922 Power server
A word of caution, though. IBM supplies charts showing the relative performance at 1, 4 and 8 GPU levels for the two Resnet tests, and not actual numbers.
Also, we have estimated the numbers from the charts and acknowledge that this introduces a degree of error. But it gives us the opportunity to make the cross-supplier comparison.
IBM has published training results for SpectrumAI systems with multiple DGX-1s and these show a broadly linear increase in performance, as DGX-1s are added. We are unable to compare these with other vendors as we don’t have their test run results for multiple DGX-1 servers.
But with these results IBM’s SpectrumAI reference architecture comes out tops in terms of Resnet-152 and Resnet-50 training model performance.