CXL a no-go for AI training

Analysis. Computer Express Link (CXL) technology has been pushed into the backseat by the Nvidia GTC AI circus, yet Nvidia’s GPUs are costly and limited in supply. Increasing their memory capacity to enable them to do more work would seem a good idea, so why isn’t CXL – and its memory pooling – front and center in the Nvidia GPU scramble?

CXL connects pools of DRAM across the PCIe bus. There are three main variants:

  • CXL 1 provides memory expansion, letting x86 servers access memory on PCIe-linked accelerator devices such as smartNICs and DPUs;
  • CXL 2 provides memory pooling between several servers hosts and a CXL-attached device with memory;
  • CXL 3 provides memory sharing between servers and CXL devices using CXL switches.

All three have a coherent caching mechanism, meaning that the local CPU level 1 and instruction caches, containing a subset of what is in memory, have uniform content. CXLs 1 and 2 are based on the PCIe 5 bus with CXL 3 using the PCIe 6 bus. Access to external memory via CXL adds latency.

All the memory that is accessed, shared or pooled in a CXL system needs a CXL access method, meaning PCIe 5 or 6 bus access and CXL protocol support. The DRAM in x86 servers and the GDDR memory in GPUs is suitable. However, high-bandwidth memory (HBM) integrated with GPUs via an interposer in Nvidia’s universe is not suitable, as it has no PCIe interface.

AMD’s Instinct M1300A accelerated processing unit (APU), with combined CPU and GPU cores and a shared memory space, has a CXL 2 interface. Nvidia’s Grace Hopper superchip, with Armv9 Grace CPU and Hopper GPUs, has a split memory space.

SemiAnalysis analyst Dylan Patel writes about CXL and GPUs in his subscription newsletter. He observes that Nvidia’s H100 GPU chip supports NVLink, C2C (to link to the Grace CPU) and PCIe interconnect formats. But the PCIe interconnect scope is limited. There are just 16 PCIe 5 lanes which run overall at 64GB/sec, whereas NVlink and C2C both run at 450GB/sec – seven times faster. Patel notes that the I/O part of Nvidia’s GPUs is space-limited and Nvidia prefers bandwidth over standard interconnects – such as PCIe. 

Therefore the PCIe area on the chip will not grow in future, and may shrink.

There’s much more detail in Patel’s newsletter but it’s behind a subscription paywall.

The takeaway is that there will be no CXL access to an Nvidia GPU’s high-bandwidth memory. However, x86 CPUs don’t use NVLink and having extra memory in x86 servers means memory-bound jobs can run faster – even with added latency for external memory access.

It then follows that CXL will not feature in AI training workloads when they run on GPU systems fitted with HBM – but it may have a role in datacenter x86/GDDR-GPU servers running AI tuning and inference workloads. We also may not see CXL having a role in edge systems, as these will be simpler in design than datacenter systems and require less memory overall.