UnifabriX on how tech can tear down AI memory wall

Lego bricks
Lego bricks

UnifabriX claims its CXL-connected external MAX memory device can deliver substantial AI processing performance improvements.

The company’s MAX memory technology was described in an earlier article. UnifabriX CEO Ronen Hyatt cites an “AI and Memory Wall” research paper by Amir Gholami et al to illustrate how he sees the process. The researchers say: “The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memory bandwidth. Over the past 20 years, peak server hardware FLOPS has been scaling at 3.0x/two years, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every two years, respectively. This disparity has made memory, rather than compute, the primary bottleneck in AI applications, particularly in serving.”

A chart in the paper shows the effects of this:

The scaling of the bandwidth of different generations of interconnections and memory, as well as the peak FLOPS. As can be seen, the bandwidth is increasing very slowly. We are normalizing hardware peak FLOPS with the R10000 system, as it was used to measure the cost of training LeNet-5

The memory wall is the gap between memory bandwidth and peak hardware FLOPS.

The paper’s authors conclude: “To put these numbers into perspective, peak hardware FLOPS has increased by 60,000x over the past 20 years, while DRAM/interconnect bandwidth has only scaled by a factor of 100x/30x over the same time period, respectively. With these trends, memory – in particular, intra/inter-chip memory transfer – will soon become the main limiting factor in serving large AI models. As such, we need to rethink the training, deployment, and design of AI models as well as how we design AI hardware to deal with this increasingly challenging memory wall.”

Hyatt modifies the chart to add a scaling line for the PCIe bus generations plus CXL and NVLink, showing that IO fabric speeds have not increased in line with peak hardware FLOPS either:

There is a performance gap in AI infrastructure between a GPU server’s memory and flash storage, even if InfiniBand is used to connect the NAND drives. By hooking up external memory via CXL (and UALink in the future), the performance gap can be mitigated.

Hyatt says memory fabrics are better than InfiniBand networks, enabling higher performance, and that CXL and UALink are open memory fabric standards comparable to Nvidia’s proprietary NVLink.

In addition to delivering performance improvements, UnifabriX’s MAX memory can save money.

In UnifabriX’s example scenario, there are 16 servers, which include four GPU servers, with each configured with 6 TB of DRAM, providing a total capacity of 96 TB. The total memory cost is $1.6 million, and UnifabriX suggests less than 30 percent memory utilization.

By adding its MAX memory unit to the configuration, with a 30 TB memory capacity, the 16 servers can now each be configured with 2.25 TB of memory, resulting in a total of 66 TB of memory at a cost of $670,000, with a much higher utilization rate. The servers get on-demand memory capacity and bandwidth, and run their applications faster.

There is a $1 million capex saving as well as a $1.5 million TCO gain in UnifabriX’s example.