SanDisk, HBF, and AI inferencing at the smartphone edge

SanDisk reckons we can have a smartphone running a mixture-of-experts (MOE) AI model, with each expert operating using data stored in dedicated HBF mini-arrays.

SVP Technology and Strategy Alper Ilkbahar presented SanDisk’s NAND technology strategy at the recent investor day, where he introduced the high-bandwidth flash (HBF) concept as a NAND equivalent of high-bandwidth memory. He started by envisaging a current NAND array optimized for capacity and having a single I/O channel. We can, in theory, subdivide this into mini-arrays in a bandwidth-optimized design, he said, and give each one its own I/O channel. “When you do that, you get massive amounts of bandwidth and bandwidth optimization.”

Having set the scene, he introduced the HBF concept with a 16-layer NAND stack connected via a logic die to an interposer that provides a multi-lane, high-bandwidth channel to a GPU or other processor: the high-bandwidth memory (HBM) idea.

SanDisk’s CBA (CMOS directly Bonded to Array) technology has a logic die bonded to a 3D NAND stack. That logic die becomes the NAND-to-interposer logic die, and the 3D NAND stack is replaced by a 16-layer stack of core dies, interconnected by HBM-like through silicon vias (TSV) connectors. This will have much more capacity than an equivalent HBM package: “We are going to match the bandwidth of HBM memory while delivering eight to 16x capacity at a similar cost point.”

In effect, although Ilkbahar didn’t explicitly say it, each stack layer is a miniature NAND array with its own I/O channel. The logic die, aka NAND controller, now has to do much more work handling these multiple and parallel I/O channels.

Initially, SanDisk engineers thought the HBF package would need a mixture of NAND and DRAM, with the DRAM used for caching critical data: “The reason we thought we still need the HBM is because, although inference workloads are heavily really intensive, you have data structures such as KV caches that need frequent updates, and we thought, for those, you might require HBM”. 

This turned out not to be the case because multi-head latent attention was developed by DeepSeek AI researchers. This enables compressed latent KV caches or vectors to be stored, leading to a reduction in memory needed. Ilkbahar said: “As a matter of fact, they are reporting that they’re able to reduce KV caches by 93 percent. So any innovation like that actually now would enable us to pack four terabytes of memory on a single GPU.”

The 4,096 GB of HBF memory is formed from 8 x 512 GB HBF dies

That means “GPT-4 has 1.8 trillion parameters with 16-bit weights, and that model alone would take 3.6 terabytes or 3,600 gigabytes of memory space. I can put that entire model on a single GPU now, I don’t need to shuffle around data any longer.”

The next generation of AI models are going to be larger still. Ilkbahar said mixture-of-experts models “require a significant number of experts for different skills. Or think about the next generation of multimodal models that are bringing together the world of texts, audio, and video. All of them are going to be memory-bound applications and we have a solution for it” with HBF.

Alper Ilkbahar

He then switched away from GPU servers (or AI PCs with GPUs inside) to smartphones: “How is your experience with your AI, with your cell phones or mobile phones? Mine is pretty non-existent because, actually, it takes incredible resources to run AI or reasonable AI in a mobile phone. Now, just throwing money [at it] doesn’t quite solve the problem. You just don’t have the real estate, you don’t have the power envelope, you don’t have the compute power, you don’t have the memory.”

HBF changes the equation, he said. “People have been trying to solve this problem by reducing the model significantly, to a few billion model parameter models, but the performance of those models has been really frustrating for the users. They just don’t deliver the experience. You really want to use a significantly larger model size, and I’m going to propose here a hypothetical model of 64 billion parameters. Although this is hypothetical, Lama 3.1 or the DeepSeek 70 B are very similar models to this in size and capability and they have been very satisfactory for people who are using it. But still, you would need 64 gigabytes of memory just to put this in.”

He reckons “I could fit an entire model” on a single HBF die, not a stack. At which points he returns to the mixture-of-experts topic: “Mixture-of-expert models divide up a very large model into multiple smaller models, and for each token generation, you actually activate only a few of those, by which you are actually reducing the need for large compute. The problem becomes more memory capacity-bound, but no longer compute-bound.”

An HBF 512GB die split into 8 mini-arrays of 64GB, each providing data for one of the experts in an MoE model

The next step in his argument is this: “If you were to put this mixture-of-experts model into an HBF and ask a question … it’ll very smartly give the right answer, and that we are truly excited about.”

An animated slide showed one HBF die split into eight units (aka mini-arrays) with a 64 billion MOE model. Each expert model has its own mini-array and executes using data from that. Ilkbahar said: “We are truly excited about the potential of making the impossible possible and truly enable amazing AI experiences at the edge.”

He concluded: “We are bringing back SanDisk. We are incredibly excited about that and we can’t wait to disrupt the memory industry.” 

If this HBF concept flies, it could do wonderful things. We think it will have a greater chance of success if there is general NAND industry buy-in. We note that Ilkbahar was once the VP and GM of Intel’s Optane group. Optane technology did not succeed and one of its characteristics was that it was proprietary to Intel and its fabrication partner Micron.