Microsoft proposes Managed Retention Memory to tackle AI workloads

Microsoft researchers have proposed Managed Retention Memory (MRM) – storage-class memory (SCM) with short-term persistence and IO optimized for AI foundation model workloads.

Sergey Legtchenko, Microsoft
Sergey Legtchenko

MRM is described in an Arxiv paper written by Microsoft Principal Research Software Engineer Sergey Legtchenko and other researchers looking to sidestep high-bandwidth memory (HBM) limitations in AI clusters. They say it is “suboptimal for AI workloads for several reasons,” being “over-provisioned on write performance, but under-provisioned on density and read bandwidth, and also has significant energy per bit overheads. It is also expensive, with lower yield than DRAM due to manufacturing complexity.”

The researchers say SCM approaches – such as Intel’s discontinued Optane and potential alternatives using MRAM, ReRAM, or PCM (phase-change memory) – all assume that there is a sharp divide between memory, volatile DRAM, which needs constant power refreshes to retain data, and storage, which persists data for the long-term, meaning years.

They say: “These technologies traditionally offered long-term persistence (10+ years) but provided poor IO performance and/or endurance.” For example: “Flash cells have a retention time of 10+ years, but this comes at the cost of lower read and write throughput per memory cell than DRAM. These properties mean that DRAM is used as memory for processors, and Flash is used for secondary storage.”

But the divide need not actually be sharp in retention terms. There is a retention spectrum, from zero to decades and beyond. DRAM does persist data for a brief period before it has to be refreshed. The researchers write: “Non-volatility is a key storage device property, but at a memory cell level it is quite misleading. For all technologies, memory cells offer simply a retention time, which is a continuum from microseconds for DRAM to many years.”

By tacitly supporting the sharp memory-storage divide concept, “the technologies that underpin SCM have been forced to be non-volatile, requiring their retention time to be a decade or more. Unfortunately, achieving these high retention times requires trading off other metrics such as write and read latency, energy efficiency, and endurance.”

General-purpose SCM, with its non-volatility, is unnecessary for AI workloads like inference, which demand high-performance sequential reads of model weights and KV cache data but lower write performance. The tremendous scale of such workloads requires a new memory class as HBM’s energy per bit read is too high and HBM is “expensive and has significant yield challenges” anyway.

The Microsoft researchers say their theorized MRM “is different from volatile DRAM as it can retain data without power and does not waste energy in frequent cell refreshes, but unlike SCM, is not aimed at long term retention times. As most of the inference data does not need to be persisted, retention can be relaxed to days or hours. In return, MRM has better endurance and aims to outperform DRAM (and HBM) on the key metrics such as read throughput, energy efficiency, and capacity.”

They note: “Byte addressability is not required, because IO is large and sequential,” suggesting that a block-addressed structure would suffice.

The researchers are defining in theory a new class of memory, saying there is an AI foundation model-specific gap in the memory-storage hierarchy that could be filled with an appropriate semiconductor technology. This “opens a field of computer architecture research in better memory for this application.”

Endurance requirements for KV cache and model weights vs endurance of memory technologies
Endurance requirements for KV cache and model weights vs endurance of memory technologies

A chart (above) in the paper “shows a comparison between endurance of existing memory/storage technologies and the workload endurance requirements. When applicable, we differentiate endurance observed in existing devices from the potential demonstrated by the technology.” Endurance is the length of time over which write cycles can be continued. “HBM is vastly over-provisioned on endurance, and existing SCM devices do not meet the endurance requirements but the underlying technologies have the potential to do so.”

The Microsoft researchers say: “We are explicitly not settling on a specific technology, instead highlighting an opportunity space. This is a call for action for those working on low-level memory cell technologies, through those thinking of memory controllers, to those designing the software systems that access the memory. Hail to a cross-layer collaboration for better memory in the AI era.”

They conclude: ”We propose a new class of memory that can co-exist with HBM, Managed-Retention Memory (MRM), which enables the use of memory technologies originally proposed for SCM, but trades retention and other metrics like write throughput for improved performance metrics crucial for these AI workloads. By relaxing retention time requirements, MRM can potentially enable existing proposed SCM technologies to offer better read throughput, energy efficiency, and density. We hope this paper really opens new thinking about innovation in memory cell technologies and memory chip design, tailored specifically to the needs of AI inference clusters.”