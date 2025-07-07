Nvidia GPUs store vectors as key-value pairs in a large language model (LLM) memory cache – KV cache – which is tiered out in a multi-level structure ending with network-attached SSDs.

Vectors are encoded values of multi-dimensional aspects of an item – word, image, video frame, sound – that an LLM deals with in its semantic searching for responses to input requests. Such requests are themselves vectorized and the LLM processes them and looks for elements in a vector store to build its response. These elements are key-value pairs held in a GPU’s high-bandwidth memory as a KV cache. Problems occur when the vectors needed in a particular response session are larger than the GPU memory available. Then existing vectors are evicted and, if needed again, recomputed – which takes time. It’s better to move them down the memory-storage hierarchy so that they can be read back in to GPU memory when needed, rather than being recomputed. That’s what tiered KV caching accomplishes and Nvidia’s Dynamo software achieves it.

An LLM has two phases when it processes a response: prefill and decode. During the prefill phase, the input request is broken down into tokens – basic words or sections of words – and these are vectorized and represented in memory as KV pairs. This process is computationally intensive and can be parallelized. The decode phase is where the LLM builds its output, a token at a time, in a sequential operation. Each new token is predicted based on previously generated tokens and the result stored in the KV cache. The first output token depends on all prompt tokens. The second output token depends on all prompt tokens plus the first output token. The third output token again depends on all the prompt tokens plus the first and second output tokens, and so on.

When the output is complete, the KV cache contents are still in GPU memory and may need to be retained for follow-up questions from the user or for use by an iterative reasoning LLM. But then a new request comes in and the KV cache contents are evicted. Unless they are held somewhere else, they have to be recomputed if needed again. Techniques like vLLM and LMCache offload the GPU’s KV cache to the GPU server’s CPU DRAM, second-tier memory, which can be larger than the available GPU memory.

Nvidia Dynamo diagram

Dynamo is a low-latency KV cache offload engine that works in multi-node systems. It supports vLLM and other inference engines such as TRT-LLM and SGLang and large-scale, distributed inferencing. Dynamo works across a memory and storage hierarchy, from HBM, through a CPU’s DRAM, to direct-attached SSDs and networked external storage.

It has four features: Disaggregated Serving, Smart Router, Distributed KV Cache Manager, and Nvidia Inference Transfer Library (NIXL). Nvidia says: “Disaggregating prefill and decode significantly boosts performance, gaining efficiency the more GPUs that are involved in inference.”

Version 1.0 of Dynamo enabled KV cache offloading to system CPU memory, and is being extended to support SSDs and networked object storage in subsequent releases. It is open source software.

Many storage suppliers support Nvidia’s AI Data platform and its included Nvidia AI Enterprise software with NIM microservices, of which Dynamo is part. We understand that Cloudian, DDN, Dell, Hitachi Vantara, HPE, IBM, NetApp, PEAK:AIO, Pure Storage, VAST Data, and WEKA will all be supporting Dynamo, as will Cohesity. Hammerspace and Pliops also support KV cache tiering.

As examples of this:

Cloudian will be supporting KV cache tiering

DDN says its Infinia object storage system “is engineered to serve KV cache at sub-millisecond latency.”

VAST Data has a blog about its Dynamo support. This says: “The distributed architecture behind Dynamo naturally supports implementing disaggregated prefill and decode. This serves as another strategy for enhancing scheduling across accelerated computings to boost inference throughput and minimize latency. It works by assigning a set of GPUs to run prefill and having NIXL move the data using RDMA to a different set of GPUs that will perform the decode process,” as seen in the diagram below.

A WEKA blog discusses its approach to tiered KV caching with the Augmented Memory Grid concept, noting that “when storing the cache outside of HBM, WEKA Augmented Memory Grid stores KV cache rapidly and asynchronously to maximize efficiency.” As a performance example, it says: “Based on testing within our Lab with a eight-host WEKApod with 72 NVMe drives a single eight-way H100 (with tensor parallelism of eight) demonstrated a retrieval rate of 938,000 tokens per second.”