VAST’s VUA flash caching virtually expands GPU server memory for AI token generation

VAST Data is open sourcing its VUA (VAST Undivided Attention) KVCache software technology to store generated AI model training and inferencing tokens in NVMe-connected SSDs for fast transfer GPU memory to avoid recomputing them.

A KVCache is an in-memory store of AI LLM tokens, keys and value vectors, generated in an attention phase of a model’s inference processing. The tokens are generated sequentially and provide a context for the model. In this multi-stage process, with the model generating one token at a time, the next step after the current one would require it to recompute all the tokens in the sequence thus far. Holding them in a server’s GPU and then CPU memory avoids this recomputation and speeds multi-step token generation. But, as LLMs deal with more and more parameters the available GPU memory gets full, overflows, and limits the KVCache token count, slowing model processing. VUA stores generated tokens evicted from the in-memory cache in NVMe-connected SSDs, a third caching tier,  so they can be re-used instead of being recomputed. VUA is software providing this SSD caching of KVCache tokens.

Jeff Denworth

Such evicted context could be stored back in the source data repository, a cloud object store for example. But, VAST co-founder Jeff Denworth blogs: “Yes, caches can be rehydrated from remote disk, but this today is a clumsy and disconnected operation that often depends upon (and suffers from) slow cloud object storage. The time to rehydrate the context and session is so long that several leading AI-as-a-service shops choose to simply recalculate an entire prompt history rather than grab all of the context and attention data from object storage.”

A second VAST blog notes “AI models are increasingly evolving to store larger contexts, or knowledge, into a model. To put this into perspective, LLaMA 1 was released in 2023 with support for a context window of 2,048 tokens. Fast forward to LLaMA 4, just announced last week by Meta, which can support new AI models with up to 10 million tokens. …Ten million tokens consume far more memory than can be accommodated in GPU memory, so larger storage and caching methods are required.”

Denworth says the vLLM GPU and CPU memory paging scheme “does not integrate with distributed NVMe based systems to provide another tier in the memory hierarchy, nor is it global…so GPU environments are divided into small and divided caches.”

What VAST has “built is a Linux-based agent that runs in your GPU servers and provides a new data presentation layer to AI frameworks.” It is “a hierarchical system to manage data across GPU memory, CPU memory and shared, RDMA-attached NVMe storage subsystems,” such as VASTs storage that supports Nvidia’s storage controller CPU-bypass RDMA-using GPUDirect protocol.

Denworth says: “VUA layers in the ability to intelligently store and serve prefixes” so they “can be served according to priority and policy. For example, the longest prefixes associated with a sequence can be served first to a GPU machine so that the full self-attention of a session can be most quickly understood.” VUA can search through billions to trillions of prefixes in on-SSD Element Store data structures” using wide-fanout V-Trees that can be searched through in millisecond time across massive metadata spaces.

Another way to describe this is to say it has intelligent prefix caching: “VUA surpasses basic caching by breaking down attention keys into chunks, which are stored in a nested structure. This enables sophisticated partial context matching using longest prefix identification, significantly improving cache hit rates in workloads such as Retrieval-Augmented Generation (RAG), where the same base documents appear across multiple distinct prompts.”

The VUA system “is global. Each GPU server now has shared access to the same extended context cache space, the same rapidly-searchable metadata space and the same global context and attention data and data index.”

Denworth says this VUA “accelerator, in terms of data sharing, only works north-south today (each machine sees a global hierarchical data space, but machines don’t see what’s in each other’s cache… so a CPU/GPU memory cache miss always goes to NVMe).” VAST is “considering a globally distributed cache where machines will also be able to see their peers within or across data centers and do low-latency retrieval of relevant keys and values based upon the above prefix filtering.”

VUA is now available as open source SW providing a prefix-search-based global and exabyte-scale KV cache, using NVMe SSDs, accessible throughout a GPU cluster. It integrates with popular AI inference workloads “providing infinite context scalability” and reduces “time to first token (TTFT) while also saving significantly on GPU and CPU memory.”

VUA shortens TTFT and also the average time to generate each subsequent token or TPOT (Time Per Output Token). It “enables persistent conversational state across turns or sessions. KV caches representing prior dialogue can be stored in off-GPU memory between queries, freeing GPU resources while retaining the ability to resume context quickly.”

VAST tested TTFT on a vLLM system, using the Qwen2.5-1.5B-Instruct model, with and without VUA, and found adding VUA made the test system 292 percent faster at the 30,000 token level: 

It says VUA is particularly valuable for applications requiring common question prompts, multi-round dialogues (faster context switching), long document Q&A (improved throughput), and high-concurrency scenarios (reduction in preemptions).

WEKA and Hammerspace

B&F wrote in March that Parallel-access filesystem supplier WEKA announced “A new Augmented Memory Grid feature enables AI models to extend memory for large model inferencing to the WEKA Data Platform. It’s a software-defined extension, which provides exascale cache at microsecond latencies with multi-terabyte-per-second bandwidth, delivering near-memory speed performance. This provides additional petabytes of capacity, said to be 1,000x more “than today’s fixed DRAM increments of single terabytes.” 

This is similar to VAST’s VUA.

Data orchestrator Hammerspace’s Tier Zero feature adds “a GPU server’s local NVMe flash drives as a front end to external GPUDirect-accessed datasets, providing microsecond-level storage read and checkpoint write access to accelerate AI training workloads.” 

And: “By incorporating these drives into its Global Data Environment as a Tier 0 in front of Tier 1 external storage, they can be used to send data to GPUs faster than from the external storage and also to write checkpoint data in less time than it takes to send that data to external storage.”

Hammerspace is not providing a KVCache facility on such Tier 0 SSDs – but it could and then would provide even more acceleration to AI inferencing workloads

VAST said it invites “the AI community to explore, use, and contribute to the VAST Undivided Attention project. Source code, documentation, and initial usage examples are available at https://github.com/vast-data/vua.” We understand that using VUA with non-VAST storage would likely introduce latency or compatibility issues, as VUA’s performance depends on VAST’s ability to search and serve data in constant time via its V-Tree technique.