DDN has released performance benchmarks showing it can can speed up AI processing time by 27x because of the way it handles intermediate KV caching.
An AI LLM or agent, when being trained on GPUs or doing inference work on GPUs and possibly CPUs, stores existing and freshly computed vectors as key-value items in a memory cache, the KV cache. This can have two memory tiers in a GPU server; the GPUs’ HBM and the CPUs’ DRAM. If more data enters the KVCache, existing data is evicted. If needed later, it has to be recomputed or, if moved out to external storage, such as locally attached SSDs or network-attached storage, retrieved, which can be faster than recomputing the vector. Avoiding KV cache eviction and recomputation of vectors is becoming table stakes for AI training storage vendors, with DDN, Hammerspace, VAST, and WEKA as examples.

Sven Oehme, CTO at DDN, states: “Every time your AI system recomputes context instead of caching it, you’re paying a GPU tax – wasting cycles that could be accelerating outcomes or serving more users. With DDN Infinia, we’re turning that cost center into a performance advantage.”
Infinia is DDN’s multi-year, ground-up redesigned object storage. It provides sub-millisecond latency, supports more than 100,000 AI calls per second, and is purpose-built for Nvidia’s H100s, GB200s, and Bluefield DPUs. DDN reminds us that Nvidia has said that agentic AI workloads require 100x more compute than traditional models. As context windows expand from 128,000 tokens to over 1 million, the burden on GPU infrastructure skyrockets – unless KV cache strategies are deployed effectively.
The company says that the traditional recompute approach with a 112,000-token task takes 57 seconds of processing time. Tokens are vector precursors, and their counts indicate the scope of an AI processing job. When the same job was run with DDN’s Infinia storage, the processing time dropped to 2.1 seconds, a 27-fold speedup. It says Infinia can cut “input token costs by up to 75 percent. For enterprises running 1,000 concurrent AI inference pipelines, this translates to as much as $80,000 in daily GPU savings – a staggering amount when multiplied across thousands of interactions and 24/7 operations.”
Alex Bouzari, CEO and co-founder of DDN, says: “In AI, speed isn’t just about performance – it’s about economics. DDN enables organizations to operate faster, smarter, and more cost-effectively at every step of the AI pipeline.”
It is unclear how DDN’s implementation compares to those from Hammerspace, VAST Data, and WEKA, as comparative benchmarks have not been made public. We would suppose that, as KV caching is becoming table stakes, suppliers such as Cloudian, Dell, IBM, HPE, Hitachi Vantara, NetApp, PEAK:AIO, and Pure Storage will add KV cache support using Nvidia’s Dynamo offload engine.
Bootnote
The open source LMCache software also provides KV cache functionality, as does the Infinigen framework.