By integrating KVCache into its filesystem, YanRong says it has dramatically improved the KV cache hit rates and long-context processing, making AI inferencing cheaper.
Chinese storage software supplier YanRong provides the YRCloudFile distributed shared file system for HPC and AI workloads. It supports all-flash drives and Nvidia’s GPUDirect protocol. The KVCache is a way of storing intermediate results during an AI model’s inferencing stage so that they don’t have to be recomputed at every stage, lengthening response time.
We understand that the KVCache in the YRCloudFile system likely serves as a distributed in-memory layer across a cluster of GPU servers to store frequently accessed metadata; the key-value pairs.
To see how its YRCloudFile KVCache performs, YanRong simulated realistic workloads, using publicly available datasets, industry-standard benchmarking tools, and NVIDIA GPU hardware. It found that YRCloudFile KVCache supports significantly higher concurrent query throughput, and offers concrete, quantifiable value for inference workloads.
YanRong conducted multi-phase tests comparing native vLLM performance against vLLM plus YRCloudFile KVCache across varying token counts and configurations.
One test evaluated total response time for a single query with from 8,000 to c30,000 tokens as its context input. KVCached YRCloudFile provided a 3x to >13x performance improvement in TTFT (Time to First Token) as the context length increased:

A second test measured how many concurrent queries were supported with a TTFT value of 2 seconds or less:

It found YRCloudFile KVCache enabled 8x more concurrent requests compared to native vLLM.
A third test result showed that, under high concurrency, YRCloudFile KVCache achieved over 4x lower TTFT across different context lengths.
YanRong says that these results show “how extending GPU memory via distributed storage can break traditional compute bottlenecks – unlocking exponential improvements in resource utilization.” All-in-all, “YRCloudFile KVCache redefines the economics of AI inference by transforming storage resources into computational gains through PB-scale cache expansion.”
You can find more details here.
Comment
We think YRCloudFile with KVCache shares some similarities with WEKA’s Augmented Memory Grid (AMG). This is a software-defined filesystem extension, which provides exascale cache capacity at microsecond latencies with multi-terabyte-per-second bandwidth, delivering near-memory speed performance.
A WEKA blog says it “extends GPU memory to a token warehouse in the WEKA Data Platform to provide petabytes of persistent storage at near-memory speed. … The token warehouse provides a persistent, NVMe-backed store for tokenized data, allowing AI systems to store tokens and retrieve them at near-memory speed.”
This “enables you to cache tokens and deliver them to your GPUs at microsecond latencies, driving the massive-scale, low-latency inference and efficient reuse of compute necessary for the next generation of AI factories.” The AMG is: “Persistently storing tokenized data in NVMe” and “tokens are stored, and pulled “off the shelf” at inference time, instead of continuously being re-manufactured on-demand for every single request.”
AMG “extends GPU memory into a distributed, high-performance memory fabric that delivers microsecond latency and massive parallel I/O – critical for storing and retrieving tokens at scale in real-time.”
A YanRong spokesperson told us: “As WEKA has not disclosed further details about their Augmented Memory Grid, we have no way to make a direct comparison between the two systems’ implementations. However, when it comes to the general purpose and the impact on LLM inferencing, both YRCloudFile KVCache and WEKA’s Augmented Memory Grid share a similar goal, which is to extend the expensive HBM to a persistent, high-bandwidth, low-latency, and scalable parallel file system, so that a large number of KVs needed during the inferencing phase can be cached in the storage, avoiding repeated calculations and improving overall performance. To achieve this goal, we need to implement a mechanism in our product so that vLLM or other inference framework can read and write KV data.”