File system provider WEKA wants to make its customers’ GenAI models respond faster by using a vLLM Mooncake tech combination.
This was revealed in a blog post by new hire Val Bercovici, titled “Three Ways Token Economics Are Redefining Generative AI.” A token is a word, part of a word, or a letter that is turned into a vector embedding and used in the semantic search part of large language model (LLM) analysis and response generation.
Token economic, he argues, help with cost reduction, latency reduction, and escaping memory capacity limitations.
Bercovici, who was CTO for NetApp back in 2016 and has long been involved in AI, writes: “DeepSeek’s recent breakthrough in token generation efficiency has reshaped the economics of Generative AI. By implementing context caching on disk, DeepSeek R1 significantly reduces the cost of token generation, slashing API expenses by up to 90 percent.” The word “disk” is a misnomer as he actually means SSD.
DeepSeek’s advance means “AI inference, typically constrained by expensive memory requirements, can now achieve memory-like performance at SSD pricing – potentially cutting costs by 30x.”
He claims WEKA’s software architecture, with NVMe SSD acceleration and fast networking “enables token processing at microsecond latencies.” He adds: “The ability to process high token volumes at ultra-low latency is becoming a key differentiator in the AI economy.”
As LLMs process larger datasets, a GPU high-bandwidth memory (HBM) capacity limitation emerges and fresh token data has to be fetched from storage, which takes time and delays the model’s response. Bercovici asserts: “WEKA enables LLMs and large reasoning models (LRMs) to treat high-speed storage as an adjacent tier of memory, achieving DRAM performance with petabyte-scale capacity” by “optimizing the handling of both input and output tokens.”
This, he claims, is going to be achieved by “WEKA’s upcoming integration with Mooncake [which] further enhances token caching, surpassing traditional solutions like Redis and Memcached in capacity, speed, and efficiency.” WEKA has a vLLM Mooncake project that optimizes token caching for inference serving. This project has two source tech components: vLLM (virtual large language model) and Mooncake. The organizations and technology relationships involved are diagrammed below:
Mooncake is a disaggregated architecture for serving LLMs developed by Chinese supplier Moonshot AI. We understand it was founded in Beijing in March 2023 by Yang Zhilin, Zhou Xinyu, and Wu Yuxin. Zhilin has an AI research background and a computer science PhD from Carnegie Mellon University. The company’s Chinese name is 月之暗面 or YueZhiAnMian, which translates to “Dark Side of the Moon,” a reference to the Pink Floyd album. It has raised more than $1.3 billion in funding and is valued at more than $3 billion by VCs.
Moonshot is an AI-driven online commerce optimization business that launched its Kimi chatbot last October. It focuses on long-context AI processing. It introduced Kimi 1.5 for text and vision data processing via GitHub and arxiv this month. It claism Kimi 1.5 “achieves state-of-the-art reasoning performance across multiple benchmarks and modalities – e.g. 77.5 on AIME, 96.2 on MATH 500, 94th percentile on Codeforces, 74.9 on MathVista – matching OpenAI’s o1” and outperforms “existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550 percent).” CoT stands for Chain of Thought.
Kimi 1.5 uses Mooncake, which is described in a separate arxiv paper. It is “the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache.”
An inference LLM has prefill and decoding phases. A DigitalOcean document says: “The prefill phase can be likened to reading an entire document at once and processing all the words simultaneously to write the first word whereas the decode phase can be compared to continuing to write this response word by word, where the choice of each word depends on what was written before.”
“LLM inference can be divided into two phases: prefill and decode. These stages are separated due to the different computational requirements of each stage. While prefill, a highly-parallelized matrix-matrix operation that saturates GPU utilization, is compute-bound, decode, a matrix-vector operation that underutilizes the GPU compute capability, is memory-bound.”
Mooncake also uses vLLM technology. This was developed at UC Berkeley as “an open source library for fast LLM inference and serving” and is now an open source project. According to Red Hat, it “is an inference server that speeds up the output of generative AI applications by making better use of the GPU memory.”
Red Hat says: “Essentially, vLLM works as a set of instructions that encourage the KV cache to create shortcuts by continuously ‘batching’ user responses.” The KV cache is a “short-term memory of an LLM [which] shrinks and grows during throughput.”
We understand that WEKA is going to integrate Mooncake and vLLM technology into its file system platform so that customers running LLMs referencing WEKA-stored data get responses faster at lower cost.
Bercovici says: “By leveraging breakthroughs like DeepSeek’s context caching and WEKA’s high-speed AI infrastructure, organizations can redefine their AI economics – making generative AI more powerful, accessible, and financially sustainable for the future.”