The Pliops LightningAI product functions as a memory tier for GPU servers and can provide a more than 2x speed up for large language model (LLM) responses.
Pliops is an Israeli server CPU offload startup that has developed XDP (Extreme Data Processor) key-value store technology with its AccelKV software running in an FPGA to accelerate low-level storage stack processing, such as RocksDB. It has now developed a LightningAI product, using ASIC hardware inside a 1 or 2RU server, applicable to both training and inference LLM workloads.
CEO Ido Bukspan said: “We saw how we can leverage our technology to something even more that changed the needle significantly in the world. And the potential is huge.”
He said that Pliops developed the core product then “took it all the way and developed further than our product in order to show the end-to-end, amazing value of XDP to performance … It’s not just a specific area. It can be expanded. We did it all the way, developed all the stack and pieces of software needed in order to prove the value that our new AI tool can help AI developers get out much more from their existing GPUs.”
XDP LightningAI’s best fit is with inference workloads, where it enables an LLM, running a multi-tier inference process, to “remember” cached data but then replaces intermediate responses and data – the attention state – needed for a subsequent response, speeding up the end-to-end LLM processing time.
The LLM, running in a GPU server with, for example, high-bandwidth memory, and accessing NoSQL and vector databases, runs out of memory capacity during a multi-tier response. This requires old data, previously evicted from the HBM prefill cache, to be reloaded. LightningAI serves as a persistent memory tier for such data, enabling the GPU to avoid the HBM reload time penalty.
It runs in a networked x86 server networked by NVMe-oF to a GPU, and enables the GPU to sidestep a memory wall, more than doubling its speed, and also be around 50 percent more power-efficient. Pliops sees it as a great benefit to inference workloads using retrieval-augmented generation (RAG) and vectors, where the GPU servers will have limited memory capacity and operate in power-constrained environments.
A GPU will run the Pliops LLM KV-Cache Inference Plug-in software. It will use a Pliops API to issue standard GPU-initiated IO requesting Pliops CUDA key-value activity. The GPU servers’ BlueField DPUs send the request across a 400 GbE RDMA Ethernet fabric to ConnectX-7 NICs in the nearby (in-rack) XDP LightningAI server. There, it’s sent to the XDP-PRO ASIC, which wrangles the data operations using direct-attached SSDs.
The Pliops stack includes application (vLLM) modifications, a GPU CUDA library for NVMe key-value commands, and a NVMe-oF initial target for GPU and Lightning servers. The system can be deployed on standard 1 or 2RU ARM or x86-based servers, and is fully compliant with the vLLM framework. A single unit can serve multiple GPUs.
Pliops is working with potential customers, OEMs and ODMs. They can inspect demonstration and proof-of-concept XDP LightningAI units now, and the company will be at SC24 in Atlanta, November 17-22. We can expect additional GenAI applications to be supported beyond LLMs in the future, as well as even more LLM acceleration, between 2.5x and 3.0x.