SCADA

SCADA – Supervisory Control and Data Acquisition – a control system architecture comprising computers, networked data communications and graphical user interfaces for high-level supervision of machines and processes. In the Nvidia Blackwell GPU environment this is a client-server runtime that runs in GPUs, functioning as a multilevel cache between the PCIe, CPU, and storage and the 100,000+ threads driving random I/Os in the GPU kernel:

  • It coalesces I/O requests within the GPU and maintains a read-through cache, converting random I/Os into either local cache hits within the GPU or batches of I/Os that are packed together before being passed over PCIe to either local NVMe or a remote SCADA server.
  • It takes full ownership of NVMe block devices and implements an NVMe driver inside the GPU. This keeps random I/Os from having to be processed on the host CPU.
  • It enables peer-to-peer PCIe in a way analogous to GPUDirect. This avoids sending I/Os all the way to host memory, and keeps traffic between GPUs and storage local to the PCIe switch they share.

There are a couple places during LLM inferencing where small, random reads have to happen repeatedly:

  1. KV cache lookups. As the response to an AI LLM like ChatGPT question is being built out word-by-word, the model needs to reference all the previous words in the conversation to decide what comes next. It doesn’t recompute everything from scratch; instead, it looks up cached intermediate results (the key and value vectors) from earlier in the conversation. These lookups involve many small reads from random places each time a new word is generated.
  2. Vector similarity search. When you upload a document to the LLM, the document gets broken into chunks, and each chunk is turned into a vector and stored in a vector index. When you then ask a question, it’s also turned into a vector, and the vector database searches the index to find the most similar chunks—a process that requires comparing the query vector against a bunch of small vectors stored at unpredictable places.

Just as GPUDirect Storage has become essential for efficient bulk data loading during training, SCADA is likely to become an essential part for efficiently inferencing in the presence of a lot of context—as is the case when using both RAG and reasoning tokens.

[Thanks to a Glenn Lockwood blog post.]