Cache-Augmented Generation – Retrieval-augmented generation (RAG) connects external knowledge bases to a Large Language Model (LLM) and retrieves fcontext each time a user asks a question. This can slow an LLM’s performance due to retrieval latency. CAG counters this by preloading relevant documents into the model’s context and stores that inference state asa Key-Value (KV) cache. Researchers have demonstrated how CAG, leveraging the extended context capabilities of modern LLMs, can eliminate the need for real-time retrieval altogether. The method involves preloading all relevant resources, especially when the documents or knowledge for retrieval are of a limited and manageable size, into the LLM’s extended context and caching its runtime parameters.