LLM – Large Language Model. Such software is used in Generative AI’s machine learning process. An LLM is trained on vast amounts of text data to understand, generate, and manipulate human language. Large language models use deep learning techniques, such as neural networks, to predict and generate text. They are designed to handle various language-related tasks, such as translation, summarization, question answering, and even creative writing.
LLMs are typically massive, with billions of parameters. For example, GPT-4, developed by OpenAI, is a type of LLM. They are trained on diverse datasets, including books, articles, websites, and other text sources, to learn patterns in language. LLMs are used in chatbots, virtual assistants, automated content generation, language translation, and more.
LLMs are used in AI training amd AI Inferencing.
inferencing LLMs have different computational and memory requirements in their prefill and decode phases;
- Prefill takes place each time a new query is sent to an LLM, and requires running the entire input query and context through the full LLM. It is compute-bound and benefits from being distributed over many GPUs. When prefill completes the LLM issues the first output token and there is a KV cache filled with the key and value vectors for each of the transformer layer’s attention heads.
- Decode occures repeatedly as the query response is generated from the first output token. It necessitates loading a set of cached key and value vectors from HBM again amd again, and doing some computation based on the previously emitted output token until, at the end of decode the full query response has been generated. It’s memory bandwidth-bound.