Tokens

Tokens – In the context of artificial intelligence (AI), particularly in natural language processing (NLP) tokens are an intermediate representation of words, images, audio and video between the origini=al item and vectors. A token is typically a unit of text—like a word, subword, or character—that an AI Large Language Model (LLM) processes. For example, in the sentence “I love AI,” the tokens might be “I,” “love,” and “AI.” These tokens start as raw text or symbols.

To work with tokens mathematically, AI models (like those based on transformers) convert them into vectors—numerical representations in a high-dimensional space. This conversion happens through a process called embedding. Each token is mapped to a vector using an embedding layer, which is trained to capture semantic meaning, context, or relationships between tokens. For instance, “love” might become something like [0.23, -1.54, 0.89, …], a list of numbers that encodes its meaning relative to other words.

Therefore:

  • Tokens are the discrete units (e.g., words or subwords).
  • Vectors are the numerical representations of those tokens after embedding.

In practice, when people talk about “tokens” in AI models, they often mean these vector representations implicitly, especially when discussing how the model processes input.

Note that a token is the smallest unit of text that the LLM processes at a time. It’s typically a word, part of a word, or a punctuation mark, depending on the tokenization method used by the model. Tokenization is the process of breaking down input text into these smaller units so the model can understand and generate text.

However, a chunk is a larger segment of text, typically consisting of multiple tokens, that is grouped together for processing or analysis. Chunking often happens when dealing with long documents or inputs that exceed the model’s token limit, requiring the text to be split into manageable pieces.