Very large AI model training uses object storage

HPC expert Glenn Lockwood says that the largest AI language models are being trained with object storage, not file storage.

AI model training needs unstructured data. Most unstructured data is stored in files. Ergo, large language model (LLM) training needs access to file data and parallel file systems, as used in high-performance computing (HPC), to deliver file data faster to processors than serial file systems. Therefore, parallel file systems are needed for LLM training. Actually, no, says Lockwood, because the characteristics of LLM model training phases actually favor object storage more than parallel file systems.

Glenn Lockwood, Microsoft
Glenn Lockwood

Lockwood is an AI Infrastructure Architect at Microsoft who has worked on one of the world’s largest supercomputers. He writes: “I guess supercomputers and parallel file systems are like peas and carrots in so many people’s minds that the idea of being able to run a massive parallel compute job without a massive parallel file system is so unintuitive that it is unbelievable.”

He mentions four phases of LLM production:

  • Data ingestion, “where crawlers scrape the Internet and pull down raw HTML, images, videos, and other media. These raw data are indexed and shoved into a data warehouse. At scale, this can be hundreds or thousands of petabytes of data for frontier models.”
  • Data preparation, “where the raw data is converted into tokenized data. It amounts to a huge data analytics problem that uses … text and image processing pipelines that filter, deduplicate, and otherwise clean the raw garbage on the Internet using frameworks like Apache Spark. The hundreds of petabytes of input get reduced down by 10x-1000x.”
  • Model training, “where the tokenized data is shoveled through the LLM on giant GPU clusters in little batches. As the data is processed, the model weights are updated, and those weights are checkpointed to storage.”
  • Model deployment and inferencing, “where the final model is copied across giant fields of inferencing servers, and a web service sits in front of it all to transform REST API requests into actual inferencing queries that run on the GPUs.”

The I/O patterns in each phase do not require a parallel file system in his view. Data ingestion “just pulls HTML, images, or video streams from the Internet and packs them into data containers. As it is packing webpages into these files, it is building a separate index that stores metadata about the webpage … and its location … Thousands of VMs might be performing these tasks completely independently.” It is write-once data and suited to object storage immutability.

The point here is that “while one could store each scraped HTML page in a file that’s organized in a parallel file system, accessing those files would be very slow – a full crawl of all the data would require scanning hundreds of billions of little files.” Lockwood reckons “it’s better to implement data containers on top of object stores and use a distributed key-value store for the index.”

Data preparation involves “running Apache Spark-like pipelines that chew through all the raw data in a trivially parallel way … Each processing task might read a couple hundred megabytes of data from an object all at once, process it in-memory, then dump it back out to objects all at once. File systems offer no benefit here, because each task reads once and writes once rather than skipping around inside individual objects.”

The input data is deduplicated, which “requires comparing every piece of data to every other piece of data.” This I/O-heavy step “is often done in a centralized location that is adjacent [to] the [ingested data] object store using CPU-based, completely separate supercomputers before training on GPUs ever begins.”

Lockwood asserts that “buying cheap object storage and a cheap CPU cluster is more cost-effective than buying an expensive file system and wasting your GPU nodes on trivially parallel text processing tasks.”

He says that, with HPC systems, “the need for fast checkpointing and restart were the primary driver behind the creation of parallel file systems.”

The AI model training phase also needs fast checkpointing and restart. However, although “parallel file systems certainly can be used for training, they are not the most cost-effective or scalable way to train across tens of thousands of GPUs.”

LLM training involves much repetition: “Training a model on GPUs, whether it be on one or a thousand nodes, follows a simple cycle (this is a ‘step’ in LLM training parlance) that’s repeated over and over.”

  1. A batch of tokenized data is loaded into GPU memory.
  2. That data is then processed through the neural network and the model weights are adjusted.
  3. All GPUs synchronize their updated weights.

The I/O load of step 1 is not the same as a traditional HPC job as, firstly, the “millions of little text or image files … are packaged into large objects before the GPUs ever see them.” Secondly, the amount of tokenized data is actually quite small. A tokenized 405-billion-parameter Llama-3 model was “trained using 16,000 GPUs, only had to load 60 TB of tokens from storage. That divides out to 3.75 GB of tokens processed by each GPU over the entire course of a 54-day run.”

That’s not much of an I/O challenge. Also, the tokenized data can fit into a GPU’s local flash storage, what Hammerspace calls Tier 0 storage. Lockwood points out that “using node-local NVMe allows storage capacity and storage performance to scale linearly with GPU performance.”

He says: “Because no two GPUs will ever need to read the same input token, there’s never a need to copy input tokens between nodes inside the training loop.”

“Super high-bandwidth or super high-capacity parallel file systems are not necessary for loading input tokens during training.”

What about checkpoint writing? Here the actual I/O burden is less than we might suppose. “Unlike with scientific HPC jobs, the checkpoint size does not scale as a function of the job size; the checkpoint for a 405 billion-parameter model trained on 16,000 nodes is the same size as the checkpoint for that model trained on three nodes. This is a result of the fact that every training step is followed by a global synchronization which makes each data-parallel copy of the model identical. Only one copy of those model weights, which amounts to under a hundred terabytes for state-of-the-art LLMs, needs to be saved.”

He cites a VAST Data source and says that: “Even a trillion-parameter model can achieve 99.7 percent forward progress (only 0.3 percent time spent checkpointing) when training across 3,072 GPUs with a modest 273 GB/s file system. A parallel file system is not required to get that level of performance; for example, HDD-based Azure Blob achieved over 1 TB/s when benchmarked with IOR for writes at scale.”

What you do is checkpoint to GPU-local storage, so the checkpoint data is persisted, and then asynchronously copy it to a neighboring GPU’s local storage for fast restore, and migrate to shared storage for longer-term retention. This involves large block writes and Lockwood concludes: “This combination of modest write bandwidth and simple, sequential, large-block writes is ideally suited for object stores. This isn’t to say a parallel file system cannot work here, but this checkpointing scheme does not benefit from directory structure, fine-grained consistency semantics, or any of the other complexities that drive up the cost of parallel file systems.”

The I/O requirements of the model deployment and inferencing stage are also quite straightforward:

  1. When provisioning a GPU node for inferencing, model weights must be loaded from shared storage as fast as possible.
  2. When using an LLM to search documents, a vector database is required to perform the similarity search that augments the LLM query with the relevant documents. This is the basis for RAG.
  3. Key-value caches are often used to reduce the latency for different parts of the inferencing pipeline by storing context, including the conversation or frequently accessed contextual documents.
  4. As the inferencing demand evolves, different models and weights may be swapped in and out of individual GPU servers.

“A parallel file system is not particularly useful for any of these; the only place in which their high bandwidth would be a benefit is in loading and re-loading model weights (#1 and #4). But as with hierarchical checkpointing, those I/O operations are whole-object, read-only copies that are a natural fit for object APIs. Complex directory structures and strong consistency simply aren’t necessary here.”

“The I/O patterns of each of these [four] steps map nicely to object storage since they are predominantly write-once and whole-file transactions. Parallel file systems certainly can be used, and workloads will benefit from the high bandwidth they offer. However, they come with the cost of features that aren’t necessary–either literal costs (in the case of appliances or proprietary software) or figurative costs (allocating people to manage the complexities of debugging a parallel file system).”

“Parallel file systems aren’t bad, and they aren’t going anywhere. But they aren’t required to train frontier models either, and … some of the largest supercomputers on the planet are designed not to require them.”

Read Lockwood’s blog for a deeper dive into this topic.