VDURA: AI training and inference needs optimized file and object balance

Much interest has been sparked across the storage world by assertions that object storage is best for AI training and inference, rather than file storage. VAST Data co-founder Jeff Denworth and Microsoft AI Infrastructure Architect Glenn Lockwood have both put forward this point of view. Hammerspace Marketing SVP Molly Presley disagreed and so too does VDURA CEO Ken Claffey.

VDURA provides a parallel file system for supercomputing, institutional and enterprise HPC. Ken Claffey thinks that the files-or-objects data access issue in the AI training and inference markets is misplaced. He believes both have their roles and he discussed that with us in an interview.

Blocks & Files: What started you thinking about this issue?

Ken Claffey, VDURA
Ken Claffey

Ken Claffey: VAST Data’s Jeff Denworth recently made a bold claim that “no one needs a file system for AI training” and that S3-based object storage is the future. While it’s true that AI workloads are evolving, the assertion that file systems are obsolete is misleading at best. 

Blocks & Files: What do you think are the realities of AI storage needs and the role of  parallel file systems in high-performance AI training at scale?

Ken Claffey: At VDURA, we don’t see AI storage as a binary choice between file and object. Our architecture is built on a high-performance object store at its core, with a fully parallel file system front end. This means users get the best of both worlds: the scalability and durability of object storage with the high-performance access required for AI training.

With our latest v11 release, we have further enhanced our platform by integrating a high-performance distributed key-value store. This new addition optimizes metadata operations and enables ultra-fast indexing, further enhancing AI and HPC workloads. Additionally, VDURA provides a high-performance S3 interface that allows seamless access to the same files and data across both file and object protocols. This ensures maximum flexibility and investment protection for enterprises scaling AI infrastructure.

Blocks & Files: Does object storage have a role here?

Ken Claffey: Glenn Lockwood from Microsoft Azure recently argued that large-scale AI language models are increasingly trained with object storage, rather than file storage. His perspective aligns with a growing shift toward object-based architectures, but it’s important to examine the nuances of AI training workflows before jumping to conclusions.

Lockwood outlines the four major phases of AI model training:

  1. Data ingestion: Collecting vast amounts of unstructured data, best suited for object storage due to its immutability and scalability.
  2. Data preparation: Transforming and cleaning the data, which is largely an in-memory and analytics-driven task.
  3. Model training: Running tokenized data through GPUs and checkpointing model weights, requiring fast storage access.
  4. Model deployment and inferencing: Distributing trained models and handling real-time queries, often optimized through key-value stores.

While Lockwood asserts that parallel file systems are not required for these workloads, his argument centers around cost-effectiveness rather than raw performance. Object storage is well suited for data ingestion and preparation due to its scale and cost efficiency. However, for model training and real-time inferencing, a hybrid approach – like VDURA’s – delivers the best of all worlds.

Blocks & Files: What is Nvidia’s perspective on this as you see it?

Ken Claffey: As they release next-generation GPUs and DGX platforms, they continue to emphasize high-performance storage requirements. According to Nvidia’s own guidance for DGX, the leading AI platform, the recommended storage configuration is: 

  • “High-performance, resilient, POSIX-style file system optimized for multi-threaded read and write operations across multiple nodes.”

Did we miss the S3 requirement? Nowhere does Nvidia state that AI training should rely solely on object storage. In fact, their own high-performance AI architectures are designed around file systems built for multi-threaded, high-throughput access across distributed nodes.

Blocks & Files: Is checkpointing encouraging object storage use?

Ken Claffey: Denworth referenced Nvidia’s “S3 Checkpointer” as evidence of a shift toward object storage for AI training. However, he conveniently left out a critical detail. The very next part of Nvidia’s own documentation states: “The async feature currently does not check if the previous async save is completed, so it is possible that an old checkpoint is removed even when the current save fails.”

What does this mean in practice? Using async checkpointing may result in a recovery point further back in time. This significantly reduces the reliability of checkpoints and increases the risk of lost training progress. The value of synchronous, consistent checkpointing cannot be overstated – something that parallel file systems have been optimized for over decades.

Blocks & Files: How are you optimizing VDURA storage?

Ken Claffey: Rather than framing the debate as “file vs object,” VDURA has built a solution that integrates:

  • A high-performance object store to handle large-scale data ingestion and archival efficiently.
  • A fully parallel file system front-end to optimize AI model training with low-latency, high-bandwidth access.
  • A distributed key-value store to accelerate metadata lookups, vector indexing, and inferencing.
  • A high-performance S3 interface ensuring multi-protocol access across AI workflows.

This architecture addresses Lockwood’s concerns while also meeting the needs of enterprises that demand the highest levels of performance and scalability. While object storage plays a key role, dismissing parallel file systems entirely ignores the practical realities of AI training at scale.

Blocks & Files: How do you see the future for AI storage?

Ken Claffey: Denworth and Lockwood both make strong cases for object storage, but they downplay the performance-critical aspects of AI training. The future of AI storage is hybrid:

  • Parallel file systems provide the speed and efficiency necessary for training.
  • Object storage is useful for archival, sharing, and retrieval workloads.
  • Multi-protocol solutions bridge the gap, but that doesn’t mean file systems are obsolete – far from it.
  • High-performance distributed key-value stores enhance metadata management and indexing, further optimizing AI workflows.

VDURA’s approach acknowledges this reality: a high-performance object store at its core, a fully parallel file system front-end, an integrated key-value store, and a high-performance S3 interface – all working together to deliver unmatched efficiency for AI and HPC workloads. Unlike VAST’s claim that object storage alone is the future, we recognize that AI training at scale requires the best of all storage paradigms.

Enterprises deploying AI at scale need a storage infrastructure that actually meets performance requirements, not just theoretical flexibility. While object storage plays a role, parallel file systems remain the backbone of high-performance AI infrastructure, delivering the speed, consistency, and scale that today’s AI workloads demand.

The industry isn’t moving away from file systems – it’s evolving to embrace the best combination of technologies. The question isn’t “file or object,” but rather, “how do we best optimize?” At VDURA, we’re building the future of AI storage with this balance in mind.