VAST Data: ‘No one needs a file system for AI training’

VAST Data co-founder Jeff Denworth posted on X that: “No one needs a file system for AI training… More specifically, no one needs a system that is *only* a file system. While the HPC storage community is out telling the world parallel file systems are essential to AI, customers have begun to deploy S3 checkpointers and S3 data loaders (which can also work async) for their training environments.”

This got us thinking about the need for files in AI training and VAST’s direction, so we asked Denworth some questions.

Blocks & Files: Could you say why file system-based suppliers such as DDN are being used by Nvidia for its own storage, has SuperPOD certification, as well as by X for its Colossus AI cluster, and many other AI applications by its customers? The same general point applies to NetApp, Pure, and WEKA. It’s clear that many customers, including Nvidia, are using file systems, parallel or not, for AI training. Why is this?

Jeff Denworth, VAST Data
Jeff Denworth

Jeff Denworth: It’s not binary, it’s evolutionary. Historically, all of the AI training frameworks required a POSIX/file interface. Only companies developing their own frameworks would consider using object storage, and this is limited to the best of the best.

Glenn Lockwood articulated an example of this here.

Many customers are still using file systems … my point was not that they’re not being used, but rather that you need multi-protocol in today’s age otherwise file system-only solutions result in very poor investment protection. The frameworks are evolving faster than customer investment decisions. Customers are now starting to make the transition, and we routinely hear from them that they love the ability to work in both modes on the same data simultaneously.

Lest you forget, Nvidia also bought an object storage company (SwiftStack). This says a lot.

Blocks & Files: Have any LLMs been trained purely using data provided direct from object storage systems. Surely this capability has only arrived recently with advances by Cloudian, MinIO, Nvidia, and Scality with GPUDirect-like access facilities for object data storage?

Jeff Denworth: Yes. Of the top-tier (top ten worldwide) models I know of:

  • VAST is being used for a very prominent model exclusively on VAST S3 at CoreWeave. We have a few other top-tier names starting to experiment.
  • Azure Blob is being used for a very prominent model.
  • Nvidia is training a very prominent model on S3-compatible storage.

That’s just what I know of.

Blocks & Files: VAST has built an AI-focused software stack, the VAST Data Platform, comprising the base data store, its DataCatalog, its DataBase, DataSpace, and DataEngine, as it fulfills its Thinking Machine vision with what we thought were the necessary software layers. But OpenAI with ChatGPT and the other GenAI model developers have shown that you can have smart chatbots without any of this software. Give them a vector database and file system and they can do their thing. Witness DDN, IBM, NetApp, Pure, and WEKA with Nvidia SuperPOD credentials.

Jeff Denworth: It’s always possible to integrate a solution; that never means it’s practical or efficient.

VAST… breaks trade-offs of scale, transactionality, security, etc. to provide [in my opinion] the best possible approach to AI retrieval. Most organizations kick around GB-scale datasets and think they have a good solution. We’re envisioning a world where AI embedding models can understand recency and relevance of all data as it’s being chunked and vectorized … where all data will be vectorized [with] trillions of vectors that need to be searchable in constant time regardless of vector space size … this is only possible with our architecture.

A system that can manage ingestion of hundreds of thousands to millions of files per second, process them and index them in real time … as well as instantaneously propagate all data updates to the index so enterprises never see stale data. A system that doesn’t need expensive memory-based indices because legacy partitioning approaches are not efficient. You need DASE (Disaggregated Storage Architecture) for all of this.

Finally … the underlying data sources need to be scalable AND enterprise grade … not sure where else you get this other than VAST.

Blocks & Files: Has ChatGPT-style technology negated the need for VAST’s software stack?

Jeff Denworth: Quite the opposite. The rise of agentic applications where organizations compute in GPU time increases the need for our technology. I’ll ask that as you consider this, you stop thinking about AI and RAG as just chatbots … the future speed of business will not be defined by how fast a human processes data. Nvidia is working to deploy 100 million agents into its enterprise (to augment 50,000 employees) over the next few years – all working together for complex business tasks. You don’t think this will push boundaries of legacy storage and database systems?

I think I see a future very different from the one you see. Everything will be about scale, GPU time, and the ability to process unprecedented amounts of data to think about hard problems. Did you see my blog?

The Stargate announcement will be the first of many. Dario [Amodei] at Anthropic also declared a need for 100x scale up in computation. This is not exclusively for training. System Two/Long-Thinking is going to change the world’s relationship with data and compel the need for even larger volumes of data.

Blocks & Files: VAST has been in a great creative period, developing its original storage technology from ground zero, and then the Thinking Machines-type software stack. Is this period of technology creativity now over with nothing but incremental tech advances and business process developments from now on? What is the vision for the future?

Jeff Denworth: I can confidently say that we have the most inventive and most ambitious team in the business. Each customer interaction gives us more inspiration for the next ten years … and we are fortunate to work with the smartest customers in the world. To assume we’ve become complacent, fat, and happy would be a dangerous assumption to make.

I’m not going to lay out our vision over email as I don’t think that does either of us any service, but we can talk more about the future maybe the next time we meet.

Blocks & Files: Your arrays can run application software in the C-nodes, providing computational storage. Isn’t this akin to turning the array into server direct-attached storage (DAS) for that application, negating the basic purpose of having a shared storage resource?

Jeff Denworth: Shared data access across machines is tantamount to what we do. Modern machinery needs real-time access to petabytes to exabytes of data to get a global data understanding. You can’t pin that data to any one host. Where and how those functions run is just a packaging exercise … we like efficiency so the more we can collapse, the better … but DAS is the opposite of how we think. Disaggregation is not just possible, we’ve shown the world that it’s very practical to getting to radical levels of data access and data processing parallelism.

Blocks & Files: How do you size the compute resource in a computational storage array?

Jeff Denworth: We’re learning more about sizing every day.

  • I/O load
  • Query load
  • Function velocity
  • Event notification activity
  • QOS management
  • RAS

I’m not sure we’ve got it all figured out since each new release is adding substantially new capability. This keeps the performance team on its toes … but we’re trying.