VAST Data: From storage solutions to unified AI platform

Analysis. VAST Data is climbing its way up the AI stack from its commodity storage base to provide a single converged AI Stack system. That’s the sense of a recent briefing we had from VAST’s field CTO Andy Persteiner.

VAST provides its disaggregated single QLC flash tier parallel, scale-out file-based storage system and has built up layers of software sitting on top of this: data catalog, global namespace, database, and coming data engine. It had an AI focus, with a Thinking Machines slant, well before the generative AI flash flood hit. Renen Hallak, CEO and co-founder, has said: “The VAST Data Platform delivers AI infrastructure that opens the door to automated discovery that can solve some of humanity’s most complex challenges.”

Andy Persteiner, VAST Data
Andy Persteiner

Persteiner explained that a lot of VAST’s customer projects were “focused more on … boring old data processing and prep. And a lot of this is CPU-based ETL workloads … Typically, they have a pipeline, which has a number of different technologies embedded, and then they’re stitching them together. And the ultimate goal is to get their data into a format that they can do either analytics, or they can start to do training or inference.”

An issue is that “oftentimes the data is not ready to be, the data is not fit for AI … work. And so they need to get it there.” People spend a lot of time doing cleansing, prep, and ETL, typically on CPU-based farms.

“Our ultimate goal is for people to take all of the messy unstructured, ugly data, bring it onto our platform, and then for the bits and pieces that need to have structure to them, to allow them to take it into our database, without having to go in and move it somewhere else to do transformation first.”

Persteiner told us: “Some of the projects I’m personally working on right now are directly involved in people taking existing data pipelines that take data from various sources, move them into a data warehouse for analytics, and BI and taking that and moving on to our platform. … We start by allowing them to run their own compute separately to do the ETL and then bring it into our database.”

Data engine

“The first portion … of the data engine that’s happening in the near to medium term is that we’re embedding an execution engine or execution framework directly on the platform. That execution framework is largely based on Spark. But there’s a lot of other bits and pieces that we put in there to make it much more optimized for doing processing of ETL.” 

This is due in a few months or so, and “will give customers the ability to bring data to process in place, and get it into a database where they can start doing analytics against it.”

Looking further out, “our goal is, generally speaking, just to take all the bits and pieces that customers are doing today, where they might be deploying them in Amazon or Google, they might be deploying them on-prem, they might be stitching together lots of different systems or different technologies. And we try to consolidate as many of them as possible.”

This may not even involve storage. “Some of the larger deployments that we have also include customers who have complex ETL pipelines, who have data prep and cleansing mechanisms who have execution engines and databases, and they’re just starting to realize that what they already have is a platform that can do many of those things.” Meaning VAST.

“And so we’re working on projects with them [and] in some cases, we’re working on projects with customers who they don’t need storage at all. What they need is a way of processing and getting their data into a tabular format and running analytics on it.”

Persteiner commented: “We’re actually pretty excited to find new customers, using our database and our data engine as the lever to bring us into these opportunities.” In his view, “you’re going to start to see customers who exclusively use us for those database and data engine offerings.”

He added this thought: “To deploy a VAST Data platform, of course, you need hardware. And right now, we haven’t released support for GPUs in that side of things … We haven’t yet implemented the scheduling side of things for GPUs, which means that customers are going to be deploying CPU-based workloads on our Data Engine at first, which lends itself very naturally to processing to ETL, to doing transformations and basic functions … A  lot of those things are not what people think of classically as AI. But they’re a very integral component to AI.”

As customers adopt this, “they’re going to then have a place where they can land all the data and then start to do more advanced analytics and AI on it. And then you’re going to start to see us allow for more flexibility. I can’t give you time frames and when sort of management of GPUs or orchestration layers is going to happen, but our plan is ultimately to be able to manage a customer’s … processing fabric, regardless of what kind of hardware they have sitting behind it.”

Many customers have AI workload arrangements in place. “I would say that the vast majority of large scale customers that we talk with, they already have a scheduler, they already have orchestration layers, they already have an ML ops platform, they already have all of these things. And so it’s not as if we’re gonna go in there and tell them to rip all of those things out. It’s not realistic. And they can manage their compute just fine. They don’t need us for that.”

“We’re deploying into some of the largest supercomputers in the world. They had tens of thousands of servers, they had thousands of GPUs. They don’t need us to manage those things. It’s not the way it’s going to work. And so I think that this is partially why we’re transitioning bit by bit into things.”

VAST will target specific workloads first. For example, “imagine the pipeline where you need to take a lot of the different sources and ingest them into a platform, have them be processed to get into a tabular format, that can be put into a discrete set of jobs. This isn’t … a general framework that you need. It’s a relatively specific framework.”

“Another thing that we’re going to be exposing is a way for customer message bus is to dump data directly to us without having to go through another layer … Customers have event pipelines that typically revolve around a message bus like Kafka or something like that. And the platform itself will allow for things like Kafka to integrate directly with the database tables that we have. In such a way that they may not need to manage a Kafka system, or if they do, that we can integrate directly with it.”

Pipeline convergence

We suggested that what VAST is doing is like water rising up. It’s going to absorb the bottom stages of this AI pipeline stage by stage so that it can converge the different pieces of software, with a comfort level being lifted stage by stage. And eventually VAST possibly does AI processing directly within the VAST environment, with GPUs involved as well. But people aren’t going to throw out their existing 1,000 GPU servers and Nvidia systems. But possibly in the future, VAST will be able to co-opt existing GPU servers into the VAST system.

Persteiner replied: “I think you’re on the right track [but] I think that I think co-opt may not be exactly how we would approach it at first.” He sees scheduling as an entry point. Customers use schedulers to keep GPUs “fed and happy in terms of having data coming in.”

That data could be located in multiple places in a customer’s distributed environment, as could the GPU serving and data needs moving to the GPUs. Persteiner said VAST will “at first allow for customers to run processing with their GPU farms against data that might have been ingested elsewhere. It’s just a matter of us moving the bytes across the wire and as intelligent of a fashion as possible.” VAST will use its global namespace capabilities for this. 

There are other angles here. “But more interesting might be the scenario where a scheduler is scheduling a job, and we’re made aware of the scheduling such that we can intelligently move those bytes ahead of time. The reverse can also be true, where the GPU scheduler can inquire about the locality of references for data, and choose to run a job in a location that has the best economics or the best sort of performance in terms of moving bytes back and forth.”

He said: “We don’t want to toss people’s schedulers to the curb. We want to integrate with them.” 

VAST will also integrate with Nvidia’s ML ops tools. Its ultimate aims go further. “The idea is that customers don’t need to know all of these bits and pieces. They don’t need to know that there’s a scheduler or they don’t even need to know what type of hardware they have sitting there. They know that they have data and they know they need answers from it. So our goal is to allow customers to have that black box experience where they don’t need to know about the bits and pieces. But there’s a long way between here and there.”

Hyperconvergence

“Early on, we were very tactically focused on making sure that we were applicable to the customers we’re talking to, which was largely HPC, in the research side of things. But now, as we graduated through the layers of HPC, research to enterprise, to a large variety of AI-centric customers, and then into the AI-focused cloud service provider. Now our focus is really to build a platform that can allow for everything to happen in one place, and not focus on a specific area or not.”

We suggested: “You’re a new form of hyperconverged company?”

In a way, Persteiner replied. “I think that, if you say the word hyperconverged, a lot of times people have this vision in their head of what that means.” Nutanix-style computing, basically.

Yes, Persteiner agreed. “I think for us, if we use the word hyperconverged, it would be in the context of a data platform. It wouldn’t be in the context of infrastructure. Because we want to bring together all the disciplines of data processing and data analytics, not necessarily just bring together how data is stored.”

Storage is a commodity

“As probably most people would tell you, storage is commodity and they don’t think there’s a future in it … I think there’s going to be a need for large scale data storage for some time to come … We still feel like there’s a decent revenue base on that side of things. But it is a race to zero in terms of how much people are willing to spend on it. And so all of our value is going to be adding layers on top.”

Building a data platform in other words.

“We don’t think that customers are going to choose a data platform based on the lowest price per gigabyte, for example, even though that’s what they have done for … general-purpose storage in the past.”

It’s a cloudifying world

VAST will be offering its platform more and more in the public cloud. “The majority of small and medium and even large enterprises are migrating large amounts of workloads to the cloud. Our strategy has always been to follow where the data is, to some extent. And so we have customers who have large scale cloud deployments in terms of a lot of storage in the cloud, or a lot of compute in the cloud. And so we’ve been finding ways to integrate with the cloud service providers to offer our experience there as well.”

VAST has initially ported its software to AWS for customers bursting workloads to it. “We’re scaling out to allow for more of the cloud service providers to take advantage of that as well … You’ll see announcements at some point for some of the others. And the emphasis, again, will be focused initially on bursting. But then you’ll start to see us transition in to scaling that out. Some of that will be scaling out using cloud native marketplace offerings, some of it will be offerings that are more embedded within the cloud service providers in the sense that … They have customers who are asking for VAST in their cloud.” 

He doesn’t just mean the big three cloud providers. VAST will be working with smaller CSPs as well. 

Persteiner reckons: “As you start to see the data platform become more of a reality in the sense that we’re going to have the data engine componentry in the products, you’ll start to see that customers will start to blur the lines between processing and storage, both on-prem and in the cloud.”

Global-scale databases will also change perceptions. “If you can run a query that confederates across a table that effectively spans the world, then people don’t have to start bifurcating where they store everything. If you have a database that was worldwide, that you didn’t have to worry about where the ingestion point was, and you can run a query that was optimized for figuring out the right place to run the compute.”

“Imagine you have a big fat table. And it’s spread out through a geography and you run a query, and you don’t even know where that data might be. But when you execute the query, the platform itself is able to dispatch the right level of compute in the different locations. In such a way that, when you get your response back, it’s a concatenation or an assembly of all the responses from the different … edge locations, and maybe datacenter base locations, then you can start to transition how you think about a database.” It will change “how you think about a data pipeline, if you don’t have to go and make sure that you make a copy of data to move over there.” 

VAST is morphing into an AI storage, pipeline, and execution software platform in front of our eyes. No other storage supplier has such a vision of the future – at least publicly. The VAST AI data platform is being built as you read this. VAST seems confident it knows what it’s doing, where it’s going, and how to lead its customers there.