Pure Storage is becoming a dataset management company

Interview: An interview with Pure Storage CEO Charles Giancarlo started by discussing AIOps and its Enterprise Data Cloud’s control plane and finished with the concept of Pure as a dataset management company. It is developing a full stack of dataset management features and function layered on top of its storage arrays and their software.

Charles Giancarlo.

Part one of this interview will take us from AIOPs to the dataset management topic. Part 2 will look at what data set management means.

B&F: Do you remember two or three years ago, maybe four or five years ago even, AIOps was a thing and now it’s not and yet …

Charles Giancarlo: Well, it is and it isn’t.

B&F: That’s where it gets interesting, isn’t it? Because your Enterprise Data Cloud control plane is an example, to me, of AIOps.

Charles Giancarlo: Organizations have to start thinking about things differently. What I’m about to talk about is yet another, let’s say it’s the next iteration of it, that I think is another mind blowing thing overall. 

If I could use an analogy, ETL (Extract, Transform, Load) is effectively a batch job. … I’m going to talk about batch, interpretive and compilation as an analogy, okay? We haven’t talked about interpretive software in decades, but you remember. Let’s think about AIOps, to use the old phrase today, ETL chains and so forth. You have to find your data. You then have to determine what is the end structure of the data that it needs to be for your analytics app or your AI app, whatever it is. Then you have to copy your data, replicate it, transform it, set it up in the new architecture, even for Snowflake or whatever. That can take weeks or even months to do all of that. And then it’s ready for processing. Well, that’s a batch job. Anything that takes a lot of time to prepare before you can run it is batch.

B&F: This is not going to happen in real time. This is going to be with background processes going on.

Charles Giancarlo: What can happen in real time are general prompts into a big AI. That’s about language. What can’t happen in real time is enterprises taking their data because it’s not been processed by AI. Think about the training that OpenAI took. It took what? Months? Years. Okay. Well, that’s what has to happen before enterprises can get meaning out of their own data. It may not take years, but it takes weeks or months to prepare it even for inference because they have to get it ready for inference. So think of that as a batch job. 

Once we have the AI data cloud, or sorry, once we have the Enterprise Data Cloud in place, first of all, the data is accessible. Now, secondly, is we believe we can put in, I’m going to call it interpretive, meaning that instead of it taking weeks, we believe we can [do it faster] with an AI engine. And because AI, if nothing else, is really good at translation, we can translate it from raw data to the data that you need, to the data in the form that you need to process. So it replaces the T in the ETL chain.

The Data Cloud replaces the E in the ETL chain, and then you still have to process it. So I think of it as interpretive, meaning that, yes, it’s not entirely real time, but now you’re talking about minutes rather than days or weeks.

I think the next step, which is going to be years in the future, is why not tag data as it’s being written now, meaning production data as it’s being written. I understand now. You can’t do that today. You won’t be able to do it tomorrow, and why not? Because companies don’t know what they need. In other words, you can’t tag it with something if you don’t know what the use of the data is. And why are you going to invest in a whole bunch of GPUs when you don’t know what you need. So I think it’s going to start with interpretive and eventually move to that last step, which I would then call compiled because it’s all ready to go.

B&F: I’m going to use a stack analogy here. So this is Pure Storage and this is Dell, and this is VAST, and sitting above them, according to David Flynn [Hammerspace] and Jason Lohrey [Arcitecta], are data managers. These people are saying, we’ll stick a vector database or other databases in our data management systems, and then we can feed data up the AI data pipelines better. And this is something, as far as I understand, you don’t do. Should you?

Charles Giancarlo: That’s the question. And because the issue about data managers that have existed for a very long time, and, … broad level, what’s the best way to put it? It’s ignoring the fact that VAST specifically is designed as a data store to feed AI, so they’re not a production data store at all, right? 

Think of the way that data today, data storage today, is structured as being application-specific. So their application-specific is directly to the data management, directly to the AI. But then you’ve got to get the data there from the production environment. 

What we are saying, and by the way, okay, so Dell has lots of different production environments. We have a lot of different production environments. The thing that we’re doing that’s different is it’s a single operating system. You got a half dozen there [with Dell]. And then, secondly, by tying it together, so all of our systems now operate as a cloud. Whether you want to go to a data lake or data warehouse or lake house. There’s all kinds of names around this, whether you want to go to that or not. The point is, you can use your production level data as the data feed, the real-time, or almost real-time data feed, rather than having to specifically copy all of your data to yet another data store that’s specifically designed for another specific purpose.

B&F: If a customer comes along to you and says, your stuff is great, but I need a vector database. I’m going to go to Pinecone, or I’m going to go to Weaviate, and that’s the data I’m going to shovel up the AI data pipeline. So your job is to feed Pinecone in this particular regard. Why not bring the vector data base down into Pure’s software environment?

Charles Giancarlo: Well, the real question I think, which is true for any system vendor, is what are the benefits and what are the limitations by putting it, embedding it, within your own environment. So think about Hadoop, right? In some ways, Hadoop was designed so that you can not only have storage, but you would have whatever database you choose embedded within the same environment. 

The challenge of course for that is it didn’t have flexibility in terms of compute versus capacity. You basically had equal amounts of compute and capacity that you had to add up. So you were limiting it. The other question is, well, are you cost-burdening your storage with compute that you don’t need or not enough compute that you do need? So is there some inherent technical advantage to combining the two on one processor, or is it the case where having it operate effectively outside the storage system is just as effective? You don’t lose any benefit, but on the other hand it gives you much greater flexibility.

So I guess our view, and I’ll come back to your question because I’m going to answer it in a different way. Our view is; let the database be the database. The customer can buy as much power or as little power as they need. We just want to be able to feed it. Now, there is an area though, where we can add a lot of value, and that is on the metadata side of things, because why can’t we store the additional metadata with the storage metadata? 

We have a giant metadata engine. We can add additional metadata to it. Does it really matter? And now the metadata is stored with the data. Does that make sense? 

B&F: indeed. 

Charles Giancarlo: Okay. So that is a benefit of what we can do that’s unique, right? That doesn’t add, in my view, doesn’t add unnecessary overhead to what we do and keeps the flexibility of how much performance, how much database performance do you need, how much AI performance you need versus how much storage and capacity you need.

B&F: Let me try and attack my point in a slightly different way to see if I can bend you around. Okay. You have workflow operations in your control. So you can have templates for workflows and you can set applications, system applications running inside those workflows? 

Charles Giancarlo: Yes, we can. 

B&F: So you could orchestrate an AI data pipeline. 

Charles Giancarlo: We certainly could. Quite happily, yes. And now with MCP it makes it so much easier.

B&F: So I think we’re in the same place at this point. You won’t necessarily stick a vector database inside Purity, but you will be able to schedule a Pinecone ingest and vectorization operation.

Charles Gancarlo: Yes. Today we demonstrated how now we can schedule a full database app with disaster recovery all based on a set of precepts and our NCPs (Nvidia Cloud Partners) that are operating directly with, for example, any one of the virtualization engines as well as any one of the databases.

B&F:  So you’re using workflows at the moment, which are in the storage environment?

Charles Giancarlo: Correct? 

B&F:  The data protection environment. 

Charles Giancarlo: That’s right. 

B&F:  Which is still storage. 

Charles Giancarlo: Yes, it is. 

B&F:  But you’ve got a general workflow operator, correct? 

Charles Giancarlo: Yes. 

B&F:  You could do whatever you want. 

Charles Giancarlo: We can do whatever we want. We’re doing that. 

B&F:  If you want to schedule compute, you can schedule compute. 

Charles Giancarlo: Correct. 

B&F:  So Pure is no longer necessarily a storage company.

Charles Giancarlo: That’s correct. We’re a workflow company and we will become more of what … right now, we’re going to be a data set management company.

****

Part 2 of this interview will dive into the data set management concept.