We quiz Pure’s VP for R&D Shawn Rosemarin on AI training, inference and Retrieval Augmented Generation. The first part of this interview is here.
B&F: Do storage considerations for running AI training in the public cloud differ from the on-premises ones?
Shawn Rosemarin: I think that they do, but the considerations themselves are nuanced. In the cloud and on-prem, we still have a choice of hard drives, spinning disk, SSDs, and persistent memory. We’re seeing roughly 60 to 80 percent of the current spinning disk is still sitting in the public clouds and hyperscalers. They are the last holdout in the move to flash-based storage.
This being said the price of SSDs continues to come down, and the requirement for more workloads to have access to energy efficient and high performant storage is increasing. Pure has a flash capacity advantage over SSDs with our Direct Flash Modules, which are at 75TB now and road-mapped out to 150TB and 300TB and beyond. We believe that, both in the cloud and on-prem, we’ll see this density advantage be a key pillar for Pure in terms of performance, density and efficiency.
Also, Pure’s Direct Flash Modules (DFM) lifespan of ten years, compared with our competitors’ SSD lifespan of five years, has a dramatic impact on long term total cost of ownership.
If you think about public clouds, and where their investments are going, they architect and invest in 5-, 7-, 10-year rack-scale designs. They’re very, very attracted by these DFMs because they’ll allow them to have a much lower operating cost over time.
Imagine if I’m a hyperscaler, and my current rack density only allows me to fill my rack at 50 percent using spinning disk or commodity SSDs. Pure’s Purity OS and DFMs allow me to scale that rack to 80 or 90 percent full. I essentially get a significant benefit – not just in density and power, but also because I can sell two or three times more capacity on the same energy footprint.
Also, in the cloud we have scale-up and scale-out for storage and we have to balance GPUs with switch fabric and storage. What’s the speed of the network that’s connecting between the two? Is the traffic going north-south or is it going east-west? Performance considerations for AI are complex and will challenge traditional network topologies.
The key point here is flexibility is essential. How easy is it going to be for me to add additional storage, add additional nodes, connect to additional facilities where there may be other datasets? And this is a real opportunity for Pure. We’ve taken non-disruptive upgrades to storage and, over the last 15 years, and well over 30,000 upgrades delivered, we have made it a core competency to make sure our customers can upgrade their environments without any disruption.
We see this as a major opportunity because of the newness of this market – how much change is likely to occur and our proven experience in non-disruptive upgrades and migrations.
B&F: Does RAG (Retrieval-Augmented Generation) affect the thinking here?
Shawn Rosemarin: As we look at RAG, I think this is huge, because I’m going to be able to take a proprietary data set, vectorize it, and enable it to enhance the LLM, which could be in the cloud. Having a consistent data plane between on-prem and the cloud will make this architecture much simpler.
If I’ve got particular edge sites where I want to keep data at the edge for a whole bunch of reasons – maybe physics, maybe cost, maybe compliance – I can do that. But if I want to move those datasets into the cloud, to increase performance, having a consistent data plane will make it simpler.
When you look at what we’re doing with Cloud Block Store, you look at what we’ve recently launched with Azure Cloud Block Store on cloud primitive infrastructure, we’re taking this simple, easy to operate data plane, Purity – that is at the heart of FlashArray and FlashBlade – and easily allowing customers to take those volumes and put them wherever they need to be, with an MSP in a cloud or on-prem.
B&F: What’s the difference between the processing and storage needs for AI inference compared to AI training?
Shawn Rosemarin: Oh, that’s huge. There’s training, there’s inference, and there’s archive. When you look at training, it’s computationally intensive. It’s what GPUs and TPUs were built for. It requires access to large volumes of data. This is the initial training of the models. You’re looking at high capacity, fast I/O, where data access speeds are critical.
When we look at inference, it’s taking a trained model, and seeing what kind of predictions and decisions it makes. Whether it’s an application that’s going to the model and asking questions, or whether it’s you and I going into ChatGPT and asking a question, we need a decent response time. It’s less about storage capacity and bandwidth compared to training – it’s more about latency and response times.
When you look at these two different models, the industry is very focused on training. At Pure, while we see a huge race to solve for training, we’re very bullish that, actually, in the long term, the majority of the market will be leveraging AI for inference.
The scale of that inference over time will be significantly larger than the scale of any training environment early on. We’re very focused on both the training and inference markets.
B&F: With AI training, the compute need is so great that the scalability of public cloud becomes attractive. But with inference, that relationship doesn’t apply. Should inferencing be carried out more on-premises than in the public cloud?
Shawn Rosemarin: I think inference will absolutely be carried out on-prem. I also think with inference, there’s also capturing the results of that inference. The enterprise wants to capture what questions were asked of my model, what responses were given, what did the customer do after they got that answer? Did they actually buy something? Did they abandon their cart?
I want to continue to refine my model to say that, even though it’s technically working the way it was designed, it’s not giving me the outcome I want – which is increased revenue, lower cost, lower risk. So I think that data organizations will be very, very interested in seeing what is the inference environment around what was asked, what came out? I’m going to take what I see happening in my on-prem inference model, and I’m going to use those findings to retrain it.
B&F: Do you think that where inferencing takes place in an enterprise will be partially dictated by the amount of storage and compute available? At edge locations outside the datacenter for example?
Shawn Rosemarin: I think we’ll actually see some training at the edge. I think we’ll actually see the development and the vectorization of datasets potentially start to happen at the edge.
If we think about where the compute is sitting idle, if we only have so much electricity, and we have much more training to do that we can accommodate by electricity, I have to look for idle sources of compute and processing power.
I think you’ll start to see some elements of the training process get broken out into a lifecycle where some of it will be done at the edge, because if I can train it, and I can shrink it, then I can actually save on the data ingress costs.
We really have to start thinking about what is the training process? How do we build it out? And how do we get each particular training element running on the most efficient platform.
Inference will not only be used by humans, inference will also be used by machines. Machines could query the inference model to get direction on what they should do next. Whether it be on a factory floor, whether it be on some remote location, et cetera.
When you think about an inference model, the key thing will be capturing the inputs and outputs of that model and being able to bring them back to a central repository where they can be connected with all the other inputs and outputs so that the next iteration of the training model can be determined.
B&F: Is GPUDirect support now table stakes for generative AI training workloads?
Shawn Rosemarin: Yes. There is absolutely no doubt about that. This is about enhancing the data transfer efficiency between GPUs and network interfaces and storage. I think that most vendors, including Pure, now have GPUDirect compatibility and certification. We’ve actually launched our BasePod and just recently, our OVS certified solution. So yes, getting the most efficient path between the GPUs and the storage is table stakes.
But that’s not necessarily where we’re going to be ten years from now. Today, CUDA – the compiler that allows the CPU to talk to the GPU – is only Nvidia. However, there have been whispers online of GitHub projects that allow CUDA instruction, CUDA compilers to be compatible with other processors. I’m not sure of the validity of those projects, but I do think it worth keeping an eye on.
I am interested to see if this becomes a universal standard – like Docker eventually became in the container world. Or does it stay Nvidia-only. Does CUDA become a more open model? Does the concept of an AI App Store extend across GPUs? Does performance become the vector? Or does the platform lock you in? I think these are all questions that need to be solved over the next three to five years.
Pure ultimately wants to make our storage operationally efficient and energy-efficient, to whatever GPU market our customers want access to. AMD and Intel have GPUs. Arm is a player, and AWS, Azure, and Google haven’t been keeping it a secret that they’re building their own GPU silicon as well.
We are very much proceeding down a path with Nvidia making sure that we’ve got OVX, BasePod and all the architectural certifications and reference architectures to satisfy our customers’ needs. Should our customers decide that there is a particular solution they want us to build into, then our job will be to make sure that we can be as performant, operationally efficient, and energy efficient as we are on any other platform.
B&F: What I’m picking up from this is that Pure wants to remain a data storage supplier. And it wants to deliver data storage services to wherever its customers want it delivered to. And for AI purposes that’s largely Nvidia today, but in two, three, four years’ time, it may include the public clouds with their own specific GPU processors. It could include AMD, or Intel or Arm. And whatever it takes, when there’s a substantial market need for Pure to support such a destination, then you’ll support it.
Shawn Rosemarin: I would agree with everything you said, but think a little bit larger about Pure. You’ll hear us talk a lot about data platform. And I know that everybody says platform these days. But I would think a little bit bigger.
So when you think about what’s happening with DFMs, remember the efficiency piece. Then think about a ten-year lifecycle of NAND flash, coupled with Purity operating system for flash that allows me to drive significantly better efficiency across how flash is managed. Then I have a portfolio that allows me to deploy with FlashArray and FlashBlade. Then I have Portworx that allows me to address and bring this efficiency to essentially any external storage capability for containers.
And now Pure is delivering Fusion to help customers automate orchestration and workload placement. The storage is a piece of it – but essentially, Pure is the data platform on which enterprises and hyperscalers deliver efficient access to flash, on-prem or the clouds. Enterprises also get flexible consumption with our Evergreen//One as-a-service consumption model governed by SLAs. So I’d encourage you to view Pure as a data platform message as opposed to just another storage appliance.