Home Blog Page 8

Solidigm/VAST study says SSD system costs much less than disk drive one

Solidigm has released a 10-year Total Cost of Ownership (TCO) model showing a VAST Data/Solidigm SSD system supporting 1 EB of data costs 58.9 percent less than an equivalent disk drive-based Ceph/Seagate system.

Dave Sierra.

It is detailed in a white paper, The Economics of Exabyte Storage written by Dave Sierra, who holds a Product Messaging and Positioning, Sustainability, position at Solidigm.

Sierra says: “This white paper demonstrates through rigorous, verifiable analysis that … The Solidigm + VAST solution delivers an estimated 10-year hardware and OpEx TCO of approximately $35.19 million USD, representing a substantial 58.9 percent saving compared to the HDD/CEPH solution’s calculated TCO of $85.62 million USD.”

The calculation is split into hardware, power and space components; 

He only looks at hardware and allied operating expenses and not the software and networking side of things. Because it’s a VAST Data system, the Ceres SSD enclosures, fitted with 122 TB QLC Solidigm SSDs, have a 10-year operating life, as that’s what VAST Data warrants. The Ceph system storage enclosures are filled with Seagate Exos Mozaic 3+ 30TB disk drives, and these are replaced twice during the model’s 10-year span. That’s because “empirical data from large-scale deployments, such as those reported by Backblaze, consistently shows that HDD Annualized Failure Rates (AFR) tend to increase significantly after the third to fifth year of operation, entering the “wear-out” phase of the bathtub curve. … Therefore, aligning with common enterprise practices for mission-critical systems and mitigating the impact of rising failure rates, this TCO model assumes a 4-year refresh cycle for the HDDs.”

That means the model assumes that the HDDs are refreshed two times, costing more, while the SSDs are not refreshed at all. Sierra says: “The significant difference in assumed operational lifespan—requiring two full replacements for the HDD fleet versus none for the SSDs over 10 years—is a fundamental factor driving the TCO divergence.”

The 1 EB of data requires a filesystem metadata and data protection overhead. This means that there is actually 1.25 EB of HDD capacity whereas, because of VAST’s 2.5:1 deduplication/compression  ratio and generally lower erasure coding overheads, there is 0.456 EB of actual SSD capacity, meaning 2.74x more disk capacity. The net result is 41,667 disk drives, in 1,852 RU (52 racks), are needed vs 3,738 Solidigm SSDs in 170 RU (5 racks). Some people might think this is stacking the deck.

Sierra says: “This fundamental difference in required physical hardware drives downstream reduction on component counts, capital costs, power consumption, and space utilization.”

The disk drive HW refreshes add a lot of extra cost, as a table shows; 

We can respectfully take issue with this as Backblaze has recently extended the operational life of its disk drives. CFO Marc Suidan told analysts: “We evaluated the estimated useful life of our fixed assets based on more recent operational data and found that our hardware last an estimated six years instead of the three to five years estimate that what we previously were using.”

Let’s modify Sierra’s table and add a single HDD refresh at the 6-year point costing the average of his year 4 and year 8 costs; $10.91 million. That changes the total 10-year HDD hardware cost to $29.86 million, only $800,000 more than the SSDs.

That would, in turn, change the total 10-year TCO figure for the HDDs from $85.62 million to $74.71 million, a 52.9 percent difference from the Solidigm SSD TCO total, not 58.9 percent. It’s still substantial.

At this point, we can ask a question. If Sierra’s estimates are correct, then why is Backblaze not replacing its 321,201 disks under management with 122 TB Solidigm QLC SSDs, using VAST-like SW to give them the same effective capacity with far fewer drives?

The VAST software costs a lot of money and you can’t simply use the data reduction and metadata management part of it alone, junking the rest. Absent that you would have to have 1.25EB of raw SSD capacity, not Sierra’s 0.456 EB, putting the SSD acquisition cost up to the $60 million mark and raising the 10-year SSD TCO from $35.19 million to $66.13 million. This would be 11.5 percent less than the $74.71 million HHD cost with a single refresh cycle.

Sierra’s SSD vs HDD cost calculation for Solidigm is dependent on the use of VAST Data software and, it appears, an overly low view of HDD working life. Take these two things away and the Solidigm/VAST 10-year TCO model is not generally applicable outside the VAST customer base.

Bootnote

Doug Gourlay.

Qumulo President and CEO Doug Gourlay sent us his thoughts about this study.

He said: “I know some of the newer entrants into the storage/data landscape don’t have enough fleetwide data to use so they may have to look for external rubrics to identify refresh and replacement cycles of components such as HDDs – thus the convenient ‘you have to refresh HDDs every four years’ and ‘you can use QLC for 10 years’.

“So I did some digging across our fleetwide statistics that now span 13 years and 1000s of systems – a far larger and longer-lived dataset than available to smaller companies, but found the following HDD failure rates in 2023/2024:

  • 2023 HDD failure rate, fleet-wide was 0.56% and the SSD failure rate was 0.47% ,
  • 2024 HDD failure rate fleet-wide was 0.65% and the SSD failure rate was 0.37%

“These are statistically similar, although certainly favoring SSD and encompasses SSD and HDD that are 10+ years in deployed life.

“Similarly though, Solidigm, the manufacturer of the HDDs, who makes them in the Dalian fab in the People’s Republic of China, only warrants their QLC for 5 years, yet is trying to claim 10 years of useful life in the article you published.

“There are a few other issues, but the above are the main ones that caught my eye and may be worth a clarification from the editor.”

Glad to oblige.

Backblaze cloud revenues grow and accelerate with AI customers

Backup and cloud storage provider Backblaze reported revenues 16 percent higher with major contributing segment cloud storage growing 29 percent.

It earned $36.3 million in second calendar 2025 quarter revenues with a GAAP loss of $7.1 million, improved from the year-ago $10.3 million loss. Its Computer Backup segment revenues were up 4 percent annually to $16.5 million, although down 0.6 percent Q/Q, a second consecutive quarter of sequential decline. The growth was attributed to a price rise. B2 Cloud Storage revenues were $19.8 million compared to $15.4 million a year ago and $!8 million in the previous quarter. 

Gleb Budman.

CEO Gleb Budman stated: “We’re pleased with our continued strong quarterly performance, with B2 revenue growth accelerating from 23% to 29% sequentially and solidifying our journey to be Adjusted Free Cash Flow positive in Q4.” The “adjusted” term means excluding one-time costs, non-cash expenses, restructuring costs or litigation expenses, from the free cash flow calculation. Free cash flow is the result after subtracting capital expenditures from operating cash flow.

CFO Marc Suidan said: “Our subsequent aim will be to achieve GAAP net income positivity as operating leverage kicks in.”

There were 150 $50k+ ARR (Annual Recurring Revenue) customers, up 30 percent year-on-year. A second Backblaze customer, an AI company, upgraded their contract to a 7-figure level in the quarter. Budman added: “We also signed our first six-figure B2 Overdrive customer in early Q3, just two months after product launch.” B2 Overdrive provides higher data access speed for AI and high-performance computing (HPC) workloads. This generative AI video company transferred their data from a hyperscale public cloud provider, because “They were getting killed by egress fees” and needed a fast data feed to GPUs which B2 Overdrive provides.

He said in the earnings call: “Our focus has been to accelerate B2 growth and become adjusted free cash flow positive.We’ve accelerated year over year B2 growth from 22 percent at the end of last year to 29 percent this past quarter and we remain on course to be adjusted free cash flow positive in Q4. This marks an important step as we continue our progress towards becoming a profitable Rule of 40 company.”

The Rule of 40 is that a SaaS business’ revenue growth rate plus its profit margin should be equal to or greater than 40 percent. The revenue growth rate could refer to the ARR increase. Profit margin is typically calculated using the EBITDA (Earnings Before Interest, Taxes, Depreciation and Amortization) margin. It’s not a GAAP measure. A rule of 40 business is generally viewed as a sustainable business, with balanced profitability and growth.

Backblaze has been focussing on its go-to-market activities to get access to more potential customers, with more direct sales focus for large customers. Budman said: “Some of the changes we made in our go to market include rebuilding account-based marketing, which required significant systems and process work, but now enables us to more precisely target organizations for whom we should be an ideal fit when they’re in market and to the broader buying community. We also revamped our approach to customer success to better identify and expand use cases with existing customers.” 

He added: “We added a field enablement function to further up level the sales team. We deployed AI in various activities across the go to market motion.”

As a result: “Our pipeline has grown significantly year over year and we’ve had a steadily growing increase in 6 figure opportunities enter the pipeline quarter over quarter, a sign that the machine is gaining momentum.”

The B2 Cloud Storage growth was affected by AI; Budman saying: ”This quarter, the number of AI customers grew 70 percent and data stored by these customers grew 40 times year over year. AI use cases drove the outperformance on our financial metrics this quarter And this is by far the most data hungry industry we have witnessed and we’re confident that AI will significantly drive our market opportunity.”

In contrast the computer backup business segment is fading. Suidan said: “In the [remainder of ] 2025, we see computer backup declining in the low to mid single digits on a quarterly basis. We continue to see areas for growth and are exploring options to improve the business, but at this point, we feel it is prudent to provide this outlook.”

Asked about the backup business, Budman said: “I think in the consumer side of the business, it’s a secular decline business. … And that’s in part because people aren’t doing backups of their laptops and desktops as much as they used to because there’s a lot more streaming in mobile devices and the like. On the business side of that, I think we actually have more opportunity.”

Budman said Backblaze is applying more of its focus to business backup: “We’ve put a tiger team, internally on computer backup.”

Backblaze had 321,201 operating disk drives at the end of the quarter. They have to be replaced, which costs cash, when their operational life comes to a close. It has improved its gross margin by revising its disk drive operating life upwards, Suidan saying: “We evaluated the estimated useful life of our fixed assets based on more recent operational data and found that our hardware last an estimated six years instead of the three to five years estimate that what we previously were using. This change better reflects our business operations and accounts for most of the gross margin improvements this quarter.”

The guidance for the next quarter is $36.9 million +/- $200,000, implying a 13.2 percent growth rate at the mid-point. It expects B2 Cloud Storage revenues to grow between 28 and 30 percent. The full year forecast has been lifted from $145 million +/- $1 million to $146 million +/-$1 million.

The Backblaze board has approved a stock repurchase program. We can hope to see a profitable Backblaze, in the GAAP sense, in 2026

Scality RING is ready for RAG

Analysis: Scality says its RING object storage can be combined with a vector database and LangChain framework to fuel RAG workflows for AI models like GPT.

LangChain provides tools to integrate external data, memory, and tools into AI workflows. The external data in this case is Scality’s RING object storage. A vector database stores the mathematically calculated vectors that represent the multiple aspects, aka dimensions, of tokenized chunks of unstructured data. A GenAI model such as GPT or Llama can only use a customer’s proprietary unstructured data when generating its responses if that data has been converted to vectors and made available to it.

Without such vectorization, RING-stored data is invisible to the model. The AI models are trained on static, generic data. RAG (retrieval-augmented generation) gives them access to a customer’s proprietary data that can be either static (old) or current (real time) or both. This means the model can respond to requests with a better contextual understanding of the request’s background and data environment.

Scality can use LangChain tools to vectorize selected RING content, store it in a Milvus vector database, and make it available to AI models, such as GPT-3.5, GPT-4 models, Llama, and others.

A Scality article explains how the vector embeddings represent unstructured source data in a way that represents its meaning. An input request is itself vectorized and the input set of vectors provides an abstraction of the input request’s meaning. The large language model (LLM) or AI agent then searches for similar vectors in the Milvus database, which stores vectors generated from the Scality RING system.

Scality says: “This enables the system to retrieve content that matches the user query semantically.” An input query asking “How to manage type 2 diabetes?” could enable the model to  detect and use documents on “insulin sensitivity” or “low-GI diets,” because, in semantic search terms, they are close to the set of vectors generated from the input request.

The vector database provides search functions such as an approximate nearest neighbor (ANN), which is based on indexing techniques. ANN looks for the approximate closest pattern to an input pattern in very large search spaces. So large that an exhaustive check of every data point to find the absolute closest match would be impractical, taking far too long. Find out more about ANN here.

Scality prefers Milvus over other vector databases because of its query speed, scalability, direct LangChain integration, and precision across different types of datasets.

It has provided a code example of the RING-LangChain-Milvus-GPT 3.5 interaction:

AI inferencing and training focus increasingly on object storage in addition to, or instead of, file storage, and suppliers such as Cloudian, DDN, MinIO, and VAST Data are emphasizing their strengths as RAG source data stores and AI pipeline support functions. We might expect Scality to develop this further, with a deeper and extended integration of the functional components.

More winners than categories in FMS 2025 Best of Show awards

Every business seems to love an award and the Future of Memory and Storage (FMS) 2025 event has Best of Show Awards-a-plenty. There were a record-breaking number of nominations, with 24 Most Innovative something or other categories, 16 of them in the Most Innovative Technology area, and 26 winners. The excess of winners over awards comes from there being three winners for the Most Innovative Technology – Industry Standards category.

Having three winners in the same category can be seen as three equal first places but, surely, one of them must be the actual best of show? Some people might say that, if three of them are best of show, then none of them are best of show. But it’s a good idea to recognize the work of the standards bodies.

The show attracted 45 exhibitors and a total of 43 sponsors in 8 categories.

The Best of Show Award Winners are:

1. Most Innovative Customer Implementation – HPC Solution

  • DDN – Cineca Supercomputing Center with DDN Data Intelligence Platform.

2. Most Innovative Enterprise Business Application – HPC Solution

  • Graid Technology –  Graid SupremeRAID.

3. Most Innovative Consumer Application

  • Sandisk – WD BLACK SN8100 NVMe SSD.

4. Most Innovative Startup Company – Virtualization IT Infrastructure

  • Verge.IO – VergeIQ.

5. Most Innovative Hyperscaler Implementation

  • Pure Storage –  DirectFlash QLC.

6. Most Innovative Artificial Intelligence (AI) Application – Media and Entertainment Solution

  • VAST Data – VAST Data Platform and the NHL.

7. Most Innovative Artificial Intelligence (AI) Application – Retrieval-Augmented Generation (RAG) Workflow

  • Infinidat – InfiniBox Platforms.

8. Most Innovative Sustainability Technology

  • Western Digital – Advanced Rare Earth Material Capture Program.

9. Most Innovative Technology – Industry Standards

  • SNIA Technical Work Groups – CXL Memory Hackathon & EDSFF E2 Form Factor. 
  • UALink Consortium – UALink 200G 1.0 Specification.
  • CXL Consortium – CXL 3.X Specifications.

10. Most Innovative Technology – 3D NAND Flash Memory

  • YMTC – Xtacking 4.0.

11. Most Innovative Technology – NAND Flash Solution

12. Most Innovative Technology – High Bandwidth Memory Technology

  • NEO Semiconductor – X-HBM Memory Solution.

13. Most Innovative Technology – DRAM Solution

  • Micron  – 1-Gamma Node LPDDR5X LPDRAM.

14. Most Innovative Technology – AI Accelerator

  • Pliops – LightningAI FusIOnX.

15. Most Innovative Technology. Computational Memory

  • XCENA – MX1 CXL3.X.

16. Most Innovative Technology – Networks

  • UnifabriX  – Memory-over-Fabrics.

17. Most Innovative Technology – Memory Controller

  • Innodisk – CXL Memory Module

18. Most Innovative Technology – Memory Controller

  • Silicon Motion – SM2514-AT Automotive SSD Controller.

19. Most Innovative Technology – 3D Memory Technology

  • Adeia – Hybrid Bonding for Memory Device.

20. Most Innovative Technology – SSD Technology

  • Kioxia – LC9 Series 245.76 TB SSD with BiCS FLASH generation 8 Memory.

21. Most Innovative Technology – Enterprise SSD

  • Samsung – PM1763 16-Channel PCIe Gen6 SSD.

22. Most Innovative Technology – SSD Software Technology

  • FADU –  FADU Flexible Data Placement (FDP).

23. Most Innovative Technology – SSD Technology

  • Solidigm – Liquid Cooled Hot Swappable NVMe SSD.

24. Most Innovative Technology – All-Flash Storage Server Solution

  • Supermicro – ARS-121L-NE316R 1U Storage System.

The AI area dominated much of the show with its seemingly insatiable demand for more memory amd more and faster-accessed storage.

Exploring Liqid’s composability

Interview. Liqid has announced composability products enabling host server apps to access dynamically-orchestrated GPU server systems built from pools of GPU, memory, and storage, focused on AI inferencing and agents, with its v3.6 product release. The apps support both PCIe-linked system components and CXL-accessed memory.

We talked with CEO Edgar Masri, President, Chief Strategy Officer and co-founder CMO Sumit Puri, and Dave Larson, GM and CTO, IO Acceleration Business, to explore the company’s technology capabilities and understand how they would be used. Their words have been edited for brevity and continuity. Let’s set the scene by refreshing our view of the product elements in Liqid’s recent announcement;

B&F: Who is your competition?

Edgar Masri.

Edgar Masri: Our biggest competition is still the status quo and people buying servers for four or five, six GPUs and then being stuck growing beyond that. 

Sumit Puri: There is a massive swing in the industry right now towards high powered GPUs. The entire Nvidia portfolio is on the PCIe side going a minimum of 600 watts per device and everything is shifting towards high-power stuff. So our first product is our PCIe 10-slot expansion chassis [EX-5510P] that we use in our composable architecture. One of the key benefits is the 600-watt support.

And that is a critical thing. If we take a look at the server vendors, the Ciscos, the Dells, the guys that are out there in their entire portfolio. Dell as an example, has one server today that supports 600 watts. That’s their new 4RU box; the XE7745 that can support 600 watt devices. The balance of their portfolio, their 1, their 2RU boxes: none of them can support these new devices. And so we think that’s a big gap in the industry that we can help fulfil with this new piece of hardware.

It is better than anything else available in the market. It has things like redundant power supplies from serviceable fans; it has multiple management ports. All the enterprise bells and whistles that you could want in a product like this have been built into it.

This box is built by Accton. Accton is a multi-billion dollar, publicly-traded technology company in Taiwan where Edgar was the CEO prior to coming here. And so we are extremely lucky to have not only Edgar but his relationship with Accton.

Sumit Puri.

We have an early OEM partner who’s been working with us on this product for a couple of years and that is Fujitsu. Fujitsu takes our product and white labels it or private labels it in Japan on the hardware, on the software they put the Fujitsu logo on it. Edgar at Accton had a good relationship with the Fujitsu team, so it just all continues to move forward together.

[The EX-5510C CXL 2.0 chasssis] holds 20 terabytes of DRAM per box and, just like our composable GPU solution, we will have trays or resource pools of DRAM. We’ll be able to connect those to multiple servers. We will use our Matrix software and we will allocate resources to servers on the fly.

The days of talking about CXL are over and what we like to say is: Liqid is going to be the company to deliver the full stack. And what we mean by full stack is, not only the expansion chassis, but the switching layer to go with it so that we can build big clusters. And also the fabric manager, which is our secret sauce, Liqid Matrix, that’ll be part of the solution, in addition to all the ancillary things like optical interconnect. 

We are leading the market in this segment and we think there’s going to be some really interesting use cases around database, around in-memory database, around RAG acceleration, where now large pools of memory are going to make an impact. 

Liqid Matrix is our fabric manager and the things that we’re touting on this release is number one, our ability to do multi fabric. So speak to a pool of GPUs, speak to a pool of memory, manage all of that with a single pane of glass. And so all of the million lines of code that we have written over the years to make our PCIe magic work, we get to bring that over to CXL. And now with a single piece of software, we can orchestrate the data center as [the slide below] shows. Our vision is, at the end of the day, the data center will be disaggregated. We’ll pull GPU resources, PCIe resources, DPU resources over our PCIe fabric.

We will pull DRAM resources over CXL today. There will be DRAM-enabled SSDs, DRAM-enabled GPUs at some point in the future. Our vision of the software has always been multi-fabric and we scale up with these current technologies like CXL. In the future, there may be UALink and those other things, but then we also have a very clean easy path to be able to integrate things like Ethernet technology because, eventually, when we grow beyond the box, we want to be able to scale out over those legacy Ethernet [and] Infiniband architectures. Our solution, because we’re in a very standard space, fits very well into a paradigm where people are already running Ethernet, InfiniBand in the data center. We can plug our pods in. And that leads to the next big announcement, which is around the software piece, which is around Kubernetes. In addition to promoting our multi-fabric orchestration capability in this 3.6 release, we have a huge focus on Kubernetes orchestration.

If you recall a few years ago we introduced our Slurm integration and we had big universities … where they were using Slurm as the orchestration engine. As part of orchestrating with Slurm, they would put a config file out there that said, hey, with this job I need four servers with two GPUs apiece and this amount of storage. And we’d go on the backend and we would create bare metal infrastructure based upon the Slurm orchestration engine. Now our next move is into Kubernetes. We think Kubernetes is the killer app for composability. The entire enterprise is using containerized methodologies for deploying their workloads. 

97 percent of enterprise customers are using containers. Nvidia has made a statement that said, if you want to go do inference, you must use containers. Containers are built for ease of deployment, ease of serviceability, ease of scaling. They’re moving in that direction. We, in this software release, have introduced our Kubernetes plug-in. So the simple way of describing this is, when you are deploying a Kubernetes pod in your environment, you don’t need to know anything about composability, you don’t need to know anything about orchestration, you don’t need to know anything about the way disaggregation works. All you need to do is with that Kubernetes pod provide a very simple spec file and the spec file has just a couple of inputs. This is the model that I want to run. These are the GPUs I want to use, this is how many of them I want, and this is the server that I want you to go deploy it on.

We automate the entirety of deploying AI in the backend. We refer to it  as one-click deployment of inference. You come in and say, give me Llama-7B, go, and we will create the entire backend model with the appropriate GPUs and the appropriate server and we will give back to the user a running instance of any model that they want and we will have automated the entire backend process. So you don’t need to know anything about orchestration composability. The only thing you need to know is how to run your Kubernetes environment. But now we will make it dynamic and, at the end of the day, very efficient.

B&F: Do you support x86 CPUs in this?

Sumit Puri: Yes. So we support the full gamut; Intel, AMD and Arm. 

B&F: And have you linked to external storage?

Sumit Puri: Yes. Two ways that we do storage. And so one way we do storage is, as you know, we have our IOA product which Dave champions for us, that is our Honey Badger product. Very, very, very fast IO; the fastest storage on the planet. It’s going up to 128 TB per card now. And that is local storage we deploy in the kit. And so I would say 70 plus percent of our deployments include that card, for very fast cache in and out of the GPUs as an example. And then the other way we connect to storage and the architecture; the server will have a legacy Ethernet, InfiniBand, IP network to the data centre and you can go grab your VAST Data, your WEKA IO, your legacy storage over your Ethernet connection.

B&F: And if there’s going to be some kind of GPU Direct connection from the GPUs to the external storage that’s supported as well?

Sumit Puri: That is supported as well. We would argue that running GPU Direct storage to that Honey Badger card in the box is the best way to do it. There’s no better way than putting the GPU four inches physically away and turning the RDMA capability on. So yes, we support both ways.

B&F: I’m thinking that what you’ve got here is hot data storage in the LQT 5500. That needs to be loaded from someplace and that’s obviously, to my mind, from external storage.

Dave Larson.

Dave Larson: Yes. And this device, the expansion enclosure, is PCI Gen five. You can populate an RDMA nick, a high performance nick for an RDMA connectivity to a much larger NVME or frankly spinning if you want it. If you want spinning disc, it’s fine. And then you populate the hot store locally. And then of course in a combined environment, which we’re not talking a lot about, but you then have even hotter store but also ephemeral store in CXL memory.

The main thing and the main benefit, this is the difference in bandwidth and more importantly latency we’re talking about it’s one to two orders of magnitude better latency on CXL DDR4 than NVMe [PCie] Gen5 by 16. Right? And then you have another two orders of magnitude latency to the far memory in a dual socket, and then you have another order of magnitude to the near socket memory. So there are four tiers of memory speed that people can leverage appropriately for how you deploy the information and the weights and everything else into the GPUs for an AI infrastructure

B&F: Can you mix different types of GPU inside one of these 5510P chassis?

Sumit Puri: Yes, absolutely. So not only can we mix different type of Nvidia GPUs, but we can also mix Intel and AMD, all the different varieties, into the box.

And not only do we support GPUs in the box, we are seeing customers begin to use FPGAs and other DPU type devices in the box. [There’s] a really important FinTech customer who finds benefit in deploying FPGAs in our box, not only for the latency benefit, not only for the fact that they can deploy 10 or 30 at a time, however many that they need through our software, they’re actually able to reprogram the device dynamically. 

So depending on what they are trying to do, they can change the personality of the FPGA, reset the FPGA, because one of the new capabilities of this new chassis is that we can power cycle that slot remotely. And so as you’re doing things like FPGA reprogramming, you don’t have to walk over and enable the physical reboot of the server. All of that mess goes away. And so there’s new devices that are being put into our system and FPGA is actually a hot one now.

B&F: What areas would be a good fit for using your kit?

Sumit Puri: One of the areas where we are finding some success is edge environments. And those slide five Joe. So edge is really interesting for us. Think about data, think about GPUs. All the data is created at the edge and so customers either have to do one of two things, move that data into the core data centre or move the GPUs to the edge.

What is the problem with edge? Edge is extremely limited on power. We don’t have a hundred kilowatt racks that we can run at the edge. We’ll be lucky if we find a 17 kilowatt rack at the edge. And so when you look at a 17 kilowatt rack and you say, how am I going to deploy AI into that rack? And we speak up this concept of tokens per watt, tokens per dollar, I got a limited amount of power. How many AI tokens can I squeeze out of that limited amount of power? This is an example of the kinds of things our customers are looking at. I can go and I can buy a single DGX solution and it’s the eight way NV link Ferrari kind of thing and it’s a big monolithic block of infrastructure and that block of infrastructure will consume 1213 kilowatts of power.

Or we can deploy a Liqid-type architecture where we can choose lower power GPUs. Remember H100 comes in two flavours. It comes in a high power NVLink flavour, which is the top tier, which is what you need for things like training or it comes in a lower power PCIe variant, a much lower power PCIe variant, which is great for things like inference. 

So in these kinds of use cases where power is very limited, we can do architectures [see slide above] on the right where in that same power budget we can squeeze 30 GPUs instead of eight GPUs. We can put two compute nodes instead of one compute node. And I’ll argue that the amount of tokens, the amount of capability is far greater for things like inference on the right hand side [of the slide above] than putting one large big monolithic block out there. And so think of telcos, think of CDN, think of edge providers who are looking to deploy GPUs into these power-sensitive environments. That’s an area, as an example, where we’re finding success.

B&F: The edge to my mind spans a very large range of applications; from small shops point of sale systems right up to small data centres. I think you’re talking more about the small data centre area.

Sumit Puri: Correct. We are not talking the deep, deep edge where we’re talking about the refrigerator. No one cares about that, right? We absolutely care about the close edge, which is yes, I’m a small data centre servicing a bunch of regional edge points. We have had discussions with retail customers around do we put some of this infrastructure, for example, inside of something like a Walmart where there’s tons of camera feeds and the number of cameras varies per store. So having some kind of flexible architecture here could be a benefit. It is probably not going to go in that direction. I think it’s the small regional data centre where you’re limited to 10 – 20 kilowatt racks as the common thing servicing a bunch of edge end points. That’s where we see ourselves landing.

B&F: You are going to have AI applications running at the edge that previously or currently are simply not possible?

Sumit Puri: The market itself is bifurcating here. There is going to be two sets of customers. There is going to be customers who do training, and if you are OpenAI and you are Meta and you are a handful of government agencies, yeah, go buy the SuperPod architecture, spend $50, $60 million, build dedicated data centres. That’s not the problem we’re trying to solve. We’re solving the problem for the rest of the world; the commercials, the enterprises. Higher education is probably our single most successful vertical. We find a tremendous amount of opportunity in higher ed. Think about higher ed and research for a second. Different students bringing different workloads to the supercomputer. Different workloads need different T-shirt sizes. We integrate with things like Slurm where we can match the T-shirt size to the workload. And so we are well on our way to having dozens of universities deploy our solution.

B&F: It is clear to me that you’ve got a stack of funding and you’ve been working for a long time because developing hardware and software like this is not cheap and it’s not quick either. So I should start thinking of you as a mature startup with relatively mature technology and serious customers. We’re not talking about some back of a garage operation here hanging on by its shirt sleeves.

Dave Larson: You are correct. You should think of us this way; we’ve been around for a little while. We are expert. We are the most expert place for PCIe composability architecture, and we have more than a hundred patents that have been granted to us in this area throughout our work, and they’re foundational to PCIe, composability and CXL.

B&F: I’m assuming that when CXL 3 comes along and PCIe 6 comes along, you’ll be leading the way there too.

Sumit Puri: That is absolutely the plan. A hundred percent.

B&F: It’s my impression that HPE was early into the composability area, but to be blunt, I think it’s failed.

Dave Larson: I feel like you were looking at my LinkedIn. I was at HPE for seven years. Most recently, four or five years ago, I was in the CTO office reporting to Mark Potter and I was the chief technologist for data centre networking, data security and cloud architecture, and I worked very closely with the composability team. A lot of my friends are still there, with Synergy and Virtual Connect and all the things that they built about composability. They had a very nice ring to it. Unfortunately, what you found is their largest customers that deployed Synergy deployed it statically in monolithic architectures. They never took advantage of the composability. They just paid the premium for all the flexibility and didn’t use it. So there’s a bit of maybe an internal backlash over there because they were the loudest voice for composability.

What we have here is actual composability, right? And I say that and I tell my friends who are in HPE, the same thing, and they struggle to believe the story because they’ve lived it unsuccessfully. But as we continue to show our capabilities, particularly on CXL, we’re quite certain that we’re going to get their attention back, because CXL composability is something that is central to what they do. Their Superdome servers have always had that kind of capability at a very, very, very premium price. We deliver that for industry standard servers at a much, much lower price point for people to get into that business.

Bootnote

The interview mentioned Fujitsu as a Liqid partner and it is also partnering with Dell, Cisco, Samsung, and Supermicro.

Sandisk and SK hynix working to standardize High Bandwidth Flash

Sandisk and High Bandwidth Memory (HBM) market leader SK hynix are working together to standardize High Bandwidth Flash (HBF).

HBF technology is aimed at giving GPUs fast access to large amounts of NAND capacity to augment the relatively limited capacity of their HBM, so accelerating AI training and inference workloads by avoiding time-consuming data accesses to PCIe-linked SSDs. Like HBM, the HBF chip is formed from stacked layers, of NAND in this case, with TSV connectors linking each layer to a base interposer, which provides fast access to the GPU. The HBF access could be orders of magnitude faster than that to an SSD.

B&F diagram showing HBM, HBF, and GPU interposer-based connectivity

The involvement of SK hynix is significant, as it developed and supplies HBM and has deep expertise in the HBM-interposer-GPU area, and thus how HBF could be implemented alongside it. The two companies have signed a Memorandum of Understanding (MoU) and aim to standardize the specification, define technology requirements, and explore the creation of a technology ecosystem for HBF.

Alper Ilkbahar

Alper Ilkbahar, Sandisk EVP and CTO, and HBF Technical Advisory Board member, stated: “By collaborating with SK hynix to define the High Bandwidth Flash specification, we are addressing the critical need for scalable memory in the AI industry. This collaboration accelerates innovation and will offer the industry new tools to handle the exponential data demands of tomorrow’s applications. Our work will help provide an effective solution to meet the world’s technology needs and exceeds expectations of our respective customers.” 

Dr Hyun Ahn, president and chief development officer (CDO) at SK hynix, added: “Through our work with Sandisk to standardize the High Bandwidth Flash specification, we are actively contributing to the commercialization of this innovative technology, which we believe is key to unlocking the full potential of AI and next-generation data workloads.” 

The HBM3E generation provides up to 1.2 TBps bandwidth and 48 GB of capacity. An SK hynix PCIe Gen 5 PCB01 SSD can provide up to 14 GBps sequential read bandwidth, 86 times slower than HBM3E’s data rate. 

Dr Hyun Ahn

HBF is targeted to offer comparable bandwidth to HBM while delivering up to 8-16x the capacity of HBM at a similar cost. That would be up to 768 GB.

The MoU implies that SK hynix could produce and supply its own HBF. This suggests that Sandisk appreciates the need to have a multi-supplier HBF market, to reassure customers that: first, they won’t be locked into a single supplier; second, that competition will accelerate HBF development; and third, that supply will be maintained even if one vendor fails.

Get Nvidia on board with HBF technology acceptance and standardization, and it’s a done deal.

Sandisk’s HBF technology was awarded “Best of Show, Most Innovative Technology” at FMS: the Future of Memory and Storage 2025. 

Sandisk intends to deliver first samples of its HBF memory in the second half of calendar 2026 and expects samples of the first AI inference devices with HBF to be available in early 2027. These could even be handheld devices as well as notebooks, desktops, and servers.

Veritas remnant Arctera bought by Cloud Software Group

The Carlyle Group-owned Arctera is being acquired by the Cloud Software Group.

Arctera is the collection of Veritas assets left behind by Cohesity when it acquired Veritas at the end of last year. These assets were put into new business Arctera, run by CEO Lawrence Wong, with three business units centered on the existing Backup Exec (protection), Data Compliance, and InfoScale (resilience) products. 

Tom Krause

The Cloud Software Group (CSG) is a collection of  enterprise software businesses including Citrix, TIBCO, IBI, Spotfire and Jaspersoft. CSG is led by CEO Tom Krause, an ex-Broadcom Software Group President and, before that, CFO. He left Broadcom in September 2022 when Vista Equity Partners and Evergreen Coast Capital Corp., an affiliate of activist investor Elliott Investment Management, acquired Citrix Systems to combine it with TIBCO Software, and formed CSG as the umbrella holding company. The whole transaction was valued at $16.5 billion. 

Four months later, in January 2023, CSG laid off 15 percent of the CSG workforce. It also arranged that Citrix and TIBCO, as well as many of the brands that existed within them – such as NetScaler, ShareFile, XenServer, Jaspersoft and others – would operate as separate business units, retaining their own brands and delivering their core products.

Krause then said: “We have established a foundation from which we can build a tremendous amount of value for our customers, giving us the cash flow to invest in our products, technology, and the future acquisitions that will put Cloud Software Group on a path to becoming one of the largest software companies in the world.” 

And now Arctera gets added to the CSG portfolio, with the transaction terms kept secret. It will operate as another standalone business unit within the Cloud Software Group portfolio. Arctera has around 30,000 global customers, more than 1,500 employees, and generates $400 million-plus of annual revenue. The potential acquisition price based on that revenue could vary between 3 to 10x depending upon its growth rate and prospects. A 5:1 price/sales ratio between Arctera’s valuation and its annual revenue would suggest it’s worth $2 billion. Arctera’s reported >40 percent adjusted EBITDA margins and its strategic fit within Cloud Software Group’s portfolio could justify a higher multiple, perhaps 8x.

Krause stated: “This acquisition marks an exciting milestone for Cloud Software Group and an important step forward in our long-term growth strategy. We have a strong track record of continuing to invest in our portfolio of leading software brands, as evidenced by our work to improve profitability and long-term sustainability of the Citrix and TIBCO businesses. Looking ahead, we aim to acquire at-scale enterprise-focused software businesses that provide proven mission-critical capabilities and high value to public and private sector customers while also ensuring they’re a good fit to operate within our proven Cloud Software Group model.”

He added: “We are confident that Arctera’s portfolio of Data Resilience, Data Compliance and Data Protection products will be a natural complementary addition to Cloud Software Group. We look forward to expanding the services and products we provide for both sets of customers through this acquisition.”

Lawrence Wong.

Wong said: “Our mission has been to build on our pedigree of innovation across our three distinct product lines, each tailored to solve the most demanding data challenges for their customers. With Cloud Software Group’s global resources, industry leading portfolio and seasoned leadership team, Arctera will be even better positioned to continue helping customers manage one of their most valuable assets: data.”

The transaction, funded by cash on CSG’s balance sheet, is expected to close in the fourth quarter of 2025, subject to certain regulatory approvals and customary closing conditions.

Comment

Arctera has been actively developing its products and go-to-market channels with announcements including:

  • August 2025: Updated Arctera Insight Platform to reduce the compliance risks associated with AI deployments, enabling customers to capture the input and output of Large Language Models (LLMs), chronicle AI data, integrate it with wider employee data, unlock corporate insight, and prevent the upload of sensitive data to public AI tools.
  • June 2025: Arctera, Wasabi Technologies, and distie TD SYNNEX launched a joint, channel-exclusive data protection offering, combining Arctera Backup Exec with Wasabi Hot Cloud Storage, available through a single SKU via TD SYNNEX, providing turnkey integration with end-to-end protection across physical, virtual, cloud, and SaaS environments.
  • May 2025: Arctera collaborated with Red Hat and its InfoScale cyber-resilience product was certified on Red Hat OpenShift Virtualization.
  • May 2025: Arctera updated its InfoScale cyber resilience product, adding real-time, application-aware resilience, cyber-ready operational defense, and proactive recovery
  • February 2025: Arctera unveiled a new brand for its Data Compliance platform – “Arctera Insight Platform” – and announced capabilities to improve the way organizations manage compliance, reduce risks, and prepare for audits.
  • January 2025: Arctera positioned as a Leader in the inaugural Gartner Magic Quadrant for Digital Communications Governance and Archiving Solutions (DCGA).

We expect to see substantial AI-related developments by Arctera, both to improve its product’s operations, and to feed data to AI models and agents, as well as CSG encouraging cross-selling between its businesses.

Sandisk unveils 256 TB SSD for AI workloads, shipping in 2026

Sandisk has announced a 256 TB SSD at FMS 2025 that will ship in the first half of 2026.

The Sandisk UltraQLC SN670 drive comes in 256 TB and 128 TB capacities, and uses BiCS 218-layer 3D NAND with its CBA (CMOS directly Bonded to Array) 2 Tb die design, and a PCIe gen5 NVMe interface. It is intended for use in AI-driven, data-intensive workloads like data ingest, preparation, and AI data lakes. It uses the same NAND as fab partner Kioxia and, we’re told, beats its 245.76 TB LC9 product‘s capacity by more than 10 TB.

Khurram Ismail, Sandisk
Khurram Ismail

Khurram Ismail, Sandisk chief product officer, stated: “As we move into the next phase of the AI era, flash storage is becoming a critical enabler of intelligent, high-performance workloads.” 

Sandisk says the UltraQLC drive has a customized controller and advanced optimizations, but not a fast single-level cell (SLC) cache. There is a Direct Write QLC feature with power-loss safe writes on the first pass. This typically means slower performance than equivalent drives using an SLC cache, unless the controller does something clever to speed things up.

It has Dynamic Frequency Scaling with up to 10 percent better projected performance for a given power level. A Data Retention (DR) profile forecast reduces DR recycles by up to 33 percent, “improving drive reliability, resilience and continuous access to data while decreasing power consumption.”

Mike James, senior director of Enterprise SSD Systems Architecture at Sandisk, said in a blog: “When you’re getting into higher and higher capacities of SSDs, interesting challenges emerge. One example is the challenge of data recycling on NAND at this scale. You can’t overwrite 128 terabytes every few days – it’s not effective or efficient. So, we’re constantly finding new ways to truncate how much needs to be recycled and inventing novel methods to reduce the impact of recycling in the background.”

Sandisk says the SN670 has “lower latency, higher bandwidth, and greater reliability, and delivers the performance needed for today’s most demanding AI workloads.” But there are no performance numbers, such as random read/write IOPS and sequential read/write bandwidth available, nor endurance stats to back this up.

The SN670 will be available in a U.2 (2.5-inch) form factor in the first half of 2026 with additional form factors available later in the year. Sandisk is preparing to put the first 128TB drives into customers’ hands for testing in a few weeks. The blog referenced above says the UltraQLC roadmap extends out to a 1 PB SSD.

Comment

If the SN670 is deliverable by April 2026, that means it has eight months of development ahead of it, and that could be why there is no performance and endurance data – the drive and its controller are simply not yet ready for final testing.

For clues about performance, we can look at earlier Sandisk information. Western Digital, in pre-Sandisk spinoff days, mentioned a possible 128 TB version of its Ultrastar DC SN655 SSD at FMS 2024. This would have been a PCIe Gen 4 QLC 218-layer drive, unlike the 3.84 TB to 61.44 TB SN655s, which used TLC 112-layer flash and the PCIe Gen 4 interface.

A few months later, Sandisk presented the UltraQLC SN670 at an Investor Day in February, saying it was a PCIe Gen 5 drive. The SN670 was compared to a “leading Gen 5 128 TB QLC SSD,” and had over 68 percent faster random read speed, more than 55 percent faster random write speed, over 7 percent better sequential read speed, and more than 27 percent better sequential write speed. It was then scheduled to ship in the third quarter of 2025 with 128 TB and 64 TB capacities – 122.88 TB and 61.44 TB usable. That date has slipped by two or more quarters and the maximum capacity has since doubled, which would mean more testing. (Detailed performance data is not present in Sandisk’s SN670 announcement, though we did ask for it.)

Three months later, in May, Sandisk CEO David Goeckeler talked about new 2 Tb QLC NAND chips to be used in coming 128 TB and 256 TB drives with PCIe Gen 5 and 6 interconnectivity and a new Stargate SSD controller.

On this basis, the PCIe Gen5 SN670 will have much higher performance than the PCIe Gen 4-using SN655, and also the Solidigm and Phison high-capacity PCIe Gen 5 SSDs.

Backblaze drive stats show 20 TB-class drives doing very well

Cloud storage provider Backblaze found that its disk drive failure rate was lower in the second 2025 quarter as previously high-failure-rate drives went kaput less often.

Its Drive Stats Team (Stephanie Doyle and Pat Patterson) said that the 12TB Seagate model ST12000NM0007 “was at a whopping 9.47% failure rate last quarter—down this quarter to only 3.58%.” It was failing at an 9.72 percent rate in the last 2024 quarter.  “Other significant drops include the 12TB HGST model HUH721212ALN604 (Q1: 4.97%; Q2: 3.39%) and the 14TB Seagate model ST14000NM0138 (Q1: 6.82%, Q2: 4.37%).” 

Stephanie Doyle (left) and Pat Patterson (right).

Two Seagate drives; ST8000NM000A (8TB) and ST16000NM002J (16TB), had zero failures in the quarter; the third consecutive zero failure quarter for the 8TB drive. 

The lifetime hard drive failure rates were; 

This is little-changed from last quarter, when the overall rate was 1.31 percent.

The smaller drives in Backblaze’ disk drive estate are getting older as newer and larger capacity drives are acquired. It has 10 HDD products with 12 TB or lower capacity, and their average failure rate is 1.42 percent. It notes: “that AFR is due to some well-performing low-failure outliers, including both of the 4TB Seagate models (0.57 percent and 0.40 percent), the 12TB HGST model HUH721212ALE600 (0.56 percent), and the 12TB Seagate model ST12000NM001G (0.99 percent).”

At the other end of the capacity scale, it has three 20TB or greater capacity drives; the 20TB Toshiba model MG10ACA20TE, 22TB WDC model WUH722222ALE6L4, and 24TB Seagate model ST24000NM002H. Their failure rates look like this; 

The company points out that, “in both drive count and length of time in the pool, it’s a little early to create definitive trends for the Seagate and the Toshiba. … the Seagate in particular is potentially at the beginning of the “bathtub” curve, we may see it change over time. 

On the other hand, the 22TB WDC model has shown up quite a bit below our current average AFR for the drive pool of all drive sizes and ages, and it’s the model with the most data.” 

It then compared the overall 20 TB+ drives failure rates over time with that of its 14-16TB drives over the same elapsed time;

There was a substantial difference, with Backblaze saying: “Comparing age of drive to age of drive, it would seem that our 20TB are right on target, and perhaps doing a bit better than expected. But, that definitely isn’t a perfect comparison given that the 14–16TB drives have a steadier and larger drive count.”

The conclusion: “It’s clear from the data that we need to give the 20TB+ drives time to mature, and that as we (depending on our buying behavior, of course) add more drives, we might see some interesting changes in the data.” 

You can point your browser at Backblaze’s Drive Stats page to see the full failure rate numbers.

Comment

Considering the complexity, mechanical finesse, tiny, tiny tolerances and 24×7 activity rates seen in hard disk drives, it is amazing that the failure rates are so low. They are, perhaps, one of the finest, high-precision, electro-mechanical devices ever mass-produced.

HPE claims ‘world’s fastest’ backup storage with Alletra X10000 upgrade

HPE has announced high-speed backup ingest for its Alletra Storage MP X10000 array along with Zerto disaster recovery integration features at Black Hat USA 2025.

The Alletra Storage MP X10000 is an all-flash object storage system with a VAST Data-like disaggregated shared everything (DASE) architecture formed from ProLiant server controller nodes and all-flash data nodes. According to HPE, it is now the world’s fastest enterprise backup storage. Zerto is a disaster recovery solution with an immutable Cyber Resilience Vault integrated with the Alletra Storage MP array.

David Hughes, SVP and GM, SASE and Security, at HPE Networking, stated: “HPE’s security solutions are designed to enable a multi-layered, zero-trust approach to protection, including AI-driven capabilities for continuous validation, anomaly detection, and automated remediation.” 

HPE has extended its Alletra Storage MP X10000 array into the data protection area, with “ultra-low recovery point objectives (RPOs) and recovery time objectives (RTOs)” for customer activities in healthcare, finance, media, and IT services, where “recovering data in minutes, not days, is critical to compliance, privacy, and business continuity.” It can ingest data at up to 1.2 PB an hour, “nearly twice as fast as its closest competitors.”

Back in 2023, an Infinidat InfiniGuard array could ingest backup data at up to 180 TB per hour. Pure’s FlashBlade has been shown to deliver up to 1.5 TBps of throughput with 75 blades, theoretically implying up to 5.4 PB an hour, although that might not be achievable in real-world conditions.

The Alletra system works in conjunction with HPE’s StoreOnce Catalyst Software, offering up to 60:1 data reduction, which integrates with backup software from Commvault Cloud, Cohesity NetBackup, and, in an upcoming release, Veeam Data Platform.

HPE has developed a cybersecurity Center of Excellence focused on detecting and validating storage systems against ransomware and other cyber threats. It has validated Alletra Storage MP B10000’s built-in ransomware and threat detection against the top 25 ransomware strains. 

HPE’s Zerto software will offer a new integration hub to enable third-party apps to access Zerto data, with CrowdStrike’s Falcon platform being the first. This will provide threat detection in real time plus cyber recovery, helping customers to reduce the impact of ransomware and other malware. When a threat is identified, the Falcon software enables Zerto to restore applications and virtual machines to clean recovery points from just seconds before the attack.

The new integration also accelerates recovery within the HPE Cyber Resilience Vault, which combines Zerto, Alletra Storage MP, as described above, ProLiant servers, and Aruba networking to deliver an air-gapped, zero-trust environment for ransomware protection and recovery.

Daniel Bernard, chief business officer at CrowdStrike, said: “By combining the power of the AI-native Falcon platform with HPE Zerto’s recovery capabilities, we’re enabling organizations to stop breaches, minimize downtime, and maintain business continuity – even in the face of ransomware and other disruptive threats.”

HPE is validating and certifying Zerto for the latest guidelines and requirements from United States governmental agencies, specifically CISA (Cybersecurity and Infrastructure Security Agency), STIG (Security Technical Implementation Guides), and FIPS (Federal Information Process Standards). 

The Alletra Storage MP X10000 data protection system is available today and you can read more in a blog. The Zerto software integration hub will be available in the fourth quarter of 2025. More information is here. A blog for the HPE Cybersecurity Center of Excellence is here. See the HPE cybersecurity developments at booth #2261, Black Hat USA, August 6-7.

Why HCI shouldn’t replace AFAs

Don’t replicate the inefficiencies of three-tier storage architectures in software when you can build a simpler, unified system

There’s a reason hyperconverged infrastructure (HCI) is failing to displace all-flash arrays (AFAs) in the enterprise storage market. AFAs excel at precision, with dedicated processors handling storage tasks exclusively.

Traditional three-tier architectures allocate compute by function, using storage processors for data operations, host processors for virtual machines, and network appliances for routing and firewalls. This separation delivers predictable performance. Each tier runs independently without competing for resources.

The architecture works, but at a price. HCI was supposed to address the cost issue while offering hardware flexibility. For architectural reasons, we will examine why it fails to deliver on this promise and what a viable solution might look like.

The VMware catalyst

The search for VMware alternatives is forcing organizations to reconsider their entire infrastructure stack. Broadcom’s acquisition altered licensing models and support structures, but many VMware alternatives are hyperconverged systems positioning themselves as complete replacements. Gartner’s Market Guide for Full-Stack Hyperconverged Infrastructure Software predicts that the non-VMware full-stack HCI install base will grow from 30% to 60% by 2029, following Broadcom’s acquisition of VMware. That hinges on HCI offering similar performance and data resilience at a lower cost.

As IT teams reassess HCI as a VMware alternative, they face a fundamental question: can an HCI architecture replace the performance and reliability of a dedicated AFA while reducing the cost of infrastructure as originally promised? The extent of Broadcom’s post-acquisition changes makes this an urgent question. Organizations need answers now, not after lengthy proof-of-concept deployments.

The promise of hyperconvergence

Conceptually, HCI appears sound: Virtualize storage, compute, and networking on standard x86 servers. AFA vendors are software companies that run on pre-built industry-standard hardware. HCI should enable organizations to deploy faster and scale seamlessly while reducing vendor dependencies. For organizations seeking to escape VMware licensing costs, HCI platforms offer an attractive bundle: hypervisor, storage, and management all in one package.

The question remains: do these platforms deliver on their promise of hardware flexibility, and can they replace AFAs in terms of performance and data resilience?

The implementation problem

HCI vendors prioritize speed to market over optimizing customer infrastructure. They build development teams around separate components: storage developers, networking developers, hypervisor developers, and management developers. Each team focuses on its domain. Integration happens through management interfaces.

This modular approach has created a communication overhead. Storage modules talk to compute modules. Networking modules coordinate with both. Management layers translate between components. Data moves through multiple software layers within each node, then crosses the network to reach other nodes.

HCI simplifies deployments and operations, but the simplicity users see hides a tremendous amount of complexity. It shifts the complexity burden from hardware integration to software coordination, increasing overhead and necessitating more powerful hardware and performance compromises.

For organizations replacing VMware environments, this hidden complexity becomes apparent when storage-intensive workloads that perform well on AFAs struggle on modular HCI platforms.

Throwing hardware at software problems

HCI vendors use the traditional response to performance bottlenecks: more processors, more memory, and more nodes. If storage virtualization consumes excessive CPU cycles, add more or faster CPUs. If inter-node communication creates latency, add more network bandwidth.

This approach consolidates three tiers into one, but replaces hardware silos with software silos. Multiple infrastructure software layers compete for the same processing resources. Storage deduplication runs alongside hypervisor operations. Network virtualization processes packets while storage handles I/O requests.

The result? HCI solutions require far more processing power than dedicated systems. Hardware compatibility lists shrink as vendors test specific configurations that can handle the processing load. Organizations discover that HCI infrastructure costs match those of three-tier systems while delivering inferior performance.

This reality challenges the VMware replacement narrative. Organizations may escape VMware licensing only to find that their storage performance is degraded, the infrastructure is less resilient, and their hardware costs remain unchanged.

The single codebase solution

The answer lies in integration, not consolidation. Instead of loosely coupling separate software packages, unified architectures merge all functions into a single code base. Storage, compute, networking, and management operate as a single system rather than as coordinated components. Instead of a hyperconverged infrastructure, they are an ultra-converged infrastructure (UCI)

This architectural change delivers measurable improvements. Unified platforms, such as VergeOS, provide five times better performance and three times better deduplication ratios with no performance impact compared to modular HCI implementations, making them ideal alternatives to AFas.

Single-codebase architectures eliminate inter-module communication overhead. Data flows through a single piece of software, with no layers or stacks. Processing resources focus on workloads rather than software coordination.

For organizations evaluating VMware alternatives, UCI platforms can surpass AFA performance while offering the operational simplicity that three-tier architectures lack. To learn more about replacing your AFA with UCI, VergeIO is sponsoring a webinar: How to Replace Your AFA.

Hardware freedom realized

Hardware requirements drop when software operates efficiently. Organizations can deploy almost any server hardware instead of consulting restrictive compatibility lists. Mixed hardware deployments become feasible when the software adapts to available resources rather than demanding specific configurations.

This flexibility matters for VMware migrations. Organizations can reuse existing server hardware instead of replacing entire infrastructure stacks. The economics improve when software efficiency reduces hardware requirements rather than increasing them.

Operational benefits

Unified codebases reduce learning curves for IT teams. Administrators master one system instead of multiple integrated components. Support becomes simpler when fewer lines of code handle more functions. Fewer software layers mean fewer potential failure points and easier troubleshooting.

HCI’s advantages over traditional infrastructure have benefited 94% of organizations globally. However, organizations need to distinguish between modular HCI implementations and unified architectures.

For teams migrating from VMware, the difference becomes critical. Modular HCI platforms may replace VMware’s hypervisor, but they introduce new complexity in the storage and networking layers. Unified platforms can simplify the entire stack.

The real question

The question isn’t whether HCI can replace AFAs. The question is whether organizations will choose modular implementations that recreate three-tier complexity in software, or unified architectures that deliver the simplicity and performance that hyperconvergence originally promised.

VMware’s disruption accelerates this decision. Organizations that recognize the difference between software consolidation and software integration will build more flexible, cost-effective infrastructures that perform better than what they replace.

The market is moving toward true unification. The VMware transition presents an ideal opportunity to implement architectures that address infrastructure challenges rather than merely shifting them from hardware to software layers.

Contributed by VergeIO.

Arrays speed up for MLPerf Storage Benchmark v2.0

The second edition of the MLPerf Storage benchmark shows tested systems serving roughly twice the number of accelerators than in the 2023 v1.0 benchmark round.

MLCommons said this round of the benchmark saw dramatically increased participation, more geographic representation from submitting organizations, and greater diversity of the systems submitted for testing. The benchmark tests how storage systems perform on the Unet3D, Cosmoflow, and Resnet50 AI training tasks along with checkpoint performance on Llama 3 training runs.

Curtis Anderson.

Curtis Anderson, MLPerf Storage working group co-chair and Hammerspace field CTO, stated: “At the scale of computation being implemented for training large AI models, regular component failures are simply a fact of life. Checkpointing is now a standard practice in these systems to mitigate failures, and we are proud to be providing critical benchmark data on storage systems to allow stakeholders to optimize their training performance.”

As AI training clusters have increased the number of GPUs, the chance of a GPU failure increases, necessitating a training run to be halted and restarted from the last checkpoint, when intervening results were written to storage. MLCommons says that if the mean time to failure for an accelerator is 50,000 hours, then a 100,000-accelerator cluster running for extended periods at full utilization will likely experience a failure every half-hour. A cluster with one million accelerators would expect to see a failure every three minutes. The faster checkpoints can be written and read, the quicker a failed job can be restarted and the shorter the overall job run time.

MLPerf Storage v2.0 includes more than 200 performance results from 26 submitting organizations: Alluxio, Argonne National Lab, DDN, ExponTech, FarmGPU, H3C, Hammerspace, HPE, JNIST/Huawei, Juicedata, Kingston, Kioxia, Lightbits Labs, MangoBoost, Micron, Nutanix, Oracle, Quanta Computer, Samsung, Sandisk, Simplyblock, TTA, UBIX, IBM, Western Digital, and YanRong.

David Kanter

Western Digital, which makes disk drives, also supplies its OpenFlex Data24 2RU EBOF (Ethernet Box of Flash) and tested 24-drive and 48-drive versions of that product, fitted with KIOXIA CM7-V Series NVMe SSDs, and collaborating with high-performance storage software provider PEAK:AIO.

David Kanter, head of MLPerf at MLCommons, said: “This level of participation is a game-changer for benchmarking. It enables us to openly publish more accurate and more representative data on real-world systems. That, in turn, gives the stakeholders on the front lines the information and tools they need to succeed at their jobs. The checkpoint benchmark results are an excellent case in point: now that we can measure checkpoint performance, we can think about optimizing it.”

The v2.0 submissions included a more diverse set of technical approaches to delivering high-performance storage for AI training than v1.0, including:

  • 6 local storage solutions
  • 2 solutions using in-storage accelerators
  • 13 software-defined solutions
  • 12 block systems
  • 16 on-prem shared storage solutions
  • 2 object stores
Oana Balmau

Oana Balmau, MLPerf Storage working group co-chair and an Assistant Professor at McGill University, said: “Everything is scaling up: models, parameters, training datasets, clusters, and accelerators. It’s no surprise to see that storage system providers are innovating to support ever larger-scale systems.”

DDN issued a statement about its results, saying it “has set a new industry benchmark with its AI400X3 storage appliance” that “delivered record-breaking throughput and unmatched performance density, saturating hundreds of Nvidia H100 GPUs from a compact, energy-efficient 2RU system.” That’s performance density, not absolute numbers such as total accelerators (GPUs) and throughput in GiB/sec.

DDN says that in single-node benchmarking, the DDN AI400X3 achieved:

  • The highest performance density on Cosmoflow and Resnet50 training, serving 52 and 208 simulated H100 GPUs with only a 2RU 2400 W appliance
  • IO performance of 30.6 GBps reads and 15.3 GBps writes resulting in load and save times of Llama3-8b checkpoints of only 3.4 and 5.7 seconds respectively

In multi-node benchmarking, it achieved:

120.68 GBps sustained read throughput and 45 simulated accelerators for Unet3D H100 training. 

  • In the v1 benchmark it was 99.02 GBps and 36 accelerators
  • Support for up to 640 simulated H100 GPUs on ResNet50 
  • Up to 135 simulated H100 GPUs on Cosmoflow with the new AI400X3, a 2x improvement over last year’s results.

DDN claimed that since 2016, Nvidia has relied exclusively on DDN to power its internal AI clusters.

Western Digital said that in the Unet3D workload, its OpenFlex Data24 achieved sustained read throughput of 106.5 GBps (99.2 GiB/s), saturating 36 simulated H100 GPUs across three physical client nodes. With the PEAK:AIO AI Data Server, OpenFlex Data24 was able to deliver 64.9 GBps (59.6 GiB/s), saturating 22 simulated H100 GPUs from a single head server and single client node. Kurt Chan, VP and GM, Western Digital Platforms Business, said: “The OpenFlex Data24 4000 Series NVMe-oF Storage Platform delivers near-saturation performance across demanding AI benchmarks, both standalone and with a single PEAK:AIO AI Data Server appliance, translating to faster time-to-results and reduced infrastructure sprawl.”

Full MLPerf Storage v2.0 benchmark results are available here.

MLPerf invites stakeholders to join the MLPerf Storage working group and help it continue to evolve the benchmark suite. A deeper understanding of the issues around storage systems and checkpointing, and the design of the checkpointing benchmarks, can be found in a post from Wes Vaske, an MLPerf Storage working group member.