Ephemeral cloud, decentralized solutions, and the future of block storage

Chris Evans is the consultant and analyst behind Architecting IT. He has a reputation for shrewd and insightful analysis of storage technologies and suppliers. We thought we’d take advantage of that and ask him for his views on some mainstream storage suppliers and new technologies such as ephemeral cloud and decentralized storage.

Blocks & Files: How can existing block storage array vendors who have ported their storage OS to the public cloud – for example, NetApp with its ONTAP-based AWS, Azure and Google offerings and Pure with Cloud Block Store – compete with the ephemeral cloud storage quartet: Dell (PowerFlex), Lightbits, Silk, and Volumez?

Storage analyst Chris Evans
Chris Evans

Chris Evans: Vendors that have ported natively to the cloud are integrated into the ecosystem of the cloud itself. This means the cloud vendor manages scaling, resiliency, availability, and upgrades. It’s a true service. Those platforms are also integrated using knowledge on how the cloud platform itself operates behind the scenes. Currently this means only NetApp and Microsoft (Windows File Server).  Every other vendor is at arms’ length from the internal operation of the platform, so must design and build accordingly. Their solutions are cloud-aware, so can scale, manage redundancy, etc. but crucially, those vendors don’t have visibility of the internal workings of the platform, so must make assumptions on how the cloud operates, based on observation. The key next step for vendors like Pure, Volumez and Lightbits is to build relationships with the platform vendors that results in native integration. This blog post covers the definition of cloud native in this context.

So, from a technology perspective, NetApp has an advantage. However, remember that NetApp is an OEM service, so the customer is owned by the cloud platform. That presents a problem for NetApp to upsell to those cloud customers as they don’t own the customer/vendor relationship.

Blocks & Files: How can the existing on-premises unstructured data storage suppliers compete with decentralized storage startups like Storj, Cubbit, and Impossible Cloud?

Chris Evans: The decentralized storage model is an interesting one that we couldn’t have developed 20 years ago. 

Fast networking, cheap compute (for erasure code calculation) and the global distribution of datacenters provides the perfect storm for the technology. The greatest challenge for the decentralized model is regulation. 

If data needs to be physically stored in-country, then the benefits of the decentralized model are lost. The other question to ask here is about scale. Decentralized storage is not about a few terabytes of filer data, but about long-term storage of petabytes to exabytes of capacity, where the usage profile changes over time. 

On-premises vendors need to decide how much of the future storage pie they want. Small-scale (< 1PB) will continue to be practical for on-premises (especially for data under regulatory rules), but many enterprise businesses will consider pushing data to the decentralized storage cloud as the best financial option, especially where that data’s value is unclear (risk of loss vs cost of retaining). On-premises vendors need to get together or acquire the likes of Backblaze or Wasabi and be able to offer both an on-prem and cloud (and integrated) solution. 

Blocks & Files: How would you compare Storj, Cubbit, and Impossible Cloud with cryptocurrency, anti-fiat currency, web3-style decentralized storage suppliers like FileCoin in terms of likely business success?

Chris Evans: I expect the decentralized storage vendors are desperately trying to distance themselves from the crypto similarities, due to the recent issues in the cryptocurrency market. I would look at decentralization as the natural evolution of RAID and erasure coding. RAID protected within a system, erasure coding extended this to multiple hardware platforms and with geodispersal; now decentralized provides the capability to buy erasure coded storage “as a service”.  The underlying protection mechanisms (especially FileCoin) are based on web3 & blockchain technology concepts but that’s where the comparison should end. 

There’s a two-tier model in play here. First, the storage layer – storage of data reliably, at scale and at low cost. Then there are the services that sit on top. CDN is a good example, perhaps AI-as-a-service where a vendor pulls a copy of the AI data from the decentralized cloud into a local cache. I see the real value being the diversity of services that can be offered. It was the same for object storage when that first hit the market. The concept is useful, but the higher-level application use cases add the value.

Blocks & Files: What do you think of the idea that Pure Storage and IBM have a sustainable and significant technology advantage over all-flash array suppliers that use COTS SSDs?

Chris Evans: We repeatedly see issues with storage media where scaling capacity or performance (or both) represents a technical challenge for the industry. As NAND flash has scaled up, the increased bit count per cell has introduced endurance issues and performance challenges (latency, not throughput). I believe current SSD capacity has been limited by 3 challenges:

  1. The per-unit cost, which could be $10,000 for a 32TB drive. Populating enough drives into a system to provide resiliency is expensive.  Additionally, a customer will track any failure much more closely, to ensure the vendor replaces any failed devices within warranty. No customer will simply discard a failed $10,000 drive. The unit cost also causes issues for the vendor and media supplier for the same reasons; the vendor will want the supplier to replace failed media under warranty. So, drives need to be more repairable, or at least capable of some degree of reuse. 
  2. DRAM overhead. Increasing the capacity of drives has been achieved through bit density and layer counts. DRAM is used to store metadata, keeping track of what data is stored where on the SSD. Typically, 1GB per 1TB of capacity is used for DRAM. With 64TB drives, this means each drive would have 64GB of DRAM. This is unsustainable in high-capacity systems from a cost and power/cooling perspective. The current vendor solution is to use larger indirection units (IU) or bigger blocks to write data. This means less metadata, but increases write amplification, making these drives only suitable for read-focused activities. The industry answer appears to be tiering as seen by Solidigm (see this post).
  3. Failure domains. Here’s a post I wrote six years ago talking about the issue where we’re already discussing 32TB ruler drives. The industry has taken an extraordinary amount of time to reach the 64TB level. Part of the problem here is the impact of a device failure. In any system, there needs to be at least one unit of free capacity. If you build from 32TB drives, at least one 32TB drive (or the equivalent capacity) must be kept free. With COTS SSDs, all drives act independently, so failure can’t be predicted. Therefore, systems get designed with sufficient excess capacity to cater for MTTR (mean time to repair). With 64TB drives, that’s a lot of wasted capacity and cost. There’s also a rebuild factor; a failed drive will create significant amounts of additional backend I/O traffic to re-protect data.

Pure Storage and IBM have control over the entire FTL (flash translation layer), so can mitigate the cost and impact of DRAM scaling, the failure domain challenge, and the media cost (by mixing SLC, QLC, TLC on the same storage blade). Pure Storage is much more advanced than IBM in this area, whereas IBM currently just has “better” SSDs.  That position may change in the near future, as indicated by IBM Fellow and Storage CTO Andy Walls in this podcast.

Blocks & Files: Does VAST Data’s storage technology (DASE architecture + QLC flash + SCM for metadata and writes + NVMe internal fabric) amount to a disruptive innovation for, firstly, the existing scale-up/scale-out file storage vendors and, secondly, the parallel file system software vendors?

Chris Evans: VAST has built a very interesting architecture that is specifically designed to overcome the challenges of large-capacity media. Data is written in stripes that suit the sequential write nature of QLC SSDs. SCM manages the short-term I/O and active metadata. The VAST system is essentially a massive key-value store (even the metadata is KV data). So, the platform is applicable to any data storage solution. However, while the VAST Data Platform could be used for block-based storage, it is not ideal for it (due to the I/O profile of block data) as it would put greater load onto the SCM layer. 

VAST is disruptive to the unstructured market, because it offers “unlimited” scale in an architecture that would continue to perform well with larger capacity media. The C-node & D-node design creates a “physical layer” onto which storage applications can run – specifically NFS and S3, but now also database and other structured architectures. The legacy vendors have no answer to the VAST architecture (hence the HPE OEM agreement). They will see their business slowly chipped away from the top down (highest capacity downwards).  The only saving grace is that the VAST solution is (currently) designed for petabyte scale and up.

Blocks & Files: I think you believe that HCI vendor Nutanix may eventually be acquired. What’s your reasoning for this view?

Chris Evans: I believe we’re going to see a convergence of architectures in the future. The public cloud has given IT organizations the capability to build and deploy applications without having to be too concerned about the infrastructure. The cloud vendor provides the physical plumbing, the virtual instances, application instances (built on virtual instances) and so on.  On-premises, internal customers of the IT department will increasingly demand a cloud-like experience. IT teams don’t want effort the of building out their own cloud ecosystem but want to take one from a vendor. This is what VMware, Nutanix, OpenShift, OpenStack and SoftIron all provide. 

So, if you’re Dell, HPE, Cisco, maybe even Oracle, how will you differentiate your hardware solutions? I can envisage one of the major infrastructure vendors acquiring Nutanix, as it offers an “oven-ready” cloud solution to sell to customers, either as a service or as a platform. The Cisco tie-in was interesting, because Nutanix’ biggest overhead is sales and marketing.  If the Cisco partnership can be shown to reduce expenses enough to hit profitability, then this validates that Nutanix has a viable solution, and someone will acquire them. The acquisition route is preferable because it stymies the competition between infrastructure vendors. Simply reselling Nutanix (or any of the private cloud solutions) just continues to make those infrastructure vendors appear as box shifters and there’s increasingly less margin in that area.    

Blocks & Files: Pure Storage is pushing the idea that no new HDDs will be sold after 2028. Do you think this idea holds water and what’s the reasoning behind your view?

Chris Evans: As I discuss in this blog post I think this message is a mix of semantics and marketing. It’s semantics from the position of the word “new”.  We can take that to mean the HDD vendors will continue to sell existing HDD models, but won’t create new ones, because the cost of development will be outweighed by the revenue return. That’s the tipping point for the industry, when the R&D costs won’t be recouped through new product sales. At that point, the vendors simply push out the same products into a dwindling market that might last another 10-20 years. The $64,000 question is whether that tipping point is 2028. There’s the marketing angle – create a story today that few, if any, people will check out in five years’ time. 

Blocks & Files: With the rise of edge as distinct from datacenter IT, and application repatriation, would you agree that the public cloud takeover of on-premises IT has come to a halt, and why or why not?

Chris Evans: I think the term “edge” is an interesting one.  We’ve had edge computing forever, just in different forms. IBM used to sell System/38 and AS/400 to sit on the edge of the mainframe. The PC and local file servers provide edge capabilities to the core datacenter.  CDNs and other delivery networks have had edge capability for at least 20 years. Over the last 70 years, IT has concertinaed from centralized to distributed and back again. So “edge” is part of the normal cycle.  

The modern interpretation of edge computing started with the view that edge devices would collect, store, pre-process, and forward data into a core datacenter. The perception was that these devices would be in hostile locations where primary datacenters would be too expensive to deliver. Today, we now see edge as an extension of the datacenter, more in a distributed computing model, than a store and forward one.

Examining AWS’s and Azure’s offerings in this space, both platforms were designed to extend the public cloud outwards and subsume on-premises systems. In that respect, the only problem they solved was the issue of latency. It’s the reason I described Outposts as a cuckoo in the nest.

The goals of edge computing need to reviewed and applied to products and solutions from either on-premises or cloud vendors. Edge solutions need to manage being deployed in a physical environment that may be suboptimal. Edge systems need to operate autonomously if network connectivity fails. Edge solutions need improved resiliency capabilities, as maintenance could be intermittent. Most of these challenges have already been met. The next wave of edge designs needs to focus on security and operations, including ensuring data collected from the edge can be trusted, systems can be updated remotely, applications can be deployed remotely, and so on. The cloud providers aren’t offering solutions to meet this need, so their strategy has stalled. 

Blocks & Files: Finally, would you argue that the generative AI hype is justified or unjustified?

Chris Evans: This is a difficult question to answer. There are clearly some significant benefits to be gained from generative AI. A human-like interface is one (with recent news that AI can hear, watch, and speak). The ability to pose questions in natural language is another.  However, in the IT world we tend to get carried away with the hype of new technology. Vendors see an opportunity to sell new products, while end users like to play with new and shiny things (even if there isn’t an obvious business case). 

With all new technology, I like to exhibit a degree of caution. In the first instance, I like to investigate and understand, but also identify shortcomings. To quote Ronald Reagan, I believe in “trust but verify,” I trust my GPS, for example, but I also verify what directions it gives me, because it is occasionally wrong. I think we’re in a hype cycle of inflated expectations, when AI is seen as the answer to everything and will be integrated into every platform. In reality, we’ll settle down into a middle ground, where generative AI speeds up certain tasks. I don’t think it’s about to become Skynet and take over the world!