Olympus Mons image courtesy of ESA/DLR/FUBerlin/AndreaLuck
Computer science Professor Onur Mutlu has won a Huawei OlympusMons award for far-reaching research on data placement tech and NAND chips.
Professor Onur Mutlu
The OlympusMons awards started in 2019 to recognize and encourage foundational data storage research and nurture collaboration between industry, academia, and research. A 2023 accolade presented to the team led by Professor Onur Mutlu of ETH Zurich was for their study into network-storage-computing converged systems, which included an adaptive optimization algorithm. Mutlu’s team was awarded ¥1 million ($137,883). A 2023 document, “Machine Learning Driven Memory and Storage Systems,” shows it involving three elements:
Pythia – self-optimizing memory controller using reinforcement learning
Sibyl – self-optimizing hybrid storage system (HSS) controllers using reinforcement learning
The reinforcement learning idea behind Sibyl is that a data placement controller, operating at a storage array controller level or in an OS, could receive positive and negative notification of data placement decisions across two or three tiers of storage, based on minimizing access latency. It could then optimize its placement decisions to reduce latency overall, learning to improve its performance over time.
Sibyl is explained in a May 2022 scientific paper, “Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning.” The abstract says: “Sibyl observes different features of the running workload as well as the storage devices to make system-aware data placement decisions. For every decision it makes, Sibyl receives a reward from the system that it uses to evaluate the long-term performance impact of its decision and continuously optimizes its data placement policy online.”
The decisions take account of the system costs involved in promoting/evicting data from any one storage tier to another.
Mutlu’s team evaluated Sibyl against CDE, HPS, archivist, and RNN-based data placement technologies. It improves performance by 21.6 percent compared to the best previous data placement technique in dual-HSS configuration, the paper states. Sibyl outperforms the state-of-the-art policy by 48.2 percent in a tri-HSS configuration: Optane P4800X SSD, Intel D3-S4510 SATA SSD, Seagate Barracuda 1 TB SATA disk drive.
It achieves 80 percent of the performance of an idealized data placement policy with complete knowledge of future access patterns, while incurring a modest storage overhead of only 124.4 KiB.
The paper’s conclusion reads: “Our extensive real-system evaluation demonstrates that Sibyl provides adaptivity and extensibility by continuously learning from and autonomously adapting to the workload characteristics, storage configuration and device characteristics, and system-level feedback to maximize the overall long-term performance of a hybrid storage system.”
Sibyl can be extended to new device tiers and could even, in theory, be applied across storage arrays in a tiered hierarchical lifecycle management system, although this has not been coded and tested.
Open source Sibyl code is available in GitHub, as is open source code for Hermes and Pythia.
VAST Data is hooking up with AI stack orchestrator Run:ai to provide storage and data infrastructure services for Run:ai’s AI tool chain and framework management for GPUs in a full stack deal.
Processing AI models for training and inference involves much, much more than sending a large data set into a job that the GPUs can process. William Blair analyst Jason Ader has diagrammed some of this complexity to show job stack components from a base infrastructure layer, including storage, up through a data layer, foundation model, middleware, and application layers:
Generative AI complexity – a tool chain diagram from William Blair analyst Jason Ader
There are myriad elements at each layer and Run:ai orchestrates them. VAST CEO and co-founder Renen Hallak said in a statement: “We’ve recognized that customers need a more holistic approach to AI operations. Our partnership with Run:ai transcends traditional, disparate AI solutions, integrating all of the components necessary for an efficient AI pipeline. Today’s announcement offers data-intensive organizations across the globe the blueprint to deliver more efficient, effective, and innovative AI operations at scale.”
Run:ai was founded in Tel Aviv, Israel, in 2018 by CEO Omri Geller and CTO Ronen Dar, and has raised $118 million thus far. Their containerized software provides AI optimization and orchestration services, built to let customers train and deploy AI models and get scalable, optimized access to AI compute resources. For example, it provides so-called fractional GPU services in which a GPU is virtualized such that separate AI jobs can share a single physical GPU, in the same way as server virtualization enables separate virtual machines (VMs) to share the same physical CPU.
Geller said: “A key challenge in the market is providing equitable access to compute resources for diverse data science teams. Our collaboration with VAST emphasizes unlocking the maximum performance potential within complex AI infrastructures and greatly extends visibility and data management across the entire AI pipeline.”
VAST and Run:ai say the joint offer will include:
Full-Stack Visibility for Resource and Data Management encompassing compute, networking, storage, and workload management across AI operations.
Cloud Service Provider-Ready Infrastructure with a CSP blueprint to deploy and manage AI cloud environments efficiently across a single shared infrastructure, with a Zero Trust approach to compute and data isolation.
Optimized End-to-End AI Pipelines from multi-protocol ingest to data processing to model training and inferencing, usingNvidia RAPIDS Accelerator for Apache Spark, as well as other AI frameworks and libraries available with the Nvidia AI Enterprise software platform for development and deployment of production-grade AI applications with the VAST DataBase enabling data pre-processing.
Simple AI Deployment and Infrastructure Management with fair-share scheduling to allow users to share clusters of GPUs without memory overflows or processing clashes, paired with simplified multi-GPU distributed training. The VAST DataSpace enables data access across geographies and multi-cloud environments, providing encryption, access-based controls, and data security.
This partnership will tend to encourage VAST customers to use Run:ai, and Run:ai users to consider storing their data on VAST. It’s a deal that should be attractive to CSPs looking to provide AI training and inferencing services as well.
VAST plus Run:ai blueprints, solution briefs, and demos will be available at Nvidia GTC 2024, booth #1424.
HPE has expanded its GreenLake for Block Storage offering with an Alletra MP-based high-end system.
It first announced subscription-based GreenLake for Block Storage – with general purpose Alletra 5000, Alletra 6000 business-critical and Alletra 9000 mission-critical tiers and a 100 percent data availability guarantee – in March 2022. A release 2 added Alletra MP support in in April last year, positioned as mission-critical storage with midrange economics, saying it was the storage industry’s first disaggregated scale-out block storage and featured 8-core and 16-core controller chassis. Since then HPE added non-switched 2-node systems supporting from 8 to 32 cores with up to two JBOF expansion shelves. These were targeted at small and mid-market customers. Now it has generally available faster switched systems.
Simon Watkins, senior manager, worldwide product marketing for HPE Storage & Cloud Data Services, blogged: “Our announcement represents the next step in a multi-step journey to consolidate, standardize and simplify our HPE block storage portfolio on a common software and hardware stack – all with a unified cloud management experience provided by the HPE GreenLake platform.”
HPE has upped the performance and capacity limits substantially, claiming it’s reimagining storage with an architecture that delivers scale, performance, and efficiency in a way that legacy storage competitors cannot achieve.
There are 16-core/32-core 2-node and 4-node switched models, which deliver up to 2x performance boost compared to previous models. They support up to 2.5x more capacity through up to 8JBOF expansion shelves and a 15.36TB to 2.8PB raw capacity range. You can upgrade drives in 2-drive increments and JBOFs in increments of one. Storage capacity and performance can be scaled independently. A new HPE StoreMore Guarantee guarantees 4:1 data compaction.
The network side has added NVMe/TCP support complementing the existing fiber channel, NVMe-of/FC and iSCSI connectivity. The Infosight AIOps system management tool now enables users to see in real time the impact that workloads are placing on storage and adapt their infrastructure to anticipate and avoid failures and optimize performance.
There is extended AI-based performance reporting and analytics for improved troubleshooting and insights. Analysis topics include headroom utilization trend analysis, top volume hotspots by latency, performance (latency and throughput), workload drift detection, resource contention detection and sustainability metrics, including power consumption trends.
The 2x performance increase is compared to existing 2-node (16-core/32-core) switchless configurations and the 25x capacity increase is compared to the prior release 2 of GreenLake for Block Storage built on Alletra Storage MP.
HPE tech specs table.
This new Alletra MP system is for mission-critical apps – from large databases to modern cloud native apps to consolidated mixed workloads – at a midrange price point. Example workloads include PostgreSQL and SAP HANA, for which the new system is certified for up to 80 production nodes.
There is unified cloud management through the GreenLake hybrid cloud platform to enable users to manage and monitor their fleet of block storage globally from a single SaaS-based cloud console accessible from any location, on any device. HPE says the system supports on-line, non-disruptive, data-in-place upgrades.
It claims legacy storage competitors, many with dual-controller designs, are unable to match the combination of disaggregated scale-out block storage of the Alletra MP system, 100 percent data availability guarantee, and the intuitive cloud experience of the GreenLake hybrid cloud platform. Watkins blogs: “It’s a storage architecture in which storage SSDs are shared by all controllers across a high-speed network, so you have no silos and the ability to tolerate multiple node failures.”
Competition
This Alletra MP block storage system gives HPE something to better answer all-flash array product pitches from Dell (PowerStore), IBM (FlashSystem), NetApp (ASA array) and Pure Storage (Flash Array//). However their maximum capacity goes beyond that of HPE’s system, which scales out to 2.9PB raw and 11.6PB effective capacity.
Dell’s PowerStore 500T provides up to 4.71PB per appliance and 18.83PB per cluster. This is effective capacity with 4:1 data reduction.
IBM’s FlashSystem 9500 offers 4.5PB effective capacity with 3:1 data reduction in a single enclosure and up to 18.4PB effective with two control enclosures and 5:1 reduction.
NetApp’s ASA A900 scales out to 351PB effective capacity – way beyond HPE’s system. Its entry-level ASA 150 can orovide 13PB effective from its up to 12 nodes.
Pure’s FlashArray//XL170 scales out to 1.48PB raw and 5.5PB effective capacity.
We think Alletra MP scale out limit improvements could be on HPE’s roadmap.
Watch a video of the performance reporting features here. A December 2023 technical deep dive blog by HPE distinguished technologist Dimitris Krekoukias provides more information. He explains: “The initial release of the new expanded HPE GreenLake for Block Storage will be switchless arrays with disks inside the main chassis, and a near future update will bring the larger, switched, fully disaggregated, shared-everything architecture for much larger and more flexible implementations.” That near future update is what has just become generally available.
Bootnote
The data reduction and data compaction technologies to reduce storage capacity utilization include dedupe with express indexing, an “extremely efficient compression algorithm,” data packing, hardware accelerated and reservation-less Thin Provisioning, and Virtual Copy.
HPE’s 4:1data compaction ratio guarantee is calculated as a ratio of usable to effective capacity, in aggregate, across all reducible data. If not, HPE will provide expertise related to data efficiency, additional storage capacity needed to reach the targeted data efficiency, or other compensation, at its own discretion. The guarantee is valid for the term of the support contract.
HPE also offers customized guarantees, valid for a year, for customers requesting a guarantee for a set amount of effective capacity.
Interview: Dan Waldschmidt, known for his fondness of ultra-running, has become the new CEO of Panzura.
He conducted the Zoom interview standing up, not sitting, which we suppose is indicative of always being on the go. Waldschmidt has been involved with Panzura since 2020, being prior CEO Jill Stelfox’s partner in buying the then floundering company with private equity backing.
As a reminder, Panzura is a cloud file services business, competing with the likes of CTERA, Egnyte, and Nasuni, ensuring that customer staff in distributed organizations have access to the latest file-based information which is kept synchronized in a global namespace. The files are held centrally in AWS using underlying object storage. It most recently announced an Edge Gateway providing smartphone and tablet access to the Cloud FS file vault.
Waldschmidt replaced Jill Stelfox as CEO in January after a time spent as Panzura CRO then chief transformation officer. We have edited the interview for brevity and clarity.
Blocks & Files: Tell us about your plans for the company
Dan Waldschmidt: Of course, everyone wants to make money. You know, to deliver a product to market usually means more satisfied customers, usually means easier access to that technology, just removing some of the friction.
So the previous leadership team said come in, do that magic for us. I was there from October-ish to January, February, until I realized they wanted to raise money.
I was leading, as CRO, our growth or global global expansion. I took a sabbatical in early ’23. A lot of people thought I went away, but it took us about two months. I probably worked harder in my job than at any other job. When you talk about people … working 100-hour weeks? Yeah, yeah, that was that. I was probably at the office at 4, 5 o’clock every morning, leaving at midnight almost every day. And I’m an athlete, too. So I would go running. And you know, I’m sort of pushing the boundaries.
By the time I took my sabbatical last year, I had sort of burned the candle on every end possible. So when I came back and realized what’s stopping us from being great. Is it more revenue? It’s not. It’s a better user experience. It is listening to our enterprise customers.
I realized we needed to overhaul. It started with an alignment around a new company mission, which I presented, and got the board to sign off on a new set of values. Not that our old values weren’t right, but being warm to an enterprise doesn’t solve their problems. Right? They need real work.
I feel lucky that I had a chance to be forming and shaping and building narrative and relationships with our leaders. So that when … someone says hit the ground running, what they usually mean is hit the ground scrambling, trying to figure it out. And these are people that I have trusted relationships with and I’m able to to elevate and accelerate our vision for enterprise, which really comes down to one thing – stability and performance and putting both of those words into one bucket, and it’s our mission for 2024, 2025. It’s not chasing shiny things. It is doing what we do really best.
Do you remember the show Dirty Jobs? Mike Rowe … had this show called Dirty Jobs, and he basically said everyone wants to go create software and become a Wall Street billionaire. But the real heart of America are people who are digging ditches and running electrical lines and cleaning chimneys. And he did all these dirty jobs to illustrate the point.
And that’s the passion that we have caught ourselves up with. There’s some dirty jobs in hybrid multi-data management that just need to be done. And they’re not glamorous. They’re features that are buried in an Admin API that controls speed between NFS and SMB, right? Connections that allow seamless experience.
That’s the business we’re in right now. And so far, I have to tell you, the results we’re seeing in the last five months are outpacing the results we’ve seen in the last 15 to 18 months.
Blocks & Files: Do you see the main options for growth as expanding current accounts or landing new accounts?
Dan Waldschmidt: We have a segment of our 100 Plus customers, where they rely on us for a significant portion of their workloads. What we have learned, though, is in the enterprise space, that’s not the case. They might be using us for 5 or 10 percent of their workloads.
And this is not unusual because they’ve got significant investment with Nutanix or Qumulo, or NetApp. And the sheer process of moving off, it’s audacious.
So it’s as they come up for their three-year cycle or five-year cycles. OK, let’s take that over. So now you go from 5 percent to 7-and-a-quarter percent. And then the next year, you go to 12 percent. And as these workloads come up, it’s new workloads. Brand new workloads could be around HPC, AI, things like that.
We’re seeing growth in those areas. We’re also seeing just the sheer elegance they’re looking for and having their data in a more protective environment, like Panzura, where you can’t hack us, crack us, delete us, or even move the data without permissions. They’re looking for that upgrade in their security profile.
Blocks & Files: Is there scope for Panzura to partner with a filer company to say, in effect, we own the edge. And we’re really, really good at that. But you’re good in the datacenter. So maybe the two of us together can make three rather than two?
Dan Waldschmidt: I feel like you’re reading somewhat from my playbook here. I had to make sure my legal pad was sitting next to me here. You’re going to see some things coming out [that] I’m not ready yet to talk about fully.
Blocks & Files: Would it be fair to say that, in contrast to CTERA and Nasuni, Panzura’s starting stance is not replacing the filer? I get the impression that you’re more concerned with partnering where it’s possible, rather than actually confronting a filer and trying to boot them out of an account.
Dan Waldschmidt: We are intently focused on listening to what enterprise leaders are asking for … Let me just paint an imaginary world where interoperability was a core piece of Cloud FS, where you could engage with the tools, platforms, and technology you’re already using. And then move your most precious workloads into a Cloud FS Panzura environment. That is what CIOs are asking for. That is the path we are on. Further advancements will be coming soon, where you will have technology to do that.
Blocks & Files: I’m getting the impression that Panzura is concerned to move toward customers, find out their pain points that you can solve, and then … get close to them and work them through. It’s not about … toss that piece of rubbish gear out and use us instead, everything will be fine. It’s more that it’s if you can use this here, and perhaps if you can use that there, then altogether, things become much better.
Dan Waldschmidt: That’s exactly our mission. It is that we are no longer forcing this binary choice of it’s either Nasuni or Panzura, it’s either NetApp or Nutanix, or Panzura on top of Nutanix. and this whole stack or nothing. That is no longer the case.
It is now saying, well, hold on, let’s be practical about that. You might have a $25 million investment in NetApp. It might need to run in place for three and a half more years or two and a half more years – because it needs to wind down and we need to get the economic benefit out of that. However, we cannot delay the benefits that having Panzura would have on the enterprise. So we’re going to take specific workloads, specific regions, business units, mission-critical assignments, and place them on Panzura, because we know it’s the fastest way in and out of multi-cloud hybrid cloud environments. That’s a fact. It’s the fastest and most secure way in and out of hybrid multi-cloud environments.
That’s what we’re going to use. And as Panzura’s enterprise-ation matures against stability and performance … we are looking forward to bringing our entire business on it. We’re already seeing this at some of our bigger customers where they might be spending with us on a … few different workloads. And now they’re in place to say “build me a three-year map roadmap where all of our business comes to you.” But let’s be smart about both the economics and the migratory strategy for that data.
Comment
Panzura is focused on methodically expanding its current accounts by adding additional workloads to its overall Cloud FS system. We believe there will be more interoperability with incumbent file system and services suppliers such as NetApp, Dell, and Qumulo.
Waldschmidt wants Panzura’s software to work like a well-oiled machine and do the necessary “dirty work,” as he calls it, the nuts and bolts of making file-based data available, securely and safely, to Panzura’s users, both at the remote office and remote user device edge and in datacenters. We could envisage, perhaps fancifully, an outer ring of Panzura-enabled access to file data for remote users with workstations, PCs, tablets, or phones. They access Panzura’s Cloud FS in the public cloud and interoperate with on-premises filers in a hybrid, multi-cloud IT environment. More and more workloads are migrated over time to this Panzura infrastructure.
This is a marathon involving step-by-step execution and improvement. Waldschmidt has written a book, EDGY Conversations: How Ordinary People Can Achieve Outrageous Success, and says he finds that while “luck may be more appealing than effort, you don’t get to choose luck. Effort, on the other hand, is totally available, all the time.”
Panzura employees appear to be entering an era where sustained hard work will be the expected norm.
Profile. StorageX is a computational storage company with a difference, supplying AI processing and models to cut data-heavy workloads down to size.
Stephen Yuan
We met StorageX founder and CEO Stephen Yuan on an IT Press Tour to find out more about its Lake Ti processing card and its functionality. StorageX believes that for data-heavy workloads, compute should move closer to the data. Moving mass data to distant compute – be it x86 or GPUs – takes a long time and needs a lot of electricity.
But placing compute near to the storage invites the question “what kind of compute?” Is it x86, GPU, Arm, Risc-V or something else? Is StorageX a ScaleFlux-type computational storage company?
StorageX says its Lake Ti P100/P150 series add-in card features a dedicated AI engine that has higher performance per watt and per dollar than mainstream GPUs, plus 32GB of DRAM. It also brings high-performance data acceleration and low latency IO acceleration for storage and the network – meaning high-performance AI, network, and storage acceleration in one package.
It supports PCIe 4 and is RDMA ready. Yuan said “We can support RDMA. We have the protocols for it right now.” The current product’s 2x 100 GbitE ports are there for that and also for memory expansion. There is a direct SCM/SSD bridge for memory/storage extension. The Lake Ti card can pre-process data before it goes to GPUs and process incoming data before it is stored on a drive.
The card is ready for real-time analytics work. Yuan asserted that Lake Ti can carry out vector processing tasks more efficiently than CPUs and GPUs. There are predefined and optimized AI models for a user application and the Lake Ti system has data-awareness applications for a smart data lake which can be used for recommendation processing.
The diagram above compares a legacy server recommendation cluster on the left with a StorageX-enabled recommendation system on the right, which is more efficient. It shows a Lake Ti card connected to a TB-level set of SSDs communicating at the GB level via PCIe switch to a CPU and a GPU. They are still needed – the card does not replace them.
Yuan explained: “If we can process closer to the storage nodes, that will make it a lot easier … you can reduce a lot of the data size, making it more efficient when doing the compute.” In other words, StorageX processes the data and reduces its size, reducing network transmission time. He added: “We can process the data close to the storage nodes, and pre-process it, and we really streamline the throughput.”
“We’re optimizing … data intensive workloads. We’re not competing with CPUs for those very complex computer tasks or GPUs for the graphical stuff or things like that. But we’re good at … data heavy workloads.”
Lake Ti card
In summary, this Lake Ti card, which is FPGA-based, has three compute functions: an AI engine and AI models, a storage data processor, and a network I/O processor. StorageX is bringing some dedicated compute functions to the storage along with storage and network I/O acceleration so that customers can carry out AI analytics work without needing access to massive x86 or GPU server farms for all of it.
Compared to Nvidia’s Bluefield SmartNIC/DPU product, the Lake Ti card has more capabilities such as AI and pseudo-TPU features. But Lake Ti does not support virtualization, which BlueField does. Nvidia’s card shares network acceleration with Lake Ti, but does not have data acceleration. Lake Ti supports memory expansion, via its 2x 100Gbit ports, and BlueField does not.
The Lake Ti card is not a DPU in the Fungible sense – a routing or gateway chip able to connect to many hosts, each of which have a corresponding front-end card.
StorageX claims it has users in the digital or smart factory market, and is working with SK hynix’s Solidigm unit in that area. It is partnering with AMD and working with some of the memory vendors – SSD vendors as well – to create a smart data lake system. StorageX is working in autonomous driving with an on-vehicle storage and compute platform, also in streaming media. With Tencent, for example, there are more than 100 million videos uploaded each day. Yuan said that means around 5PB of video data needs to be mined, analyzed, and recommended, which involves 360 billion frames per day. TikTok is showing up to 167 million videos per minute and needs fast recommendations to its users.
Yuan says StorageX Lake Ti computational storage processors in a smart data lake can provide video understanding, natural language processing, extraction, user profiling, and recommendations.
There is a next-generation product in development that may use an ASIC instead of an FPGA, and could support PCIe 5. It should arrive later this year. Direct CXL support is planned, as is card clustering. The clustering would not necessarily use a PCIe connection.
Some kind of fast data transport to GPUs is being developed, with Yuan hinting: “We’re planning to connect to the GPUs using certain … protocols, and pre-process the data before feeding it to the GPU itself.”
This should be independent of the specific GPU manufacturer, as “the only thing that matters is how we connect to them.”
The Lake Ti product is not sold as a card on its own – StorageX provides a rounded package. “We provide hardware, software and services for … customers,” Yuan explained.
From a competitive view, StorageX considers itself to be an AI plus data company. “Databricks or … SaaS software service companies really are software-based companies. We are providing our dedicated hardware to accelerate these workloads, making it more efficient to process very data-heavy workloads.”
If potential customers are spending a lot of money on GPU or x86 server clusters and suffering long latency data movement between them and the storage, Yuan suggests: “Our hardware helps us to reduce the TCO and make the [CPUs and GPUs] more efficient. So we’re a kind of hybrid hardware and algorithms and data company.”
It is not a drive-level computational storage company like ScaleFlux. It is a system-level near-storage compute company with AI capabilities running on its own specialized FPGA hardware, and we’re going to hear a lot more about it in the future.
Backblaze has a Powered by Backblaze program for ISVs and other partners to integrate Backblaze B2 cloud object storage into their products and services. A Custom Domains feature lets businesses serve content to end users from the web domain or URL of their choosing, with Backblaze managing the cloud storage on the backend. A Backblaze Partner API enables organizations to provision Backblaze B2 Cloud Storage accounts from within their own platform. The API also enables these companies to manage accounts, run reports, and create a bundled solution or managed service for a unified user experience. There’s more info in a Backblaze blog.
…
Nikkei Asia reports that chipmaker CXMT, based in Hefei, eastern China, is working to produce China’s first domestic high bandwidth memory (HBM) chips. It has received manufacturing and test equipment from US and Japanese suppliers suitable for this. American equipment suppliers Applied Materials and Lam Research had received licenses from Washington to ship chip production tools to the Chinese memory chipmaker since mid-2023. China is managing to do this despite US technology export sanctions in general. The chips are likely to be less advanced than Samsung, Micron, and SK hynix HBM chips but it is a starting point.
…
Analyst DCIG recognized the top five rising vendors offering multi-site file collaboration products. They include Hammerspace, LucidLink (Filespaces), Quobyte, Resilio Connect, and Tiger Technology (Tiger Bridge). It evaluated 18 suppliers and products, but did not include CTERA, Egnyte, Nasuni nor Panzura [see inclusion criteria below]. Get a copy of the report here. [DCIG has a separate global file management system report – see Panzura note below.]
…
Lakehouse supplier Dremio tells us February 12-16 is Love Data Week, an international celebration of data in all its forms to promote good research data management strategies. Yes, really. Its State of the Lakehouse 2024 report says:
Data lakehouse adoption is on the rise and cost savings are key.
70 percent of respondents say more than half of all analytics will be on the data lakehouse within three years, and 86 percent reckon their organization plans to unify analytics data.
Technical professionals overwhelmingly agree AI is a national security priority (84 percent), noteworthy in light of the recent US executive order on AI.
42 percent moved from a cloud data warehouse to the data lakehouse. Top reasons for the shift were cost efficiency and ease of use.
81 percent of respondents are using a data lakehouse to support data scientists building and improving AI models and applications.
…
Privately-owned Data orchestrator Hammerspace said it had great momentum entering 2024. In 2023 it experienced:
A single customer deployment that exceeds 100 petabytes of capacity.
300-plus percent growth in sales pipeline.
650 percent year-over-year increase in the capacity under management in the Hammerspace Global Data Environment software.
65 percent year-over-year growth in signed channel ecosystem partners.
Nearly doubled sales and marketing teams to meet increased market demand.
Customers include Blue Origin, the National Science Foundation, Royal Caribbean Group, and Hyperscale large language model environments.
…
Cybersecurity supplier Index Engines reports a sharp increase in adoption of its AI-powered CyberSense software in several sectors targeted by ransomware. These include telecom (70 percent gain), insurance and finance (56 percent), IT services (50 percent), federal and state government (41 percent), and energy (38 percent).
….
The Japanese government is giving a ¥150 billion ($1 billion) subsidy to the Kioxia-Western Digital joint venture NAND manufacturing plants at Yokkaichi and Kitakami aimed at securing stable production of semiconductors in Japan. Previously, the joint venture manufacturing facility at Yokkaichi was approved to receive up to ¥92.9 billion in subsidies from the Japanese government in 2022.
…
Abel Gordon
NVMe/TCP storage startup Lightbits Labs has promoted Chief System Architect Abel Gordon to CTO. He has been instrumental, we’re told, in designing its low-latency clustered block storage software, leveraging the NVMe over TCP protocol. Lightbits aims to consolidate its market presence and meet the burgeoning demand for advanced cloud storage solutions. Gordon’s was said to be instrumental in forging a technology alliance with Microsoft. This alliance resulted in the introduction of Lightbits on Azure Managed Application marketplace, including solutions for Azure VMware Solution (AVS), Azure Kubernetes Service (AKS), and Oracle on Azure.
…
LucidLink, which supplies filespace technology that is designed to give instant access to data and real-time collaboration for the world’s creative companies, has completed its Systems and Organization Controls (SOC 2), Type II audit, and has the TPN Blue Shield certification by the Trusted Partner Network (TPN). Developed by the American Institute of Chartered Public Accountants (AICPA), SOC 2 is a rigorous compliance framework designed to ensure the effectiveness of a company’s security policies and procedures.
…
San Francisco startup Mechanical Orchard has raised $24 million in a Series A funding round, led by Emergence Capital. It uses iterative, AI-enhanced, and reverse engineering approaches to move large enterprises off mainframe systems and into the cloud. It gives CIOs and CTOs better visibility and control over the process, we’re told, while ensuring critical systems stay online. Founder and CEO Rob Mee also started software development consulting firm Pivotal Labs.
Peer Software announced Chen Moore and Associates (CMA) is using Peer Software’s Global File Service (PeerGFS) to simplify file management and increase productivity for distributed teams. PeerGFS provides Chen Moore’s team, including many CAD users, with fast, local access to shared project files.
…
SSD controller supplier Phison is making specialized SSDs for video surveillance storage, saying HDDs speed performance has gradually become insufficient to meet the demands of high-resolution video recording. HDDs have high failure rates, power consumption, and noise issues. The mid-high-range S12DI and cost-effective S17T SSD drives have:
Sustained Write Performance of 500 MBps to prevent dropped frames.
Power Loss Protection (PLP) to ensure data is effectively preserved in the event of a power outage.
Reliability and durability to support continuous intensive data writing for extended periods, such as 24 hours.
Quantum has sponsored an IDC brief, “Data Deluge: Why Every Enterprise Needs a Cold Storage Strategy,” for insight and guidance to develop cost-effective cold storage strategies in the areas of long-term data retention, cloud data security and compliance, and data migration and management. Get access to it here.
…
Rakuten Symphony announced the full-scale commercial launch of Rakuten Drive, a file storage cloud service, in Japan. It’s already available in Korea. Rakuten Drive provides file storage cloud services for both individual and enterprise customers that the company says allows users to send large files quickly and offer secure, intuitive file storage and sharing in the cloud. It is introducing the ability to view and edit in real time documents, decks, spreadsheets using the web version of Microsoft 365 productivity apps like Microsoft Word, Excel, and PowerPoint. The consumer-facing version of the service has been integrated with Rakuten ID – a shared ID that allows Rakuten members to access multiple services – making it possible for the 100 million Rakuten members in Japan to use the service with their existing Rakuten ID.
…
SK hynix plans to commence mass production of its HBM3E chips within the first half of 2024, with 8-stack sampling and 6-stack having 1.2 TBps bandwidth (1.2 TBps/stack). GenAI training is driving a need for more memory for GPU processors. Its HBM4 technology, with peak stack bandwidth of >1.5 TBps, is set for chip sampling in 2025 and mass production in 2026.
…
SK hynix has established a roadmap to actively use recycled and renewable materials in production. It aims to raise the proportion of recycled materials used in the products currently manufactured by the company to 25 percent by 2025 and more than 30 percent (based on weight) by 2030. It will start with essential metals for semiconductor production such as copper, tin, and gold, and replace them with recycled materials.
…
Business Korea reports that SK hynix is developing 2.5D fan-out packaging for its next generation of DRAM. Two DRAM chips are arranged horizontally with no substrate underneath them, making them thinner. They are combined as a single and thinner chip. Fan-out wafer level packaging has been used by TSMC and Samsung for smartphone application processors. SK hynix reckons fan-out packaging can reduce costs by avoiding the Through-Silicon Via (TSV) process and increasing the number of input/output (I/O) interfaces. The company supplied special DRAM for the “R1” computation unit installed in Apple’s new AR device Vision Pro headset.
…
Scalable GPU data analytics platform SQream announced an integration with Dataiku, which supplies a scalable data science and machine learning (ML) product. Users can now use SQream’s analytics capabilities within the flexible and governed Dataiku environment, enabling users to explore, prepare, and transform large datasets.
…
Synology has started offering its cloud file transfer service, C2 Transfer, for free. C2 Transfer is designed for individuals and businesses to streamline the sharing of large files while ensuring data security, we’re told. C2 Transfer simplifies and secures the sharing of sensitive information within teams. Following the free plan’s introduction in December 2023, Synology says C2 Transfer experienced a fourfold increase in total file transfer and request tasks, alongside a 105 percent growth in user numbers from December 2022 to December 2023.
…
Veeam Software has been named Alliance Partner of the Year by S3 cloud storage provider Wasabi Technologies.
…
Western Digital notified the SEC that it’s unable to file the calendar Q4 2023 quarterly 10-Q report because of recording errors relating to its equity interest in flash ventures in Japan. This is because of differences between Japanese and US GAAP accounting principles for lease-related transactions. It needs additional time to correct the errors. Due to this, WD expects to make corrections to the preliminary financial results for the three and six months ended December 29, 2023, and to the financial statements for the three-month and six-month periods ended December 30, 2022, when the company files its 10-Q form. Wells Fargo analyst Aaron Rakers said: “The differences show up in the equity in earnings for these entities recognized by the company in Other Income & Expenses (OI&E) and a cumulative understatement of the carrying value of the company’s equity investment in Flash Ventures.”
…
PNY is supplying hardware for VAST Data.
…
Edge cloud provider Zadara has completed Australia’s Information Security Registered Assessors Program (IRAP) process. The assessment accompanies the expansion of Zadara’s Edge Cloud, with a new availability zone now online in Melbourne and more coming. Zadara has fully managed compute (EC2 compatible), storage (block, file, object), and network systems hosted in a global point-of-presence (POP) network, called an edge cloud. It can also be run on AWS, Azure, GCP, and Oracle public clouds, or in a customer’s own datacenters. Zadara adds this IRAP assessment to its list of existing certifications including SOC 1, SOC 2, ISO27001, ISO27017, ISO27018, ISO27701, HIPAA, and EU GDPR.
…
Software RAID supplier Xinnor said the Karlsruhe Institute of Technology (KIT), one of the leading research institutions in Germany, has tested Xinnor’s xiRAID as a high-performance cache for its archival storage platform at HPSS (High Performance Storage System). After running several tests of other HDD and SSD setups, KIT has turned to a combination of xiRAID and NVMe drives as the most efficient solution that hugely outperforms other options.
AirMettle has reinvented object storage for the analytics and generative AI era with a parallel software storage controller architecture delivering much faster access to data.
Founded by CEO Donpaul Stephens in 2017 and incorporated in 2018, the startup has thus far eschewed venture capital, raising around $4 million from grants and angels. Stephens was one of the founders of all-flash array startup Violin Memory back in 2005. What led him to AirMettle was the realization that large objects are sent to and from storage systems – such as Cloudian, MinIO, and Scality – as single items when only small parts or sections are needed at a time by a server, either local or remote. If you could divide a large object into shards, with shards delineated from the structural concepts of accessing applications – video frames for example – and spread them across the multiple nodes in an object system, then the multiple nodes could serve them in parallel. This can speed access overall by a factor of up to 100x.
Stephens told an IT Press Tour briefing: “We bring software compute into storage but we do it correctly. Computational storage has been around for 20 years and is a failure in the market … We are computational storage, but at the system level. Not at the drive level.”
Founder and CEO Donpaul Stephens (left) and chief product officer Troy Trenchard (right)
AirMettle says its “core innovation is a software-defined storage platform with integrated distributed parallel processing enabling direct queries of semi-structured content in storage. The value gained is reducing network traffic to the query result set instead of the whole data set.”
Standard object storage stores and provides access to objects. AirMettle’s software does the same but provides access to parts of objects. It understands the data formats used by accessing applications and then partitions/structures the data according to these formats, and spreads them out in erasure coded single digit MB shards across the object storage nodes. That means they can operate in parallel. These nodes are commercial-off-the-shelf (COTS) storage servers – special hardware is not needed.
Analysis and AI training data sets are steadily increasing in size, making GPU servers work harder. They need more memory, and fast networking is required to get massive data sets to the GPUs quickly enough.
AirMettle says this situation is getting worse. Analytics memory is expensive and hitting size limits as data sets grow (eg AI). Analytics compute is an expensive way to filter out irrelevant raw data, and faster networking is expensive, constrained, and adds latency. It’s better to rearchitect the data store so that smaller amounts of data are sent to the analytics processors.
AirMettle supports classic tabular data, video data, and other formats – about ten in total – covering more than 90 percent of the object data stored. Stephens explained: “We handle multi-dimensional data, ie climate and weather. We divide the data into sub-planes which facilitates parallel processing.”
AirMettle has a database to hold its own internal metadata, which adds about 0.1 percent or so to the content. Stephens said: “This metadata is not indexing. It’s record:offset – where the bytes are.” This internal metadata enables parallel in-storage analytics. The right data is fed to analytics processors without having to go through a data lake extract, transform, and load (ETL) process to populate a data warehouse, which is then used as the analytics data source. In effect the AirMettle storage system, a smart data lake, inherently carries out ETL internally and no data warehouse is needed.
Envisage a large-scale object storage system with a thousand to ten thousand access nodes based on standard x86 servers and local drives. AirMettle’s software runs in these access node’s controllers, presenting an S3 Infrequent Access-type tier to accessing applications with an API. The system’s speed increases with scale and outperformed AWS’s S3 on the Star Schema benchmark by 100x.
The Star Schema benchmark used 223 Select queries to object storage. AirMettle took less than a minute while AWS took 1 hour 45 minutes for 700GB of data queried..
Stephens said AI training uses files sent via GPUDirect to GPUs. In effect it could provide the equivalent of GPUDirect for objects.
There are about 20 employees in the biz, with nobody yet full time on sales or marketing. Chief product officer Troy Trenchard was hired in July last year to lead product strategy, marketing, and business development. He discussed three public cloud product ideas:
AirMettle OS as a layer on top of public cloud object stores. 5x acceleration possibility, S3 get/put operations. Object storage would compete with Cloudian etc. at this level. It would need meed mid-market channel players.
AirMettle Analytics on top. S3 Select, Arrow 100x accelerated enhanced SQL (S3+, Arrow). This implies a more enterprise option of AirMettle OS and Analytics. Growth over time.
AirMettle Sci-AI. Scientific analysis + AI – plug-in support for add-ins and modules – GPU. For high-end customers in near term.
AirMettle has been awarded a patent and there are more in the pipeline.
The startup has not publicly launched its product technology, but has paying customers – such as Los Alamos Labs, its first such customer, with Gary Grider, HPC division leader, lauding the software. Stephens told us: “You get more value out of supercomputers if our software runs on the COTS storage servers feeding them data.” It’s also working with the National Oceanic and Atmospheric Administration in the US.
Stephens said: “We are talking to the CSPs on infrequent access tiers – all of the big three.” Potential use cases include accelerated in-storage data mining of video light sources, processing of video data, accelerated video analytics, and the selection and rescaling of weather data.
This is a young business that has bypassed the traditional startup funding development process. It has unique and promising technology, and we can expect to hear more from it in 2024 as its product strategy solidifies and its route to market becomes defined.
The Israeli Defense Force is using Combatica’s VR services to train its soldiers, and ionir Kubernetes data storage is part of the product offer.
Combatica is an Israeli startup that supplies XR (extended reality) training through Oculus-style virtual reality headsets allowing soldiers to simulate combat missions in virtual reality while actually moving around inside a big room. The headsets are fitted with sensors and not tethered to host systems, so trainees can move freely with the Combatica software synchronizing their positions relative to their colleagues and terrorist avatars.
Combatica VR headset training session with three trainees and the view seen through one of the users’ headsets
Erel Herzog, co-founder and chief strategy officer at Combatica, explained in a statement: “The Combatica-ionir partnership is a powerful combination. ionir’s specialization in transferring massive amounts of data will be crucial to efficiently manage our technology’s seamless and rapid deployment.”
Because trainees wear sensor-fitted headsets, there is no need for the training room to have sensors installed on its walls – enabling ordinary buildings to be used.
The ionir Kubernetes data services software, from the company formerly known as Reduxio, provides storage for stateful container apps, along with data protection, mobility, and deduplication services. It claims data can be transferred between Kubernetes clusters in the same or remote locations in 40 seconds or so, regardless of capacity or distance. The destination system user applications have virtually instant access to a fully functional read/write capable copy of the source volume.
The cloud and data services provided by ionir will enable Combatica to simulate any urban environment worldwide – accurately – and will help scale up its systems.
Jacob Cherian
Jacob Cherian, CEO of Ionir, gushed: “We’re excited to showcase ionir’s Kubernetes solution with Combatica, which will allow Combatica to parallelize data management, to scale deployments seamlessly, and support multiple customer environments across public and private clouds with one integrated solution stack.”
Private equity firm C5 Capital is an investor in ionir, along with Jerusalem Venture Partners and others. C5 founder and CEO Andre Pienaar noted: “C5 Capital supports and invests in Israel’s technology sector for the long term. We are excited to help form new partnerships that will save the lives of soldiers and law enforcement officers combatting the threat of extremist terrorists and ensuring the safe return of the victims of terrorism worldwide.”
Bootnote
Combatica documentation says trainees can freely move around the virtual space – walking, running, or performing actions like they do in real life. The system allows for large teams to train together in the same scenario. The firm has developed its own system that allows users to operate their own service weapons in a safe manner, including recoil.
The instructor can easily control the scenario – changing the synthetic environment, the time of day or the weather, enemy behavior and more. A variety of scenarios and intelligent AI enemies are available. The system tracks the user’s actions, and ionir’s analytics tools deliver meaningful insights on the trainee’s or squad’s performance.
VAST Data has a 6 percent share of the datacenter all-flash array storage market, a chart based on IDC numbers has revealed. But IDC doesn’t agree with the validity of the chart.
Update: IDC rebut Zeoli and Flanagan use of its supposed number. 10 February 2024.
Microsoft Customer Success Account Manager and Generative Value newsletter publisher Eric Flaningam, a self-described value investor and tech analyst, provided the numbers and a chart in an X post. The chart shows how datacenter all-flash array (AFA) storage supplier market shares changed from 2015 to the first half of 2023. The market at that time was shared between Dell (28 percent), Pure Storage (13 percent), NetApp (12 percent), HPE (8 percent), Huawei (8 percent), IBM (8 percent), VAST Data (6 percent), and Others (17 percent). It used IDC numbers from a Data Gravity newsletter source.
Flanagam’s chart has a couple of errors. (1) it’s all-flash array market share. (2) Its final column should be the first half of 2023, not the second. See bootnote below.
As VAST is a private business and doesn’t reveal its revenues, this public depiction of its AFA market share total and growth from 2020 is something of a revelation. It has told us before it is growing fast and this chart seem to support that claim.
Most of the other suppliers are US-based and public – save for Huawei – thus we get to see their revenue numbers from quarterly reports. We can see that NetApp’s all-flash share has trended down from a peak in 2018 to which it rose quickly from 2015, HPE and IBM’s shares have trended down, while Pure’s has trended up. Huawei showed a substantial rise in the 2016 to 2021 period then flattened out at the 8 percent level.
Flaningam’s chart provides an opportunity to assess VAST’s revenues in the first half of 2023 by comparing its market share with other AFA suppliers whose storage revenues we do know. In the first half of 2023 Pure’s storage revenues were $1.46 billion and that equates to its 13 percent market share. This indicates that VAST’s 6 percent share equals just under half of Pure’s revenues in 2023’s first half, meaning $674 million?
VAST CMO Jeff Denworth told B&F that a direct comparison between Pure and VAST is invalid because: “We don’t take all of the revenue since we don’t sell hardware (versus Pure, which is a hardware company) even though our software drives different hardware platforms. Revenue is also not something we account like hardware companies do. So, no, the metrics are not congruent. In general, though, there’s a significant amount of annualized software subscription revenue (we take multi-year contracts) and other third-party hardware partners who are benefiting from our ascent.”
Bootnote 1
Data Gravity is a subscription-based newsletter produced by Chris Zeoli, a partner at Wing Venture Capital. He reproduced an IDC chart in a December 2023 edition which provide all-flash TAM (Total Addressable Market) percentage numbers for Dell, HPE, Huawei, IBM, NetApp, Pure, VAST and others from 2015 to the first half of 2023;
This is an equivalent to each supplier’s all-flash array market share and the basis for Flaningam’s storage market share chart. We turned Zeoli’s stacked bars into a line chart to show this;
Zeoli says that VAST has a business profile that “excels in cash flow and Annual Recurring Revenue (ARR), though may lag in live revenue.”
IDC Rebuttal
An IDC spokesperson said:”Both the stacked bar chart that Zeoli created and the line graph that Flaningam created (both of which appear in your story) are inaccurate. Our analysts could not figure out how Zeoli arrived at the figures presented in his original graphic (and the analysts assured me that these figures do not align to our All Flash Array market shares for 1H23). Our thinking is that the errors are due to the complete lack of a definition of what Zeoli means by “Data Center Flash Storage”. For the record, IDC does not currently track such a market, so those figures were created solely by Mr. Zeoli.”
A second point: “Another thing that raised our eyebrows is why anyone would include a software platform provider like VAST in a comparison of storage infrastructure (hardware) providers.”
And thirdly: “So, as you might guess, IDC cannot verify any of the data that appears in these graphics – the figures are all based on Mr. Zeoli’s “analysis”. But I think your article makes that fairly clear. And you did state that the charts were “based on” or “used” IDC data, which should imply to the reader that these are not a straight reading of IDC’s data. Unfortunately, not everyone caught that subtlety, but we thank you for that.”
As IDC was, apparently the source for Zeoli’s analysis, which itself was the basis for Flanagan’s subsequent article, then both are wrong. And that means that our reasoning that VAST’s ‘market share’ of 6 percent equates to slightly less than Pure’s 13 percent and hence its H1 2023 revenues were $673 million is also wrong. We cannot conclude that VAST has a 6 percent share of the data center all-flash array. market. Zeoli and Flaningam’s views are based on an inaccurate view of IDC numbers and our original version of this article was wrong as well.
Pure Storage has chopped up to 275 employees globally, marking another round of layoffs.
Update: Zeoli attribution of VAST having a 6 percent share of the data center all-flash array market removed. 10 February 2024.
We’re told the all-flash array supplier’s data protection, AI & analytics, database, alliances, and unstructured data areas are affected, and the reductions amount to some 4 percent of Pure’s workforce.
A spokesperson at Pure told Blocks and Files: “As Pure continues to scale and maintain a record of high growth and innovation, we recently completed a workforce rebalancing initiative to align our employees with company priorities and areas that are strategic to the business. The employees affected by this initiative were eligible to apply to open roles and also offered outplacement services to assist with their job search outside of Pure.”
This latest round of layoffs follows job cuts made by Pure in January and April last year
Nasdaq-listed Pure announced a healthy set of results for its Q3 of fiscal 2024, ended November 5, with revenues up 13 percent year-on-year to $762.8 million, and a profit of $70.4 million recorded compared to a minor loss in the same period of the prior year. The Q4 outlook was more downbeat, with revenues of $782 million forecast, equating to a decline of 3.5 percent year-on-year.
A move toward more subscription-based business is affecting income on the one hand and delayed shipment of a $41 million 5G telco customer is another headwind.
We have heard from industry sources that Hammerspace is making inroads at Meta, a big FlashBlade customer, in the large language model training arena, using all-flash storage servers orchestrated by Hammerspace’s software. Meta, like other large web-scale customers, needs a POSIX-compliant file system and not an HTTP/RSST object system, according to a source familiar with the situation.
AI training cluster diagram
Meta is building 1,000 node clusters with 100 Tbps bandwidth and doesn’t have an RDMA-capable network, we’re told. It uses layer 3 switching and cannot use NVMeoF. Meta started with Pure Storage’s FlashBlade as the array, we’re told, but came to view it as limited. It’s also not a commercial-off-the-shelf (COTS) system.
At this scale you need to save pennies on the dollar, and it was felt that Hammerspace’s relatively cheap Linux storage boxes could fit the bill. We’re also told VAST Data, which uses COTS kit with an internal NVMe-oF fabric, was expensive and wouldn’t scale past 30 nodes without performance falling off. Vast disputes this.
Veritas, which private equity house Carlyle bought from Symantec for $7.4 billion in 2015, supplies data protection and management products. It is regarded as a legacy data protection vendor, which has moved into cloud with Alta. The CEO is Greg Hughes. Cohesity, led by CEO Sanjay Poonen, was founded by Mohit Aron in 2013 and has raised $660 million in total funding. The company supplies data protection and cyber resilience software, provided as cloud services. It also has a CS5000 HCI data protection appliance product line.
Sanjay Poonen
“This deal will combine Cohesity’s speed and innovation with Veritas’ global presence and installed base,” said Poonen in a statement. “This combination will be a win-win for our collective 10,000 customers and 3,000 partners, and I can’t wait to work with the Veritas team to bring our vision to life.”
“We will lead the next era of AI-powered data security and management by bringing together the best of both product portfolios – Cohesity’s scale-out architecture ideally suited for modern workloads and strong Generative AI and security capabilities and Veritas’ broad workload support and significant global footprint, particularly in the Global 500 and large public sector agencies.”
Cohesity’s Turing initiative involves a generative AI partnership with Google Cloud and Vertex AI platform for building and scaling machine learning models. Cohesity is extending this initiative by working with AWS Bedrock to have business context-aware answers generated in response to user requests and questions.
Veritas’ data protection operation is understood to have a value of more than $3 billion, including debt. To pay for this, Cohesity has raised some $1 billion in equity and $2 billion in debt from an investment group including Haveli Investments, Premji Invest, and Madrona. The combined Cohesity-Veritas data protection business has a reported $7 billion valuation.
The new entity will have a $1.6 billion annual revenue run rate, the companies said.
The deal should close by the end of 2024, and Carlyle will roll over or retain its holding in the Veritas data protection business and join Cohesity’s board, along with Veritas CEO Hughes.
The remainder of Veritas’ business, said to include data compliance and backup services, will be placed in a new company by Carlyle. It will be called DataCo and will include Veritas’ InfoScale, Data Compliance, and Backup Exec businesses. It is interesting that Backup Exec is not included in the Cohesity-Veritas transaction, perhaps because it is regarded as an SME data protection product. A Cohesity spokesperson said: “I’ll just say that Cohesity is focused on the enterprise, and the businesses from Veritas that we are combining with keep us focused on the enterprise.” Lawrence Wong, presently SVP of Strategy and Products at Veritas, will head DataCo.
Commvault’s chief customer officer, Sarv Saravanan, said: “This deal between Cohesity and Veritas could create complete chaos for customers. Platform integration challenges and redundant product portfolios could take years to address. With cyberattacks increasing in severity and frequency, there’s no time for that. Customers need to know if they’re hit, they can recover fast. In today’s world, cyber resilience equals business resilience.”
The combined Cohesity-Veritas business has c10,000 customers and c3,000 partners with hundreds of exabytes of data under management.
DCIG proprietor Jerome Wendt opines: “This acquisition gives both companies what they want and enterprises what they need. Veritas NetBackup gets the next-generation technologies in the form of AI, hyperconvergence, and a cloud platform that it needs to stay relevant. Cohesity gains immediate access to enterprise accounts and the ability to set NetBackup’s future technology direction by owning NetBackup. Finally, enterprises obtain a technical path forward that does not require them to abandon their investment in NetBackup.”
Comment
This deal is a major reshaping of the data protection and cyber resiliency landscape. Poonen became Cohesity CEO in August 2022 and he’s putting his own stamp on the company with this merger for Cohesity, which is still technically a startup.
The data protection and cyber resiliency market is fragmented with roughly three groups of suppliers:
Fast-growing new arrivals: Cohesity, Druva, HYCU, Rubrik, and more
Mature players: Arcserve, Asigra, NAKIVO, HPE’s Zerto, and many others
Rubrik, a strong rival to Cohesity, is reportedly gunning for an IPO, and Veeam surely has an IPO in its future. Cohesity filed for an IPO in December 2021 but it did not come to fruition. Now it could position itself for one in the 2025/2026 time frame once the Veritas acquisition has been digested and the business has a healthy bottom line. It is probable that a combined Cohesity-Veritas data protection and cyber resilience business will become one of the largest players in the data protection marketplace. William Blair’s Jason Ader thinks it will be the leading company, larger than Veeam. It’s good news for IBM as its Storage Defender offering includes Cohesity’s data protection as an integral part of the product.
There will be a lot of business and technology integration and consolidation inside the new business. We have heard, for example, that Cohesity’s hyperconverged offering – the Cohesity Data Cloud and the Helios interface – will get Veritas software IP and be offered as an upgrade to the Veritas customer base. Obviously there will need to be a single management facility and perhaps a combined data protection map and weakness identification facility.
This is the first major consolidation move in the fragmented backup industry. It may prompt more jostling for position as other legacy incumbents look to join with aggressive and faster growing new players to secure their position in an evolving market.
Interview: Compared to file, object has simply become a protocol absorbed into an analysis solution, and proprietary GPUDirect will be followed by more generic multi-GPU-vendor technology. These are points made by Qumulo CTO Kiran Bhageshpur when asked about the relative status of object and file storage on the one hand and the role of GPUDirect and object storage on the other. Qumulo supplies scalable and parallel access file system software and services for datacenters, edge sites, and the cloud with a scale anywhere philosophy. It supports the S3 protocol with Qumulo Shift, but not GPUDirect.
Kiran Bhageshpur
The surge of interest in AI means Qumulo is placed at the intersection of large unstructured data sets, GPU processing, and LLM training and inference. Nvidia’s GPUs dominate the LLM training market and there are two distinct populations of unstructured data – files and objects – on which these GPUs can feed. How does Bhageshpur view their relative status?
He said: “My firmly held opinion is that object in the cloud is an absolute first-class citizen … It is a fundamental primitive; S3 and EC2 and EBS were the first three services as a part of AWS back in 2006. And it is insanely durable. It is ridiculously available. It is got fantastic throughput characteristics, not latency, but throughput. So therefore, it is impossible to overwhelm object in the cloud. It’s got a huge scale. They talk about zone-redundant storage, they talk about locally redundant storage. You could do replication as a part of the setup to a different region. It’s fantastic.”
But “object on-premises is a cruel joke.”
Qumulo supports the S3 protocol “simply because we support a protocol.”
On-premises object storage is a pale imitation of in-cloud object storage, he says. “Nobody has the capability to actually deliver that level of redundancy that exists in the cloud on-premises. And even if they do, it is tiny in terms of the number of people who can operationalize it. Think about zone redundant storage which is an S3 basic characteristics, which means they have three datacenters, all of them with separate power cooling, networking, and yet only one millisecond to one and a half millisecond latency between these datacenters. And they are essentially doing erasure encoding of the object across these things. How many people on earth do you think outside the magnificent seven, and maybe another 10 to 12, can actually do this?”
The magnificent seven refers to Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP), Alibaba Cloud, IBM Cloud, Oracle Cloud, and Salesforce.
General enterprise datacenters may have “three datacenters in three different states. Now your latencies are 30, 40, 50, 60, 70 milliseconds; the experience is crappy. And now you get the sort of opinion that it is slow.”
But “then you will have people like Pure and VAST come along and say, hey, it’s fast object. But, if you look at the use cases, it’s not quite there. Now, let me flip that around to something else and talk about file.”
“Enterprise file, whether it is Dell EMC Isilon, or Qumulo or NetApp, is world class. It is fantastic. It is scalable, it’s reliable, it is affordable. It’s great. I’ll even throw in VAST and give them credit; it’s really good service on premises.”
However, he says, “file in the cloud … is a bit of a cruel joke, until we came along with Azure Native Qumulo.”
“Think about it. EFS is just NFS highly scalable, good shared storage. But it’s got none of the features, it’s got no … quotas and replication and snapshots … Azure NetApp Files is actually hardware in the cloud datacenter … If you look at all of these, they are 10 to 20 times more expensive than what is available on-premises.”
“Look at the price point on, I just pick on Azure NetApp. The premium tier, the ultra premium fast thing on Azure NetApp Files, is over $300 per terabyte per month. This is publicly available.”
Azure NetApp Files pricing for West US 2 region. It’s $392.74/TiB/month for single encryption, which equates to $357.19/TB/month
Bhageshpur asserts: “It still is a robust business … And that’s because it is speaking to a palpable need for file. So the whole thought process that file is going to die because of object was nonsensical, and everybody is recognizing that including the cloud guys.”
“That’s why people like Microsoft have partnered very closely with us with Azure Native Qumulo as a service out there. And if you think about it, it’s going to have all the performance of file on premises, it’s got all the elasticity of the cloud in every possible way, when you’re paying for only consumed services, whether it is performance or capacity. So you’ve got the best of both worlds.”
“One of my one of my colleagues, who spent a lot of time at Dell, and AWS, and now is with us running our file business, Brandon Whitelaw, loves to say that, if you take any of the file services in the cloud, wrap, you know, sheet metal around that and try to sell it to an on-premises buyer, they will laugh at that, because it’s so not better from the features, functionality, cost point of view.”
In his view: “When we bring in a unified service, which has got the best of both worlds, all the power and POSIX compliance of file with the cost elasticity of object, that’s where it changes the game. That’s my take on that one.”
“There are niche use cases for object on-premises, such as being an S3 backup target for Veeam or Rubrik, running a Kafka workflow or a … Spark workflow on-premises in conjunction with the cloud offering.”
But object on-premises has not had a great run. “Nobody has had a big outcome … There’s no big standalone object company. Is it Scality? Is it Cloudian? MinIO has come along with an open source plane, that’s kind of clever … In fact, object has simply become a protocol absorbed into an analysis solution … We have object, Pure supports object, VAST supports object. It is just a protocol that we think about. It’s a standard version of an HTTP protocol. That’s all, OK?”
In summary, object on-premises is a cruel joke and basically just protocol support. File in the cloud is a cruel joke, apart from Azure Native Qumulo.
GPUs and GPUDirect
We discussed sending unstructured data to GPU servers, specifically Nvidia GPUs. Unstructured data lives basically in file or object repositories. Nvidia has its proprietary GPUDirect protocol for getting NVMe SSD-held file information to its servers direct from storage with no time-consuming host server CPU intervention. Can we envisage a GPUDirect-type arrangement for objects? Bhageshpur argues that we should not consider developing such a protocol.
“Let me articulate what I think is happening, right? Nvidia is the sexiest company on the face of this planet. Today, it’s joined Amazon and Apple and Microsoft, in being one.”
“GPUDirect is not a technically complicated thing. Especially for a product like Qumulo, which is a user space protocol running on the richness of file. Supporting RDMA is trivial to go do, right? But also, the thing that you gotta go ask yourself, and we ask ourselves this very, very seriously all the time, is: what is the problem? And for which customers?”
“GPUDirect is an Nvidia thing. Does that work with Intel’s Gaudi 2 architecture? No. Does that work with the Microsoft Maia chipset coming out? What happens when there is the next-generation silicon in AWS?”
In other words, he appears to be saying, GPUDirect and Mellanox Ethernet NICs are proprietary lock-in technologies.
Bhageshpur asks: “What’s the generic solution? To get data, independent of file, object or whatever else out there? And what that is is the generic problem to solve.”
He admits: “There are a modest number of players on-premises inside the Nvidia GPU infrastructure, which have just built out massive amounts of Nvidia GPU farms and they probably have GPUDirect storage, whether it is NetApp, or Pure, or our Dell or Vast behind that … And there are small numbers of very large customers.”
But this is not general across the market, although big numbers are involved. ” The reality for AI in general is I’ve talked to a lot of folks [and] by and large, broadly, it’s very much in the experimental stage. I mean, ChatGPT gets our attention, Google Bard gets our attention … ServiceNow announced that it’s had a billion dollars in additional revenue in the last one year because of their AI integration.”
The use case is almost trivial in one way. “If you look at what they have done, it’s pretty straightforward. It is a modern version of a better chatbot in front of it, right? … You’re going to do the things in your workflow engine that you would otherwise have very complex commands and connect things up [to do]. But that’s powerful. That’s really useful.”
However: “It’s in a highly experimental stage. We find a lot of the experimentation is going on either at small scale at a desktop level … or it’s happening in the cloud, not counting the small number of people who do have large farms out there.”
Bhageshpur notes: “It’s not very obvious what the thing is, and whether there is actually going to be everybody doing the training. Is training going to be something that; pick a generic Fortune 1000 company does, or do they consume the output of that as a package service? I think the jury’s still out.”
While the jury is still out, “our focus is really about data and scaling anywhere. Do we have the solution for our customers on premises on any of their infrastructure? Do they get the exact same file and data services in the cloud, any of the clouds that they may choose to go?What about out at the edge [where] they have completely different requirements? Can they exist and operate a single data plane across this, both from a control as well as data visibility and data access point of view?”
Focusing on this, he says: “From a data point of view, getting that to be a really robust solution across, is going to allow us as a company to intersect what the AI trend lines do, no matter which way it goes.”
This is because “inference at the HQ, you have a solution out there. Training in the datacenter, you got a solution out there. You’re gonna go training and inference in the cloud, you got a solution for data out there. How do you connect the GPUs to the data no matter where it is, is the problem we want to solve rather than a point solution.”
He wants us to realize, though, that GPUDirect (NFS over RDMA) has validity. “Don’t get me wrong. That’s not hard to do. And it’s not unreasonable for us to do it. But it’s also, I know, there’s so much talk about it that everybody thinks that’s the cat’s meow and that’s the thing which is most magically required. And it’s an important, but not indispensable, part of what strategy you should be [looking at] going forward.”
Our sense is that Qumulo would like there to be fast file data access technology covering GPUs in general – Nvidia, Intel, AMD, Microsoft, Amazon, and Google ones. And it thinks supplying file data fast for LLM training and inference in the public cloud, at edge sites and in datacenters is also going to be needed.
We think that, until the GPU hardware players come up with a standard interface, a kind of multi-vendor GPUDirect that can be used to put data in any GPU’s memory, then supporting very large GPU players with supplier technology-specific software will be largely inevitable. It is not in Nvidia’s interest to enable GPUDirect-type support for Intel’s Gaudi 2, AMD Radeon or Microsoft’s Maia chips.
GPUDirect and S3
But could we have a GPUDirect for S3? Bhageshpur thinks not. “There is no such thing as S3 over RDMA, because, essentially, GPUDirect is NFS over RDMA.”
Also, the big CSPs face another problem here. He says: “Let me talk about one more reason why RDMA is over bought. You need network card support to make that happen, right? It’s not generic on anything. You need a the network interface to be able to support RDMA to be able to go make it work.” He rhetorically asks: “Guess which is the most prominent networking card which supports RDMA? Do you want to take a guess on which one it is?”
I said Ethernet ones and he almost pounced on this answer: “Who makes them? Mellanox. And who owns Mellanox? Nvidia.”
“If you look at the clouds, various clouds in various regions have pockets where they have Mellanox interfaces on their CPU side that you can go leverage. But it’s various pockets, and it’s kind of scattered; it’s not standard. You’re not going to see Mellanox type RDMA-enabled Ethernet cards ubiquitously available in the cloud for a long time … We know that getting into every other place means that these guys have to go through a sort of hardware refresh cycle, which takes years, you know, to have ubiquitous availability.”
“There are going to emerge different technologies other than GPUDirect, especially in the cloud, to bring data faster from where it lives closer to the GPU.” He wouldn’t be drawn on any details about this.
Bhageshpur thinks that the compound effect of individual productivity enhancements from AI could be huge.
“The real thing to think about is how many people are going to train foundational models? I would argue it is extremely small. How many people are going to update foundational models with specific data? I would think that is also small. It’s not going to be hundreds of thousands. And then how many people are going to consume that? It’s a lot, right?”
“One analogy I’ll make is, how many people build their own distributed file system? If you’re a company, not too many. You look to somebody else to solve that infrastructure problem, because it’s hard. It’s kind of the same thing with foundational models. Training foundational models is difficult … How is the generic enterprise going to go update that?”
He doesn’t think it will, and the main use for AI LLMs is “all about individual productivity, which is why, if you look at the focus, the big part of Azure is a Copilot for everything, which is: let us use our technology to make an individual inside a corporation more productive.”
He says: “That’s exactly what ServiceNow has also done. And by the way, to some extent, maybe it’s correlation and not causation, but it shows up in terms of the US, which is the dominant adopter of all of these things, having had this huge productivity spike in the last 18 to 24 months.”
The answer to the GenAI and ChatGPT question “isn’t going to be as simple and as reductionist as, oh my God, AIG and Terminator, and Skynet … That’s a joke, it doesn’t work. It’s going to be really compelling in mundane ways. It is going to make everybody a little bit more productive.”
“If you’re seeing 5, 10, 20 percent improvement in productivity among your workforce, that’s huge. And people are willing to pay for that.”
And Qumulo’s function? “All the focus in our tiny, teeny little role here is: how do we help our customers who have all of that data stored on us to go take advantage of the [AI] services, build the appropriate connection such that the jobs are done much easier.”