Home Blog Page 10

Backtracking Amazon DynamoDB: Clumio pushes recovery play

Clumio Backtrack enables partition-level Amazon DynamoDB recovery back to a point in time with no database reconfiguration.

DynamoDB is AWS’ widely-used and fully-managed NoSQL database service. AWS says it serves hundreds of customers, with table sizes exceeding 200 TB, and processes over one billion requests per hour. Clumio is Commvault’s acquired AWS data protection business with an incremental forever backup model. The company has previously announced Clumio Backtrack for Amazon Simple Storage Service (S3),  which enables customer admins to quickly revert billions of S3 objects – or pieces of data – to a specific version at a specific point and time. Now Backtrack gets extended to cover DynamoDB and do the same thing with table partitions.

Woon Jung, Commvault’s CTO for cloud native stuff, stated: “Clumio Backtrack  removes the friction and risk from database recovery. Now, teams can roll back or restore their data in minutes, not days, and without needing to perform complex, multi-step manual recoveries.

The company says Dynamo deployments contain complex tables, often with multiple partitions within each table, that are constantly updated by applications and microservices. Such tables can contain billions of records and terabytes of data. If there is an outage or other disruption admins would like to recover just the affected partitions but, Clumio says, admins generally have to restore all of data to a new table, copy the impacted items back, and then delete the new table. This can be a slow and tedious process entailing database down time.

Clumio Backtrack for DynamoD screen grab.

Clumio Backtrack fixes it by enabling DynamoDB admins to near instantly revert existing DynamoDB tables to a prior point in time with no reconfiguration needed. They can also recover individual partitions instead of entire tables, cutting both recovery times and recovery costs.

Perhaps we can look forward to Backtrack being extended again and covering AWS’ Aurora relational database.

Clumio Backtrack for DynamoDB is now available globally via the AWS Marketplace and pricing is consumption-based. Find out more about Clumio and its offerings here.

Comprehensive solutions from HPE for safeguarding your data

HPE data protection portfolio looks after your digital assets no matter what threats may come

In today’s digital landscape, organizations face relentless threats and disruptions that can compromise their data integrity and operational continuity. As hybrid IT environments grow more complex and cyberattacks rise, organizations need a robust, agile, and intelligent approach to data protection.

HPE’s data protection solutions are a good basis for a data protection strategy that safeguards critical workloads, ensures cyber resilience, and accelerates recovery across all your environments from on-premises to cloud.

Navigating data protection challenges

Ransomware hit 59% of organizations last year, with 94 percent of those attacks targeting backup environments. They disrupt operations and finances, costing $4.9 million per attack on average.

Companies facing down these threats must also rethink how they protect data as data storage and workloads evolve. Data now stretches across edge devices, datacenters, and the cloud. Workloads such as SaaS, cloud-native apps, AI stacks, and big data analytics have changed the game. More fragmented environments have created serious challenges for data protection, and more stringent regulations have increased the risk of missteps.

Reimagining data protection

HPE’s holistic approach to data protection secures every application while meeting business cost and performance needs. It includes mitigating infrastructure failures, automating protection to reduce human error, and planning for both bad actors and disasters simultaneously.

HPE data protection solutions cover any workload, any location, and any SLA requirement, ensuring recoverability across hybrid and multi-cloud environments.

Strengthen your data protection strategy with HPE

Cyber resilience

This is critical to maintaining business continuity and minimizing data loss when cyberattacks happen. HPE’s cyber resilience capabilities are integral to its holistic approach, providing strong access controls, comprehensive encryption, policy-driven protection, and air-gapped architectures. Real-time anomaly detection and analytics guide protection and recovery decisions, providing swift protective actions and rapid restoration of trusted data.

High performance

Speed is crucial in modern data protection because every minute of downtime can result in lost revenue, damaged reputation, and operational setbacks. With HPE’s high-throughput backup and restore speed, you can quickly recover critical data and resume operations after an incident.

Fast failover and failback ensure minimal disruption, while efficient data reduction optimizes storage and reduces costs. Additionally, HPE’s cyberforensics capabilities enable quick identification of safe data for restoration, further reducing recovery time and ensuring your business stays up and running.

An integrated ecosystem

Managing data security across diverse environments can be a daunting task. Many organizations struggle with fragmented data protection solutions that lead to inefficiencies and vulnerabilities. HPE’s integrated ecosystem addresses these pain points by providing seamless data security from the point of creation (source) to its final destination for backup or long-term retention (target).

With intelligent data movers facilitating secure and efficient transfer in between, you can ensure that your data is protected at every stage. Additionally, HPE integrates best-in-class technologies from strategic software partners such as Commvault, Veeam, and Cohesity to achieve comprehensive data protection and eliminate the complexities of managing multiple solutions.

HPE solutions to secure your business data


HPE offers solutions to address all data protection needs across the data lifecycle, starting with creation of data at the source. This includes HPE Alletra Storage MP B10000 for structured data storage with high availability, multi-factor authentication, encryption, anomaly detection, and immutable snapshots. For unstructured data, HPE Alletra Storage MP X10000 provides highly available, fast unstructured data object storage with data protection accelerator nodes, encryption, immutability, and versioning.

After data is written to storage, continuous data protection with HPE Zerto Software provides ultra-low recovery point objectives (RPO) and recovery time objectives (RTO). HPE Zerto unlocks granular, journal-based recovery of virtualized and cloud applications with features such as real-time anomaly detection, orchestrated failover and testing, and application-centric recovery.Pair HPE Zerto with backup software and infrastructure that extends protection to all applications across your entire data estate. HPE StoreOnce is a high-performance purpose-built backup appliance featuring multi-factor authentication, encryption, immutable backups, and fast, cost-efficient backups and restores. Additionally, HPE also offers integrated solutions with Commvault, Veeam, and Cohesity for backup and cyber resilience no matter where or how your workloads are deployed.

For the ultimate in cyber-protection, the HPE Cyber Resilience Vault enables rapid air-gapped recovery after even the worst cyber-attack. A full-stack solution, it offers isolated and immutable copies of your data in an offline clean room for recovery. It is built atop well-established offerings for storage (HPE Alletra Storage MP), compute (HPE ProLiant), networking (HPE Aruba), and cyber recovery (HPE Zerto).

Lastly, as data ages and the need for long-term retention stretches into years, HPE Cloud Bank Storage and HPE Storage Tape provide cost-effective storage options. HPE Cloud Bank Storage enables long-term cloud storage with multi-factor authentication, encryption, and immutability. HPE Storage Tape serves as a low-cost, air-gapped, offline backup and archive repository with immutability and encryption.

HPE’s data protection portfolio offers a multi-layered, zero-trust approach addressing the evolving challenges faced by modern enterprises.

HPE ensures robust data protection and business resilience by integrating best-in-class technologies to secure every application, meeting business cost and performance needs.

With solutions designed to protect against cyberthreats, disruptions, and regulatory complexities, HPE empowers organizations to safeguard their data and maintain operational continuity in an increasingly complex digital landscape.

Learn more about HPE’s data protection products.

Contributed by HPE.

ExaGrid reports growth in deduped backup biz

ExaGrid says it’s had another storming quarter, with the privately held company saying this is its 18th consecutive growth quarter selling tiered deduping backup target appliances that compete with systems from Dell and HPE.

The company’s disk-based appliances have a non-dedupe backup ingest landing zone, for fast restores, and a Repository Tier zone with cluster-wide deduplication. System features include a Retention Time-Lock that includes a non-network-facing tier (creating a tiered air gap), delayed deletes and immutability for ransomware recovery. The business, which did not reveal any figures, claimed it made record revenue in the second quarter, adding over 140 new customers with the average deal size increasing.

Bill Andrews

President and CEO Bill Andrews stated: “Revenue and EBITDA continue to grow year over year. We are the largest independent backup storage vendor, and we continue to add support and integration with more backup applications which offers investment protection to our existing customers and provides even more potential for top line growth.”

“We’ve hit well over 4,700 active customer installations worldwide. ExaGrid continues to have an over 70 percent competitive win rate (74.1 percent in Q2) replacing primary storage behind the backup application, as well as inline deduplication appliances such as Dell Data Domain and HPE StoreOnce.”

The company has been FCF positive, P&L positive, and EBITDA positive for the past 18 quarters and has zero debt. We estimate it had 112 six- and seven-figure deals in the quarter. Half of its business this quarter came from new customers, and over 50 percent of its overall business came from outside the USA. ExaGrid said it is adding over 14 more sales regions worldwide.

Its appliances support more than 25 backup applications and utilities including: Veeam, Commvault, NetBackup, HYCU, Oracle RMAN direct, Rubrik, SQL Dumps direct and HYCU, with support for Cohesity DataProtect due in the first half of 2026.

Recently other backup target appliance companies have been adding all-flash configurations, such as Dell and Quantum. Such all-flash appliances can provide up to 130T TB/hour throughput in Dell’s case which puts ExaGrid in the slow lane with its 20 TB/hour (EX189 product). We would expect ExaGrid to have hybrid flash-disk and/or all-flash appliance developments ongoing with an announcement in the next 12 months and possibly sooner.

Backblaze adds one-click legal hold to enterprise backup service

Cloud storage provider Backblaze is providing a simple on/off switch to set up a legal hold on files stored in its vaults.

It has introduced Legal Hold for Backblaze Computer Backup with Enterprise Control to preserve a user’s entire backup, including every historical version captured, with a single click. Legal holds may be necessary when an organization becomes involved in litigation, requiring potentially relevant information to be preserved unchanged for legal proceedings. It can be legally required even when an organization reasonably anticipates litigation. Implementing such legal freezes on stored data and finding out which files are relevant and need to be placed in an immutable state can be time-consuming. It’s extra work and cost, with potential financial penalties for non-compliance.

Gleb Budman, Backblaze
Gleb Budman

Backblaze CEO Gleb Budman stated: “With Legal Hold, information stays secure and immediately accessible when the stakes are highest. Reliable preservation of data reduces exposure to fines and sanctions, giving organizations fast, predictable compliance support without extra software, hardware, or surprise fees.” 

The update features are:

  • Administrators can instantly activate Legal Hold in the Enterprise Control console without additional hardware or software. 
  • Runs silently in the background without downtime, throttling, or notifications. 
  • Offers unlimited version retention without additional fees. 
  • No charge instant file retrieval with available encrypted drive delivery via courier. 
  • Data is secure by default with encryption at rest and in transit, with optional private-key encryption available. 

Data management suppliers such as Datadobi and Komprise offer legal hold functionality for data under their management. Cloud file services supplier CTERA does the same, as do several others.

Legal Hold is available now for all Backblaze Computer Backup with Enterprise Control customers at no additional cost. New users can try it out with a free 15-day trial, accessible here.

SK hynix revenues go sky high with HBM

High Bandwidth memory has become a high revenue earner for Korea’s SK hynix as second 2025 quarter revenues jumped 35.4 percent year-on-year to ₩22.23 trillion ($16.23 billion).

There was a ₩7 trillion ($5.1 billion) profit, up 69 .8 percent annually, down 14 percent sequentially, and the company is amassing cash, with its cash and cash equivalents increasing to ₩17 trillion ($12.4 billion) at the end of June, up by ₩2.7 trillion ($1.97 billion) from the prior  quarter. Its debt ratio and net debt ratio stood at 25 percent and 6 percent, respectively, as net debt fell by ₩4.1 trillion ($3 billion), compared with the previous quarter. HBM chips provided 77 percent of its second quarter revenue.

Song Hyun Jong, SK hynix President and Head of Corporate Center, said in the earnings call: “Demand for AI memory continued to grow, driven by aggressive AI investments from big tech companies.” 

SK hynix said its customers plan to launch new products in the second half and iIt expects to double 2025 HBM revenues compared to 2024. Increasing competition among hi-tech companies to enhance AI model inferencing will lead to higher demand for high-performance and high-capacity memory products. Ongoing investments by nations to build sovereign AI will also help, with Song Hyun Jong saying: “Additionally, ongoing investments by governments and corporations for solving AI are likely to become a new long term driver of AI memory demand.” 

AI looks to be the gift that just keeps on giving as AI training needs GPUS with HBM and DRAM, AI inferencing needs servers with GPUs+HBM and DRAM, and AI PC, notebooks, tablets and smartphones will need DRAM and NAND.

The company will start provision of an LPDDR-based DRAM module for servers within this year, and prepare GDDR7 products for AI GPUs with an expanded capacity of 24Gb from 16Gb. HBM4 will come later this year. Song Hyun Jong said: “We are on track to meet our goal as a Full Stack AI Memory Provider satisfying customers and leading market expansion through timely launch of products with best-in-class quality and performance required by the AI ecosystem.”

SK Hynix Q2 cy2025 revenue by product amd application.

NAND looks a tad less exciting, as SK hynix will “maintain a prudent stance for investments considering demand conditions and profitability-first discipline, while continuing with product developments in preparation for improvements in market conditions.” It will expand sales of QLC-based high-capacity eSSDs and build a 321-layer NAND product portfolio.

Kioxia/Sandisk’s BiCS10 NAND has 332 layers with Samsung’s V10 NAND promising 400+, up from its current (V9) 286 layers. Micro is at 276 with its G9 NAND and YMTC at 232 with its Gen5 technology. SK hynix’s Solidigm subsidiary has 192-layer product and looks in need of a substantial layer count jump if it is to stay relevant. It builds its NAND with a different fab and design from SK hynix and, unless it adopts SK hynix 321-layer tech, will have to devize its own at considerable expense. Whichever route it takes, adopt SK hynix tech or go-it-alone, it will have to equip fabs with new tools. Using SK hynix tech would potentially save a lot of development expense and a 321-layer product would give it an up-front 67 percent increase in die capacity over its 192-layer product. 

Overall SK hynix has grown its share of the DRAM and NAND market, with Counterpoint Research estimating it is now neck and neck with Samsung for combined DRAM and NAND revenues in the second quarter at the $15.5 billion mark.

Counterpoint Senior Analyst Jeongku Choi said, “SK hynix saw its largest-ever quarterly loss (₩3.4 trillion or $2.7 billion) in Q1 2023, prompting painful decisions such as production cuts. However, backed by world-class technology, the company began a remarkable turnaround, kick-started by the world’s first mass production of HBM3E in Q1 2024. By Q1 2025, SK hynix had claimed the top spot in global DRAM revenue, and just one quarter later, it is now competing head-to-head with Samsung for leadership in the overall memory market.”

MinIO gives object storage a GenAI upgrade with Iceberg

Interview: MinIO is building a linking layer between its object store and GenAI and that involves embracing structure in the form of Iceberg tables. It also entails building some linkage between them and, up until now, the vector-focused GenAI models and agents.

We talked to MinIO co-founder and co-CEO AB Periasamy and Erik Frieberg, his CMO, to explore this topic. It’s fairly complex and we’ll cover it in two parts, with this being part one. The interview has been edited for brevity.

Blocks & Files: What’s the main thing affecting MinIO presently?

AB Periasamy

AB Periasamy: The best thing that happened to us is GenAI and in every enterprise the budget is now towards GenAI and the race is starting to shift towards data. Who has more data and how to put data to use? That is directly contributing to our company’s growth. Because once you start scaling your data and AI data infrastructure, it points to object store and customers are also looking beyond cloud. Cloud is already behind GenAI and the scale of data is pushing the customers to good private cloud and the private cloud object store, everything is in our backyard. We created that market outside of AWS and that’s contributing to our growth. And you can see the company has been aggressively growing hiring in on all sites, engineering, like marketing, sales, even. We launched a government BU and are launching a partner program. Overall, the company is going through an upgrade and aggressive growth.

Blocks & Files: How many people work at MinIO now?

AB Periasamy: I stop counting, Erik. How much? It is like 150-plus.

Erik Frieberg: I would say 160 something now. It’ll be 170-ish next week.

Blocks & Files: What’s happening with MinIO marketing?

Erik Frieberg

Erik Frieberg: I joined a little while ago to run marketing. The go-to-market organization a month before I joined in November was dramatically different than it is today. I think, that’s no joke, eight or nine times of what it was a year ago.

Blocks & Files: Why the strategy to have that built up so quickly?

AB Periasamy: I made the decisions. I’ll just give the high level view that, even though we are hiring aggressively relative to others, we are very prudent about hiring because, in this AI world, we have to think about hiring very differently. All the junior-level jobs today, machines can do so much better and most other companies hire aggressively and then fire aggressively. 

For us, we treat our team like family and it’s so important to take your time, be prudent about every one of those who you bring on board. Two years from now, five years from now, are they going to be with the family and are they going to be more productive because they can combine themselves with AI? That’s the only hope. Human plus machine is going to be better than human and machine in isolation. 

So be prudent about bringing the right people in who will fill the holes we have inside the organization. That’s how we are looking at it. So, in many ways, I actually look at this as this is not fast enough, but I would rather hire less and not hire anybody than hire aggressively.

Erik Frieberg: I’m going to say three things. One is I would actually say AB probably started hiring too late, so this should have been on the rise already. Second is the difference of having AIStor. I come from an open source background. I was at MongoDB where there was no commercial differentiation and it makes a big difference. There’s always only so much support and services you can sell to customers. 

You’ve got to have differentiated production software that has features that are a catalyst for growth. And that [AIStor] came out in the October-ish time frame and I think people are now really understanding it, seeing the difference. You’re going to see some announcements coming in the future on this, so that’s creating more opportunity. And then like AB said, it’s not a target. It’s not that he’s saying: “Go hire five people in marketing.” He’s saying: “Hey, we need more product marketing now because we have more to talk about.”

Blocks & Files: AB, MinIO is a storage company and for a customer of yours who is using generative AI, large language models and agents, there’s a software stack between the storage layer and their agents and large language models. So where in that stack do you see MinIO’s activities ending? Where’s the border and is it moving up the stack?

AB Periasamy: Yes. In fact, the story we have been telling all along resonates more with the customers now than the past. We always said MinIO object store is a key value store. What is a key value store? It’s a data store. We are closer to Oracle than EMC in the past. The data store is the dumbest of all databases. It is a key value store with get and put objects, but it needs to process data at massive scale without losing a single object. It has to be transaction ACID-compliant, all of that and scale, while it has the simplicity at its roots. So we always thought of ourselves as a database company, except that we are talking about unstructured data. Things are starting to evolve there because of GenAI.

We brought Erik in because he came from MongoDB. Our culture is closer to MongoDB than Pure or NetApp or MCA software company and in modern times [with] GenAI, the scale is much larger. The only way you can go is scale-out and then go software-defined. There are some interesting things that are happening [with] the object store and the layer above Gen AI, where the integration is going on.

The AI is directly talking to the object store through the MCP server. It’s the agents that are interacting with the object store, both in terms of administration as well as the data discovery and dealing with the data itself. There are some interesting challenges that are emerging here. If it’s a spreadsheet, if it’s a CSV file, single object? Say I have a database file, can I analyze this? We are talking about enterprises having 10 petabyte tables; data sitting in Iceberg tables. In the past it was proprietary data sitting in proprietary databases. They are increasingly now moving towards Iceberg open table format.

How can you have large language models understand that scale of data? That’s a bridging factor that we need to work on and those are the areas that we are actually actively investing in as well. The model may still be running on OpenAI or Anthropic, but the enterprise data never leaves their facility. How do they discover the data? Only the metadata or an anonymized form of the intelligence is shared. We are actively working on that part of the stack.

There’s also another interesting twist here that, if it’s unstructured data like blobs, photos, videos, documents, then everybody knows how LLM take advantage of them. The new emerging area is structured data, structured data like Iceberg tables, and at large scale. How can LLMs understand structured data?

Comment 

The point AB is making is that Iceberg tables are structured and it makes no sense to vectorize them as, unlike words or images, they don’t stand alone with dimensional aspects that you can vectorize. There needs to be some intervening logic/abstraction between the tables and the GenAI LLMs and agents that bridges their unstructured data vector focus and the searching of structured data. We’ll explore this more in part two.

The Iceberg age: StarTree latest to adopt popular table format

Iceberg is becoming the lingua franca of datalake table formats with StarTree the latest supplier to embrace it as a real-time backend.

Open source Iceberg is an open table format for large-scale analytics which doesn’t store or execute queries itself, as a traditional database would. It operates as a software layer atop storage systems like Parquet, ORC, and Avro, and cloud object stores such as AWS S3, Azure Blob, and the Google Cloud Store, which handle handles metadata, partitioning, and schema evolution.  Iceberg provides ACID transactions, schema versioning, and time travel, with data querying and processing handled by separate SW such Apache Flink, Presto, Spark, Trino, and other analytics engines. 

The StarTree Cloud is a fully managed, cloud-native platform built on Apache Pinot, a real-time distributed OLAP (Online Analytical Processing) datastore. StarTree Cloud is designed for OLAP and enables low-latency querying (milliseconds) and high-throughput processing (10,000+ queries/second) of large-scale data from streaming sources (e.g., Apache Kafka, Amazon Kinesis) and batch sources (e.g., AWS S3, Snowflake). Now it can be both the analytic and serving layer on top of Iceberg.

StarTree claims Iceberg support can transform it from a passive storage format into a real-time backend capable of powering customer-facing applications and AI agents with high concurrency serving thousands of simultaneous users with consistent speed and reliability. 

Kishore Gopalakrishna on the summit of Half Dome, Yosemite.

Kishore Gopalakrishna, StarTree co-founder and CEO, stated: “We’re seeing explosive growth in customer-facing, and increasingly agent-facing, data products that demand sub-second responsiveness and fresh insights. At the same time, Iceberg is emerging as the industry standard for managing historical data at scale.”

“As these two trends converge, StarTree is delivering unique value by acting as a real-time serving layer for Iceberg empowering companies to serve millions of external users and AI agents securely, scalably, and without moving data.”

Recent Iceberg adoptees include Snowflake, Confluent, AWS with S3, SingleStore, and Databricks

Paul Nashawaty, principal analyst at theCUBE Research, said: “Apache Iceberg is rapidly becoming the de facto standard for managing large-scale analytical data in the open data lakehouse—adoption has surged by over 60 percent year-over-year, according to theCUBE Research.”

StarTree asserts that most existing query engines built around Iceberg and Parquet struggle to meet the performance SLAs required for external-facing, high-concurrency analytical applications, and companies have historically avoided serving data directly from their lakehouse. It claims that by combining Iceberg and Parquet open table formats Pinot’s indexing and high-performance serving capabilities, StarTree offers real-time query acceleration directly on native Iceberg tables.

StarTree Cloud graphic.

Unlike Presto, Trino and similar engines, StarTree says it’s built for low-latency, high-concurrency access, integrating directly with Iceberg, boosting performance with features such as:

  • Native support for Apache Iceberg and Parquet in StarTree Cloud
  • Real-time indexing and aggregations, including support for numerical, text, JSON, and geo indexes
  • Intelligent materialized views via the StarTree Index
  • Local caching and pruning for low-latency, high-concurrency queries
  • No data movement required—serve directly from Iceberg
  • Intelligent prefetching from Iceberg, minimizing irrelevant data scans 

Nashawaty reckons: “StarTree’s ability to serve Iceberg data with sub-second latency and without data duplication is a unique and timely advancement. It addresses a critical performance need for accessing historical data in modern data products.”

Support for Apache Iceberg in StarTree Cloud is available today in private preview. For more information, visit www.startree.ai.

Kioxia unveils highest capacity SSD at 245.76 TB

Online storage capacity constraints are melting away with Kioxia announcing the largest capacity SSD at 245.76 TB, opening the door to a 200-plus petabyte rack and a five-rack exabyte configuration.

It complements the already announced 122.88 TB model and is intended for use in data lakes – where massive data ingestion and rapid processing are required – and generative AI, which requires storing large datasets for training large language models (LLMs), as well as the embeddings and vector databases that support inference via retrieval-augmented generation (RAG).

These drives are aimed at very large organizations and hyperscalers that need this level of capacity muscle. 

Neville Ichhaporia, Kioxia
Neville Ichhaporia

Neville Ichhaporia, SVP and GM of the SSD business unit at Kioxia America, said the LC9 series sets “a new benchmark for capacity and innovation in the SSD space. As generative AI reshapes datacenter architecture, we’re delivering the kind of breakthrough technology that enables our customers to optimize their IT infrastructure investments – pushing boundaries in performance, efficiency, and scale.”

Kioxia now has twice as much drive capacity as competitors Phison (122.88 TB) and Solidigm (122 TB), which announced their drives in November last year. Since Sandisk is a NAND fab joint-venture partner of Kioxia, it could also produce a similar-sized SSD to the LC9.

The LC9 drives use Kioxia’s BiCS8 218-layer 3D NAND organized in QLC (4bits/cell) format with 2 Tb dies. These 32 dies are stacked to produce 8 TB of capacity in a compact 154-ball BGA (Ball Grid Array) package. Kioxia says this is an industry-first. 

We imagine that Phison might find scope for parallelism with around 30 of these BGAs needed for a 245.76 TB drive, were it to be minded and able to buy the component chips and produce one.

A Kioxia slide provides LC9 speeds-and-feeds information:

The drives are available in U.2 (2.5-inch), E1.S, and E1.L formats, though maximum capacity is limited to 122.88 TB in the 2.5-inch and E1.S variants.

These SSDs have a PCIe 5×4 interface, or PCIe 5×2 in a dual-port config, along with NVMe. Performance is heavily skewed to reads, with an up to 12 GBps bandwidth versus just 3 GBps for writes. This is not that fast for PCIe Gen 5 SSDs. Kioxia’s own CD9P drive offers up to 14.8 GBps reads and 7 GBps writes but it does use faster TLC NAND.

Phison’s 128 TB Pascari D205V uses QLC NAND, like the LC9, and is faster, delivering up to 14.6 GBps read bandwidth and 3.2 GBps write speeds. The SK hynix PS1012 drive is also a QLC SSD and is slightly faster as well. It pumps out 13 GBps read bandwidth at the 61 TB capacity level. A 122 TB version is due later this year.

Kioxia’s LC9 SSDs

The LC9 has Flexible Data placement support to improve endurance, at 0.3 drive writes per day, and performance. It also has various security features, such as self-encryption, and meets the Open Compute Project (OCP) Datacenter NVMe SSD v2.5 specification, but not all requirements.

We envisage other SSD suppliers announcing their own similarly capacious SSDs and pitching them at the same market soon enough. It’s noteworthy that Pure Storage is introducing 300 TB versions of its proprietary NAND drive, the Direct Flash Module (DFM), and Meta has licensed DFM IP. Both Micron and Kioxia are NAND chip suppliers to Pure.

The pace of SSD capacity expansion is rapid, far outpacing disk drive capacity growth, which is currently in the 32-36 TB area with 40 TB drives forecast. There is no pricing information available for the LC9 so we can’t make a $/TB comparison with HDDs. A rack full of LC9s could potentially significantly outperform disk drives in TCO terms because of the reduced floor and rack space, cooling, and most likely the power they need.

Kioxia’s coming BiCS 10 generation of NAND will feature 332 layers, 52 percent more than BiCS 9. That would enable, in theory, a 3 Tb QLC die and, by extrapolation, a 368 TB SSD. This would have seemed crazy three or four years ago. Such is the way our expectations of what’s normal in the NAND world have changed.

Kioxia LC9 Series SSDs are now sampling with select customers and will be featured at the Future of Memory and Storage 2025 conference, taking place August 5-7 in Santa Clara.

LTO tape market rebounds in 2024 with 176.5 EB shipped

LTO tape
LTO tape

There was 15.4 percent more LTO tape capacity shipped in 2024 compared to 2023 as growth resumed following three years of more or less level pegging after a pandemic dip.

The LTO Program Technology Provider Companies (TPCs) – HPE, IBM, and Quantum – released their annual tape media shipment report, saying a record 176.5 exabytes (EB) of total compressed tape capacity shipped in 2024, assuming a 2.5:1 compression ratio. Similar numbers are available for 2019 to 2023, but not for earlier years. However, the LTO org has previously produced a bar chart showing the 2008 to 2019 history. We have combined the old and new charts to provide a full 2008 to 2024 tape capacity shipped history: 

There was a dip in 2018 due to a patent dispute between the two tape ribbon manufacturers, Fujifilm and Sony, which restricted the global supply of LTO-8 tape media until a cross-licensing deal was agreed between the two.

Capacity shipped then grew in 2019 to 114 EB, but was hit hard by the COVID-19 pandemic in 2020 with a fall to 105 EB. There was a ramp back up in 2021 to 148 EB, but dismal 0.5 percent growth in 2022 to 148.3 EB, and a laggardly 3.14 percent rise to 152.9 EB in 2023. Happily, the need to archive unstructured data has now resumed growing and we have a healthier 15.4 percent growth from 2023 to 2024.

Bruno Hald

Looking ahead to the hoped-for AI archiving demand, Bruno Hald, General Manager, Secondary Storage, at Quantum, said: “Setting a new growth record for the fourth year in a row, LTO tape technology continues to prove its longevity as a leading enterprise storage solution. Organizations navigating their way through the AI/ML era need to reconfigure their storage architectures to keep up, and LTO tape technology is an essential piece of the puzzle for those seeking a cost-friendly, sustainable, and secure solution to support modern technology implementation and the resulting data growth. We look forward to introducing the next iteration of LTO tape technology this year to bring enhanced storage capabilities to the enterprise.” Hald is referring to the LTO-10 format.

The LTO org says in its announcement that tape complements SSD and disk as an archival storage medium “to provide a low-cost, last line of defense against ransomware and other security threats.” And it mentions air-gapping, of course: “Further, offline copies of LTO tapes separate data archives from the connected environment, providing greater assurances for data recovery.”

It does not mention that LTO-10, unlike previous generational LTO changes, does not provide a speed increase over LTO-9, only saying: “LTO-10 … will provide the same blazing speed as LTO-9.” Nor does it mention LTO-10’s loss of backwards-compatibility or initial 30 TB raw capacity rather than the expected 36 TB raw capacity. Setting these relatively minor points aside, the future for LTO tape shipments looks good for the rest of 2025 and 2026. But if Microsoft’s Project Silica or Cerabyte or HoloMem get their various glass tablet technologies productized by 2027, 2028, or 2029, then we might see some real competition for tape developing.

MASV Express aims to automate complex file workflows without code

Massive file transfer supplier MASV has an upcoming MASV Express product that can automate complex media ingest and delivery processes without coding or any complicated setup.

It can move files to the cloud or connected on-prem storage, MAMs (Media Asset Managers), and individual recipients with no coding, no desktop client, drag-and-drop workflows, two-way encryption, secure authentication, and privacy controls. The files can be up to 20 GB or more in size.

Majed Alhajry, MASV
Majed Alhajry

Majed Alhajry, CTO and interim CEO of MASV, stated: “File transfer is no longer just a background process; it’s a critical function in the global media supply chain. When files move faster, more reliably, and with less friction, content gets to market sooner and that means more opportunities to monetize across broadcast, digital, and OTT (Over the Top) platforms.”

OTT refers to the delivery of audio, video, and other media content directly to viewers over the internet, bypassing traditional distribution channels like cable or satellite TV. MASV’s file transfer moves files at up to 10 Gbps across an AWS server and content delivery network. It uses TCP-based file delivery, with chunking – splitting files into smaller parts for parallel transfer – and AWS’s private network infrastructure to accelerate data movement.

A customer such as Troveo, which helps creators license unused video footage for AI training, uses MASV for fast, no-code ingestion of over 6,000 TB of video each month. Sarah Barrick, Head of Growth at Troveo, said: “If something like MASV did not exist, it likely would have precluded a lot of creators from monetizing their video assets with Troveo. The ultimate factor was the usability. It was just so easy for users of all kinds, from small creators to larger enterprises – they didn’t even have to download a desktop client. They didn’t need any technical expertise, and could use a private, secure portal to begin uploading right away.”

MASV partners with Qumulo, and has integrations with AWS, Backblaze, Frame.io, Iconik, Mystika Workflows, Wasabi, and other suppliers. Think of MASV as a dedicated large file transfer service, overlapping in that regard with Arcitecta, LucidLink, and PeerGFS, which all offer more functionality alongside basic file moving. The MASV Express offering will be showcased at IBC2025 in The Netherlands, Stand 1.B01, RAI Amsterdam, September 12-15, 2025.

Bootnote

MASV was a passion project within a small Canadian networking company before spinning off in late 2019. That company was LiveQoS where Majed Alhajry was a software engineer between 2014 and 2017. LiveQoS acquired him and his SuperBeam peer-to-peer file sharing app in June 2014. We understand that MASV started life as a LiveQoS VPN project to improve the reliability and performance of Office 365. MASV then evolved to move massive media files, such as rushes, across the internet, and became MASV Rush. It was in a preliminary development phase from late 2017 to late 2019 with Alhajry the technical team lead. Effectively, he was a MASV co-founder.

MASV was officially founded in December 2019 as a pay-as-you-go large file transfer service, with co-founder Gregg Wood as CEO. He is an ex-Corel Senior Director of Product Management who left in January 2017 and became a C-Level advisor at the Positive Venture Group in Ottawa. Early funding for MASV included investment from Peter Lalonde, a Positive Venture Group partner. Wood stepped down from the C-Level Advisor role in September 2018, initially becoming MASV’s VP Product Management and then, when it officially launched, its CEO.

MASV CFO Jeff Stoss is another co-founder, having been in at the start and CFO at LiveQoS from 2016. Stoss was also the co-founder, CFO, and C-Level Advisor at the Positive Venture Group. This was sold to Consero Global in July 2021. LiveQoS, with its residual SD-WAN IP, was sold to IP licensing business Adaptiv Networks in 2019.

We asked MASV some quesions about its recent history;

Blocks & Files: Is Majed Alhahry a co-founder of MASV? Same for CFO Jeff Stoss?

MASV: Yes, Majed and Jeff are co-founders along with Greg Wood who remains active in MASV via his vice-chairman role. 

Blocks & Files: Why did CEO Greg Wood become vice-chairman at MASV?

MASV: Co-founder Greg has made the transition from CEO to Vice Chairman after his successful 5 year stint in leading MASV from its founding. This is a strategic succession to cofounding CTO Majed Alhajry, reflecting the company’s evolution into a deeply technical platform for secure enterprise data exchange. Greg did a fantastic job building the business and is now working on new initiatives to take MASV into new markets. Greg remains actively engaged supporting Majed and Team MASV.

Blocks & Files: Why is Majed Alhajry an interim CEO? I understand he’s been in that post since January when Greg became vice-chairman.

MASV: Majed brings a distinguished track record of technical and business leadership as the founder of SuperBeam and co-founder of MASV. His recent appointment as interim CEO, in addition to CTO, reflects a natural progression in his journey with MASV, leveraging his deep expertise to drive the company’s strategic vision. In this expanded role, Majed is spearheading MASV’s push into the enterprise market, bolstered by a growing team, strategic partnerships, and purpose-built solutions like MASV Express. These initiatives—combining cutting-edge features with enterprise-grade capabilities—position MASV to solidify its foothold and accelerate growth in the corporate sector.

Blocks & Files: What is cut-through delivery technology?

MASV: Cut-through delivery is no longer a term MASV will use going forward to refer to Express. Instead, they have explained how Express works:

MASV Express moves files from origin to the destination without first needing the complete file in MASV’s secure container. This way, Express dramatically accelerates turnaround time, on top of MASV’s already super-fast file transfer speed, preserving MASV’s operational agility. MASV Express keeps content flowing smoothly through any network environment, ensuring the fastest, most reliable, and most secure delivery experience available. Express builds on MASV’s current capability to overcome any point-to-point delivery issues—like propagating outages and network congestion—by preserving reliability of the file transfer while decreasing turnaround times.

Storage news ticker – July 22

Adeptia Connect is a business data exchange software moving data between applications, partners, and customers, supporting data integration, application integration, and B2B/EDI processes. It can be single or multi-tenant and deployed on-premises or in the cloud or both. It has AI-driven data mapping, pre-built connectors, and reusable templates, allowing non-technical users to create and manage data workflows. Adeptia has just announced self-managed and hybrid deployment options for Connect. “Hosted on Microsoft Azure and AWS and managed by Adeptia’s specialized Cloud Infrastructure Engineering team, Adeptia’s self-managed option combines the simplicity of SaaS with the enterprise security standards that customers demand. Adeptia handles complex operational tasks – such as deployment optimization, monitoring, backups, disaster recovery, and version upgrades – while maintaining rigorous security compliance, including SOC 2 certification.” More info here.

Data source connector supplier Airbyte has added new features to its Enterprise product: the ability to facilitate data sovereignty and compliance across different global regions; synchronize data with its metadata to assist AI models’ ability to improve reasoning and accuracy; and directly load data volumes to BigQuery and Snowflake, which we’re told increases data sync speeds and reduces compute costs. Direct loading can reduce compute costs by 50 to 70 percent, and increases speed by up to 33 percent depending on the use case. 

Airbyte now enables synchronized unstructured files and structured records to be transferred in the same data pipeline, which is the most efficient way to preserve metadata and data relationships – enabling richer data context and significantly improving the performance of AI Large Language Models (LLMs) using that data. There are more details in the blog post and more about Airbyte Enterprise here.

Assured Data Protection (ADP) is partnering with technology services distributor Avant in North America to offer Rubrik and Nutanix-based managed backup, disaster recovery (DR), and cyber resiliency services. It will mean that Avant can offer Rubrik’s backup and DR technology to customers for the first time with Nutanix technology also being available, depending on needs.

Ataccama ONE v16.2 introduces AI-driven data lineage features designed to help business users audit, trust, and act on their data, without writing a single line of SQL. It says: “Business users can now trace a data point’s origin and understand how it was profiled or flagged without relying on IT. Ataccama shows how data flows through systems and provides plain-language descriptions of the steps behind every number. For example, in a financial services setting, a data steward can immediately see how a risk score was derived or how a flagged transaction passed through a series of enrichment and quality checks. That kind of visibility shortens reviews, streamlines audits, and gives business teams the confidence to act on the data in front of them.”

Broadcom’s Tomahawk Ultra is an Ethernet switch for HPC and AI workloads with 250 ns switch latency at full 51.2 Tbps throughput. It delivers line-rate switching performance even at minimum packet sizes of 64 bytes, supporting up to 77 billion packets per second. It reduces header overhead from 44 bytes down to as low as 6 bytes, while maintaining full Ethernet compliance – boosting network efficiency and enabling flexible, application-specific optimizations. The switch incorporates lossless fabric technology that eliminates packet drops during high-volume data transfer, and implements Link Layer Retry (LLR) and Credit-Based Flow Control (CBFC) to eliminate packet loss and ensure reliability. It’s designed with topology-aware routing to support advanced HPC topologies including Dragonfly, Mesh, and Torus.

Cadence announced the tapeout of the industry’s first LPDDR6/5X memory IP system solution optimized to operate at 14.4 Gbps, up to 50 percent faster than the previous generation of LPDDR DRAM. The Cadence IP for the JEDEC LPDDR6/5X standard consists of an advanced PHY architecture and a high-performance controller designed to maximize power, performance and area (PPA) while supporting both LPDDR6 and LPDDR5X DRAM protocols for optimal flexibility. It says its new LPDDR6/5X memory IP system offering enables scaling up AI infrastructure to accommodate the memory bandwidth and capacity demands of next-generation AI LLMs, agentic AI and other compute-heavy workloads for various verticals. Multiple engagements are currently underway with leading AI, high-performance computing (HPC) and datacenter customers.

Research from data streamer Confluent says more than two-thirds (67 percent) of UK IT leaders see disconnected, siloed data as preventing them from fully adopting AI and machine learning (ML) – making fragmented data systems the largest obstacle to AI success. There was:

  • Ambiguity surrounding data lineage, timeliness, and quality assurance – 63 percent
  • Insufficient skills and expertise in managing AI projects and workflows – 61 percent
  • Limited ability to seamlessly integrate new data sources – 57 percent
  • Insufficient infrastructure for real-time data processing – 50 percent

Download Confluent’s 2025 Data Streaming Report here.

Confluent announced that Confluent Cloud is now available in the new AI Agents and Tools category on AWS Marketplace. It says this enables AWS customers to easily discover, buy, and deploy AI agent solutions, including its fully managed data streaming platform, Confluent Cloud, using their AWS accounts, for accelerating AI agent and agentic workflow development. Confluent Cloud streams, connects, processes, and governs data as it flows throughout the business in real time, enabling customers to build and deploy intelligent, adaptable, and scalable AI applications faster. By replacing fragmented, batch-based systems with a continuously updating stream of business context, Confluent creates a shared, real-time data backbone that AI agents can subscribe to, interact with, and reason over. Confluent Cloud empowers customers to confidently scale their businesses for the future with AI at the forefront. 


China’s CXMT, the parent of DRAM fabber ChangXin Memory, is preparing for an IPO in China. This is another sign, together with YMTC’s NAND plans below, of China’s growing self-sufficiency in memory technology and production. China could emerge as a DRAM and NAND supplier on the world stage.

Data management supplier Denodo’s cloud-based Agora offering is now available in the AWS Marketplace’s new AI Agents and Tools category. Denodo is offering both a SaaS version (Agora) and an Amazon Machine Image (Denodo Enterprise Plus), giving AWS customers a flexible path to build Agentic AI ecosystems using open standards. With support for the Model Context Protocol (MCP) built into the latest Denodo AI SDK, organizations can now unify, govern, and deliver real-time data to GenAI apps without having to replicate or re-engineer their entire data architecture.

Denodo has launched DeepQuery, a new capability that is bringing reasoning-based GenAI to enterprise systems. While most GenAI tools stop at retrieving facts, DeepQuery connects to live, governed enterprise data and delivers answers that explain the why behind the what. These answers are fully reasoned, cited, and ready for decision-making. Rather than pulling data into a silo, DeepQuery applies structured reasoning directly on real-time systems to support sales teams, finance, HR, and more. It’s the first enterprise-grade GenAI to offer this depth of synthesis with full explainability and traceability marking an evolution toward cognitive GenAI, where models don’t just answer – they investigate. Denodo is offering early access to select partners through its AI Accelerator Program. Denodo also announced the availability of Model Context Protocol (MCP) support as part of the Denodo AI SDK.

China’s Exascend offers products like PCIe NVMe SSDs, SATA SSDs, CFast, and CFexpress cards, tailored for demanding applications in aerospace, AI, IoT, and video surveillance. It has two new PE4 SSDs, 7 mm U.2 models: 23.04 TB and 30.72 TB. They “enable system integrators to maximize capacity in edge servers, rugged industrial computers, on-premises AI nodes, and space-constrained datacenters where every watt and every millimeter count.” The PE4 30.72 TB doubles the density of competing 7 mm U.2 SSDs and enables up to 737 TB of flash in a single 1U server, cutting rack space requirements by 50 percent compared to 15 mm drives and delivers a two-fold jump in TB-per-U efficiency.

The PE4 series have a PCIe Gen 4 x 4 interface and use 3D TLC NAND. They are rated for up to 1 drive write per day (DWPD) and offer a 2-million-hour mean time between failures (MTBF). Engineering samples of the PE4 Series U.2 SSD in 23.04 TB and 30.72 TB models are shipping now to select customers, with volume production scheduled for Q3 2025. The PE4 7 mm family is already available in 7.68 TB and 15.36 TB capacities.

Frontgrade Technologies, which provides high-reliability microelectronics for space and national security, has unveiled its embedded MultiMediaCard (eMMC) – a space-grade NAND memory product offering 32 GB density, the highest available in today’s market. Designed to deliver maximum capacity and resilience in space environments, this radiation-tolerant, non-volatile flash data storage device is optimized for low Earth orbit (LEO) missions and beyond. It has an integrated flash controller with built-in FTL for wear leveling, bad block handling, and ECC, and has radiation-tolerant construction suitable for many mission orbit requirements.There is a standard eMMC 5.1 interface for streamlined system integration. Frontgrade’s eMMC is currently available for sampling. Flight units are expected to be available in Q4 2025. Download a datasheet from this webpage

IBM says integrating Storage Ceph with its Storage Deep Archive “presents a compelling solution for organizations looking to optimize their data archival strategies. The combination delivers substantial cost efficiency, operational simplicity, air gap resiliency, and policy-driven flexibility.” Ceph now supports policy-based transitions of data to external S3-compatible storage tiers, including those that present Glacier Retrieval classes. IBM’s Storage Deep Archive has “ultra-low-cost archival capabilities” and “IBM Storage Deep Archive uniquely complements Ceph’s capabilities by offering an S3 Glacier-compatible interface without the traditional overhead associated with tape systems,” although it is a tape library system. An IBM blog has more information.


Note: IBM Storage Deep Archive is an on-premises archival data storage system integrated within an IBM Diamondback server and a 5U top rack, utilizing tape storage with an S3 Glacier command interface. It offers up to 27 petabytes of capacity in a 19-inch rack.

William Blair analyst Jason Ader says of Nutanix: “With respect to the competitive fallout from Broadcom’s acquisition of VMware, Nutanix sees this as a multi-year displacement opportunity, less characterized by a specific tipping point and more by a continuum that will build on itself over time (putting more pieces in place right now to execute on the land grab, including channel enablement, OEM and external storage partnerships, and incremental sales reps). While Nutanix has already seen some benefit from VMware displacements (as evidenced by the recent uptick in new logos and the strong and growing seven-figure pipeline), inertia with VMware customers is real (especially with larger, complex deployments) and software/hardware refresh cycles often need to align for customers to shift off VMware.”

VergeIO is offering replacement storage with the launch of The AFA Replacement Kit. VergeIO says it’s “an offering designed to replace traditional all-flash arrays with a simpler, more cost-effective infrastructure solution. VergeIO customers have reported reducing storage costs by a factor of ten, in addition to the added savings from eliminating expensive VMware licensing and support agreements. This VergeIO kit brings together three Solidigm 4 TB enterprise SSDs and a VergeOS server license combined into one streamlined platform. Along with your servers, it’s a complete, ready-to-run infrastructure solution. All IT needs to do is insert the included flash drives into empty drive bays in existing servers, and they’re ready to deploy VergeOS.” VergeIO marketing head George Crump said: “It is 12 TB per node, typical customer has at least four nodes. If a customer needs more, we have flexibility.” 

We don’t see customers choosing this as a general external AFA replacement, e.g. rejecting NeApp AFF A150, which is both a scale up (2RU x 24 slots) and scale-out external storage system (to 12 x A pairs) whereas the VergeIO AFF Replacement is a scale-out system with three-drive (slot) nodes. We see it specifically as a VMware vSAN replacement.

VergeIO was been asked about this point and a spokesperson said: “The VergeIO AFA Replacement Kit is not limited to three drives or three nodes—those were illustrative starting points for the press release. The solution can scale to any number of drives per server and any number of servers. It’s designed to give IT teams flexibility to size configurations based on workload requirements, not fixed appliance specs.” 

Also the product is both a vSAN and general external AFA replacement, with VergeIO saying: “The AFA Replacement Kit serves both purposes. It functions as a general AFA replacement, suitable for customers considering alternatives to traditional SANs and filers from vendors like Dell, NetApp, Pure, HPE, and others.[And] It is a strong alternative for VMware customers who are moving off vSAN and want server-side flash with integrated compute and storage.”

VergeIO wants us to know “that research from ESG indicates that over 53% of SANs are used exclusively to support virtualized workloads, and over 90% support at least some virtualized workloads. VergeIO is designed to meet that need directly. While VergeOS is not designed to run non-virtualized workloads, many customers find that they can migrate bare-metal applications they previously believed too resource-intensive for virtualization. This is common in VergeOS environments thanks to its efficiency, where storage, compute, AI, and networking are integrated into a single software codebase.”

DigiTimes reports China’s NAND fabbing YMTC aims to pilot production of NAND chips using China-only tech this year. It was estimated to hold 8 percent of the global NAND market (a monthly capacity of 130,000 wafer starts per month) by the end of 2024, and wants to reach a 15 percent share by the end of 2026. This is based on ramping up WSPM to 150,000 and its latest gen 5 NAND memory with 294 layers built from bonded 150L and 144L components (Xtacking 4). 

YMTC is on the US tech export sanctions list and cannot buy NAND fab tools using the latest technology from suppliers such as ASML. Tom’s Hardware has YMTC NAND die info:

  • X4-9070 – 1 TB TLC 3600 MT/s – in production
  • X4-6080 – ? TB QLC – H2 2025
  • X5-9080 – 2 TB TLC with 4800 MT/s – 2026
  • X5-6080 – ? TB QLC with 4800 MT/s – 2026

YMTC’s gen 6 die will likely feature 300+ layers built from 3 component stacks (Xtacking 5).

Zadara software is being used by East Africa services provider TouchNet at an iXAfrica co-location datacenter in Nairobi, Kenya, to provide a sovereign cloud, offering computer, networking, storage, and GPUaaS. The Zadara AI Cloud enables businesses, managed service providers (MSPs), and cloud service providers (CSPs) to optimise operations without the need to rely on offshore services. Zadara provides its own take on the public cloud with fully managed compute (EC2 compatible), storage (block, file, object), and network systems hosted in a global point-of-presence (POP) network, called edge clouds. It can also be run on AWS, Azure, GCP, and Oracle public clouds, or in a customer’s own datacenters – as in this iXAfrica case. This deployment enables sovereign cloud capabilities hosted locally in Kenya, and a public cloud-like experience (compute, storage, networking) using Zadara’s EC2- and S3-compatible services. There is full regulatory and data residency compliance for sectors requiring localized data control (e.g. government, finance, healthcare).

Where to point your RAG: Sourcing proprietary data for LLMs and AI agents

Digital brain

Analysis: There are many places within IT infrastructure that organizations can use to get a single proprietary data source for their large language models and AI agents doing RAG (retrieval-augmented generation) – backup stores, data lakes, data management catalogs, and storage arrays. Let’s sketch the background then take a look at how they compare to each other.

LLMs and agents can scan all kinds of unstructured information in text, numbers, audio, image, and video formats in response to natural language inputs and generate natural language outputs. However, they are initially trained on general data sources and will make better responses to an organization’s users if they have access to the organization’s own data. But that data is spread across many disparate systems and formats.

For example, datacenters (mainframes and x86 servers), edge sites, public cloud instances, SaaS app stores, databases, data warehouses, data lakes, backups on disk and tape on-premises and in the public cloud, and archives. The data can be accessed via structured (block) or unstructured (file and object) protocols with many subsidiary formats: Word, Excel, PDF, MPEG, etc.

There are four theoretical approaches to getting all these different data sources available for RAG. First, you could go to individual specific sources each time, which would require you to know what they are, where they are and what they contain. This approach operates at the level of individual storage arrays or cloud storage instances.

Secondly, you could go to existing agglomerated data sources, meaning databases, data warehouses, and data lakes. Third, you could go to the backup stores and, fourthly, use a data management application that knows about your data.

These second, third, and fourth options already have metadata catalogs describing their content types and locations which makes life a lot easier for the RAG data source hunter. All else being equal, accessing one location to find, filter, select, extract, and move RAG data is better than going to myriad places. It makes AI data pipeline construction much simpler.

A couple of other observations. The more of your proprietary data the single place contains the better as you have to make fewer exception arrangements. Secondly, the LLMs and agents need vectorized unstructured data for their semantic searches, and this needs producing and storing. Any central RAG data source facility needs to support vectorization and vector storage.

That’s the background. Let’s take a gander at the four main RAG data sources.

Storage arrays

Virtually every storage array vendor we know, both hardware and software, software-defined, file and object, is building in some kind of Gen AI support. Cloudian, DDN, Dell, HPE, NetApp, Scality, StorONE, VAST Data, WEKA, and others are all piling in.

Arrays (or SW-defined storage) which have a fabric connecting on-premises arrays, and also public cloud instances with a global namespace  will have an advantage; their reach is obviously greater than non-fabric arrays. This is true for cloud file services suppliers, such as Box, CTERA, Egnyte, Nasuni, and Panzura as well.

However, such vendors can only supply data for RAG stored on their systems and nowhere else, unless they have connectors giving them wider data access. An agent granted access to a Dell, NetApp, DDN, HPE, whatever array won’t be able to see data on another supplier’s array in that organization though the NetApp, DDN, HPE, whatever array RAG lens.

Database/warehouse/lake/lakehouses

A database, data warehouse, data lake and lakehouse store more information respectively and, generally, in more formats as we head from databases to data warehouses, then data lakes and lakehouses. At the database end of this spectrum, specialized vector databases are appearing, from suppliers such as Pinecone and Zilliz (Milvus). They say they offer the best of breed vector storage, filtering, extract, and support for AI pipelines.

Other databases aim to be multi-modal, like SingleStore. Its SingleStoreDB is a performant, distributed, relational, SQL database with operational, analytical, and vector data support, integration with Apache Iceberg, Snowflake, BigQuery, Databricks, and Redshift, and a Flow ingest feature.

Data warehouses are rapidly becoming AI data feed warehouses, witness Snowflake and its AI Data Cloud. Data lakes and lakehouses are also very much GenAI-aware and rapidly developing features to support it. As an example consider Databricks which recently added a Lakebase Postgres database layer to its lakehouse, enabling AI apps and agents to run analytics on operational data within the Databricks environment. It also introduced Agent Bricks, a tool for automated AI agent development.

These are all excellent products, and connectors exist to bring them data from multiple different sources or allow them to access external table software data.

Their operational and admin environment becomes more and more complicated as they extend their reach to external tables or use connectors to bring in data from distant sources

Backup stores

A backup vault stores data copied from multiple places within your organization. It can function as a single source for LLMs and agents needing access to your proprietary information. It’s not real-time information but it can be very close to that and it, obviously, has troves of historical information.

The backup vault storage media is an important consideration. Tape is obviously a no-no due to lengthy data access times. Disk is better while an all-flash vault is probably the fastest. 

Cohesity, Commvault, Rubrik, and Veeam are well aware of this strength they possess, as a vast potential RAG data store, and building out features to capitalize on it. Cohesity has its Gaia initiative. Rubrik recently acquired Predibase for agentic AI development functionality.

An obvious caveat is that the backup vendors can only serve data to LLMs and agents that they have backed up. Anything else is invisible to them. This could encourage their customers to standardize on one backup supplier across their organization.

Data Managers

Data managers such as Arcitecta, Datadobi, Data Dynamics, Hammerspace, and Komprise almost never act as data sources. We say almost never because Hammerspace is edging towards that functionality with the Open Flash Platform initiative. The data managers manage data stored on multiple storage arrays and in public cloud storage instances. This partially reflects a hierarchical life cycle management background common to several of them.

The data managers don’t store data, but they do index and catalog it; they know what data their customers have, and they often have data moving capability, coming from a migration background in some cases. That means they can construct interface processes to AI data pipelines and feed data up them. Komprise is highly active here. Datadobi with its StorageMap, policy-driven workflows and data moving engine has a great base on which to build AI model-specific functionality.

Arcitecta’s Mediaflux product can be used to curate high-quality datasets from vast volumes of unstructured data, making it easier to feed data into large language models (LLMs) and other AI systems. We expect more functionality specific to AI to emerge here.

To expand their reach and give them more relevance here the data managers could usefully partner with the database/warehouse, lake and lakehouse suppliers. Hammerspace and Snowflake is an example of this.

Other data silos

SaaS apps such as Salesforce are another data source for RAG. They cannot see outside their own environment and, absent explicit connectors, other data sources, with one exception, cannot see into theirs. Their own environments – Microsoft’s, for example – can of course be large.

The exception is backup. If a SaaS app has its customer data backed up, by a Cohesity, Commvault, Druva, HYCI, etc., then that data is inside the backup vendor’s purview.

Summing up

There is no silver bullet here, no instant fix, no one all-embracing Gen AI RAG data source. You can only search out the best available Gen AI data source or sources for your organization considering your on-prem and public cloud IT data estate. All the suppliers mentioned are developing features, building partnerships and even acquiring companies that own relevant technology. They all have their merits and will have more next month, next year, and so on.

Wherever you are starting from you can perhaps place candidate suppliers into one of the four categories: storage, database/warehouse/lake/lakehouse, backup store and data management. That will give you a quick overall view of their scope. You can then, hopefully, decide which route or routes you might take to arrive at the most efficient way to give the LLMs and agents you use access to the proprietary data they need