Home Blog Page 14

VAST Data: ‘No one needs a file system for AI training’

VAST Data co-founder Jeff Denworth posted on X that: “No one needs a file system for AI training… More specifically, no one needs a system that is *only* a file system. While the HPC storage community is out telling the world parallel file systems are essential to AI, customers have begun to deploy S3 checkpointers and S3 data loaders (which can also work async) for their training environments.”

This got us thinking about the need for files in AI training and VAST’s direction, so we asked Denworth some questions.

Blocks & Files: Could you say why file system-based suppliers such as DDN are being used by Nvidia for its own storage, has SuperPOD certification, as well as by X for its Colossus AI cluster, and many other AI applications by its customers? The same general point applies to NetApp, Pure, and WEKA. It’s clear that many customers, including Nvidia, are using file systems, parallel or not, for AI training. Why is this?

Jeff Denworth, VAST Data
Jeff Denworth

Jeff Denworth: It’s not binary, it’s evolutionary. Historically, all of the AI training frameworks required a POSIX/file interface. Only companies developing their own frameworks would consider using object storage, and this is limited to the best of the best.

Glenn Lockwood articulated an example of this here.

Many customers are still using file systems … my point was not that they’re not being used, but rather that you need multi-protocol in today’s age otherwise file system-only solutions result in very poor investment protection. The frameworks are evolving faster than customer investment decisions. Customers are now starting to make the transition, and we routinely hear from them that they love the ability to work in both modes on the same data simultaneously.

Lest you forget, Nvidia also bought an object storage company (SwiftStack). This says a lot.

Blocks & Files: Have any LLMs been trained purely using data provided direct from object storage systems. Surely this capability has only arrived recently with advances by Cloudian, MinIO, Nvidia, and Scality with GPUDirect-like access facilities for object data storage?

Jeff Denworth: Yes. Of the top-tier (top ten worldwide) models I know of:

  • VAST is being used for a very prominent model exclusively on VAST S3 at CoreWeave. We have a few other top-tier names starting to experiment.
  • Azure Blob is being used for a very prominent model.
  • Nvidia is training a very prominent model on S3-compatible storage.

That’s just what I know of.

Blocks & Files: VAST has built an AI-focused software stack, the VAST Data Platform, comprising the base data store, its DataCatalog, its DataBase, DataSpace, and DataEngine, as it fulfills its Thinking Machine vision with what we thought were the necessary software layers. But OpenAI with ChatGPT and the other GenAI model developers have shown that you can have smart chatbots without any of this software. Give them a vector database and file system and they can do their thing. Witness DDN, IBM, NetApp, Pure, and WEKA with Nvidia SuperPOD credentials.

Jeff Denworth: It’s always possible to integrate a solution; that never means it’s practical or efficient.

VAST… breaks trade-offs of scale, transactionality, security, etc. to provide [in my opinion] the best possible approach to AI retrieval. Most organizations kick around GB-scale datasets and think they have a good solution. We’re envisioning a world where AI embedding models can understand recency and relevance of all data as it’s being chunked and vectorized … where all data will be vectorized [with] trillions of vectors that need to be searchable in constant time regardless of vector space size … this is only possible with our architecture.

A system that can manage ingestion of hundreds of thousands to millions of files per second, process them and index them in real time … as well as instantaneously propagate all data updates to the index so enterprises never see stale data. A system that doesn’t need expensive memory-based indices because legacy partitioning approaches are not efficient. You need DASE (Disaggregated Storage Architecture) for all of this.

Finally … the underlying data sources need to be scalable AND enterprise grade … not sure where else you get this other than VAST.

Blocks & Files: Has ChatGPT-style technology negated the need for VAST’s software stack?

Jeff Denworth: Quite the opposite. The rise of agentic applications where organizations compute in GPU time increases the need for our technology. I’ll ask that as you consider this, you stop thinking about AI and RAG as just chatbots … the future speed of business will not be defined by how fast a human processes data. Nvidia is working to deploy 100 million agents into its enterprise (to augment 50,000 employees) over the next few years – all working together for complex business tasks. You don’t think this will push boundaries of legacy storage and database systems?

I think I see a future very different from the one you see. Everything will be about scale, GPU time, and the ability to process unprecedented amounts of data to think about hard problems. Did you see my blog?

The Stargate announcement will be the first of many. Dario [Amodei] at Anthropic also declared a need for 100x scale up in computation. This is not exclusively for training. System Two/Long-Thinking is going to change the world’s relationship with data and compel the need for even larger volumes of data.

Blocks & Files: VAST has been in a great creative period, developing its original storage technology from ground zero, and then the Thinking Machines-type software stack. Is this period of technology creativity now over with nothing but incremental tech advances and business process developments from now on? What is the vision for the future?

Jeff Denworth: I can confidently say that we have the most inventive and most ambitious team in the business. Each customer interaction gives us more inspiration for the next ten years … and we are fortunate to work with the smartest customers in the world. To assume we’ve become complacent, fat, and happy would be a dangerous assumption to make.

I’m not going to lay out our vision over email as I don’t think that does either of us any service, but we can talk more about the future maybe the next time we meet.

Blocks & Files: Your arrays can run application software in the C-nodes, providing computational storage. Isn’t this akin to turning the array into server direct-attached storage (DAS) for that application, negating the basic purpose of having a shared storage resource?

Jeff Denworth: Shared data access across machines is tantamount to what we do. Modern machinery needs real-time access to petabytes to exabytes of data to get a global data understanding. You can’t pin that data to any one host. Where and how those functions run is just a packaging exercise … we like efficiency so the more we can collapse, the better … but DAS is the opposite of how we think. Disaggregation is not just possible, we’ve shown the world that it’s very practical to getting to radical levels of data access and data processing parallelism.

Blocks & Files: How do you size the compute resource in a computational storage array?

Jeff Denworth: We’re learning more about sizing every day.

  • I/O load
  • Query load
  • Function velocity
  • Event notification activity
  • QOS management
  • RAS

I’m not sure we’ve got it all figured out since each new release is adding substantially new capability. This keeps the performance team on its toes … but we’re trying.

Nvidia says Spectrum-X adaptive routing can boost storage fabric bandwidth

Nvidia says its InfiniBand technology-infused Spectrum-X Ethernet networking can increase storage fabric network read bandwidth by almost 50 percent.

Spectrum-X is a combination of the Spectrum-4 ASIC-based Ethernet switch offering that accompanies its InfiniBand products. It supports RoCE v2 for remote direct memory access (RDMA) over Converged Ethernet and the BlueField-3 SuperNIC. Nvidia’s InfiniBand products feature adaptive routing to send data packets across the least congested network routes when the initially selected routes are busy or a link outage occurs. The Spectrum-4 SN5000 switch provides up to 51.2 Tbps bandwidth with 64 x 800 Gbps Ethernet ports. There are RoCE extensions for adaptive routing and congestion control, and these work with the BlueField-3 product.

Nvidia Spectrum-4 switch (top) and BlueField-3 SuperNIC (below)
Nvidia Spectrum-4 switch (top) and BlueField-3 SuperNIC (below)

Adaptively routed packets can arrive at the destination out of sequence, and Nvidia’s BlueField-3 product can reassemble them properly, “placing them in order in the host memory and keeping the adaptive routing transparent to the application.”

An Nvidia blog explains that, because Spectrum-X adaptive routing is able to mitigate flow collisions and increase effective bandwidth, the effective storage performance is much higher than with RoCE v2, “the Ethernet networking protocol used by a majority of datacenters for AI compute and storage fabrics.”

The blog discusses checkpointing during an large language model (LLM) training, which can take days, weeks, or even months. Job state is saved periodically so that if the training run fails for any reason, it can be restarted from a saved checkpoint state instead of initiating it from the beginning. It says: “With billion and trillion-parameter models, these checkpoint states become large enough – up to several terabytes of data for today’s largest LLMs – that saving or restoring them generates ‘elephant flows’ … that can overwhelm switch buffers and links.”

This assumes the checkpoint data is being sent to shared storage, an array, for example, across a network and not to local storage in the GPU servers, a technique used in Microsoft’s LLM training.

Nvidia also says that such network traffic spikes can occur in LLM inferencing operations when RAG (retrieval-augmented generation) data is sent to the LLM from a networked storage source holding the RAG data in a vector database. It explains that “vector databases are many-dimensional and can be quite large, especially in the case of knowledge bases consisting of images and videos.”

The RAG data needs to be sent with minimal latency to the LLM and this becomes even more important when the LLM is executing in “multitenant generative AI factories, where the number of queries per second is massive.”

Nvidia says it has tested out these Spectrum-4 features with its Israel-1 AI supercomputer. The testing process measured the read and write bandwidth generated by Nvidia HGX H100 GPU server clients accessing the storage, once with the network configured as a standard RoCE v2 fabric, and then with the adaptive routing and congestion control from Spectrum-X turned on.

Nvidia Israel-1 AI supercomputer graphic
Nvidia Israel-1 AI supercomputer graphic

Tests were run using different numbers of GPU servers as clients, ranging from 40 to 800 GPUs. In every case, Spectrum-X performed better, with read bandwidth improving from 20 to 48 percent and write bandwidth increasing from 9 to 41 percent.

Nvidia says Spectrum-X works well with its other offerings to accelerate the storage to GPU data path:

  • AIR cloud-based network simulation tool for modeling switches, SuperNICs, and storage.
  • Cumulus Linux network operating system built around automation and APIs, “ensuring smooth operations and management at scale.”
  • DOCA SDK for SuperNICs and DPUs, providing programmability and performance for storage, security, and more.
  • NetQ network validation toolset that integrates with switch telemetry.
  • GPUDirect Storage for direct data path between storage and GPU memory, making data transfer more efficient.

We can expect Nvidia partners such as DDN, Dell, HPE, Lenovo, VAST Data, and WEKA to support these Spectrum-X features. Indeed DDN, VAST Data, and WEKA have already done so.

Very large AI model training uses object storage

HPC expert Glenn Lockwood says that the largest AI language models are being trained with object storage, not file storage.

AI model training needs unstructured data. Most unstructured data is stored in files. Ergo, large language model (LLM) training needs access to file data and parallel file systems, as used in high-performance computing (HPC), to deliver file data faster to processors than serial file systems. Therefore, parallel file systems are needed for LLM training. Actually, no, says Lockwood, because the characteristics of LLM model training phases actually favor object storage more than parallel file systems.

Glenn Lockwood, Microsoft
Glenn Lockwood

Lockwood is an AI Infrastructure Architect at Microsoft who has worked on one of the world’s largest supercomputers. He writes: “I guess supercomputers and parallel file systems are like peas and carrots in so many people’s minds that the idea of being able to run a massive parallel compute job without a massive parallel file system is so unintuitive that it is unbelievable.”

He mentions four phases of LLM production:

  • Data ingestion, “where crawlers scrape the Internet and pull down raw HTML, images, videos, and other media. These raw data are indexed and shoved into a data warehouse. At scale, this can be hundreds or thousands of petabytes of data for frontier models.”
  • Data preparation, “where the raw data is converted into tokenized data. It amounts to a huge data analytics problem that uses … text and image processing pipelines that filter, deduplicate, and otherwise clean the raw garbage on the Internet using frameworks like Apache Spark. The hundreds of petabytes of input get reduced down by 10x-1000x.”
  • Model training, “where the tokenized data is shoveled through the LLM on giant GPU clusters in little batches. As the data is processed, the model weights are updated, and those weights are checkpointed to storage.”
  • Model deployment and inferencing, “where the final model is copied across giant fields of inferencing servers, and a web service sits in front of it all to transform REST API requests into actual inferencing queries that run on the GPUs.”

The I/O patterns in each phase do not require a parallel file system in his view. Data ingestion “just pulls HTML, images, or video streams from the Internet and packs them into data containers. As it is packing webpages into these files, it is building a separate index that stores metadata about the webpage … and its location … Thousands of VMs might be performing these tasks completely independently.” It is write-once data and suited to object storage immutability.

The point here is that “while one could store each scraped HTML page in a file that’s organized in a parallel file system, accessing those files would be very slow – a full crawl of all the data would require scanning hundreds of billions of little files.” Lockwood reckons “it’s better to implement data containers on top of object stores and use a distributed key-value store for the index.”

Data preparation involves “running Apache Spark-like pipelines that chew through all the raw data in a trivially parallel way … Each processing task might read a couple hundred megabytes of data from an object all at once, process it in-memory, then dump it back out to objects all at once. File systems offer no benefit here, because each task reads once and writes once rather than skipping around inside individual objects.”

The input data is deduplicated, which “requires comparing every piece of data to every other piece of data.” This I/O-heavy step “is often done in a centralized location that is adjacent [to] the [ingested data] object store using CPU-based, completely separate supercomputers before training on GPUs ever begins.”

Lockwood asserts that “buying cheap object storage and a cheap CPU cluster is more cost-effective than buying an expensive file system and wasting your GPU nodes on trivially parallel text processing tasks.”

He says that, with HPC systems, “the need for fast checkpointing and restart were the primary driver behind the creation of parallel file systems.”

The AI model training phase also needs fast checkpointing and restart. However, although “parallel file systems certainly can be used for training, they are not the most cost-effective or scalable way to train across tens of thousands of GPUs.”

LLM training involves much repetition: “Training a model on GPUs, whether it be on one or a thousand nodes, follows a simple cycle (this is a ‘step’ in LLM training parlance) that’s repeated over and over.”

  1. A batch of tokenized data is loaded into GPU memory.
  2. That data is then processed through the neural network and the model weights are adjusted.
  3. All GPUs synchronize their updated weights.

The I/O load of step 1 is not the same as a traditional HPC job as, firstly, the “millions of little text or image files … are packaged into large objects before the GPUs ever see them.” Secondly, the amount of tokenized data is actually quite small. A tokenized 405-billion-parameter Llama-3 model was “trained using 16,000 GPUs, only had to load 60 TB of tokens from storage. That divides out to 3.75 GB of tokens processed by each GPU over the entire course of a 54-day run.”

That’s not much of an I/O challenge. Also, the tokenized data can fit into a GPU’s local flash storage, what Hammerspace calls Tier 0 storage. Lockwood points out that “using node-local NVMe allows storage capacity and storage performance to scale linearly with GPU performance.”

He says: “Because no two GPUs will ever need to read the same input token, there’s never a need to copy input tokens between nodes inside the training loop.”

“Super high-bandwidth or super high-capacity parallel file systems are not necessary for loading input tokens during training.”

What about checkpoint writing? Here the actual I/O burden is less than we might suppose. “Unlike with scientific HPC jobs, the checkpoint size does not scale as a function of the job size; the checkpoint for a 405 billion-parameter model trained on 16,000 nodes is the same size as the checkpoint for that model trained on three nodes. This is a result of the fact that every training step is followed by a global synchronization which makes each data-parallel copy of the model identical. Only one copy of those model weights, which amounts to under a hundred terabytes for state-of-the-art LLMs, needs to be saved.”

He cites a VAST Data source and says that: “Even a trillion-parameter model can achieve 99.7 percent forward progress (only 0.3 percent time spent checkpointing) when training across 3,072 GPUs with a modest 273 GB/s file system. A parallel file system is not required to get that level of performance; for example, HDD-based Azure Blob achieved over 1 TB/s when benchmarked with IOR for writes at scale.”

What you do is checkpoint to GPU-local storage, so the checkpoint data is persisted, and then asynchronously copy it to a neighboring GPU’s local storage for fast restore, and migrate to shared storage for longer-term retention. This involves large block writes and Lockwood concludes: “This combination of modest write bandwidth and simple, sequential, large-block writes is ideally suited for object stores. This isn’t to say a parallel file system cannot work here, but this checkpointing scheme does not benefit from directory structure, fine-grained consistency semantics, or any of the other complexities that drive up the cost of parallel file systems.”

The I/O requirements of the model deployment and inferencing stage are also quite straightforward:

  1. When provisioning a GPU node for inferencing, model weights must be loaded from shared storage as fast as possible.
  2. When using an LLM to search documents, a vector database is required to perform the similarity search that augments the LLM query with the relevant documents. This is the basis for RAG.
  3. Key-value caches are often used to reduce the latency for different parts of the inferencing pipeline by storing context, including the conversation or frequently accessed contextual documents.
  4. As the inferencing demand evolves, different models and weights may be swapped in and out of individual GPU servers.

“A parallel file system is not particularly useful for any of these; the only place in which their high bandwidth would be a benefit is in loading and re-loading model weights (#1 and #4). But as with hierarchical checkpointing, those I/O operations are whole-object, read-only copies that are a natural fit for object APIs. Complex directory structures and strong consistency simply aren’t necessary here.”

“The I/O patterns of each of these [four] steps map nicely to object storage since they are predominantly write-once and whole-file transactions. Parallel file systems certainly can be used, and workloads will benefit from the high bandwidth they offer. However, they come with the cost of features that aren’t necessary–either literal costs (in the case of appliances or proprietary software) or figurative costs (allocating people to manage the complexities of debugging a parallel file system).”

“Parallel file systems aren’t bad, and they aren’t going anywhere. But they aren’t required to train frontier models either, and … some of the largest supercomputers on the planet are designed not to require them.”

Read Lockwood’s blog for a deeper dive into this topic.

Volumez expands DIaaS to optimize cloud block storage for AI

Volumez has updated its cloud-delivered block storage provisioning service for containerized applications to GenAI with its Data Infrastructure-as-a-Service (DIaaS) product.

The company says its technology can maximize GPU utilization and automate AI and machine learning (ML) pipelines. Existing pipelines to deliver data and set up AI/ML infrastructure are sabotaged by storage inefficiencies, underutilized GPUs, over-provisioned resources, unbalanced system performance, increased cost, complex management, and poorly integrated tooling, which drains the AI pipeline team’s bandwidth and delays projects.

Volumez DIaaS

John Blumenthal, Volumez’s Chief Product and Business Officer, noted at a January IT Press Tour session in Silicon Valley that AI/GenAI workloads demand concentrated compute and storage power in a dense infrastructure. That way they can achieve sustainability, cost efficiency, and energy optimization while reducing hardware sprawl.

Volumez DIaaS architecture

Volumez DIaaS is based on an understanding of cloud providers’ Infrastructure-as-a-Service (IaaS) products and is aimed at creating a balanced infrastructure for AI workloads. This includes a storage infrastructure based on declarative and composed NVMe/TCP instances. It uses intrinsic Linux services and there is no storage controller in the data path.

Volumez DIaaS architecture

The context here is that public clouds such as AWS and Azure rent out compute and storage resources and are not incentivized to optimize how customers efficiently and evenly consume these resources. Volumez makes its money by operating in the zone between the public clouds’ optimization of their own resources for their own benefit, and customers’ need to optimize the efficient use of these resources in terms of performance, cost, and simple operations for their own benefit.

Volumez offers better price performance for its block storage than AWS instances, as a chart indicates:

It can set up the AWS or Azure data infrastructure from a data scientist’s notebook, with Volumez imported as a PyTorch library and auto-calculating the storage infrastructure requirements.

Volumez says the MLPerf Storage benchmark shows its infrastructure performing well.

Blumenthal said that AI workloads require a balance of high capacity, high bandwidth, and high performance at an affordable cost, and claims Volumez can provide this. He claimed it increases GPU utilization – they don’t wait so long for data – and so provides improved training and inference yields from the infrastructure.

Check out various Volumez white papers and blogs for more information.

Storage news ticker – February 3

Screenshot

The UK-based part of Rubrik Backup-as-a-Service supplier ADP (Assured Data Protection) is partnering cyber-security, comms and tech MSP Wavenet – the UK’s largest independent IT MSP – to provide enterprise grade backup and disaster recovery solutions to customers across the UK. Wavenet will offer Rubrik’s data protection platform as a managed service through Assured’s second site data replication infrastructure to bolster customers’ cyber resiliency. Wavenet anticipates increased demand for these services in vertical markets in which it already has a strong presence, such as education, retail, and oil & gas, as well as other markets.  

Veritas spin-out data manager Arctera unveiled a new brand for its Data Compliance platform – “Arctera Insight Platform” – and announced capabilities that improve the way organizations manage compliance, reduce risks, and prepare for audits. New features include a GenAI-powered assistant to streamline and simplify compliance tasks. The Arctera Insight Platform now:

  • Ensures insight is granular: The User Audit View now offers detailed user-level insight, while the Migration Management Dashboard allows seamless tracking across data centers.
  • Streamlines governance tasks: The GenAI assistant simplifies document review with instant summaries, key topic extraction, and sentiment analysis across various content types.
  • Captures more content: Arctera Insight Platform integrates with over 120 content sources including messaging platforms such as WhatsApp, WeChat, Microsoft Teams, Slack and, now, iMessage.
  • Delivers information where it’s needed: Seamless integration with business applications, such as Microsoft Outlook, delivers compliance data into existing workflows.

Arweave says it’s a decentralized and permissionless permanent storage network, designed to preserve knowledge for centuries. It uses a proof-of-access protocol, and is aiming the product at those who require long-term availability of their data, for example researchers, historians, and individuals seeking to safeguard their digital assets. It claims its newly developed algorithm introduced in the 2.9 update reduces the computational costs for honest participants in Arweave mining while preserving the network’s existing security guarantees.

“By dramatically reducing computational costs for miners, Arweave 2.9 empowers a wider community to participate in securing the future of digital information,” said Sam Williams, founder of Arweave and CEO of Forward Research.

Backblaze announced the Winter Release of its B2 Cloud Storage IaaS platform. It has Scalable Application Keys so enterprises can generate and refresh up to 10,000 unique data access keys per minute – a 150x increase – so they can securely move more bits & bytes between vast endpoint networks and Backblaze B2 storage buckets. Scoped Application Keys give admins the ability to generate application keys for specific prefixes within all of their storage buckets. 

A strengthened B2 CLI enhances credential management by moving from disk-stored credentials to secure local caching while also adding productivity features like autocomplete suggestions and shortcuts to save users time.  Event Notifications help businesses streamline workflows and make informed decisions faster by automatically triggering actions and updates based on data changes in Backblaze. Custom Upload Timestamps provide the ability to retain original file timestamp data from other storage providers when migrating or copying datasets into Backblaze B2–supporting data portability for companies that rely upon timestamps in their compliance-focused data lifecycle management. It also has IPV6 support.

Live data replicator Cirata announced Cirata Data Migrator has successfully passed the Amazon Web Services (AWS) Foundational Technical Review (FTR).“Cirata Data Migrator is designed to handle the complexities of large-scale data migration, providing unmatched scalability and real-time data consistency,” said Paul Scott-Murphy, Chief Technology Officer, Cirata. “Our solution extends capabilities with AWS-specific enhancements, making it the ideal choice for enterprises looking to future-proof their data management strategies without the usual costs or complexities. Passing the AWS FTR is a testament to our dedication to quality, security and customer satisfaction.” Cirata Data Migrator is now also offered on the AWS Marketplace along with Cirata’s Data Migration as a Service (DMaaS) offering.

Research house Coldago has published a Jan 2025 GEMS list showing 5 players to watch in the coming months as a result of their vision, potential disruption, product development, recent announcements and execution with a serious aspect on the technology and product innovations. They are Exagrid, LeilStorage, QSTarTechnologies, Volumez and Wasabi

The full document is only available for purchasers.

Lakehouse supplier Databricks has announced the final closing of its Series J funding round which values the company at $62 billion. Databricks said it plans to invest this capital toward new AI products, acquisitions, and expansion of its international  go-to-market operations. This capital is also expected to be used toward providing liquidity for current and former employees and paying related taxes. In addition to raising the $10 billion equity financing from some of the most well-known investors, Databricks closed a $5.25 billion credit facility led by JPMorgan Chase alongside Barclays, Citi, Goldman Sachs, and Morgan Stanley, with  participation from other leading financial institutions and alternative asset managers. The credit facility includes a $2.5 billion unfunded revolver and a $2.75 billion term loan.

A BestBrokers “Unicorns in 2024: AI is Taking Over” document lists two storage startup companies in its list:

  • Databricks at $62 billion
  • WEKA at $1.6 billion

Automated endpoint management platform supplier NinjaOne, based in Austin, TX, and Dropsuite, based in Melbourne, Australia, a cloud data backup, archiving, and recovery supplier, announced a definitive agreement for NinjaOne to acquire Dropsuite for a total transaction value of approximately $252 million. The acquisition is expected to close in the first half of 2025, subject to approval by Dropsuite’s shareholders, court approval, Australia’s Foreign Investment Review Board (FIRB) approval, and other customary closing conditions.

What impressed us most after our due diligence across the SaaS backup and data protection market is that Dropsuite is obsessed with customer success as much as NinjaOne, and their product is widely loved across its customer base and the nearly 1.5 million people who rely on it,” said Chris Matarese, President, CFO, and co-founder at NinjaOne. “We have long said we are a build before buy company, and we want to assure our customers and partners that we have passed on a number of potential acquisitions and spent a lot of time to ensure that Dropsuite is as easy to use and deeply integrated into the NinjaOne automated endpoint management platform as the rest of our product suite.”

A UK fashion designer, who has previously designed for Lady Gaga, has launched a dress, codenamed Project Max, ahead of London Fashion Week which has been made entirely from discarded datacenter material. Maximilian Raynor, (20K+ followers) and datacenter company Equinix  (supports TikTok, Netflix, London Stock Exchange, etc) have taken wearable tech to a different dimension with the creation of a one-off garment that brings the internet to life by using recycled materials in new and innovative ways. ProjectMax is made from “the internet,” using repurposed materials (3,600m of internet cables, metal  nuts and bolts) to showcase the “personification of the internet herself.”  It took 640 hours to create. 

DDN has hired Izhar Sharon as SVP of Infinia AI Customer Advocacy, reporting to CEO Alex Bouzari. Sharon has Datafy (business advisor, 7 months), OpenDrives (CEO, 20 months), Volumez (Chief Business Officer, 11 months), Infinidat (President Infinidat USA, 7 years), XtremIO (Consultant, nine months), IBM (43 months), XIV (four years), and EMC (nine years) experience. 

Storage exec Izhar Sharon, DDN
Izhar Sharon

IBM Storage Ceph has added a bucket logging feature. An IBM blog says: “The introduction of the S3 bucket logging feature in IBM Storage Ceph 8.0 is a game-changer for storage access management. This feature provides transparency and control by empowering end users to configure and manage their application access logs. With the ability to log operations at the bucket level, users can monitor activity, troubleshoot issues, and enhance their security posture—all tailored to their specific requirements without admin intervention.”

“Tools like Trino and Superset can analyze and visualize the log data generated by S3 Bucket Logging.” IBM’s “engineering team is working hard to improve bucket logging for our next IBM Storage Ceph release, including bug fixes, enhancements, and better AWS S3 compatibility.”

In its Q4 2024 results IBM said it had double-digit storage growth in the hybrid infrastructure part of its overall infrastructure (z mainframe, Power and storage HW systems) business which, overall brought in $4.3 billion, down 7.6 percent Y.Y. There was no detail about which storage products were involved, the revenues involved nor the actual growth rate.

Moonbeam, a contract platform for building cross-chain-connected applications, has unveiled DataHaven, a decentralized storage system purpose-built for next-gen applications, offering Web3 developers a secure, censorship-resistant, and verifiable platform with enhanced Ethereum alignment. DataHaven seamlessly integrates with EigenLayer’s Ethereum Layer 2 networks, addressing critical challenges in data storage for emerging use cases across AI, DePIN, and RWAs. DataHaven aims to provide a sanctuary for data, offering decentralized storage designed for AI and human data to coexist while remaining free from centralized influence.

Data protector NAKIVO announced strong Q4 2024 results, with a 25 percent increase in revenue and a 10 percent growth in the global customer base compared to Q4 2023. it didn’t provide base or final figures. In Q4 2024, 62 percent of the total revenue came from the EMEA region, 25 percent from the Americas, and 13 percent from the Asia-Pacific region. The countries delivering the highest revenue growth in Q4 2024 were Columbia, Ukraine, and Georgia. NAKIVO’ said its revenue grew by more than 100 percent QoQ in Morocco, Andorra, Colombia, Ukraine, Georgia, Denmark, India, South Africa, Sweden, Thailand, Taiwan, French Polynesia, United Arab Emirates, Romania, Hungary, and Canada. Currently, NAKIVO has over 16,000 active customers across 183 countries, achieving a 10 percent increase in its customer base in Q4 2024 vs. Q4 2023. This growth includes a 14 percent increase in the Asia-Pacific region, 10 percent in EMEA, and 8 percent in the Americas, we are told.

Cloud file services supplier Nasuni announced a 317 percent growth in data under management in the energy sector in the past two years. It says it won EthosEnergy, a global independent service provider focusing on rotating equipment for customers in the power generation, energy, industrial, aerospace, and defense markets, as a customer.

Storage exec Wissam Jabre, NetApp
Wissam Jabre

NetApp has hired ex-WD CFO Wissam Jabre as an EVP and its CFO, replacing Mike Berry who is retiring. CEO George Kurian said: “His decades of financial experience will be invaluable as we continue to execute on our visionary approach for a data-driven future. As a seasoned CFO and proven leader in building high-performance finance teams, Wissam will play a pivotal role working across the business to drive sustained growth and long-term value for our stakeholders.”

NetApp has approved new near-term science-based emissions reduction targets verified by the Science-Based Targets initiative (SBTi). SBTi develops standards, tools and guidance which allow companies to set greenhouse gas (GHG) emissions reductions targets in line with what is needed to keep global heating below catastrophic levels. To date, more than 4000 companies have validated science-based goals to reduce GHG emissions with commitments from nearly 3000 more to set targets.

NetApp commits to reduce absolute scope 1 and 2 GHG emissions 50.8 percent by FY2030 from a FY2020 base year. NetAppmalso commits to reduce scope 3 GHG emissions from use of sold products 51.6 percent per effective petabytes shipped by FY2030 from a FY2023 base year. SBTi has classified NetApp’s scope 1 and 2 target ambition as in line with a 1.5°C trajectory.

These targets expand and replace the decarbonization goals NetApp set in 2022. NetApp has already made progress toward its sustainability targets, reducing scope 1 and scope 2 emissions by 37 percent compared to its 2020 baseline, as reported in NetApp’s FY24 Environmental, Social, and Governance (ESG) Impact Report.

Quantum discussed data reduction rates in the latest Myriad release, saying: “Depending on the data stored in Myriad, customers will see different  ratios of data reduction, and we continue to benchmark our algorithms on various data workloads for better guidance. For example, video is essentially already compressed so customer data comprised of more video will see less data reduction overall, some customers may have many copies of data, and corporate data tends to be more compressible, etc., resulting in much higher  compaction ratios.

“While we offered 3-4x compaction example ratios in our data sheets, in testing of general data collections we have seen far higher compaction ratios of 20x or higher, which is again all dependent on the type of data stored, number of copies, and other factors. Overall Quantum Myriad customers will see variable compaction rates ranging from 2-20x or higher depending on the data type.”

QuestDB has joined the STAC Benchmark Council. QuestDB is an open-source time-series database designed for streaming ingestion, supporting open formats like Parquet and Arrow, object storage, and time-series SQL extensions for market data and capital markets use cases.

Digitimes reports Samsung is on high alert as CXMT reportedly adopts an advanced 1z process for its 10nm class DDR5 mass production. It suggests: “That would leapfrog CXML to the forefront of DRAM technology which, if true, would spell trouble for Samsung, Micron and others (and for DRAM pricing in general).”

Germany’s Heise media outlet reports that old Seagate drives are fraudulently entering the supply chain with retailers, etailers and system houses delivering supposedly as-new drives that have actually had thousands of hours of use. Affected suppliers include BAB Distribution, ctn-systeme, some Amazon and eBay retailers, Digitec-Galaxus from Switzerland, DiTech from Austria and the Luxembourg branch of Alternate, plus system houses such as Bechtle and Wortmann.

Customers can use SMART diagnostics’ FARM (Field Accessible Reliability Metrics) to find out a drive’s real runtime. Heise says the FARM values can “be determined with the smartmontools under Linux, Windows and also macOS with the command smartctl -l farm /dev/sd[X]; Seagate’s tool collection Seatools can also read out the FARM values.”

A Seagate statement said: “Seagate has not sold or distributed these mislabeled hard drives to dealers. We recommend that retailers purchase hard drives only from certified Seagate distribution partners to ensure that they purchase and sell only new or Seagate factory recertified drives. Seagate refurbished, factory-certified drives resold under the Seagate Drive Circularity Program can be identified by the green-bordered white hard drive label and the designation ‘Factory Recertified’. To report a suspected mislabeled Seagate drive, you can contact the Seagate Ethics Helpline.”

Cloud data warehouser Snowflake is reportedly in talks to acquire streaming data platform supplier RedPanda, with a suggested asking price of $1.5 billion.William Blair analyst Jason Ader views that with skepticism. The deal could negatively affect the existing Snowflake-Confluent partnership but be a net positive for Confluent: “Put simply, this is a strong validation of the strategic importance of data streaming and its large total addressable market.” Ader says: “in its most recent third quarter, Confluent management noted that it won 15 out of 16 times against Redpanda in customer bake-offs.” He suggests Confluent could buy Starburst or sell itself to Databricks.

VAST Data announced a new partnership with another AI cloud service provider yesterday, Hypertec Cloud out of Canada, and reckons it’s is dominating the AI CSP market.

Cloud-managed composable fast block storage supplier Volumez is evolving its mission to be a Gen AI-focussed data infrastructure provider, saying it can lift GPU utilization way higher than the 30 percent or so often reported. It does this by dynamically composing storage for a data scientist’s AI GPU-using job and using existing Linux features to do so. 

XConn Technologies (XConn) announced ts PCIe 5.0 capable switch, XC51256 “Apollo,” has successfully passed the testing at the PCI-SIG Compliance Workshop #132. The XC51256, offers a dense 256-lane configurations for applications heavily reliant on GPUs and accelerators, such as JBOG and JBOA processing setups. The Apollo switch family delivers hybrid CXL 2.0 and PCIe 5.0 technology support in a single switch to further future-proof system designs.

Tom’s Hardware reports, quoting TechInsights, that China’s YMTC has a Gen 5 3D NAND technology with 294 overall layers but 232 active layers, the same as its current gen 4 technology.

Quantum seeks to strengthen balance sheet with SEPA share sale deal

Nasdaq-listed storage player Quantum has struck a standby equity purchase agreement (SEPA) with Yorkville Advisors Global.

The deal will give Quantum access to additional equity capital and liquidity. Limited initially to 1.157 million shares, any further issuance requires shareholder approval.

A SEPA is a financial arrangement used by publicly listed companies to raise capital by selling shares to private investors over time. A company can “call” on the investor to purchase equity on an “as-needed basis,” up to a certain limit.

Unlike traditional capital-raising methods that may involve an immediate large issuance of shares, a SEPA provides more flexibility to the entity seeking funding, allowing companies to sell shares when needed and at times that are advantageous, such as when the market is more favorable.

Under the Quantum/Yorkville Advisors Global arrangement, Quantum has three years to access the full SEPA amount.

Jamie Lerner, Quantum
Jamie Lerner

“This strategic financial partnership provides Quantum with the flexibility to support our ongoing operations and accelerate our growth initiatives,” said CEO Jamie Lerner. “We have solidified access to significant capital, which, over time, will be used to strengthen our balance sheet and enable us to focus on executing our vision of revolutionizing data management solutions for the AI industry.”

Lerner added: “Quantum has made substantial efforts over the last year to improve our operational and financial health through a combination of revenue and margin improvement plans, financial and organizational restructuring, and cost reduction initiatives. In addition, we have been exploring several strategic alternatives to pay down our currently outstanding debt, which would also help to lower our cost structure, including lowering the interest expense and other fees the company has incurred.”

The agreement with Yorkville is initially limited to 19.99 percent of outstanding shares, or no more than 1,157,139 shares of common stock under the purchase agreement. Shareholder approval is required to exceed this share limit. The company’s share price at close on January 30, 2025, was $22.45.

The per share price Yorkville will pay for any shares will be dependent on the one or three-day pricing period elected by the company at a discount of either 3 or 4 percent. Quantum says any proceeds received from these sales will be used for “working capital and general corporate purposes,” including the repayment of debt.

Quantum is expected to release its third quarter earnings after the market closes on Sunday, February 2, 2025. Analysts expect Quantum to post earnings of $0.84 per share, and revenue of $72 million for the quarter. In the second quarter, ended September 30, 2024, revenue was $70.5 million.

Earlier this week, Quantum announced scalability enhancements to its Myriad all-flash file system, offering incremental, in-place system scaling with dynamic, automatic data leveling.

Sony quits recordable Blu-ray disc market


February will see Sony end the production of recordable Blu-ray discs, with no replacement for business users.

In a recent statement, Sony Storage Media Solutions said recording Blu-ray disc media, recording mini discs, recording MD data, and mini DV cassettes are all ending, saying there will be “no successor models.”

The production end doesn’t affect Blu-rays bought by consumers containing films and TV shows, just the blank ones that can be used to record data. Sony offered a range of recordable options, from 25 to 128 GB storage capacity, in both R (write once) and RE (write multiple times).

The production of recordable Blu-ray discs for the consumer market came to an end in July last year. At that time, Sony stated it would continue producing optical discs for business and corporate use – specifically for storing data for secure, long-term usage – until it became “unprofitable.” That position has obviously been reached.

Other manufacturers, such as Panasonic and Verbatim, are still manufacturing recordable optical discs. 4K Blu-ray discs are still being produced by various manufacturers.

Development of Blu-ray was started by Philips and Sony in 1995, shortly after Toshiba’s DVD format was seen as the ideal replacement for VHS tape.

According to IEEE Spectrum, which covered Sony’s decision: “Despite using the compact wavelengths of blue light, Blu-ray soon hit a limit of how much data could be stored.”

AI storage storm brews: Incumbent suppliers face pressure to acquire new tech

Analysis: Massive scale-out, file, and object storage winds are blowing through the incumbent storage supplier halls and threaten to become a cat 4 hurricane that will upend them.

There are three front-runners hustling at the front of this rising tempest. DDN, VAST Data, and WEKA, with Hammerspace accompanying them, are rushing towards the established incumbents – Dell, HPE, Hitachi Vantara, IBM, NetApp, and Pure Storage. All of them can see it coming and are assessing how fierce the storm is going to become.

Several of them have made moves already, with a third signaling it might.

HPE appears to be leading here. It has adopted a version of VAST Data’s disaggregated, shared-everything (DASE) architecture with its Alletra MP X10000 system, poised to match VAST from inside HPE’s GreenLake and partner ecosystem. NetApp has instituted a project to produce an AI-focused ONTAP. Dell has a project to parallelize PowerScale and it has its disaggregated PowerFlex storage software.

Pure Storage has indicated it’s making moves into the AI training storage arena via exec comments in its recent earnings call. IBM is buying upper-stack AI software supplier DataStax, but not getting directly involved in AI mass scale-out, file, and object storage.

Even Quantum has its Myriad OS development focused on this mass scale-out file and object storage arena.

Looking at the onrushing hustlers, VAST is the front-runner here, with its deliverable DASE product now supporting block, file, and object storage inside a globally distributed namespace and data space, Kafka event broking, and virtual dominance of the GPU-as-a-Service AI server farms market. 

DDN is buoyed by years of HPC success, a longstanding Nvidia partnership, and its new Infinia OS supporting multiple data access protocols. It has recently taken in $300 million in private equity funding.

WEKA has surged through the legacy HPE parallel file systems market as HPC-style storage is being used by enterprises for AI workloads. And yet WEKA appears to be facing a reset, having experienced a raft of executive departures. The company raised $140 million in an E-round of VC funding in May last year, with a $1.6 billion valuation. 

When a technology tsunami has hit storage incumbents in the past, they have embraced it in one of two ways: build their own tech or buy it in. HPE has been a big acquirer in the past. So too has Dell, Hitachi Vantara, NetApp, and even Pure.

Thus far with AI storage, there have been no acquisitions by the incumbents, only internal developments by Dell, HPE, NetApp, and Pure. Is the acceleration of the AI storage market demand build-up now reaching a point where it’s necessary for them to think about buying in the new tech they need?

Could Dell, no stranger to substantial storage acquisitions (remember EMC), be thinking of making a move here? Could IBM move downstack and mimic its prior Storwize buy? Ditto NetApp realizing it needs to move faster and doing a follow-on to its SolidFire and StorageGRID buys? Ditto also for Hitachi Vantara.

B&F is starting to think that a major acquisition by an incumbent storage supplier is overdue. They need to hedge their go-it-alone bets. The AI-focused, mass scale-out, multi-protocol, high-speed storage barbarians are at the gates. Which one of the incumbents will be the first to splash the cash and buy in?

WEKA boosts GenAI speed with Mooncake vLLM tech

File system provider WEKA wants to make its customers’ GenAI models respond faster by using a vLLM Mooncake tech combination.

Val Bercovici, WEKA
Val Bercovici

This was revealed in a blog post by new hire Val Bercovici, titled “Three Ways Token Economics Are Redefining Generative AI.” A token is a word, part of a word, or a letter that is turned into a vector embedding and used in the semantic search part of large language model (LLM) analysis and response generation.

Token economic, he argues, help with cost reduction, latency reduction, and escaping memory capacity limitations.

Bercovici, who was CTO for NetApp back in 2016 and has long been involved in AI, writes: “DeepSeek’s recent breakthrough in token generation efficiency has reshaped the economics of Generative AI. By implementing context caching on disk, DeepSeek R1 significantly reduces the cost of token generation, slashing API expenses by up to 90 percent.” The word “disk” is a misnomer as he actually means SSD.

DeepSeek’s advance means “AI inference, typically constrained by expensive memory requirements, can now achieve memory-like performance at SSD pricing – potentially cutting costs by 30x.”

He claims WEKA’s software architecture, with NVMe SSD acceleration and fast networking “enables token processing at microsecond latencies.” He adds: “The ability to process high token volumes at ultra-low latency is becoming a key differentiator in the AI economy.”

As LLMs process larger datasets, a GPU high-bandwidth memory (HBM) capacity limitation emerges and fresh token data has to be fetched from storage, which takes time and delays the model’s response. Bercovici asserts: “WEKA enables LLMs and large reasoning models (LRMs) to treat high-speed storage as an adjacent tier of memory, achieving DRAM performance with petabyte-scale capacity” by “optimizing the handling of both input and output tokens.”

This, he claims, is going to be achieved by “WEKA’s upcoming integration with Mooncake [which] further enhances token caching, surpassing traditional solutions like Redis and Memcached in capacity, speed, and efficiency.” WEKA has a vLLM Mooncake project that optimizes token caching for inference serving. This project has two source tech components: vLLM (virtual large language model) and Mooncake. The organizations and technology relationships involved are diagrammed below:

WEKA diagram

Mooncake is a disaggregated architecture for serving LLMs developed by Chinese supplier Moonshot AI. We understand it was founded in Beijing in March 2023 by Yang Zhilin, Zhou Xinyu, and Wu Yuxin. Zhilin has an AI research background and a computer science PhD from Carnegie Mellon University. The company’s Chinese name is 月之暗面 or YueZhiAnMian, which translates to “Dark Side of the Moon,” a reference to the Pink Floyd album. It has raised more than $1.3 billion in funding and is valued at more than $3 billion by VCs.

Moonshot is an AI-driven online commerce optimization business that launched its Kimi chatbot last October. It focuses on long-context AI processing. It introduced Kimi 1.5 for text and vision data processing via GitHub and arxiv this month. It claism Kimi 1.5 “achieves state-of-the-art reasoning performance across multiple benchmarks and modalities – e.g. 77.5 on AIME, 96.2 on MATH 500, 94th percentile on Codeforces, 74.9 on MathVista – matching OpenAI’s o1” and outperforms “existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550 percent).” CoT stands for Chain of Thought.

Kimi 1.5 uses Mooncake, which is described in a separate arxiv paper. It is “the serving platform for Kimi, a leading LLM service provided by Moonshot AI. It features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters. It also leverages the underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache.”

An inference LLM has prefill and decoding phases. A DigitalOcean document says: “The prefill phase can be likened to reading an entire document at once and processing all the words simultaneously to write the first word whereas the decode phase can be compared to continuing to write this response word by word, where the choice of each word depends on what was written before.”

“LLM inference can be divided into two phases: prefill and decode. These stages are separated due to the different computational requirements of each stage. While prefill, a highly-parallelized matrix-matrix operation that saturates GPU utilization, is compute-bound, decode, a matrix-vector operation that underutilizes the GPU compute capability, is memory-bound.”

Mooncake also uses vLLM technology. This was developed at UC Berkeley as “an open source library for fast LLM inference and serving” and is now an open source project. According to Red Hat, it “is an inference server that speeds up the output of generative AI applications by making better use of the GPU memory.”

Red Hat says: “Essentially, vLLM works as a set of instructions that encourage the KV cache to create shortcuts by continuously ‘batching’ user responses.” The KV cache is a “short-term memory of an LLM [which] shrinks and grows during throughput.”

We understand that WEKA is going to integrate Mooncake and vLLM technology into its file system platform so that customers running LLMs referencing WEKA-stored data get responses faster at lower cost.

Bercovici says: “By leveraging breakthroughs like DeepSeek’s context caching and WEKA’s high-speed AI infrastructure, organizations can redefine their AI economics – making generative AI more powerful, accessible, and financially sustainable for the future.”

AWS: S3 Metadata goes GA stateside to speed analytics

Amazon Web Services has made Amazon S3 Metadata generally available following its introduction in preview at the company’s re:Invent 2024 conference in Las Vegas.

Amazon describes S3 Metadata as the “easiest and fastest way” to discover and understand S3 data, with automated and easily queried metadata that’s updated in “near real time”. It promises to “simplify” business analytics, real-time inference applications, and other areas.

The technology supports object metadata, which includes system-defined details like size and source of the object, and custom metadata, which allows users to apply tags to annotate their objects with information like product SKU, transaction ID, or content rating.

S3 Metadata automatically captures metadata from objects as they are uploaded into a bucket and makes that metadata queryable in a read-only table.

As data in the bucket changes, S3 Metadata updates the table “within minutes” to reflect the latest changes. These metadata tables are stored in S3 Tables, storage optimized for tabular data. The S3 Tables integration with AWS Glue Data Catalog is in preview, allowing companies to stream, query, and visualize data – including S3 Metadata tables – using AWS analytics services such as Amazon Athena, Amazon Data Firehose, Amazon EMR, Amazon QuickSight, and Amazon Redshift.

Additionally, S3 Metadata integrates with Amazon Bedrock, allowing for the annotation of AI-generated videos with metadata that specifies its AI origin, creation timestamp, and the specific model used for its generation.

AWS says S3 Metadata is currently available in a number of AWS regions, including US East (N. Virginia), US East (Ohio), and US West (Oregon).

As for S3 pricing, you pay for storing objects in your S3 buckets. The rate you are charged depends on your objects’ size, how long the objects are stored during the month, and the storage class, such as S3 Standard, S3 Intelligent-Tiering, S3 Standard-Infrequent Access, S3 One Zone-Infrequent Access, and S3 Express One Zone, for instance.

Kioxia open sources AiSAQ tech to cut DRAM use in RAG

Kioxia AiSAQ technology, designed to reduce DRAM requirements in generative AI systems, was this week released as open source software.

AiSAQ, otherwise known as “all-in-storage ANNS with product quantization,” provides an “approximate nearest neighbor search” (ANNS) algorithm optimized for SSDs. Kioxia AiSAQ software also delivers scalable performance for retrieval-augmented generation (RAG) without placing index data in DRAM, instead searching directly on SSDs.

Generative AI systems demand significant compute, memory, and storage resources. “While they have the potential to drive transformative breakthroughs across various industries, their deployment often comes with high costs,” said Kioxia. RAG is a critical phase of AI that refines large language models (LLMs) with data specific to the company or application.

A central component of RAG is a vector database that accumulates and converts specific data into feature vectors for retrieval. RAG also utilizes an ANNS algorithm, which identifies vectors that improve the model based on similarity between the accumulated and target vectors. “For RAG to be effective, it must rapidly retrieve the information most relevant to a query,” said Kioxia.

Traditionally, ANNS algorithms are deployed in DRAM to achieve the high-speed performance required for these searches. But Kioxia AiSAQ technology provides a “scalable and efficient” ANNS solution for billion-scale datasets with “negligible” memory usage and “fast” index switching capabilities, Kioxia said.

Kioxia AiSAQ slide
Kioxia AiSAQ slide

The key benefits of AiSAQ include allowing large-scale databases to operate without relying on limited DRAM resources, enhancing the performance of RAG systems. It also eliminates the need to load index data into DRAM, enabling the vector database to launch instantly. This supports switching between user-specific or application-specific databases on the same server for efficient RAG service delivery.

Axel Stoermann, Kioxia
Axel Stoermann

It is optimized for cloud systems by storing indexes in disaggregated storage for sharing across multiple servers. This approach adjusts vector database search performance for specific users or applications, and helps the migration of search instances between physical servers.

“Our AiSAQ solution paves the way for almost infinite scaling of RAG applications in generative AI systems based on flash-based SSDs at the core,” said Axel Stoermann, chief technology officer and VP at Kioxia Europe. “By utilizing SSD-based ANNS, we are reducing the reliance on costly DRAM, while matching the performance needs of leading in-memory solutions, enhancing the performance range of large-scale RAG applications significantly.”

Data resilience and data portability

SPONSORED FEATURE: Considering it has such a large share of the data protection market, Veeam doesn’t talk much about backups in meetings with enterprise customers these days.

Instead, the company focuses on two other terms: data resilience, and data portability. This is a response to new threats that are emerging for enterprise customers – including some from vendors themselves.

You’d be forgiven for thinking that data resilience was just another term for what we used to call business continuity. However, it goes further than that, explains Brad Linch, director of technical strategy at Veeam. It’s a response to a shift in the data protection landscape over the last few years.

“Business continuity plans were designed for fires and power outages and natural disasters. But in this next evolution of the data protection industry focused on data resilience, you’re dealing with a subcategory of disaster, which is cyberattacks,” he says. “Dealing with cyber recoveries is a whole different ball of wax. It’s much more common than a natural disaster. For one thing, it’s going to happen more often than an accidental fire, or a power outage, or even a hurricane.”

Disasters of this type carry different challenges. For example, when hit by ransomware, which often focuses on encrypting backups, it’s more difficult to identify your last clean backup copy. That makes it inadvisable to automatically restore a backup directly to production without doing some further analysis. The whole backup and restoration process changes.

From VMware……to everywhere

Natural disasters and cyberattacks aren’t the only modern threats to data resilience. Another threat is commercial. Many enterprises’ natural data habitat has become VMware, which was Veeam’s original focal point for backup solutions. They happily processed their data in VMware virtual machines, but now, commercial changes from VMware’s new owner are threatening that situation.

Broadcom’s licensing changes eliminated stand-alone support services for perpetual license holders and shifted them to per-core subscription-based support packages. In October, the company backtracked, reintroducing its vSphere Enterprise Bundle so that enterprise customers could use the software without upgrading to the full VMware Cloud Foundation suite.

Experts fret that Broadcom’s capitulation wasn’t enough. In November Gartner VP analyst Michael Warrilow said that VMware enterprise customers were “all miserable” because of the Broadcom changes, adding that licensing costs have risen by 200-500 percent on average. A migration away from the platform hasn’t happened yet, he said – but it’s likely coming, given that IT procurement often moves in multi-year cycles.

This tense situation highlights the need for something that Veeam has been investing in heavily for years; broad cross-platform support that goes well beyond VMware. Enterprise customers need a data resilience solution that can protect them across all of the platforms they use, enabling them to make technical and commercial decisions that easily support their business decisions. That includes expanding out from – or away from – VMware, or any hypervisor or cloud platform, should they need to.

This concept of cross-platform support for backups underpins a central tenet of Veeam’s: ‘data portability’. The company is an advocate for platform-agnostic backup, enabling people to store or recover copies of their data wherever they wish. It has built data portability into its products to satisfy long-term strategic reasons such as a Broadcom-driven migration along with shorter-term technical ones such as system outages.

“It’s about making sure you can run your data not so much where you want, but sometimes where you need to,” explains Matt Crape, senior technical product marketer at Veeam. “Say you get ransomware, and you lose access to all your hardware as the insurance company investigates. If you do it right, you can restore to the cloud and get things back up and running very quickly.”

To do this, Veeam embarked on a multi-year effort to provide extensive cross-platform options. This began in 2011, when it rolled out support for Microsoft’s Hyper-V. In 2016, it launched a direct restore to Azure service.

Since then, it has gone much further. Today, it supports Nutanix’s AHV open-source enterprise hypervisor platform, along with both Red Hat Virtualization (which Red Hat will sunset in 2026) and the newer Red Hat OpenShift Virtualization. It has also delivered support for Oracle Linux Virtualization Manager, and for the Proxmox Virtual Environment (VE) virtualization management platform. On the cloud side, it now supports AWS and Google Cloud alongside Azure.

Support for these platforms enables it to offer an instant cross-platform virtual machine recovery feature that mounts your best available backup to whichever target – or targets – that you choose – including Kubernetes container environments.

Building deep data resilience

Veeam has a dual-pronged strategy to support its data resilience mission. Alongside its focus on breadth of platform support, it has also spent the last few years building out a deep set of features to protect enterprise customer data and assist with migration.

The company provides support for both single, multi-cloud, and hybrid environments (including its own Veeam Data Cloud Vault storage service). Scalability is a critical factor here, as the larger and more complex your data backup infrastructure becomes the more complex it is to manage. The company’s Scale-out Backup Repository feature provides a layer of abstraction across an enterprise customer’s entire data landscape, virtualizing it into a single pool for ease of management. This, combined with support for backup policies, enables administrators to keep sprawling data collections in line.

Alongside support for multiple cloud and virtualization platforms, Veeam has built out a variety of tool integrations. These are especially strong in the cybersecurity and incident response realm that forms such a big part of the data resilience proposition. Alongside SIEM integrations with the likes of Splunk and Sophos, it also offers SIEM and SOAR integrations with Palo Alto. Its SOAR partnerships extend to other companies including ServiceNow and Cisco, with more reportedly on the way.

Veeam has built a quite expansive set of threat detection capabilities covering before, during, and after attacks which can help users achieve even greater protection across their environments. The firm has also partnered with various companies across other functions, including cybersecurity readiness, incident response, ransomware detection and reconnaissance, and encryption.

These partnerships have helped the data resilience company to enhance its Data Platform tool with various security protections that spot and squash cyberattacks including ransomware. They enable it to take a multi-faceted approach that includes scanning during the backup process. It will watch for file extensions and tools such as NMAP or Filezilla which are known to be used by bad actors, enabling it to detect attackers that ‘live off the land’ by using generic tools to fly under the radar.

Immutability is another key part of ransomware protection, as this type of malware is known to attempt encryption of backups in many cases. Veeam offers immutability protections for its backups, enabling administrators to set time periods during which data cannot be altered. Administrators can use its immutability workload monitoring feature to spot any changes made to that time window. This immutable backup works across a wide range of enterprise technology platforms, including all of the usual cloud suspects, alongside those from a tranche of hardware and software partners.

Confidence in reporting and restoration

Immutable data is only useful if you’re confident that you can restore it. The US Department of Homeland Security’s CISA advocates for testing of backup restoration in its CPG Cross-Sector Cybersecurity Performance Goals guidance. Veeam achieves this via its SureBackup feature, which stands up virtual machines from backup data and then tests them to ensure they’re healthy.

Data restoration must be more than quick, explains Linch – it must also be flexible so that victims of a cyberattack can work around whatever damage the attacker has done. This is a critical part of the data resilience proposition.

“You have to be able to recover from multiple layers, such as backups, snapshots and replicas,” he says. “When you look at Veeam, we’re the only one in the industry that allows you to do those three things in any license of Veeam you bought.” The company also offers restoration of full VMs, individual files, and application-specific data such as database content.

In many industries, it isn’t enough to backup and test your data; you have to prove that you’re doing it well. That’s where the final element of data resilience – compliance – comes into play.

Veeam offers various compliance tools, including policy-driven automation for tasks like retention management and the storage of tagged data in specific areas, along with constant compliance monitoring through its Security and Compliance Analyzer. Its reporting capabilities give you something to satisfy the auditors, streamlining the compliance process so that you – and the regulators – can be sure you’re protecting data properly.

Tools like these are just one part of data resilience. Enterprises that don’t prepare a strategy to orchestrate all of this properly run the risk of ‘all the gear and no idea’ syndrome. “Organizations fall to the level of their preparation,” says Rick Vanover, who is on the product strategy team in Veeam’s Office of the CTO.

At a strategic level, data resilience requires good risk management, which means sitting down with the rest of the business to understand what kinds of data it wants to protect and why.

There are also considerations at the technical level. “Performance management is a big one,” warns Vanover. And enterprises must also consider the full spectrum of technical and commercial challenges. “If people move to the cloud, they must assess networking, security and billing. All three of those bite different, and hard.”

Data resilience is a far-reaching discipline that includes data portability. The option to migrate your backup data quickly between backup platforms is critical in an industry where an attack can knock infrastructure out of action – or where your costs can balloon on a vendor’s whim. Veeam is building a solid tool set to support a strategic move for data resilience. That looks set to become more important for enterprise users of VMware and other platforms alike in the future.

Sponsored by Veeam.