Home Blog Page 41

Commvault expands cloud and cyber resilience solutions, including Cloud Rewind

Commvault is announcing new cloud support and anti-malware updates at its London Shift customer event, with full Commvault Cloud availability on AWS, protection of the Google Workspace suite, point-in-time recovery and rebuild of SaaS suite backups, as well as a partnership with Pure Storage to support financial customers affected by the EU’s Digital Operational Resilience Act (DORA).

Commvault Cloud is the new branding for its Metallic SaaS data protection service. DORA is a set of EU regulations to enhance the cyber resilience of financial institutions aiming to ensure they can continue to function during cyberattacks or other potentially disastrous IT incidents. It is scheduled to come into force from January 2025.

Pranay Ahlawat, Commvault
Pranay Ahlawat

Pranay Ahlawat, Commvault Chief Technology and AI Officer, stated: “We are proud to extend the full power of the Commvault Cloud platform and cloud-native solutions to AWS. We believe our game-changing technology will empower joint customers to recover faster, mitigate threats more effectively, and enhance their cyber resilience strategies.”

Offerings that will be available to AWS customers include: 

  • Cloud Rewind: Based on Appranix technology, Cloud Rewind acts as an AWS time machine, enabling customers to “rewind” to the last clean copy of their data, recover that data, and automate the cloud application rebuild process so that businesses can get back to normal in minutes.
  • Cyber Resilience for Amazon S3: Through its recent acquisition of Clumio, Commvault will be introducing new technology in the coming months that will also bring time machine capabilities to Amazon Simple Storage Service (S3) customers. In the event of an attack, this technology will allow customers to revert quickly to to a clean copy of data that has not been infiltrated with malware. 
  • Air Gap Protect: This will provide AWS customers with immutable, isolated copies of data in a Commvault tenant as a service, giving AWS customers another way to keep their data safe and resilient. 
  • Cleanroom Recovery: This is already present in Azure and being extended to AWS. It will allow organizations to automatically provision recovery infrastructure, enabling recovery to an isolated location in AWS and restore infrastructure workloads like Active Directory and production workloads. They can also conduct forensics in this clean, safe location, and it enables IT and security teams to test their cyber recovery plans in advance so that they know that when they are hit, they can recover quickly.

Cloud Rewind is unique to Commvault. “When organizations are attacked, restoring the data is only half the battle,” the company says. “The truly laborious task is actually restoring the distributed cloud applications, which are used to run and power that data.”

A typical enterprise could run more than 350 SaaS apps and attack recovery can require these apps to be restored in a systematic way that takes a lot of time when done manually. Appranix technology automatically identifies and catalogs all cloud components in use, offering full visibility into what assets need protection and recovery, analyzing and defining the relationships and dependencies between various cloud components.

It builds an operational blueprint, capturing the data and the full map of applications, infrastructure, and networking configurations so that “systems can be restored with their complete operational blueprint intact, reducing guesswork for a thorough recovery … When a system is restored, all connected resources and services are aligned, with little to no human involvement.”

Cloud Rewind supports all major public and private cloud platforms, including AWS, Google Cloud, and Microsoft Azure. 

Brian Brockway, Commvault
Brian Brockway

Brian Brockway, Commvault CTO, stated: “What we are doing with Cloud Rewind is unlike anything offered on the market today. In the ransomware era, recovering data is important, but it’s table stakes. We’re ushering in an entirely new chapter in cyber resilience that not only expedites data recovery, but recovery of cloud applications. This is the gold standard in recovery for a cloud-first world.”

A new Cyber Resilience Dashboard provides continuous ransomware readiness assessments and indicates gaps in resilience plans. It provides a view across the entire data estate, assessing components such as testing frequency and success, and availability of immutable air-gapped copies of critical data.

We understand that a coming Clumio capability will be the instant restore of massive S3 buckets with tens of billions of objects. Restoring single S3 objects has easy but restoring buckets at massive scale is a different and vastly more difficult proposition. Clumio’s tech uses snapshotting and versioning with the ability to go back to a point in time and restore a bucket’s state.

Google Cloud

Commvault has launched SaaS-based Cloud Backup & Recovery for Google Workspace. This provides Gmail, Google Drive, Shared Drives, and built-in Google Cloud Storage for Google Workspace protection. Google Workspace customers will be able to effortlessly discover active data, rapidly recover from inadvertent or malicious data deletion, and maintain a copy of valuable data in the Commvault Cloud for compliance mandates.

It is expanding its Google Cloud capabilities with Cloud Rewind, which integrates Appranix’s application rebuild capabilities into the Commvault Cloud platform, providing Google Cloud customers with an automated, cloud-native rebuild solution to recover from cyber incidents more rapidly.

Earlier this year, Commvault announced support for object retention lock for Google Cloud Storage, providing customers with immutable cloud storage on Google Cloud infrastructure.

Pure Storage

Commvault announced a joint cyber-readiness offering with Pure Storage to deliver the ability to continuously test recovery in secure, isolated environments; on-demand in cloud-isolated tenants via Commvault’s Cleanroom Recovery solution or within isolated recovery environments. The two say that customers “can easily deliver rapid, frictionless recovery of clean data to isolated environments with the flexibility needed to meet operational and data sovereignty requirements.” 

This is not only applicable to DORA, but supports compliance under other cybersecurity and privacy regulations like the EU’s NIS2 Directive and e-mandates from Reserve Bank of India (RBI) for recurring transactions.

Patrick Smith, Field CTO, EMEA, Pure Storage, said: “Through our partnership with Commvault, we are giving financial institutions critical tools that not only help comply with regulations like DORA but advance their cyber resilience to help ensure enterprise data remains secure, protected, and if necessary, recoverable.”

Availability

Commvault’s suite of offerings for AWS will be generally available in the coming months. Following that, joint customers will be able to access them in the AWS Marketplace. Commvault Cloud Backup & Recovery for Google Workspace is targeted for availability by the end of the calendar year. Cloud Rewind and the Cyber Resilience Dashboard will be generally available in the coming months.

AI vectorization signals end of the unstructured data era

Analysis: A VAST Data blog says real-time GenAI-powered user query response systems need vector embeddings and knowledge graphs created as data is ingested, not in batch mode after the data arrives.

VAST co-founder Jeff Denworth writes in a post titled “The End of the Unstructured Data Era” about retrieval-augmented generation (RAG): “AI agents are now being combined to execute complex AI workflows and enhance each other. If a business is beginning to run at the speed of AI, it’s unthinkable that AI engines should wait minutes or days for up-to-date business data.”

Jeff Denworth

He says early “solutions for chatbots … front-ended small datasets such as PDF document stores,” but now we “need systems that can store and index trillions of vector embeddings and be able to search on them in real time (in order to preserve a quality user experience).”

A vector embedding is a calculated number (vector) representing some unstructured data item’s presence on a scalable dimension, such as color, position, shape component, aural frequency, and more. A document, file, data object, image, video, or sound recording item can be analyzed by a vectorization engine and hundreds, thousands, or more vector embeddings generated that characterize the item. These vector embeddings are indexed and stored in a vector database. When a user makes a request to a GenAI chatbot, that request is turned into vectors and the vector database searched for similar vectors inside a so-called semantic space.

The user could supply an image and ask “What is the name of this image?” The image and question are vectorized and the chatbot semantically searches its general training dataset for image vectors that are closest to the user-supplied image, and responds with that painting’s name: “It is the Mona Lisa painted by Leonardo da Vinci,” or, more useful to an enterprise: “It is a Cisco Nexus 9200 switch.”

To ensure higher response accuracy, the chatbot could be given access to a customer organization’s data and its retrieved response from its database augmented with that, hence retrieval-augmented generation.

A knowledge graph is generated from structured or block data and stores and models relationships (events, objects, concepts, or situations) between so-called head and tail entities, with a “triple” referring to a head entity + relationship + tail entity. Such triples can be linked and the relationships are the semantics. Crudely speaking, they describe how data item pairs are linked in a hierarchy. Chatbots at the moment do not use knowledge graphs, but suppliers like Illumex are working in what we could call knowledge graph augmented retrieval.

Denworth’s company has announced a real-time data ingestion, vectorization, storage, and response AI-focused InsightEngine system, and his blog extols its virtues and explains the need for its development.

He writes: “We’re watching the exponential improvements in embedding models and seeing a time in the not-too-distant future where these tools will be of a caliber that they can be used to index the entirety of an enterprise knowledge base and even help to curate enterprise data. At that point, hyper-scale vector stores and search tools are table stakes.The trillion-way vector problem is already a reality for large AI builders like OpenAI and Perplexity that have gone and indexed the internet.”

As well as vectorizing existing data sets, companies will “need to be able to create, store and index embeddings in real time.”

“I think of vectors and knowledge graphs as just higher forms of file system metadata,” Denworth writes. “Why wouldn’t we want this to run natively from within the file system if it was possible?”

Existing file and object systems “force IT teams to build cumbersome retrieval pipelines and wrestle with the complexity of stale data, stale permissions and a lot of integration and gluecode headaches … The idea of a standalone file system is fading as new business priorities need more from data infrastructure.”

Let’s think about this from a business aspect. A business has data arriving or being generated in multiple places inside its IT estate: mainframe app environment, distributed ROBO systems, datacenter x86 server systems, top tier public cloud apps, SaaS apps, security systems, data protection systems, employee workstations, and more.

Following Denworth’s logic, all of this data will need vectorizing in real time, at the ingest/generation location point and time, and then stored in a central or linked (single namespace) database so that semantic searches can be run against it. That means that all the applications and storage systems will need to support local and instant vectorization – and knowledge graph generation as well.

There will need to be some form of vectorization standard developed and storage capacity would need to be put aside for stored vectors. How much? Let’s take a PDF image. Assuming 512 vector dimensions, and 32-bit floating point numbers per dimension, we’d need around 2 KB of capacity. Increase the dimension count and up goes the capacity. Halve the floating point precision and down goes the capacity.

This means that file-handling and object systems from Dell, DDN, HPE, Hitachi Vantara, IBM, NetApp, Pure Storage, Qumulo etc. would need to have vectorization, embedding storage, and metadata added to them – if Denworth is right. Ditto all the data lake and lakehouse systems. 

Ed Zitron

AI bubble or reality

Of course, this will only be necessary if the generative AI frenzy is not a bubble, and develops into a long-lived phenomenon with real and substantial use cases emerging. Commentators such as Ed Zitron have decided that they won’t. OpenAI and its like are doomed, according to critics, with Zitron writing: “it feels like the tides are rapidly turning, and multiple pale horses of the AI apocalypse have emerged: a big, stupid magic trick‘ in the form of OpenAI’s (rushed) launch of its o1 (codenamed: strawberry) model, rumored price increases for future OpenAI models (and elsewhere), layoffs at Scale AI, and leaders fleeing OpenAI. These are all signs that things are beginning to collapse.”

But consultancies like Accenture are going all-in on chatbot consultancy services. An Accenture Nvidia Business Group has been launched with 30,000 professionals receiving training globally to help clients reinvent processes and scale enterprise AI adoption with AI agents. 

Daniel Ives

Financial analysts like Wedbush also think the AI hype is real, with Daniel Ives, managing director, Equity Research, telling subscribers: “The supply chain is seeing unparalleled demand for AI chips led by the Godfather of AI Jensen [Huang] and Nvidia and ultimately leading to this tidal wave of enterprise spending as AI use cases explode across the enterprise. We believe the overall AI infrastructure market opportunity could grow 10x from today through 2027 as this next generation AI foundation gets built with our estimates a $1 trillion of AI capex spending is on the horizon the next three years.

“The cloud numbers and AI data points we are hearing from our field checks around Redmond, Amazon, and Google indicates massive enterprise AI demand is hitting its next gear as use cases explode across the enterprise landscape.”

Ben Thompson

Stratechery writer Ben Thompson is pro AI, but thinks it will take years, writing: “Executives, however, want the benefit of AI now, and I think that benefit will, like the first wave of computing, come from replacing humans, not making them more efficient. And that, by extension, will mean top-down years-long initiatives that are justified by the massive business results that will follow.”

Who do we believe? Zitron or the likes of VAST Data, Accenture, Wedbush, and Thompson? Show me enterprises saving or making millions of dollars from GenAI use cases with cross-industry applicability and the bubble theory will start receding. Until that happens, doubters like Zitron will have an audience.

Catalogic adds clean-room recovery, immutability to DPX

Privately held Catalogic has added a Deletion Lock feature and a Cyber Resilient Recovery feature with a clean-room environment for verifying restores are free from malware.

Catalogic Software, founded in 2003, had three product lines by 2020: ECX for copy data management, DPX for endpoint and server data protection, and its CloudCasa SaaS service for Kubernetes app protection. In May 2021, IBM, which accounted for approximately 80 percent of ECX sales, bought the ECX product line. A CloudCasa spinoff was envisaged last year but has been placed on the back burner. CloudCasa partnered with IONOS Cloud in July and received a Persistent Volume upgrade last month. Now DPX v4.11 has been upgraded with immutability and clean room recovery.

Pawel Staniec, Catalogic
Pawel Staniec

Catalogic CTO Pawel Staniec stated: “DPX features a software-defined storage layer with built-in ransomware protection, offering a unique ‘out-of-the-box’ solution. Unlike other vendors that require assembling various components to achieve the same functionality, DPX delivers it all at a fraction of the cost.”

The Deletion Lock feature prevents backed-up data being deleted or altered. The clean-room facility creates a quarantined space isolated from the customer’s network. Within, admins can check and verify a backup is free from malware before restoring its data to production.

Catalogic has a backup repository and storage appliance called vStor. vStor Snapshot Explorer is an anti-malware file-scanning facility, distributed as an agent installable on Windows and Linux hosts. It exposes a REST API and has built-in plugin architecture that can be used for integration with existing security or data protection infrastructure. 

A vStor GuardMode feature provides immutable backups with retention policies, access controls, logging, and auditing.

The latest version of DPX has integrated both vStor Snapshot Explorer and vStor GuardMode. Snapshot Explorer now scans and verifies snapshot backups to verify they are free from ransomware and enables granular file recovery even if catalog information is lost.

Data lakehouse Starburst makes senior hires for key markets

Open data lakehouse Starburst has hired fresh leadership for its key growth markets.

Starburst is positioning itself to solve the problem of an ever increasing volume of enterprise data siloed across different platforms, including machine learning and AI data.

The company says its Trino platform can help such businesses with data management. Customers on Starburst’s books include the likes of Halliburton, Comcast, Doordash, the US government, and some of the largest US banks.

Deron Miller (above) and Steve Williamson (below)

As part of its scale-up, the provider has now appointed Deron Miller as senior vice president and general manager for the Americas and Asia-Pacific regions, and has also brought in Steve Williamson as SVP and general manager for the EMEA region.

Prior to Starburst, Deron served as chief revenue officer for Delphix, an enterprise data company acquired by Perforce Software. He has also held revenue leadership roles at GE, Adobe, Microstrategy, and Nuance Communications.

“Starburst is one of the most exciting technologies that I have seen in over 20 years,” beamed Miller. “By enabling our customers to access data directly across different platforms, we can help accelerate innovation, while reducing cost.”

Before joining Starburst, Williamson served as general manager of EMEA at Apptio. He has also served in executive roles at Acquia, Demandware/Salesforce, Dell/EMC, and Oracle, as part of a 20-plus years career.

Williamson said: “European enterprises have to navigate challenges in data privacy, security, and compliance, whilst keeping up with the dynamic needs of the market. Starburst offers Trino and Apache Iceberg to simplify access to enterprise data to drive faster revenue growth, while mitigating data risk.”

Earlier this year, Starburst appointed Steven Chung as president, Tobias Ternstrom as chief product officer, and Adam Ferrari as senior vice president of engineering.

Numemory reinvents Optane storage-class memory

Chinese fabless semiconductor startup Numemory is developing storage-class memory (SCM) chips using Optane-like phase change memory and crossbar technology.

Numemory, a business unit of Xincun Technology (Wuhan), has launched the NM101, a storage-class memory chip with 10x faster read and write speeds and 5x longer endurance than NAND chips. Xincun Technology was established in July 2022 and has developed a 3D selector-only memory (SOM) technology using phase-change memory and a crossbar structure, which is remarkably similar to Intel and Micron’s failed Optane memory technology.

Numemory’s website states: “The new memory is a type of SCM (storage-class memory) that can increase storage capacity, reduce costs, and thus greatly improve the overall performance of the storage system. The new memory is considered to be a next-generation non-volatile storage technology that is very promising for large-scale mass production, and has become a hot topic of international concern in recent years.”

Storage-class memory combines near-DRAM speeds with faster-than-NAND performance while retaining NAND persistence so that it occupies a place in the memory hierarchy above SSDs but below DRAM.

The NM101 chip’s tech specs are listed in a table on the Numemory website, which we translated:

The 3200 MT/sec rating is the same as an Intel Optane PMem 200 device. Numemory says the NM101 has low latency but the table does not provide a latency number. It doesn’t provide any IOPS or bandwidth numbers either.

In 2023, Numemory stated it had “a total of 273 invention patents, 60 international patents, and 213 Chinese patents; 51 Chinese patents have been authorized; and 11 integrated circuit layout designs have been authorized.”

A patent, CN118215387A, relating to PCM and a SOM SCM device between DRAM and NAND, states: “SOM differs from 3D XPoint memory (and similar 3D cross-point memory) in that the memory cells that it uses a cross-point structure (crossbar) consist of only one selector. The SOM memory has the advantages of simple structure, high scalability, high SET/RESET speed, good cycle life, vertical stacking, no write crosstalk caused by thermal effect and the like, is a very promising nonvolatile memory, and has obvious advantages in the fields of high-speed computing interconnection (Compute Express Link, CXL), high-bandwidth memory (High Bandwidth Memory, HBM) and the like.”

There is also patent CN118301944A, which relates to a hard mask layer in SOM fabrication.

We understand that Chinese tech business Guao Technology has invested in Xincun, which has been developing its SOM chip since 2019. Xincun aims to start mass production by the end of this year and produce 10,000 chips a month by the end of 2025. A production line will be established, with Guao investing ¥10 billion along with help from the government of Anji county, Huzhou city, Zhejiang province. Guao Technology and Xincun Technology will jointly set up a chip module testing and packaging company.

We understand Xincun Technology has around 180 employees, with 80 percent in research and development and 60 percent holding a master’s degree or higher qualification.

Comment

If the NM101 is viable and can be manufactured and sold successfully – unlike Optane – China will have a real SCM memory device that should be faster than accelerated SSDs such as Solidigm’s D7-P5810, Micron’s XTR, and Kioxia’s FL6 drives. That should speed applications running in SCM-assisted servers and storage systems.

We don’t know anything about Numemory’s go-to-market strategy, but we can assume the domestic Chinese market will be its initial focus with sales outside China a possibility once manufacturing volume ramps up high enough. 

We can also assume that Chinese storage companies like Huawei and domestic server suppliers will be customers for the NM101, which should give them a performance advantage against storage arrays and servers that don’t use the chip.

We can also look forward to Taiwanese and Korean semiconductor companies, as well as engineers from all the DRAM and SSD vendors, poring over Xincun Technology’s patents and trying to find out if Numemory’s tech is real, can be replicated somehow, and if it can possibly succeed where Intel and Micron failed.

Ocient hyperscale data warehouse supports machine learning but not GenAI

Chris Gladwin, Ocient
Chris Gladwin, Ocient

Hyperscale data warehouse supplier Ocient has surveyed enterprise execs and found that AI spending is being prioritized against a background of concern about cloud costs and energy consumption.

Ocient surveyed more than 500 IT and data leaders managing data workloads of 150 terabytes or more in its third annual Beyond Big Data report, titled “Reaching New Altitudes.”

Chris Gladwin, Ocient
Chris Gladwin

CEO and co-founder Chris Gladwin stated: “As data volumes continue to explode, enterprises are facing a dual challenge: rising costs and growing energy consumption. To harness the full potential of data while mitigating these risks, leaders must fundamentally rethink their data analytics strategies. The latest Beyond Big Data report confirms this shift and highlights the critical significance of sustainable, energy-efficient data analytics and management solutions.”

The headline findings include:

  • 100 percent of IT and data leaders agree that increasing the volume of data analyzed by their organization near-term is important.
  • Only 25 percent of respondents report prioritizing cloud-only data analytics solutions in the next 12-18 months, a 10 percent drop from 2023.
  • While data speed and security remain crucial, sustainable energy consumption and cost have emerged as top priorities.
  • 64 percent of enterprise leaders cite “surprising” cloud costs as the leading factor impacting their ability to predict spend accurately.
  • More than half (53 percent) of respondents say energy consumption is a top concern.
  • Nearly a third (31 percent) cite reducing energy consumption as a motivator to switch or upgrade data warehouse or database solutions.
  • 93 percent of IT and data leaders plan to make AI investments in the next 12-18 months.

Ocient compared its 2023 and 2024 survey results and found four key changes in responses:

Ocient survey findings

It acknowledges that AI, specifically machine learning (ML), has become an essential way to analyze vast data stores and is offering ML built directly into its hyperscale data warehouse with its OcientML offering.

Its machine learning support is, we’re told, extensive, interoperable, and built for data scientists. They can use SQL commands to work directly in the database and also integrate OcientML with third-party tools like Jupyter notebooks to train models and run predictions on datasets in the Ocient system.

Ocient Data Platform diagram
Ocient Data Platform diagram

Ocient maintains a repository of common ML models that can be used as written or adapted to meet specific needs, the company says. It’s focused on algorithms commonly seen in structured and semi-structured data that run on features that have terabytes or more of data. It includes many regression and classification models, as well as feed-forward neural network models, principal component analysis models, and more.

Find a complete list of OcientML models and their definitions here.

No data movement out of Ocient’s repository is needed to run ML models, and users can use full resolution data to build their models, iterating rapidly without worrying about impacting other workloads, Ocient says.

The company adds that customers can generate powerful machine learning models with the simplicity of SQL, in-database machine learning training, and prediction built directly into the database engine. This, Ocient says, streamlines ML and saves developers time. They can create and train ML models within the database using CREATE MLMODEL statement with a SELECT to specify independent and dependent variables.

Then they can query using MLMODEL name as a function in a SELECT statement specifying one or more columns as independent variables in the MLMODEL.

Models within the Ocient Hyperscale Data Warehouse are first-class database objects, created with data definition language (DDL) and accessible with SQL. Training requires a SQL statement. For example, the SQL command for creating a simple linear regression model might look like this: 

Once the model has been created, it runs a prediction as you put the model name in a Select statement and use it as you would any function:

SELECT my_model(col1) FROM my_table;

This can be done when creating new tables, when inserting into new tables, or when running a query.

Ocient does not support generative AI with large language models and a focus on ordinary users as its ML tools are instead targeted toward data scientists. A search on Ocient’s website for “generative AI” returned “no results found.”

Ocient would like Apache Druid, Snowflake, AWS Redshift Aqua, and other data warehouse users to switch to its data platform because it claims it is better at scaling, faster, more energy-efficient, and offers better value for money.

SingleStore acquires BryteFlow to better digest ERP and CRM data

High-capacity database provider SingleStore has bought data integration platform BryteFlow to help it ingest key data workflows more easily from CRM and ERP platforms, including SAP, Oracle, Salesforce, and many other sources.

SingleStore will integrate BryteFlow’s capabilities into its core product to complement the existing functionality, creating what it terms a “SingleConnect” experience.

The acquisition follows a partnership with Snowflake and recent support for real-time processing on Apache Iceberg data for connecting data lakehouses.

With the incorporation of BryteFlow’s technology, SingleStore’s customers, “already used to sub-second query times, can now operationalize data from their CRM and ERP systems at scale,” said the provider.

“They can do this in real-time, enabling more insights for immediate action … supporting additional use cases for generative AI,” it claimed.

Headquartered in Sydney, Australia, BryteFlow’s existing customers include Siemens, Synchrony, and Warner Bros. BryteFlow offers scalable change data capture (CDC) capability from many sources, we’re told, with the built-in ability to ensure data integrity between source and target.

Raj Verma, SingleStore
Raj Verma

It also works with the major cloud platforms, including AWS, Azure, and Google, enabling “seamless integration” with cloud-based data warehouses and data lakes, using a no-code interface to make the platform accessible to users without deep technical expertise.

“This acquisition completes the next step in our pursuit to operate with ever-increasing speed, scale, and simplicity. Customer needs are changing [quickly] due to disruption in big data storage formats and quantum leaps in generative AI applications,” said Raj Verma, SingleStore CEO.

Pradnya Bhandary, BryteFlow CEO, added: “Together, through SingleConnect, we will make it faster and easier for developers to tap into an array of enterprise data sources, tackle diverse and challenging workloads, and deliver top-notch experiences for their own customers.”

Earlier this year, SingleStore added features to its Pro Max database to make AI app development faster.

DataPelago unveils software accelerator for GenAI and lakehouse analytics

Startup DataPelago has revealed its Universal Data Processing Engine (UDPE), software that accelerates compute for data analytics and GenAI models.

DataPelago – a portmanteau of “data” and “archipelago” – says it’s creating a new data processing standard for the accelerated computing era to overcome performance, cost, and scalability limitations of existing software architectures and x86 CPUs. The UDPE uses open source Gluten, Velox, and Substrait to turbocharge Spark and Trino, providing customers with “disruptive price/performance advantages.” It integrates into existing data stores and lakehouse platforms, SQL, Python, Airflow workflow automation, Tableau, Power BI, and more with no need for data migration and no lock-in.

DataPelago was founded in 2021 by CEO Rajan Goyal and chief product officer Anand Iyer. Goyal was CTO at DPU startup Fungible, acquired by Microsoft for around $190 million in 2022. His backstory includes hardware/software co-design at Cavium. Think of the UDPE as a quasi-software DPU.

Rajan Goyal, DataPelago
Rajan Goyal

DataPelago has raised more than $75 million in VC funding through a 2021 seed round for $8 million, a 2022 $20 million A-round, and a $47 million venture round just this month. The latest round involved Eclipse, Taiwania Capital, Qualcomm Ventures, Alter Venture Partners, Nautilus Venture Partners, and Silicon Valley Bank, a division of First Citizens Bank.

Goyal stated: “Today, organizations are faced with an insurmountable barrier to unlocking breakthrough intelligence and innovation: processing an endless sea of data … By applying nonlinear thinking to overcome data processing’s current limits, we’ve built an engine capable of processing exponentially increasing volumes of complex data across varied formats.”

DataPelago position in the GenAI/analytics stack
DataPelago’s position in the Gen AI/analytics stack

He told an investor that he would proceed by “first building a software-based query plan that leverages inherent data knowledge to compute adjacency and storage to compute proximity. On top of this, he described his vision of creating networked query engines that could be used to achieve massively parallel query execution.” These engines would support CPUs, GPUs, FPGAs, and other accelerating hardware.

UDPE refactors data processing to exploit accelerated computing, leveraging higher degrees of parallelism and a tightly coupled memory model to deliver orders of magnitude of higher performance.

It has three component layers:

  • DataVM – a virtual machine with a domain-specific Instruction Set Architecture (ISA) for data operators providing a common abstraction for execution on CPU, GPU, FPGA, and custom silicon.
  • DataOS – operating system layer mapping data operations to heterogeneous accelerated computing elements and managing them dynamically to optimize performance at scale.
  • DataApp – pluggable layer that enables integration with platforms including Spark and Trino to deliver acceleration capabilities to these engines.
Anand Iyer, DataPelago
Co-founder and CPO Anand Iyer

The company claims its UDPE is suited for use cases that are resource intensive, such as analyzing billions of transactions while ensuring data freshness and supporting AI-driven models to detect threats at wire-line speeds across millions of consumer and datacenter endpoints, and providing a scalable platform to facilitate the rapid deployment of training, fine-tuning, and RAG inference pipelines.

UDPE is not a storage engine. A spokesperson told B&F: “A storage engine (like Speedb) is used to write data to and read data from storage drives and is written in low-level code. A storage engine cares about data placement on the storage rather than the semantics of query or data processing requests. DataPelago is a data processing engine for GenAI and analytics workloads. DataPelago sits higher up in the technology stack. It focuses on processing data processing queries/requests and leaves the placement of the actual data to the underlying storage layer, which would include technologies such as the Speedb engine.” 

It “introduces enhancements that automatically map operations to the most suitable computing hardware – be it CPU, GPU, FPGA, or others – and dynamically reconfigures these elements to maximize performance for the target hardware … [It] requires no custom hardware and works off standard accelerated compute instances available in the cloud from hyperscalers such as AWS, Azure, and GCP as well as the new GPU cloud providers such as CoreWeave, Crusoe, Lambda, etc. All of this happens seamlessly for users, requiring no changes to queries, code, applications, workflows, tools, or processes.”

Goyal writes about his experience founding DataPelago here.

MongoDB says it’s winning with targeted commercial AI projects

Database provider MongoDB has launched the latest version of its document-based system, claiming it can be at the center of moves toward commercial AI deployments.

Competing against traditional relational database heavyweights like Oracle and IBM, MongoDB sells both an on-premises version of its database and a cloud version, called Atlas, which is sold through the AWS, Azure, and Google Cloud.

MongoDB 8.0 is now generally available and comes with various data management and throughput improvements.

Architectural optimizations in 8.0 have significantly reduced memory usage and query times, and it has more efficient batch processing capabilities than previous versions. Specifically, 8.0 promises 32 percent better throughput, 56 percent faster bulk writes, and 20 percent faster concurrent writes during data replication.

In addition, 8.0 can handle higher volumes of time series data, and can perform complex aggregations more than 200 percent faster, with lower resource usage and costs, according to MongoDB.

Horizontal scaling is also “faster and easier than ever.” Horizontal scaling allows applications to scale beyond the limits of traditional databases by splitting data across multiple servers (sharding), without pre-provisioning additional compute resources for a single server. Sharding improvements in MongoDB 8.0 distribute data across shards “up to 50 times faster,” without the need for additional configuration or setup.

As part of the global launch of the improved database, a reveal in London took place at one of the company’s regular local events for customers, partners, and developers, which Blocks & Files attended. At the moment, MongoDB is believed to only hold about 2 percent of the total global database market by sales, although most analysts put it in the top five providers when it comes to developer use.

We wanted to know how the company intended to scale up through wider cloud use and, of course, as a result of wider AI-driven workloads. The business was notably upbeat. In fact, it has established a whole unit around commercial AI deployments in industry verticals, and claims it’s winning.

Greg Maxson, MongoDB
Greg Maxson

Greg Maxson, senior director of AI GTM (go-to-market), said businesses were being tested by the rapid marketing cycle of AI, uncertainty about which AI language models to use and which technology stacks to implement, and a lack of skills and resources to do it.

“Two months ago we established our MongoDB AI Application Program (MAAP), and have listed the seven general stack technologies that customers must start with, when it comes to AI projects. It’s foundational stuff, but we are already winning key customers around it on the services side.”

Maxson said a “large automotive company in France” wanted to better use its mechanical data, including video clips and manuals, to find out what the faults were in its cars when they were returned to dealers. “We came up with 20 AI models they could test to find the right solution, and now one of them is being used with MongoDB at dealerships across France,” said Maxson.

He claimed another firm – a “global household appliance manufacturer” – came to MongoDB because it wanted to integrate digital sound analysis from its products, including vacuum cleaners, into its manufacturing and quality control systems using AI. The chosen MongoDB system is now being used.

“We’ve brought all three main cloud providers into this process, and global system integrator Accenture is also involved, and we are aiming to set industry vertical standards to make AI projects work.”

The AI hype cycle is huge when it comes to data processing, data management, and data storage. But to make AI work across these areas, there has to be focus to enable delivery. Establishing an AI GTM unit at product and service providers is likely a solid first step in guiding potential AI customers through confusion.

Bootnote

MongoDB was initially called 10gen but wanted to indicate the product created in 2007, now called MongoDB, could scale to handle massive amounts of data – hence the name change.

Storage news ticker – October 3

Adaptive announced a control plane that brokers access across all surfaces and safeguards sensitive information, protecting data closer to its source, minimizing sensitive data exposure. It uses an agentless architecture that doesn’t require any change to existing workflows and tools. By understanding an organization’s data context, Adaptive simplifies protecting data at scale, allowing organizations to manage privileges and mask sensitive information efficiently. It says it’s at the forefront of an integrated, new approach that ends longstanding silos between data security and IAM (Identity and Access Management).

Adaptive argues that traditional data protection tools focus on protecting data at rest. However, in modern organizations, a large number of identities consume data for reporting, analysis, internal tools, ETL pipelines, and more. Protecting data at rest alone is ineffective, and the only way to safeguard data is to manage and control access across channels. Data protection in modern organizations is actually an access problem. Since data protection tools and access management systems have operated in silos, they leave blind spots in an organization’s security and don’t solve for escalating insider threats and cyber attacks.

Asianometry has produced a video on “The Wobbly Future of the Hard Disk Drive Industry.” It’s 18.5 minutes on the situation of the disk drive industry, threatened by SSDs, and facing production issues with new tech, such as HAMR and MAMR. It’s a tad simplistic, but it could fill up a coffee break.

Video screenshot on hard drive storage

Ceramic-based data storage startup Cerabyte announced that its president, Steffen Hellmold, will deliver a main stage presentation titled “The New Storage Tier to Enable the Yottabyte Era” at Yotta 2024, October 7–9, in Las Vegas, Nevada.

Data lakehouse supplier Databricks announced a new regional hub in London for the EMEA region by opening a new office in the Fitzrovia neighborhood. It says the Databricks Data Intelligence Platform provides a unified foundation for all data and governance, combined with AI models tuned to an organization’s unique characteristics. Databricks has over 400 employees in the UK and its local business has experienced over 60 percent annualized growth over the past three years. Customers include Gousto, Heathrow, Rolls-Royce, SEGA Europe, Shell, and Virgin Atlantic. Databricks has also formed partnerships with more than 35 universities across the country.

HPC parallel file system and enterprise storage supplier DDN is partnering with YTexas, a business network and community organization supporting businesses relocating to or expanding within Texas. DDN says it’s expanding its AI research footprint in Texas. Through its partnership with YTexas, DDN hopes to contribute to the state’s burgeoning business ecosystem, driving technological breakthroughs, job creation, and robust economic growth.

Enterprise cloud collaboration and file services supplier Egnyte has added five more patents to its existing portfolio of 45 or so:

  • System and Method for Enhancing Content Collaboration by Conflict Detection and Resolution in a Hybrid Cloud Cache – This solution addresses challenges in hybrid cloud environments by managing file versions and preventing data loss due to conflicting uploads. 
  • Event-Based User State Synchronization in a Local Cloud of a Cloud Storage System – This patent enables synchronization of user definitions between remote and local cloud systems, allowing centralized control over user access and near real-time updates. It maintains a global user directory on a remote cloud server, which can be synchronized with local user directories on multiple local cloud servers.
  • System and Method for Sensitive Content Analysis Prioritization Based on File Metadata – This technology estimates the likelihood of sensitivity for file objects. 
  • Storage Agnostic Large Scale Permissions and Access Analytics – This solution simplifies and consolidates permission sets from multiple heterogeneous file storage systems.
  • System and Method for Serving Subject Access Requests – This patent details a method for efficiently handling SARs in compliance with data privacy regulations, like GDPR.

Information management and governance supplier EncompaaS announced a strategic partnership with Dutch records retention pioneer Filerskeepers. EncompaaS enables organizations to discover, understand, govern, and use their data to promote automated governance at scale. Filerskeepers identifies country-specific data retention obligations and assists its clients with implementing those obligations to their data, no matter where in the world they operate. 

A mass storage roadmap is available here at the IEEE Xplore website. It covers NAND, SSDs, HDDs, tape, optical disks, and DNA storage, about which it says: “DNA data storage has been demonstrated in the lab, but the costs of reading and writing data on synthetic DNA are currently too expensive for practical applications.”

Mass storage roadmap

NetApp has expanded its Google Cloud partnership to integrate its unified data storage and intelligent services into the Google Distributed Cloud architecture. Google Distributed Cloud extends customers’ cloud infrastructure and services to on-premises sites and it now supports NetApp ONTAP and StorageGRID to support its own services including databases, AI, and analytics.

Fabless semiconductor company Primemas, which has a system-on-chip (SoC) hub chiplet platform, is partnering with Samsung to collaborate on the development of advanced Compute Express Link (CXL) memory products. Primemas will conduct joint R&D with Samsung using its CXL 3.0-enabled SoC hub chiplet (hublet) and FPGA chiplet to provide optimized products for next-generation datacenters and AI systems.

Primemas graphic

Quantum announced that its object storage software has extended its Veeam Ready qualifications to include the new ActiveScale 7.0 software, which support all-flash systems for fast ransomware recovery, and ActiveScale Cold Storage support for Veeam Archive Tier for low-cost, air-gapped retention of Veeam backups. With Veeam’s new Direct to Archive feature, backups can be sent directly from Performance Tier storage to the Archive Tier.

Red Hat OpenStack 2024.2 Dalmatian is the latest release of the open ource and modular cloud infrastructure software stack with its Nova (compute), Swift (object storage), Cinder (block storage), Neutron (networking), and Horizon (dashboard) components. Skyline and its modernized web UI are now fully supported as part of the official OpenStack release. Blazar introduced support for reserving compute (including GPU) instances based on existing Nova flavors. There are security updates as well. In Nova, with the libvirt driver and libvirt version 7.3.0 or newer, mediated devices for vGPUs are now persisted across reboots of a compute host. Download OpenStack Dalmatian here.

The OpenStack market is estimated to be worth $22.81 billion in 2024 and is expected to reach $91.44 billion by 2029, growing at a CAGR of 32 percent. OpenStack is currently experiencing a significant surge in adoption, attributed largely to the software’s popularity as a VMware alternative, and also to its AI workload support. 

Rowan Trollope, CEO of database supplier Redis, argues that AI is only as good as the quality and availability of data fed into it, and the exponential growth of GenAI technologies could take a quick downward turn if not supported by the right databases – like, say, Redis, which powers the most innovative GenAI tools available today, including those developed by OpenAI. Redis 8, the latest release, has 16x more throughput and 4x faster response time. Trollope reckons it’s important not to let hyperscalers – such as Google and AWS – monopolize the game. Redis recently changed its licensing as the previous setup was allowing Google and Amazon to take something Redis put out as open source and sell it to their customers.

High-availability supplier SIOS announced LifeKeeper for Linux version 9.9.0 with advanced DR features, including synchronous and asynchronous data mirroring, intelligent failover mechanisms, enhanced security management, expanded language support, and additional Linux operating system compatibility. More information can be found here.

Stratio BD’s Generative AI Data Fabric product helps businesses manage and use vast quantities of data and enables all users (data managers, data scientists, as well as business users) to complete complex data queries in their natural language using its Talk to Your Data feature. The Generative AI Data Fabric 4.4 release allows users to complete complex data transformation and ML processes within seconds using natural language prompts. Users can have entire conversations with Stratio GenAI in almost any language, providing the world’s fastest and most comprehensive natural language data management product. Stratio BD’s ““”Productivity, Performance, and Potential: Generative AI In Action” White Paper can be read here.

A recent study by TRG Datacenters aimed to identify how much data volume grows over time by collecting apps in four main categories: communication, navigation, work, and social media. The download sizes of these apps were traced over time using the Wayback Machine. The study also gathered data on the most sold phones from 2014 to 2024, focusing on their photo, video, and non-expandable memory characteristics. By calculating the average non-expandable memory and the sizes of photos and videos, the study compared these sizes over the years to determine the file size change both in megabytes (MB) or gigabytes (GB) and as a percentage. No surprise: file sizes have grown a lot, with video games growing the most.

Data storage growth

Following AT&T’s claim that Broadcom proposed a VMware price increase of 1,050 percent, Dave Russell, SVP and head of strategy at Veeam, stated: “We have seen a 300 percent price increase in the VMware products we’re using. This is in line with what I’ve heard from customers, some of whom have reported even higher price increases … Most large enterprises have renewed their VMware products, giving them time to decide whether they still plan to use them going forward. Meanwhile, smaller customers are more sensitive to pricing changes … The vast majority of large enterprises put themselves in a position to push any concerns out. Some customers are looking at whether an alternative is ‘feasible.'”

William Blair analysts write: “We have confidence that the AI investment cycle has several years to run. Qualitatively, the executives at Microsoft/OpenAI, Alphabet, Oracle, Amazon, and Meta have all indicated they see a multi-year investment runway related to building out AI infrastructure, training next-gen models, and building new AI-native applications. Alphabet’s CEO put it most succinctly, indicating that the real risk in this AI platform shift is under-investment rather than over-investment.

“While training-based demand for infrastructure has been highly concentrated in a handful of the largest technology companies (40 percent of Nvidia revenue comes from four companies), inference demand promises to be more diffuse. The general rule of thumb for inference costs are that they are the square root of training costs. Importantly, while inferencing ASPs are lower, the potential market opportunity is much broader, applicable eventually to the entire global population as AI becomes integrated into applications across industries and geographies.”

Western Digital opens separate websites for HDD and flash businesses

Western Digital has advanced the previously announced split between its HDD and flash businesses by launching separate customer websites. The separation aims to improve operational focus and market agility in pursuit of improved sales and margins.

“We are now operating as two specialized websites: WesternDigital.com for HDDs and platforms, and SanDisk.com for flash technology, including SSDs, memory cards, USB flash drives, and more,” a canned statement from the storage titan read.

As announced in October last year, Western Digital plans to separate its HDD and flash businesses, creating two independent, public companies with market-specific, strategic focus. The separation will “better position each franchise to execute innovative technology and product development, capitalize on unique growth opportunities, extend respective market leadership positions, and operate more efficiently with distinct capital structures,” Western Digital explained. “The creation of two specialized websites is a necessary step in the company separation process.”

Although the websites are separate, there’s still some crossover between brands. On WesternDigital.com, you can shop for all HDD and platform products from the following brands: Western Digital, WD, WD_BLACK, and SanDisk Professional. On SanDisk.com, you can shop for all flash products, such as SSDs, memory cards, and USB flash drives from the Western Digital, WD, WD_BLACK, SanDisk, and SanDisk Professional brands.

For support, customers go through the relevant website, with G-Technology customers going through WesternDigital.com.

All warranty claims for HDDs and platform products from Western Digital, WD, WD_BLACK, SanDisk Professional, and G-Technology should be submitted through the Western Digital Support account. After signing in, select your registered product and “Request Warranty Replacement.” If you have not registered your product yet, select “Register a New Product.”

All warranty claims for flash products such as SSDs, memory cards, and USB flash drives from Western Digital, WD, WD_BLACK, SanDisk, and SanDisk Professional should be submitted through the SanDisk Support account.

The formal business split is expected to be completed in the “second half of 2024” and since last year the firm has established legal entities across around 20 countries. Once complete, both divisions will operate as publicly traded companies.

David Goeckeler, Western Digital’s CEO, will lead the SanDisk business, and Irving Tan, currently executive vice president of global operations, will become the CEO of WD.

Discover what it takes to build your AI-ready infrastructure strategy

SPONSORED POST: Organisations across the UK are rushing to find new ways of using artificial intelligence (AI) to streamline their operations and build new products and services for their customers and stakeholders.

A report published by the UK Office of National Statistics (ONS) last year suggests that the majority of UK organisations are yet to implement any form of AI within their business, though there has been a considerable expansion in awareness of the technology and willingness to explore its capabilities.

Certainly, the potential to extract intelligence and insight from the vast amount of data now at their disposal is huge. But identifying and implementing machine learning, deep learning, generative AI (GenAI) and other forms of the technology to fulfil that ambition still poses a significant challenge for many.

It’s not just about the tech stack – there are hurdles with data quality, supply chains, legacy systems, costs and operational complexity to navigate too. Building a hybrid IT infrastructure capable of handling AI is an important step. The larger the workload and the more data it consumes, the more likely that a data centre will be need to host and process all of that information – an environment that offers the scalability and agility to quickly expand further in support of additional and ever larger AI workloads and datasets as the business requires.

Few organisations will have that knowledge or infrastructure capability in-house, so choosing a partner with the expertise to guide them through the implementation journey will be critical.

Digital Realty has put together a white paper specifically designed to provide guidance on the importance of having a robust enterprise infrastructure to support an organisation’s AI needs. Titled “AI for IT Leaders: Deploying a Future-Proof IT Infrastructure”, it offers advice on the strategic goals of enterprise AI adoption together with the common AI workloads, challenges and solutions needed to deploy AI-ready infrastructure. Digital Realty also provides a list of evaluation criteria which will help you choose the right partners to build an AI infrastructure stack within your organisation that will yield the best performance.

You can register download a copy of the Digital Realty whitepaper by clicking this link.

Sponsored by Digital Realty.