Home Blog Page 191

Storage news ticker – July 7

Storage news ticker
Library of Congress image.ights Advisory: No known restrictions on publication.

GRAID has incrementally improved its NVMe RAID Card’s software, adding data corruption and correction checks. v1.2.2 of its SupremeRAID storage product enhances its RAID 6 erasure coding (EC) N+2 data consistency check capability with the ability to correct customer data when data corruption is detected for customers deploying SupremeRAID in a RAID 6 configuration. The new 1.2.2 feature set is available on both SupremeRAID SR-1000 for PCIe 3 and SupremeRAID SR-1010 for PCIe 4 servers and will be available worldwide on July 11, 2022.

HPE announced that Catharina Hospital, the largest hospital in the Netherlands’ Eindhoven region, providing care for over 150,000 patients, has selected the HPE Ezmeral software portfolio to build a cloud-native analytics and data lakehouse to provide more accurate diagnosis, early detection, and injury prevention for patients. Thomas Mast from the cardiology department at Catharina Hospital said: “The new data lakehouse that we’ve created using HPE Ezmeral will enable us to help accelerate model training and detect cardiogram anomalies among the 500,000 electrocardiograms (ECGs) already available for data analysis with higher precision and to identify the correct diagnosis and treatment.”

MemVerge Memory Machine software has achieved Red Hat OpenShift Operator and container certifications and is now listed in the Red Hat Ecosystem Catalog. This means cloud builders can deploy Memory Machine software with the confidence that it has completed rigorous certification testing with Red Hat Enterprise Linux and OpenShift. Memory Machine software lowers the cost of memory infrastructure by transparently delivering a pool of software-defined DRAM and lower-cost persistent memory. It provides higher availability to cloud-native OpenShift applications by providing ZeroIO In-Memory snapshots, replication, and instant recovery for terabytes of data.

Micron has announced commercial and industrial channel partner availability of its DDR5 server DRAM, with modules up to 64GB, in support of industry qualification of next-generation Intel and AMD DDR5 server and workstation platforms. The move to DDR5 memory enables up to an 85 percent increase in system performance over DDR4 DRAM. DDR5-enabled servers are being evaluated and tested in datacenter environments and are expected to be adopted at an increasing rate throughout the remainder of 2022. The introductory data rate for DDR5 is 4800MT/s but is anticipated to increase to meet future datacenter workload demands.

Peer Software has a strategic alliance with Pulsar Security and will use its team of cyber security experts to continuously monitor and analyze emerging and evolving ransomware and malware attack patterns on unstructured data. The PeerGFS distributed file system will use these attack patterns to enhance its Malicious Event Detection (MED) feature and enable an additional layer of cyber security detection and response. Pulsar Security will work with Peer Software to educate and inform enterprise customers on emerging trends in cyber security, and how to harden their systems against attacks through additional services like penetration testing, vulnerability assessments, dark web assessments, phishing simulations, red teaming, and wireless intrusion prevention.

NAND and DRAM chip maker Samsung has reported strong second quarter revenues and profits but smartphone memory demand was down, echoing part of Micron’s recent experience, which was also affected by a downturn in PC storage and memory demand. Reuters reported Samsung revenues rose 21 percent to ₩77 trillion, in line with estimates. It posted an operating profit of ₩14 trillion ($10.7 billion), up 11 percent from ₩12.57 trillion a year earlier. According to Counterpoint Research , estimated smartphone shipments by Samsung’s mobile business in the second quarter were about 62-64 million, about 5-8 percent lower than a March estimate.

Replicator WANdisco is constantly trumpeting its sales successes. Here’s another: a top five Canadian bank has made a three-year, $1.1 million contract to migrate its Hadoop workloads from an on-premises Cloudera cluster to the Google Cloud Platform (GCP). The initial migration will be of 5 petabytes of data with room for future expansion. This is WANdisco’s first Google Cloud win in North America and represents another contract win in the global financial services market.

StorCentric has filed for Chapter 11 bankruptcy and is looking for a sale

Mini-storage conglomerate StorCentric has filed for Chapter 11 bankruptcy protection and is looking for a buyer. StorCentric-owned Drobo has also made its own Chapter 11 filing.

According to the bankruptcy company news outlet, StorCentric and six affiliated debtors, including Drobo, made their filing on June 20. They cited a devastating business impact from the COVID-19 pandemic.

The triggers for the filing were the fact of it making its June 10 payroll late and having insufficient cash to make a June 22 payroll date. A petition.substack note says “The company is facing the imminent June 30, 2022, maturity of its $25m 5.875% fixed senior notes Series 2020-3 (with UMB Bank NA as Indenture Trustee) and with all possible refinancing and funding avenues foreclosed to the company under current market conditions, the debtors were unable to obtain new sources of investment prior to filing.” It had zero cash in hand.

Because of this StorCentric sought a pre-bankruptcy petition financing of $5 million for this from Serene Investment Management LLC. 

StorCentric’s filing says it has been operating at a loss for some time. It has between $10 million and $50 million in estimated assets and estimated liabilities in the same range. Its IP is worth, it says, between $50 million to $70 million.

We understand Drobo made its Chapter 11 filing on June 20 and a hearing will be held in the US Bankruptcy Court, California Northern District, San Jose, on July 19. All non-governmental creditors must file their claims with the Bankruptcy Court by October 17. The StorCentric-owned business hopes to restructure and continue in business.

How did we get here? 

Data Robotics was founded in 2007 by Geoff Barrall and developed a consumer/prosumer file storage product with hot-swap drives called Drobo. The firm changed its name to Drobo and was bought by/merged with Connected Data in 2013. Connected Data, founded in 2012, also by Geoff Barrall, produced the private cloud file sync’n’share Transporter device.

Drobo B12i device.

Drobo was bought by an investment group in May 2015, with ex-Brocade exec Mihir Shah becoming its CEO. 

Imation bought Connected Data for $7.5 million in October 2010. Imation had previously bought storage array supplier Nexsan in 2013.

Imation then imploded, with Nexsan and Connected Data bought in a non-cash deal by private equity house Spear Point Capital Management in Jan/February 2017.

Mihir Shah.

Private equity backed StorCentric was started up in 2018 as a new home for the Nexsan array and Drobo prosumer storage businesses, having bought them from their then owners. Its CEO was Mihir Shah and he ran Nexsan and Drobo as separate divisions. The Connected Data Transporter technology became dormant.

StorCentric bought struggling NVMe all-flash array startup Vexata in July 2019. It then bought the Retrospect backup business in late 2019, and the failed and somewhat recovering all-flash array vendor Violin Systems in October 2020.

At this point it had the Retrospect backup business, three storage array businesses – Nexsan, Vexata and Violin – and the Drobo prosumer external storage business.

Since then we have had the COVID pandemic, remote working, supply chain problems and the Ukraine war. In February 2020 Drobo tweeted about supply chain issues:

Shah himself blogged about supply chain issues in March 2020, saying “We are in close contact with all of our suppliers and we are trying to mitigate the impact of any delay in the supply chain.”

A July 1, 2022, Reddit post said that Drobo was not closing down:

A consensus of posts on the Appleinsider forum is that StorCentric has failed to add new Drobo hardware for two years, and that its software is not up to date with current Apple Mac hosts.

We have reached out to Mihir Shah for a comment and the company issued a statement: “StorCentric has commenced a voluntary Chapter 11 reorganization in order to strengthen and position the company for future success. It will remain completely operational during this process and will continue to service its customers and make the necessary investments to achieve its strategic objectives.”

CXL memory pools: Just how big can they be?

CXL 2.0 will enable memory pooling, which sounds great, but a bit vague. How big can the memory pools be? There is no firm answer yet, but we can take a stab at it. How does a petabyte sound? Or more?

The potential memory pool capacity depends upon the servers providing it. And that depends upon the server CPUs and their PCIe lane support as well as DRAM DIMM module capacity. We started on the basis of Sapphire Rapids processors, and were helped by MemVerge’s expert Beau Beauchamp in our calculations and also by Jim Handy of Objective Analysis..

CXL 2.0 maintains cache coherency between a server CPU host and three device types. Type 1 devices are I/O accelerators with caches, such as smartNICs, and they use CXL.io protocol and CXL.cache protocol to communicate with the host processor’s DDR memory. Type 2 devices are accelerators fitted with their own DDR or HBM (High Bandwidth) memory and they use the CXL.io, CXL.cache, and CXL.memory protocols to share host memory with the accelerator and accelerator memory with the host. Type 3 devices are just memory expander buffers or pools and use the CXL.io and CXL.memory protocols to communicate with hosts.

CXL diagram from The Next Platform
Diagram from The Next Platform

A Type 3 device will link to a server host across the link, which will be PCIe 5.0 initially. Later versions of PCIe – 6 and 7 for example – will be used in subsequent versions. Intel’s coming gen-4 Xeon SP Sapphire Rapids CPU will support CXL 1.1. and then 2.0. 

It will have up to 56 cores, 8x DDR5 DRAM sockets, support 64GB of HBM2e with each of its four component tiles supporting 16GB, and up to 128 PCIe 5 lanes, as well as Optane PMem 300 devices.

Beauchamps’s view

Beauchamp told us: “Sapphire Rapids supports 8 DIMM sockets, so 4TB can be configured using 512GB DIMMs (which will exist), but the economics of doing so will make it a rare case.”

CXL memory pooling
CXL memory pooling graphic

CXL 2.0 will support 16 PCIe lanes. Beauchamp added detail: “CXL supports from 1 to 16 lanes per link in powers of 2.  Each PCIe 5 lane provides 4GB/sec of bandwidth, so 128 GB/sec for a x16 link. A DDR5 channel has ~38 GB/sec bandwidth, hence a x4 CXL link (32 GB/sec) is a more comparable choice if direct-attaching CXL memory modules. The industry is generally centering on x4 links for CXL memory cards.”

He then worked out how many more DDR channels this meant. “Multiple CXL links are supported. I’m sure Sapphire Rapids has many PCIe lanes (close to 100), and I assume a third to a half of those can be CXL, so let’s say there are three x16 groups (which like PCIe can each be bifurcated into four x4 links) so that is equivalent to about 12 more DDR channels. I don’t know how many slots chassis designs will include.”

Samsung 512GB CXL memory module
Samsung 512GB CXL memory module

Using the maximum DIMM size of 512GB, “12 more DDR channels” would mean a theoretical 6TB of CXL DRAM capacity per Sapphire Rapids CPU, which could be added to its local, direct-attached 4TB max of DRAM to make 10TB in total.

Beauchamp said “CXL 2.0 supports switching, so using a x16 link to a switch-fanout multi-slot (>10 slots) memory box is conceivable. Multiple servers sharing (partitioning) 50–60 slots is also conceivable.”

Building on this

Now let’s add in MemVerge’s memory machine technology. It is system software that clusters many servers’ memory into a single pool. In April 2019, we learned it could support a cluster of 128 server appliances, each with 512GB of DRAM and 6TB of Optane.

That means we can, in theory, take a single Sapphire Rapids server’s maximum 6TB of DRAM and cluster 128 of them together to produce a 768TB pool of memory along with the 128 x 4TB = 512TB of clustered local memory in the servers – 1.28PB in total.

Beauchamp said there are a few technical issues that may impede this pooling vision in the near term. First: “Previous Xeon processors have a max memory limit they can ‘own’ per socket (ie, be the ‘home’ coherence agent). Cascade Lake had multiple SKUs for this (1, 2, and 4TB).  Ice Lake had only one SKU, 4TB. I don’t know what Sapphire Rapids will do in this regard, but I expect it to increase, maybe double to 8TB? That limit may clip your 10TB per socket result.”

Second: “CXL 2.0 switch configuration is limited, so fully flexible sharing between 128 servers may not be practical. (CXL 3.0 is targeting more capable and simplified switching).”

Handy’s view

In answer to the question “how much CXL memory might there be?” Handy said “I think that the answer will come from a different direction.”

“The system’s memory space will be limited by a number of factors – not simply the ones you mention. Near Memory (the stuff that actually touches the CPU) is limited by capacitive loading and how many chips a CPU pin can drive, or even how many chips a DRAM pin can drive. The memory density of a particular chip is important here, and that will increase over time. 

“For Far Memory (memory that communicates with the CPU through a controller, ie CXL or OpenCAPI) the number of memory chips that can be attached is almost limitless, so other factors determine how much memory can be attached. (To my understanding, an overabundance of memory addresses is available in CXL, although I have been having some trouble finding an exact answer. I did some looking into the 628-page CXL 2.0 spec, and made a couple of calls, but don’t yet have a solid answer.)

“What ‘other factors’ would determine the maximum memory? The two that I can think of are the physical address length supported by the CPU and the cache controller. These are usually the same. CXL uses asymmetric coherency, which means that a single CPU manages the coherency of the whole system. Any limit on the CPU determines the maximum memory size that can be managed. If Sapphire Rapids is like its predecessors, it will be offered in a variety of flavors with an assortment of addressing capabilities.

“… so you don’t get a concrete answer!”

Handy then made a number of detailed points:

  1. The maximum DDR DRAM that can be attached to a Sapphire Rapids chip is determined by the CPU chip’s address pins, not by the DRAM chips it uses. Samsung capitalizes off of that by making mega-DIMMs (128, 256, & 512GB) with stacked DRAM chips, using TSV technology developed for HBM, and charging high prices for them, so your calculation (1.28PB) would necessarily be off. As denser DRAMs come along, that would throw it off as well. Samsung may choose to double the number of DRAM chips in its stacks, too, so that would also throw things off.
  2. HBM support would also be determined by addressing. I would expect for Intel to set the number of address bits a little high to allow future designers to use whatever DRAM makers can ship them five years from now.
  3. Each CXL channel is a single lane. That lane can either speak PCIe or CXL. This is determined at boot-up. I believe that the number of PCIe lanes on a CPU chip can go from zero to infinity, with 16 being a practical number today.
  4. The amount of DRAM on a single CXL lane should be many orders of magnitude larger than your 512GB estimate. This means that the amount of memory is limited by the CPU’s address range.

“This isn’t a tidy answer, but CXL has been designed to lead a long life, and that involves providing an over-abundant (for now, at least) address range.”

The net effect of this is that our 1.28PB calculated Near Memory plus Far Memory total could be too small – but no one knows. It depends upon CPU DRAM capacity, CPU PCIe lanes, CXL switch capacity and CXL memory module capacity – that’s four areas of uncertainty, and there may be more.

Eideticom CTO Stephen Bates suggested: “You might have issues powering and cooling all that memory, but that’s just a physics problem ;-).”

Hyperscaler CSPs and massive memory

Our calculated 1.28PB memory amount is a theoretical number and, as Beauchamp has intimated, the economics of this will make it unlikely. But imagine a hyperscale public cloud supplier doing this – providing a massive pool of memory to a set of Sapphire Rapids servers, and so enabling, through composability software, the provisioning of 100TB to 1,000TB memory instances for hyperscale in-memory database/data warehouse applications that would run at currently unimaginable speeds. A pool of Optane storage-class memory could be added in for persistent storage – checkpointing, for example.

Customers could use this facility for burst processing in the AI, machine learning, and analytics application space.

We’re dealing with possibilities here but, with Sapphire Rapids and CXL 2 building blocks, they are tantalizingly close. They represent a potentially huge increase in the deployment of of in-memory data sets and the execution speed of applications using them. 

Boot note

The Xeon Ice Lake processor has a 46-bit physical address space:  64TB.

Storage news ticker – July 6

Storage news
Storage news

Clever Cloud, which is currently setting up in a third Parisian data center, has chosen to partner with French company Kalray and its Flashbox NVMe storage array that can accommodate up to 24 PCIe (NVMe) SSDs. The Flashbox has no host CPU. Instead it has K200-LP accelerator cards featuring the Coolidge 80-core DPU (Data Processing Unit) that takes advantage of the MPPA (Massively Parallel Processor Array) architecture developed by Kalray. It can deliver up to 12 million IOPS through a 2x 100Gb/s network interface and is based on market standards such as Storage Performance Development Kit (SPDK), NVMe/TCP or RDMA over Converged Ethernet (RoCEv1/v2). It will eventually be offered by storage vendors using Kalray cards. The Flashbox will be showcased at the upcoming Flash Memory Summit from August 2 to 4 in Santa Clara Convention Center, CA (booth #940). 

Data protector Commvault has appointed Cesar Cid de Rivera as International VP of Sales Engineering. He previously worked at Commvault in the 2013-2017 period and will lead the pre-sales and technical team responsible for the delivery of Commvault’s iIntelligent Data Management services to customers and partners. Cid de Rivera worked most recently at ServiceNow, Pegasystems, Symantec, and Veritas.

Kingston Digital has announced the portable XTS-AES Encrypted USB drive with automatic – but optional – USBtoCloud backup. It provides consumer-grade security with AES hardware-encryption in XTS mode to safeguard against BadUSB with digitally-signed firmware and Brute Force password attacks. 

The drive offers multi-password Admin and User options with Complex or Passphrase modes. To reduce failed login attempts during password entry, the “eye” symbol can be enabled to reveal what has been typed. But should a User password be forgotten, the Admin password can be used to access the data on the drive or reset the User password. The drive USBtoCloud backup supports Google Drive OneDrive (Microsoft), Amazon Cloud Drive, Dropbox or Box for a user’s personal cloud storage. It is available in capacities from 16GB to 128GB and is backed by a limited five-year warranty and free technical support.

Microsoft has announced a preview of Azure Virtual Machines Backup which can take multiple snapshots a day. Its supports a recovery point objective of four hours and offers instant restore retention with a default of seven days retention and a maximum of 30 days.

Data protector N-able says that its Cove Data Protection for Microsoft 365, built for MSPS, has officially surpassed one million protected Microsoft 365 users. More than 4,600 N-able partners are utilizing the solution to help prevent data loss across more than 36,000 customer domains and over 1.2 million Exchange mailboxes.

N-able has bought Spinpanel, a multi-tenant Microsoft 365 management and automation technology built for Microsoft Cloud Solution Providers (CSPs). A significant majority of N-able partners manage and/or resell Microsoft Cloud services for their small and medium enterprise customers. The acquisition will help N-able partners alleviate two challenges – labor scarcity and moving to the cloud. It will also provide Spinpanel customers access to a wider range of IT management and security services.

NY EliteX-PRO

PNY has announced its EliteX-Pro USB 3.2 Gen 2×2 portable SSD in an aluminum chassis. It flashes along with transfer speeds of up to 1,600MB/sec read and 1,500MB/sec write, and comes in 100GB, 1TB, 2TB, and 4TB capacities. Acronis True Image data protection is included along with USB-C to USB-C & USB-C to USB-A cables. The PNY EliteX-PRO was available for pre-ordering, starting from the July 4 on Amazon.

ScaleOut Software has introduced its ScaleOut In-Memory Database with support for Redis clients as a full production release. Redis clients can now connect to a cluster of ScaleOut servers to run Redis commands and dramatically improve their ease of use, scaling, and automated data recovery. They can also simultaneously backup and restore all cluster nodes with a single command. ScaleOut In-Memory Database incorporates open-source Redis version 6.2.5 code to process Redis commands, and it offers the flexibility to run on either Linux or Windows servers. This full production release supports all popular Redis data structures (strings, sets, sorted sets, lists, hashes, geospatial, hyperloglog, and streams), as well as transactions, publish/subscribe commands, and blocking commands. Not supported in this release are LUA scripting, modules, and AOF persistence.

Transcend has launched its ESD380C USB 3.2 Gen 2×2 interface portable SSD with a performance speed of 2,000 MB/sec, up to 2TB of capacity, and an anti-shock, compact aluminum design. It is covered by a silicone rubber case to make it more rugged and shock-absorbing to meet US military drop-test standards. It comes with USB Type-C and Type-C to Type-A cables. The drive has free Transcend Elite data management software, providing various functions, which includes backup, restore, encryption, decryption, cloud backup, etc. It is compatible with Windows, macOS, and Android. Visit the Transcend website for more details. 

Huawei: Comparing OceanProtect to backup rivals

Chinese supplier Huawei has an all-flash deduplicating backup target line. How does this OceanProtect range of products compare to the IBM, Pure Storage, and VAST Data all-flash backup target systems?

The OceanProtect range won the Special Prize for Best of Show Award in the Server & Storage category at Interop Tokyo 2022. It has entry-level, mid-range, and high-end products: the X6000, X8000, and X9000. Each comes in an all-flash or disk drive version and the three types differ in capacity and performance as this table illustrates: 

Huawei OceanProtect product range
Huawei OceanProtect product range. Numbers taken from Huawei OceanProtect datasheet

We can comment straight away that the disk drive capacities look to be lowish – the three disk drive vendors have been shipping 16 and 18TB drives for some months and 20TB drives are available. The same could be said about the SSDs as 15.6TB drives are available.

The performance numbers are helped by Huawei fitting the product with the CPU-offloading Direct TCP/IP Offload Engine (DTOE) intelligent network interface card. Huawei suggests this doubles array bandwidth compared to ordinary NICs. Huawei says the system puts out a maximum of 480,000 IOPS but doesn’t break that number down by individual product.

Huawei OceanProtect products
Huawei OceanProtect products

The usable capacity is boosted by a multi-layer, inline variable-length deduplication engine, compression and byte-level compaction, with Huawei claiming they provide an extraordinary up to 72:1 data reduction ratio.

The two multi-core controller nodes are in an active:active relationship – failover takes seconds – and system reliability – six nines or 99.9999 percent – is also helped by the optional RAID-TP scheme which can withstand three concurrent drive failures.

Pure’s latest FlashBlade//S has up from seven to ten storage blades, each with one to four flash drives (DFM, Direct Flash Modules). Each DFM has either 24 or 48TB of raw QLC flash capacity, meaning a maximum of 192TB per blade and 1.92PB per system (with ten blades). Pure says this grows to an effective 4PB given 2:1 compression. That exceeds the OceanProtect X9000’s 3.6PB usable capacity.

The prior FlashBlade, which uses TLC flash, can have from seven to 15 blades with 8TB or 52TB capacity in its chassis – a max of 780TB raw, and 1.6PB effective at 2:1 compression. 

There can be up to ten FlashBlade//S systems in a cluster with a total capacity of  ~20–30PB depending on data compression.

IBM’s FlashArray can be a Veeam backup target. The large 730 has up to 45.5PB of effective capacity, assuming 3:1 compression and a 162TB/hour ingest and restore rate. It has 116TB effective capacity after deduplication and compression. Both numbers surpass the OceanProtect X9000.

VAST Data says its all-QLC flash Universal Storage system can be a backup target and had a maximum capacity of 3.3PB in November 2021. Its latest Ceres chassis, which uses Nvidia BlueField smartNICs,  holds up to 1.35PB of raw capacity – 4.05PB at 3:1 data reduction – and 144TB/hour bandwidth; slower than Huawei’s X9000. However, this is a scale-out system so customers just buy more chassis to meet capacity and bandwidth needs.

Bottom line

A full comparison would factor in price against performance for these systems, but that’s where things get complex. At this end of the market, each customer’s requirements are likely to be unique and it’s hard to draw like-for-like comparisons against a pricing sheet. But on a sheer performance and/or capacity basis the IBM, Pure, and VAST systems can exceed the speed and volume of the Huawei OceanProtect systems.

Databricks updates Lakehouse Platform

Databricks has updated its Lakehouse Platform, promising more cost-efficient performance, better pipeline building, virtual cleanrooms, and a marketplace for data and analytics assets.

Databricks has been funded to the tune of $3.6 billion with its last raise taking $1.6 billion in September 2021. The now-public Snowflake only raised $1.4 billion in total. Databricks’s analytics software is said to work on raw data, stored in a data lake, without it having to be extracted, transformed and loaded (ETL) into a separate data warehouse – that’s the Snowflake way. 

Ali Ghodsi, co-founder and CEO of Databricks, issued a statement: “Our customers want to be able to do business intelligence, AI, and machine learning on one platform, where their data already resides … Today’s announcements are a significant step forward in advancing our Lakehouse vision, as we are making it faster and easier than ever to maximize the value of data, both within and across companies.”

The performance improvements consist of:

  • Databricks SQL Serverless provides instant, secure, and fully-managed elastic compute for improved performance at a lower cost.
  • The Photon native vectorized query engine on Azure Databricks, written to be directly compatible with Apache Spark APIs so it works with existing code.
  • Open source connectors for Go, Node.js, and Python to help operational applications access the Lakehouse.
  • Databricks SQL CLI enables queries to be run directly from a local computer.
  • Databricks SQL query federation adds the ability to query remote data sources, including PostgreSQL, MySQL, AWS Redshift, and others, without the need to ETL the data from the source systems.

Databricks does use ETL to process streaming and batch workloads for analytics, data science, and ML workloads. This is done to turn SQL queries into production ETL pipelines and uses Delta LiveTables as the ETL framework. A Databricks blog claims this makes it possible to declaratively express entire data flows in SQL and Python. Delta LiveTables has been given a new performance optimization layer to speed up execution and reduce ETL costs.

Added Enhanced Autoscaling scales resources with the fluctuations of streaming workloads. Change Data Capture (CDC) for Slowly Changing Dimensions – Type 2 (see bootnote below) tracks every change in source data for both compliance and machine learning experimentation purposes.

MLflow Pipelines is a machine learning pipeline builder. It uses MLflow software and v2.0 of this enables users to define the elements of a pipeline in a configuration file, with MLflow Pipelines managing execution automatically. Databricks has added Serverless Model Endpoints to directly support production model hosting, and built-in Model Monitoring dashboards to help the analysis of real-world model performance.

Databricks’s Cleanrooms provides a way to share and join data across organizations with a secure and hosted environment. Customers can collaborate with their clients and partners on any cloud and enabling them to run computations and workloads using both SQL and data science-based tools, including Python, R, and Scala, with data privacy controls.

A Databricks Marketplace provides an open environment within which to package and distribute data and analytics assets. It says this will enable data providers to securely package and monetize a host of assets such as data tables, files, machine learning models, notebooks, and analytics dashboards.  Data consumers will be able to subscribe to pre-existing dashboards that provide desired analytics for a dataset.

Competitor Snowflake already has cleanrooms and a marketplace, and supports machine learning pipelines.

Databricks SQL Serverless is now available in preview on AWS. The Photon query engine is in public preview and will be generally available on Databricks Workspaces in the coming weeks. Databricks Cleanrooms will be available in the coming months as will the Databricks Marketplace.

Bootnote

A Slowly Changing Dimension (SCD) is a construct that stores and manages current and historical data over time in a data warehouse. A Type 1 SCD has new data overwriting the existing data. A Type 2 SCD writes new records for new data, thus retaining the full history of values. A type 3 SCD only stores the previous and current value for an attribute. When a new value comes in, it becomes the current value, with the prior current value overwriting and itself becoming the previous value.

HPE: Getting ahead with corporate TV

NetApp was the first company we cover to start its own in-house TV channel to market its products and strategy. Now HPE has gone one better, in a creative sense, and is producing its own FinTech TV sitcom.

NetApp TV started in 2021 as a streaming TV service in its own right. The company chose to do this rather than use YouTube as a distribution medium, which was the Intel route to getting potential customers to watch corporate content.

Intel’s YouTube presence is subdivided into seven channels and each has content uploaded to it. Here is the Newsroom channel home screen:

Intel TV compared to HPE

Check out the subscriber numbers; that’s one measure of TV channel success. A lot of the videos are short, talking head stuff, one to two minutes long and a bit cheesy. An example: Happy Customer Appreciation Day has CEO Pat Gelsinger thanking customers for their loyalty. 

NetApp TV

NetApp is altogether more professional about its TV output. It has its own video website with four channels: Insight, Cloud, Hybrid Cloud, and Data ONTAP.

NetApp TV compared to HPE

The programs (videos) are typically two to five minutes long and may form a series, like the five-episode Data Protection Dojo. Think of them as mini-tutorials with a talking head, a NetApp person, introducing a topic and using a few PowerPoint slides as reference points: 

NetApp TV compared to HPE
NetApp Data Protection Dojo video screengrab

They don’t use professional presenters, instead focusing on product managers and product line execs talking about their portfolio areas.

HPE fintech sitcom

But HPE, with Nvidia, has introduced its very own sitcom. “Get Ahead” is a streaming TV production that follows the lives of thee business people involved in applying AI technology to a financial services business; a customer insights manager, a data scientist, and an IT manager.

HPE Get Ahead

They are played by professional actors and each of the four episodes in the series is professionally produced, directed, and scripted. Each episode is five to nine minutes long and each is linked to downloadable HPE documentation, such as “An IT Manager’s Guide to Getting Ahead“; PDF brochures to help viewers decide to follow up by contacting HPE.

This strikes us as a highly inventive way to use TV content to get a supplier’s message across to an audience. No talking heads; rather fictionalized real people in real roles meeting real problems in a humorous setting. It was pioneered, we think, by Microsoft in 2017 with a Windows 95 video presented as  “the world’s first cyber sitcom,” starring Matthew Perry and Jennifer Aniston.

Why did HPE do this?

HPE told us: “The ambition was to break conformity from traditional AI marketing and challenge the market. The team aimed to convey familiar characteristics and traits of the AI team dynamic highlighting the realities of AI implementation.”

 How did the concept originate?

“The idea for the sitcom was based on a qualitative research into the challenges of AI projects within financial services organisations our team conducted. The goal was to create content that conveys the realities of AI implementation and the role-contextual challenges AI teams face.”

Why have three main characters?

“A common theme apparent in all of the interviews was the ‘trifecta’ of roles involved. Participants stated that product managers/line of business, data scientists and IT (IT/data engineers) play a key role in selecting technology and delivering AI projects. Research suggested that these roles often have differing priorities causing friction that can often delay or cause projects to fail. This friction is thought to be a result of different professional characteristics, working methodologies and knowledge of AI in the wider-business context.”

Pure’s video content

Pure Storage, in partnership with Cisco, has produced a cartoon about a character called Francois – who works with the FlashStack converged infrastructure product – and overcoming legacy infrastructure:

Pure's TV compared to HPE

It’s 1:20, available on YouTube, and best seen once only. A second episode covers Professional Services. There is a separate “FlashBlade//S Unboxing” video with (apparently) real people installing a FlashArray//S system. The Pure Storage YouTube channel has 7,780 subscribers. It’s a start but there is quite a way to go to catch up with the Intel YouTube channel’s 534,000 subscribers.

Moving on from YouTube

Having a supplier-focused YouTube channel is not at all unusual. The Dell EMC channel has 47,900 subscribers. There is an HPE Technology channel with 53,600 subscribers. Infinidat also has a YouTube channel, with 585 subscribers. But these YouTube videos are standard fare.

The Wasabi Cloud YouTube channel has “Migrate with Nate” videos with fictional business characters singing their lines about issues driving them to use the Wasabi cloud. This use of “singing reads” is different from the standard talking head YouTube supplier output, but not that different.

We have come a long way from them and it can be seen as another step on the road to suppliers controlling their own media content. Back in 2006 ex-FT journalist Tim Foremski wrote a famous blog entitled “Die! Press release! Die! Die! Die!” in which he said traditional press-released-based journalism was inefficient. He followed it up with a 2009 article saying “Every company has to become a media company” using Cisco as an example of a corporation which was also a content producer about its own offerings.

Corporately produced static content about a corporation’s products and services has evolved again and become dynamic with streaming TV. Now many companies are their own YouTube video channel producers. Companies like Intel produce talking head videos for YouTube viewers. Pure puts out YouTube cartoon videos. NetApp advanced the art by producing videos with itself as the destination site for viewers, not YouTube. And HPE, with Nvidia, is now pumping out its very own sitcom. Whatever next – characters interacting in a metaverse with inline GreenLake subscription options?

Data fragmentation needs a data intelligence fix

3D rendering of AI brain

Increasing data fragmentation is hindering analytics and compliance, meaning it requires a data intelligence and governance remedy to provide data visibility and control.

According to IDC’s 2021 “Global Survey of the Office of the Chief Data Officer (CDO), some 79 percent of organizations are using more than 100 data sources, making it hard to manage the entirety of an organization’s data estate.

Data visibility means knowing what data you have and where it came from, and it enables compliance and security. You can’t comply with compliance regulations if you don’t know what data you have and nor can you secure data you don’t know you possess.

Financial analyst Jason Ader and his team at William Blair examined suppliers offering remedial products and services, noting an emerging ecosystem that is providing help for the different kinds of data users and data policy managers (as detailed in the chart below)  

William Blair refers to this area as data intelligence and governance (DI&G) via a report entitled “Data Intelligence and Governance Market – Managing Data Duality in a Digital-First World”.

“The marriage of visibility and control in many ways defines the DI&G space and is the main catalyst driving convergence of previously disparate market segments like data catalogs, data quality, data lineage, data stewardship, master data management, metadata management, data integration, data privacy, data access governance, and data security.”

Over the past decade three separate groups of vendors have emerged:

  1. Data intelligence suppliers – analytics-centric, for example, finding and curating high-value data for business intelligence and analytics purposes.
  2. Data privacy and compliance – privacy-centric vendors concerned with the EU’s General Data Protection Regulation (GDPR), HIPAA, and the California Consumer Privacy Act (CCPA).
  3. Data security – security-centric businesses trying to protect sensitive data from internal and external threats.

Ader says via the report that “these three worlds are naturally and inevitably converging into one consolidated market.” The DI&G market, he says, will be worth $26 billion in 2025. This means that centralized DI&G platform products will develop with initial convergence and consolidation in each of the three categories above. The report lists suppliers in each of the three categories: 

We have previously written about Delphix and its data ops characteristics, also Egnyte with its secure enterprise collaboration features, and Quest.

Ader suggests the convergence “puts point product vendors at risk of disintermediation as the features they offer become consolidated into broader suites and platforms.” He also says that best-of-breed vendors will find it “more likely that they will be ripe acquisition targets for consolidators.”

Customer organisations are themselves fragmented in how they deal with data, with William Blair detailing the different parts of an organisation dealing with different types of data:

  • Chief Data Officer – buys data catalogs
  • Chief Information Security Office – buys security products
  • Chief Privacy Officer – buys compliance products

Customer organizations will need to centralize their data intelligence and control roles under a CDO. The report says “we are seeing CDOs emerge as the top executive in charge of utilization and governance of data across the organization (which often includes responsibility for data security and privacy).”

The emergence of Cloud 3.0 and independent vendor exclusion

Dominant private/public cloud abstraction layers could exclude independent vendors from suppliers’ customer bases.

Dell and HPE are erecting public cloud-like abstraction layers over their on-premises hardware and software systems, transforming them into subscription-based services with a public cloud provisioning and consumption model. But APEX (Dell) and GreenLake (HPE) are clearly extensible to cover the three main public clouds as well – AWS, Azure, Google – forming an all-in-one cloud management, subscription, and provisioning service, providing one throat to choke across numerous cloud environments.

Research consultancy Wikibon suggests that the idea is evolving into a “supercloud” concept. An online research note says: “Early IaaS really was about getting out of the datacenter infrastructure management business – call that cloud 1.0 – and then 2.0 was really about changing the operating model and now we’re seeing that operating model spill into on-prem workloads. We’re talking here about initiatives like HPE’s GreenLake and Dell’s APEX.” 

Super cloud diagram
Blocks & Files Supercloud diagram

Analyst John Furrier suggests: “Basically this is HPE and Dell’s Outposts. In a way, what HPE and Dell are doing is what Outposts should be. We find that interesting because Outposts was a wake up call in 2018 and a shot across the bow at the legacy players. And they initially responded with flexible financial schemes but finally we’re seeing real platforms emerge.”

Wikibon says: “This is setting up Cloud 3.0 or Supercloud as we like to call it. An abstraction layer above the clouds that serves as a unifying experience across the continuum of on-prem, cross clouds and out to the near and far edge.”

We think that the impetus for this is coming from the two main on-premises system vendors: Dell and HPE. They have pretty complete hardware and software stacks on which they run legacy, virtual machine (VM), and containerized apps. Because of the efforts made by VMware and Nutanix, VMs can run on-premises (private cloud) and in the public cloud. With Kubernetes adopted as a standard, containerized apps can run in both the private and public cloud environments as well. Given the right public cloud API interfaces, GreenLake and APEX can transfer apps between these environments and provision the compute, storage, and networking resources needed. Transferring data is not so easy because of data gravity, which we’ll look at in a moment.

It is in Dell and HPE’s interests to support a hybrid and multi-cloud environment, as their enterprise customers are adopting this model. Both Dell and HPE can retain relevance by supporting that. They do that initially by replicating the public cloud operating model for their on-premises (now private cloud) products – Cloud 2.0 – and then by facilitating bursting or lift-and-shift to the public clouds. Their safeguard here is to sing loud about frustrating public cloud lock-in attempts by each of the big three cloud service providers (CSPs), pointing out the cost disadvantages of being locked in to a single CSP.

They can also help manage unstructured data better by providing global file and object namespaces across the four environments: private cloud, AWS, Azure, and Google. 

This is entering into Hammerspace territory and leads us straight into the independent storage and data services vendor problem. Does that vendor compete with Dell and HPE or partner with them?

A consideration for Dell and HPE is that they own the assets which they supply to customers for which there are service level agreements (SLA). They will not want to supply partner products unless they absolutely fit in with the APEX and GreenLake SLAs and can be maintained and supported appropriately.

Considerations like this will encourage these two prime vendors to have restricted partnership deals. They also brand their services as APEX for this or GreenLake for that, and prior product brands – such as HPE Alletra 9000 – are de-emphasized. Thus we have GreenLake for Block Storage. Assume GreenLake supplied a WEKA-sourced high-speed file service then it would not necessarily be branded GreenLake for WEKA but GreenLake for High-Speed File.

If HPE did such a deal with WEKA then it would most likely preclude HPE doing similar deals with DDN, IBM (Spectrum Scale), Panasas, Qumulo, and ThinkParQ (BeeGFS). We are going to potentially see a much-reduced IT system software component choice by customers adopting the APEX and GreenLake-style private clouds.

In other words, independent storage and storage application suppliers could find themselves excluded from selling to Dell’s APEX and HPE’s GreenLake customers, unless the customers have a pressing need for products not included in the APEX or GreenLake catalogs.

This is going to present an interesting strategic and marketing challenge to independent vendors.

Data gravity

It’s a truism that large datasets cannot be transferred quickly to/from the public cloud because it takes too long, so compute needs to come to the data and not the other way around. We can say that the only reason data gravity exists is because networking is too slow. A 10TB dataset had gravity when 10Gbit networking was the norm. Assume that, after overheads, 10Gbit/sec equals 1GB/sec and that it would then take 10,000 seconds to move a 10TB dataset; 166.7 mins or just over 2 hours 46 minutes.

Now we have 100Gbit networking. 10GB/sec and it will take 1,000 seconds, 16 minutes and 42 seconds. Now let’s amp up the speed to 400Gbit, 40GB/sec, and our 10TB data transfer takes 250 seconds, 4.2 minutes. The effect of data gravity diminishes as networking speed increases. If we reach 800Gbit/sec networking speed then our 10TB transfer would take 125 seconds, 2.1 minutes.

Data gravity will always exist as long as datasets carry on growing. So a data analytics app looking at a 1PB dataset or larger will surely be subject to data gravity for many years – but smaller datasets may well find that data gravity weakens substantially as networking speeds increase.

Qumulo CEO speaks out on layoffs, growth and more: ‘Supply chain challenges are real’

Executives and office life at Qumulo in Seattle, WA July 20, 2018. Photo by Andy Rogers Images

Scale-out and parallel file services supplier Qumulo laid off around 80 people last week, attributing this to growth not meeting expectations among other factors. We asked CEO Bill Richter some questions about the circumstances surrounding the decision.

Blocks & Files: Did Qumulo try to raise additional funding in 2021 and this year?

Bill Richter, Qumulo
Bill Richter

Bill Richter: Qumulo raised $125 million in 2020 and has not needed additional capital since.  

Blocks & Files: Were the growth expectations that failed to materialize the result of market factors such as COVID and China-caused supply chain issues, the Ukraine War, and lower consumer spending due to inflation? Or could you take a look at products and go-to-market structure?

Bill Richter: To be clear, we achieved record growth in the last fiscal year, and have hit records since.  The supply chain challenges are real, however. Qumulo is software-defined and runs on standard storage servers supplied by a half-dozen major OEMs.  So if one supplier has longer delays than another, customers have choices.  That’s incredibly strategic compared to other vendors that operate on bespoke hardware from a single or narrow channel.   

Qumulo’s products remain the most innovative and trusted in the industry.  Ask any of our 700-plus customers or check out our 4.9 out of 5 star rating on Gartner Peer Insights from over 100 of them.  We make two fundamental promises to them every time they purchase – outstanding capability on day one and constant innovation and service they can rely on in the future.  It’s not enough to make half that promise, and that’s what drove our thinking about restructuring the business for long-term sustainability and profitability.  Our mission is to help customers store, manage, and curate their data anywhere and forever. 

Blocks & Files: If market factors were the cause, then will other storage suppliers have to hunker down, as it were, as well?

Bill Richter: I’ll let others speak for themselves. I have seen reckless spending from others in the industry and the troubling thing is the risk that creates for customers. Overly hyped, flash-in-the-pan vendors are the biggest offenders. If a storage vendor isn’t thinking about long-term sustainability, then customers should be worried.  

Blocks & Files: Did the expectations fail to materialize across the breadth of Qumulo’s market or were some sectors affected more than others?

Bill Richter: Actually our demand has been strong and consistent across the geos in which we operate.  We power core mission-critical applications at the heart of the enterprise, and those don’t turn off or slow down during economic cycles. I’d be more worried if my business were tied solely to GPUs powering the ups and downs of Bitcoin mining.  

Blocks & Files: Can Qumulo move towards and reach profitability without additional funding?

Bill Richter: Yes, we have a very strong balance sheet and no debt outstanding. The changes we made this week were simply designed to accelerate that path. We may choose to raise capital in the future, but only opportunistically.

Blocks & Files: How will Qumulo refine its product and go-to-market strategy to enhance its growth and profitability prospects?

Bill Richter: Customers love our products and believe in our vision to help them store, manage, and curate their data anywhere, forever.  The “anywhere” part is a key area of focus, which is why you’ve seen us focus so much on public cloud capabilities and HPE GreenLake, and increase the number of storage OEMs that power our software.  In 2022, if you’re not thinking that way, then you’re keeping customers in the past.  

We recently added some outstanding industry veteran leadership to go even faster on both go-to-market and product development.  We have built powerful leveraged channels to deliver our products to customers all over the world efficiently, which is where you should expect us to double down our focus.  Spending not directly tied to innovation or leveraged growth was the focus for our restructuring.  

Blocks & Files: Qumulo’s leadership website shows no Chief Revenue Officer, Chief Marketing Officer, Chief Operations Officer or head of product engineering. How are these responsibilities being fulfilled? Will executives be recruited for the positions?

Bill Richter: We have a really simple leadership structure designed for customer outcomes: 

  • Eric Brodersen (Chief Customer Officer) is responsible for end-to-end go-to-market.  From the first moment of marketing awareness, to the commercial sales process, to ongoing (and industry leading) customer success. Customers want a high-continuity experience and our GTM organizational structure is designed around that.    
  • Kiran Bhageshpur (Chief Technology Officer) is responsible for all of product development. Again, customers want continuity from our big vision to the features and capabilities they use today. Our leadership structure is designed around them.  
  • Matt Frey (Chief Financial Officer) is a seasoned industry leader who leads all of Qumulo’s business operations.

Underneath these executives, we have outstanding leaders that are aligned around our mission and more than capable of powering the business forward. As we grow, you should expect us to continue to add leadership within the organizational structure outlined above.  

Blocks & Files: Should Qumulo pursue the high-performance primary file serving market for GPUs (GPUDirect for example) and similar data-intensive, low-latency application areas as well as the bulk capacity secondary data storage market?

Bill Richter: Qumulo delivers a software-defined storage solution. Our performance is more than sufficient for a wide variety of workflows, including HPC workflows, for a diverse set of customers. Furthermore, we expect, over the next 12-18 months, to more than double performance across a variety of axes through software optimizations.

Our customers value our innovations that deliver high-performance storage for a wide variety of workloads without relying on specialized hardware or non-standard software.

FDP

FDP – Flexible Data Placement.  FDP allows the placement of different data in separate NAND blocks within an SSD. In a conventional SSD all data written to the drive at a similar time will be stored in the same NAND blocks. When any of that data is either invalidated by a new write or unmapped then the flash translation layer (FTL) needs to do garbage collection (GC) in order to recover physical space. In a data placement SSD data can be separated during host writes allowing for whole NAND blocks to be invalidated or unmapped at once by the host.

SiPH

SiPH – Silicon photonics is a platform basis from which photonic integrated circuits (PICs) can be made. Silicon on insulator (SOI) wafers are the semiconductor substrate material. Most of standard CMOS manufacturing processes can be applied to it. Silicon photonics uses semiconductor manufacturing infrastructure to produce optical components and systems that can transmit, process, and manipulate light at high speeds, higher than electrical equivalents. The technology is relevant for applications requiring high bandwidth and low latency, such as data centers, telecommunications, and advanced computing, such as Generative AI training.