Storage facing trillion-row database apocalypse

HDD storage
HDD storage

A new era of large-scale analytics and storage is opening up ahead of us. Ocient, Imply, VAST Data, and WEKA are four startups positioned to store and access data among hundreds of petabytes or trillions of database rows in seconds. They all employ massively parallel access techniques in one way or another and fundamentally use software rather than hardware to achieve their performance levels.

Update. SingleStore’s database can chew through a trillion rows/second using 448 Xeon Platinum 8180 Skylake cores, 28 per server, as its blog describes. The data was stored on SSDs, though that wasn’t really significant since the queries were run warm-start, so the data was already cached in memory. The network was 10-gigabit Ethernet.

Having said that, it needs pointing out that NVMe SSDs are a fundamental part of the VAST Data and Ocient products, and have played a role in WEKA deployments as well.

The need for such high-speed access to structured and unstructured data is not general yet. It’s concentrated in a few markets – such as financial trading (where VAST and WEKA are doing well), online media advertising display technology (an Ocient focus), high-performance computing (WEKA again), and AI/ML model training. 

The AI/ML driver

VAST co-founder Jeff Denworth thinks the use of AI/ML technology is going to spread into the general business market. Most businesses will have a need to trawl through their internal production logs and external customer interaction data to find patterns, analyze causes, and make decisions to optimize internal and external operations. This may save or make pennies per operation but cause significant amounts of cash to be earned or saved at a larger scale.

ML models are being used to aid health device scan diagnoses, investment trading decisions, factory production operations, logistics delivery paths, product recommendations, process improvements and staff efficiency. The complexity of ML models is roughly doubling year-on-year, according to Denworth. The general rule is that the larger the model, the better the training and subsequent inferences.

Pure is moving into the larger dataset market, and high-end array supplier Infinidat might say it is already there.

All these companies are intent on reacting to this move from petabytes to exabytes. They see it affecting on-premises environments as well as public cloud ones. VAST is an on-premises company but will be cloud-connected – if not cloud-present – in some way in the future. Ocient is both on-premises and in the cloud, as is WEKA. Imply is pure software so can run in the cloud, whereas Infinidat is an on-premises business.

Their acceptance and embrace of exabyte-level scale sets them apart from mainstream storage providers who will, Denworth says, have to overcome significant architectural disadvantages if they want to compete. 

Ocient Hyperscale Data Warehouse

Ocient has just launched its Hyperscale Data Warehouse product. It is a v19.0 product – earlier versions having been used, successfully it claims, for hyperscale deployments over the past year with a select group of enterprise customers. It says the product is engineered to deliver unmatched price-performance for rapid complex and continuous analysis of massive structured and semi-structured datasets. Customers can execute previously infeasible workloads in interactive time, returning results in seconds or minutes versus hours or days.

The software has a Compute Adjacent Storage Architecture (CASA), Ocient says, which places storage adjacent to compute on industry-standard NVMe solid-state drives. This delivers hundreds of millions of random read IOPS and enables massively parallelized processing across simultaneous loading, transformation, storage, and data analysis of complex data types. The whole data path has been optimized for such performance. 

For example, it has a high-throughput custom interface to NVMe SSDs with highly parallel reads with high queue depths to saturate drive hardware. There is a lock-free, massively parallel SQL cost optimizer that ensures each query plan is executed to the best of its ability within its service class and without impacting performance of other workloads or users.  

The Ocient Hyperscale Data Warehouse is generally available as a fully managed service hosted in OcientCloud, on-premises in the customer’s datacenter, and is in the Google Cloud Marketplace.

Incumbent state

VAST Data has a significant software launch coming up. Denworth says that what VAST did for its hardware array, stateless controllers, and single-tier QLC flash storage, it will now do for software. 

Incumbents will need to respond to match what the newcomers have. Going all-flash and single-tier is not enough – they have to change their software. This could mean software technology which will take years to develop from the ground up. We might see incumbents buying this technology rather than developing it. We might see processor chip developers, like Nvidia, buying their way in to keep their GPUs fed with the data they need to crunch AI/ML training models.

Unless Dell EMC, IBM, HPE, NetApp, Qumulo, and the object storage suppliers can demonstrate that they can operate at the same scale, performance, resilience, and cost as these up-and-comers, they may have to fight harder for the multi-hundred petabyte-level, trillion-row structured/unstructured dataset area – at least if what Imply, Ocient, VAST and WEKA see coming is correct.