IBM’s watsonx.data and Storage Scale accelerated AI platform

IBM has combinined data lakehouse and parallel file system features to provide a scalable, machine-learning-based AI processing and analytics data storage platform, using its watsonx.data and Storage Scale products. 

This delivers, IBM claims, extreme AI performance with GDS (GPU Direct Storage) and Nvidia for training generative AI models faster. There is multi-protocol support enabling simpler workflows, a unified data platform for analytics and AI, and the system supports retrieval-augmented generation (RAG) using a customer’s proprietary data.

Big Blue’s watsonx.data is a data lakehouse. It combines the features of a data lake – based on a scale-out architecture using commodity servers and capable of storing and processing large volumes of structured and unstructured data – with the performance of a data warehouse. It supports the Apache Iceberg open table format enabling different processing engines to access the same data at the same time. 

Storage Scale is a parallel and scale-out file system originally called GPFS. It is being used as a storage layer underneath watson.x.data, providing object storage facilities underneath a file access overlay. V5.2.1 Storage Scale has a non-containerized, High Performance S3 protocol service providing this.

How it fits together

An IBM diagram, with added B&F yellow box elements, lays out the software componentry:

There are separate, disaggregated, compute and storage layers. For compute, a Red Hat OpenShift container cluster base is used by watsonx.data applications, which include Presto and Spark. Presto provides datalake analytics using a distributed SQL query engine while Spark is an in-memory big data processing and analytics resource.

There is also a Hive Metastore providing a shared metadata service and a Milvus vector database service in this layer. Milvus is used to enable RAG by accessing a customer’s potentially large datasets residing on Storage Scale.

There are three main elements in the separate storage infrastructure: Storage Scale filesystem clusters holding the data; Active File Management (AFM) for storage abstraction and acceleration; and S3 data access protocol service for high performance object access.

The S3 service exposes object storage buckets to watsonx.data for attachment to a query engine such as Presto or Spark. S3 objects are mapped to files and buckets are mapped to directories within Storage Scale and vice versa.

The S3 buckets can be local to the storage layer or cached (and thus accelerated) by Storage Scale from external object stores which may be globally dispersed across various clouds, datacenters and locations. In either case, multiple Spark and Presto engine instances connect to the Storage Scale layer using the S3 protocol to access the buckets. 

AFM has local caching and enables sharing of data across clusters, virtualizing remote S3 buckets at fileset level. It implements a global namespace across Storage Scale clusters and can include NFS data sources in this namespace as well.The remote S3 buckets appear as local buckets under the Storage Scale file system, under a common storage namespace. This eliminates the need for data copies.

The virtualizing of remote S3 buckets relies on Storage Scale High Performance S3, which is based on NooBaa open-source software. This is object storage software using X86 servers and storage, presented as an S3-like cloud service. Noobaa was acquired by Red Hat in 2018, and abstracts storage infrastructure across hybrid multi-cloud environments. It also provides data storage service management. Red Hat made it part of its OpenShift Data Foundation (ODF) product set. IBM bought Red Hat in 2019 and added ODF to its then Spectrum Fusion offering – now Storage Fusion – alongside the existing containerized version of Spectrum Scale and Spectrum Protect data protection.

Now NooBaa is a customizable and dynamic data gateway for objects, providing data services such as caching, tiering, mirroring, dedupe, encryption, and compression, over any storage resource including S3, GCS, Azure Blob, filesystems, etc.

Storage Scale’s High Performance Object S3 service is optimized for multi-protocol data access. It replaces earlier Swift-based Object S3 and Containerized S3 service implementations in Storage Scale. A Cluster Export Services (CES) facility within Storage Scale manages high-availability through CES nodes. 

Multiple tiers

IBM says there can be multiple performance tiers for Storage Scale storage, to optimize costs and performance. There can be a high-performance tier for hot data, along with a cost-effective tier or even tape for long-term storage and archival, together with automated policy driven placement across tiers – making the tiering seamless and transparent to applications.

This combined watsonx.data and Storage Scale system provides a unified but disaggregated compute and storage platform on which to run AI applications for both training and inference. Customers may well value this as IBM is acting as a single source for the required software. We have covered other AI data platform approaches from Dell, HPE, Lenovo, NetApp, MinIO and Pure, with VAST Data preparing its own Data Engine offering.

The watsonx.data and Storage Scale AI bundle is described in an IBM Redbook which “showcases how IBM watsonx.data applications can benefit from the enterprise storage features and functions offered by IBM Storage Scale.”

You can find out more about the latest watsonx.data v2.0.2 from release notes here.