Amazon has added an Iceberg table bucket type to S3 along with S3 Metadata to create an S3 analytics data lake capability competing with Databricks, Dremio, SingleStore and other Iceberg-using data lake suppliers, as well as Iceberg-supporting object storage suppliers
The new S3 Tablesa bucket type and S3 Metadata were announced at the recent re:Invent conference. The announcement involves Apache Iceberg and Parquet and is a means of accelerating datalake activities with “up to 3x faster query performance, up to 10x higher transactions per second (TPS), and automated table maintenance and automation for analytics workloads.” This is in comparison to general S3 buckets.
Andy Warfield, VP for Storage, and distinguished AWS engineer, stated: “We have seen the rapid rise of tabular data and, increasingly, customers want to query across tables, improve query performance, and understand and organize troves of data so they can easily find exactly what they need. S3 Tables and S3 Metadata remove the overhead of organizing and operating table and metadata stores on top of objects, so customers can shift their focus back to building with their data.”
Let’s go down the multi-layer datalake analytics stack to some basics. Amazon says a data lake is a central data store for all of an organization’s structured and unstructured data at any scale. Unlike a data warehouse which is optimized for schema-organized relational data from transaction systems, a data lake holds both relational data like operational databases and non-relational data like mobile apps, IoT devices, and social media. The data is stored as-is, without it having to be structured but it still has to organized somehow. As we understand it, a datalake typically stores its base data in tables organized in Apache Parquet file format. This is a columnar storage format optimized for analytical workloads.
Apache Iceberg is a management or metadata layer, an open table format (OTF) that supports Parquet schema evolution, time travel, ACID (Atomicity, Consistency, Isolation, and Durability) for maintaining data integrity, transactions, partitioning and query optimization, without requiring changes to the underlying data files. Parquet itself does not manage table-level metadata or partition management.
Amazon says organizations use Iceberg to query across billions of files containing petabytes or even exabytes of data. Datalake suppliers like Databricks, Dremio and SingleStore support and use Iceberg in their offerings, or they build their own systems. Amazon suggests “These external systems are costly and complex, and they require skilled teams to maintain, using up valuable resources.”
S3 Tables are purpose-built for managing Apache Iceberg tables for data lakes. Amazon says “customers can use S3 Tables by creating a table bucket that optimizes the storage and querying of tabular data in fully-managed Iceberg tables.” These S3 tables “automatically manage table maintenance tasks such as compaction for better query performance and snapshot management to continuously optimize query performance and storage costs, even as customers’ data lakes scale and evolve.”
They get Iceberg features such as row-level transactions, queryable snapshots via time travel through old dataset snapshots, functionality, schema evolution with a table’s structure altered with no disruption to the data files, and table-level access controls.
S3 Metadata is stored in an S3 Tables bucket. It stores automatically generated metadata from objects loaded into the S3 Tables buckets, in near real-time. The metadata includes more than “20 elements including the bucket name, object key, creation/modification time, storage class, encryption status, tags, and user metadata. You can also store additional, application-specific descriptive information in a separate table and then join it with the metadata table as part of your query.” See a Jeff Barr blog for more information.
Regarding the separate table for app-specific metadata, Amazon says “Customers can add their own custom metadata using object tags to annotate objects with information specific to their business, such as product SKUs, transaction IDs, or content ratings, or with customer details.”
This metadata can be queried with SQL, enabling customers to “find and prepare data for use in business analytics and real-time inference applications, as well as fine-tune foundation models, perform retrieval augmented generation (RAG), integrate data warehouse and analytics workflows, perform targeted storage optimization tasks, and more.”
Amazon, with S3 Tables and S3 Metadata, is providing a more performant data querying and visualization connectivity layer, via its Glue Data Catalog, between its S3 object storage and AWS Analytics services such as Athena, Redshift, EMR, and QuickSight. The performance comparison is with the S3 general bucket which has no in-built Iceberg table support.
Check out a YouTube video of an Amazon S3 Tables presentation at re:Invent here. There are Amazon news blog posts for S3 Tables and S3 Metadata which provide more information.
Competing object storage suppliers Cloudian, MiniIO and Scality can also have their storage function as a datalake and be integrated with, for example, Apache Spark and Dremio, to query Iceberg tables and use its schema evolution, time travel, and partitioning features.
Availability
S3 Tables integration with AWS Glue Data Catalog is in preview, and available now in the US East (Ohio, N. Virginia) and US West (Oregon) AWS Regions. Customers pay for storage, requests, an object monitoring fee, and and fees for compaction. See the S3 Pricing page for more info.
S3 Metadata is available in preview now, also in the US East (Ohio, N. Virginia) and US West (Oregon) AWS Regions. Glue Data Catalog integration is in preview as well.
Pricing is based on the number of updates (object creations, object deletions, and changes to object metadata) with an additional charge for storage of the metadata table. For more pricing information, visit the S3 Pricing page.