Apache Iceberg – an open source table format for large-scale datasets in data lakes, sitting above storage systems like Parquet, ORC, and Avro, and cloud object stores such as AWS S3, Azure Blob, and the Google Cloud Store. It brings database-like features to data lakes, such as ACID support, partitioning, time travel, and schema evolution. Iceberg organizes data into three layers:
- Data Files: The actual data, stored in formats like Parquet or ORC
- Metadata Files: Track which data files belong to a table, their schema, and partition info
- Snapshot Metadata: Logs every change (commit) as a snapshot, enabling time travel and rollbacks
Iceberg format tables, are used in big data and enable SQL querying. Query engines such as Spark, Trino, Flink, Presto, Hive, Impala, StarRocks, and others can work on the tables simultaneously. The tables are managed by metadata tracking and snapshotting changes.