Two open source data lakehouse platforms will emerge, says db insight

Research house db insight says it’s likely Delta Lake and Iceberg will become the main data lakehouse platforms, with proprietary alternatives being a transitional stage.

Tony Baer.

This prediction is made by analyst Tony Baer in a research report entitled “Data Lakehouse open source market landscape.” This 13-page document looks at the data lakehouse technology area, defining the technology and comparing three open source projects – Delta Lake, Apache Hudi and Apache Iceberg – with proprietary alternatives, such as AWS, Oracle and Terradata.

Baer writes that “Over the past five years, a new construct designed to combine the best of the Data Warehouse and Data Lake worlds has emerged: the Data Lakehouse.” This construct is based on a key enabling technology: “new table formats overlaying atop cloud object storage that deliver the performance, governance, granular security, and ACID transaction support of data warehouses, combined with the economics of scale and the analytical flexibility of data lakes.”

In his view, “Data lakehouses will enable data lakes to perform and be controlled, governed, and secured like data warehouses.”

Such warehouses support ACID (Atomicity, Consistency, Isolation, and Durability) as a way of having their data integrity maintained. ACID transaction support will become the lynchpin of data lakehouses because it will give enterprises confidence in the consistency of the data. Baer says: “In the long run, open source will prevail because ACID support will be table stakes, not a competitive differentiator.”

He think there is 80 percent functional parity between Delta Lake, Hudi and Iceberg. It will be “the breadth and depth of the commercial ecosystem and depth of support that will determine the winners” and there are likely to be just two. He thinks: “Today, Delta Lake and Iceberg have the clear momentum and are clearly the early favorites to be the lakehouses left standing.”

And Hudi (which stands for Hadoop Upserts Deletes Incrementals): “The challenge is building a commercial ecosystem beyond the long tail, with pressing need to line up a major data platform heavyweight” such as IBM, Oracle or SAP for example.

Baer says Data Lakehouses will eventually co-opt the enterprise data warehouse because they provide many of the same capabilities for multi-function analytics, but they will not replace data lakes or purpose-built warehouses or data marts. Cloud data warehouses with support for polyglot data types, Python and AutoML capabilities will be, he thinks, the gateway drugs for data lakehouses.

This is a readable and informative report that is free to download.