Databricks buys Tabular to win the Iceberg war

Mature startup lakehouse supplier Databricks, with north of $4 billion in total funding, has reportedly spent a billion dollars or more on buying Tabular.

Databricks supplies an object-based lakehouse data repository combining a data warehouse and a data lake. It created its open source Delta Lake format, donated to the Linux Foundation, to enable high performance lakehouse content querying through Apache Spark with ACID (Atomicity, Consistency, Isolation, Durability) transactions, and data ingest (batch or streaming). It is a fast-growing business and faces competition from the competing, and also open source, Iceberg table format.

The Wall Street Journal reports that Databricks is spending between $1 billion and $2 billion to buy Tabular. This will be an amazing financial return for Tabular’s founders, VC investors and other stockholders.

Ali Ghodsi.

Ali Ghodsi, Co-founder and CEO at Databricks, said in a statement: “Databricks pioneered the lakehouse and over the past four years, the world has embraced the lakehouse architecture, combining the best of data warehouses and data lakes to help customers decrease  TCO, embrace openness, and deliver on AI projects faster. Unfortunately, the lakehouse paradigm has been split between the two most popular formats: Delta Lake and Iceberg.”

He added: “Last year, we announced Delta Lake UniForm to bring interoperability to these two formats, and we’re thrilled  to bring together the foremost leaders in open data lakehouse formats to make UniForm the best way to unify your data for every workload.

Tabular was founded in 2021 by ex-Netflix people Daniel Weeks, Jason Reid, and Ryan Blue. Weeks and Blue were the original co-creators of the Apache Iceberg project when working at Netflix. Blue also serves as the Iceberg PMC Chair and Weeks is an Iceberg PMC member.

Iceberg format tables enable SQL querying of data lake contents. Spark, Trino, Flink, Presto, Hive, Impala, StarRocks, and other query engines can work on the tables simultaneously. Tabular has developed a data management software layer based on Iceberg tables. It has raised a total of $37 million in two rounds, the latest for $26 million in September last year.

Ryan Blue

Delta Lake has over 500 code contributors, and over 10,000 companies globally use Delta Lake to process 4+ exabytes of data on average each day. But Iceberg has not been dominated by Delta Lake, having many users as well. Iceberg and Delta Lake are now the two leading open-source lakehouse formats, both being based on Apache Parquet (open-source, column-oriented data storage format). 

Ryan Blue, Co-Founder and CEO at Tabular, said: “It’s been amazing to see both Iceberg and Delta Lake grow massively in popularity,  largely fueled by the open lakehouse becoming the industry standard.”

In 2023 Databricks introduced Delta Lake UniForm tables to provide interoperability across Delta Lake, Iceberg, and Hudi (Hadoop Upserts Deletes and Incrementals). The latter being a third open-source framework for building transactional data lakes with processes for ingesting, managing, and querying large volumes of data. Delta Lake UniForm supports  the Iceberg restful catalog interface so customers can use the analytics engines and tools they are already familiar with, across all their data.

Vinoth Chandar, creator and PMC chair of the Apache Hudi project, and Onehouse founder and CEO, told us: “Users need open data architectures that give them control of their data to power all their use cases from AI to real-time analytics. We’re excited to see the increased investment in open data lakehouse projects, and this Databricks announcement may bring increased compatibility between Delta Lake and Iceberg. However, users demand more – they need a completely open architecture across table formats, data services, and data catalogs, which requires more interoperability across the stack.”

    Delta Lake UniForm is only a partway step to full Iceberg-Delta Lake unification. The Tabular acquisition opens the door to that goal. Databricks says that, by bringing together the original creators of Apache Iceberg and Delta Lake, it can provide data compatibility so that customers are no longer limited by having to choose one of the formats.

    Databricks (with Tabular) will work closely with the Delta Lake and Iceberg communities to develop lakehouse format compatibility. In the short term this will be provided inside Delta Lake UniForm and in the long term, it will help develop a single, open, and common standard of interoperability foir an open lakehouse.

    Blue said: “With Tabular joining Databricks, we intend to build the best data management platform based on open lakehouse formats so that companies don’t have to worry about picking the ‘right’ format  or getting locked into proprietary data formats.”

    The huge elephant in the room here is getting unstructured data available to Gen AI’s large language models, both for training, with massive data sets being needed, and also inference. That has led to Databricks being willing to spend such a huge amount of its VC-invested cash to buy Tabular at such a high price. It will enable it to compete with more intensity against major player Snowflake.

    Snowflake has just announced Polaris Catalog, a vendor-neutral, open catalog implementation for Apache Iceberg. Thus will be open-sourced and it will provide interoperability with AWS, Confluent, Dremio, Google Cloud, Microsoft Azure, Salesforce, and more.

    The Tabular acquisition should close by the end of July, subject to closing conditions, and the bulk of Tabular’s 40 employees will join Databricks.