Snowflake competitor Databricks claimed a TPC-DS benchmark record for its data lakehouse technology and said a study showed it was 2.5x faster than Snowflake. Databricks lacks integrity according to Snowflake, which has come out fighting, saying the study was flawed.
Databricks claims that source data in it its so-called data lakehouse can be analysed faster than if it were filtered and processed through an Extract-Transform-Load (ETL) procedure and then loaded into a data warehouse, such as Snowflake’s, for analysis. TPC-DS is a decision support benchmark with audited results. Databricks achieved 32,941,245 QphDS @ 100TB, beating the previous world record held by Alibaba’s custom built system, which achieved 14,861,137 QphDS @ 100TB.
Databricks announced the research team at Barcelona Supercomputing Center (BSC) ran a different benchmark comparing Databricks SQL and Snowflake, and found that Databricks SQL was 2.7x faster than a similarly sized Snowflake setup. They benchmarked Databricks using two different modes: on-demand and spot (underlying machines backed by spot instances with lower reliability but also lower cost). Databricks was 7.4x cheaper than Snowflake in on-demand mode, and 12x in spot.
A Snowflake blog by founders Benoit Dageville and Thierry Cruanes said Snowflake had deliberately not engaged in “benchmarking wars and making competitive performance claims divorced from real-world experiences. This practice is simply inconsistent with our core value of putting customers first.”
Also: “Anyone who has been in the industry long enough can likely attest to the reality that the benchmark race became a distraction from building great products for customers.” However, in this Databricks instance, “Though Databricks’ results are under audit as part of the TPC submission process, it’s turned the communication of a technical accomplishment into a marketing stunt lacking integrity in its comparisons with Snowflake.”
The two founders say “The Snowflake results that it published were not transparent, audited, or reproducible. And, those results are wildly incongruent with our internal benchmarks and our customers’ experiences.”
The Databricks blog included this chart:
The TPC-DS power run consists of running 99 queries against the 100TB scale TPC-DS database.
Snowflake took issue with the Databricks-Barcelona result and ran the test itself:
It said: “Out of the box, all the queries execute on a 4XL warehouse in 3,760s, using the best elapsed time of two successive runs. This is more than two times faster than what Databricks has reported as the Snowflake result, while using a 4XL warehouse, which is only half the size of what Databricks indicated it used for its own power run.”
But Databricks was still faster, though not by so much. However Snowflake is developing 5XL warehouse technology and claims “Our 5XL in its current form significantly beats Databricks in total elapsed time (2,597s versus 3,527s), and we expect material improvements when it reaches general availability.”
Databricks also said the Barcelona study showed it had vastly better price/performance than Snowflake:
The Snowflake founders dislike Databricks’ price/performance comparison too, saying it is misleading. “Our Standard Edition on-demand price for a 4XL warehouse run in the AWS-US-WEST cloud region is $256 for an hour. Since Snowflake has per-second billing, the price/performance for the entire power run is $267 for Snowflake, versus the $1,791 Databricks reported on our behalf.” Here is its chart showing this:
So, again, Databricks was better than Snowflake, although by much less of a margin. However, the Snowflake founders argue: “Using Standard Edition list price, Snowflake matches Databricks on price/performance: $267 versus $275 for the on-demand price of the Databricks configuration used for the 3,527s power run that was submitted to TPC.”
They say interested parties can run the SnowFlake TPC-DS benchmark power run themselves. It only takes a few mouse clicks and about an hour of elapsed time. Snowflake itself “will not publish synthetic industry benchmarks as they typically do not translate to benefits for customers.”
Certainly not in this instance, as it would show that Databricks is slightly faster at roughly similar price/performance.