Unicorn cloud data analytics startup Dremio aims to supply analytic processes running directly on data lakes, and says its latest open source-based software release, adding speed and more powerful capabilities, is a step forward in obsoleting data warehouses and eliminating the data warehouse tax.
Its so-called Dart initiative makes Dremio’s in-memory software, powered by Apache Arrow, run faster and do more to save customers time and money. The pitch is that having SQL-based analytics routines run directly on data stored in Amazon S3 and Azure Data Lake means there us no need to pass through an Extract, Transform and Load (ETL) process to load a data warehouse before running analytics processes. Data warehouses, like Snowflake and Yellowbrick Data, have a great deal of functionality and built-in speed. Dremio has to provide both so customers see no need for ETL prepping of data warehouses as a necessary part of running their preferred analytics processes and getting fast query responses.
Tomer Shiran, founder and chief product officer at Dremio, said in a provided quote: “Enabling truly interactive query performance on cloud data lakes has been our mission from day one, but we’re always looking to push the boundaries and help our customers move faster … Not only are we dramatically increasing speed and creating efficiencies, we’re also reducing costs for companies by eliminating the data warehouse tax without trade-offs between cost and performance.”
Dremio says information held in S3 and Azure Data Lake can be stored and managed in open-source file and table formats such as Apache Parquet and Apache Iceberg, and accessed by decoupled and elastic compute engines such as Apache Spark (for batch processing), Dremio (SQL), and Apache Kafka (streaming).
Apache Iceberg provides data warehouse functionality such as transactional consistency, rollbacks, and time travel. It also enables enable multiple applications to work together on the same data in a transactionally consistent manner.
Dremio supports Project Nessie, which provides a Git-like experience for a data lake, and builds on table formats like Iceberg and Delta Lake to let users to take advantage of branches to experiment or prepare data without impacting the live view of the data. Nessie enables a single transaction to span operations from multiple users and engines including Spark, Dremio, Kafka, and Hive. It makes it possible to query data from consistent points in time as well as across different points in time.
Thirumalesh Reddy, VP of Engineering and Security at Dremio, added his two cents: “There are two major dimensions you can optimise to maximise query performance: processing data faster, and processing less data.” Dremio’s latest software releases does both. Its features are said to include:
- Better query planning: Dremio gathers deep statistics about the underlying data to help its query optimiser choose an optimal execution path for any given query.
- Query plan caching: Useful for when many users simultaneously fire similar queries against the SQL engine as they navigate through dashboards.
- Improved, higher-performance compiler that enables larger and more complex SQL statements with reduced resource requirements.
- Broader SQL coverage including additional window and aggregate functions, grouping sets, intersect, except/minor, and more.
- Faster Arrow-based query engine: Arrow component Gandiva is an LLVM-based toolkit that enables vectorized execution directly on in-memory Arrow buffers by generating code to evaluate SQL expressions that uses the pipelining and SIMD capabilities of modern CPUs. Gandiva has been extended to cover nearly all SQL functions, operators, and casts.
- Less data-read IO: Dremio reduces the amount of data read from cloud object storage through enhancements in scan filter pushdown (now supporting multi-column pushdown into source reads, the ability to push filters across joins, and more).
- Unlimited table sizes with an unlimited number of partitions and files, and near-instantaneous availability of new data and datasets as they persist on the lake.
- Automated management of transparent query acceleration data structures (known as Data Reflections).
These features help Dremio’s software process less data and process it faster than before, it is said. Check out the Dremio Architecture Guide here.
Blocks & Files notes Dremio says it wants to enable data democratisation without the vendor lock-in of cloud data warehouses. In other words, green field users who are not using a data warehouse or not locked in to one can use Dremio’s software to get data warehouse functionality at less cost. They can avoid what Dremio calls the data warehouse tax.
Whether Dremio can actually obsolete data warehouses is another matter, but it’s a nice and clean marketing pitch.