Databricks has updated its Lakehouse Platform, promising more cost-efficient performance, better pipeline building, virtual cleanrooms, and a marketplace for data and analytics assets.
Databricks has been funded to the tune of $3.6 billion with its last raise taking $1.6 billion in September 2021. The now-public Snowflake only raised $1.4 billion in total. Databricks’s analytics software is said to work on raw data, stored in a data lake, without it having to be extracted, transformed and loaded (ETL) into a separate data warehouse – that’s the Snowflake way.
Ali Ghodsi, co-founder and CEO of Databricks, issued a statement: “Our customers want to be able to do business intelligence, AI, and machine learning on one platform, where their data already resides … Today’s announcements are a significant step forward in advancing our Lakehouse vision, as we are making it faster and easier than ever to maximize the value of data, both within and across companies.”
The performance improvements consist of:
- Databricks SQL Serverless provides instant, secure, and fully-managed elastic compute for improved performance at a lower cost.
- The Photon native vectorized query engine on Azure Databricks, written to be directly compatible with Apache Spark APIs so it works with existing code.
- Open source connectors for Go, Node.js, and Python to help operational applications access the Lakehouse.
- Databricks SQL CLI enables queries to be run directly from a local computer.
- Databricks SQL query federation adds the ability to query remote data sources, including PostgreSQL, MySQL, AWS Redshift, and others, without the need to ETL the data from the source systems.
Databricks does use ETL to process streaming and batch workloads for analytics, data science, and ML workloads. This is done to turn SQL queries into production ETL pipelines and uses Delta LiveTables as the ETL framework. A Databricks blog claims this makes it possible to declaratively express entire data flows in SQL and Python. Delta LiveTables has been given a new performance optimization layer to speed up execution and reduce ETL costs.
Added Enhanced Autoscaling scales resources with the fluctuations of streaming workloads. Change Data Capture (CDC) for Slowly Changing Dimensions – Type 2 (see bootnote below) tracks every change in source data for both compliance and machine learning experimentation purposes.
MLflow Pipelines is a machine learning pipeline builder. It uses MLflow software and v2.0 of this enables users to define the elements of a pipeline in a configuration file, with MLflow Pipelines managing execution automatically. Databricks has added Serverless Model Endpoints to directly support production model hosting, and built-in Model Monitoring dashboards to help the analysis of real-world model performance.
Databricks’s Cleanrooms provides a way to share and join data across organizations with a secure and hosted environment. Customers can collaborate with their clients and partners on any cloud and enabling them to run computations and workloads using both SQL and data science-based tools, including Python, R, and Scala, with data privacy controls.
A Databricks Marketplace provides an open environment within which to package and distribute data and analytics assets. It says this will enable data providers to securely package and monetize a host of assets such as data tables, files, machine learning models, notebooks, and analytics dashboards. Data consumers will be able to subscribe to pre-existing dashboards that provide desired analytics for a dataset.
Competitor Snowflake already has cleanrooms and a marketplace, and supports machine learning pipelines.
Databricks SQL Serverless is now available in preview on AWS. The Photon query engine is in public preview and will be generally available on Databricks Workspaces in the coming weeks. Databricks Cleanrooms will be available in the coming months as will the Databricks Marketplace.
A Slowly Changing Dimension (SCD) is a construct that stores and manages current and historical data over time in a data warehouse. A Type 1 SCD has new data overwriting the existing data. A Type 2 SCD writes new records for new data, thus retaining the full history of values. A type 3 SCD only stores the previous and current value for an attribute. When a new value comes in, it becomes the current value, with the prior current value overwriting and itself becoming the previous value.