Snowflake, Databricks set to collide over AI

AI everywhere
AI everywhere

The two leading cloud data warehouse and lakehouse suppliers, Snowflake and Databricks, are racing to get the most comprehensive generative AI functions working on their customers’ data sets.

Both offer data platform solutions across the three main cloud providers: AWS, Microsoft Azure, and GCP. Both are highly funded, Databricks with $3.6 billion raised and post-IPO Snowflake with a $2.5 billion annual revenue run rate. Snowflake has made a series of AI-related announcements at its Snowflake summit the day after Databricks said it was buying large language model (LLM) startup MosaicML with technology used in generative AI.

LLMs are used to make chatbots that can “understand” natural language queries and requests, analyze source data sets and respond with query answers, software code, generated images – like a superintelligent hotel concierge offering a white glove service to guests.

Snowflake was founded in 2012 by three data warehouse experts who saw limitations in existing on-premises data warehouse products and set out build a proprietary cloud-native, cloud-resident, scalable and fully managed data warehouse. The technology uses a shared-nothing massively parallel processing (MPP) SQL-based query cluster that accesses shared-disk data storage.

It has added external table support to broaden its data sources outside its own data warehouse wall.

Databricks was started up a year later by the original creators of the Apache Spark in-memory big data processing platform. It wanted to simplify large-scale data processing and help enterprises get insights from troves of structured, semi-structured and unstructured data. It has espoused the lakehouse idea, combining traditional data warehousing and data lake workloads in a single unified platform which is offered via a platform-as-a-service (PaaS) model with customers able to customize the compute side. 

Databricks’ software works on raw data in the lakehouse, without it having to undergo an extract, transform and load (ETL) process, as is the case with Snowflake. Of course, Databricks’ processes have to select and subset the data first. 

Snowflake has its own data storage format while Databricks has its open source Delta Lake storage facility.

As well as SQL queries, Databricks supports real-time stream processing, machine learning, and graph processing. Databricks has MLlib and TensorFlow; built-in libraries for machine learning and deep learning. It has also developed a facility for building and deploying LLMs and has its open source Dolly chatbot. It is now buying MosaicML for $1.3 billion to help customers build and deploy AI models on their own data.

Snowflake is running parallel to this and has just announced:

  • A partnership with Nvidia so that customers can use Nvidia’s NeMo framework for developers to build, customize, and deploy generative AI models using their organization’s data, with billions of parameters. 
  • The private preview Document AI LLM using Applica’s generative AI technology to let users analyze documents.
  • Extended Iceberg tables to push out the data set boundaries for Snowflake.
  • A private preview of Snowpark Container Services in which developers can run AI and ML models directly within Snowflake’s Data cloud. Customers get access to third-party software including LLMs, ML APIs for model development, Notebooks, and MLOps tools.

SnowPark is Snowflake’s facility for enabling non-SQL access to its data.

The net effect of this generative AI race between Databricks and Snowflake is that every other data lake, lakehouse and data warehouse supplier may be under pressure to follow in their footsteps or risk being left behind in what has quickly become a table stakes game.

Lakehouse supplier Dremio has already added Text-to-SQL, Autonomous Semantic Layer, and Vector Lakehouse functionality to its product. Open source vector database supplier Zilliz may start to field partnership and potential acquisition queries from various analytics suppliers. That’s because its technology stores the vector embeddings data needed by LLMs. Transactional and real-time database supplier SingleStore has demoed ChatGPT working against data it stores. 

We are witnessing a generative AI frenzy as vendors try to ensure they don’t get left behind – and both Databricks and Snowflake just ratcheted the knob to a higher setting.