Confluent Tableflow unlocks real-time AI with Delta Lake and Iceberg

Data streamer Confluent has updated its Tableflow software to give AI and analytics software access to real-time operational data in data warehouses and lakes.

The Tableflow updates build on the Confluent-Databricks partnership announced in February to add bi-directional integration between Tableflow, Delta Lake, and the Databricks Unity Catalog.

Specifically, support for Apache Iceberg is now generally available and an Early Access Program for Delta Lake support is open. Production workload teams “can now instantly represent Apache Kafka topics as Iceberg tables to feed any data warehouse, data lake, or analytics engine for real-time or batch processing use cases.”

Tableflow also offers enhanced data storage flexibility and integrations with catalog providers, such as the AWS Glue Data Catalog and the Snowflake Open Catalog (a managed service for Apache Polaris). The Apache Iceberg and Delta Lake support enable real-time and batch processing and help ensure real-time data consistency across applications.

Confluent’s Chief Product Officer, Shaun Clowes, stated: “With Tableflow, we’re bringing our expertise of connecting operational data to the analytical world. Now, data scientists and data engineers have access to a single, real-time source of truth across the enterprise, making it possible to build and scale the next generation of AI-driven applications.”

Confluent Tableflow screenshot graphic
Confluent Tableflow screenshot graphic

Citing the IDC FutureScape: Worldwide Digital Infrastructure 2025 Predictions report, which states: “By 2027, after suffering multiple AI project failures, 70 percent of IT teams will return to basics and focus on AI-ready data infrastructure platforms,” Confluent claims AI projects are failing because old development methods cannot keep pace with new consumer expectations. 

The old development methods, or so says Confluent, are related to an IDC finding: “Many IT organizations rely on scores of data silos and a dozen or more different copies of data. These silos and redundant data stores can be a major impediment to effective AI model development.”

The lesson Confluent draws is that instead of many silos, you need a unified system that knows “the current status of a business and its customers and take action automatically.” Business operational data needs to get to the analytics and AI systems in real-time. It says: “For example, an AI agent for inventory management should be able to identify if a particular item is trending, immediately notify manufacturers of the increased demand, and provide an accurate delivery estimate for customers.” 

Tableflow, it declares, simplifies the integration between operational data and analytical systems, because it continuously updates tables used for analytics and AI with the exact same data from business applications connected to the Confluent Cloud. Confluent says this is important as AI’s power depends on the quality of the data feeding it.

The Delta Lake support is AI-relevant as well as it’s “used alongside many popular AI engines and tools.”

Of course, having both real-time and batch data available though Iceberg and Delta Lake tables in Databricks and other data warehouses and lakes is not enough for AI large language model processing; the data needs to be tokenized and vectorized first.

Confluent is potentially ready for this, with its Create Embeddings action, a no-code feature “to generate vector embeddings in real time, from any model, to any vector database, across any cloud platform.”

Users can bring their own storage to Tableflow, using any storage bucket. Tableflow’s Iceberg tables support access for analytical engines such as Amazon Athena, EMR, and RedShift, and other data lakes and warehouses, including Snowflake, Dremio, Imply, Onehouse, and Starburst. 

Apply for the Tableflow Early Access Program here. Read a Tableflow GA product blog to find out more.

Bootnotes

The Confluent Cloud is a fully managed, cloud-native data streaming platform built around Apache Kafka. Confluent was founded by Kafka’s original creators, Jay Kreps, Jun Rao, and Neha Narkhede. They devised Kafka when working at LinkedIn in 2008.

Delta Lake is open source software built by Databricks that is layered above a data lake and enables batch and real-time streaming data processing. It adds a transaction log, enabling ACID (Atomicity, Consistency, Isolation, Durability) transaction support. Users can ingest real-time data, from Apache Kafka, for example, into a dataset, and run batch jobs on the same dataset without needing separate tools. Databricks coded Delta Lake on top of Apache Spark.

Apache Kafka is open source distributed event streaming software built for large-scale, real-time data feeds used in processing, messaging, and analytics. Data in Kafka is organized into topics. These are categories or channels, such as website clicks or sales orders. A source “producer” writes events to topics, and target “consumers” read from them.

Apache Iceberg is an open source table format for large-scale datasets in data lakes, sitting above storage systems like Parquet, ORC, and Avro, and cloud object stores such as AWS S3, Azure Blob, and the Google Cloud Store. It brings database-like features to data lakes, such as ACID support, partitioning, time travel, and schema evolution. Iceberg organizes data into three layers:

  • Data Files: The actual data, stored in formats like Parquet or ORC
  • Metadata Files: Track which data files belong to a table, their schema, and partition info
  • Snapshot Metadata: Logs every change (commit) as a snapshot, enabling time travel and rollbacks

Apache Spark is an open source, distributed computing framework intended for fast, large-scale data processing and analytics. It’s used for big data workloads involving batch processing, real-time streaming, machine learning, and SQL queries.

The Databricks Unity Catalog provides a centralized location to catalog, secure, and govern data assets like tables, files, machine learning models, and dashboards, across multiple workspaces, clouds, and regions in the Databricks Lakehouse environment. It acts as a single source of truth for metadata and permissions.