WANdisco sends Hadoop data and metadata to Databricks cloud data lakes

Replicator WANdisco has announced its Live Migrator can now automate the migration of Apache Hive metadata directly into Databricks data lakes, helping move Hadoop and Spark data and saving time and much manual effort.

Live Migrator is replication software that can migrate large datasets, such as ones using Hadoop, with constant changes, from on-premises sites to and between public clouds. Apache Hive is an open-source data warehouse that uses SQL and is built on top of Hadoop. It supports analysis of large datasets stored in Hadoop’s HDFS, the Amazon S3 filesystem, and Alluxio.

Databricks enables SQL querying and analysis of data lakes without having to first extract, transform, and load data into separate data warehouses.

WANdisco CTO Paul Scott-Murphy said: “Data and metadata are migrated automatically without any disruption or change to existing systems. Teams can implement their cloud modernisation strategies without risk, immediately employing workloads and data that were locked up on-premises, now in the cloud using the Lakehouse platform offered by Databricks.”

Source data sets do not need to be migrated in full before they are converted into the Databricks Delta format as LiveData Migrator automates incremental transformation to Delta Lake. This eliminates the need for manual data mappings as there is now direct, native access to structured data in Databricks from on-premises environments.

Users can eliminate migration tasks, such as constructing data pipelines to transform, filter, and adjust data, and any up-front planning and staging. There is no need to set up auto-load pipelines to identify newly landed data and convert it to a final form.

Users choose to convert content to the Delta Lake format when they create the Databricks metadata target. The desired data to migrate is then set by defining a migration rule, and selecting the Hive databases and tables that require migration. 

WANdisco says a single management facility can handle both Hadoop data and Hive metadata migrations. Ongoing changes to source metadata are reflected immediately in Databricks’ Lakehouse platform, and on-premises data formats used in Hadoop and Hive are automatically made available in Delta Lake on Databricks. It should all be a lot easier than before, and less prone to error.