Data lake protection, resilience, tiering, and DR

If you have a data lake, how can you protect it against data loss or ransomware? Should the data lake provider or you protect the data lake itself or the underlying storage?

Users running applications in VMware or other hypervisor system want their virtual machines (VMs) protected, and backup software does that, protecting and restoring at the VM level. Databases are also protected at the database level. Oracle even provides an RMAN tool for doing this. Should the data lake provider also offer data protection facilities? How should functions like tiering, disaster recovery, and encryption be parceled out?

We asked Mr Backup, aka W Curtis Preston, about this. He said: “In my opinion, backups are never the responsibility of the provider of anything. Meaning RMAN doesn’t back up Oracle for you, you have to use RMAN.”

RMAN is just a tool. “Databricks has a similar tool. There is both a CLI and repo functionality you can use to back up the instance. The storage side of things would be backed up w/whatever you are using (Azure Blob, etc).”

That implies there is a distinction between the storage side of things, the data, and the metadata, with the data lake customer protecting the storage and the data lake supplier providing tools to protect the metadata. This turns out to be an important distinction.

Chris Evans

Storage consultant Chris Evans has a similar view of the data protection responsibilities between data lake provider and the underlying storage.

He told us: “The ‘big boys’ like Databricks and of course SnowFlake are 100 percent cloud-based, at least, I’m sure SnowFlake is, Databricks looks to be [mostly] the same. 

“The data stored in these systems will go through an ETL ingestion type process, where structured data will be converted to a format like XML so it can be stored long-term as documents on an object store.  Effectively, all the data records are exported to a flat-file format because there’s no need to retain the locking, updating and general relational capabilities of a database.

“The warehouse/lakehouse is read-only so no need to retain structural integrity. Unstructured data can be stored in an object store unfettered, but with all types of data, a degree of indexing and metadata will be created that’s used to track the content.

“If 99 percent of the source data is kept in an object store, then the storage platform can do all the work of protection, replication, locking etc. Snowflake/Databricks or those systems in the cloud just use the scalability of the cloud system, while storing their metadata in massively scalable relational databases also on the cloud. Ultimately, what you’re asking is relatively trivial for Databricks and the heavy storage lifting is done by the cloud provider.

“I imagine that Databricks and Snowflake have some basic rules about replicated object storage buckets and use either automated tiering or their own algorithms to optimise data placement.  DR isn’t really necessary, as are backups of the object storage data.  As most of the content will be read-only, it would be possible to put long-term object locking in place until the data isn’t needed any longer.” 

He added: “One other aspect to consider is how Snowflake and others can integrate with on-premises data.  In this instance, I expect the companies have built data collectors/scanners, that simply go and build metadata from the content.  When the data is accessed, there’s a ‘translation layer’ a bit like OLE from the Microsoft days that makes the view of the data in the Databricks platform accessible to the Databricks software as if it was in an internal format.”

Databricks

We asked Databricks VP of Field Engineering Toby Balfre some questions to find out more about general lakehouse data protection issues. The metadata (control plane) and data (data plane) distinction appeared in his answers.

Blocks & Files: How is the data in a data lake protected against accidental deletion?

Toby Balfre: Databricks uses Delta Lake, which is an open format storage layer that delivers reliability, security and performance for data lakes.  Protection against accidental deletion is achieved through both granular access controls and point-in-time recovery capabilities.

Delta Lake has full support for time-travel, including point-in-time recovery. The granularity is at the level of every single operation. As customers write into a Delta table or directory, every operation is automatically versioned. Customers can use the version number to travel back in time as well.

Blocks & Files: How are disaster recovery facilities provided?

Toby Balfre: The Databricks control plane is always on hot standby in DR environments and does not have any additional charge. Databricks supports Active Active (hot), Active Passive (warm), and cold DR scenarios. The data is always in the customer’s data plane. The customer may choose to replicate their data and databases to new regions using Databricks’ Delta streaming capability for database replication. Another alternative is to clone the data to a secondary site. Delta clones simplify data replication, enabling customers to develop an effective recovery strategy for their data. Using Delta clones allows users to quickly and easily incrementally synchronize data in the correct order between the primary and secondary sites or regions. Delta uses its transaction log to perform this synchronization. 

Databricks customers often implement DR strategies to get RPO of 0 and RTO of under a minute. Recovery from user errors with time-travel and restore is foolproof, built-in, and easy. Each change in Delta lake is automatically versioned and customers can access historical versions of that data.

With the restore command, the full table can be restored to its original correct state, and it is only a metadata operation, making it instantaneous and no extra cost as the data is not duplicated.

Blocks & Files: What resilience features are provided by Databrick?

Toby Balfre: Databricks is architected to provide seamless failover, without the need to restart the database or the applications running on the database. Databricks’ compute cluster designs are inherently resilient; Databricks provides a distributed computing environment that stripes (distributes) workloads across many nodes and if a node disappears it is replaced. If a job crashes it restarts, restores state on the new node and is automatically re-tried from the last checkpoint it recorded during its processing. 

In addition SQL endpoints come with a multi-cluster load balancer where multiple clusters read and write on the same database. If one cluster fails, queries are routed to other clusters automatically without any interruption. 

To achieve full hot standby, clusters can be pre-launched and ready to go in the event of a region failure.

Blocks & Files: Is older, less-accessed data moved to cheaper storage?

Toby Balfre: Since Databricks builds on the file storage of the cloud providers it taps into the power of the various storage classes associated with the files stores like AWS S3 or Azure ADLS. Users can designate a certain storage class for infrequently accessed objects or for archiving purposes (cold storage). Intelligent tiering can be used to automate the movement between hot and cold storage.

Users define the data archiving policy directly through the cloud storage APIs. Files are then moved by the CSP storage service (S3, ADLS, GCS) from warm to colder tiers. 

Databricks allows users to specify the path that data is stored, and data is automatically tiered by the underlying storage system. For example, each table is decomposed into multiple partitions; each partition into multiple files; and each file into multiple splits. Each split is typically 128MB in size. For example, customers can enable intelligent tiering by AWS S3 and still query their delta lake spanning across warm and cold tiers seamlessly.

Databricks automatically manages data movement between object storage layer (frequent or infrequent tiers) e.g. S3, ADLS, GCS, the local SSDs (warm), and the memory of the cluster (hottest) based on the workload and the query patterns. 

Blocks & Files: How is data in the data lake protected against ransomware?

Toby Balfre: Databricks provides enterprise-grade security and compliance capabilities, as well as recovery to a previous point in time (covered earlier) that can be used to protect against ransomware.  

Blocks & Files: And compliance?

Toby Balfre: Databricks can be used to process data subject to various regulatory compliances like GDPR, HIPAA, HITRUST (Azure), PCI, SOC, ISO, Fedramp (moderate), Fedramp High (Azure only).

In the Databricks architecture, the control plane is managed by Databricks as a service and has all the necessary controls required for regulatory compliance. This is Databricks responsibility. 

For the data plane, Databricks provides an account level configuration that customers can use to ensure all workloads in Databricks are deployed with controls required for regulatory compliance. This is a shared responsibility. 

Blocks & Files: Can you describe the encryption facilities?

Toby Balfre: Databricks supports encryption of both data at rest and in transit. Encryption in transit is implemented through TLS1.2 and through host-level encryption of all network traffic where available (e.g. by using nitro instances on AWS). Data at rest can either be server-side encryption (cloud provider integration) or application-side encryption (encrypted before writing to cloud data stores). Databricks also encrypts temporary data stored on local disks during query processing. 

Data can be encrypted using Databricks or customer-managed keys. Individual fields/columns can also be encrypted using granular keys with native AES functions.

Databricks does not charge extra to enable these features.

Blocks & Files: How is high availability provided?

Toby Balfre: Databricks offers HA through a distributed lakehouse architecture with no single point of failure and at no additional cost. High availability does not require significant explicit preparation from the Databricks customer.

To provide HA, Databricks leverages HA capabilities of CSP services such as Amazon S3, ADLS, Amazon EC2, and Azure VMs. Through decoupled storage from compute architecture, data is stored in object stores that are automatically replicated across availability zones. Compute can be launched across different clusters in different data centers (regions), and all have access to the same underlying data. Databricks’ cluster manager transparently relaunches any worker instance that terminates or is revoked.By storing data in object storage layer such as Amazon S3, ADLS, and GCS, Delta lake provides out of box HA with the durability of 99.999999999 percent (11 9’s) and availability of >99.99 percent. 

There is no single point of failure in a Databricks deployment so HA is delivered through a distributed system that is resilient at every point. HA can be achieved within a single region but customers may want to set up HA in multiple regions and across multiple clouds. To do that, we are in 33 Azure Regions worldwide, 15 AWS Regions and 7 regions in GCP.

Blocks & Files: What replication facilities exist in Databricks?

Toby Balfre: For data replication across regions within a cloud or multi-clouds, customers use Delta clone. 

Using Delta clones allows customers to quickly and easily incrementally synchronize data in the correct order between your primary and secondary sites or regions. Delta uses its transaction log to perform this synchronization, analogous to how RDBMS replication relies on its logs to restore or recover the database to a stable version. In addition data can be replicated using a streaming source from a delta table to tail the transaction log. This is near instantaneous [and] provides a transactionally complete replica.  

Comment

It occurs to us that there is a data protection marketing opportunity here. A SaaS backup supplier, such as Commvault, Clumio, Druva, HYCU or Veeam could get its product technology integrated to Databricks, Dremio, SingleStore, Snowflake or similar data warehouse/lake suppliers’ control and data planes, and offer an all-in-one data protection service for them; a Metallic Databricks service or HYCU Snowflake offering, for example.