We had the opportunity to run an email interview with Mark Greenlaw, VP of market strategy at Cirrus Data and asked him questions to find out more about where block-based migration fits in the market.
Blocks & Files: How would you advise customers to decide between block-based and file-based data migration? What are the considerations when choosing which to use?
Mark Greenlaw: Good question! We often get questions about the differences between block and file data and how to think about it. Ultimately, the migration methodology is determined by the type of data you’re migrating. Typically, you use block migration tools for block data and file migration tools for file data. That said, there are exceptions to this rule depending on the desired outcome.
One example is when you want to migrate your data from an Oracle environment to a Microsoft SQL environment. In this situation, you need to not only migrate the data but also transform it. You would use a file migration solution to extract, transform, and load the block data, getting a representation of the data in a file format that can now be read into a SQL environment.
There are also situations in file migrations where block migration solutions work better.
This scenario happens when you want to move a large amount of file data from cloud vendor A to cloud vendor B and you plan to keep it all on the same file server. The right block migration solution can make a bitmap copy of the drive which allows you to do only the change data, not file by file, but by the whole volume. A block migration in this situation performs the file migration quicker and without disruption.
Here are the important considerations for choosing the right migration solution. If you’re using a block tool and haven’t synchronized your applications, you can end up with a less-than-perfect copy. It is important when moving data from source to destination that the migration solution has automation that enables you to track changes and then synchronize the source and destination to make sure they are exact replicas.
In a file migration, you must keep file attributes like modification time, creation date, etc. If you don’t, you are likely to end up with file administrative nightmares. Imagine you had a shared drive with a rule that said, “Archive any files that are in the drive for more than two years.” If I’m not careful to keep the creation date and I change my modification time, I could end up with too much data in the shared drive. Then I must expand it, which can be expensive and lead to operational interrupts that are difficult to manage.
Blocks & Files: What are the considerations involved in object data migration?
Mark Greenlaw: Objects, in general, are there because they’re hashed data. I create an address for the data that I’m trying to access, and I have a hash that says, “I know the address of this data,” and then I can store it in object. That process gives me object attributes instead of file attributes. As long as I have the hash, I can move where the data is and not worry about updating it.
Let’s say I’m taking a portion of the object data. It’s possible to take archived data out to an object pool, such as an S3 bucket. Now I can archive to that object pool—but I need to ensure when somebody accesses that file, they can read it. The challenge is I have to encapsulate essentially a block of data or a group of data bytes that I want to archive and create that hash. I can move the object data to an S3 bucket, but if it moves back to an on-premises bucket, I need to copy that hash with all the data there. If your hash gets corrupted, you lose all the attributes of your data.
The big cloud vendors offer these object-based repositories that are not very performant, but they are scalable. The advantage of such an object store is that it’s essentially limitless. As long as I have a hash key, I can continue writing—enabling me to have a moderately slow but scalable archive. There I can archive chunks of data, which allows me to keep the file attributes onsite or keep the database instance onsite while a portion of my data is now on a lower cost, lower performance environment. This is fine if I’m not accessing the data very frequently, and I want to keep it out of my core data. If you have a high level of recall, object buckets tend not to be very performant, which slows things down.
Blocks & Files: Can Cirrus Data help with data repatriation from the cloud? Is this file or block-based and how do egress fees affect things?
Mark Greenlaw: Cirrus Data’s data migration solutions are source and destination agnostic. We regularly help organizations migrate data from their on-premises environment to the cloud. There are also situations when we need to migrate data from the cloud back to an on-premises environment.
Why would someone do that? Essentially, buyer’s remorse. Maybe you moved your data to the cloud, but it was more expensive than you expected. It could also be that it didn’t deliver the performance you wanted, or perhaps you changed your business priorities. Now you’ve decided that your data needs to be back on-premises.
There are a few considerations in bringing your data back from the cloud. There are lots of free vendor tools that get you up to the cloud, but cloud vendors don’t have any incentive to provide you assistance in leaving. They also charge you an egress fee. It is essentially The Hotel California: You can check in anytime you like, but you can never leave. Egress fees are usually a month’s rent. Your fee is typically determined by the number of terabytes per month, and you receive another bill for egress charges per terabyte.
When Cirrus is repatriating data for customers, our solution does a data reduction before it hits the meter. If you’re moving a terabyte and it’s only half full, a typical migration tool will copy the full terabyte, zeros, and all. We run a process called de-zeroing, which removes all the empty space and compresses the rest. This can be a significant saving if you are, for example, only 50% full. We use industry-standard tools to examine the data in its environment and often get an overall reduction of 8 to 1, which means my egress fee is one-eighth what it would’ve been if you didn’t use a solution like Cirrus Data.
Blocks & Files: Can Cirrus move block data between public clouds? Has this ever happened?
Mark Greenlaw: Yes, we have absolutely moved block data between public clouds. As hybrid IT continues its hunger for optimization, we expect such cloud-to-cloud migrations to increase. Here’s how this usually happens:
A customer has an environment with some data on-premises, different data on AWS, and then another group of data with Azure. Imagine AWS announcing, “Storage now costs 25 percent less.” The customer will obviously want to move their data from Azure to AWS to take advantage of that lower-cost environment.
We are completely source and destination agnostic. In this scenario, because Cirrus Data is cloud agnostic, we have no problems helping the customer take advantage of the best storage environment for their data. The customer could run into the same issue with the egress fees we mentioned earlier. To minimize those egress charges, we would run our data reduction, de-zeroing, and compression software to make sure the volume of data being moved is only what is required. The customer then can take advantage of lower storage costs with the right SLA. Cirrus Data is renowned for the speed of its data migrations, so all this can happen at lightning speed.
Speed is the other aspect of data reduction that’s critically important. When I’m moving on-premises data to the cloud, or from the cloud back to an on-premises environment, that data reduction also results in work getting done significantly faster. One independent study reported a four to sixfold improvement in time over using a tool that doesn’t have that data reduction.
Blocks & Files: How can deduplication and/or compression work with block-based migration?
Mark Greenlaw: Block storage is usuallymission-critical applications and databases. Those who manage databases and applications are typically not IT infrastructure people. More likely, they are focused on the applications and making sure that those applications stay up 24/7. Applications and databases generally require a full backup and are managed carefully, so that if something goes bump in the night they can recover quickly.
So if you’re an administrator of applications and databases, you want to ensure you can take on an influx of new data at any given time. Let’s say you’re running a website, and somebody puts a promotion out there. Suddenly, you get a flurry of new activity and, with it, lots of new information being written into your database. If you were running at 80% or 90% full, you could get dangerously close to a full database. A full database means it is unable to take any new information and your application fails. It is not an acceptable outcome.
It’s important to consider this scenario because it leads database administrators (DBAs) and application managers to say, “I’m not going to let my database be more than 40 to 60% full. When I reach that threshold, I’ll expand my storage bucket, so I don’t ever run out of space.” The problem is if your maximum threshold is only 60%, you are not very efficient. And keep in mind that application owners are not evaluated on IT infrastructure costs—they’re evaluated on application uptime because the cost of the application or database being down is catastrophic.
The flip side to this reality is that when you are migrating those applications or databases, you need to get rid of the empty space. We quickly remove all those zeros and move just the necessary data. It is a great result for the DBAs and application managers who want to reduce the time it takes to get to the next level. It takes less time to move one-eighth of the data than it takes to move the entire volume. IT is happy because we were able to optimize the costs and the application teams are happy because the block data moved quickly and without disruption.
Blocks & Files: Are there any benchmark tests or other performance measures customers could use to decide which data mover to select: Cirrus Data, DataDobi, WANdisco, etc.? Eg. $/TB moved, total migration time per 10PB?
Mark Greenlaw: We benchmark ourselves against cloud vendors’ free data migration tools like AWS CloudEndure and Microsoft Azure Migrate. The reason for that is our customers are tempted by free migration tools. Testing ourselves with encryption and data reduction in flight; we are about 4 to 6 times faster than those free tools.
More importantly, we can keep these speed advantages at scale for customers moving big data volumes. If you’re moving 100TB, free vendor tools could take nine months or longer, which is an excessively long period to manage risk and manage disruption. Cirrus Data has successfully moved 100TB in one month. This is a huge business advantage for the organization trying to bring its new storage environment online. Reduced risk, minimal downtime during cutover, and an accelerated digital transformation—in short, a game changer for large organizations.
Blocks & Files: Why are transportable disk-based migration methods, such as Seagate’s Lyve Mobile and Amazon Snowball still in use? Don’t they point to the view that network-based data migration fails above a certain level of TB-to-be-moved-in a-certain-time?
Mark Greenlaw: Let’s examine how these technologies work. They mail you a set of disk drives. You connect that to your infrastructure, copy the data, and mail it back to the cloud vendor. Then they put it into the cloud.
If that’s the fastest way you can migrate onto your new storage environment, you must be okay with significant disruptions. By the time you disconnect that container and ship it up to the cloud vendor and restore it in the cloud, there’s a long gap in connectivity. Customers end up with these options because they don’t have data reduction in flight.
If they accelerate the process four to sixfold, they might not need to take advantage of these disk-mailing tools. The other challenge is these cloud vendors (like all storage vendors) want to make it difficult to move. Hotel California is really the theme song. The organization has the ability to connect and move the repository to the cloud, but if you are dissatisfied with that move, these tools do not have a reverse button. You are stuck in this environment because you can’t get your data out.
That’s why it’s important you have a multidirectional solution with data reduction. We also have a capability called Intelligent Quality of Service (iQoS). This patented technology can look at the data and determine what is in use and what is inactive. We measure and automatically yield. If that data’s active, we just pause, and then we resume when the data becomes inactive. All data has a variable access environment, so we can ensure we are doing good, high-speed migrations without impacting production and do it in a way that doesn’t overwhelm a customer’s TCP/IP connections.
These physical disk services are a stop gap created because SMB organizations don’t have the networks of big enterprises. Since these SMBs are going to the cloud, they might try to delay a network upgrade. It wouldn’t make sense to complete an upgrade only to move to the cloud. Once the data is there they won’t need the increased bandwidth. We enable SMBs to get to the cloud they want without using these disk-based tools at a reasonable time and without application impact.
Blocks & Files: Do fast file metadata accessors with caching, such as Hammerspace, make networked data migration redundant? What are the considerations involved in a decision between actual data migration and metadata-access-led remote data access such as Hammerspace‘s Global Data Environment and technology from cloud file services suppliers such as Panzura and Nasuni?
Mark Greenlaw: Fast metadata accessors make recently and often-used file data feel like it’s being accessed locally when it’s stored in the cloud. These tools are ideal when groups of users need fast access to data from any location, such as a temporary office or small branch. They’re also useful for global collaboration.
Maybe my company has a global development program. Some people are based in low-cost geographies, and I have others working from the home office in a major metropolitan center. I want everybody to have the same performance with a shared file. This works by having metadata describe the file data—telling me where that data lives, then taking and caching that information in a local repository. Now I don’t have to worry about the distance between my users.
As useful as fast metadata accessors are, I don’t believe they make networked data migration redundant. While these tools are great for short-term offices that need high performance, their cost means they don’t scale well for larger offices or individual users. You wouldn’t put a Hammerspace application on everybody’s laptop. What’s more, because these tools can’t cache everything, they must focus on data in high demand—they won’t help if users need access to more obscure files.
Bootnote
Get a Cirrus Data white paper on migration here.