Replacing an old filer or object store entails migrating data to the new system – and that can lead to a world of hurt.
“The ROI of a new array only begins when migration ends,” says Michael Jack, head of sales at Datadobi, a company that has developed storage migration software called DobiMigrate.
Until migration ends, the customer has two arrays on their premises, taking up floor space and needing power, cooling and system management, according to Jack. The data migration process in-flight often takes much longer than expected. It is generally a one-off exercise, conducted by businesses that are not data migration experts. Few lessons are learnt or carried over from previous migrations.
Data migration is a multi-phase process:
- Start. Scan the source system and build a what-to-migrate catalogue.
- Update this during the migration process
- Move the data
- Write to the target
- Verify it is correct
- Finish. Cutover from the source to the target
File system scanning can take a long time when there are petabytes of data and billions of files. [Imagine one person has to list every book in the US Library of Congress. You would need to organise an army of people to work in parallel to complete the task in weeks rather than years.]
Also, data that is written to tape in a backup process is read back to verify that what was written is what should have been written. With data migration this happens only when specialist software is used. Only then do you have a data custody chain that can satisfy compliance regulations.
Robocopy and Rsync
Datadobi’s Jack told Blocks & Files NAS and object storage systems vendors minimise migration difficulties and suggest their customers do it themselves with scripting, using Windows Robocopy or Unix/Linux Rsync.
However, old-school software utilities date from pre-petabyte times when file populations were much smaller. They can take a long time to finish, need scripts written, have limited protocol support, do not cover cases where there are multiple different access permission schemes and cannot guarantee that a migration has completed successfully.
For example,
- Rsync is single-threaded and only supports NFS. Multiple rsync instances can be run in parallel, by writing complicated shell scripts to parse the file system structure and assign each portion to a unique rsync instance. This approach does not scale well and limits performance.
- Robocopy is limited to scanning NTFS file systems. It supports multiple tread; 8 by default, but only one scan thread is used to update file system maps.
- Permission data is stored differently by different suppliers. A DataDobi tech brief states: “NetApp, for example, stores either NTFS or UNIX permissions but not both. EMC’s VNX and Unity platforms running in ‘Native’ access mode will store both NTFS and UNIX permissions separately while EMC’s Isilon implements a ‘Unified’ permission model wherein both sets of permissions are combined into a single permission model.”
DobiMigrate includes a DataMiner component which supports multi-processing and uses multiple threads. It has SMB and NFS proxies, which means NTFS and Linux file systems can be scanned in parallel. DobiMiner can also scan the same source file data in SMB and NFS modes, scooping up the different metadata from each protocol. This contains the permission data which is migrated to the destination system automatically.
DobiMigrate can scan 10 billion files or more and supports NFS v3 and v4, SMB v1, 2 and 3, S3, ECS, and Isilon formats. Azure and Google Cloud Platform object formats are on the development roadmap.
Data moving
Datadobi moves data across a network link between arrays in a parallel fashion to speed data movement. There isn’t a slow single stream. It is sensitive to the host system workload burden and throttles its activities if the workload is affected beyond set limits.
Making a hash of data writing
When file data is selected for migration, a hash of its contents is calculated before it is written to the target system. It is read back, a new hash calculated and the two hashes are compared to ensure an exact copy has been made. If there is a mismatch the file is copied again. DobiMigrate software does this automatically.
Rsync uses hashes too but in a different way. It breaks a file to be migrated into chunks and makes chunk-level hashes. These are compared to similar chunk-level hashes on the target system. If there is no match the new chunk is written to the target. But the copied data’s integrity is not verified, i.e., that what was written matches what should have been written.
Robocopy does not natively check the integrity of files written to a destination. The unsupported Microsoft FCIV (File Checksum Integrity Verifier) utility can do this. It requires scripting for it to read both the source and target files, calculate the hashes, and compare them. If errors are detected the affected files must be re-copied and re-checked.
Datadobi fact file
Datadobi was founded by four Dell EMC engineers following the closure of the Centera object storage centre in Belgium in 2009. Their first migrations were Centera to Centera. This moved on to Centera to Isilon, then NetApp to Isilon and from there to nearly any NAS to any NAS or object store.
Datadobi is entirely self-funded. There are about 60 employees and revenues in 2018 were €10m. Customers are typically one-off users and there is little recurring revenue.
To date the company has worked for 737 customers, with 30 per cent apiece in finance and healthcare. Eighty per cent of its work is in the USA where it has 30 staff. Customer migrations include data moving between on-premises and cloud destinations.
We note that Dell EMC uses Datadobi for NetApp-to-Isilon OneFS migrations.
Data migration niche
Blocks & Files considers Datadobi is unlikely to attract much VC investment. Nor is it a business that would attract a vendor as their main interest is to provide a on-ramp to their kit, not an off-ramp.
Even then the difficulties inherent in developing software to scan, copy, move and verify data from multiple sources would limit what they could do. Yet there is a need for data migration – every time a customer buys a new filer or object system. Hence the niche business that Datadobi is opening up
Companies such as InfiniteIO (file access acceleration), Igneous (extremely large file system storage), Komprise (file system lifecycle management), Actifio and Cohesity (secondary data management), have expertise in scanning file system metadata but apply it for their own purposes, not data migration.