Komprise CTO on how to accelerate cloud migration

Interview: Data manager Komprise takes an analytics-first approach to its Smart Data Migration service involving scanning and indexing all unstructured data across environments, then categorizing it based on access frequency (hot, warm, or cold), file types, data growth patterns, departmental ownership, and sensitivity (e.g. PII or regulated content).

Using this metadata, enterprises can:

  • Place data correctly: Ensure active data resides in high-performance cloud tiers, while infrequently accessed files move to lower-cost archival storage.
  • Reduce scope and risk: By offloading cold data first or excluding redundant and obsolete files, the total migration footprint is much smaller.
  • Avoid disruption: Non-disruptive migrations ensure that users and applications can still access data during the transfer process.
  • Optimize for compliance: Proper classification helps ensure sensitive files are placed in secure, policy-compliant storage.

We wondered about the cut-off point between digitally transferring files and physically transporting storage devices, and asked Komprise field CTO Ben Henry some questions about Smart Data Migration.

Blocks & Files: A look at the Smart Data Migration concept suggests that Komprise’s approach is to reduce the amount of data migrated to the cloud by filtering the overall dataset.

Ben Henry: Yes, we call this Smart Data Migration. Many of our customers have a digital landfill of rarely used data sitting on expensive storage. We recommend that they first tier off the cold data before the migration; that way they are only migrating the 20-30 percent of hot data along with the dynamic links to the cold files. In this way, they are using the new storage platform as it is meant to be used: for hot data that needs fast access.

Ben Henry, Komprise
Ben Henry

 
Blocks & Files: Suppose I have a 10 PB dataset and I use Komprise to shrink the amount actually sent to the cloud by 50 percent. How long will it take to move 5 PB of data to the cloud? 

Ben Henry: Komprise itself exploits the available parallelism at every level (volumes, shares, VMs, threads) and optimizes transfers to move data 27x faster than common migration tools. Having said this, the actual time taken to move data depends significantly on the topology of the customer environment. Network and security configurations can make a tremendous difference as well as where data resides. If it is spread across different networks that can impact the transfer times. We can use all available bandwidth when we are executing the migration if the customer chooses to do so.

 
Blocks & Files: Is there compute time involved at either end to verify the data has been sent correctly? 

Ben Henry: Yes. We do a checksum on the source and then on the destination and compare them to ensure that the data was moved correctly. We also provide a consolidated chain of custody report so that our customer has a log of all the data that was transferred for compliance reasons. Unlike legacy approaches that delay all data validation to the cutover, Komprise validates incrementally through every iteration as data is copied to make cutovers seamless. We are able to provide a current estimate of the final iteration because Komprise does all the validation up front when data is copied, not at the end during the time sensitive cutover events.

Blocks & Files: What does Komprise do to ensure that the data is moved as fast as possible? Is it compressed? Is it deduplicated? 

Ben Henry: Komprise has proprietary, optimized SMB and NFS clients that allow the solution to analyze and migrate data much faster. Komprise Hypertransfer optimizes cloud data migration performance by minimizing the WAN roundtrips using dedicated channels to send data, mitigating the SMB protocol issues.

Blocks & Files: At what point of capacity is it better to send physical disk drives or SSDs (124 TB ones are here now) to the cloud as that would be quicker than transmitting them across a network? Can Komprise help with this? 

Ben Henry: The cost of high-capacity circuits is now a fraction of what it was a few years ago. Most customers have adequate networks set up for hybrid cloud environments to handle data transfer without needing to use physical drives and doing things offline. It’s common for enterprises to have 1, 10, or even 100 Gb circuits. 

Physical media gets lost and corrupted and offline transfers may not be any quicker. For instance, sending 5 PB over an Amazon Snowball could easily take 25 shipments, since one Snowball only holds 210 TB. That’s painful to configure and track versus “set and forget” networking. Sneakernet in many scenarios is a thing of the past now. In fact, I was just talking with an offshore drilling customer who now uses satellite-based internet to transmit data from remote oil rigs that lack traditional network connectivity.

Blocks & Files: That’s a good example, but for land-based sites Seagate says its Lyve mobile offering can be used cost-effectively to physically transport data.

Ben Henry: Yes, we are not suggesting that you will never need offline transfer. It is just that the situations where this is needed have reduced significantly with the greater availability of high-speed internet and satellite internet. Now, the need for offline transfers has become more niche to largely high-security installations.

Blocks & Files: You mention that sending 5 PB by Amazon Snowball needs 25 shipments. I think sending 5 PB across a 100Gbit link could take around five days assuming full uninterrupted link speed and, say, 10 percent network overhead

Ben Henry: Using a steady, single stream of data sent over a 100Gbit link with 50 ms of average latency, which reduces the single stream to ~250 Mbps, the transfer time could take years. Komprise isn’t single stream. In fact, it’s distributed and multithreaded. So, instead of just using 250 Mbps of a 100Gbit link, we can utilize the entire circuit bringing the job down to days.

Blocks & Files: With Snowball Edge storage optimized devices, you can create a single, 16-node cluster with up to 2.6 PB of usable S3-compatible storage capacity. You would need just 2 x 16-node clusters for 5 PB.

Ben Henry: Yes, there are still scenarios that need edge computing or have high security constraints where network transfer is not preferred. For these scenarios, customers are willing to invest in additional infrastructure for edge storage and compute such as the options you mention. Our point is simply that the market has shifted and we are not seeing great demand for offline transfers largely because the bandwidth related demand for offline transfers has greatly reduced with the availability of high-speed internet and satellite internet.