VAST Data restores backup data faster than disk-based appliances because it has data reduction technology and media advantages.
That’s the claim made in a white paper by VAST technology evangelist Howard Marks. Let’s see how it stacks up.
He uses Dell’s PowerProtect appliances, previously known as Data Domain, as a comparison: these are disk-based purpose-built backup appliances that use compression and inline deduplication to reduce the size of a backup. VAST uses compression and deduplication, too, but differently with its all-flash boxes.
The PowerProtect systems use LZ, which was the compression algorithm used by Data Domain, and can have a hardware compression card that can use and accelerate gzfast or the optional gz algorithm. PowerProtect compression is applied to data after it arrives in an NVRAM buffer and its compression, like VAST’s, is lossless.
LZ is a dictionary compression method in that a symbol is used to represent a repeated pattern in a data set. Possible symbols are stored in a dictionary with the appropriate symbol being used when a particular pattern of data is found.
VAST has data compression as well, and uses the Zstandard (ZSTD) compression method with byte granularity, as well as data-aware compression. It runs ZSTD, not inline when data arrives on the system, but when it migrates already ingested data to its QLC flash storage from a storage-class memory tier or buffer that houses incoming data. Because it is not inline with data ingest, this means it has more time to run the compression algorithms.
Marks says “Zstandard combines dictionary and Huffman coding techniques at multiple levels to optimize between CPU time/cycles and compression.” Huffman coding, he says, “eliminates entropy at the bit level by aligning the number of bits to store each value with the value’s frequency so common values are stored with fewer bits.”
For example, the binary code for the letter “a” is 01100001. If the letter “a” occurs frequently in a piece of text then a Huffman coding tree for that text could substitute a shorter bit value, such as 010, thus saving 5 bits for every occurrence of “a”.
VAST’s data-aware compression is used with numeric data that has a known limited range, such as sensor temperatures or stock trade values. Because these values typically vary over a limited range, the most significant bits or whole bytes will often be repeated. For example, a stock price in dollars could bobble around 10.50, 10.60, 10.35, 10.60, etc. The “10” is often repeated with the next values (50, 60, 35, 600) being repeated less. VAST’s software can, for instance, specifically compress 32-bit floating-point and integers by skipping these unchanging most-significant bits.
Marks says “the area containing the most significant bits will be very compressible, holding a small range of values that repeat often,” and therefore delivers good compression results. In other words, if your compression algorithm is aware of the kinds of data being stored, it can potentially work more effectively than a generic algorithm that tries to work well with whatever information you throw at it.
Compression algorithms have a scope, as Marks explains: “The most significant limitation of most compression is its limited scope. Most dictionary coding algorithms only build their dictionaries across 64KB or less, starting a new block and storing another dictionary every 64KB.”
VAST also uses delta encoding as another data-aware technique. Where repeated information is coming in, such as temperature readings, then instead of storing the complete value each time, the system stores just the change from the preceding value. Marks writes: “Combine this with the trick of separating the four bytes of the 32-bit values into different blocks, and the system only stores the changes in those high-order bytes. Because temperatures change relatively slowly, that difference will be 0 for many samples.”
Marks says the VAST system experiments to find out the best compression method to use: “VAST systems automatically determine if data should be compressed with ZSTD or with some combination of floating-point optimization, delta-optimization, and ZSTD by taking a small sample of each element and compressing it with each of our data-aware compression methods. The system compresses each element with the method most effective for its contents.”
Marks says smaller subsets of data, or chunks, are better for deduplication purposes than larger chunk sizes: “For example, when a user makes a 4KB change to a file, a system that deduplicates on 8KB chunks will store 8KB, but a system that deduplicates on 128KB chunks will store 128KB (16 times as much).”
Doing this inline means chunk metadata has to be kept in DRAM. Smaller chunks have a higher memory cost and a higher processing cost as well. There is a rehydration time penalty when restoring – rehydrating – deduplicated data from disk drives. That is because the restored file has to be rebuilt from deduplication source chunks which will be located in random locations on the disk. The larger the file and the smaller the chunks, the more random IOs are needed to rehydrate it and this comes up against a 7,200rpm disk drive’s approximate 100 IO/sec limit.
Marks writes: “That rehydration tax means a disk-based purpose-built backup appliance (PBBA) like PowerProtectDD can only feed restore jobs 1/3rd to 1/5th as fast as it can accept backups.”
All-flash arrays pay no such random IO rehydration penalty and so can use smaller chunks for greater deduplication efficiency “without slowing restore performance.”
VAST Data also runs a chunk similarity process “to determine if multiple chunks are similar, even when they are not identical. Chunks that are identified as ‘similar’ contain strings in common.” These can be compressed using a common compression dictionary.
Chunking size and deduplication scope
If chunk sizes are fixed, data repetitions that cross chunk boundaries won’t be detected. Adaptive chunking avoids this disadvantage by adapting “the size of each data chunk based on the data’s contents. The system takes moving samples across the data to be divided into chunks and uses some state of that moving sample to determine where chunks begin and end.”
When looking for duplicated data patterns in a data set, the larger the search area the better. For example, if you look for possible repeated data group in a single deduplication appliance you may or may not find it. If you look in 10 such appliances you have a greater chance of finding it. The ten appliances have to be linked in a cluster or realm for this global deduplication to work.
PowerProtect/Data Domain appliances do not have global deduplication whereas the scale-out VAST system does.
VAST ran a comparison test between a DataDomain Virtual Edition (DDVE) system and a VAST cluster using three datasets. The results, funnily enough, showed VAST deduplicating data sets between 24 and 40 percent better:
The potential savings in terms of stored backup capacity can be, well, vast, and the restore times will be much faster for VAST than for a disk-based Power Protect/Data Domain system.
Read Marks’ paper for a deeper dive into these matters. We have asked Dell if it has any comments on the claimed figures.