Deduplication

Deduplication – A method for removing repeated items or groups of items in a file to make it take up less space. It is often used in backup software and also in purpose-built back up appliances, such as those from Dell. Eg, PowerProtect and previous DataDomain products. Inline deduplication is carried out as backup data lands on the device, and not afterwards in a post-process deduplication exercise. Global deduplication is carried out across a group of appliances. Otherwise the deduplication scope is limited to the appliance or system within which it is executing. Some storage arrays also use it, such as file-based VAST Data’s Universal Storage.

Deduplication is lossless. It is a form of data reduction and compression is another. Compression reduces the size of a file by finding and replacing redundant data within the file by pointers to the original data string. Eg; a 100 text file might be compressed by replacing repeated sequences of space characters our text strings replacing them with shorter pointers.  

Deduplication works at the block-level within files and gives blocks a calaculated hash value. If a subsequent block has the same hash value then it is replaced with a pointer to the original block, thus saving space. The deduplication algorithm can work with a fixed block size or variable sized blocks.

As a quick differentiation, compression and deduplication are both forms of data reduction with compression scanning for similar characters or character strings in short sections of a file, whereas deduplication calculates hash values for lager strings of characters (blocks) and then looks for repeated hash values to identify similar blocks. It is a hash value level check whereas compression is a character string level check.