Compression

Compression – the shrinking of a file or object’s size by replacing repeated strings of bits with fewer bits and a pointer. This is done so that files take up less storage space or needs less bandwidth when transmitted across a network link. When the file is received at the far end of the link it may be decompressed, as a stored compressed file would be when it is read. The compression-decompression processes could be carried out by a codec hardware device or performed using software.

Compression techniques can be applied to text files, audio files, images and video files, and can be done in a lossless manner or a lossy way, meaning that some information is permanently lost. Audio files may contain frequencies that the human ear cannot detect  and removing them won’t affect the perceived quality of the heard audio file but will reduce its size. Applying lossy compression to images and videos can reduce the perceived quality of both.

Applying lossy compression to text files is self-defeating as losing information obscures the meaning of the text.

Compression techniques

Algorithms are used to scan files and find repeated bit sequences. They are replaced by a pointer which identifies the original bit sequence and the location in the file where it should be written when the file is decompressed. The original bit sequences can be stored in a dictionary appended to the file.

JPEG – (Joint Photographic Experts Group) is a standard format for compressed image data which is lossy in that some data is lost. This can be most effective in reducing image file sizes while still delivering acceptable image quality.

MPEG – (Moving Picture Experts Group) – A way of compressing video files that stores changes between frames in a video and not the entire frame. There are different types of MPEG compression, such as MPEG2, MPEG2 and MPEG4.

Lempel-Ziv – Abraham Lempel and Jacob Ziv published the  LZ77 and LZ78 lossless compression papers in 1977 an 1978 respectively. Repeated bit sequences in a file are detected by reading a section of the file, a window. When identified the repetitive sequence is stored in the file as a length-distance pair (pointer). This says replace the next N characters with the sequence found at earlier location X in the file. Then the window is moved up the file (sliding window in LZ77) and the repetition search repeated.

The file is thus reduced in size and can be exactly rebuilt (rehydrated) when it is read.

The LZ78 algorithm stores repeated bit sequences in a dictionary and uses a pointer added to the file to identify which dictionary sequence to insert at a particular location.

The LZ78 algorithm was improved by a later LZW algorithm that pre-initializes a dictionary used with all possible characters or with an emulation of a pre-initialized dictionary.

The Lempel-Ziv algorithms are widely used.

ZIP – ZIP or archive folders are produced by losslessly compressing files and putting them inside a ZIP folder to save bandwidth and time with a network transmission or storage on a portable device. When read they are “un-zipped.”

Note. See also deduplication.