Huffman coding is an entropy coding algorithm used for lossless data compression, developed by David Huffman in 1952. The idea is to assign variable-length codes to input characters, the most frequent character gets the smallest code and the least frequent character gets the largest code. The Huffman algorithm is a so-called “greedy” approach to solving this problem in the sense that at each step, the algorithm chooses the best available option. It turns out that this is sufficient for finding the best encoding.
Prefix Property: The variable-length codes assigned are Prefix codes, means the data sequence are assigned such that the code assigned to one character is not the prefix code of any other character. This helps to avoid the ambiguity when decoding the generated bitstream. Let us understand prefix codes with a counterexample. Let there be four characters a, b, c and d, and their corresponding variable length codes be 00, 01, 0 and 1. This coding leads to ambiguity because code assigned to c is the prefix of codes assigned to a and b. If the compressed bit stream is 0001, the decoded output may be “cccd” or “ccb” or “acd” or “ab”.
There are mainly two major parts in Huffman coding:
- Building the Huffman tree
- Traverse the Huffman tree and assign codes to the characters.
Procedure to be followed:
- The input characters are sorted in the descending or ascending order.
- Two characters of lowest probability are assigned a one and a zero.
- The sum of these two probabilities gives the new probability.
- This probability is placed in the sorted list and accordingly assigning ones and zeros.
- This procedure is repeated until all the probabilities get added up.
- The codeword of the input character is found by tracing back the ones and zeros assigned.
The featured image is from here.