Updated 2017-10-31
How much information is required to to represent some outcome. If the outcome is either true or false, then only 1 bit is required to store the information with no loss.
Information measure should take into account the probabilities of outcomes to minimize the “footprint” of the information stored.
Suppose there are \(M\) symbols: \(x_i, i\in\{1,2,\dots,M\}\). Then the information in each symbol is measured as
\[\boxed{I(x_i)=\log_n\left(\frac{1}{p_i}\right)}\]If the symbols are bits in a computer, then \(n=2\) and \(I(x_i)\) is number of bits.
Requirements
Example:
You got admitted to Stanford university (4.8% admission rate), then the chance is about 1 in 21. It follows that the information measured is \(\log_2(21)\approx 4.4\) bits.
However, it was all a dream; you did not get into Stanford. Then you’re part of the 95.2% that was rejected, or 20 in 21, or 0.952. Thus, the information measured is \(\log_2(1.05)\approx 0.071\) bits.
The total information in a message with \(N\) symbols and an alphabet size of \(M\) is given by
\[-N\sum_{i=1}^M p_i\log_2(p_i)\]The average information is the entropy, which is
\[\boxed{H(X)=-\sum_{i=1}^M p_i\log_2(p_i)}\]Notice that it can be also expressed as \(-\mathbb E\{\log_2(p_i)\}\).
The entropy tells us the minimum number of bits required to represent each symbol. Any less then we have a lossy encoding / compression.
According to law of large numbers, on average, any symbol \(x_i\) will appear \(N\times p_i\) times. Where \(N\) is the length in number of bits, and \(p_i\) is the probability of occurrence. This approximation is only accurate as \(N\rightarrow \infty\).
In a stream of information that has \(N\) bits, number of typical sequences are
\[2^{N\cdot H(X)}\]It follows that we need \(N\cdot H(X)\) bits to represent a typical sequence.
Assuming that each typical sequence is equally likely, then the probability of a typical sequence is
\[\frac{1}{2^{N\cdot H(X)}}\]It is a method of encoding symbols with codewords. The idea centers around prefix-free code: meaning that the codeword is never a prefix to other symbols’ codewords.
Huffman coding allows the symbol that appears the most use codewords that have the least length. Therefore number of bits that is required to be transmitted is minimized.