by Neal R. Wagner
Copyright © 2001 by Neal R. Wagner. All rights reserved.
NOTE: This site is obsolete. See book draft (in PDF):
The term information theory refers to a remarkable field of study developed by Claude Shannon in 1948. Shannon's work was like Einstein's gravitation theory, in that he created the whole field all at once, answering the most important questions at the beginning. Shannon was concerned with ``messages'' and their transmission, even in the presence of ``noise''. For Shannon, the word ``information'' did not mean anything close to the ordinary English usage. Intuitively, the information in a message is the amount of ``surprise'' in the message. No surprise means zero information. (Hey, that's somewhat intuitive. However, it is not intuitive that a random message has the most information in this sense.)
Shannon defined a mathematical quantity called entropy which measures the amount of information in a message, if the message is one of a collection of possible messages, each with its own probability of occurrence. This entropy is the average number of bits needed to represent each possible message, using the best possible encoding. If there are n messages X = {X1, ... , Xn}, with probabilities of occurrence: p(X1), ... , p(Xn) (with sum equal 1), then the entropy H(X) of this set of messages is:
Intuitively, the entropy is just the weighted average of the number of bits required to represent each message, where the weights are the probabilities that each message might occur.
For example, if we have two messages X = {male, female}, each having probability 1/2, then the entropy is
Thus in this case, as we would intuitively expect, there is one bit of information in such a message.
Suppose p(X1) = 1 and the remaining probabilities are zero. In this case the entropy works out to be 0 as one would expect, since one is only going to receive the message X1, so there is no information and no surprise in receiving this message. The actual message X1 might be complex, with many bits representing it, but its probability is 1, so only this message can occur, with no information or ``surprise'' on its receipt, even if it is complex. (In the calculation of entropy, the term 0 * log2(1/0) comes up, which looks like 0 times log2(infinity). This term would be indeterminate, but the first part tends to 0 much faster than log2(infinity) tends to infinity, so that in practice such terms are regarded as 0.)
As another example, suppose n = 3 and p(X1) = 1/2, p(X2) = 1/4, and p(X3) = 1/4. Then the entropy works out to be 1.5. It is possible to encode these messages as follows: X1: 0, X2: 10, and X3: 11. In this case the average code length is the same as the entropy. One doesn't normally expect to be able to represent a collection of messages with a code whose average length is exactly equal to the entropy; it is never possible to get the average length less than the entropy.
Finally, suppose there are 1000 equally probably messages. Then the entropy is:
Thus the entropy value of these messages means that there are nearly 10 bits of information in each message. Similarly, if there are n equally likely messages, then the entropy of a message is log2n. The equal probable case gives the largest possible value for the entropy of a collection of messages.
The terms code and coding refer to ways of representing information. We will usually be using binary codes, that is, codes that use only the binary bits 0 and 1. There are three kinds of coding:
Shannon also introduced the concept of channel capacity, which is the maximum rate at which bits can be sent over an unreliable (noisy) information channel with arbitrarily good reliability. The channel capacity is represented as a fraction or percentage of the total rate at which bits can be sent physically over the channel. Shannon proved that there always exist codes that will signal arbitrarily close to the channel capacity with arbitrarily good reliability. Thus by choosing a larger and more complicated code, one can reduce the number of errors to as small a percentage as one would like, while continuing to signal as close as one wants to 100% of the channel capacity. In practice the theory does not provide these good codes, though they are known to exist. It is not possible to signal with arbitrarily good reliability at a rate greater than the channel capacity.
The simplest example of such a channel is the binary symmetric channel. Here every time a bit is transmitted, there is a fixed probability p, with 0 <= p <= 1 such that a transmitted 0 is received as a 0 with probability p and received as a 1 with probability 1 - p. The errors occur at random.
For example, if p = 1 there are no errors at all on the channel, and the channel capacity is 1 (meaning 100%). If p = 0, the capacity is still 1 as long as you realize that all bits are reversed. If p = 0.5, then on receipt of a bit, both 0 and 1 are equally likely as the bit that was sent, so one can never say anything about the original message. In this case the channel capacity is 0 and no information can be sent over the channel.
For binary symmetric channels there is a simple formula for the capacity C (a Java program that calculates channel capacity is here):
Alternatively, one can write this formula as:
where X consists of two messages with probabilities p and 1 - p. One can argue intuitively that this formula makes use of the amount of information lost during transmission on this noisy channel, or one can show this formula mathematically using concepts not introduced here.
The following short table was produced by the Java program here):
| Probability | Channel Capacity |
| 0.00 or 1.00 | 1.000000000000000 |
| 0.05 or 0.95 | 0.713603042884044 |
| 0.10 or 0.90 | 0.531004406410719 |
| 0.15 or 0.85 | 0.390159695283600 |
| 0.20 or 0.80 | 0.278071905112638 |
| 0.25 or 0.75 | 0.188721875540867 |
| 0.30 or 0.70 | 0.118709100769307 |
| 0.35 or 0.65 | 0.065931944624509 |
| 0.40 or 0.60 | 0.029049405545331 |
| 0.45 or 0.55 | 0.007225546012192 |
| 0.50 or 0.50 | 0.000000000000000 |
Exercise: Use a numerical approximation algorithm to find the inverse of the channel capacity formula. Then write a Java program implementing the algorithm and printing a table of channel capacities and corresponding probabilities (just the reverse of the above table).
| Channel Capacity |
Probability p |
Probability 1 - p |
| 0.00 | 0.50000000 | 0.50000000 |
| 0.10 | 0.31601935 | 0.68398065 |
| 0.20 | 0.24300385 | 0.75699615 |
| 0.30 | 0.18929771 | 0.81070229 |
| 0.40 | 0.14610240 | 0.85389760 |
| 0.50 | 0.11002786 | 0.88997214 |
| 0.60 | 0.07938260 | 0.92061740 |
| 0.70 | 0.05323904 | 0.94676096 |
| 0.80 | 0.03112446 | 0.96887554 |
| 0.90 | 0.01298686 | 0.98701314 |
| 1.00 | 0.00000000 | 1.00000000 |
[Ans: Here is a Java program that uses the simple bisection method to print the table: inverse capacity table.]
Let us work out a simple example of the binary symmetric channel with p = 0.75. Keep in mind that this means the error rate is 25%: an extremely high figure. In such a case every fourth bit on the average will be reversed, with the reversals occurring completely at random. The capacity of the binary symmetric channel with p = 0.75 is:
A channel capacity of 0.18872 means that one can at best signal at a rate of slightly less than 18% of the true channel bandwidth, because there are so many errors.
Here is a simple code to improve the error rate: send each bit 3 times and let the majority decide at the other end. In this case a single error can be corrected, since if 0 is transmitted and something like 010 is received, the majority rule would interpret the received 3 bits as a 0.
With a success rate of 3/4, there will be 0 errors in the 3 bits with probability 27/64, and there will be 1 error also with probability 27/64. Then 2 errors will occur 9/64 of the time and 3 errors will occur 1/64 of the time. Only the last two cases represent an actual error with this triple redundant system, so the new error rate is 10/64 or about 16%. In summary, by reducing the transmission rate from 100% to 33%, one can reduce the error rate from 25% to 16%. (This is a very simple-minded and inefficient code.)
One could continue in this way, using more and more duplicates and letting the majority rule. (An even number of duplicates is not a good choice because the result is indeterminate in case equal numbers of 0s and 1s are received.) Here is a simple table that gives the results (a Java program that prints the table is given as the answer to the next exercise).
| Repetition Codes, Error Probability = 0.25 | |||
|---|---|---|---|
| Number of
Duplicates |
Transmission
Rate |
Error
Rate |
Success
Rate |
| 1 | 100% | 25% | 75% |
| 3 | 33% | 16% | 84% |
| 5 | 20% | 10% | 90% |
| 7 | 14% | 7% | 93% |
| 9 | 11% | 5% | 95% |
| 11 | 9.1% | 3.4% | 96.6% |
| 25 | 4.0% | 0.337% | 99.663% |
| 49 | 2.04% | 0.008% | 99.992% |
| 99 | 1.01% | 0.0000044% | 99.9999956% |
| 199 | 0.5025% | 1.805E-12% | 99.9999999999982% |
It should be clear that you can get as low an error rate as you like by using more and more duplicates, reducing the transmission rate to near zero. At a little less than the channel capacity (7 duplicates and a transmission rate of 14%), you can get the error rate down to 7%. Shannon's theory says that there are other more complicated codes that will also take the error rate arbitrarily close to zero, while maintaining a transmission rate close to 18%. (These other codes can get very large and complicated indeed. No general and practical codes have ever been discovered.)
Exercise: Find the channel capacity for p = 2/3 [Ans: 0.08170.] Do up a table for p = 2/3 similar to the one for p = 3/4. [Ans: here.]
Instead of having code words just for the two bits 0 and 1, it is possible to find codes for longer blocks of bits: for example, to take two bits at a time and use some distinct code word for each of the four possibilities: 00, 01, 10, and 11. Here is a scheme that encodes blocks of three bits at a time:
| Information Word |
Code Word |
| 000 | 000 000 |
| 100 | 100 011 |
| 010 | 010 101 |
| 001 | 001 110 |
| 011 | 011 011 |
| 101 | 101 101 |
| 110 | 110 110 |
| 111 | 111 000 |
If the information bits are designated a1, a2, and a3, then the code word bits are: a1, a2, a3, a4, a5, a6, where
a4
= a2 xor a3
a5
= a1 xor a3
a6
= a1 xor a2
These code words have the property that any two of them differ from one another in at least three positions. (One says that the Hamming distance between any two of them is at least 3.) If there is a single error in the transmission of a code word, there is a unique code word that differs from the transmitted word in only one position, whereas all others differ in at least two positions. For this reason, this code is single error correcting, just like the previous triple code, where each bit is transmitted three times.
Notice the big difference between this code and the triple transmission code: this code has a transmission rate of 50%, while the triple code has a rate of only 33.3%, even though both do single-error correction.
In the previous code, after transmitting each bit 3 times, one got a better error rate by transmitting each bit 5 times, or 7 times, or 9 times, etc. The transmission rate went to zero, while the error rate also went to zero. In the same way, one could repeat the codes in this section, and the transmission rates would be slightly better than the earlier ones, though they would still go to zero. What is wanted is a way to keep a high transmission rate while lowering the error rate.
Claude Shannon proved the existence of arbitrarily good codes in the presence of noise. The codes he constructed are completely impractical, but they show the limits of what is possible.
The actual situation is just a little more complicated: a code table chosen at random might be anything at all, and so might not perform well. Shannon's proof of his result takes the average over all possible codes. For large enough n, the average satisfies his theorem, so there must exist an individual specific code table that satisfies his theorem. In practice, for large n a random code is very likely to work.
These codes are impractical because the system requires keeping a table of all the random code words. Assuming the table would fit into memory, encoding could be done efficiently. However, for decoding one must find the code word in the table that most closely matches the received code word (that has errors from the noise on the channel).
Exercise: Create a simulation of Shannon's result for specific parameter values and for a specific random code table (with codewords chosen at random). Use p = 3/4 for the error probability in the binary symmetric channel. Thus the channel capacity C will be 0.18872. Now try a specific value for the blocksize n, say n = 15, so that there will be 32768 information words and 32768 entries in the code/decode table. Shannon's theory says that these random code words must be at least n/C = 15/0.18872 = 79.483 bits long. Try a longer length for the code words, say 100 bits, and find the error rate of the simulation in this case. (We know that for n large enough this error rate will be as small as we like in this case, but without some estimates, we don't know how it will work out for n = 15.) Try a shorter length, say 65 for the code word lengths. (We know that if the code word lengths are chosen less than n/C the error rate can never tend to zero, no matter how big n is chosen.) Also try the value 80 (or 79) for comparison. [Ans: See the next section for an answer.]