Christian Zuniga
5 min readMar 26, 2023

--

A Short Introduction to Information

by Christian Zuniga, PhD

In 1948 Claude Shannon published his legendary paper “A Mathematical Theory of Communication” founding the important subject of information theory [1]. His main insights are:

  1. There is a way of quantifying the amount of information in data
  2. There is a capacity or limit in the rate that information can be transmitted.

This enabled better digital storage, satellite, and wireless communication systems. His key concept of entropy extended beyond to other areas such as artificial intelligence and even biology. This article will focus on the first part called Source Coding whose main result is that raw data typically has redundancy and can be compressed up to a limit with negligible loss of information.

Shannon quantified information by how surprising an occurrence is. In the setup of a communication system, a message received conveys more information if it occurs less frequently or has low probability among all possible messages. On the other hand, a message that has high probability gives less information or is less uncertain. As a common example, if a message is a weather forecast in an area which has sunny skies most of the year, a message indicating ‘sunny skies’ gives little information. In the extreme case where there are sunny skies year-round, the message would not have to be communicated. If the message is ‘rain’ then that message would contain more information since it is less probable. The amount of information in a message is inversely related to its probability.

To find the information content of a message we need the probability of observing a message p(x). Any message x, is a sequence of N symbols “x1,x2,x3,…xi,…xN”, These could be letters from the English language such as a word x =”Hello”. Each symbol ‘xi’ can take on one of several values from a set of K symbols with probability p(xi=k). For example, the probability of the letter ‘e’ has been estimated from multiple texts to be around 0.13 [2]. The possible symbols are given by k=0,…K-1 with no loss of generality. The information content of each symbol given by -log p(xi=k). Figure 1 shows this function has the desired inverse relationship between information content and probability of a symbol occurring. The logarithm also has other properties that make it suitable as a measure of information.

Figure 1 Uncertainty of a message increases as its probability decreases. The uncertainty is measured by -logp(xi) where xi is a symbol. The x axis is the probability which ranges only from 0 to 1. (Image by author)

The average amount of information in each symbol can be found from the usual definition of average. The information content of each symbol, -log p(xi=k), is weighted by its probability and these are added up for all K symbols. This average is called the entropy H. On average, each symbol conveys H bits of information.

This is the information content of each symbol. How about the information content of a message made of more than 1 symbol?

A message x is a sequence of N symbols “x1,x2,x3,…xi,…xN”, the amount of information in the message is given by the combined information of all symbols. Specifically, when the symbols are independent, the information in a message x is given by -log p(x).

We will use p(xi) in the rest of the article to avoid clutter. The total amount of information in a message is given by the sum of the information of the symbols. Shannon discovered that if a message is very long, or N increases, something interesting happens. The amount of information in a message is approximately NH.

Let’s look at how this happens following the argument in reference [3]. Given a long enough sequence, every symbol k will occur some number of times. If x1 occurred n1 times, x2 occurred n2 times, and so on, we have.

Notice the summation is now over the K symbols possible, not the N symbols in the message. By the law of large numbers as N increases.

Therefore

What does this mean? It means that for long enough messages, the probability of any message has become independent of its component symbols and is a constant.

Furthermore, since the probability is uniform, the number of messages is approximately.

This set of messages is called the typical set [4]. They take up most of the probability. The total number of possible messages can be much greater (K^N or 2^(NlogK)). This difference is what allows for compression. If K=2, so the symbols are 0 or 1, the total number of messages of length N is 2^N but practically only 2^NH occur. The entropy H for the binary case is less than or equal to 1 as shown in Figure 2 so there are effectively less messages when H < 1 or p(x=1) is different from 0.5. Another way to look at the situation is that 1 binary digit can convey 1 bit of information or less.

Figure 2 For the binary case where there are 2 symbols with probability p and 1-p, the entropy is less than or equal to 1 and has a maximum at p=0.5 (Image by author)

References

[1] Shannon “The Mathematical Theory of Communication” University of Illinois Press 1963

[2] Stone “Information Theory; A Tutorial Introduction” Sebel Press 2015

[3] Gershenfeld “The Physics of Information Technology” Cambridge University Press 2000

[4] Cover and Thomas “Elements of Information Theory” Second Edition Wiley 2006

BECOME a WRITER at MLearning.ai

--

--

Christian Zuniga

Christian Zuniga has worked as a lecturer in San Jose State University in data mining. He worked in OPC modeling and is interested in machine learning.