## Information Entropy — a Study of Unpredictability

### Introduction

When Mom says your room is messy, have you ever wondered, feeling treated unfairly, how could she make such judgment? What’s the foundation of her conclusion? Perhaps you have many trinkets in your room, but does that make your room “messy”? Doctors have hundreds of gadgets in their operating rooms, but those rooms are among the tidiest places on earth. What’s the difference between complicated and complex? How exactly do we measure the level of messiness? Well, that’s what we are going to talk about in this article — a concept in information theory that measures the level of randomness — entropy.

### 1. Random Variables

Just like graph algorithms are based on graphs, entropy can only be brought down to earth by the notion of random variables. It’s hard for one to internalize the idea of entropy without some basic understandings of random variables. If you are already familiar with them, please skip to section 2. If you are not, don’t worry. I will walk you through the concept quickly. Some very basic background knowledge is enough for our discussion of entropy.

#### 1.1 Basic Ideas of Random Variables

You must have seen things like this in high school.

A = “It will rain tomorrow”, P(A) = 30%

The A here is a naive form of random variables. We now need to define the concept more concretely. In a higher level, random variables are just like graphs. They are abstractions of things. Graphs, for instance, are abstractions of network-like data structures like road maps or social networks. Similarly, random variables are abstractions of uncertain events. For instance, we can represent “the result you get after tossing a fair coin” by a random variable X. We set X = 0 if the coin shows tail and X = 1 otherwise. By common sense, we know that P(X = 0) = 0.5 (the probability of getting a head on a toss is one half) and that P(X = 1) = 0.5 (the probability of getting a tail on a toss is also one half).

Note that we are, in most cases, not very interested in the numeric value of a random variable because they are just indicators or representations of an event. We are instead more interested in the probability that a random variable takes on a particular value. This is what distinct random variables from normal variables. If a normal variable y = 2, then y IS the integer 2. They are equivalent. There’s nothing more to be said. But for random variables, numeric values are just tokens that indicate a certain event in the sample space. For our random variable X, we don’t care about the number 0 or 1. We can even use different numbers. For example, we can instead set X = 111111 if the coin shows tail and X = 99999 if the coin shows head. It doesn’t change the meaningful properties of X. What we care about is that P(X = 111111) = 0.5 and P(X = 99999) = 0.5, which are the probabilities of events in the sample space.

The coin tossing case is just a very simple example. Random variables can represent more complicated uncertain events such as “the number you get after tossing a fair dice,” or “the number you get by asking a person to choose an integer between 1 and 10000.” Rather than saying, “what’s the probability of getting a six after tossing a fair dice?” We can instead evaluate P(X = 6).

The coin tossing case is just a very simple example. Random variables can represent more complicated uncertain events such as “the number you get after tossing a fair dice,” or “the number you get by asking a person to choose an integer between 1 and 10000.” Rather than saying, “what’s the probability of getting a six after tossing a fair dice?” We can instead evaluate P(X = 6).

#### 1.2 Continuous Random Variables (Optional)

The random variable X we talked about is discrete, which means It only takes on non-continuous values. For example, it is impossible to get a 5.4542 on a fair dice toss. However, random variables are not limited to the discrete world. They are also good models for random events with continuous outcomes. For instance, we can use random variable Y to represent the real number we get by tossing a “baseball” onto the x-axis.

In a word, Y is just X’s counterpart in the continuous world. The probability formulas related to them are very similar, except that one involves summation and the other employs integration. I will not go more in-depth on those here because they don’t contribute to an introductory understanding of entropy.

#### 1.2 Probability Distribution

Random variables are good models for uncertain events. And the most useful property of them is called probability distribution. Which shows how the probabilities of different events distribute in the entire sample space. For example, this diagram shows the distribution of random variable X (the outcome of a fair dice toss), which is uniformly distributed across 1-6.

Notice that the sum of all the possibilities must be one.

### 2. Entropy — a Measurement of Unpredictability

OK, now let’s talk about entropy. In a word, entropy measures the unpredictability of a random variable. It is a ruler of randomness. I don’t want to throw you its mathematical form right away. I think that leads to passive memorization instead of initiative exploration. Given that we have now understood what random variables are, let’s see if we can devise a function to measure their randomness by ourselves. Don’t worry. As long as you have taken high school algebra, there is nothing complicated algebraically in the following content.

#### 2.1 Two Dimensions of Randomness

We have many basic functions in our mathematical toolbox. And we want to piece some of them together in a way that can properly measure the level of randomness. But before that, we need to figure out what factors influence randomness. Otherwise, we won’t even be able to know where to begin.

For starters, consider two discrete random variables G and R describing the respective outcomes of tossing two different dices. And think about which one is “less predictable.” This is not hard at all. Please do make sure that you figure this out before you continue reading!

Obviously, the green dice is a fair dice and therefore less predictable. And the red dice is a cheating dice, maybe one of those in the movie Inception. We can predict, with considerable confidence, that 4 would be the outcome of a toss of the red dice. The moral here is that the clustering of probability in a random variable’s distribution (for R, probability clusters around the value 4), causes the level of randomness to decline. In other words, randomness peaks when probabilities are spread out evenly across the range of a random variable.

Now that we have obtained our first revelation, let’s move a step further. Consider random variable G which describes the outcome of tossing a fair six-sided dice and M, which represents the outcome of tossing a fair eighteen-sided dice. Which random variable is less predictable, or more random?

The answer is intuitive: although both random variables have evenly spread possibility distributions, M can take on 18 values instead of 6, therefore is more unpredictable. We can thus obtain our second revelation: the wider the range of a random variable, the higher its level of randomness.

#### 2.2 Devising a Measurement of Randomness

What we just derived are the two dimensions of randomness: Uniform distribution or clustered distribution? Wide range or narrow range? And we have figure out their correlations with randomness. Now let’s devise a function that takes these two dimensions into account and outputs a universal measurement of unpredictability. Let call this potential function H(X), where X is our random variable. Since we know the range of X positively influences its level of randomness, it seems that we should walk through all possible k values that X can take on and get a sum of something. That way, the wider its range, the more things we can sum up, and the bigger H(X) will be. Since we are not sure what that “something” is, let’s just call it f( ). Thus, we know that our function will look something like this.

$H(X) = \sum_{k} f(P(X=k))$

By doing the sum, we hooked up our measurement H(X) with the range of X to implement revelation 2. Good! Now let’s figure out what f() should be. How should we treat individual probability? We know the sum of f() is supposed to take X’s probability distribution into account. The more evenly spread the distribution is, the bigger the sum of f() should be. But it is “the sum of f(k)” we are talking about here. It’s tricky. Let’s focus on a single k for now. For example, when k = 4 in our cheating dice example.

We see that probability clusters around k = 4, therefore f(P(X=4)) should contribute very little to our total sum because our first moral told us that clustering of possibility decreases randomness. It seems that the closer to 0 the probability of k, the bigger f( ) should be. Because if we have tiny possibilities everywhere across a wide range of values, our random variable indeed seems pretty “unpredictable.” What function in our mathematical toolbox has this property? Well, there is one that fits perfectly — the logarithmic function. If we inverse the logarithmic function, which means adding a negative sign in front of it, we will get a function where y increases as x approach 0, and y = 0 where x = 1. This makes perfect sense because if the possibility of, say, k = 4 equals 1, which means the dice will 100% land on 4, there is no randomness at all. Our measurement should indeed output 0 at that case.

Well, are we good? Should we just use -logx as f(k) and we are done? Our current function is.

$H(X) = \sum_{k} -log_{2} (P(X=k))$

Is this going to work? Consider this example below.

In the counter-example above, G is highly predictable. Yet our “ruler of randomness” outputs a big value because f(2) is so big. Unfortunately, our models fail. -logx alone doesn’t work. Because it only considers the possibility of a single k value instead of the entire distribution. But we are very close to the final answer. There is only one missing piece of the puzzle: if we weight -logx by the corresponding probability, the problem is fixed.

$H(X) = \sum_{k} (P(X=k)*-log_{2} (P(X=k)))$

In the counter-example above, if we weight -log(P(X=k)) by P(X=k), the problem of miniscule possibility causing major increase in randomness is resolved. Because if the probability is very small, P(X=k) will also be very small and thus stopping -log(P(X=k)) from getting too big.

And the last formula is the mathematical definition of entropy in information theory. If we simplify the notation a little bit. Use p to represent all P(X=k). Then the formula can also be written as:

$H(X) = \sum_{p} -plog_{2} p$

### 3. Another Intuition behind Entropy — a Measurement of Informativity.

The unit of entropy is bit. Sounds familiar? Well, this is not a coincidence. In fact, information theory is where the word “bit” and its kin units started to be widely used to measure data size. Not only have we just devised a measurement of randomness, but we have also found a good model for the level of informativity. For example, in our 8-sided fair dice example, we get an entropy of 3, which means if we use the dice as a “telegraph signal source,” it’s capable of sending out 3 bits of information.

You can think of the clustering of probability as a force which stops our dice signal source from sending out whatever number it wants. Therefore, the clustering of probability will lead to a decreased level of informativity.

If, for example, our dice can only land on the number 3. We can only send out one number if we use the dice as a signal source. Therefore, it’s entropy is 1 bit.

### Summary

The idea of entropy is one of the fundamental concepts in information theory developed by American mathematician Claude Shannon. Its applications include data compression and machine learning algorithm like decision tree. Entropy is a delicate and informative tool if we can see the logic beneath the formula. If you find anything above hard to understand, please leave a comment.