# Cross Entropy for Dummies in Machine Learning Explained

Contributed by: Rakesh Ambudkar

## Introduction to Cross Entropy

The moment we hear the word Entropy, it reminds me of Thermodynamics. In entropy, the momentum of the molecules is transferred to another molecule, the energy changes from one form to another, entropy increases. Well, what does that mean? There is a disorder in the system. In literal terms, there is a change happening in the system. Disorder does not mean things get into a disordered state. It simply means that there is a change. Since the temperature of the system changes, density, as well as the heat, changes in the system. It is also defined as randomness in the system. When the water is heated, the temperature increases, which increases the entropy. Well, enough of Physics. The questions that pop up are-

1. How is this concept applicable to the world of statistics?
2. Is Entropy and cross-entropy the same?
3. Where can cross-entropy be applied?
4. How to calculate the cross-entropy?
5. What does the value so obtained by cross-entropy signify?

These are just some of the questions, but the list is quite long. In this article, an effort is made to provide clarity on some concepts of Entropy and cross-entropy. Let’s start with the basic understanding.

As mentioned above, Entropy can be defined as randomness, or in the world of probability as uncertainty or unpredictability. Before we deep dive into the concept of Entropy, it is important to understand the concept of Information theory that was presented by Claude Shannon in his mathematical paper in the 19th century.

It is based on how effectively a message can be sent from a sender to the receiver. As we know, low-level machine language codes and decodes in the form of bits namely 0’’s and 1’s. Information can be passed in the form of bits, which is either 0 or 1. But the question is, how much of the information passed is useful and how much of the information gets passed to the receiver? Is there any kind of loss of information?

Let’s say, if a bank needs to provide a loan to its customer, the bank first needs to protect itself from the risk of uncertainty of payment of loan. If there is a system that provides the information that the potential customer will default (1) or not default (0), then an entropy (degree of uncertainty) can be calculated. This will help the bank make a better decision.

So, if I have a system that accepts the loans of the potential customer and the system provides outputs such as- will the customer default or not default? Then this information can be given in a one-bit code. To say the system has to just say Yes or No to the bank to accept and provide the loan to its potential customer. This requires just one bit of code and encodes the same. Is this sufficient enough information for the bank to decide this can be risky and the bank to lose some of its customers? How do we reduce the uncertainty of losing the potential customer? By providing a little more information.

Let’s say, we extend this possibility by having 4 more possibilities- most likely default, or less likely default, default and non-default.

In the first case, there are 2 possibilities which are represented by bits 0 and 1. This can be represented as 2which means 1 bit is required to send the information

Second case, there are 4 possibilities which can be represented as 00, 01, 10, 11. This can be represented as 22 which means 2 bits are required to send the information

Now let’s calculate the binary logarithm of the possibility which is log2 4 = 2. This was arrived by assuming that all the 4 events are equally likely to occur. This can also be expressed as

– log2 (1/4) = 2 where ¼ now is the probability of occurrence of the event as there are 4 events which are equally likely to happen.  (Probability is defined as no of chance that the event happens / Total number of events)

Inf(x) = – log2(p(x)) where p(x) is the probability of the event x.

This was defined by Shannon and hence as been called as Shannon Information

If we have such n number of events occurring with the corresponding probability then this can be expressed as – Σx p(x)log2(p(x))

This provides the average amount of information from the sample in question for the given probability. This also tells about the degree of unpredictability.

How is this related to our discussion on entropy that was explained as a concept in thermodynamics. Larger the randomness larger is the entropy. Here, we can relate randomness to uncertainty. So larger the uncertainty, larger will be the entropy.

## How to calculate Entropy

For the current dataset of Loan default, we know the probability for each event. Entropy can be calculated by using various tools like R , Python. For simplicity, Python is used for the purpose of this article as given below.

``````# import entropy
from scipy.stats import entropy
# calculate the entropy with base as 2
Etp = entropy (predicted value, base=2)
Print(‘Entropy : ‘ %Etp)
``````

For the current dataset of Loan default, the Entropy is 6.377 bits.

Let’s say, we have two distributions to be compared with each other. Cross entropy uses the idea that we discussed on entropy. Cross entropy measures entropy between two probability distributions. Let’s say the first probability distribution is represented as A, and the second probability distribution is represented as B.

Cross entropy is the average number of bits required to send the message from distribution A to Distribution B. Cross entropy as a concept is applied in the field of machine learning when algorithms are built to predict from the model build. Model building is based on a comparison of actual results with the predicted results. This will be explained further by working on Logistic regression where cross-entropy is referred to as Log Loss.

Let’s look at the Loan default example that was discussed between the Bank and the system. Let’s say distribution A is our Actual distribution and distribution B is Predicted distribution

We can represent Cross Entropy as-

CE(X,Y) = – Σx p(X)   * log2(q(X))

Where x is the sum of all the values and p(X) is the probability of the distribution in actual distribution. A and q(X) is the probability of distribution in predicted distribution B. So how do we correlate Cross Entropy to entropy when working with two distributions? If the predicted values are the same as actual values, then Cross entropy is equal to entropy. However, in the real world, predicted differs from the actual, which is called divergence, as they differ or diverge from the actual value. Divergence is called KL (Kullback- leibler ) divergence. Hence, Cross entropy can also be represented as the sum of Entropy and KL Divergence.

Let’s explore and calculate cross entropy for loan default.

The figure below shows a snapshot of the Sigmoid curve or an S curve that was arrived at by building a sample dataset of columns – Annual Income and Default status. Wherever, the annual income is less than 5 lacs, default loan status was set to 1 (loan default) and greater than 5 lacs as 0 (no default). This was done just for representational purpose.

The difference between the actual and predicted values for the loan default distribution is shown as error or difference between them. It is the randomness or uncertainty that defines the entropy. From the visuals above, it gives us a clear view that there seems to be large differences for actual values at both the ends.  This difference is loss of information to predict the outcome. The difference or error in prediction is also called the Log Loss or loss function for any classification problems in machine learning. In our default loan case, we have two classes (Default or no default) or two labels. Cross entropy is used to determine how the loss can be minimized to get a better prediction. The lesser the loss, the better the model for prediction. This is used while building models in Logistic regression and Neural network classification algorithm.

The loss or difference for each data point is calculated and summed up to arrive at average Log loss.

Here natural logarithm is used rather than binary logarithm. Cross entropy loss can be defined as-

CE(A,B) = – Σx p(X)   * log(q(X))

When the predicted class and the training class have the same probability distribution the class entropy will be ZERO.

As mentioned above, the Cross entropy is the summation of KL Divergence and Entropy. So, if we are able to optimize or minimize the KL divergence the loss function gets optimized. However, we also need to consider that if the cross-entropy loss or Log loss is zero then the model is said to be overfitting.

## Cross Entropy as a Loss Function

Cross entropy as a loss function can be used for Logistic Regression and Neural networks. For model building, when we define the accuracy measures for the model, we look at optimizing the loss function. Let’s explore this further by an example that was developed for Loan default cases.

Let’s work this out for Logistic regression with binary classification. The problem statement is to predict one of the two classes. Cross entropy can be applied here to minimize the loss. In other words, we need to reduce or minimize the misclassification of the predicted output.  The logistic model will calculate the probability of each class (here, two, as this is binary classification).

The interest is to find the difference between actual and predicted probabilities. For example, let’s say, we need to find the probability of defaults which is our interest and for which we are required to build the model. Cross entropy will find the difference between the actual probability for default, which is available for us through training dataset, and the predicted probability for default, which is required to be calculated by the model.

The model will predict for Loan default class p(default). For the other class- no default

p (no default) = 1- p(default)

The cross entropy as a log function will be as given below

CE (p, q) = – [0 * log2 (0.65) + 1 * log2 (0.35) = – ( -1.514573) = 1.5145732

The cross entropy is high in this case as there are several instances of misclassification of predicted output.

If the probability of prediction is improved to say 50% to predict the default cases then the cross entropy will be-

CE (p, q) = – [0 * log2 (0.55) + 1 * log2 (0.55) = – ( -0.86249) = 0.86249

The cross entropy is now lower than the previous one when the prediction for default class was 35%. Improving the prediction to 50% reduces or optimizes the loss.