# Gini Impurity (With Examples)

TIL about Gini Impurity: another metric that is used when training decision trees.

Last week I learned about Entropy and Information Gain which is also used when training decision trees. Feel free to check out that post first before continuing.

I will be referencing the following data set throughout this post

```
"Will I Go Running" Data Set
Day | Weather | Just Ate | Late at Work | Will I go Running?
--- | --- | --- | --- | ---
1 | 'Sunny' | 'yes' | 'no' | 'yes'
2 | 'Rainy' | 'yes' | 'yes' | 'no'
3 | 'Sunny' | 'no' | 'yes' | 'yes'
4 | 'Rainy' | 'no' | 'no' | 'no'
5 | 'Rainy' | 'no' | 'no' | 'yes'
6 | 'Sunny' | 'yes' | 'no' | 'yes'
7 | 'Rainy' | 'no' | 'yes' | 'no'
```

## Gini Impurity

`Gini Impurity`

is a measurement of the likelihood of an **incorrect classification** of a new instance of a random variable, if that new instance were **randomly classified according to the distribution of class labels** from the data set.

Gini impurity is **lower bounded by 0**, with 0 occurring if the data set contains only one class.

The formula for calculating the gini impurity of a data set or feature is as follows:

```
J
G(k) = Σ P(i) * (1 - P(i))
i=1
```

Where P(i) is the probability of a certain classification `i`

, per the training data set.

If it seems complicated, it really isn’t! I’ll explain with an example.

## Example: “Will I Go Running?”

In the data set above, there are two classes in which data can be classified: “yes” (I will go running) and “no” (I will not go running).

If we were using the entire data set above as training data for a new decision tree (not enough data to train an accurate tree… but let’s roll with it) the `gini impurity`

for the set would be calculated as follows:

```
G(will I go running) = P("yes") * 1 - P("yes") + P("no") * 1 - P("no")
G(will I go running) = 4 / 7 * (1 - 4/7) + 3 / 7 * 1 - P(3/7)
G(will I go running) = 0.489796
```

This means there is a **48.97%** chance of a new data point being incorrectly classified, based on the observed training data we have at our disposal. This number makes sense, since there are more `yes`

class instances than `no`

, so the probability of mis-classifying something is less than a coin flip (if we had the same number).

So how do we use this when building a decision tree?

## Gini Gain

Similar to entropy, which had the concept of `information gain`

, `gini gain`

is calculated when building a decision tree to help determine which attribute gives us the most information about which class a new data point belongs to.

This is done in a similar way to how information gain was calculated for entropy, except instead of taking a weighted sum of the entropies of each branch of a decision, we **take a weighted sum of the gini impurity.**

```
Gini_Gain(attribute) = total_impurity - impurity_remainder(attribute)
branch_n
remainder(attribute) = Σ P(attribute_branch_n)*G(branch)
branch
```

## So which should I use? Gini Impurity or Entropy?

It seems that `gini impurity`

and `entropy`

are often interchanged in the construction of decision trees. **Neither metric results in a more accurate tree than the other.**

All things considered, a slight preference might go to `gini`

since it doesn’t involve a more computationally intensive `log`

to calculate.

## Comments