Gini Impurity (With Examples)
TIL about Gini Impurity: another metric that is used when training decision trees.
Last week I learned about Entropy and Information Gain which is also used when training decision trees. Feel free to check out that post first before continuing.
I will be referencing the following data set throughout this post
"Will I Go Running" Data Set
Day | Weather | Just Ate | Late at Work | Will I go Running?
--- | --- | --- | --- | ---
1 | 'Sunny' | 'yes' | 'no' | 'yes'
2 | 'Rainy' | 'yes' | 'yes' | 'no'
3 | 'Sunny' | 'no' | 'yes' | 'yes'
4 | 'Rainy' | 'no' | 'no' | 'no'
5 | 'Rainy' | 'no' | 'no' | 'yes'
6 | 'Sunny' | 'yes' | 'no' | 'yes'
7 | 'Rainy' | 'no' | 'yes' | 'no'
Gini Impurity
Gini Impurity
is a measurement of the likelihood of an incorrect classification of a new instance of a random variable, if that new instance were randomly classified according to the distribution of class labels from the data set.
Gini impurity is lower bounded by 0, with 0 occurring if the data set contains only one class.
The formula for calculating the gini impurity of a data set or feature is as follows:
J
G(k) = Σ P(i) * (1 - P(i))
i=1
Where P(i) is the probability of a certain classification i
, per the training data set.
If it seems complicated, it really isn’t! I’ll explain with an example.
Example: “Will I Go Running?”
In the data set above, there are two classes in which data can be classified: “yes” (I will go running) and “no” (I will not go running).
If we were using the entire data set above as training data for a new decision tree (not enough data to train an accurate tree… but let’s roll with it) the gini impurity
for the set would be calculated as follows:
G(will I go running) = P("yes") * 1 - P("yes") + P("no") * 1 - P("no")
G(will I go running) = 4 / 7 * (1 - 4/7) + 3 / 7 * 1 - P(3/7)
G(will I go running) = 0.489796
This means there is a 48.97% chance of a new data point being incorrectly classified, based on the observed training data we have at our disposal. This number makes sense, since there are more yes
class instances than no
, so the probability of mis-classifying something is less than a coin flip (if we had the same number).
So how do we use this when building a decision tree?
Gini Gain
Similar to entropy, which had the concept of information gain
, gini gain
is calculated when building a decision tree to help determine which attribute gives us the most information about which class a new data point belongs to.
This is done in a similar way to how information gain was calculated for entropy, except instead of taking a weighted sum of the entropies of each branch of a decision, we take a weighted sum of the gini impurity.
Gini_Gain(attribute) = total_impurity - impurity_remainder(attribute)
branch_n
remainder(attribute) = Σ P(attribute_branch_n)*G(branch)
branch
So which should I use? Gini Impurity or Entropy?
It seems that gini impurity
and entropy
are often interchanged in the construction of decision trees. Neither metric results in a more accurate tree than the other.
All things considered, a slight preference might go to gini
since it doesn’t involve a more computationally intensive log
to calculate.
Comments