When to Use One Hot Encoding
TIL about One Hot Encoding, and when it is necessary to use as a preprocessing step for machine learning models.
What is One Hot Encoding?
One Hot Encoding
is a pre-processing step that is applied to categorical data, to convert it into a non-ordinal numerical representation for use in machine learning algorithms.
Phew there’s a lot to unpack there! Let’s go through an example.
Step 1: Convert Categorical Data to Numerical Labels
Let’s say we have 3 data instances with attributes of Preferred Programming Language
and OS of Choice
.
Preferred Programming Language | OS of Choice |
---|---|
Javascript | OSX |
Python | Linux |
Scala | OSX |
Machine learning algorithms provided by libraries like sklearn have trouble working with attributes with string values (like the attributes above). To fix this problem, we perform a pre-processing step that converts string valued attributes in to numerical representations.
sklern has a LabelEncoder
that does just that: turns string values in an attribute in to numerical values.
After running the data above through the sklearn LabelEncoder
, our table will look something like this:
Preferred Programming Language | OS of Choice |
---|---|
0 | 0 |
1 | 1 |
2 | 0 |
The Problem Of Ordinality
Stopping here, though, causes a problem… machine learning algorithms will treat the ordinality
of numbers in an attribute with some significance: a higher number “must be better” than a lower number in some way.
For some categories, this can make sense: if “cold” is better than “warm” is better than “hot”, for example, maybe stopping at the LabelEncoded representation makes sense. In the temperature case, it can actually convey more information than the un-encoded version.
For most categories, though, there is no sense of superiority between category values, and the ordinality
injected by the LabelEncoder
just results in noise.
Fortunately, there is a way to combat this: One Hot Encoding
One Hot Encoding
One Hot Encoding
takes an attribute with numerical values, and encodes the values as binary arrays. The length of these arrays is the max value of the numerical category.
The result of a One Hot Encoded attribute is n binary attributes that represent the values in the original attribute. This allows a machine learning algorithm to leverage the information contained in a category value without the confusion caused by ordinality.
sklearn offers a OneHotEncoder
class in its preprocessing package.
A One Hot Encoded version of the Label Encoded table above would look something like this:
Javascript | Python | Scala | OSX | Linux |
---|---|---|---|---|
1 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 1 |
0 | 0 | 1 | 1 | 0 |
Notice that we now have 5 binary attributes, one for each of the values from the original 2 attributes. 3 were created from the original attribute “Preferred Programming Language” and 2 from the original attribute “OS of Choice”. A 1 represents that the instance had that category value.
Also recognize that while one hot encoding reduces noise in your data that would have otherwise been cause by incorrect ordinal relationships, it also greatly increases the dimensionality of your data. High dimensionality has its own set of problems in machine learning, such as the curse of dimensionality
(which I plan on writing about in the next few days).
In a project I was working on recently, some category attributes had hundreds of choices. While there were only 20 attributes to start, there were close to 8000 attributes after one hot encoding. That’s a huge, but necessary increase in dimensionality to support categorical data in your models.
The new binary attributes created from One Hot Encoding will also be quite sparse
(mostly zeros) unless your attribute allows an instance to take on multiple values.
Comments