Data preparation | One hot encoding | Data encoding
Categorical Encoding — The What, When & Why?
Understanding why machines require categorical data encoding
If you are reading this, I am assuming you already know what encoding means. Nevertheless, I’ll give a brief intro for those who are new to data science.
Note — Throughout this article, the terms; features, columns and variables have been used interchangeably.
Data is classified as below:
The What and Why
1. What is Categorical Encoding and why do you need it?
For some Machine Learning algorithms, whenever you have categorical data, you have to convert it to numerical type. The reason you convert categorical columns to numerical is so that the machine learning algorithm is able to understand and process it. This process is called categorical encoding.
Different Approaches to Categorical Encoding are;
I. Label Encoding, and II.One-Hot Encoding.
I. Label Encoding
In the table above, you can see that ‘Names’ is a categorical feature (column). More specifically it is a Nominal feature since the names do not have an order or rank to them.
Now, when you perform label encoding using Python in scikit-learn, the names are encoded based on the order of the alphabets. i.e. Akon will be encoded with 0, Bkon with 1, Ckon with 2 and Dkon with 3.
The algorithm may interpret a relationship between names such as Akon < Bkon < Ckon < Dkon. You do not want this to happen. To handle this issue, you can use One-Hot Encoding.
II. One-Hot Encoding
One-Hot Encoding is the process of creating dummy variables. In this encoding technique, each name (class) is represented as one feature.
i.e. Akon is represented by column 0, Bkon by column 1, Ckon by column 2, and Dkon by column 3.
Note — One-Hot Encoding results in a Dummy Variable Trap. A Dummy Variable Trap is a scenario in which features are highly correlated to each other. The Dummy Variable Trap leads to the problem known as Multicollinearity. Multicollinearity occurs where there is a dependency between the features. Multicollinearity is a serious issue in machine learning algorithms like Linear Regression and Logistic Regression.
Therefore, in order to overcome the problem of multicollinearity, one of the dummy variables (features) needs to be dropped. The outcome of one feature can easily be predicted with the help of the remaining features. i.e. in this case, If the name is not Bkon or Ckon or Dkon, it is definitely Akon. (Considering that it is a non nullable column)
So, you will be left with ony 3 columns. The machine learning algorithm will remember Akon as 0–0–0. Just modify the code as below:
Now comes the very important question;
2. When to use Label Encoding vs One-Hot Encoding?
Apply Label Encoding when:
- The categorical feature is ordinal (like t-shit sizes: Small, Medium, Large). You can encode Small as 1, Medium as 2 and Large as 3 or vice-versa.
- The number of unique classes in the categorical feature is quite large as one-hot encoding will create too many columns. i.e. if you have 1000 different names and if you apply one-hot encoding you will end up with 999 new columns.
Quick tip: When you have a large number of unique classes in a single categorical feature, you can do an aggregate to obtain the top (repeating) 20–30 classes and label encode them, while the other sparse classes can be simply labeled as ‘others’. This again depends on your data and other project requirements. There is never one solution fits all in data science.
Apply One-Hot Encoding when:
- The number of unique classes in the categorical feature is less.
- The categorical feature is not ordinal.
That’s all folks! Thank you for reading. Any feedback will be highly appreciated.
You can get in touch with me via LinkedIn or my Website:
Swapnil Kangralkar PMP®, PSM® - Intermediate Data Scientist - Business, Research and Artificial…
Brand Ambassador | Data Scientist | Project Management Professional (PMP) | Professional Scrum Master (PSM) | Engineer…