Categorical Encoding

Converting categorical features into a numerical format.

Key Notes

Most machine learning algorithms are designed to work with numerical data. However, real-world datasets often contain categorical features, which are variables that represent labels or categories rather than numerical quantities (e.g., 'color': ['red', 'blue', 'green']). Categorical encoding is the process of converting these text-based categories into numbers so that our models can process them. There are several techniques to achieve this. One of the simplest is Label Encoding, where each unique category is assigned an integer value (e.g., 'red' -> 0, 'blue' -> 1, 'green' -> 2). While simple, this can be problematic because the model might incorrectly assume an ordinal relationship between the categories (e.g., green > blue > red). A more robust and widely used technique is One-Hot Encoding. This method creates a new binary (0 or 1) column for each unique category. For each row, a '1' is placed in the column corresponding to its category, and '0's are placed in all other new columns. For example, 'red' would be represented as [1, 0, 0] in a `color` feature that has 'red', 'blue', and 'green' as possible values. This avoids the issue of implied order and is generally the preferred method for nominal categorical data (where no order exists). Choosing the right encoding strategy is important as it can significantly impact the performance of the machine learning model.

Back to Data Collection & Preprocessing

Categorical Encoding

Converting categorical features into a numerical format.

Key Notes