Handling Unknown Categories in both train and test set during One Hot Encoding

Introduction

Often we encounter some categories in train dataset which are not present in the test dataset or vice-versa which becomes problematic when doing one hot encoding. After one-hot encoding we face the issue that there are more features in train set than the test set or the vice-versa. And, we all know that if a model is trained on x number of features, to make prediction out of it we need to pass test set with x number of features.

In this tutorial, we will discuss two different ways of dealing with it.

About the Data

The dataset we will use here is related to travel insurance. The objective for this dataset is to find out if the insurance buyer will claim the insurance in near future or not.

There are 11 variables in the dataset including the target variable.

Data Description

  • Duration: Travel duration

  • Destination: Travel destination

  • Agency: Agency Name

  • Commission: Commission on the insurance

  • Age: Age of the insurance buyer

  • Gender: Gender of the insurance buyer

  • Agency Type: What is the agency type?

  • Distribution Channel: offline/online

  • Product Name: Name of the insurance plan

  • Net Sales: Net sales

  • Claim: If the insurance is claimed or not (the target variable), 0 = not claimed, 1 = claimed

This dataset is available at the official GitHub page of DPhi: https://github.com/dphi-official/Datasets/tree/master/travel_insurance

To load the train dataset run the below command in your notebook:


import pandas as pd

train_data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/travel_insurance/Training_set_label.csv")

To load the test_dataset run the below command in your notebook:


test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/travel_insurance/Testing_set_label.csv')

Load Library

import pandas as pd
# Load train dataset

train_data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/travel_insurance/Training_set_label.csv")
# Load test dataset

test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/travel_insurance/Testing_set_label.csv')
# Train data looks like:

train_data.head()
Agency Agency Type Distribution Channel Product Name Duration Destination Net Sales Commision (in value) Gender Age Claim
0 CWT Travel Agency Online Rental Vehicle Excess Insurance 61 UNITED KINGDOM 19.8 11.88 NaN 29 0
1 EPX Travel Agency Online Cancellation Plan 93 NEW ZEALAND 63.0 0.00 NaN 36 0
2 EPX Travel Agency Online 2 way Comprehensive Plan 22 UNITED STATES 22.0 0.00 NaN 25 0
3 C2B Airlines Online Silver Plan 14 SINGAPORE 54.5 13.63 M 24 0
4 EPX Travel Agency Online Cancellation Plan 90 VIET NAM 10.0 0.00 NaN 23 0
# Test data looks like:

test_data.head()
Agency Agency Type Distribution Channel Product Name Duration Destination Net Sales Commision (in value) Gender Age
0 EPX Travel Agency Online Cancellation Plan 24 HONG KONG 27.0 0.0 NaN 36
1 EPX Travel Agency Online Cancellation Plan 51 JAPAN 45.0 0.0 NaN 36
2 EPX Travel Agency Online Cancellation Plan 52 JAPAN 21.0 0.0 NaN 21
3 EPX Travel Agency Online Cancellation Plan 89 SINGAPORE 11.0 0.0 NaN 30
4 EPX Travel Agency Online Cancellation Plan 5 MALAYSIA 10.0 0.0 NaN 33
# Drop the column 'Destination' from both train and test set, and drop the target variable 'Claim' from the train dataset

train_data.drop(['Destination', 'Claim'], axis=1, inplace=True)

test_data.drop('Destination', axis=1, inplace=True)

Checking the number of unique values in all the categorical columns in both train and test set

# Select the categorical columns

cat_train = train_data.select_dtypes('object')

cat_test = test_data.select_dtypes('object')
cat_train.nunique()
Agency                  16
Agency Type              2
Distribution Channel     2
Product Name            26
Gender                   2
dtype: int64
cat_test.nunique()
Agency                  16
Agency Type              2
Distribution Channel     2
Product Name            25
Gender                   2
dtype: int64

We can observe above that there are 26 categories in ‘Product Name’ column of train data while only 25 categories in ‘Product Name’ column of test data

Method 1: One hot Encoding using pd.get_dummies()

train = pd.get_dummies(cat_train)

test = pd.get_dummies(cat_test)
# checking the number of features in train and test set

print("There are {} features in train set".format(len(train.columns)))

print("There are {} features in test set".format(len(test.columns)))
There are 48 features in train set
There are 47 features in test set

Now, there are 48 features in train set and 47 features in test set. If you train a model using the train set, the model will ask you for 48 features while testing also. So in this case, we can find out the one feature in train that is not present in test and add that column in the test set with all values as 0.

# Getting the missing feature

missing_feature = list(set(train.columns) - set(test.columns))[0]

print(missing_feature)
Product Name_Travel Cruise Protect Family
# Adding the missing feature to the test data

test[missing_feature] = 0
# Check the number of feature in test set

len(test.columns)
48

Now, there are 48 features in test set also.

The problem with this approach

  1. If there are more missing features from test, it might become little difficult to add all those many columns to test set.

  2. What if there are some categories (or say features after one-hot encoding) which are present in test but not in train? In this case you need to manually add or remove all those categories that are present in test but not in train to the train data as we did above.

Method 2: Using OneHotEncoder() from sklearn.preprocessing

The OneHotEncoder() class from sklearn has an attribute named ‘handle_unknown’. By default this attribute’s value is ‘error’ which throws an error whenever it sees the unknown category. In order to handle the unknown category, we can pass this attribute’s value as ‘ignore’

From sklearn documentation:

handle_unknown{‘error’, ‘ignore’}, default=’error’

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

from sklearn.preprocessing import OneHotEncoder

Before using OneHotEncoder, make sure the data doesn’t contain any null values otherwise it will throw an error

cat_train.dropna(inplace=True)

cat_test.dropna(inplace=True)
ohe = OneHotEncoder(handle_unknown = 'ignore')

encoded_train = ohe.fit_transform(cat_train).toarray()

train = pd.DataFrame(encoded_train, columns=ohe.get_feature_names(cat_train.columns))

Now do one hot encoding on test data. But on test data we will use only ‘transform’ method instead of fit_transform.

encoded_test = ohe.transform(cat_test).toarray()

test = pd.DataFrame(encoded_test, columns=ohe.get_feature_names(cat_test.columns))
# checking the number of features in train and test set

print("There are {} features in train set".format(len(train.columns)))

print("There are {} features in test set".format(len(test.columns)))
There are 42 features in train set
There are 42 features in test set

Now, both train and test set have equal number of features.

Conclusion

In the second method we don’t need to add or remove any features manually but in case of the first method, we either need to remove or add the unknown categories in train or test set.

1 Like