Introduction
Often we encounter some categories in train dataset which are not present in the test dataset or vice-versa which becomes problematic when doing one hot encoding. After one-hot encoding we face the issue that there are more features in train set than the test set or the vice-versa. And, we all know that if a model is trained on x number of features, to make prediction out of it we need to pass test set with x number of features.
In this tutorial, we will discuss two different ways of dealing with it.
About the Data
The dataset we will use here is related to travel insurance. The objective for this dataset is to find out if the insurance buyer will claim the insurance in near future or not.
There are 11 variables in the dataset including the target variable.
Data Description
-
Duration: Travel duration
-
Destination: Travel destination
-
Agency: Agency Name
-
Commission: Commission on the insurance
-
Age: Age of the insurance buyer
-
Gender: Gender of the insurance buyer
-
Agency Type: What is the agency type?
-
Distribution Channel: offline/online
-
Product Name: Name of the insurance plan
-
Net Sales: Net sales
-
Claim: If the insurance is claimed or not (the target variable), 0 = not claimed, 1 = claimed
This dataset is available at the official GitHub page of DPhi: https://github.com/dphi-official/Datasets/tree/master/travel_insurance
To load the train dataset run the below command in your notebook:
import pandas as pd
train_data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/travel_insurance/Training_set_label.csv")
To load the test_dataset run the below command in your notebook:
test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/travel_insurance/Testing_set_label.csv')
Load Library
import pandas as pd
# Load train dataset
train_data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/travel_insurance/Training_set_label.csv")
# Load test dataset
test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/travel_insurance/Testing_set_label.csv')
# Train data looks like:
train_data.head()
Agency | Agency Type | Distribution Channel | Product Name | Duration | Destination | Net Sales | Commision (in value) | Gender | Age | Claim | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | CWT | Travel Agency | Online | Rental Vehicle Excess Insurance | 61 | UNITED KINGDOM | 19.8 | 11.88 | NaN | 29 | 0 |
1 | EPX | Travel Agency | Online | Cancellation Plan | 93 | NEW ZEALAND | 63.0 | 0.00 | NaN | 36 | 0 |
2 | EPX | Travel Agency | Online | 2 way Comprehensive Plan | 22 | UNITED STATES | 22.0 | 0.00 | NaN | 25 | 0 |
3 | C2B | Airlines | Online | Silver Plan | 14 | SINGAPORE | 54.5 | 13.63 | M | 24 | 0 |
4 | EPX | Travel Agency | Online | Cancellation Plan | 90 | VIET NAM | 10.0 | 0.00 | NaN | 23 | 0 |
# Test data looks like:
test_data.head()
Agency | Agency Type | Distribution Channel | Product Name | Duration | Destination | Net Sales | Commision (in value) | Gender | Age | |
---|---|---|---|---|---|---|---|---|---|---|
0 | EPX | Travel Agency | Online | Cancellation Plan | 24 | HONG KONG | 27.0 | 0.0 | NaN | 36 |
1 | EPX | Travel Agency | Online | Cancellation Plan | 51 | JAPAN | 45.0 | 0.0 | NaN | 36 |
2 | EPX | Travel Agency | Online | Cancellation Plan | 52 | JAPAN | 21.0 | 0.0 | NaN | 21 |
3 | EPX | Travel Agency | Online | Cancellation Plan | 89 | SINGAPORE | 11.0 | 0.0 | NaN | 30 |
4 | EPX | Travel Agency | Online | Cancellation Plan | 5 | MALAYSIA | 10.0 | 0.0 | NaN | 33 |
# Drop the column 'Destination' from both train and test set, and drop the target variable 'Claim' from the train dataset
train_data.drop(['Destination', 'Claim'], axis=1, inplace=True)
test_data.drop('Destination', axis=1, inplace=True)
Checking the number of unique values in all the categorical columns in both train and test set
# Select the categorical columns
cat_train = train_data.select_dtypes('object')
cat_test = test_data.select_dtypes('object')
cat_train.nunique()
Agency 16
Agency Type 2
Distribution Channel 2
Product Name 26
Gender 2
dtype: int64
cat_test.nunique()
Agency 16
Agency Type 2
Distribution Channel 2
Product Name 25
Gender 2
dtype: int64
We can observe above that there are 26 categories in ‘Product Name’ column of train data while only 25 categories in ‘Product Name’ column of test data
Method 1: One hot Encoding using pd.get_dummies()
train = pd.get_dummies(cat_train)
test = pd.get_dummies(cat_test)
# checking the number of features in train and test set
print("There are {} features in train set".format(len(train.columns)))
print("There are {} features in test set".format(len(test.columns)))
There are 48 features in train set
There are 47 features in test set
Now, there are 48 features in train set and 47 features in test set. If you train a model using the train set, the model will ask you for 48 features while testing also. So in this case, we can find out the one feature in train that is not present in test and add that column in the test set with all values as 0.
# Getting the missing feature
missing_feature = list(set(train.columns) - set(test.columns))[0]
print(missing_feature)
Product Name_Travel Cruise Protect Family
# Adding the missing feature to the test data
test[missing_feature] = 0
# Check the number of feature in test set
len(test.columns)
48
Now, there are 48 features in test set also.
The problem with this approach
-
If there are more missing features from test, it might become little difficult to add all those many columns to test set.
-
What if there are some categories (or say features after one-hot encoding) which are present in test but not in train? In this case you need to manually add or remove all those categories that are present in test but not in train to the train data as we did above.
Method 2: Using OneHotEncoder() from sklearn.preprocessing
The OneHotEncoder() class from sklearn has an attribute named ‘handle_unknown’. By default this attribute’s value is ‘error’ which throws an error whenever it sees the unknown category. In order to handle the unknown category, we can pass this attribute’s value as ‘ignore’
From sklearn documentation:
handle_unknown{‘error’, ‘ignore’}, default=’error’
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
from sklearn.preprocessing import OneHotEncoder
Before using OneHotEncoder, make sure the data doesn’t contain any null values otherwise it will throw an error
cat_train.dropna(inplace=True)
cat_test.dropna(inplace=True)
ohe = OneHotEncoder(handle_unknown = 'ignore')
encoded_train = ohe.fit_transform(cat_train).toarray()
train = pd.DataFrame(encoded_train, columns=ohe.get_feature_names(cat_train.columns))
Now do one hot encoding on test data. But on test data we will use only ‘transform’ method instead of fit_transform.
encoded_test = ohe.transform(cat_test).toarray()
test = pd.DataFrame(encoded_test, columns=ohe.get_feature_names(cat_test.columns))
# checking the number of features in train and test set
print("There are {} features in train set".format(len(train.columns)))
print("There are {} features in test set".format(len(test.columns)))
There are 42 features in train set
There are 42 features in test set
Now, both train and test set have equal number of features.
Conclusion
In the second method we don’t need to add or remove any features manually but in case of the first method, we either need to remove or add the unknown categories in train or test set.