Contribute: Found a typo? Or any other change that could improve the notebook tutorial? Please consider sending us a pull request in the public repo of the notebook here.
Assignment - 3: Solution
(Absolute Beginners and Beginners)
This is the third assignment of the DPhi 5 Week Data Science Bootcamp and revolves around Data Science Problem solving.
Agenda
- Problem Statement
- Objective
- Dataset & Data Description
- Solution Steps:
- Load data
- Understand your data: Data Analysis and Visualizations (EDA)
- Pre-process the data
- Prepare train and test datasets
- Choose a model
- Train your model
- Evaluate the model (F1-score calculation)
- Optimize: repeat steps 5 - 7
- Conclusion
- Prediction on New Test data
- Load the new test data
- Fill missing values if any
- Preprocessing and cleaning the data
- Predict the target values
Problem Statement
Objective
A hospital in the province of Greenland has been trying to improve its care conditions by looking at historic survival of the patients. They tried looking at their data but could not identify the main factors leading to high survivals.
You are the best data scientist in Greenland and theyâve hired you to solve this problem. Now you are responsible for developing a model that will predict the chances of survival of a patient after 1 year of treatment (Survived_1_year).
Dataset & Data Description
The dataset contains the patient records collected from a hospital in Greenland. The âSurvived_1_yearâ column is a target variable which has binary entries (0 or 1).
- Survived_1_year == 0, implies that the patient did not survive after 1 year of treatment
- Survived_1_year == 1, implies that the patient survived after 1 year of treatment
To load the dataset in your jupyter notebook, use the below command:
import pandas as pd
pharma_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/pharma_data/Training_set_begs.csv')
Data Description:
- ID_Patient_Care_Situation: Care situation of a patient during treatment
- Diagnosed_Condition: The diagnosed condition of the patient
- ID_Patient: Patient identifier number
- Treatment_with_drugs: Class of drugs used during treatment
- Survived_1_year: If the patient survived after one year (0 means did not survive; 1 means survived)
- Patient_Age: Age of the patient
- Patient_Body_Mass_Index: A calculated value based on the patientâs weight, height, etc.
- Patient_Smoker: If the patient was a smoker or not
- Patient_Rural_Urban: If the patient stayed in Rural or Urban part of the country
- Previous_Condition: Condition of the patient before the start of the treatment ( This variable is splitted into 8 columns - A, B, C, D, E, F, Z and Number_of_prev_cond. A, B, C, D, E, F and Z are the previous conditions of the patient. Suppose for one patient, if the entry in column A is 1, it means that the previous condition of the patient was A. If the patient didnât have that condition, it is 0 and same for other conditions. If a patient has previous condition as A and C , columns A and C will have entries as 1 and 1 respectively while the other column B, D, E, F, Z will have entries 0, 0, 0, 0, 0 respectively. The column Number_of_prev_cond will have entry as 2 i.e. 1 + 0 + 1 + 0 + 0 + 0 + 0 + 0 = 2 in this case. )
Solution steps:
- Load data
- Understand your data: EDA
- Pre-process the data
- Prepare train and test datasets
- Choose a model
- Train your model
- Evaluate the model (F1-score calculation)
- Optimize: repeat steps 5 - 7
Load Libraries
import pandas as pd # package for data analysis
import numpy as np # package for numerical computations
# libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns
# to ignore warnings
import warnings
warnings.filterwarnings('ignore')
# For Preprocessing, ML models and Evaluation
from sklearn.model_selection import train_test_split # To split the dataset into train and test set
from sklearn.linear_model import LogisticRegression # Logistic regression model
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder # for converting categorical to numerical
from sklearn.metrics import f1_score # for model evaluation
Load Data
data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/pharma_data/Training_set_begs.csv')
EDA
Primary screenings:
- Get a look at the data, its columns and kind of values contained in these columns: df.head()
- Stepping back a bit, get a look at the column overview: number, types, NULL counts: df.info()
# Take a look at the first five observations
data.head()
ID_Patient_Care_Situation | Diagnosed_Condition | Patient_ID | Treated_with_drugs | Patient_Age | Patient_Body_Mass_Index | Patient_Smoker | Patient_Rural_Urban | Patient_mental_condition | A | B | C | D | E | F | Z | Number_of_prev_cond | Survived_1_year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22374 | 8 | 3333 | DX6 | 56 | 18.479385 | YES | URBAN | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0 |
1 | 18164 | 5 | 5740 | DX2 | 36 | 22.945566 | YES | RURAL | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 |
2 | 6283 | 23 | 10446 | DX6 | 48 | 27.510027 | YES | RURAL | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 |
3 | 5339 | 51 | 12011 | DX1 | 5 | 19.130976 | NO | URBAN | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 |
4 | 33012 | 0 | 12513 | NaN | 128 | 1.348400 | Cannot say | RURAL | Stable | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1 |
# A concise summary of the data
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23097 entries, 0 to 23096
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID_Patient_Care_Situation 23097 non-null int64
1 Diagnosed_Condition 23097 non-null int64
2 Patient_ID 23097 non-null int64
3 Treated_with_drugs 23084 non-null object
4 Patient_Age 23097 non-null int64
5 Patient_Body_Mass_Index 23097 non-null float64
6 Patient_Smoker 23097 non-null object
7 Patient_Rural_Urban 23097 non-null object
8 Patient_mental_condition 23097 non-null object
9 A 21862 non-null float64
10 B 21862 non-null float64
11 C 21862 non-null float64
12 D 21862 non-null float64
13 E 21862 non-null float64
14 F 21862 non-null float64
15 Z 21862 non-null float64
16 Number_of_prev_cond 21862 non-null float64
17 Survived_1_year 23097 non-null int64
dtypes: float64(9), int64(5), object(4)
memory usage: 3.2+ MB
Observations:
- There are 23097 observations divided into 17 columns.
- There are some missing values in the dataset.
Letâs take a look at the distribution of our target variable to determine if we have a balanced dataset
sns.countplot(x='Survived_1_year', data=data)
plt.show()
There are 8000 + patients who did not survive after 1 year of treatment and 14000 + patients who survived after 1 year of treatment. The ratio is 1:2 (approx). So there is no class imbalance
Next, we will perform EDA on our continuous variables
# getting only the numerical features
numeric_features = data.select_dtypes(include=[np.number]) # select_dtypes helps you to select data of particular types
numeric_features.columns
Index(['ID_Patient_Care_Situation', 'Diagnosed_Condition', 'Patient_ID',
'Patient_Age', 'Patient_Body_Mass_Index', 'A', 'B', 'C', 'D', 'E', 'F',
'Z', 'Number_of_prev_cond', 'Survived_1_year'],
dtype='object')
numeric_data=data[['Diagnosed_Condition', 'Patient_Age', 'Patient_Body_Mass_Index', 'Number_of_prev_cond', 'Survived_1_year']] #keeping in the target varibale for analysis purposes
numeric_data.head()
# ID_Patient_Care_Situation and Patient_ID are just an ID we can ignore them for data analysis.
# Number_of_prev_cond is dependent on 7 columns - A, B, C, D, E, F, Z
Diagnosed_Condition | Patient_Age | Patient_Body_Mass_Index | Number_of_prev_cond | Survived_1_year | |
---|---|---|---|---|---|
0 | 8 | 56 | 18.479385 | 2.0 | 0 |
1 | 5 | 36 | 22.945566 | 1.0 | 1 |
2 | 23 | 48 | 27.510027 | 1.0 | 0 |
3 | 51 | 5 | 19.130976 | 1.0 | 1 |
4 | 0 | 128 | 1.348400 | 1.0 | 1 |
# Checking the null values in numerical columns
numeric_data.isnull().sum()
Diagnosed_Condition 0
Patient_Age 0
Patient_Body_Mass_Index 0
Number_of_prev_cond 1235
Survived_1_year 0
dtype: int64
We can see that 1235 values are missing from âNumber_of_prev_condâ column. We will fill these with the mode.
Why mode? As per the data description this columnâs value is dependent on the seven columns - âAâ, âBâ, âCâ, âDâ, âEâ, âFâ, âZâ.
These columns have values either 0 or 1. Hence these seven columns are categorical columns.
So the column âNumber_of_prev_condâ have discrete values from integers 0 to 7 and can be considered as categorical column as it has only 7 different values. Hence here we can fill the missing values with mode.
data['Number_of_prev_cond'] = data['Number_of_prev_cond'].fillna(data['Number_of_prev_cond'].mode()[0]) # filling the missing value of 'Number_of_prev_cond'
numeric_data['Number_of_prev_cond']=data['Number_of_prev_cond']
numeric_data.isnull().sum()
# The returned object by using mode() is a series so we are filling the null value with the value at 0th index ( which gives us the mode of the data)
Diagnosed_Condition 0
Patient_Age 0
Patient_Body_Mass_Index 0
Number_of_prev_cond 0
Survived_1_year 0
dtype: int64
# Taking a look at the basic statistical description of the numerical columns
numeric_data.describe()
Diagnosed_Condition | Patient_Age | Patient_Body_Mass_Index | Number_of_prev_cond | Survived_1_year | |
---|---|---|---|---|---|
count | 23097.000000 | 23097.000000 | 23097.000000 | 23097.000000 | 23097.000000 |
mean | 26.413127 | 33.209768 | 23.454820 | 1.710352 | 0.632247 |
std | 15.030865 | 19.549882 | 3.807661 | 0.768216 | 0.482204 |
min | 0.000000 | 0.000000 | 1.089300 | 1.000000 | 0.000000 |
25% | 13.000000 | 16.000000 | 20.205550 | 1.000000 | 0.000000 |
50% | 26.000000 | 33.000000 | 23.386199 | 2.000000 | 1.000000 |
75% | 39.000000 | 50.000000 | 26.788154 | 2.000000 | 1.000000 |
max | 52.000000 | 149.000000 | 29.999579 | 5.000000 | 1.000000 |
observations
- The minimum and maximum values for all the numerical columns.
- The mean and median (i.e. 50%) value for all the numerical columns are nearly same, so no outliers.
A good way to visualize the above information would be boxplots.
Box Plot
A box plot is a great way to get a visual sense of an entire range of data. It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.
Box plots divides data into its quartiles. The âboxâ shows a user the data set between the first and third quartiles.
The median gets drawn somewhere inside the box and then you see the most extreme non-outliers to finish the plot. Those lines are known as the âwhiskersâ. If there are any outliers then those can be plotted as well.
With box plots you can answer how diverse or uniform your data might be. You can identify what is normal and what is extreme. Box plots help give a shape to your data that is broad without sacrificing the ability to look at any piece and ask more questions.
It displays the five-number summary of a set of data. The five-number summary is:
- minimum
- first quartile (Q1)
- median
- third quartile (Q3)
- maximum
Boxplot also helps you to check if there are any outliers in your data or not.
For reading about boxplot and outliers: https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba
Read more about Box Plots here.
for feature in numeric_data.drop('Survived_1_year', axis = 1).columns:
sns.boxplot(x='Survived_1_year', y=feature, data=numeric_data)
plt.show()
We can also see there are some outliers in the columns - âPatient_Ageâ, âPatient_Body_Mass_Indexâ, and âNumber_of_prev_condâ. There are various ways to treat the outliers as mentioned in the article https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba. Here I have not treated any outliers.
Following is a correlation analysis between the continuous varibles, visualized using a heatmap
numeric_data=numeric_data.drop(['Survived_1_year'], axis=1)
colormap = sns.diverging_palette(10, 220, as_cmap = True)
sns.heatmap(numeric_data.corr(),
cmap = colormap,
square = True,
annot = True)
plt.show()
Finally, we must look at the EDA for our categorical variables. However, before analyzing the categorical columns further, we will treat the missing values
data.isnull().sum()
ID_Patient_Care_Situation 0
Diagnosed_Condition 0
Patient_ID 0
Treated_with_drugs 13
Patient_Age 0
Patient_Body_Mass_Index 0
Patient_Smoker 0
Patient_Rural_Urban 0
Patient_mental_condition 0
A 1235
B 1235
C 1235
D 1235
E 1235
F 1235
Z 1235
Number_of_prev_cond 0
Survived_1_year 0
dtype: int64
Filling Missing values
data['Treated_with_drugs']=data['Treated_with_drugs'].fillna(data['Treated_with_drugs'].mode()[0])
data['A'].fillna(data['A'].mode()[0], inplace = True)
data['B'].fillna(data['B'].mode()[0], inplace = True)
data['C'].fillna(data['C'].mode()[0], inplace = True)
data['D'].fillna(data['D'].mode()[0], inplace = True)
data['E'].fillna(data['E'].mode()[0], inplace = True)
data['F'].fillna(data['F'].mode()[0], inplace = True)
data['Z'].fillna(data['Z'].mode()[0], inplace = True)
data.isnull().sum()
ID_Patient_Care_Situation 0
Diagnosed_Condition 0
Patient_ID 0
Treated_with_drugs 0
Patient_Age 0
Patient_Body_Mass_Index 0
Patient_Smoker 0
Patient_Rural_Urban 0
Patient_mental_condition 0
A 0
B 0
C 0
D 0
E 0
F 0
Z 0
Number_of_prev_cond 0
Survived_1_year 0
dtype: int64
EDA on Categorical Data
Letâs perform Exploratory Data Analysis on the Categorical data.
In the categorical_data variable weâll keep all the categorical features and remove the others.
Note that the features are not being removed from the main dataset - data. Weâll select features with a feature selection technique later.
categorical_data = data.drop(numeric_data.columns, axis=1) # dropping the numerical columns from the dataframe 'data'
categorical_data.drop(['Patient_ID', 'ID_Patient_Care_Situation'], axis=1, inplace = True) # dropping the id columns form the dataframe 'categorical data'
categorical_data.head() # Now we are left with categorical columns only. take a look at first five observaitons
Treated_with_drugs | Patient_Smoker | Patient_Rural_Urban | Patient_mental_condition | A | B | C | D | E | F | Z | Survived_1_year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | DX6 | YES | URBAN | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0 |
1 | DX2 | YES | RURAL | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
2 | DX6 | YES | RURAL | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 |
3 | DX1 | NO | URBAN | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
4 | DX6 | Cannot say | RURAL | Stable | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 |
Now we can look at the distributions of our categorical variables
categorical_data.nunique() # nunique() return you the number of unique values in each column/feature
Treated_with_drugs 32
Patient_Smoker 3
Patient_Rural_Urban 2
Patient_mental_condition 1
A 2
B 2
C 2
D 2
E 2
F 2
Z 2
Survived_1_year 2
dtype: int64
So âTreated_with_drugsâ column has 32 unique values while âPatient_Smokerâ has only 3 categorical values. âPatient_mental_conditionâ column has only 1 categorical value.
# Visualization of categorical columns
for feature in ['Patient_Smoker', 'Patient_Rural_Urban', 'Patient_mental_condition']:
sns.countplot(x=feature, hue='Survived_1_year', data=categorical_data)
plt.show()
plt.figure(figsize=(15,5))
sns.countplot(x='Treated_with_drugs', hue='Survived_1_year', data=categorical_data)
plt.xticks(rotation=90)
plt.show()
Pre-Processing and Data Cleaning of Categorical Variables
We have discussed in our sessions that machine learning models accepts only numerical data. âTreated_with_drugsâ column is a categorical column and has values as combination of one or more drugs. Letâs split all those combined drugs into individual drugs and create dummies for that.
drugs = data['Treated_with_drugs'].str.get_dummies(sep=' ') # split all the entries separated by space and create dummy variable
drugs.head()
DX1 | DX2 | DX3 | DX4 | DX5 | DX6 | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 1 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 1 |
data = pd.concat([data, drugs], axis=1) # concat the two dataframes 'drugs' and 'data'
data = data.drop('Treated_with_drugs', axis=1) # dropping the column 'Treated_with_drugs' as its values are now splitted into different columns
data.head()
ID_Patient_Care_Situation | Diagnosed_Condition | Patient_ID | Patient_Age | Patient_Body_Mass_Index | Patient_Smoker | Patient_Rural_Urban | Patient_mental_condition | A | B | C | D | E | F | Z | Number_of_prev_cond | Survived_1_year | DX1 | DX2 | DX3 | DX4 | DX5 | DX6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22374 | 8 | 3333 | 56 | 18.479385 | YES | URBAN | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 18164 | 5 | 5740 | 36 | 22.945566 | YES | RURAL | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 6283 | 23 | 10446 | 48 | 27.510027 | YES | RURAL | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 5339 | 51 | 12011 | 5 | 19.130976 | NO | URBAN | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
4 | 33012 | 0 | 12513 | 128 | 1.348400 | Cannot say | RURAL | Stable | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
âPatient_Smokerâ is also a categorical column and we need to create dummies for this too. If you observe the data, the column âPatient_Smokerâ has a category âCannot sayâ.
data.Patient_Smoker.value_counts()
NO 13246
YES 9838
Cannot say 13
Name: Patient_Smoker, dtype: int64
There can be different ways to deal with the category âCannot sayâ. Here we will consider it as missing value and fill those entries with the mode value of the column.
data.Patient_Smoker[data['Patient_Smoker'] == "Cannot say"] = 'NO' # we already know 'NO' is the mode so directly changing the values 'Cannot say' to 'NO'
The column âPatient_mental_conditionâ has only one category âstableâ. So we can drop this column as for every observation the entry here is âstableâ. This feature wonât be useful for making the prediction of the target variable as it doesnât provide any useful insights of the data. Hence, It is better to remove this kind of features.
data.drop('Patient_mental_condition', axis = 1, inplace=True)
Now letâs convert the remaining categorical column to numerical using get_dummies() function of pandas (i.e. one hot encoding).
data = pd.get_dummies(data, columns=['Patient_Smoker', 'Patient_Rural_Urban'])
data.head()
ID_Patient_Care_Situation | Diagnosed_Condition | Patient_ID | Patient_Age | Patient_Body_Mass_Index | A | B | C | D | E | F | Z | Number_of_prev_cond | Survived_1_year | DX1 | DX2 | DX3 | DX4 | DX5 | DX6 | Patient_Smoker_NO | Patient_Smoker_YES | Patient_Rural_Urban_RURAL | Patient_Rural_Urban_URBAN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22374 | 8 | 3333 | 56 | 18.479385 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
1 | 18164 | 5 | 5740 | 36 | 22.945566 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
2 | 6283 | 23 | 10446 | 48 | 27.510027 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
3 | 5339 | 51 | 12011 | 5 | 19.130976 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
4 | 33012 | 0 | 12513 | 128 | 1.348400 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23097 entries, 0 to 23096
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID_Patient_Care_Situation 23097 non-null int64
1 Diagnosed_Condition 23097 non-null int64
2 Patient_ID 23097 non-null int64
3 Patient_Age 23097 non-null int64
4 Patient_Body_Mass_Index 23097 non-null float64
5 A 23097 non-null float64
6 B 23097 non-null float64
7 C 23097 non-null float64
8 D 23097 non-null float64
9 E 23097 non-null float64
10 F 23097 non-null float64
11 Z 23097 non-null float64
12 Number_of_prev_cond 23097 non-null float64
13 Survived_1_year 23097 non-null int64
14 DX1 23097 non-null int64
15 DX2 23097 non-null int64
16 DX3 23097 non-null int64
17 DX4 23097 non-null int64
18 DX5 23097 non-null int64
19 DX6 23097 non-null int64
20 Patient_Smoker_NO 23097 non-null uint8
21 Patient_Smoker_YES 23097 non-null uint8
22 Patient_Rural_Urban_RURAL 23097 non-null uint8
23 Patient_Rural_Urban_URBAN 23097 non-null uint8
dtypes: float64(9), int64(11), uint8(4)
memory usage: 3.6 MB
As you can see there are no missing data now and all the data are of numerical type.
There are two ID columns - âID_Patient_Care_Situationâ and âPatient_IDâ. We can think of removing these columns if these are randomly generated value and there is not any id repeated like we had done for the âPassengerIdâ in Titanic Dataset. âPassengerIdâ was randomly generated for each passenger and none of the ids were repeated. So letâs check these two ids columns.
print(data.ID_Patient_Care_Situation.nunique()) # nunique() gives you the count of unique values in the column
print(data.Patient_ID.nunique())
23097
10570
You can see there are 23097 unique âID_Patient_Care_Situationâ and there are 23097 total observations in the dataset. So this column can be dropped.
Now, there are only 10570 unique values in the column âPatient_IDâ. This means there are some patient who came two or more times in the hospital because it is possible the same person was sick for two or more than two times (with different illness) and visited hospital for the treatment. And the same patient will have different caring condition for different diseases.
So there are some useful information in the column - âPatient_IDâ and thus we will not drop this column.
data.drop(['ID_Patient_Care_Situation'], axis =1, inplace=True)
Prepare Train/Test Data
- Separating the input and output variables
Before building any machine learning model, we always separate the input variables and output variables. Input variables are those quantities whose values are changed naturally in an experiment, whereas output variable is the one whose values are dependent on the input variables. So, input variables are also known as independent variables as its values are not dependent on any other quantity, and output variable/s are also known as dependent variables as its values are dependent on other variable i.e. input variables. Like here in this data, we can see that whether a person will survive after one year or not, depends on other variables like, age, diagnosis, body mass index, drugs used, etc.
By convention input variables are represented with âXâ and output variables are represented with âyâ.
X = data.drop('Survived_1_year',axis = 1)
y = data['Survived_1_year']
- Train/test split
We want to check the performance of the model that we built. For this purpose, we always split (both input and output data) the given data into training set which will be used to train the model, and test set which will be used to check how accurately the model is predicting outcomes.
For this purpose we have a class called âtrain_test_splitâ in the âsklearn.model_selectionâ module.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
Model Building
We have seen from Exploratory Data Analysis that this is a classification problem as the target column âSurvived_1_yearâ has two values 0 - means the patient did not survive after one year of treatment, 1 - means the patient survived after one year of treatment. So we can use classification models for this problem. Some of the classification models are - Logistic Regression, Random Forest Classifier, Decision Tree Classifier, etc. However, we have used two of them - Logistic Regression and Random Forest Classifier.
1. Logistic Regression Model
model = LogisticRegression(max_iter = 1000) # The maximum number of iterations will be 1000. This will help you prevent from convergence warning.
model.fit(X_train,y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=1000,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
pred = model.predict(X_test)
Evaluation:
print(f1_score(y_test,pred))
0.7866774979691308
The f1 score by Logistic Regression model is 78.6%. Letâs try Random Forest Classifier and see if we get better result with it.
2. Random Forest
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score, f1_score
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
forest.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=5, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000,
n_jobs=None, oob_score=False, random_state=1, verbose=0,
warm_start=False)
Evaluating on X_test
y_pred = forest.predict(X_test)
fscore = f1_score(y_test ,y_pred)
fscore
0.8220447284345048
The f1 score by Random Forest classifier is 82.2% which is better than logistic regression.
Well there are so many feaures to train the model. We can go and try some feature selection techniques and check if the performance of Random Forest is affected or not. And see if with the decrease in the complexity of model is satisfactory with minimal affect to the performance of the model.
We have used the Boruta feature selector. You can use some other techniques too and see if that is giving better result than Boruta.
3. Random Forest and Boruta
Note: Before proceeding ahead please revise the feature selection Notebook or the session if you donât remember what is âBorutaâ.
Boruta is an all-relevant feature selection method. Unlike other techniques that select small set of features to minimize the error, Boruta tries to capture all the important and interesting features you might have in your dataset with respect to the target variable.
Boruta by default uses random forest although it works with other algorithms like LightGBM, XGBoost etc.
You can install Boruta with the command
pip install Boruta
!pip install Boruta
Collecting Boruta
e[?25l Downloading https://files.pythonhosted.org/packages/b2/11/583f4eac99d802c79af9217e1eff56027742a69e6c866b295cce6a5a8fc2/Boruta-0.3-py3-none-any.whl (56kB)
e[K |ââââââââââââââââââââââââââââââââ| 61kB 1.8MB/s
e[?25hRequirement already satisfied: scikit-learn>=0.17.1 in /usr/local/lib/python3.6/dist-packages (from Boruta) (0.22.2.post1)
Requirement already satisfied: numpy>=1.10.4 in /usr/local/lib/python3.6/dist-packages (from Boruta) (1.18.5)
Requirement already satisfied: scipy>=0.17.0 in /usr/local/lib/python3.6/dist-packages (from Boruta) (1.4.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.17.1->Boruta) (0.16.0)
Installing collected packages: Boruta
Successfully installed Boruta-0.3
from boruta import BorutaPy
boruta_selector = BorutaPy(forest, n_estimators='auto', verbose=2, random_state=1) # initialize the boruta selector
boruta_selector.fit(np.array(X_train), np.array(y_train)) # fitting the boruta selector to get all relavent features. NOTE: BorutaPy accepts numpy arrays only.
Iteration: 1 / 100
Confirmed: 0
Tentative: 22
Rejected: 0
Iteration: 2 / 100
Confirmed: 0
Tentative: 22
Rejected: 0
Iteration: 3 / 100
Confirmed: 0
Tentative: 22
Rejected: 0
Iteration: 4 / 100
Confirmed: 0
Tentative: 22
Rejected: 0
Iteration: 5 / 100
Confirmed: 0
Tentative: 22
Rejected: 0
Iteration: 6 / 100
Confirmed: 0
Tentative: 22
Rejected: 0
Iteration: 7 / 100
Confirmed: 0
Tentative: 22
Rejected: 0
Iteration: 8 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 9 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 10 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 11 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 12 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 13 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 14 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 15 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 16 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 17 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 18 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 19 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 20 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 21 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 22 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 23 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 24 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 25 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 26 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 27 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 28 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 29 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 30 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 31 / 100
Confirmed: 16
Tentative: 3
Rejected: 3
Iteration: 32 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 33 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 34 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 35 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 36 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 37 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 38 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 39 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 40 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 41 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 42 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 43 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 44 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 45 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 46 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 47 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 48 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 49 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 50 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 51 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 52 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 53 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 54 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 55 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 56 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 57 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 58 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 59 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 60 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 61 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 62 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 63 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 64 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 65 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 66 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 67 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 68 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 69 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 70 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 71 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 72 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 73 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 74 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 75 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 76 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 77 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 78 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 79 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 80 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 81 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 82 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 83 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 84 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 85 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 86 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 87 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 88 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 89 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 90 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 91 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 92 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 93 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 94 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 95 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 96 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 97 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 98 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
Iteration: 99 / 100
Confirmed: 17
Tentative: 2
Rejected: 3
BorutaPy finished running.
Iteration: 100 / 100
Confirmed: 17
Tentative: 1
Rejected: 3
BorutaPy(alpha=0.05,
estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
class_weight=None, criterion='gini',
max_depth=5, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=123, n_jobs=None,
oob_score=False,
random_state=RandomState(MT19937) at 0x7F0EBCEFA990,
verbose=0, warm_start=False),
max_iter=100, n_estimators='auto', perc=100,
random_state=RandomState(MT19937) at 0x7F0EBCEFA990, two_step=True,
verbose=2)
print("Selected Features: ", boruta_selector.support_) # check selected features
print("Ranking: ",boruta_selector.ranking_) # check ranking of features
print("No. of significant features: ", boruta_selector.n_features_)
Selected Features: [ True True True True True False True False False False True True
True True True True True True True True True]
Ranking: [1 1 1 1 1 2 1 3 4 5 1 1 1 1 1 1 1 1 1 1 1]
No. of significant features: 17
So boruta has selected 17 relavent features. # Letâs visualise it better in the form of a table
Displaying features rank wise
selected_rfe_features = pd.DataFrame({'Feature':list(X_train.columns),
'Ranking':boruta_selector.ranking_})
selected_rfe_features.sort_values(by='Ranking')
Feature | Ranking | |
---|---|---|
0 | Diagnosed_Condition | 1 |
19 | Patient_Smoker_YES | 1 |
18 | Patient_Smoker_NO | 1 |
17 | DX6 | 1 |
16 | DX5 | 1 |
15 | DX4 | 1 |
14 | DX3 | 1 |
13 | DX2 | 1 |
12 | DX1 | 1 |
11 | Number_of_prev_cond | 1 |
21 | Patient_Rural_Urban_URBAN | 1 |
7 | D | 1 |
5 | B | 1 |
4 | A | 1 |
3 | Patient_Body_Mass_Index | 1 |
2 | Patient_Age | 1 |
20 | Patient_Rural_Urban_RURAL | 1 |
1 | Patient_ID | 2 |
6 | C | 3 |
8 | E | 4 |
9 | F | 5 |
10 | Z | 6 |
Create a new subset of the data with only the selected features
X_important_train = boruta_selector.transform(np.array(X_train))
X_important_test = boruta_selector.transform(np.array(X_test))
Build the model with selected features
# Create a new random forest classifier for the most important features
rf_important = RandomForestClassifier(random_state=1, n_estimators=1000, n_jobs = -1)
# Train the new classifier on the new dataset containing the most important features
rf_important.fit(X_important_train, y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000,
n_jobs=-1, oob_score=False, random_state=1, verbose=0,
warm_start=False)
Evaluation
y_important_pred = rf_important.predict(X_important_test)
rf_imp_fscore = f1_score(y_test, y_important_pred)
print(rf_imp_fscore)
0.8578215134034612
If you remember from above that the Random Forest Classifier with all the features had given f1 score as 82.2% while after selecting some relavent features the Random Forest Classifier has given f1 score as 85.7% which is a good improvement in terms of bothe performance of the model (i.e. the result) and the complexity is also reduced.
Well we have chosen some of the parameters randomly like max_depht, n_estimators. There are many other parameters related to Random Forest model.If you remember we had discussed in our session âPerformance Evaluationâ about Hyper parameter tunning. Hyper parameter tunnning helps you to choose a set of optimal parameters for a model. So letâs try if this helps us to further improve the performance of the model.
Grid Search helps you to find the optimal parameter for a model.
Hyper Parameter Tunning
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search
param_grid = {
'bootstrap': [True, False],
'max_depth': [5, 10, 15],
'n_estimators': [500, 1000]}
rf = RandomForestClassifier(random_state = 1)
# Grid search cv
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 2, n_jobs = -1, verbose = 2)
grid_search.fit(X_important_train, y_train)
Fitting 2 folds for each of 12 candidates, totalling 24 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 24 out of 24 | elapsed: 2.2min finished
GridSearchCV(cv=2, error_score=nan,
estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
class_weight=None,
criterion='gini', max_depth=None,
max_features='auto',
max_leaf_nodes=None,
max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None,
oob_score=False, random_state=1,
verbose=0, warm_start=False),
iid='deprecated', n_jobs=-1,
param_grid={'bootstrap': [True, False], 'max_depth': [5, 10, 15],
'n_estimators': [500, 1000]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=2)
grid_search.best_params_
{'bootstrap': True, 'max_depth': 15, 'n_estimators': 500}
pred = grid_search.predict(X_important_test)
f1_score(y_test, pred)
0.8657267539442766
As you can see the accuracy has been improved from 85.7% to 86.6% by selecting some good parameters with the help of hyper parameter tunning - GridSearchCV
Conclusion
- It is clearly observable that how the f1 scores increased from logistic regression to random forest, random forest with full features to random forest on the selected features using Boruta.
- Then again the f1 score increased with Hyper parameter tunning.
- Also this is one of the approach to solve this problem. There can be many other approaches to solve this problem.
- We could also try standardizing and normalizing the data or some other algorithms and so onâŚ
- Well you should try standardizing or normalizing the data and then observe the difference in f1 score.
- Also try doing using Decision Tree.
Now letâs predict the output for new test data
Well, we will predict the output for new test data using the Random Forest model with the selected features using Boruta and also with the best parameters that we got during hyper parmeter tunning because we have got the highest f1 score with this model on X_test data (also called the validation data). We can directly use grid_search variable to predict as this variable is the reference to the trained model.
New Test Data
Tasks to be performed:
- Load the new test data
- If missing values are there then fill the missing values with the same techniques that were used for training dataset
- Convert categorical column to numerical
- Predict the output
- Download the predicted values in csv
Why do we need to do the same procedure of filling missing values, data cleaning and data preprocessing on the new test data as it was done for the training and evaluation data?
Ans: Because our model has been trained on certain format of data and if we donât provide the testing data of the similar format, the model will give erroneous predictions and the accuracy/f1 score of the model will decrease. Also, if the model was build on ânâ number of features, while predicting on new test data you should always give the same number of features to the model. In this case if you provide different number of features while predicting the output, your ML model will throw a ValueError saying something like ânumber of features given x; expecting nâ. Not confident about these statements? Well, as a data scientist you should always perform some experiment and observe the results.
# Load the data
test_new_data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/pharma_data/Testing_set_begs.csv")
# take a look how the new test data look like
test_new_data.head()
ID_Patient_Care_Situation | Diagnosed_Condition | Patient_ID | Treated_with_drugs | Patient_Age | Patient_Body_Mass_Index | Patient_Smoker | Patient_Rural_Urban | Patient_mental_condition | A | B | C | D | E | F | Z | Number_of_prev_cond | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19150 | 40 | 3709 | DX3 | 16 | 29.443894 | NO | RURAL | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 |
1 | 23216 | 52 | 986 | DX6 | 24 | 26.836321 | NO | URBAN | Stable | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
2 | 11890 | 50 | 11821 | DX4 DX5 | 63 | 25.523280 | NO | RURAL | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 |
3 | 7149 | 32 | 3292 | DX6 | 42 | 27.171155 | NO | URBAN | Stable | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 3.0 |
4 | 22845 | 20 | 9959 | DX3 | 50 | 25.556192 | NO | RURAL | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Can you observe the new test data here? It is in the same format as our training data before performing any cleaning and preprocessing.
Checking missing values
test_new_data.isnull().sum()
ID_Patient_Care_Situation 0
Diagnosed_Condition 0
Patient_ID 0
Treated_with_drugs 0
Patient_Age 0
Patient_Body_Mass_Index 0
Patient_Smoker 0
Patient_Rural_Urban 0
Patient_mental_condition 0
A 0
B 0
C 0
D 0
E 0
F 0
Z 0
Number_of_prev_cond 0
dtype: int64
New test data has no missing values. So treating missing value is not required.
Preprocessing and data cleaning: same as we did on training data
drugs = test_new_data['Treated_with_drugs'].str.get_dummies(sep=' ') # split all the entries
drugs.head()
DX1 | DX2 | DX3 | DX4 | DX5 | DX6 | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | 0 | 0 | 1 | 1 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 0 | 0 | 1 | 0 | 0 | 0 |
test_new_data = pd.concat([test_new_data, drugs], axis=1) # concat the two dataframes 'drugs' and 'data'
test_new_data = test_new_data.drop('Treated_with_drugs', axis=1) # dropping the column 'Treated_with_drugs' as its values are splitted into different columns
test_new_data.head()
ID_Patient_Care_Situation | Diagnosed_Condition | Patient_ID | Patient_Age | Patient_Body_Mass_Index | Patient_Smoker | Patient_Rural_Urban | Patient_mental_condition | A | B | C | D | E | F | Z | Number_of_prev_cond | DX1 | DX2 | DX3 | DX4 | DX5 | DX6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19150 | 40 | 3709 | 16 | 29.443894 | NO | RURAL | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 23216 | 52 | 986 | 24 | 26.836321 | NO | URBAN | Stable | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 11890 | 50 | 11821 | 63 | 25.523280 | NO | RURAL | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0 | 0 | 0 | 1 | 1 | 0 |
3 | 7149 | 32 | 3292 | 42 | 27.171155 | NO | URBAN | Stable | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 3.0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 22845 | 20 | 9959 | 50 | 25.556192 | NO | RURAL | Stable | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 1 | 0 | 0 | 0 |
test_new_data.Patient_Smoker.value_counts()
NO 5333
YES 3970
Name: Patient_Smoker, dtype: int64
This data does not have value as âCannot sayâ in âPatient_Smokerâ column
The column âPatient_mental_conditionâ has only one category âstableâ. So we can drop this column as for every observation the entry here is âstableâ.
test_new_data.drop('Patient_mental_condition', axis = 1, inplace=True)
Now letâs convert the categorical column to numerical using get_dummies() function of pandas (i.e. one hot encoding).
test_new_data = pd.get_dummies(test_new_data, columns=['Patient_Smoker', 'Patient_Rural_Urban'])
test_new_data.head()
ID_Patient_Care_Situation | Diagnosed_Condition | Patient_ID | Patient_Age | Patient_Body_Mass_Index | A | B | C | D | E | F | Z | Number_of_prev_cond | DX1 | DX2 | DX3 | DX4 | DX5 | DX6 | Patient_Smoker_NO | Patient_Smoker_YES | Patient_Rural_Urban_RURAL | Patient_Rural_Urban_URBAN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19150 | 40 | 3709 | 16 | 29.443894 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | 23216 | 52 | 986 | 24 | 26.836321 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
2 | 11890 | 50 | 11821 | 63 | 25.523280 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 |
3 | 7149 | 32 | 3292 | 42 | 27.171155 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 3.0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
4 | 22845 | 20 | 9959 | 50 | 25.556192 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
test_new_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9303 entries, 0 to 9302
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID_Patient_Care_Situation 9303 non-null int64
1 Diagnosed_Condition 9303 non-null int64
2 Patient_ID 9303 non-null int64
3 Patient_Age 9303 non-null int64
4 Patient_Body_Mass_Index 9303 non-null float64
5 A 9303 non-null float64
6 B 9303 non-null float64
7 C 9303 non-null float64
8 D 9303 non-null float64
9 E 9303 non-null float64
10 F 9303 non-null float64
11 Z 9303 non-null float64
12 Number_of_prev_cond 9303 non-null float64
13 DX1 9303 non-null int64
14 DX2 9303 non-null int64
15 DX3 9303 non-null int64
16 DX4 9303 non-null int64
17 DX5 9303 non-null int64
18 DX6 9303 non-null int64
19 Patient_Smoker_NO 9303 non-null uint8
20 Patient_Smoker_YES 9303 non-null uint8
21 Patient_Rural_Urban_RURAL 9303 non-null uint8
22 Patient_Rural_Urban_URBAN 9303 non-null uint8
dtypes: float64(9), int64(10), uint8(4)
memory usage: 1.4 MB
As you can see there are no missing data now and all the data are of numerical type.
The column - âID_Patient_Care_Situationâ is an ID. Here we can remove this column too as we did in training dataset.
test_new_data.drop(['ID_Patient_Care_Situation'], axis =1, inplace=True)
test_new_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9303 entries, 0 to 9302
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Diagnosed_Condition 9303 non-null int64
1 Patient_Age 9303 non-null int64
2 Patient_Body_Mass_Index 9303 non-null float64
3 A 9303 non-null float64
4 B 9303 non-null float64
5 C 9303 non-null float64
6 D 9303 non-null float64
7 E 9303 non-null float64
8 F 9303 non-null float64
9 Z 9303 non-null float64
10 Number_of_prev_cond 9303 non-null float64
11 DX1 9303 non-null int64
12 DX2 9303 non-null int64
13 DX3 9303 non-null int64
14 DX4 9303 non-null int64
15 DX5 9303 non-null int64
16 DX6 9303 non-null int64
17 Patient_Smoker_NO 9303 non-null uint8
18 Patient_Smoker_YES 9303 non-null uint8
19 Patient_Rural_Urban_RURAL 9303 non-null uint8
20 Patient_Rural_Urban_URBAN 9303 non-null uint8
dtypes: float64(9), int64(8), uint8(4)
memory usage: 1.2 MB
Prediction
imp_test_features = boruta_selector.transform(np.array(test_new_data))
prediction = grid_search.predict(imp_test_features)
We have the predicted output stored in the variable âpredictionâ. Letâs download this prediction as csv file as shown below.
Download the prediction file in csv
res = pd.DataFrame(prediction)
res.index = test_new_data.index
res.columns = ["prediction"]
from google.colab import files
res.to_csv('prediction_results_HP.csv')
files.download('prediction_results_HP.csv')