# Predict if a Patient will Survive after 1 year of Treatment - Beginners

Contribute: Found a typo? Or any other change that could improve the notebook tutorial? Please consider sending us a pull request in the public repo of the notebook here.

# Assignment - 3: Solution

#### (Absolute Beginners and Beginners)

This is the third assignment of the DPhi 5 Week Data Science Bootcamp and revolves around Data Science Problem solving.

# Agenda

• Problem Statement
• Objective
• Dataset & Data Description
• Solution Steps:
• Understand your data: Data Analysis and Visualizations (EDA)
• Pre-process the data
• Prepare train and test datasets
• Choose a model
• Train your model
• Evaluate the model (F1-score calculation)
• Optimize: repeat steps 5 - 7
• Conclusion
• Prediction on New Test data
• Load the new test data
• Fill missing values if any
• Preprocessing and cleaning the data
• Predict the target values

## Problem Statement

### Objective

A hospital in the province of Greenland has been trying to improve its care conditions by looking at historic survival of the patients. They tried looking at their data but could not identify the main factors leading to high survivals.

You are the best data scientist in Greenland and theyâve hired you to solve this problem. Now you are responsible for developing a model that will predict the chances of survival of a patient after 1 year of treatment (Survived_1_year).

### Dataset & Data Description

The dataset contains the patient records collected from a hospital in Greenland. The âSurvived_1_yearâ column is a target variable which has binary entries (0 or 1).

• Survived_1_year == 0, implies that the patient did not survive after 1 year of treatment
• Survived_1_year == 1, implies that the patient survived after 1 year of treatment

To load the dataset in your jupyter notebook, use the below command:

``````import pandas as pd
``````

#### Data Description:

• ID_Patient_Care_Situation: Care situation of a patient during treatment
• Diagnosed_Condition: The diagnosed condition of the patient
• ID_Patient: Patient identifier number
• Treatment_with_drugs: Class of drugs used during treatment
• Survived_1_year: If the patient survived after one year (0 means did not survive; 1 means survived)
• Patient_Age: Age of the patient
• Patient_Body_Mass_Index: A calculated value based on the patientâs weight, height, etc.
• Patient_Smoker: If the patient was a smoker or not
• Patient_Rural_Urban: If the patient stayed in Rural or Urban part of the country
• Previous_Condition: Condition of the patient before the start of the treatment ( This variable is splitted into 8 columns - A, B, C, D, E, F, Z and Number_of_prev_cond. A, B, C, D, E, F and Z are the previous conditions of the patient. Suppose for one patient, if the entry in column A is 1, it means that the previous condition of the patient was A. If the patient didnât have that condition, it is 0 and same for other conditions. If a patient has previous condition as A and C , columns A and C will have entries as 1 and 1 respectively while the other column B, D, E, F, Z will have entries 0, 0, 0, 0, 0 respectively. The column Number_of_prev_cond will have entry as 2 i.e. 1 + 0 + 1 + 0 + 0 + 0 + 0 + 0 = 2 in this case. )

## Solution steps:

2. Understand your data: EDA
3. Pre-process the data
4. Prepare train and test datasets
5. Choose a model
6. Train your model
7. Evaluate the model (F1-score calculation)
8. Optimize: repeat steps 5 - 7

``````import pandas as pd   # package for data analysis
import numpy as np    # package for numerical computations

# libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# to ignore warnings
import warnings
warnings.filterwarnings('ignore')

# For Preprocessing, ML models and Evaluation
from sklearn.model_selection import train_test_split   # To split the dataset into train and test set

from sklearn.linear_model import LogisticRegression     # Logistic regression model

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder    # for converting categorical to numerical

from sklearn.metrics import f1_score    # for model evaluation
``````

``````data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/pharma_data/Training_set_begs.csv')
``````

### EDA

Primary screenings:

1. Get a look at the data, its columns and kind of values contained in these columns: df.head()
2. Stepping back a bit, get a look at the column overview: number, types, NULL counts: df.info()
``````# Take a look at the first five observations
``````
ID_Patient_Care_Situation Diagnosed_Condition Patient_ID Treated_with_drugs Patient_Age Patient_Body_Mass_Index Patient_Smoker Patient_Rural_Urban Patient_mental_condition A B C D E F Z Number_of_prev_cond Survived_1_year
0 22374 8 3333 DX6 56 18.479385 YES URBAN Stable 1.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0
1 18164 5 5740 DX2 36 22.945566 YES RURAL Stable 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1
2 6283 23 10446 DX6 48 27.510027 YES RURAL Stable 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0
3 5339 51 12011 DX1 5 19.130976 NO URBAN Stable 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1
4 33012 0 12513 NaN 128 1.348400 Cannot say RURAL Stable 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1
``````# A concise summary of the data
data.info()
``````
``````<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23097 entries, 0 to 23096
Data columns (total 18 columns):
#   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
0   ID_Patient_Care_Situation  23097 non-null  int64
1   Diagnosed_Condition        23097 non-null  int64
2   Patient_ID                 23097 non-null  int64
3   Treated_with_drugs         23084 non-null  object
4   Patient_Age                23097 non-null  int64
5   Patient_Body_Mass_Index    23097 non-null  float64
6   Patient_Smoker             23097 non-null  object
7   Patient_Rural_Urban        23097 non-null  object
8   Patient_mental_condition   23097 non-null  object
9   A                          21862 non-null  float64
10  B                          21862 non-null  float64
11  C                          21862 non-null  float64
12  D                          21862 non-null  float64
13  E                          21862 non-null  float64
14  F                          21862 non-null  float64
15  Z                          21862 non-null  float64
16  Number_of_prev_cond        21862 non-null  float64
17  Survived_1_year            23097 non-null  int64
dtypes: float64(9), int64(5), object(4)
memory usage: 3.2+ MB
``````

Observations:

1. There are 23097 observations divided into 17 columns.
2. There are some missing values in the dataset.

Letâs take a look at the distribution of our target variable to determine if we have a balanced dataset

``````sns.countplot(x='Survived_1_year', data=data)
plt.show()
``````

There are 8000 + patients who did not survive after 1 year of treatment and 14000 + patients who survived after 1 year of treatment. The ratio is 1:2 (approx). So there is no class imbalance

Next, we will perform EDA on our continuous variables

``````# getting only the numerical features
numeric_features = data.select_dtypes(include=[np.number])    # select_dtypes helps you to select data of particular types
numeric_features.columns
``````
``````Index(['ID_Patient_Care_Situation', 'Diagnosed_Condition', 'Patient_ID',
'Patient_Age', 'Patient_Body_Mass_Index', 'A', 'B', 'C', 'D', 'E', 'F',
'Z', 'Number_of_prev_cond', 'Survived_1_year'],
dtype='object')
``````
``````numeric_data=data[['Diagnosed_Condition', 'Patient_Age', 'Patient_Body_Mass_Index', 'Number_of_prev_cond', 'Survived_1_year']]  #keeping in the target varibale for analysis purposes

# ID_Patient_Care_Situation and Patient_ID are just an ID we can ignore them for data analysis.
# Number_of_prev_cond is dependent on 7 columns - A, B, C, D, E, F, Z
``````
Diagnosed_Condition Patient_Age Patient_Body_Mass_Index Number_of_prev_cond Survived_1_year
0 8 56 18.479385 2.0 0
1 5 36 22.945566 1.0 1
2 23 48 27.510027 1.0 0
3 51 5 19.130976 1.0 1
4 0 128 1.348400 1.0 1
``````# Checking the null values in numerical columns
numeric_data.isnull().sum()
``````
``````Diagnosed_Condition           0
Patient_Age                   0
Patient_Body_Mass_Index       0
Number_of_prev_cond        1235
Survived_1_year               0
dtype: int64
``````

We can see that 1235 values are missing from âNumber_of_prev_condâ column. We will fill these with the mode.

Why mode? As per the data description this columnâs value is dependent on the seven columns - âAâ, âBâ, âCâ, âDâ, âEâ, âFâ, âZâ.

These columns have values either 0 or 1. Hence these seven columns are categorical columns.

So the column âNumber_of_prev_condâ have discrete values from integers 0 to 7 and can be considered as categorical column as it has only 7 different values. Hence here we can fill the missing values with mode.

``````data['Number_of_prev_cond'] = data['Number_of_prev_cond'].fillna(data['Number_of_prev_cond'].mode()[0])  # filling the missing value of 'Number_of_prev_cond'

numeric_data['Number_of_prev_cond']=data['Number_of_prev_cond']
numeric_data.isnull().sum()

# The returned object by using mode() is a series so we are filling the null value with the value at 0th index ( which gives us the mode of the data)
``````
``````Diagnosed_Condition        0
Patient_Age                0
Patient_Body_Mass_Index    0
Number_of_prev_cond        0
Survived_1_year            0
dtype: int64
``````
``````# Taking a look at the basic statistical description of the numerical columns
numeric_data.describe()
``````
Diagnosed_Condition Patient_Age Patient_Body_Mass_Index Number_of_prev_cond Survived_1_year
count 23097.000000 23097.000000 23097.000000 23097.000000 23097.000000
mean 26.413127 33.209768 23.454820 1.710352 0.632247
std 15.030865 19.549882 3.807661 0.768216 0.482204
min 0.000000 0.000000 1.089300 1.000000 0.000000
25% 13.000000 16.000000 20.205550 1.000000 0.000000
50% 26.000000 33.000000 23.386199 2.000000 1.000000
75% 39.000000 50.000000 26.788154 2.000000 1.000000
max 52.000000 149.000000 29.999579 5.000000 1.000000

observations

1. The minimum and maximum values for all the numerical columns.
2. The mean and median (i.e. 50%) value for all the numerical columns are nearly same, so no outliers.

A good way to visualize the above information would be boxplots.

Box Plot

A box plot is a great way to get a visual sense of an entire range of data. It can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

Box plots divides data into its quartiles. The âboxâ shows a user the data set between the first and third quartiles.

The median gets drawn somewhere inside the box and then you see the most extreme non-outliers to finish the plot. Those lines are known as the âwhiskersâ. If there are any outliers then those can be plotted as well.

With box plots you can answer how diverse or uniform your data might be. You can identify what is normal and what is extreme. Box plots help give a shape to your data that is broad without sacrificing the ability to look at any piece and ask more questions.

It displays the five-number summary of a set of data. The five-number summary is:

• minimum
• first quartile (Q1)
• median
• third quartile (Q3)
• maximum

Boxplot also helps you to check if there are any outliers in your data or not.

For reading about boxplot and outliers: https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba

``````
for feature in numeric_data.drop('Survived_1_year', axis = 1).columns:
sns.boxplot(x='Survived_1_year', y=feature, data=numeric_data)
plt.show()
``````

We can also see there are some outliers in the columns - âPatient_Ageâ, âPatient_Body_Mass_Indexâ, and âNumber_of_prev_condâ. There are various ways to treat the outliers as mentioned in the article https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba. Here I have not treated any outliers.

Following is a correlation analysis between the continuous varibles, visualized using a heatmap

``````numeric_data=numeric_data.drop(['Survived_1_year'], axis=1)
colormap = sns.diverging_palette(10, 220, as_cmap = True)
sns.heatmap(numeric_data.corr(),
cmap = colormap,
square = True,
annot = True)
plt.show()
``````

Finally, we must look at the EDA for our categorical variables. However, before analyzing the categorical columns further, we will treat the missing values

``````data.isnull().sum()
``````
``````ID_Patient_Care_Situation       0
Diagnosed_Condition             0
Patient_ID                      0
Treated_with_drugs             13
Patient_Age                     0
Patient_Body_Mass_Index         0
Patient_Smoker                  0
Patient_Rural_Urban             0
Patient_mental_condition        0
A                            1235
B                            1235
C                            1235
D                            1235
E                            1235
F                            1235
Z                            1235
Number_of_prev_cond             0
Survived_1_year                 0
dtype: int64
``````

#### Filling Missing values

``````data['Treated_with_drugs']=data['Treated_with_drugs'].fillna(data['Treated_with_drugs'].mode()[0])
``````
``````data['A'].fillna(data['A'].mode()[0], inplace = True)
data['B'].fillna(data['B'].mode()[0], inplace = True)
data['C'].fillna(data['C'].mode()[0], inplace = True)
data['D'].fillna(data['D'].mode()[0], inplace = True)
data['E'].fillna(data['E'].mode()[0], inplace = True)
data['F'].fillna(data['F'].mode()[0], inplace = True)
data['Z'].fillna(data['Z'].mode()[0], inplace = True)
``````
``````data.isnull().sum()
``````
``````ID_Patient_Care_Situation    0
Diagnosed_Condition          0
Patient_ID                   0
Treated_with_drugs           0
Patient_Age                  0
Patient_Body_Mass_Index      0
Patient_Smoker               0
Patient_Rural_Urban          0
Patient_mental_condition     0
A                            0
B                            0
C                            0
D                            0
E                            0
F                            0
Z                            0
Number_of_prev_cond          0
Survived_1_year              0
dtype: int64
``````

### EDA on Categorical Data

Letâs perform Exploratory Data Analysis on the Categorical data.
In the categorical_data variable weâll keep all the categorical features and remove the others.

Note that the features are not being removed from the main dataset - data. Weâll select features with a feature selection technique later.

``````categorical_data = data.drop(numeric_data.columns, axis=1)    # dropping the numerical columns from the dataframe 'data'
categorical_data.drop(['Patient_ID', 'ID_Patient_Care_Situation'], axis=1, inplace = True)    # dropping the id columns form the dataframe 'categorical data'
categorical_data.head()    # Now we are left with categorical columns only. take a look at first five observaitons
``````
Treated_with_drugs Patient_Smoker Patient_Rural_Urban Patient_mental_condition A B C D E F Z Survived_1_year
0 DX6 YES URBAN Stable 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0
1 DX2 YES RURAL Stable 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1
2 DX6 YES RURAL Stable 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0
3 DX1 NO URBAN Stable 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1
4 DX6 Cannot say RURAL Stable 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1

Now we can look at the distributions of our categorical variables

``````categorical_data.nunique()   # nunique() return you the number of unique values in each column/feature
``````
``````Treated_with_drugs          32
Patient_Smoker               3
Patient_Rural_Urban          2
Patient_mental_condition     1
A                            2
B                            2
C                            2
D                            2
E                            2
F                            2
Z                            2
Survived_1_year              2
dtype: int64
``````

So âTreated_with_drugsâ column has 32 unique values while âPatient_Smokerâ has only 3 categorical values. âPatient_mental_conditionâ column has only 1 categorical value.

``````# Visualization of categorical columns
for feature in ['Patient_Smoker', 'Patient_Rural_Urban', 'Patient_mental_condition']:
sns.countplot(x=feature,  hue='Survived_1_year', data=categorical_data)
plt.show()

plt.figure(figsize=(15,5))
sns.countplot(x='Treated_with_drugs',  hue='Survived_1_year', data=categorical_data)
plt.xticks(rotation=90)
plt.show()

``````

#### Pre-Processing and Data Cleaning of Categorical Variables

We have discussed in our sessions that machine learning models accepts only numerical data. âTreated_with_drugsâ column is a categorical column and has values as combination of one or more drugs. Letâs split all those combined drugs into individual drugs and create dummies for that.

``````drugs = data['Treated_with_drugs'].str.get_dummies(sep=' ') # split all the entries separated by space and create dummy variable
``````
DX1 DX2 DX3 DX4 DX5 DX6
0 0 0 0 0 0 1
1 0 1 0 0 0 0
2 0 0 0 0 0 1
3 1 0 0 0 0 0
4 0 0 0 0 0 1
``````data = pd.concat([data, drugs], axis=1)     # concat the two dataframes 'drugs' and 'data'
data = data.drop('Treated_with_drugs', axis=1)    # dropping the column 'Treated_with_drugs' as its values are now splitted into different columns

``````
ID_Patient_Care_Situation Diagnosed_Condition Patient_ID Patient_Age Patient_Body_Mass_Index Patient_Smoker Patient_Rural_Urban Patient_mental_condition A B C D E F Z Number_of_prev_cond Survived_1_year DX1 DX2 DX3 DX4 DX5 DX6
0 22374 8 3333 56 18.479385 YES URBAN Stable 1.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0 0 0 0 0 0 1
1 18164 5 5740 36 22.945566 YES RURAL Stable 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1 0 1 0 0 0 0
2 6283 23 10446 48 27.510027 YES RURAL Stable 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 0 0 0 0 0 1
3 5339 51 12011 5 19.130976 NO URBAN Stable 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1 1 0 0 0 0 0
4 33012 0 12513 128 1.348400 Cannot say RURAL Stable 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1 0 0 0 0 0 1

âPatient_Smokerâ is also a categorical column and we need to create dummies for this too. If you observe the data, the column âPatient_Smokerâ has a category âCannot sayâ.

``````data.Patient_Smoker.value_counts()
``````
``````NO            13246
YES            9838
Cannot say       13
Name: Patient_Smoker, dtype: int64
``````

There can be different ways to deal with the category âCannot sayâ. Here we will consider it as missing value and fill those entries with the mode value of the column.

``````data.Patient_Smoker[data['Patient_Smoker'] == "Cannot say"] = 'NO'    # we already know 'NO' is the mode so directly changing the values 'Cannot say' to 'NO'
``````

The column âPatient_mental_conditionâ has only one category âstableâ. So we can drop this column as for every observation the entry here is âstableâ. This feature wonât be useful for making the prediction of the target variable as it doesnât provide any useful insights of the data. Hence, It is better to remove this kind of features.

``````data.drop('Patient_mental_condition', axis = 1, inplace=True)
``````

Now letâs convert the remaining categorical column to numerical using get_dummies() function of pandas (i.e. one hot encoding).

``````data = pd.get_dummies(data, columns=['Patient_Smoker', 'Patient_Rural_Urban'])
``````
``````data.head()
``````
ID_Patient_Care_Situation Diagnosed_Condition Patient_ID Patient_Age Patient_Body_Mass_Index A B C D E F Z Number_of_prev_cond Survived_1_year DX1 DX2 DX3 DX4 DX5 DX6 Patient_Smoker_NO Patient_Smoker_YES Patient_Rural_Urban_RURAL Patient_Rural_Urban_URBAN
0 22374 8 3333 56 18.479385 1.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0 0 0 0 0 0 1 0 1 0 1
1 18164 5 5740 36 22.945566 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1 0 1 0 0 0 0 0 1 1 0
2 6283 23 10446 48 27.510027 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 0 0 0 0 0 1 0 1 1 0
3 5339 51 12011 5 19.130976 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1 1 0 0 0 0 0 1 0 0 1
4 33012 0 12513 128 1.348400 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1 0 0 0 0 0 1 1 0 1 0
``````data.info()
``````
``````<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23097 entries, 0 to 23096
Data columns (total 24 columns):
#   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
0   ID_Patient_Care_Situation  23097 non-null  int64
1   Diagnosed_Condition        23097 non-null  int64
2   Patient_ID                 23097 non-null  int64
3   Patient_Age                23097 non-null  int64
4   Patient_Body_Mass_Index    23097 non-null  float64
5   A                          23097 non-null  float64
6   B                          23097 non-null  float64
7   C                          23097 non-null  float64
8   D                          23097 non-null  float64
9   E                          23097 non-null  float64
10  F                          23097 non-null  float64
11  Z                          23097 non-null  float64
12  Number_of_prev_cond        23097 non-null  float64
13  Survived_1_year            23097 non-null  int64
14  DX1                        23097 non-null  int64
15  DX2                        23097 non-null  int64
16  DX3                        23097 non-null  int64
17  DX4                        23097 non-null  int64
18  DX5                        23097 non-null  int64
19  DX6                        23097 non-null  int64
20  Patient_Smoker_NO          23097 non-null  uint8
21  Patient_Smoker_YES         23097 non-null  uint8
22  Patient_Rural_Urban_RURAL  23097 non-null  uint8
23  Patient_Rural_Urban_URBAN  23097 non-null  uint8
dtypes: float64(9), int64(11), uint8(4)
memory usage: 3.6 MB
``````

As you can see there are no missing data now and all the data are of numerical type.

There are two ID columns - âID_Patient_Care_Situationâ and âPatient_IDâ. We can think of removing these columns if these are randomly generated value and there is not any id repeated like we had done for the âPassengerIdâ in Titanic Dataset. âPassengerIdâ was randomly generated for each passenger and none of the ids were repeated. So letâs check these two ids columns.

``````print(data.ID_Patient_Care_Situation.nunique())     # nunique() gives you the count of unique values in the column
print(data.Patient_ID.nunique())
``````
``````23097
10570
``````

You can see there are 23097 unique âID_Patient_Care_Situationâ and there are 23097 total observations in the dataset. So this column can be dropped.

Now, there are only 10570 unique values in the column âPatient_IDâ. This means there are some patient who came two or more times in the hospital because it is possible the same person was sick for two or more than two times (with different illness) and visited hospital for the treatment. And the same patient will have different caring condition for different diseases.

So there are some useful information in the column - âPatient_IDâ and thus we will not drop this column.

``````data.drop(['ID_Patient_Care_Situation'], axis =1, inplace=True)
``````

### Prepare Train/Test Data

1. Separating the input and output variables

Before building any machine learning model, we always separate the input variables and output variables. Input variables are those quantities whose values are changed naturally in an experiment, whereas output variable is the one whose values are dependent on the input variables. So, input variables are also known as independent variables as its values are not dependent on any other quantity, and output variable/s are also known as dependent variables as its values are dependent on other variable i.e. input variables. Like here in this data, we can see that whether a person will survive after one year or not, depends on other variables like, age, diagnosis, body mass index, drugs used, etc.

By convention input variables are represented with âXâ and output variables are represented with âyâ.

``````X = data.drop('Survived_1_year',axis = 1)
y = data['Survived_1_year']
``````
1. Train/test split

We want to check the performance of the model that we built. For this purpose, we always split (both input and output data) the given data into training set which will be used to train the model, and test set which will be used to check how accurately the model is predicting outcomes.

For this purpose we have a class called âtrain_test_splitâ in the âsklearn.model_selectionâ module.

``````X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
``````

### Model Building

We have seen from Exploratory Data Analysis that this is a classification problem as the target column âSurvived_1_yearâ has two values 0 - means the patient did not survive after one year of treatment, 1 - means the patient survived after one year of treatment. So we can use classification models for this problem. Some of the classification models are - Logistic Regression, Random Forest Classifier, Decision Tree Classifier, etc. However, we have used two of them - Logistic Regression and Random Forest Classifier.

### 1. Logistic Regression Model

``````model = LogisticRegression(max_iter = 1000)     # The maximum number of iterations will be 1000. This will help you prevent from convergence warning.
model.fit(X_train,y_train)
``````
``````LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=1000,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
``````
``````pred = model.predict(X_test)
``````

Evaluation:

``````print(f1_score(y_test,pred))
``````
``````0.7866774979691308
``````

The f1 score by Logistic Regression model is 78.6%. Letâs try Random Forest Classifier and see if we get better result with it.

### 2. Random Forest

``````from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score, f1_score
from sklearn.ensemble import RandomForestClassifier
``````
``````forest = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)

forest.fit(X_train, y_train)
``````
``````RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=5, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000,
n_jobs=None, oob_score=False, random_state=1, verbose=0,
warm_start=False)
``````
##### Evaluating on X_test
``````y_pred = forest.predict(X_test)

fscore = f1_score(y_test ,y_pred)
fscore
``````
``````0.8220447284345048
``````

The f1 score by Random Forest classifier is 82.2% which is better than logistic regression.

Well there are so many feaures to train the model. We can go and try some feature selection techniques and check if the performance of Random Forest is affected or not. And see if with the decrease in the complexity of model is satisfactory with minimal affect to the performance of the model.

We have used the Boruta feature selector. You can use some other techniques too and see if that is giving better result than Boruta.

### 3. Random Forest and Boruta

Note: Before proceeding ahead please revise the feature selection Notebook or the session if you donât remember what is âBorutaâ.

Boruta is an all-relevant feature selection method. Unlike other techniques that select small set of features to minimize the error, Boruta tries to capture all the important and interesting features you might have in your dataset with respect to the target variable.

Boruta by default uses random forest although it works with other algorithms like LightGBM, XGBoost etc.

You can install Boruta with the command

pip install Boruta

``````!pip install Boruta
``````
``````Collecting Boruta
e[K     |ââââââââââââââââââââââââââââââââ| 61kB 1.8MB/s
e[?25hRequirement already satisfied: scikit-learn>=0.17.1 in /usr/local/lib/python3.6/dist-packages (from Boruta) (0.22.2.post1)
Requirement already satisfied: numpy>=1.10.4 in /usr/local/lib/python3.6/dist-packages (from Boruta) (1.18.5)
Requirement already satisfied: scipy>=0.17.0 in /usr/local/lib/python3.6/dist-packages (from Boruta) (1.4.1)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn>=0.17.1->Boruta) (0.16.0)
Installing collected packages: Boruta
Successfully installed Boruta-0.3
``````
``````from boruta import BorutaPy
``````
``````boruta_selector = BorutaPy(forest, n_estimators='auto', verbose=2, random_state=1)   # initialize the boruta selector
boruta_selector.fit(np.array(X_train), np.array(y_train))       # fitting the boruta selector to get all relavent features. NOTE: BorutaPy accepts numpy arrays only.
``````
``````Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	9 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	10 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	11 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	12 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	13 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	14 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	15 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	16 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	17 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	18 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	19 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	20 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	21 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	22 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	23 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	24 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	25 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	26 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	27 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	28 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	29 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	30 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	31 / 100
Confirmed: 	16
Tentative: 	3
Rejected: 	3
Iteration: 	32 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	33 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	34 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	35 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	36 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	37 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	38 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	39 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	40 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	41 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	42 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	43 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	44 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	45 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	46 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	47 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	48 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	49 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	50 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	51 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	52 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	53 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	54 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	55 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	56 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	57 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	58 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	59 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	60 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	61 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	62 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	63 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	64 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	65 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	66 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	67 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	68 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	69 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	70 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	71 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	72 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	73 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	74 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	75 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	76 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	77 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	78 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	79 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	80 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	81 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	82 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	83 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	84 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	85 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	86 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	87 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	88 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	89 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	90 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	91 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	92 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	93 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	94 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	95 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	96 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	97 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	98 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3
Iteration: 	99 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	3

BorutaPy finished running.

Iteration: 	100 / 100
Confirmed: 	17
Tentative: 	1
Rejected: 	3

BorutaPy(alpha=0.05,
estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
class_weight=None, criterion='gini',
max_depth=5, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=123, n_jobs=None,
oob_score=False,
random_state=RandomState(MT19937) at 0x7F0EBCEFA990,
verbose=0, warm_start=False),
max_iter=100, n_estimators='auto', perc=100,
random_state=RandomState(MT19937) at 0x7F0EBCEFA990, two_step=True,
verbose=2)
``````
``````print("Selected Features: ", boruta_selector.support_)    # check selected features

print("Ranking: ",boruta_selector.ranking_)               # check ranking of features

print("No. of significant features: ", boruta_selector.n_features_)
``````
``````Selected Features:  [ True  True  True  True  True False  True False False False  True  True
True  True  True  True  True  True  True  True  True]
Ranking:  [1 1 1 1 1 2 1 3 4 5 1 1 1 1 1 1 1 1 1 1 1]
No. of significant features:  17
``````

So boruta has selected 17 relavent features. # Letâs visualise it better in the form of a table

#### Displaying features rank wise

``````selected_rfe_features = pd.DataFrame({'Feature':list(X_train.columns),
'Ranking':boruta_selector.ranking_})
selected_rfe_features.sort_values(by='Ranking')
``````
Feature Ranking
0 Diagnosed_Condition 1
19 Patient_Smoker_YES 1
18 Patient_Smoker_NO 1
17 DX6 1
16 DX5 1
15 DX4 1
14 DX3 1
13 DX2 1
12 DX1 1
11 Number_of_prev_cond 1
21 Patient_Rural_Urban_URBAN 1
7 D 1
5 B 1
4 A 1
3 Patient_Body_Mass_Index 1
2 Patient_Age 1
20 Patient_Rural_Urban_RURAL 1
1 Patient_ID 2
6 C 3
8 E 4
9 F 5
10 Z 6

#### Create a new subset of the data with only the selected features

``````X_important_train = boruta_selector.transform(np.array(X_train))
X_important_test = boruta_selector.transform(np.array(X_test))
``````

#### Build the model with selected features

``````# Create a new random forest classifier for the most important features
rf_important = RandomForestClassifier(random_state=1, n_estimators=1000, n_jobs = -1)

# Train the new classifier on the new dataset containing the most important features
rf_important.fit(X_important_train, y_train)
``````
``````RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000,
n_jobs=-1, oob_score=False, random_state=1, verbose=0,
warm_start=False)
``````

#### Evaluation

``````y_important_pred = rf_important.predict(X_important_test)
rf_imp_fscore = f1_score(y_test, y_important_pred)
``````
``````print(rf_imp_fscore)
``````
``````0.8578215134034612
``````

If you remember from above that the Random Forest Classifier with all the features had given f1 score as 82.2% while after selecting some relavent features the Random Forest Classifier has given f1 score as 85.7% which is a good improvement in terms of bothe performance of the model (i.e. the result) and the complexity is also reduced.

Well we have chosen some of the parameters randomly like max_depht, n_estimators. There are many other parameters related to Random Forest model.If you remember we had discussed in our session âPerformance Evaluationâ about Hyper parameter tunning. Hyper parameter tunnning helps you to choose a set of optimal parameters for a model. So letâs try if this helps us to further improve the performance of the model.

Grid Search helps you to find the optimal parameter for a model.

### Hyper Parameter Tunning

``````from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search
param_grid = {
'bootstrap': [True, False],
'max_depth': [5, 10, 15],
'n_estimators': [500, 1000]}
``````
``````rf = RandomForestClassifier(random_state = 1)

# Grid search cv
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 2, n_jobs = -1, verbose = 2)
``````
``````grid_search.fit(X_important_train, y_train)
``````
``````Fitting 2 folds for each of 12 candidates, totalling 24 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:  2.2min finished

GridSearchCV(cv=2, error_score=nan,
estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
class_weight=None,
criterion='gini', max_depth=None,
max_features='auto',
max_leaf_nodes=None,
max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None,
oob_score=False, random_state=1,
verbose=0, warm_start=False),
iid='deprecated', n_jobs=-1,
param_grid={'bootstrap': [True, False], 'max_depth': [5, 10, 15],
'n_estimators': [500, 1000]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=2)
``````
``````grid_search.best_params_
``````
``````{'bootstrap': True, 'max_depth': 15, 'n_estimators': 500}
``````
``````pred = grid_search.predict(X_important_test)
``````
``````f1_score(y_test, pred)
``````
``````0.8657267539442766
``````

As you can see the accuracy has been improved from 85.7% to 86.6% by selecting some good parameters with the help of hyper parameter tunning - GridSearchCV

# Conclusion

• It is clearly observable that how the f1 scores increased from logistic regression to random forest, random forest with full features to random forest on the selected features using Boruta.
• Then again the f1 score increased with Hyper parameter tunning.
• Also this is one of the approach to solve this problem. There can be many other approaches to solve this problem.
• We could also try standardizing and normalizing the data or some other algorithms and so onâŚ
• Well you should try standardizing or normalizing the data and then observe the difference in f1 score.
• Also try doing using Decision Tree.

Now letâs predict the output for new test data

Well, we will predict the output for new test data using the Random Forest model with the selected features using Boruta and also with the best parameters that we got during hyper parmeter tunning because we have got the highest f1 score with this model on X_test data (also called the validation data). We can directly use grid_search variable to predict as this variable is the reference to the trained model.

## New Test Data

Tasks to be performed:

• Load the new test data
• If missing values are there then fill the missing values with the same techniques that were used for training dataset
• Convert categorical column to numerical
• Predict the output
• Download the predicted values in csv

Why do we need to do the same procedure of filling missing values, data cleaning and data preprocessing on the new test data as it was done for the training and evaluation data?

Ans: Because our model has been trained on certain format of data and if we donât provide the testing data of the similar format, the model will give erroneous predictions and the accuracy/f1 score of the model will decrease. Also, if the model was build on ânâ number of features, while predicting on new test data you should always give the same number of features to the model. In this case if you provide different number of features while predicting the output, your ML model will throw a ValueError saying something like ânumber of features given x; expecting nâ. Not confident about these statements? Well, as a data scientist you should always perform some experiment and observe the results.

``````# Load the data
``````
``````# take a look how the new test data look like
``````
ID_Patient_Care_Situation Diagnosed_Condition Patient_ID Treated_with_drugs Patient_Age Patient_Body_Mass_Index Patient_Smoker Patient_Rural_Urban Patient_mental_condition A B C D E F Z Number_of_prev_cond
0 19150 40 3709 DX3 16 29.443894 NO RURAL Stable 1.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0
1 23216 52 986 DX6 24 26.836321 NO URBAN Stable 1.0 1.0 0.0 0.0 0.0 0.0 0.0 2.0
2 11890 50 11821 DX4 DX5 63 25.523280 NO RURAL Stable 1.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0
3 7149 32 3292 DX6 42 27.171155 NO URBAN Stable 1.0 0.0 1.0 0.0 1.0 0.0 0.0 3.0
4 22845 20 9959 DX3 50 25.556192 NO RURAL Stable 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

Can you observe the new test data here? It is in the same format as our training data before performing any cleaning and preprocessing.

### Checking missing values

``````test_new_data.isnull().sum()
``````
``````ID_Patient_Care_Situation    0
Diagnosed_Condition          0
Patient_ID                   0
Treated_with_drugs           0
Patient_Age                  0
Patient_Body_Mass_Index      0
Patient_Smoker               0
Patient_Rural_Urban          0
Patient_mental_condition     0
A                            0
B                            0
C                            0
D                            0
E                            0
F                            0
Z                            0
Number_of_prev_cond          0
dtype: int64
``````

New test data has no missing values. So treating missing value is not required.

#### Preprocessing and data cleaning: same as we did on training data

``````drugs = test_new_data['Treated_with_drugs'].str.get_dummies(sep=' ') # split all the entries
``````
DX1 DX2 DX3 DX4 DX5 DX6
0 0 0 1 0 0 0
1 0 0 0 0 0 1
2 0 0 0 1 1 0
3 0 0 0 0 0 1
4 0 0 1 0 0 0
``````test_new_data = pd.concat([test_new_data, drugs], axis=1)     # concat the two dataframes 'drugs' and 'data'
test_new_data = test_new_data.drop('Treated_with_drugs', axis=1)    # dropping the column 'Treated_with_drugs' as its values are splitted into different columns

``````
ID_Patient_Care_Situation Diagnosed_Condition Patient_ID Patient_Age Patient_Body_Mass_Index Patient_Smoker Patient_Rural_Urban Patient_mental_condition A B C D E F Z Number_of_prev_cond DX1 DX2 DX3 DX4 DX5 DX6
0 19150 40 3709 16 29.443894 NO RURAL Stable 1.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0 0 1 0 0 0
1 23216 52 986 24 26.836321 NO URBAN Stable 1.0 1.0 0.0 0.0 0.0 0.0 0.0 2.0 0 0 0 0 0 1
2 11890 50 11821 63 25.523280 NO RURAL Stable 1.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0 0 0 1 1 0
3 7149 32 3292 42 27.171155 NO URBAN Stable 1.0 0.0 1.0 0.0 1.0 0.0 0.0 3.0 0 0 0 0 0 1
4 22845 20 9959 50 25.556192 NO RURAL Stable 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 0 1 0 0 0
``````test_new_data.Patient_Smoker.value_counts()
``````
``````NO     5333
YES    3970
Name: Patient_Smoker, dtype: int64
``````

This data does not have value as âCannot sayâ in âPatient_Smokerâ column

The column âPatient_mental_conditionâ has only one category âstableâ. So we can drop this column as for every observation the entry here is âstableâ.

``````test_new_data.drop('Patient_mental_condition', axis = 1, inplace=True)
``````

Now letâs convert the categorical column to numerical using get_dummies() function of pandas (i.e. one hot encoding).

``````test_new_data = pd.get_dummies(test_new_data, columns=['Patient_Smoker', 'Patient_Rural_Urban'])
``````
``````test_new_data.head()
``````
ID_Patient_Care_Situation Diagnosed_Condition Patient_ID Patient_Age Patient_Body_Mass_Index A B C D E F Z Number_of_prev_cond DX1 DX2 DX3 DX4 DX5 DX6 Patient_Smoker_NO Patient_Smoker_YES Patient_Rural_Urban_RURAL Patient_Rural_Urban_URBAN
0 19150 40 3709 16 29.443894 1.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0 0 1 0 0 0 1 0 1 0
1 23216 52 986 24 26.836321 1.0 1.0 0.0 0.0 0.0 0.0 0.0 2.0 0 0 0 0 0 1 1 0 0 1
2 11890 50 11821 63 25.523280 1.0 0.0 0.0 0.0 1.0 0.0 0.0 2.0 0 0 0 1 1 0 1 0 1 0
3 7149 32 3292 42 27.171155 1.0 0.0 1.0 0.0 1.0 0.0 0.0 3.0 0 0 0 0 0 1 1 0 0 1
4 22845 20 9959 50 25.556192 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0 0 1 0 0 0 1 0 1 0
``````test_new_data.info()
``````
``````<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9303 entries, 0 to 9302
Data columns (total 23 columns):
#   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
0   ID_Patient_Care_Situation  9303 non-null   int64
1   Diagnosed_Condition        9303 non-null   int64
2   Patient_ID                 9303 non-null   int64
3   Patient_Age                9303 non-null   int64
4   Patient_Body_Mass_Index    9303 non-null   float64
5   A                          9303 non-null   float64
6   B                          9303 non-null   float64
7   C                          9303 non-null   float64
8   D                          9303 non-null   float64
9   E                          9303 non-null   float64
10  F                          9303 non-null   float64
11  Z                          9303 non-null   float64
12  Number_of_prev_cond        9303 non-null   float64
13  DX1                        9303 non-null   int64
14  DX2                        9303 non-null   int64
15  DX3                        9303 non-null   int64
16  DX4                        9303 non-null   int64
17  DX5                        9303 non-null   int64
18  DX6                        9303 non-null   int64
19  Patient_Smoker_NO          9303 non-null   uint8
20  Patient_Smoker_YES         9303 non-null   uint8
21  Patient_Rural_Urban_RURAL  9303 non-null   uint8
22  Patient_Rural_Urban_URBAN  9303 non-null   uint8
dtypes: float64(9), int64(10), uint8(4)
memory usage: 1.4 MB
``````

As you can see there are no missing data now and all the data are of numerical type.

The column - âID_Patient_Care_Situationâ is an ID. Here we can remove this column too as we did in training dataset.

``````test_new_data.drop(['ID_Patient_Care_Situation'], axis =1, inplace=True)
``````
``````test_new_data.info()
``````
``````<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9303 entries, 0 to 9302
Data columns (total 21 columns):
#   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
0   Diagnosed_Condition        9303 non-null   int64
1   Patient_Age                9303 non-null   int64
2   Patient_Body_Mass_Index    9303 non-null   float64
3   A                          9303 non-null   float64
4   B                          9303 non-null   float64
5   C                          9303 non-null   float64
6   D                          9303 non-null   float64
7   E                          9303 non-null   float64
8   F                          9303 non-null   float64
9   Z                          9303 non-null   float64
10  Number_of_prev_cond        9303 non-null   float64
11  DX1                        9303 non-null   int64
12  DX2                        9303 non-null   int64
13  DX3                        9303 non-null   int64
14  DX4                        9303 non-null   int64
15  DX5                        9303 non-null   int64
16  DX6                        9303 non-null   int64
17  Patient_Smoker_NO          9303 non-null   uint8
18  Patient_Smoker_YES         9303 non-null   uint8
19  Patient_Rural_Urban_RURAL  9303 non-null   uint8
20  Patient_Rural_Urban_URBAN  9303 non-null   uint8
dtypes: float64(9), int64(8), uint8(4)
memory usage: 1.2 MB
``````

### Prediction

``````imp_test_features = boruta_selector.transform(np.array(test_new_data))
``````
``````prediction = grid_search.predict(imp_test_features)
``````

We have the predicted output stored in the variable âpredictionâ. Letâs download this prediction as csv file as shown below.

## Download the prediction file in csv

``````res = pd.DataFrame(prediction)
res.index = test_new_data.index
res.columns = ["prediction"]

from google.colab import files
res.to_csv('prediction_results_HP.csv')