Solve Datathon 2: Beginners: Cancer Death Rate Challenge | DPhi

As per WHO,

Many aspects of the behaviour of cancer disease are highly unpredictable. Even with the huge number of studies that have been done on the DNA mutation responsible for the disease, we are still unable to use these information at clinical level. However, it is important that we understand the effects and impacts of this disease from the past information as much as we possibly can.


This is a companion discussion topic for the original entry at https://dphi.tech/practice/challenge/52

How to scale data? is it after split the data into train and test sets? or before the split? is it before separation of input and target features of data or after?
Is scale date using standard scaler or min max scaler? When or how to choose between standard scaler or min max scaler to apply in a dataset?

I’m confused when to apply it. For example in this datathon 2, should I apply Standard Scaler for X train and X test. or apply Standard Scaler for the whole X then split it into train and test?

another question, is fit_transform for X train? and is transform (without fit) for X test? or X test is using fit_transform as well?
But if I want to apply X test for using only transform, it will error. While using fit_transform is going through. so it confused me since if I google around, people recommend using transform only in test data sets, if I use fit transform at test data, it’s peeking /cheating.

Hi @rainbowmoonlight

binnedinc column is a object type but has data int type so what we have to do?
And in test_data there are missing_values same as train data so we have to do same like train data?

Hi @jivan_s

  • Convert the column data type to numeric
  • yes

Hi,

Since the prediction will be through MSE and the target variable is a continuous column, so only linear regression can be used.
Am I right?
Which another method to choose for getting the numeric prediction?

There are other algorithms like Random Forest Regressor, Decision tree regressor, etc.

Hi,
I’m getting key error: 0 when I try to fit boruta selector to my baseline model. Could anyone help me fix this error. I know we get key error when that particular key isnt present but I donno why im getting this error.
I tried everything but still the error is not getting fixed.
Thank you :slight_smile:

Hi @monty29
Please share your notebook here: manish@dphi.tech

Hi,

1.) Can we drop the binnedInc, Geography column from the dataset, since both the columns are of object type and unique values are also high?
2.) Suggest any method to deal with these 2 columns: binnedInc and Geography.
3.) Where to get the solution to the previous 2 quizzes?

Thank you I just fixed my code.
But my problem now is with Geography column in dataframe because test set which dphi has provided has different numbers of columns after encoding so trying to figure out ways to optimise my model. BinnedInc isnt adding much value to the model so how to go about it? Should i consider it or just drop it?

Hi @monty29
Glad to know the earlier issue is fixed.
Please go through this tutorial to for handling categorical values: Handling Unknown Categories in both train and test set during One Hot Encoding

Hi @rizwan
For binnedInc:
If you see this column is a collection of two values. You can write a logic that would split up this one column to form two different columns.
Another idea would be to take the average (mid) income instead of min and max incomes.

For Geography:
Handle categorical data. You can either drop it or do some one hot encoding or label encoding.

Hi,

Means you are saying that there are 10 unique values for binnedInc, I have to take the mean of that range and replace it with the initial values?

How to split then if I want to split, If I split do I have to drop the initial column?

Do I have to drop the original column for both the cases?

Thank you :slight_smile: but the mse is increasing by using scaling so im confused like we do scaling and normalization for improving the accuracy right? And should we perform all the steps mentioned in template notebook for notebook submission?
I will try to use all columns nd see the mse but this is a bit challenging.
I Would try my best and thank you @manish_kc_06 for your help.

Hi,
replace each entry in that column with (min_income + max_income) / 2.

After feature selection by using RFE on train dataset, how can I select the seme number of features from test_data as it has now target variable.@ manish_kc_06

@tanvirmoni drop unimportant columns :test_significant_features = test_data.drop([list of unimportant columns], axis =1)
or select only significant columns : test_significant_features = test_data[list of significant columns]

1 Like

hi,

I am not able to deal with the binnedInc column, kindly help me with this, I am a beginner in this

hi @rizwan i had the same issue. so I dropped that and did the modelling, and got a good MSE. If you want to do something with binnedInc, then do just like mentioned in previous comments, else just drop and proceed.