Many aspects of the behaviour of cancer disease are highly unpredictable. Even with the huge number of studies that have been done on the DNA mutation responsible for the disease, we are still unable to use these information at clinical level. However, it is important that we understand the effects and impacts of this disease from the past information as much as we possibly can.
How to scale data? is it after split the data into train and test sets? or before the split? is it before separation of input and target features of data or after?
Is scale date using standard scaler or min max scaler? When or how to choose between standard scaler or min max scaler to apply in a dataset?
I’m confused when to apply it. For example in this datathon 2, should I apply Standard Scaler for X train and X test. or apply Standard Scaler for the whole X then split it into train and test?
another question, is fit_transform for X train? and is transform (without fit) for X test? or X test is using fit_transform as well?
But if I want to apply X test for using only transform, it will error. While using fit_transform is going through. so it confused me since if I google around, people recommend using transform only in test data sets, if I use fit transform at test data, it’s peeking /cheating.
Since the prediction will be through MSE and the target variable is a continuous column, so only linear regression can be used.
Am I right?
Which another method to choose for getting the numeric prediction?
I’m getting key error: 0 when I try to fit boruta selector to my baseline model. Could anyone help me fix this error. I know we get key error when that particular key isnt present but I donno why im getting this error.
I tried everything but still the error is not getting fixed.
1.) Can we drop the binnedInc, Geography column from the dataset, since both the columns are of object type and unique values are also high?
2.) Suggest any method to deal with these 2 columns: binnedInc and Geography.
3.) Where to get the solution to the previous 2 quizzes?
Thank you I just fixed my code.
But my problem now is with Geography column in dataframe because test set which dphi has provided has different numbers of columns after encoding so trying to figure out ways to optimise my model. BinnedInc isnt adding much value to the model so how to go about it? Should i consider it or just drop it?
If you see this column is a collection of two values. You can write a logic that would split up this one column to form two different columns.
Another idea would be to take the average (mid) income instead of min and max incomes.
Handle categorical data. You can either drop it or do some one hot encoding or label encoding.
Thank you but the mse is increasing by using scaling so im confused like we do scaling and normalization for improving the accuracy right? And should we perform all the steps mentioned in template notebook for notebook submission?
I will try to use all columns nd see the mse but this is a bit challenging.
I Would try my best and thank you @manish_kc_06 for your help.
@tanvirmoni drop unimportant columns :test_significant_features = test_data.drop([list of unimportant columns], axis =1)
or select only significant columns : test_significant_features = test_data[list of significant columns]
hi @rizwan i had the same issue. so I dropped that and did the modelling, and got a good MSE. If you want to do something with binnedInc, then do just like mentioned in previous comments, else just drop and proceed.