When should we do the test train split on a dataset?
- After cleaning the complete dataset
- First split the dataset and then only clean the training dataset
When attempting the final submission of the bootcamp, I adopted the first approach. But when I imported the test dataset (provided separately for the assignment), I faced the following error:
" could not convert string to float: ‘blue-collar’ "
The column containing this value was converted into numeric values using get_dummies() function.
The same error occurred when I adopted the second approach, when I entered the code to predict the values for test dataset (as the split was done in the beginning).
So to summarize, I am asking two questions -
- When to split the dataset into train and test?
- How to tackle the error " could not convert string to float: ‘blue-collar’ "?
Thanks