AI & Data Science Challenges | DPhi


This is a companion discussion topic for the original entry at https://dphi.tech/challenges/160

Hello. This is my impression for round 47.

I noticed that there were some significant differences between train set and test set after the competition.
For example, the column “Host Since” in the first row of train data is “2016-01-20” while that of test data is “15-06-2014”. Some other columns have non-correspondence of data format.
Does this happen often?

Despite above problem, I don’t understand why applying inaccurate label encoding and regression works.

@uwi Hello, thank you for putting this up. We would like to acknowledge that the order of formatting of date variables in train and test data are different. This was not deliberate and our apologies for that, further, we later went with a fair assumption that data scientists will take care of it during the cleaning process by understanding the pattern in test and train datasets.

Addressing your question on “Despite above problem, I don’t understand why applying inaccurate label encoding and regression works.” One of the main reasons could be - variable importance of date type variables may not be significantly contributing to the predictive ability of the ML model you built, hence it is not affecting the overall model performance despite feeding an incorrect date format variable. Will be happy to look at your code and share further inputs if required. Let me know if there are any questions. Thanks!

1 Like