Solve Data Sprint #16 Challenge | DPhi


Information will be put up on 4 December 202 at 21:00 IST | 16:00 CET

This is a companion discussion topic for the original entry at

Base Model with GML - Auto EDA, Auto Feature Engineering, Auto Machine Learning

Hey! I have used GML to make a complete base model pipeline for you guys!

Check here:

In the test set all values of weight column are NaN. Is this by choice?

Hi @aayushpatni
No, it’s not by choice. It’s a mistake. However, we have updated the test_data. Sorry for the inconvenience caused.

Hello Everyone!
There was an issue with the ‘weight’ column in test_data. All the values in it were null which should not be. We have updated the test data. Kindly check it here:

Our apology for this.

Hello everyone! recently i got an issues as shown in the attached image regarding submission files upload. Is anyone know the problem for this issues and how to solve it? Thank you very much

Dear dimasganteng,
I have the same problem.

Hi there! I have tried to submit multiple times but I still receive the same evaluation error as per the attached screenshot. Did I miss any any step? Would appreciate your advice on this. Thanks

when saving csv, make sure to do index = False.

submission.to_csv('submission.csv', index = False)

An extra first column is causing this error

Hey Ahmed, i already follow your suggestion but later i’ve got an error message “Prediction file has fewer number of records than expected. Ensure the no. of records in your prediction file match the no. of records in the test dataset.” and i check the csv and the number is in weird formatting. Is there anything i can do to fix this? thank you

Categorical Variable Imputation
Imputing a significant categorical variable with mode will work or will it misrepresent our data?
Manufacturer variable is the reference here.

Hi @nebkartik
If the missing data is not huge in this case, the mode will work. If the missing data is more than say 30% or 40% for example, I would check what percent of the products has the same manufacturer and take the action accordingly. If 80% of the product has same manufacturer, would go with mode, because there is a very high probability that the product belongs to the same manufacturer. Again this will vary from person to person.

1 Like

Categorical Variable Dummy Variables
If our variable has more than 50 categories and the frequency of each category is close to others, how should this be handled as creating a lot of variables will affect the model’s performance.

Hi @nebkartik
In this case, you can use feature selection techniques to remove less important categories/features.

1 Like

Hi @muhammad4hmed very interesting notebook that you shared, i am running from collab and got two errors
first, importing gml, i got

Since the GPL-licensed package unidecode is not installed, using Python’s unicodedata package which yields worse results.
GML is up to date
i dont know if this is really a problem but when i try to run autofeature engineering lines
fe = FeatureEngineering …
i got this error " module ‘sympy’ has no attribute ‘add’"
below is a few lines before error and complete description
Creating New Features with Features Selection
[GML] The 2 step feature engineering process could generate up to 4186 features.
[GML] With 4710 data points this new feature matrix would use about 0.08 gb of space.
[FEATURE_ENGINEERING] Step 1: transformation of original features
[FEATURE_ENGINEERING] Generated 51 transformed features from 13 original features - done.
[FEATURE_ENGINEERING] Step 2: first combination of features
[FEATURE_ENGINEERING] Generated 1955 feature combinations from 2016 original feature tuples - done.
[FEATURE_ENGINEERING] Generated altogether 2043 new features in 2 steps
[FEATURE_ENGINEERING] Removing correlated features, as well as additions at the highest level

AttributeError Traceback (most recent call last)
in ()
19 test_data=test_data,
20 verbose=1,
—> 21 feateng_steps=2)

3 frames
/usr/local/lib/python3.7/dist-packages/GML/AUTO_FEATURE_ENGINEERING/ in (.0)
339 print("[FEATURE_ENGINEERING] Generated altogether %i new features in %i steps" % (len(feature_pool) - len(start_features), max_steps))
340 print("[FEATURE_ENGINEERING] Removing correlated features, as well as additions at the highest level")
–> 341 feature_pool = {c: feature_pool[c] for c in feature_pool if c in uncorr_features and not feature_pool[c].func == sympy.add.Add}
342 cols = [c for c in list(df.columns) if c in feature_pool and c not in df_org.columns] # categorical cols not in feature_pool
343 if cols:

AttributeError: module ‘sympy’ has no attribute ‘add’