Contribute: Found a typo? Or any other change that could improve the notebook tutorial? Please consider sending us a pull request in the public repo of the notebook here
Getting Started Code For Data Sprint #6 on DPhi
Author: Manish KC
Download the images
We are given google drive link in the ‘Data’ section of problem page which has all the required train images (to build the model) and test images to predict the label of these images and submit the predictions on the DPhi platform.
We can use GoogleDriveDownloader form google_drive_downloader library in Python to download the shared files from the shared Google drive link: https://drive.google.com/file/d/1_W2gFFZmy6ZyC8TPlxB49eDFswdBsQqo/view?usp=sharing
The file id in the above link is: 1_W2gFFZmy6ZyC8TPlxB49eDFswdBsQqo
from google_drive_downloader import GoogleDriveDownloader as gdd
gdd.download_file_from_google_drive(file_id='1_W2gFFZmy6ZyC8TPlxB49eDFswdBsQqo',
dest_path='content/face_mask_detection.zip',
unzip=True)
Downloading 1_W2gFFZmy6ZyC8TPlxB49eDFswdBsQqo into content/face_mask_detection.zip... Done.
Unzipping...Done.
We have all the files from the shared Google drive link downloaded in the colab environment.
Loading Libraries
All Python capabilities are not loaded to our working environment by default (even they are already installed in your system). So, we import each and every library that we want to use.
We chose alias names for our libraries for the sake of our convenience (numpy --> np and pandas --> pd, tensorlow --> tf).
Note: You can import all the libraries that you think will be required or can import it as you go along.
import pandas as pd # Data analysis and manipultion tool
import numpy as np # Fundamental package for linear algebra and multidimensional arrays
import tensorflow as tf # Deep Learning Tool
import os # OS module in Python provides a way of using operating system dependent functionality
import cv2 # Library for image processing
from sklearn.model_selection import train_test_split # For splitting the data into train and validation set
Loading and preparing training data
The train and test images are given in two different folders - ‘train’ and ‘test’. The labels of train images are given in a csv file ‘Training_set_face_mask.csv’ with respective image id (i.e. image file name).
Getting the labels of the images
labels = pd.read_csv("/content/content/face_mask_detection/Training_set_face_mask.csv") # loading the labels
labels.head() # will display the first five rows in labels dataframe
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
filename | label | |
---|---|---|
0 | Image_1.jpg | without_mask |
1 | Image_2.jpg | without_mask |
2 | Image_3.jpg | without_mask |
3 | Image_4.jpg | without_mask |
4 | Image_5.jpg | without_mask |
labels.tail() # will display the last five rows in labels dataframe
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
filename | label | |
---|---|---|
11259 | Image_11260.jpg | with_mask |
11260 | Image_11261.jpg | with_mask |
11261 | Image_11262.jpg | with_mask |
11262 | Image_11263.jpg | with_mask |
11263 | Image_11264.jpg | with_mask |
Getting images file path
file_paths = [[fname, '/content/content/face_mask_detection/train/' + fname] for fname in labels['filename']]
Confirming if no. of labels is equal to no. of images
# Confirm if number of images is same as number of labels given
if len(labels) == len(file_paths):
print('Number of labels i.e. ', len(labels), 'matches the number of filenames i.e. ', len(file_paths))
else:
print('Number of labels does not match the number of filenames')
Number of labels i.e. 11264 matches the number of filenames i.e. 11264
Converting the file_paths to dataframe
images = pd.DataFrame(file_paths, columns=['filename', 'filepaths'])
images.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
filename | filepaths | |
---|---|---|
0 | Image_1.jpg | /content/content/face_mask_detection/train/Ima... |
1 | Image_2.jpg | /content/content/face_mask_detection/train/Ima... |
2 | Image_3.jpg | /content/content/face_mask_detection/train/Ima... |
3 | Image_4.jpg | /content/content/face_mask_detection/train/Ima... |
4 | Image_5.jpg | /content/content/face_mask_detection/train/Ima... |
Combining the labels with the images
train_data = pd.merge(images, labels, how = 'inner', on = 'filename')
train_data.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
filename | filepaths | label | |
---|---|---|---|
0 | Image_1.jpg | /content/content/face_mask_detection/train/Ima... | without_mask |
1 | Image_2.jpg | /content/content/face_mask_detection/train/Ima... | without_mask |
2 | Image_3.jpg | /content/content/face_mask_detection/train/Ima... | without_mask |
3 | Image_4.jpg | /content/content/face_mask_detection/train/Ima... | without_mask |
4 | Image_5.jpg | /content/content/face_mask_detection/train/Ima... | without_mask |
The ‘train_data’ dataframe contains all the image id, their locations and their respective labels. Now the training data is ready.
Data Pre-processing
It is necessary to bring all the images in the same shape and size, also convert them to their pixel values because all machine learning or deep learning models accepts only the numerical data. Also we need to convert all the labels from categorical to numerical values.
# Your Code goes here
Building Model
Now we are finally ready, and we can train the model.
# Your code goes here
Validate the model
Wonder🤔 how well your model learned! Lets check its performance on the X_val data.
# Your code goes here
Predict The Output For Testing Dataset
We have trained our model, evaluated it and now finally we will predict the output/target for the testing data (i.e. testing_set_label.csv) given in ‘How To Submit’ section of the problem page.
Load Test Set
Load the test data on which final submission is to be made.
# Loading the order of the image's name that has been provided
test_image_order = pd.read_csv("/content/content/face_mask_detection/Testing_set_face_mask.csv")
test_image_order.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
filename | label | |
---|---|---|
0 | Image_1.jpg | NaN |
1 | Image_2.jpg | NaN |
2 | Image_3.jpg | NaN |
3 | Image_4.jpg | NaN |
4 | Image_5.jpg | NaN |
Getting images file path
file_paths = [[fname, '/content/content/face_mask_detection/test/' + fname] for fname in test_image_order['filename']]
Confirm if number of images in test folder is same as number of image names in ‘Testing_set_face_mask.csv’
# Confirm if number of images is same as number of labels given
if len(test_image_order) == len(file_paths):
print('Number of image names i.e. ', len(test_image_order), 'matches the number of file paths i.e. ', len(file_paths))
else:
print('Number of image names does not match the number of filepaths')
Number of image names i.e. 1536 matches the number of file paths i.e. 1536
Converting the file_paths to dataframe
test_images = pd.DataFrame(file_paths, columns=['filename', 'filepaths'])
test_images.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
filename | filepaths | |
---|---|---|
0 | Image_1.jpg | /content/content/face_mask_detection/test/Imag... |
1 | Image_2.jpg | /content/content/face_mask_detection/test/Imag... |
2 | Image_3.jpg | /content/content/face_mask_detection/test/Imag... |
3 | Image_4.jpg | /content/content/face_mask_detection/test/Imag... |
4 | Image_5.jpg | /content/content/face_mask_detection/test/Imag... |
Combining the test_image_order dataframe and the test_images dataframe
test_data = pd.merge(test_images, test_image_order, how = 'inner', on = 'filename')
test_data.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
filename | filepaths | label | |
---|---|---|---|
0 | Image_1.jpg | /content/content/face_mask_detection/test/Imag... | NaN |
1 | Image_2.jpg | /content/content/face_mask_detection/test/Imag... | NaN |
2 | Image_3.jpg | /content/content/face_mask_detection/test/Imag... | NaN |
3 | Image_4.jpg | /content/content/face_mask_detection/test/Imag... | NaN |
4 | Image_5.jpg | /content/content/face_mask_detection/test/Imag... | NaN |
Drop the ‘label’ form test_data
test_data.drop('label', axis = 1, inplace=True)
The ‘train_data’ dataframe contains all the image id, their locations and their respective labels. Now the training data is ready.
Note:
-
Use the same techniques to deal with missing values as done with the training dataset.
-
Don’t remove any observation/record from the test dataset otherwise you will get wrong answer. The number of items in your prediction should be same as the number of records are present in the test dataset.
-
Use the same techniques to preprocess the data as done with training dataset.
Why do we need to do the same procedure of filling missing values, data cleaning and data preprocessing on the new test data as it was done for the training and validation data?
Ans: Because our model has been trained on certain format of data and if we don’t provide the testing data of the similar format, the model will give erroneous predictions and the rmse of the model will increase. Also, if the model was build on ‘n’ number of features, while predicting on new test data you should always give the same number of features to the model. In this case if you provide different number of features while predicting the output, your ML model will throw a ValueError saying something like ‘number of features given x; expecting n’. Not confident about these statements? Well, as a data scientist you should always perform some experiment and observe the results.
Data Pre-processing on test_data
# Your code goes here
Make Prediction on Test Dataset
Time to make a submission!!!
pred = model.predict(test_data)
# The predicted values are the probabilities value
pred[0]
array([0.28916276], dtype=float32)
Note: If you use one output neron while defining the model and loss as binary_crossentropy while compiling the model, you will get single probability value.
But, if you use more than one neuron in the output layer while defining the model and use sparse_categorical_crossentropy as loss function while compiling the model, you will get multiple probability values.
Here our predictions are single probability value. There are two commonly used methods to get the actual labels in this case.
- Using a threshold value
- Using np.round() function
These works for only binary class problems.
1. Take a threshold of 0.5
If the probability value is less than 0.5, the prediction is 0 (i.e. without_mask) else the prediction is 1 (i.e. with_mask)
Since the submission format is given as
predictions = [‘with_mask’, ‘without_mask’, ‘without_mask’, ‘with_mask’, …]
here we will convert the probability values in the same format, i.e. if probability value is less than 0.5, the prediction will be ‘without_mask’ else the prediction will be ‘with_mask’
prediction = []
for value in pred:
if value < 0.5:
prediction.append(0) # it can be 0 or your respective class
else:
prediction.append(1) # it can be 1 or your respective class
2. Using np.round() function
In this case np.round() will return the nearest integer value.
- This will return either 0 or 1
np.round(pred[0]) # since the probability value shown above is 0.28 whose closest integer is 0
array([0.], dtype=float32)
If you had used more than one neuron in output layer and sparse_categorical_crossentropy, you would get multiple probability value as prediction. These are the probability value for each class. In this case you can use np.argmax() to get the required label. This will return the index of the probability value which is maximum.
np.argmax(pred[0])
Note: Follow the submission guidelines given in ‘How To Submit’ Section.
How to save prediciton results locally via jupyter notebook?
If you are working on Jupyter notebook, execute below block of codes. A file named ‘submission.csv’ will be created in your current working directory.
res = pd.DataFrame({'filename': test_data['filename'], 'label': prediction}) # prediction is nothing but the final predictions of your model on input features of your new unseen test data
res.to_csv("submission.csv") # the csv file will be saved locally on the same location where this notebook is located.
OR,
If you are working on Google Colab then use the below set of code to save prediction results locally
How to save prediction results locally via colab notebook?
If you are working on Google Colab Notebook, execute below block of codes. A file named ‘prediction_results’ will be downloaded in your system.
res = pd.DataFrame({'filename': test_data['filename'], 'label': prediction}) # prediction is nothing but the final predictions of your model on input features of your new unseen test data
res.to_csv("submission.csv")
# To download the csv file locally
from google.colab import files
files.download('submission.csv')
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
Well Done! 
You are all set to make a submission. Let’s head to the challenge page to make the submission.