# Exploratory Data Analysis - Titanic Dataset

Contribute: Found a typo? Or any other change that could improve the notebook tutorial? Please consider sending us a pull request in the public repo of the notebook here.

# Assignment - 2 : Solution

#### (Absolute Beginners and Beginners)

This is the second assignment of the DPhi Data Science Bootcamp and revolves around Exploratory Data Analysis on Titanic Dataset

``````# Loading Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
``````
``````/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
import pandas.util.testing as tm
``````
``````# Loading Data
titanic_data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/First_ML_Model/master/titanic.csv", index_col = 'PassengerId')
``````

Question 1

Select the correct statement about the titanic dataset.

1. â€˜Fareâ€™ feature have 0 missing values
2. Male passengers are more than female passengers in count

Solution Code:

``````# For option 1: Checking missing values
titanic_data.isnull().sum()    # returns you the number of missing values in each column
``````
``````Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64
``````
``````# For option 2: Count the male and female passengers on titanic
titanic_data.Sex.value_counts()    # value counts helps you to count the frequency of each category in a column.
``````
``````male      577
female    314
Name: Sex, dtype: int64
``````

Question 2:

What is the proportion of passengers who survived?
Note: in this question, we are asking for an answer as a proportion. Therefore, your answer should take a value between 0 and 1 rounded to 2 decimal places

1. 0.38
2. 0.39
3. 0.39
4. 0.40

Solution Code:

``````# First get all the passengers who survived, it's code is [titanic_data.survived == 1]
survived_passengers = titanic_data[titanic_data.Survived == 1]

# To get proportion of it divide survived_passengers count by total passengers on titanic
surv_pass_prop = len(survived_passengers) / len(titanic_data)

# round the value to two decimal places
round(surv_pass_prop, 2)
``````
``````0.38
``````

Question 3:

What is the median Fare of the passengers?

1. 14.4542
2. 13.4542
3. 32.2042
4. None

``````# Method 1: Sort the values in the column 'Fare' and find the value at mid position
fares = titanic_data.Fare.sort_values().values
l = len(fares)
fares[l//2]
``````
``````14.4542
``````
``````# Method 2: Get the second quantile value (or 50th percentile value)
titanic_data.Fare.quantile(0.5)
``````
``````14.4542
``````

Question 4

Select the true statements: (one or more correct)

1. Percentage of women survived was more than percentage of men survived
2. It looks like first class passengers were given priority to survive
3. It looks like Children were given priority to survive

Correct answer: all of the above

``````# For option 1: Check the percentage of male who survived and percentage of female who survived
print('% of men who survived', 100*np.mean(titanic_data['Survived'][titanic_data['Sex'] == 'male']))       # % of male survived
print('% of women who survived', 100*np.mean(titanic_data['Survived'][titanic_data['Sex'] == 'female']))   # % of female survived
print('*'*10)

# For option 2: Check the percentage of people survived in each class
print('% of passengers who survived in first class', 100*np.mean(titanic_data['Survived'][titanic_data['Pclass'] == 1]))    # % of first class passengers survived
print('% of passengers who survived in first class', 100*np.mean(titanic_data['Survived'][titanic_data['Pclass'] == 2]))  # % of second class passengers survived
print('% of passengers who survived in third class', 100*np.mean(titanic_data['Survived'][titanic_data['Pclass'] == 3]))  # % of third class passengers survived
print('*'*10)

# For option 3: Check if percentage of passengers below age 18 survived survived more than other age category
print('% of children who survived', 100*np.mean(titanic_data['Survived'][titanic_data['Age'] < 18])) # % of passengers below age 18 survived
print('% of adults who survived', 100*np.mean(titanic_data['Survived'][titanic_data['Age'] >= 18]))  # % of passengers greater than or equal to age 18 survived
``````
``````% of men who survived 18.890814558058924
% of women who survived 74.20382165605095
**********
% of passengers who survived in first class 62.96296296296296
% of passengers who survived in first class 47.28260869565217
% of passengers who survived in third class 24.236252545824847
**********
% of children who survived 53.98230088495575
% of adults who survived 38.10316139767055
``````
1. % of women survived is more than men
2. % of 1st class passengers survived more than othe two classes. Hence first class passengers were given more priority
3. % of passengers below age 18 survived more than other age

Question 5

Create a subset of the data, only taking observations for which the passsenger survived. Call this newly created dataset as survived_passengers. How many of the survived passengers had embarked from â€˜Southamptonâ€™ i.e. â€˜Sâ€™?

1. 217
2. 644
3. 168
4. 77

``````# First get the passengers who survived
survived_passengers = titanic_data[titanic_data.Survived == 1]

# count the passenger embarked from each port using value_counts()
survived_passengers.Embarked.value_counts()
``````
``````S    217
C     93
Q     30
Name: Embarked, dtype: int64
``````

Question 6:

Which of the following feature plays an important role in the survival of the passengers?

1. Name
2. Ticket
3. Age

Not any people can be given priority to survive just on the basis of name. So option 1 is not correct

Ticket is a randomly generated value while booking. This cannot be a factor for a person to be given priority to survive.

We have already seen above that the people below age 18 has more number of survival than people age above 18. So age plays an important role in the survival of people.

Question 7:

Five highest fares of the passengers:

1. [512.3292, 512.3292, 512.3292, 263.0, 263.0]
2. [510.3292, 512.3292, 512.3292, 263.0, 263.0]
3. [512.3292, 512.3292, 512.3292, 263.0, 256.0]
4. [512.3292, 520.3292, 512.3292, 263.0, 263.0]

``````# sort the values in fare in descending order and get the first five values
list(titanic_data.Fare.sort_values(ascending = False))[:5]
``````
``````[512.3292, 512.3292, 512.3292, 263.0, 263.0]
``````

Question 8:

Median age of the passenger is:

1. 28.0
2. 29.0
3. 27.0
4. 30.0

``````# Method 1
titanic_data.Age.median()
``````
``````28.0
``````
``````# Method 2: If you have column with missing values and you want the median value, never use np.median() as it will give you 'nan'.
# Always use np.nanmedian() this will give you correct median value
np.nanmedian(titanic_data.Age)
``````
``````28.0
``````

Question 9:

Select the correct statement:

1. There are 891 unique values in the Name column
2. There are 714 unique values in the Name column

``````# Get all the unique values in Name column using unique() method and then get the length of that unique values
len(titanic_data.Name.unique())
``````
``````891
``````

Question 10:

Most of the passengers have -------- siblings/spouses.

1. 5
2. 1
3. 0
4. 209
5. 8

``````# use value_counts method on the column 'SibSp' as it helps you to get the number of occurrence of each value in a column.
``````0    608