Contribute: Found a typo? Or any other change that could improve the notebook tutorial? Please consider sending us a pull request in the public repo of the notebook here.
Assignment-1: Solutions
(Absolute Beginners and Beginners)
This is the first assignment of the DPhi Data Science Bootcamp and revolves around Data Analysis and Data Visualisation on the Standard Metropolitan Area Dataset.
About the Dataset
It contains data of 99 standard metropolitan areas in the US. The data set provides information on 10 variables for each area for the period 1976-1977. The areas have been divided into 4 geographic regions: 1=North-East, 2=North-Central, 3=South, 4=West.
Link to the Dataset: https://bit.ly/SMA_Dataset
We are first importing packages using their standard alias names: pd for pandas, np for numpy, plt for matplotlib.pyplot and sns for seaborn
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Using the Pandas read_csv method to read the Dataset’s CSV file. Inside the brackets, we can either specify the path of the file or a downloadable link.
Here, the file is uploaded on GitHub and we can directly use the link to load and access it. If you are uploading your own file, make sure to specify the full path of the file.
Please go through the following module to learn about working with CSV files: https://bit.ly/DPhi_Day9
I am storing the dataset in the variable ‘data’
data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Assignment_Solutions/master/Standard%20Metropolitan%20Areas%20Data%20-%20train_data%20-%20data.csv")
Pandas head() method is used to return top n (5 by default) rows of a DataFrame. It allows us to get an estimate of what the data looks like
data.head()
land_area | percent_city | percent_senior | physicians | hospital_beds | graduates | work_force | income | region | crime_rate | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1384 | 78.1 | 12.3 | 25627 | 69678 | 50.1 | 4083.9 | 72100 | 1 | 75.55 |
1 | 3719 | 43.9 | 9.4 | 13326 | 43292 | 53.9 | 3305.9 | 54542 | 2 | 56.03 |
2 | 3553 | 37.4 | 10.7 | 9724 | 33731 | 50.6 | 2066.3 | 33216 | 1 | 41.32 |
3 | 3916 | 29.9 | 8.8 | 6402 | 24167 | 52.2 | 1966.7 | 32906 | 2 | 67.38 |
4 | 2480 | 31.5 | 10.5 | 8502 | 16751 | 66.1 | 1514.5 | 26573 | 4 | 80.19 |
You can change the number of rows being displayed by specifying a number inside the head function.
Let’s look at 10 rows now
data.head(10)
land_area | percent_city | percent_senior | physicians | hospital_beds | graduates | work_force | income | region | crime_rate | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1384 | 78.1 | 12.3 | 25627 | 69678 | 50.1 | 4083.9 | 72100 | 1 | 75.55 |
1 | 3719 | 43.9 | 9.4 | 13326 | 43292 | 53.9 | 3305.9 | 54542 | 2 | 56.03 |
2 | 3553 | 37.4 | 10.7 | 9724 | 33731 | 50.6 | 2066.3 | 33216 | 1 | 41.32 |
3 | 3916 | 29.9 | 8.8 | 6402 | 24167 | 52.2 | 1966.7 | 32906 | 2 | 67.38 |
4 | 2480 | 31.5 | 10.5 | 8502 | 16751 | 66.1 | 1514.5 | 26573 | 4 | 80.19 |
5 | 2815 | 23.1 | 6.7 | 7340 | 16941 | 68.3 | 1541.9 | 25663 | 3 | 58.48 |
6 | 8360 | 46.3 | 8.2 | 4047 | 14347 | 53.6 | 1321.2 | 18350 | 3 | 72.25 |
7 | 6794 | 60.1 | 6.3 | 4562 | 14333 | 51.7 | 1272.7 | 18221 | 3 | 64.88 |
8 | 3049 | 19.5 | 12.1 | 4005 | 21149 | 53.4 | 967.5 | 15826 | 1 | 30.51 |
9 | 4647 | 31.5 | 9.2 | 3916 | 12815 | 65.1 | 1032.2 | 14542 | 2 | 55.30 |
What we can gather from the displayed data is that we have 7 columns/ 7 features.
We have the following description about each of these columns already:
- land_area : size in square miles
- percent_city : percent of population in central city/cities
- percent_senior : percent of population ≤ 65 years
- physicians : number of professionally active physicians
- hospital_beds : total number of hospital beds
- graduates : percent of adults that finished high school
- work_force : number of persons in work force in thousands
- income : total income in 1976 in millions of dollars
- crime_rate : Ratio of number of serious crimes by total population
- region : geographic region according to US Census
We can see that the regions have 4 values, where:
1 = North-East
2 = North-Central
3 = South
4 = West
The info() function is used to print a concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 10 columns):
land_area 99 non-null int64
percent_city 99 non-null float64
percent_senior 99 non-null float64
physicians 99 non-null int64
hospital_beds 99 non-null int64
graduates 99 non-null float64
work_force 99 non-null float64
income 99 non-null int64
region 99 non-null int64
crime_rate 99 non-null float64
dtypes: float64(5), int64(5)
memory usage: 7.8 KB
Using the info function we get to know that the data has 99 entries(rows) and all of the columns contain 99 non-null entries. This indicates that we don’t have any null value.
Also, our data consists of float and integer values.
Solutions
Now that we’ve had a good look of our dataset, we can start solving the questions.
Question 1:
Which of the following information is correct about the data?
Select one:
a. There are 4 variables of ‘float’ dtype, 5 variables of ‘int’ dtype.
b. There are 6 variables of ‘float’ dtype, 4 variables of ‘int’ dtype
c. There are 3 variables of ‘float’ dtype, 7 variables of ‘int’ dtype
d. There are 5 variables of ‘float’ dtype, 5 variables of ‘int’ dtype.
Answer 1:
This question demands the knowledge of data types of various features in our dataset.
We can easily find that out using the info function. Using the resulting table of the info function used above, we can count the number of float and int variables.
We can see that there exist 5 float variables( percent_city, percent_senior, graduates, work_force, crime_rate)
Also, the number of int variables are also 5( land_area, physicians, hospital_beds, income, region)
The correct option for this question is therefore d - There are 5 variables of ‘float’ dtype, 5 variables of ‘int’ dtype.
Question 2: Which of the following is true about bar chart and histogram?
Select one or more:
a. Histogram presents quantitative data while bar chart presents categorical data / discrete data.
b. In histogram bars cannot be re-ordered while in bar charts bars can be re-ordered.
c. Histogram indicates distribution of non - discrete variables.
d. Bar chart indicates comparison of discrete variables / categorical variables.
Answer 2:
a.
-
A histogram displays the single quantitative variable along the x axis and frequency of that variable on the y axis.
-
Bar graphs can be used to represent both quantitative and categorical data.
Both the statements are true and thus option a is correct.
b. Bars can be reordered in bar charts but not in histograms. Option b is correct.
c. This is because a histogram represents a continuous(non-discrete) data set, and as such, there are no gaps in the data. Option c is correct.
d. It is True that Bar chart indicates comparison of discrete variables / categorical variables. Option d is correct.
Thus, all the options are correct.
Question 3:
Which of the following information is correct about the data?
Select one or more:
a.
-
mean land areas of these Metropolitan areas = 2615.73
-
maximum crime rate among all the Metropolitan areas = 85.62
b.
- minimum income in million dollars of all the metropolitan areas = 769.00
c.
-
total count of non null entries in ‘hospital_beds’ = 99.0
-
Most of the Metropolitan areas lie in region 3
d.
-
only 17 Metropolitan areas lie in region 4
-
average crime rate in among the metropolitan areas = 55.64
In such type of questions, we need to check all the options to find out which ones are True.
Let’s write the codes to check them one by one:
Answer 3:
a - 1) mean land areas of these Metropolitan areas = 2615.73
We can find the mean of the land area by Numpy’s mean function.
We need to provide the column name inside the brackets of the mean function
Accessing DataFrame’s columns
Now, there are actually two methods of accessing a DataFrame’s column:
- Square brackets ([])
- Dot operator (.)
-
They are the same as long you’re accessing a single column with a simple name, but you can do more with the bracket notation.
-
You can only use df.col if the column name is a valid Python identifier (e.g., does not contains spaces and other such stuff). Also, you may encounter surprises if your column name clashes with a pandas method name (like sum).
-
With brackets you can select multiple columns (e.g., df[[‘col1’, ‘col2’]]) or add a new column (df[‘newcol’] = …), which can’t be done with dot access.
# Dot notation
np.mean(data.land_area)
2615.7272727272725
# Bracket notation
np.mean(data['land_area'])
2615.7272727272725
Both of these work and provide us the same result. Now, you can see that the mean is represented till the 13th decimal place.
We can use the round function to represent the same no. of decimal places in our mean as is displayed in the options.
import math #importing the math package for round function
round(np.mean(data.land_area), 2) # The argument 2 represents to how many decimal places we want to round off
2615.73
Our answer is same as the option. Thus a - 1) is correct
a - 2) maximum crime rate among all the Metropolitan areas = 85.62
This requires us to use the max function on the crime_rate column.
max(data.crime_rate)
85.62
We obtained the same result as specified in the option. Thus, a- 2) is correct
Since both the parts of option a are correct, we can say that a is correct
Let’s find out if any of the other options is correct now.
b - 1) minimum income in million dollars of all the metropolitan areas = 769.00
Similar to the previous option, we can use the min function on the income column
min(data.income)
769
769.00 and 769 are the same. Thus, option b is correct.
c - 1) total count of non null entries in ‘hospital_beds’ = 99.0
We already know from the info function that we have 99 non null entries in each column.
We can still confirm it with the notnull function of Pandas applied on the hospital_beds column
pd.notnull(data['hospital_beds'])
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
19 True
20 True
21 True
22 True
23 True
24 True
25 True
26 True
27 True
28 True
29 True
...
69 True
70 True
71 True
72 True
73 True
74 True
75 True
76 True
77 True
78 True
79 True
80 True
81 True
82 True
83 True
84 True
85 True
86 True
87 True
88 True
89 True
90 True
91 True
92 True
93 True
94 True
95 True
96 True
97 True
98 True
Name: hospital_beds, Length: 99, dtype: bool
The length 99 means that there are non null 99 entries/ rows in the column.
Thus, c-1) is correct.
c - 2) Most of the Metropolitan areas lie in region 3
The The value_counts() method returns a Series containing the counts of unique values. This means, for any column in a dataframe, this method returns the count of unique entries in that column.
data.region.value_counts()
3 36
2 25
1 21
4 17
Name: region, dtype: int64
The above code gives us the count of each region in the data[‘region’] column.
We can see that region 3 has the maximum count - 36.
Thus, c-2) is correct.
Since both the parts of the option c are correct, we can say that the option c is correct.
d - 1) only 17 Metropolitan areas lie in region 4
The answer for this can be obtained from the same result of the above code block.
The region 4 has a count of 17 and thus d - 1) is correct.
d - 2) average crime rate in among the metropolitan areas = 55.64
For finding the average crime rate, we can apply mean function on the crime_rate column
np.mean(data.crime_rate)
55.6430303030303
Again, we can round off the above result to get the exact mean.
round(np.mean(data.crime_rate),2)
55.64
The average crime rate obtained is same as that in the option. Thus, d-2) is correct
Since both the parts of option d are correct, we can say that option d is correct.
We have now proved that all 4 options are correct.
Question 4:
Which of the following is the correct correlation matrix among the features of the dataset?
Select one:
Option a:
Option b:
Answer 4:
We can obtain a correlation matrix using the corr method. The correlation matrix obtained below is the same as given in option a. Thus, option a is correct.
data.corr()
land_area | percent_city | percent_senior | physicians | hospital_beds | graduates | work_force | income | region | crime_rate | |
---|---|---|---|---|---|---|---|---|---|---|
land_area | 1.000000 | -0.077320 | 0.092226 | 0.085054 | 0.081034 | 0.088728 | 0.135792 | 0.111404 | 0.292392 | 0.293907 |
percent_city | -0.077320 | 1.000000 | -0.250995 | 0.067391 | 0.052898 | 0.076720 | 0.016011 | 0.019235 | 0.235880 | 0.159596 |
percent_senior | 0.092226 | -0.250995 | 1.000000 | 0.056454 | 0.083775 | -0.155695 | 0.035240 | 0.046073 | -0.242811 | -0.177992 |
physicians | 0.085054 | 0.067391 | 0.056454 | 1.000000 | 0.974241 | 0.049500 | 0.965597 | 0.976209 | -0.140961 | 0.187763 |
hospital_beds | 0.081034 | 0.052898 | 0.083775 | 0.974241 | 1.000000 | -0.003892 | 0.967913 | 0.974416 | -0.220305 | 0.109799 |
graduates | 0.088728 | 0.076720 | -0.155695 | 0.049500 | -0.003892 | 1.000000 | 0.044054 | 0.045578 | 0.246226 | 0.290880 |
work_force | 0.135792 | 0.016011 | 0.035240 | 0.965597 | 0.967913 | 0.044054 | 1.000000 | 0.996735 | -0.144022 | 0.175945 |
income | 0.111404 | 0.019235 | 0.046073 | 0.976209 | 0.974416 | 0.045578 | 0.996735 | 1.000000 | -0.152016 | 0.175797 |
region | 0.292392 | 0.235880 | -0.242811 | -0.140961 | -0.220305 | 0.246226 | -0.144022 | -0.152016 | 1.000000 | 0.636192 |
crime_rate | 0.293907 | 0.159596 | -0.177992 | 0.187763 | 0.109799 | 0.290880 | 0.175945 | 0.175797 | 0.636192 | 1.000000 |
Question 5:
Which of the following is the first five land area of the metropolitan area data which are located in region 4?
Select one:
Option a:
Option b:
Option c:
Option d:
Answer 5:
This question involves conditional selection of columns.
-
The == operator is a comparison operator. If the values of two operands on either side are equal, then the condition becomes true.
-
The condition data[‘region’]==4 checks if the region of an entry is 4.
-
data[data[‘region’]==4] finds out all those columns that satisfy the condition specified.
-
data[data[‘region’]==4][‘land_area’] selects the land_area column from the data
-
Finally, data[data[‘region’]==4][‘land_area’].head() picks the top 5 rows as the question demands first five land area.
data[data['region']==4]['land_area'].head()
4 2480
13 782
14 4226
18 27293
20 9155
Name: land_area, dtype: int64
Since the above output matches the option d, option d is correct
Question 6:
Which of the following information is correct about Standard Metropolitan Data?
Select one or more:
a. There isn’t any Metropolitan area which is located in region 4 and crime rate is 55.64.
b. There are 4 Metropolitan area which is located in region 1 and their crime rate is greater than or equal to 54.16
c. There is one Metropolitan area whose crime rate is 85.62 and is located in region 4.
d. There are 40 Metropolitan area which is located in region 3 or have land area greater than or equal to 5000
Answer 6:
This question involves the concept of conditional selection along with combining 2 conditions with the Logical Operators( and, or, not)
a. There isn’t any Metropolitan area which is located in region 4 and crime rate is 55.64.
data[(data['region']==4) & (round(data['crime_rate'],2)==55.46)]
land_area | percent_city | percent_senior | physicians | hospital_beds | graduates | work_force | income | region | crime_rate |
---|
We obtained an empty table. This means that there isn’t any Metropolitan area which is located in region 4 and crime rate is 55.64.
Thus, option a is correct.
b. There are 4 Metropolitan area which is located in region 1 and their crime rate is greater than or equal to 54.16
data[(data['region']==1) & (round(data['crime_rate'],2)>=54.16)]
land_area | percent_city | percent_senior | physicians | hospital_beds | graduates | work_force | income | region | crime_rate | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1384 | 78.1 | 12.3 | 25627 | 69678 | 50.1 | 4083.9 | 72100 | 1 | 75.55 |
10 | 1008 | 16.6 | 10.3 | 4006 | 16704 | 55.9 | 935.5 | 15953 | 1 | 54.16 |
24 | 2966 | 26.9 | 10.3 | 2053 | 6604 | 56.3 | 450.4 | 6966 | 1 | 56.55 |
55 | 192 | 60.5 | 10.8 | 617 | 1789 | 44.1 | 212.6 | 3158 | 1 | 58.79 |
Since we have 4 records in the above result, the option b is correct.
c. There is one Metropolitan area whose crime rate is 85.62 and is located in region 4.
data[(data['region']==4) & (round(data['crime_rate'],2)==85.62)]
land_area | percent_city | percent_senior | physicians | hospital_beds | graduates | work_force | income | region | crime_rate | |
---|---|---|---|---|---|---|---|---|---|---|
20 | 9155 | 53.8 | 11.1 | 2280 | 6450 | 60.1 | 575.2 | 7766 | 4 | 85.62 |
We recieved 1 entry that satisfied the above conditions. Thus, option c is correct.
d. There are 40 Metropolitan area which are located in region 3 or have land area greater than or equal to 5000
len(data[(data['region']==3) | (data['land_area']>=5000)])
41
There are 41 Metropolitan area which are located in region 3 or have land area greater than or equal to 5000. Thus the option d is incorrect.
Options a,b,c are correct.
Question 7:
Which of the following is the correct line plot between land area and crime rate?
Select one:
Option a:
Option b:
Option c:
Option d:
Answer 7:
We’ll use the plot function of Matplotlib to draw a line plot between land area and crime rate
plt.plot(data.land_area, data.crime_rate)
plt.show()
Since the plot matches with option a, Option a is correct.
Question 8:
Which of the following correlation information is correct between two variables.
Select one:
a There is a positive correlation between ‘hospital_beds’ and ‘physicians’ and its scatter plot is:
b. There is a negative correlation between ‘hospital_beds’ and ‘physicians’ and its scatter plot is:
c. There is a negative correlation between ‘hospital_beds’ and ‘physicians’ and its scatter plot is:
Answer 8:
We’ll use the scatter method of Matplotlib to draw a scatter plot between ‘hospital_beds’ and ‘physicians’.
plt.scatter(data.hospital_beds, data.physicians)
plt.show()
The scatter plot matches with that of option a and b. We can notice that the direction of the line made by scatter plot is upwards or the slope of the line is positive.
Also, the value of physicians is increasing as the value of hospital_beds is increasing. This means, there is a positive correlation between ‘hospital_beds’ and ‘physicians’.
Thus, option a is correct.
Question 9:
Which of the following is the correct bar plot of geographic regions?
Select one:
Option a:
Option b:
Option c:
Answer 9:
The question requires us to create a bar plot of geographic regions against the frequency of each region.
This means that we’ll have to count the occurrence of each region in our data.
Let’s have a look at 2 methods of making the same bar plot:
Method 1 - Using the Pandas plot function to draw a bar plot
-
In this method, we use the value_counts() function to find out the frequency of each region in our data.
-
The plot method with the kind=‘bar’ argument helps us plot bar charts. We are also specifying an argument color and providing a list of colours we want our bars to be plotted in.
-
Also, it can be observed from the above options that the bars should not be sorted in the decreasing order of frequency which is usually the case while counting occurrence. We can use sort_index() to sort the bars according to their indices.
ax = data['region'].value_counts().sort_index().plot(kind = 'bar', color = ['blue','orange','green','red'])
ax.set_title("Geographic Region according to US census")
ax.set_xlabel("Geographic Region")
ax.set_ylabel("Frequency")
plt.show()
Method 2 - Using seaborn to draw a colourful bar plot
-
In this approach, we first used the groupby method to group values according to the regions.
-
Then, we applied count to count the no. of entries in each group.
-
On printing this df, you can observe that all the columns now contain the same data i.e frequencies of groups. To get any particular column, you can specify the column name inside the square brackets. Here, we’ve randomly used the first column land_area.
-
Since the columns are grouped according to regions, our index will be region.
-
Finally, Seaborn’s barplot method is used to plot a bar graph between the index and the land_area
df1 = data.groupby('region').count()
df1
land_area | percent_city | percent_senior | physicians | hospital_beds | graduates | work_force | income | crime_rate | |
---|---|---|---|---|---|---|---|---|---|
region | |||||||||
1 | 21 | 21 | 21 | 21 | 21 | 21 | 21 | 21 | 21 |
2 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 |
3 | 36 | 36 | 36 | 36 | 36 | 36 | 36 | 36 | 36 |
4 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 | 17 |
df = df1[['land_area']]
df
land_area | |
---|---|
region | |
1 | 21 |
2 | 25 |
3 | 36 |
4 | 17 |
ax = sns.barplot(df.index, df.land_area)
ax.set_title("Geographic Region according to US census")
ax.set_xlabel("Geographic Region")
ax.set_ylabel("Frequency")
plt.show()
We have obtained the exact same plot as option c. Thus, option c is correct.
Question 10:
Which of the following is showing the correct distribution of income of the Metropolitan areas?
Select one:
Option a:
Option b:
Option c:
Answer 10:
From the options, we understand that the question requires us to plot a histogram for income. We can do that with Matplotlib’s hist function.
plt.hist(data.income)
plt.title('Total income in 1976 in millions of dollars')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.show()
We have obtained the exact same plot as a. Thus, option a is correct.
And that’s it with all the questions of the assignment! If you faced some difficulties in solving those, going through this article might be helpful.
https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/