Data Analysis and Data Visualizations - SMA Dataset

:hammer_and_wrench: Contribute: Found a typo? Or any other change that could improve the notebook tutorial? Please consider sending us a pull request in the public repo of the notebook here.

Assignment-1: Solutions

(Absolute Beginners and Beginners)

This is the first assignment of the DPhi Data Science Bootcamp and revolves around Data Analysis and Data Visualisation on the Standard Metropolitan Area Dataset.

About the Dataset

It contains data of 99 standard metropolitan areas in the US. The data set provides information on 10 variables for each area for the period 1976-1977. The areas have been divided into 4 geographic regions: 1=North-East, 2=North-Central, 3=South, 4=West.

Link to the Dataset: https://bit.ly/SMA_Dataset

We are first importing packages using their standard alias names: pd for pandas, np for numpy, plt for matplotlib.pyplot and sns for seaborn

%matplotlib inline

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

Using the Pandas read_csv method to read the Dataset’s CSV file. Inside the brackets, we can either specify the path of the file or a downloadable link.

Here, the file is uploaded on GitHub and we can directly use the link to load and access it. If you are uploading your own file, make sure to specify the full path of the file.

Please go through the following module to learn about working with CSV files: https://bit.ly/DPhi_Day9

I am storing the dataset in the variable ‘data’

data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Assignment_Solutions/master/Standard%20Metropolitan%20Areas%20Data%20-%20train_data%20-%20data.csv")

Pandas head() method is used to return top n (5 by default) rows of a DataFrame. It allows us to get an estimate of what the data looks like

data.head()
land_area percent_city percent_senior physicians hospital_beds graduates work_force income region crime_rate
0 1384 78.1 12.3 25627 69678 50.1 4083.9 72100 1 75.55
1 3719 43.9 9.4 13326 43292 53.9 3305.9 54542 2 56.03
2 3553 37.4 10.7 9724 33731 50.6 2066.3 33216 1 41.32
3 3916 29.9 8.8 6402 24167 52.2 1966.7 32906 2 67.38
4 2480 31.5 10.5 8502 16751 66.1 1514.5 26573 4 80.19

You can change the number of rows being displayed by specifying a number inside the head function.

Let’s look at 10 rows now

data.head(10)
land_area percent_city percent_senior physicians hospital_beds graduates work_force income region crime_rate
0 1384 78.1 12.3 25627 69678 50.1 4083.9 72100 1 75.55
1 3719 43.9 9.4 13326 43292 53.9 3305.9 54542 2 56.03
2 3553 37.4 10.7 9724 33731 50.6 2066.3 33216 1 41.32
3 3916 29.9 8.8 6402 24167 52.2 1966.7 32906 2 67.38
4 2480 31.5 10.5 8502 16751 66.1 1514.5 26573 4 80.19
5 2815 23.1 6.7 7340 16941 68.3 1541.9 25663 3 58.48
6 8360 46.3 8.2 4047 14347 53.6 1321.2 18350 3 72.25
7 6794 60.1 6.3 4562 14333 51.7 1272.7 18221 3 64.88
8 3049 19.5 12.1 4005 21149 53.4 967.5 15826 1 30.51
9 4647 31.5 9.2 3916 12815 65.1 1032.2 14542 2 55.30

What we can gather from the displayed data is that we have 7 columns/ 7 features.

We have the following description about each of these columns already:

  1. land_area : size in square miles
  2. percent_city : percent of population in central city/cities
  3. percent_senior : percent of population ≤ 65 years
  4. physicians : number of professionally active physicians
  5. hospital_beds : total number of hospital beds
  6. graduates : percent of adults that finished high school
  7. work_force : number of persons in work force in thousands
  8. income : total income in 1976 in millions of dollars
  9. crime_rate : Ratio of number of serious crimes by total population
  10. region : geographic region according to US Census

We can see that the regions have 4 values, where:

1 = North-East

2 = North-Central

3 = South

4 = West

The info() function is used to print a concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 10 columns):
land_area         99 non-null int64
percent_city      99 non-null float64
percent_senior    99 non-null float64
physicians        99 non-null int64
hospital_beds     99 non-null int64
graduates         99 non-null float64
work_force        99 non-null float64
income            99 non-null int64
region            99 non-null int64
crime_rate        99 non-null float64
dtypes: float64(5), int64(5)
memory usage: 7.8 KB

Using the info function we get to know that the data has 99 entries(rows) and all of the columns contain 99 non-null entries. This indicates that we don’t have any null value.

Also, our data consists of float and integer values.

Solutions

Now that we’ve had a good look of our dataset, we can start solving the questions.

Question 1:
Which of the following information is correct about the data?

Select one:

a. There are 4 variables of ‘float’ dtype, 5 variables of ‘int’ dtype.

b. There are 6 variables of ‘float’ dtype, 4 variables of ‘int’ dtype

c. There are 3 variables of ‘float’ dtype, 7 variables of ‘int’ dtype

d. There are 5 variables of ‘float’ dtype, 5 variables of ‘int’ dtype.

Answer 1:

This question demands the knowledge of data types of various features in our dataset.

We can easily find that out using the info function. Using the resulting table of the info function used above, we can count the number of float and int variables.

We can see that there exist 5 float variables( percent_city, percent_senior, graduates, work_force, crime_rate)

Also, the number of int variables are also 5( land_area, physicians, hospital_beds, income, region)

The correct option for this question is therefore d - There are 5 variables of ‘float’ dtype, 5 variables of ‘int’ dtype.

Question 2: Which of the following is true about bar chart and histogram?

Select one or more:

a. Histogram presents quantitative data while bar chart presents categorical data / discrete data.

b. In histogram bars cannot be re-ordered while in bar charts bars can be re-ordered.

c. Histogram indicates distribution of non - discrete variables.

d. Bar chart indicates comparison of discrete variables / categorical variables.

Answer 2:

a.

  • A histogram displays the single quantitative variable along the x axis and frequency of that variable on the y axis.

  • Bar graphs can be used to represent both quantitative and categorical data.

Both the statements are true and thus option a is correct.

b. Bars can be reordered in bar charts but not in histograms. Option b is correct.

c. This is because a histogram represents a continuous(non-discrete) data set, and as such, there are no gaps in the data. Option c is correct.

d. It is True that Bar chart indicates comparison of discrete variables / categorical variables. Option d is correct.

Thus, all the options are correct.

Question 3:

Which of the following information is correct about the data?

Select one or more:

a.

  1. mean land areas of these Metropolitan areas = 2615.73

  2. maximum crime rate among all the Metropolitan areas = 85.62

b.

  1. minimum income in million dollars of all the metropolitan areas = 769.00

c.

  1. total count of non null entries in ‘hospital_beds’ = 99.0

  2. Most of the Metropolitan areas lie in region 3

d.

  1. only 17 Metropolitan areas lie in region 4

  2. average crime rate in among the metropolitan areas = 55.64

In such type of questions, we need to check all the options to find out which ones are True.

Let’s write the codes to check them one by one:

Answer 3:

a - 1) mean land areas of these Metropolitan areas = 2615.73

We can find the mean of the land area by Numpy’s mean function.

We need to provide the column name inside the brackets of the mean function

Accessing DataFrame’s columns

Now, there are actually two methods of accessing a DataFrame’s column:

  1. Square brackets ([])
  2. Dot operator (.)
  • They are the same as long you’re accessing a single column with a simple name, but you can do more with the bracket notation.

  • You can only use df.col if the column name is a valid Python identifier (e.g., does not contains spaces and other such stuff). Also, you may encounter surprises if your column name clashes with a pandas method name (like sum).

  • With brackets you can select multiple columns (e.g., df[[‘col1’, ‘col2’]]) or add a new column (df[‘newcol’] = …), which can’t be done with dot access.

# Dot notation
np.mean(data.land_area)
2615.7272727272725
# Bracket notation
np.mean(data['land_area'])
2615.7272727272725

Both of these work and provide us the same result. Now, you can see that the mean is represented till the 13th decimal place.

We can use the round function to represent the same no. of decimal places in our mean as is displayed in the options.

import math #importing the math package for round function
round(np.mean(data.land_area), 2) # The argument 2 represents to how many decimal places we want to round off
2615.73

Our answer is same as the option. Thus a - 1) is correct

a - 2) maximum crime rate among all the Metropolitan areas = 85.62

This requires us to use the max function on the crime_rate column.

max(data.crime_rate)
85.62

We obtained the same result as specified in the option. Thus, a- 2) is correct

Since both the parts of option a are correct, we can say that a is correct

Let’s find out if any of the other options is correct now.

b - 1) minimum income in million dollars of all the metropolitan areas = 769.00

Similar to the previous option, we can use the min function on the income column

min(data.income)
769

769.00 and 769 are the same. Thus, option b is correct.

c - 1) total count of non null entries in ‘hospital_beds’ = 99.0

We already know from the info function that we have 99 non null entries in each column.

We can still confirm it with the notnull function of Pandas applied on the hospital_beds column

pd.notnull(data['hospital_beds'])
0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
12    True
13    True
14    True
15    True
16    True
17    True
18    True
19    True
20    True
21    True
22    True
23    True
24    True
25    True
26    True
27    True
28    True
29    True
      ... 
69    True
70    True
71    True
72    True
73    True
74    True
75    True
76    True
77    True
78    True
79    True
80    True
81    True
82    True
83    True
84    True
85    True
86    True
87    True
88    True
89    True
90    True
91    True
92    True
93    True
94    True
95    True
96    True
97    True
98    True
Name: hospital_beds, Length: 99, dtype: bool

The length 99 means that there are non null 99 entries/ rows in the column.

Thus, c-1) is correct.

c - 2) Most of the Metropolitan areas lie in region 3

The The value_counts() method returns a Series containing the counts of unique values. This means, for any column in a dataframe, this method returns the count of unique entries in that column.

data.region.value_counts()
3    36
2    25
1    21
4    17
Name: region, dtype: int64

The above code gives us the count of each region in the data[‘region’] column.

We can see that region 3 has the maximum count - 36.

Thus, c-2) is correct.

Since both the parts of the option c are correct, we can say that the option c is correct.

d - 1) only 17 Metropolitan areas lie in region 4

The answer for this can be obtained from the same result of the above code block.

The region 4 has a count of 17 and thus d - 1) is correct.

d - 2) average crime rate in among the metropolitan areas = 55.64

For finding the average crime rate, we can apply mean function on the crime_rate column

np.mean(data.crime_rate)
55.6430303030303

Again, we can round off the above result to get the exact mean.

round(np.mean(data.crime_rate),2)
55.64

The average crime rate obtained is same as that in the option. Thus, d-2) is correct

Since both the parts of option d are correct, we can say that option d is correct.

We have now proved that all 4 options are correct.

Question 4:

Which of the following is the correct correlation matrix among the features of the dataset?

Select one:

Option a:

Option b:

Answer 4:

We can obtain a correlation matrix using the corr method. The correlation matrix obtained below is the same as given in option a. Thus, option a is correct.

data.corr()
land_area percent_city percent_senior physicians hospital_beds graduates work_force income region crime_rate
land_area 1.000000 -0.077320 0.092226 0.085054 0.081034 0.088728 0.135792 0.111404 0.292392 0.293907
percent_city -0.077320 1.000000 -0.250995 0.067391 0.052898 0.076720 0.016011 0.019235 0.235880 0.159596
percent_senior 0.092226 -0.250995 1.000000 0.056454 0.083775 -0.155695 0.035240 0.046073 -0.242811 -0.177992
physicians 0.085054 0.067391 0.056454 1.000000 0.974241 0.049500 0.965597 0.976209 -0.140961 0.187763
hospital_beds 0.081034 0.052898 0.083775 0.974241 1.000000 -0.003892 0.967913 0.974416 -0.220305 0.109799
graduates 0.088728 0.076720 -0.155695 0.049500 -0.003892 1.000000 0.044054 0.045578 0.246226 0.290880
work_force 0.135792 0.016011 0.035240 0.965597 0.967913 0.044054 1.000000 0.996735 -0.144022 0.175945
income 0.111404 0.019235 0.046073 0.976209 0.974416 0.045578 0.996735 1.000000 -0.152016 0.175797
region 0.292392 0.235880 -0.242811 -0.140961 -0.220305 0.246226 -0.144022 -0.152016 1.000000 0.636192
crime_rate 0.293907 0.159596 -0.177992 0.187763 0.109799 0.290880 0.175945 0.175797 0.636192 1.000000

Question 5:

Which of the following is the first five land area of the metropolitan area data which are located in region 4?

Select one:

Option a:

Option a

Option b:

Option b

Option c:

Option c

Option d:

Option d

Answer 5:

This question involves conditional selection of columns.

  • The == operator is a comparison operator. If the values of two operands on either side are equal, then the condition becomes true.

  • The condition data[‘region’]==4 checks if the region of an entry is 4.

  • data[data[‘region’]==4] finds out all those columns that satisfy the condition specified.

  • data[data[‘region’]==4][‘land_area’] selects the land_area column from the data

  • Finally, data[data[‘region’]==4][‘land_area’].head() picks the top 5 rows as the question demands first five land area.

data[data['region']==4]['land_area'].head()
4      2480
13      782
14     4226
18    27293
20     9155
Name: land_area, dtype: int64

Since the above output matches the option d, option d is correct

Question 6:

Which of the following information is correct about Standard Metropolitan Data?

Select one or more:

a. There isn’t any Metropolitan area which is located in region 4 and crime rate is 55.64.

b. There are 4 Metropolitan area which is located in region 1 and their crime rate is greater than or equal to 54.16

c. There is one Metropolitan area whose crime rate is 85.62 and is located in region 4.

d. There are 40 Metropolitan area which is located in region 3 or have land area greater than or equal to 5000

Answer 6:

This question involves the concept of conditional selection along with combining 2 conditions with the Logical Operators( and, or, not)

a. There isn’t any Metropolitan area which is located in region 4 and crime rate is 55.64.

data[(data['region']==4) & (round(data['crime_rate'],2)==55.46)]
land_area percent_city percent_senior physicians hospital_beds graduates work_force income region crime_rate

We obtained an empty table. This means that there isn’t any Metropolitan area which is located in region 4 and crime rate is 55.64.

Thus, option a is correct.

b. There are 4 Metropolitan area which is located in region 1 and their crime rate is greater than or equal to 54.16

data[(data['region']==1) & (round(data['crime_rate'],2)>=54.16)]
land_area percent_city percent_senior physicians hospital_beds graduates work_force income region crime_rate
0 1384 78.1 12.3 25627 69678 50.1 4083.9 72100 1 75.55
10 1008 16.6 10.3 4006 16704 55.9 935.5 15953 1 54.16
24 2966 26.9 10.3 2053 6604 56.3 450.4 6966 1 56.55
55 192 60.5 10.8 617 1789 44.1 212.6 3158 1 58.79

Since we have 4 records in the above result, the option b is correct.

c. There is one Metropolitan area whose crime rate is 85.62 and is located in region 4.

data[(data['region']==4) & (round(data['crime_rate'],2)==85.62)]
land_area percent_city percent_senior physicians hospital_beds graduates work_force income region crime_rate
20 9155 53.8 11.1 2280 6450 60.1 575.2 7766 4 85.62

We recieved 1 entry that satisfied the above conditions. Thus, option c is correct.

d. There are 40 Metropolitan area which are located in region 3 or have land area greater than or equal to 5000

len(data[(data['region']==3) | (data['land_area']>=5000)])
41

There are 41 Metropolitan area which are located in region 3 or have land area greater than or equal to 5000. Thus the option d is incorrect.

Options a,b,c are correct.

Question 7:

Which of the following is the correct line plot between land area and crime rate?

Select one:

Option a:

Option a

Option b:

Option b

Option c:

Option c

Option d:

Option d

Answer 7:

We’ll use the plot function of Matplotlib to draw a line plot between land area and crime rate

plt.plot(data.land_area, data.crime_rate)
plt.show()

Since the plot matches with option a, Option a is correct.

Question 8:

Which of the following correlation information is correct between two variables.

Select one:

a There is a positive correlation between ‘hospital_beds’ and ‘physicians’ and its scatter plot is:

Option a

b. There is a negative correlation between ‘hospital_beds’ and ‘physicians’ and its scatter plot is:

Option b

c. There is a negative correlation between ‘hospital_beds’ and ‘physicians’ and its scatter plot is:

Option c

Answer 8:

We’ll use the scatter method of Matplotlib to draw a scatter plot between ‘hospital_beds’ and ‘physicians’.

plt.scatter(data.hospital_beds, data.physicians)
plt.show()

The scatter plot matches with that of option a and b. We can notice that the direction of the line made by scatter plot is upwards or the slope of the line is positive.

Also, the value of physicians is increasing as the value of hospital_beds is increasing. This means, there is a positive correlation between ‘hospital_beds’ and ‘physicians’.

Thus, option a is correct.

Question 9:

Which of the following is the correct bar plot of geographic regions?

Select one:

Option a:

Option a

Option b:

Option b

Option c:

Option c

Answer 9:

The question requires us to create a bar plot of geographic regions against the frequency of each region.

This means that we’ll have to count the occurrence of each region in our data.

Let’s have a look at 2 methods of making the same bar plot:

Method 1 - Using the Pandas plot function to draw a bar plot

  • In this method, we use the value_counts() function to find out the frequency of each region in our data.

  • The plot method with the kind=‘bar’ argument helps us plot bar charts. We are also specifying an argument color and providing a list of colours we want our bars to be plotted in.

  • Also, it can be observed from the above options that the bars should not be sorted in the decreasing order of frequency which is usually the case while counting occurrence. We can use sort_index() to sort the bars according to their indices.

ax = data['region'].value_counts().sort_index().plot(kind = 'bar', color = ['blue','orange','green','red'])
ax.set_title("Geographic Region according to US census")
ax.set_xlabel("Geographic Region")
ax.set_ylabel("Frequency")
plt.show()

Method 2 - Using seaborn to draw a colourful bar plot

  • In this approach, we first used the groupby method to group values according to the regions.

  • Then, we applied count to count the no. of entries in each group.

  • On printing this df, you can observe that all the columns now contain the same data i.e frequencies of groups. To get any particular column, you can specify the column name inside the square brackets. Here, we’ve randomly used the first column land_area.

  • Since the columns are grouped according to regions, our index will be region.

  • Finally, Seaborn’s barplot method is used to plot a bar graph between the index and the land_area

df1 = data.groupby('region').count()
df1
land_area percent_city percent_senior physicians hospital_beds graduates work_force income crime_rate
region
1 21 21 21 21 21 21 21 21 21
2 25 25 25 25 25 25 25 25 25
3 36 36 36 36 36 36 36 36 36
4 17 17 17 17 17 17 17 17 17
df = df1[['land_area']]
df
land_area
region
1 21
2 25
3 36
4 17
ax = sns.barplot(df.index, df.land_area)
ax.set_title("Geographic Region according to US census")
ax.set_xlabel("Geographic Region")
ax.set_ylabel("Frequency")
plt.show()

We have obtained the exact same plot as option c. Thus, option c is correct.

Question 10:

Which of the following is showing the correct distribution of income of the Metropolitan areas?

Select one:

Option a:

Option a

Option b:

Option b

Option c:

Option c

Answer 10:

From the options, we understand that the question requires us to plot a histogram for income. We can do that with Matplotlib’s hist function.

plt.hist(data.income)
plt.title('Total income in 1976 in millions of dollars')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.show()

We have obtained the exact same plot as a. Thus, option a is correct.

And that’s it with all the questions of the assignment! If you faced some difficulties in solving those, going through this article might be helpful.

https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/