Descriptive Statistics

Statistics is a science to work on data using mathematical models. It does the following operations on data :

Collection
Analysis
Transform
Interpret
Present

There are two types of Statistics:

Descriptive
This branch of Statistics is used to explore the data. It uses simple functions like mean, mode, etc. for the better understanding of data
Inferential
This branch of Statistics uses mathematics algorithms/models to infer hidden information in the data

Descriptive Statistics

Descriptive statistics can be broadly divided into 2 categories

Measures of Central Tendency - To calculate the center points of data
Measures of Dispersion - To determine how data is distributed

Measures of Central Tendency

Mean - Average of data points(sum of data points/number of data points)
Median - Mid value of data when it's sorted
- \(\frac{N + 1}{2}\) th value if N is odd
- Average of \(\frac{N}{2}\)th value and \(\frac{N}{2} + 1\)th value if N is even
Mode - Value which occurs the maximum number of times in data

Measures of Dispersion

Range - It's the difference between maximum value and the minimum value in data
- Range = Max - Min
Quartiles - Quartiles are the values that divide data in quarters after arranging them in increasing order.
- 1st Quartile(Q1) = 25% cut
- 2nd Quartile(Q2) = 50% cut (Also called as median)
- 3rd Quartile(Q3) = 75% cut
- InterQuartiles Range(IQR) = Q3 - Q1

Standard Deviation - This shows how data is dispersed/spread around mean value. It's calculated in 5 steps

Calculate Mean
Calculate the difference between each value and mean
Take the square of each difference and add them
Divide this sum of squared differences by the number of data points
Take the square root of the whole quantity

where:

\(\bar{x}\) : average of all data points
\(x_i\) : \(i_{th}\) data point
N : number of data points

Skewness - It shows how data is dispersed when a graph is drawn between the values(x-axis) and their frequencies(y-axis) in data.
- Positive skewness - most of the data is towards left and has a long tail in the right direction
- Normal curve - A normal curve is symmetric around mean and has the same value for mean, median & mode.
- Negative skewness - most of the data is towards the right and has a long tail in the left direction

Kurtosis - A measure of the sharpness of the peak when a graph is drawn between the values(x-axis) and their frequencies(y-axis) in data.
- Leptokurtic - Lepto means 'Thin'. A leptokurtic curve has positive kurtosis value.
- Mesokurtic/Normal - Meso means 'middle'. A normal curve has zero(0) kurtosis value.
- Platykurtic - Lepto means 'Platy'. A platykurtic curve has negative kurtosis value.

Let's work on an example here to understand Descriptive Statistics better. Below table has the data of 25 students in a class. Let's try to calculate various statistical measures for this data.

Mean

sum of all 25 heights/ 25 = 3439/25= 137.56 cm

Median

To calculate the median, we need to arrange them in increasing order. So, let's sort the values in increasing order :

135 135 135 135 136 136 136 136 136 136 137 137 137 137 138 138 138 139 139 139 140 140 141 141 142

Number of data points N is 25(odd) here, so median will be the 13th value

median = 137

Mode

136 has appeared the maximum number of times, so mode = 136

Range

142(max value) - 135(min value) = 7

Standard Deviation

[{(135-137.56)^2 + (135-137.56)^2 + (135-137.56)^2 + (135-137.56)^2 + (136-137.56)^2 + (136-137.56)^2 + (136-137.56)^2 + (136-137.56)^2 + (136-137.56)^2 + (136-137.56)^2 + (137-137.56)^2 + (137-137.56)^2 + (137-137.56)^2 + (137-137.56)^2 + (138-137.56)^2 + (138-137.56)^2 + (138-137.56)^2 + (139-137.56)^2 + (139-137.56)^2 + (139-137.56)^2 + (140-137.56)^2 + (140-137.56)^2 + (141-137.56)^2 + (141-137.56)^2 + (142-137.56)^2}/25]^(1/2) = 2.08

Fortunately, pandas provides us with inbuilt functions to calculate descriptive statistics. Let's solve the same example using pandas functions.

Exercise

Create a dataframe olympicsMedalTally of top 10 countries of 2016 Olympics medal table as below
Calculate all Measures of Central Tendency.
Calculate all Measures of Dispersion.

Exercise Solutions

Complete and Continue