Descriptive Statistics
Statistics is a science to work on data using mathematical models. It does the following operations on data :
- Collection
- Analysis
- Transform
- Interpret
- Present
There are two types of Statistics:
- Descriptive
This branch of Statistics is used to explore the data. It uses simple functions like mean, mode, etc. for the better understanding of data
- Inferential
This branch of Statistics uses mathematics algorithms/models to infer hidden information in the data
Descriptive Statistics
Descriptive statistics can be broadly divided into 2 categories
- Measures of Central Tendency - To calculate the center points of data
- Measures of Dispersion - To determine how data is distributed
Measures of Central Tendency
- Mean - Average of data points(sum of data points/number of data points)
- Median - Mid value of data when it's sorted
- \(\frac{N + 1}{2}\) th value if N is odd
- Average of \(\frac{N}{2}\)th value and \(\frac{N}{2} + 1\)th value if N is even
- Mode - Value which occurs the maximum number of times in data
Measures of Dispersion
- Range - It's the difference between maximum value and the minimum value in data
- Range = Max - Min
- Quartiles - Quartiles are the values that divide data in quarters after arranging them in increasing order.
- 1st Quartile(Q1) = 25% cut
- 2nd Quartile(Q2) = 50% cut (Also called as median)
- 3rd Quartile(Q3) = 75% cut
- InterQuartiles Range(IQR) = Q3 - Q1
- Standard Deviation - This shows how data is dispersed/spread around mean value. It's calculated in 5 steps
- Calculate Mean
- Calculate the difference between each value and mean
- Take the square of each difference and add them
- Divide this sum of squared differences by the number of data points
- Take the square root of the whole quantity
- Skewness - It shows how data is dispersed when a graph is drawn between the values(x-axis) and their frequencies(y-axis) in data.
- Positive skewness - most of the data is towards left and has a long tail in the right direction
- Normal curve - A normal curve is symmetric around mean and has the same value for mean, median & mode.
- Negative skewness - most of the data is towards the right and has a long tail in the left direction
- Kurtosis - A measure of the sharpness of the peak when a graph is drawn between the values(x-axis) and their frequencies(y-axis) in data.
- Leptokurtic - Lepto means 'Thin'. A leptokurtic curve has positive kurtosis value.
- Mesokurtic/Normal - Meso means 'middle'. A normal curve has zero(0) kurtosis value.
- Platykurtic - Lepto means 'Platy'. A platykurtic curve has negative kurtosis value.

where:


Let's work on an example here to understand Descriptive Statistics better. Below table has the data of 25 students in a class. Let's try to calculate various statistical measures for this data.

Mean
sum of all 25 heights/ 25 = 3439/25= 137.56 cm
Median
To calculate the median, we need to arrange them in increasing order. So, let's sort the values in increasing order : 135 135 135 135 136 136 136 136 136 136 137 137 137 137 138 138 138 139 139 139 140 140 141 141 142 Number of data points N is 25(odd) here, so median will be the 13th value median = 137
Mode
136 has appeared the maximum number of times, so mode = 136
Range
142(max value) - 135(min value) = 7
Standard Deviation
[{(135-137.56)^2 + (135-137.56)^2 + (135-137.56)^2 + (135-137.56)^2 + (136-137.56)^2 + (136-137.56)^2 + (136-137.56)^2 + (136-137.56)^2 + (136-137.56)^2 + (136-137.56)^2 + (137-137.56)^2 + (137-137.56)^2 + (137-137.56)^2 + (137-137.56)^2 + (138-137.56)^2 + (138-137.56)^2 + (138-137.56)^2 + (139-137.56)^2 + (139-137.56)^2 + (139-137.56)^2 + (140-137.56)^2 + (140-137.56)^2 + (141-137.56)^2 + (141-137.56)^2 + (142-137.56)^2}/25]^(1/2) = 2.08
Fortunately, pandas provides us with inbuilt functions to calculate descriptive statistics. Let's solve the same example using pandas functions.
- Create a dataframe olympicsMedalTally of top 10 countries of 2016 Olympics medal table as below
- Calculate all Measures of Central Tendency.
- Calculate all Measures of Dispersion.