Descriptive Statistics

Statistics is a science to work on data using mathematical models. It does the following operations on data :

  1. Collection
  2. Analysis
  3. Transform
  4. Interpret
  5. Present
 

There are two types of Statistics:

  1. Descriptive

    This branch of Statistics is used to explore the data. It uses simple functions like mean, mode, etc. for the better understanding of data

  2. Inferential

    This branch of Statistics uses mathematics algorithms/models to infer hidden information in the data

 

Descriptive Statistics

 

Descriptive statistics can be broadly divided into 2 categories

  • Measures of Central Tendency - To calculate the center points of data
  • Measures of Dispersion - To determine how data is distributed
 

Measures of Central Tendency

 
  • Mean - Average of data points(sum of data points/number of data points)
  •  
  • Median - Mid value of data when it's sorted
    • \(\frac{N + 1}{2}\) th value if N is odd
    • Average of \(\frac{N}{2}\)th value and \(\frac{N}{2} + 1\)th value if N is even
  •  
  • Mode - Value which occurs the maximum number of times in data
 

Measures of Dispersion

 
  • Range - It's the difference between maximum value and the minimum value in data
    • Range = Max - Min
  •  
  • Quartiles - Quartiles are the values that divide data in quarters after arranging them in increasing order.
    • 1st Quartile(Q1) = 25% cut
    • 2nd Quartile(Q2) = 50% cut (Also called as median)
    • 3rd Quartile(Q3) = 75% cut
    • InterQuartiles Range(IQR) = Q3 - Q1
  •  
  • Standard Deviation - This shows how data is dispersed/spread around mean value. It's calculated in 5 steps
    1. Calculate Mean
    2. Calculate the difference between each value and mean
    3. Take the square of each difference and add them
    4. Divide this sum of squared differences by the number of data points
    5. Take the square root of the whole quantity
    \[ \text{Standard Deviation} = \sqrt\frac{{\sum_{i=1}^N (x_i -\bar{x})^2}}{N} \]

    where:

    \(\bar{x}\) : average of all data points
    \(x_i\) : \(i_{th}\) data point
    N : number of data points
     
  • Skewness - It shows how data is dispersed when a graph is drawn between the values(x-axis) and their frequencies(y-axis) in data.
       
    • Positive skewness - most of the data is towards left and has a long tail in the right direction
    • Normal curve - A normal curve is symmetric around mean and has the same value for mean, median & mode.
    • Negative skewness - most of the data is towards the right and has a long tail in the left direction
    •  
  •  
  • Kurtosis - A measure of the sharpness of the peak when a graph is drawn between the values(x-axis) and their frequencies(y-axis) in data.
       
    • Leptokurtic - Lepto means 'Thin'. A leptokurtic curve has positive kurtosis value.
    • Mesokurtic/Normal - Meso means 'middle'. A normal curve has zero(0) kurtosis value.
    • Platykurtic - Lepto means 'Platy'. A platykurtic curve has negative kurtosis value.
    •  
   

Let's work on an example here to understand Descriptive Statistics better. Below table has the data of 25 students in a class. Let's try to calculate various statistical measures for this data.

 

Mean

sum of all 25 heights/ 25 = 3439/25= 137.56 cm

Median

To calculate the median, we need to arrange them in increasing order. So, let's sort the values in increasing order :

135 135 135 135 136 136 136 136 136 136 137 137 137 137 138 138 138 139 139 139 140 140 141 141 142

Number of data points N is 25(odd) here, so median will be the 13th value

median = 137

Mode

136 has appeared the maximum number of times, so mode = 136

Range

142(max value) - 135(min value) = 7

Standard Deviation

[{(135-137.56)^2 + (135-137.56)^2 + (135-137.56)^2 + (135-137.56)^2 + (136-137.56)^2 + (136-137.56)^2 + (136-137.56)^2 + (136-137.56)^2 + (136-137.56)^2 + (136-137.56)^2 + (137-137.56)^2 + (137-137.56)^2 + (137-137.56)^2 + (137-137.56)^2 + (138-137.56)^2 + (138-137.56)^2 + (138-137.56)^2 + (139-137.56)^2 + (139-137.56)^2 + (139-137.56)^2 + (140-137.56)^2 + (140-137.56)^2 + (141-137.56)^2 + (141-137.56)^2 + (142-137.56)^2}/25]^(1/2) = 2.08
 

Fortunately, pandas provides us with inbuilt functions to calculate descriptive statistics. Let's solve the same example using pandas functions.

Exercise
  1. Create a dataframe olympicsMedalTally of top 10 countries of 2016 Olympics medal table as below
  2. Calculate all Measures of Central Tendency.
  3. Calculate all Measures of Dispersion.