Inferential Statistics

In most of the cases, it's not possible to collect entire data. Therefore, we collect some part of that data. In statistics terms, this entire data is called population and part of the data is called sample data.

You must have seen how before elections, analytics companies choose few people and ask them which political party they are going to vote for and on the basis of that they predict which party is going to win the election. In this case, these few people are a sample of the entire population who are going to cast their vote.

 

You must be thinking, how is it possible to predict the behavior of the entire population on the basis of just a few people, who are not even 1% of the total population sometimes. So, let us introduce the concept of representativeness here.

 

Representativeness

 

A sample is called a representative of a population when it has same characteristics as that of the population. For example : the mean, median, range, etc. of sample are the same as that of the population.

Now, a question arises how do we collect the representative sample from the population. There are many types of sampling techniques to get a representative sample. Let’s discuss 3 main techniques here:

  1. Simple Random Sampling(SRS)
  2. Stratified Sampling
  3. Systematic Sampling
 

1. Simple Random Sampling (SRS)

 

This sample collection is done randomly. You just pick some people from voter list at random without giving any thought or any logic. The rationale behind this sample being representative is that it was selected randomly so there should not be any bias in the sample; therefore, it should have the same characteristics as that of the population. And, if the sample has some bias, then it was not collected randomly. In the below picture, red circle dots are selected at random and form a random sample.

 

2. Stratified Sampling

 

Here, you apply a little bit of logic before picking data randomly. You divide the population into groups(strata) on the basis of an important factor and then inside each group(stratum), you collect a sample by random sampling. This ensures that group should be representative of the population at least upto an extent because you are collecting data from each group.

Example- If in voter list there are 51% males and 49% females, you will divide the population into two such groups, and if you have to pick 1000 data points, you will pick 510 males and 490 females randomly.

 

3. Systematic Sampling

 

Here you follow a pattern to pick sample data points. For example – you decide that from 50000 people voters list, you will pick every 50th person to get 1000 data sample.

 
 

Random Variable

 

Entire statistics is based on one thing and its called random variable. A Random variable is a variable whose actual value can only be determined after the associated event happens. Before the event, you can only tell its value with some possibility, but not with 100% certainty.

For example, when you flip a coin, you can't tell whether it will be Head or Tail, but you can definitely tell that there is 50% possibility that it will be Head, and 50% possibility that it will be Tail.

Another example – Let's say that 50,000 people are going to cast their votes in one constituency for an upcoming election. Some will vote for Party A, and others will not. But, how many people will vote for party A is a random variable. You don’t know that till the election result is out. But, by taking a representative sample of people and asking them which party they are going to vote for, you can get some idea about the election result.

 
 

Probability

 

Probability is the chance that some event will take place; or in other words, the likelihood of an event happening. Mathematically, it can be calculated as

Probability of an event = No. of occurrences of that event / No. of occurrences of all events

Example: Let's say we have a basket and there are 4 red balls and 6 blue balls. What are the chances that if a ball is randomly picked from the basket, it will be a red color ball?

Probability of a red ball being picked = total no. of red balls / total no. of balls = 4 / 10 = 0.4

 
 

Expected Value

 

As the name suggests, Expected Value of a random variable is the most probable value of it. If you have to tell the value of a random variable before an event happens, then you would tell its expected value. Expected value is average of all possible values, considering their possibilities. If you are asked about one value of a random variable, you should tell this value. It's nothing but summation of all probable values of a random variable multiplied by their probability.

EV = x1*p1 + x2*p2 + x3*p3 + …. + x3*p3

where \(x_1\), \(x_2\),.. \(x_n\) are possible values and \(p_1\), \(p_2\)..\(p_n\) are corresponding probabilities of these values.

Example: Let's say John buys a lottery ticket for $10 and there are 1000 lottery tickets in all. If John wins, he would be awarded $5000. What would be the expected winning value for this lottery ticket for John?

Here winning value of the lottery ticket is a random variable.

Assuming all lottery tickets have equal chances of winning, there are two possible winning values of this random variable:

  • -$10, if John does not win the lottery as his money will be wasted
  • $5000, if John wins the lottery
 
  • Possibility of John loosing the lottery = 999/1000 & Amount lost = $10
  • Possibility of John winning the lottery = 1/1000 & Amount won = $5000-$10 = $4990

Expected value for John winning lottery = -$10*(999/1000) + $4990*(1/1000) = -999/100 + 499/100 = -$5

 
 

Probability Distribution Graph

 

This brings us to another concept called Probability Distribution Graph. Probability distribution Graph is a graph between all values of a random variable and their probabilities. Let’s understand this with an example of heights of students in a class. If we put heights on the x-axis, and the corresponding proportion of students(or probability - number of students of a particular height / number of students) on the y-axis, then this graph is called probability distribution function. This is also called Probability Density Graph. Here, the height of the student is a random variable.

If you can have a function y = f(x) to represent this graph, then this function is called Probability Distribution/Density Function. These functions can be of two types :

  • continuous – height of students in the class(height is a real no.)
  • discrete – no. of students coming to class on time(It can not be a real no. It has to be an integer)

So, it all boils down to the fact that if you know the probability distribution function/graph of the random variable, you can easily calculate the probability for a certain outcome of a random variable. For example, if you are given the above graph and asked - what is the probability that a randomly picked student will have height as 137 cm. You can look at the graph and tell that it should be around 0.170.

 
 

Various Types of Probability Distributions

 

There are certain types of common probability distributions a random variable follows. Let’s get into these distributions, and we will have a look at an example of each of these distributions. Broadly, there are two types of distributions:

  1. Discrete Distributions

    Random variable can have only certain numerical(integer etc.) values in Discrete Distributions. In this case, the function is called Probability Mass Function, instead of Probability Distribution Function.

  2. Continuous Distributions

    Random variable can have any numerical value in Continuous Distributions. In this case, the function is called Probability Density/Distribution Function

Discrete Distribution

 

1. Binomial Discrete Distribution

 

Binomial random variables follow this distribution. A binomial random variable can have only two outcomes. Result of flipping a coin(Head or Tail) is a binomial random variable.

The probability of getting exactly x successes in n trials is given by

\[ \text{Pr(x)} = \frac{n!}{x!(n-x)!}p^x{(1-p)}^{n-x} \]

where

  • n : number of trials/experiments
  • x : number of successes
  • p : probability of success

Mean of a binomial distribution = np

Standard deviation of a binomial distribution = \(\sqrt{npq}\) where q=1-p

Example : A student will come to the class on a particular day or not, is a binomial random variable. Suppose, on an average 20% of students bunk the class on any given day. And, the teacher does not teach the class if more than 20% of students are absent. There are 10 students in the class. Let’s try to calculate that what are the chances that the teacher will teach the class on any given day.

This problem can be solved using the formula given above, but we will be using python package scipy.stats to solve inferential statistics problems. In this problem, if the number of absent students is more than 2, the teacher will not teach the class.

There are two approaches to solve the problem

  1. Calculate the probability of exactly 0, 1 & 2 students missing the class and add them up
  2. Calculate the cumulative probability of 2 students missing the class (*cumulative = addition of all quantities less than or equal to a particular number)
  • For the first approach, we will use pmf(probability mass function) - binom.pmf(x, n, p)
  • For the second approach, we will use cdf(cumulative distribution function) - binom.cdf(x, n, p)
 
Exercise 1
  • 5 coins were tossed, what is the probability that at least 3 of them will get head.
 

2. Poisson Discrete Distribution

 

In this distribution, a random variable can have more than 2 outcomes. It talks about the number of occurrences of an event in some time period.

e.g. : Number of accidents happening in a day is a random variable and can have any number of values(not just two, like binomial random variable)

\[ \text{Pr(x)} = \frac{\lambda^xe^{-x}}{x!} \]

where

  • \(\lambda\) is the average number of occurrence of an event in some time period
  • x is number of occurrences we are interested in

Example: Generally, 12 accidents happen on an average in a day in a city. The number of available ambulances can handle cases upto 15 accident cases. What is the probability that cases can go above 15, and extra ambulances need to be deployed?

Exercise 2
  • In a call center, around 300 calls are handled daily. Call center employees can handle 350 calls per day. What is the probability that calls can go beyond 350, and extra employees need to be hired?
Complete and Continue