Linear Regression

Regression is a widely used machine learning algorithm. It is nothing but a measure of a relationship between a target/dependent variable and independent variables.

Prediction using regression works on this simple principle that once you are able to establish the relationship between target variable and independent variables, you can easily predict the target variable if dependent variables are provided. For example, if you could establish a relationship Y= mX + C between independent variable X and dependent variable Y (i.e., you could figure out values of m & C), you can easily calculate the value of Y if you are provided with the value of X.

Linear regression primarily identifies a linear relationship between variables. Linear relationship means when an independent variable's value increases, dependent variable's value will also increase/decrease linearly. Below graphs show linear increasing and decreasing relationship between two variables.

These graphs can be written as Y = mX + C . It is a straight line's equation in mathematics

where

m is the slope of the line(how much tilted it is), and its value is equal to tan(θ).
C is intercept, i.e. how far line cuts the Y-axis from the origin(0, 0).

These are called parameters of line.

If X's value increase by 1 unit, Y's value will increase by m units. Here, X is independent variable and Y is the dependent variable.

A relationship between 2 variables as shown in the above graph, is called a simple linear relationship. If there are more than one dependent variables, it's called multiple linear relationship.

A multiple linear regression equation looks something like below.

\[ Y = a_0 + a_1x_1 + a_2x_2 + ... + a_nx_n \]

here \(a_{0}, a_{1}, .. ,a_{n}\) are parameters/coefficients which describe the relationship between the features and the target variable.

We will understand this whole concept practically by working on cars data in the next chapter. We will also introduce you to super simple python packages for applying machine learning algorithms.

Residuals

Residual is the difference between predicted values and actual values of the target variable. In linear regression, we try to find the line of best fit which reduces this difference as much as possible.

These residuals can be there because of two reasons:

Model is not able to capture the data pattern well, i.e. the model provided line is not the best fit.
Some random, unpredictable error because of data

A model can capture only patterns, but a random error doesn't have any pattern in it to be captured by the model. The sum of these errors should be zero. Random error graphs are drawn between residuals and predicted(fitted) values.

The 2nd figure above is not the random error because there are 2 clear patterns:

All residuals are negative before 6 and positive after 6
Residual values show an increasing trend

Residual graphs should not give you any information. If you are able to get any information/pattern from the residual graph that means your model is not good enough to capture entire data pattern information and missing out on information to residuals.

There are two ways to catch that pattern back from residuals to our model:

Add new variables to the model
Use some other non-parametric algorithm like decision trees

Assumptions of Linear Regression

There are a few assumptions which need to be checked before applying linear regression on the data.

Linear Relationship

The primary and foremost condition of applying linear regression is that there must be a linear relationship between dependent and independent variables. It can be tested by drawing a simple scatter plot graph with the dependent variable on the y-axis and independent variable on the x-axis.

No Auto-correlation

There should be no correlation between residual values. It means that if you know one value of residual, you should not be able to predict any other value of residual. It happens when residual values are completely random.

No Multicollinearity

When some of the variables are correlated with each other, it can cause confusion to the algorithm. The algorithm may not be able to assign correct coefficient/parameters/weights to such variables. Its exactly like two people are talking to you simultaneously and you are not able to understand any of them. To remove multicollinearity, create a correlation matrix and keep only one from two highly correlated variables.

Heteroskedasticity

If the variance of the residuals is not constant and has a funnel type shape, then this circumstance is called heteroskedasticity. It means that as we move from left to right in residual v/s fitted value graph, the range of error term also increases. It means that there is some pattern, which is against the basic assumption that there should not be any pattern in residuals.

Have a look at the above graph, residual values seem to have 0 sum, but exhibit heteroskedasticity.

Normality

Residuals should be distributed normally around 0. This also boils down to the fact that if errors are not normally distributed (which is a manifestation of randomness) and some skewness is there, some pattern was left uncaught by model or relationship between dependent and target variable is not linear.

Important variables

Important variables are the ones which have high coefficients (only and only if you have standardized them). Variables which are in different scales (like kilogram, meters, temperature) can't be compared as is. They need to be standardized to bring to the same scale.

Another way to find out important variable is to add variable and calculate R-square value, which shows how much variability of the data is being explained by the model/variable (i.e., how well model is able to capture data pattern). Variables which increase R-square value of model more should be considered important variables.

Complete and Continue