Underfitting-Overfitting

Fitting is nothing but how well your model is able to capture the data pattern.

There can be 3 cases under this fitting scenario:

  1. Overfitting
  2. Underfitting
  3. Balanced-fitting
 

Overfitting

A model is overfitted when it fits the data too well. This model gives excellent accuracy with training data, but poor accuracy with unknown data. These characteristics signify a higher variance low bias model.

When does overfitting happen -

  • When the model is too complicated because it has high degree input variables, complex variables or too many variables.
  • More likely to occur in non-parametric and non-linear models (example: decision tree models)

How overfitting can be handled -

  • Feature selection: Use fewer features, less complex variables

Below is the example which shows polynomial models with different degrees of input feature X. When we use higher degree(complex) variables, the model tends to overfit.

  • Increase the amount of regularization used - Regularization is like penalizing the model for over fitting. We will learn about it in Regularization chapter, but all it does is that it reduces the parameters of variables to smaller values or almost zero, resulting in less impact of these variables.
  • Cross-Validation - When we use a cross-validation technique like k-fold, we actually train the model k times which gives us more flexibility and more data(in an indirect way) to train our model rather than training model to a specific train and test dataset which will leave lesser chances of overfitting.

Underfitting

A model is underfitted when it's not able to fit the data well. This model does not give good accuracy with training data, it may or may not give good accuracy with unknown data. These characteristics actually signify a low variance high bias model.

When does it happen -

  • Model is overly simple with less and easy variables
  • More likely to occur in parametric and linear models

How it can be handled -

  • Add new variables(features) and more interaction variables using domain knowledge
  • Decrease the regularisation

The aim of modeling should always be to get the balanced fitting stage where the accuracy of the model is good enough with both train data and test data.

Complete and Continue