Training vs Testing

The most critical step of a data science project is to create a model, which can predict an outcome on unseen data.

You start by collecting data from various sources, then explore, clean and prepare it. After that, you have the data which is ready for modeling.

To create(train) a model, not 100% of the entire available data is used. Instead, generally 60-80% of the data is used to create a model which is called training data. Rest of the data is used to check the performance of the model which is divided into two parts - validation data & test data.

1. Training dataset - To train the model

2. Validation dataset - To validate the if hyperparameters like 'amount of regularisation' considered for the model are appropriate or not? These hyperparameters are changed if validation dataset accuracy is not good. These hyperparameters are given by user before modeling. We will read about them in future chapters.

3. Test dataset - To test the accuracy of the final model

There is one more type of data, called unknown data, on which we apply our final model to solve our business problem. Generally, we don't have this data at the time of modeling. As and when data comes, we apply our model and make the predictions.

It's always good to check the performance of the model with the test data in hand, before putting it to use for the prediction on unseen data.

The process of checking the performance of the model is also known as model validation.

Split of data into training, validation and test data should be completely random.

A model generally gives better accuracy on training data and not so good on validation/test data, because it's been trained on training data only. Its accuracy might drop even further when used on unseen data.

Sometimes, when data is less, people prefer to have only two datasets - train & test.

This method of not using some data to build a model and using it to test the model is also called Holdout method.

We use train_test_split method of sklearn.model_selection module to split data into train & test datasets. It takes below parameters:

Dependent variable dataset (x)
Target variables series (y)
test_size - the proportion of test data
random_state - train_test_split always shuffles the data before splitting, so everytime you call this function, you get different train & test data. Passing an integer to random_state parameter makes sure that it returns exact same train and test datasets. For example, random_state=43 will give you same train and test datasets whenever you pass it to train_test_split function.

Exercise 1

For this chapter, we will use Haberman's Survival Dataset from the UC Irvine Machine Learning Repository.

Load the dataset into dataframe Haberman_data and divide the dataset into train & test datasets in a ratio of 70 & 30 respectively.

Cross-Validation

Though the split of train & test data set must be completely random, and it's assumed that train data and test data must be similar in characteristics; still, there might be some hidden pattern in trained data which is not present in test data or vice versa. Generally, there are 3 issues with the hold-out method:

1. High Variance

Model built on such data which have hidden patterns present in training data, but not in test or unseen data, will give a very poor prediction on the test or unseen data. Moreover, it will highly depend on which data is used for model building. Such models, which are dependent on data used modeling, are called high variance models.

2. High Bias

On the other hand, there might be cases where some complex pattern is present in the test dataset, but not in training data. Model built on such training data will be simple and have poor accuracy on the test or unseen data. Such models are also called high bias models.

3. Less Data

Another problem is that you lose a significant amount of data(test data) to build a model. It becomes a big issue when you already have fewer data available for modeling.

To overcome these problems, we should use cross-validation.

What is cross-validation

Cross-validation is nothing but dividing the entire dataset into k subsets and apply holdout method k times. Each time hold one subset for testing and use rest k-1 subsets for model building. In this process, each data is used k-1 times for model training and once for testing.

This will tackle all problems, as entire data is being used for training as well as testing. Generally, K's value varies between 5 to 10.

The error is estimated to be average of all k trials.

Advantages of Cross-validation over normal train-test splitting:

You can create k sets of train-test datasets from one original data
Test data does not remain unused, as every row is used for training. So, any pattern in test data, which is not present in train data, will not go unnoticed because that set of train-test data will give very poor accuracy in k-fold cross-validation.
If you have less data and can't afford to hold out some data as test data, k-fold cross-validation is very useful.

Cross-validation helps in:

Selection of the right algorithm
Choosing the model hyperparameters

Selection of the right algorithm

Cross-validation will help you in choosing the right model.

Let's say you used 2 algorithms Algorithm 1 & Algorithm 2 to create models. You divided the data into two parts - Train & Test.

You created 2 models using these 2 algorithms, which gave accuracy as below :

What happened here? Algorithm 1 actually overfits the data more than Algorithm 2, and data patterns were different in training data & test data, so models trained on training data could not give as good accuracy on test data as they gave on train data.

Therefore, we can not rely on one set of train-test data.

Let's say we used cross-validation and found the accuracy on 5 sets of test dataset as below :

This proves that Algorithm 2 should be a better model than the Algorithm 1 in this case.

Choosing the model hyperparameters

Even sometimes the algorithm is final, but to choose the correct model(correct hyperparameters), cross-validation helps.

For example, we might have two models from the SVM algorithm with different regularization(C) values. Cross-validation helps us in choosing the correct hyperparameters also - svm(C=1) or svm(C=100) ?

Exercise 2

Convert the dataframe Haberman_data into 5-fold train-test datasets.

In practice, we don't divide data into k sets of train-test datasets as we did above. sklearn provides a function called cross_val_score which directly returns performance score. It takes the below parameters:

estimator, X, y=None, groups=None, scoring=None, cv=’warn’ model, x, y, cv=kfold, scoring='accuracy'

Imbalanced data

One more very common problem which occurs very frequently in data science, is imbalanced data. Let's take an example to understand this better:

You have a dataset of transactions which are having 98% genuine and 2% fraud transactions. If you create a model which even predicts that every transaction is genuine, then also it's correct 98% times, which is damn good accuracy, but what will be the impact on business, if you miss 2% fraud transactions - HUGE.

To tackle such cases, we need to do something before deciding our test and train datasets. There are 3 methodologies to tackle this problem.

Undersampling
Oversampling
SMOTE

Undersampling

In this technique, you select a subset of majority class by selecting less data points randomly.
For example, if you have 9800 genuine transactions and 200 fraud, then all you have to do is take around 400 observations out of 9800. Then ration of genuine vs fraud becomes 400 : 200 (66% : 33%) which is far better then 98 % : 2%.
There is always a risk of missing relevant information which may be hidden in 9800-400 = 9400 majority class observation

Oversampling

In oversampling, you replicate the minority class datapoints multiple times to make a balanced dataset.
For example, you can replicate these 200 observations 30 times to make 6000 observations which will balance the dataset by making the ratio of 60 : 40.
Problem with this technique is that the model tends to overfit such data, as observations are mere replication and not giving any new information to the algorithm to generalize it more.

SMOTE(Synthetic Minority Oversampling Technique)

In this technique, samples are created synthetically in feature space by algorithms like k-nearest neighbors.
Just for understanding purpose, you can consider it as changing minority class observations' values little bit here and there to create new observations, so that these newly created observations are close to minority class observations and far from majority class observations.
The likely issue with this technique is there might be some synthetic observations which overlap both minority and majority class and can create a problem for an algorithm to create a model clearly dividing majority and minority class observations.
Package imblearn provides a class SMOTE to perform this oversampling.

Let's look at the 3 resampling techniques by working on the example.

Exercise 3

Apply all 3 techniques of resampling on Haberman_data to balance the classes equally.
- Oversampling
- Undersampling
- SMOTE

Important Note: It's not necessary to balance the data equally in 50:50 ratio. We only try to balance the data when it's highly imbalanced like 90:10. There is no hard & fast rule on the ratio when to call the data balanced or imbalanced. Even 80:20 is not imbalanced sometimes. It's up to the user's discretion when he considers data imbalanced for his problem.

Complete and Continue