Exploration & Preparation

Data Exploration

Exploratory data analysis is a critical step of a data science project. Data exploration is done after finalizing the problem statement and collecting the data from data sources.

Why data exploration is done:

To assess the quality of data if there are any anomalies, outliers, missing values, bad values, etc.
To understand the data better and draw the inference from it

How data exploration is done:

Descriptive Statistics - using pandas methods
Graphical Analysis - using matplotlib and seaborn python packages
Domain Knowledge

Data Preparation

Most of the times, data can't be used for modeling as is. Data needs to be cleaned and prepared before modeling.

Why data preparation is done:

Data might have missing values, outliers, bad values which may lead to inaccurate results.

Sometimes, we need to derive a few new variables from existing variables for better results.

How data preparation is done:

Missing value/outliers treatment - using pandas methods

Data Transformation - using pandas methods

Deriving new variables - using pandas methods

Data Exploration and Data Preparation go hand in hand. Data Exploration depicts in what all cases, data has to be prepared; and using Data Preparation techniques, we prepare the data.

Type of variables in data

Steps for Data Exploration & Data Preparation

Missing values

Outliers & bad values

Convert categorical/non-numeric columns to numerical columns

Derived Variable

Data Transformations

Graphical Analysis

Let's work on some example to understand data exploration and data preparation. We will work on a real-life adult data from the UC Irvine Machine Learning Repository. You can read more about the dataset and its columns here.

Let's get into the mode of coding. Let's import the data and do basic analysis.

Exercise 1

For this chapter, we will use Forest Fires Dataset from UC Irvine Machine Learning Repository for our exercises.

Load the dataset into dataframe forest_fires, and do the basic analysis by finding it's shape, columns & their datatypes, etc.

1. Missing values

We will first do data exploration to find out if there are any missing values in the data and then do data preparation to deal with missing values.

Data Exploration : Once you have the data in pandas dataframe, you can run isnull().sum() method on it. This will return the number of null values in each column of dataframe.

Data Preparation : If you get data with null values, there are three ways to remove nulls from data.

Drop rows which have null values

Imputation - Replace nulls with some values

Go back to the data collection team if you find that null values follow some pattern. Therefore, there might be some chance that you may get some extra information about these null values.

Exercise 2

Identify if there are any missing values in dataframe forest_fires, and get rid of them using data preparation techniques.

2. Outliers & bad values

We will first do data exploration to find out if there are any outliers or bad values in the data and then do data preparation to deal with such values.

Data Exploration : There are two ways to find out outliers :

Graphical Method - Box Plot

Derive Summary Statistics

For numerical columns, use describe() method

For categorical columns, use value_counts() method

Data Preparation : If you get data with null values, there are three ways to remove nulls from data.

Domain Knowledge - Domain Knowledge will come into picture if extreme values are really outliers or show different behavior from other values. On the basis of it, we can take two decisions :

Leave them as is in the data

Remove them form data

Create an entirely new model for such extreme values

Go back to the data collection team to get more clarification about such values

Replace outliers with mean values of the respective columns

Exercise 3

Find out if there are any outliers or bad values in dataframe forest_fires, and deal with them using data preparation techniques.

3. Converting categorical/non-numeric columns to numerical columns

This is a pure data preparation step. As machine learning library only accepts numerical columns, we will convert all non-numerical columns to numerical columns in this step.

Data Preparation : Use pandas method get_dummies() to convert non-numeric columns to numeric columns as explained in below example.

Exercise 4

Convert all non-numerical columns to numerical columns in forest_fires dataframe.

4. Derived Variable

This is a pure data preparation step. Sometimes, we need to derive new variables as per the problem statement. For example, we may need to derive a person's age from his Date of Birth.

Data Preparation : Deriving new variables is a two-step process :

First, use your domain knowledge identify if new variables need to be derived

Then, use Python & pandas methods to create new variables

There may be cases when even continuous variables like age need to be clubbed in less number of categories like age: 0-25, 25-50, 50+

Exercise 5

Derive a variable season from column month in forest_fires dataframe.

if the month is March, April or May; then the season is spring

if the month is June, July or August; then the season is summer

if the month is September, October or November; then the season is spring

if the month is December, January or February; then the season is winter

5. Data Transformations

This is also a pure data preparation step. There are times when we need to transform (e.g., take the log or square root of the variable) our data.

Data Preparation : Transforming old variables into new variables is a two-step process :

First, use your domain knowledge identify if variables need to be transformed

Then, use Python & pandas methods to transform variables

Let's take an example to understand this. Let's say data values are on the power of 10. In this case, 10 & 1000 should not be considered 9990 apart, but it's just that 1000 is three times 10. It's better to take the log in this case. Taking log also helps in outlier treatment sometimes.

Exercise 6

If you look at the target variable area, you will find that it's most values are inclined towards 0. It is better to take the logarithmic transformation of it to make it evenly spread.

6. Graphical Analysis

This is a pure data exploration step to analyze data graphically.

Data Preparation : We can draw graphs to find out:

Univariate Analysis - to find out how data is distributed if there is any skewness/outliers

Continuous variable - Draw linechart or distplot

Categorical variable - first take value_counts() and then draw a histogram on that as shown in the example below

Bivariate Analysis - to find out the relationship between a variable and the target variable

Create a contingency Table - This is a non-graphical method

Draw a bivariate graph between two variables

Exercise 7

Find out the spread of continuous variables by drawing histograms - temp and wind in forest_fires dataframe.

Find out the on what days fire was more by drawing a bar chart of day column of forest_fires dataframe.

Find out the on what days of months fire was more by drawing a contingency table between day & month columns of forest_fires dataframe.

Find out if there is any relationship between temperature and area by drawing a scatter plot between temp and area columns of forest_fires dataframe.

Note - We will work on this dataset again and will apply the concepts learned here for each column of the dataset.

Complete and Continue