Exploration & Preparation

Data Exploration

 

Exploratory data analysis is a critical step of a data science project. Data exploration is done after finalizing the problem statement and collecting the data from data sources.

 

Why data exploration is done:

  1. To assess the quality of data if there are any anomalies, outliers, missing values, bad values, etc.
  2. To understand the data better and draw the inference from it
 

How data exploration is done:

  1.  Descriptive Statistics - using pandas methods
  2. Graphical Analysis - using matplotlib and seaborn python packages
  3. Domain Knowledge
 

Data Preparation

 

Most of the times, data can't be used for modeling as is. Data needs to be cleaned and prepared before modeling.

 

Why data preparation is done:

  1. Data might have missing values, outliers, bad values which may lead to inaccurate results.
  2. Sometimes, we need to derive a few new variables from existing variables for better results.
 

How data preparation is done:

  1.  Missing value/outliers treatment - using pandas methods
  2.  Data Transformation - using pandas methods
  3. Deriving new variables - using pandas methods
 

Data Exploration and Data Preparation go hand in hand. Data Exploration depicts in what all cases, data has to be prepared; and using Data Preparation techniques, we prepare the data.

 

Type of variables in data

 
 

Steps for Data Exploration & Data Preparation

 
  1. Missing values
  2. Outliers & bad values
  3. Convert categorical/non-numeric columns to numerical columns
  4. Derived Variable
  5. Data Transformations
  6. Graphical Analysis
 

Let's work on some example to understand data exploration and data preparation. We will work on a real-life adult data from the UC Irvine Machine Learning Repository. You can read more about the dataset and its columns here.

 

Let's get into the mode of coding. Let's import the data and do basic analysis.

Exercise 1

For this chapter, we will use Forest Fires Dataset from UC Irvine Machine Learning Repository for our exercises.

  • Load the dataset into dataframe forest_fires, and do the basic analysis by finding it's shape, columns & their datatypes, etc.

1. Missing values

 

We will first do data exploration to find out if there are any missing values in the data and then do data preparation to deal with missing values.

 

Data Exploration : Once you have the data in pandas dataframe, you can run isnull().sum() method on it. This will return the number of null values in each column of dataframe.

 

Data Preparation : If you get data with null values, there are three ways to remove nulls from data.

  • Drop rows which have null values
  • Imputation - Replace nulls with some values
  • Go back to the data collection team if you find that null values follow some pattern. Therefore, there might be some chance that you may get some extra information about these null values.
 
 
Exercise 2
  • Identify if there are any missing values in dataframe forest_fires, and get rid of them using data preparation techniques.
 

2. Outliers & bad values

 

We will first do data exploration to find out if there are any outliers or bad values in the data and then do data preparation to deal with such values.

 

Data Exploration : There are two ways to find out outliers :

  • Graphical Method - Box Plot
  • Derive Summary Statistics
    • For numerical columns, use describe() method
    • For categorical columns, use value_counts() method
 

Data Preparation : If you get data with null values, there are three ways to remove nulls from data.

  • Domain Knowledge - Domain Knowledge will come into picture if extreme values are really outliers or show different behavior from other values. On the basis of it, we can take two decisions :
    • Leave them as is in the data
    • Remove them form data
    • Create an entirely new model for such extreme values
  • Go back to the data collection team to get more clarification about such values
  • Replace outliers with mean values of the respective columns
 
 
Exercise 3
  • Find out if there are any outliers or bad values in dataframe forest_fires, and deal with them using data preparation techniques.
 

3. Converting categorical/non-numeric columns to numerical columns

 

This is a pure data preparation step. As machine learning library only accepts numerical columns, we will convert all non-numerical columns to numerical columns in this step.

 

Data Preparation : Use pandas method get_dummies() to convert non-numeric columns to numeric columns as explained in below example.

 
Exercise 4
  • Convert all non-numerical columns to numerical columns in forest_fires dataframe.

4. Derived Variable

 

This is a pure data preparation step. Sometimes, we need to derive new variables as per the problem statement. For example, we may need to derive a person's age from his Date of Birth.

 

Data Preparation : Deriving new variables is a two-step process :

  1. First, use your domain knowledge identify if new variables need to be derived
  2. Then, use Python & pandas methods to create new variables
 

There may be cases when even continuous variables like age need to be clubbed in less number of categories like age: 0-25, 25-50, 50+

 
Exercise 5
  • Derive a variable season from column month in forest_fires dataframe.
    • if the month is March, April or May; then the season is spring
    • if the month is June, July or August; then the season is summer
    • if the month is September, October or November; then the season is spring
    • if the month is December, January or February; then the season is winter
 

5. Data Transformations

 

This is also a pure data preparation step. There are times when we need to transform (e.g., take the log or square root of the variable) our data.

 

Data Preparation : Transforming old variables into new variables is a two-step process :

  1. First, use your domain knowledge identify if variables need to be transformed
  2. Then, use Python & pandas methods to transform variables
 

Let's take an example to understand this. Let's say data values are on the power of 10. In this case, 10 & 1000 should not be considered 9990 apart, but it's just that 1000 is three times 10. It's better to take the log in this case. Taking log also helps in outlier treatment sometimes.

 
Exercise 6
  • If you look at the target variable area, you will find that it's most values are inclined towards 0. It is better to take the logarithmic transformation of it to make it evenly spread.
 

6. Graphical Analysis

 

This is a pure data exploration step to analyze data graphically.

 

Data Preparation : We can draw graphs to find out:

  • Univariate Analysis - to find out how data is distributed if there is any skewness/outliers
    • Continuous variable - Draw linechart or distplot
    • Categorical variable - first take value_counts() and then draw a histogram on that as shown in the example below
  • Bivariate Analysis - to find out the relationship between a variable and the target variable
    • Create a contingency Table - This is a non-graphical method
    • Draw a bivariate graph between two variables
 
Exercise 7
  • Find out the spread of continuous variables by drawing histograms - temp and wind in forest_fires dataframe.
  • Find out the on what days fire was more by drawing a bar chart of day column of forest_fires dataframe.
  • Find out the on what days of months fire was more by drawing a contingency table between day & month columns of forest_fires dataframe.
  • Find out if there is any relationship between temperature and area by drawing a scatter plot between temp and area columns of forest_fires dataframe.

Note - We will work on this dataset again and will apply the concepts learned here for each column of the dataset.

Complete and Continue