WeirdGeek

Data Science | Machine Learning | Automation

  • Data Analytics
  • Python
  • Data Science
  • Google Apps Script
  • Machine Learning
  • Artificial Intelligence
  • SQL Server

11/11/2018 By WeirdGeek Leave a Comment

7 steps to prepare your data for machine learning models with Python

As a Data Scientist or data analyst, you have to prepare your data for machine learning models by getting it into shape. When you came across words like data cleaning or data cleansing, pre-processing, wrangling in data science community it has only one meaning that you are referring to pre modelling data activities like removing outliers, filling nan or missing values with educated guesses. If you are pursuing your career data scientists, you must know that this is how you are going to spend your most of the time while dealing with different datasets.

Step by Step process on how to prepare your data for machine learning models:
Installing required packages

Before you go any further, make sure you have Python and that the expected version is available from your command line. You can check this by running:

python –version

You should get some output like Python 3.6.5 If you do not have Python, please install the latest 3.x version from python.org or refer to the Installing Python

Check whether you have installed pip or not

Additionally, you’ll need to make sure you have pip available. You can check this by running:

pip --version

If you installed Python from source, with an installer from python.org, you should already have pip. But if you using Linux and installed using your OS package manager, you may have to install pip separately, check here

If pip isn’t already installed, then first try to bootstrap it from the standard library:

python -m ensurepip --default-pip

If that still doesn’t allow you to run pip:

  • Download get-pip.py

  • Run the below command and this will install or upgrade pip.

get-pip.py

For more information about how to install packages, you can check here.

Importing data files

To import the data in CSV format or excel, we can use pandas .read_csv(), .read_excel() function as shown in our previous post. You can check the post here. Also for a complete list of functions, you can find the official documentation here.

Exploratory Data Analysis

Exploratory Data Analysis or EDA refers to the important process of performing initial investigations on data to discover patterns, spotting irregularities and to check assumptions with the help of summary statistics and graphical representations.

Basically, when you are dealing with a single variable (univariate ), you can calculate summary statistics for each field in the raw dataset. When dealing with two variables (bivariate), you can calculate summary statistics like mean, standard deviation etc. and also the relationship between each variable in the dataset and the target variable of interest.

Below you can see a visual EDA i.e heatmap of Wine Quality dataset showing correlation.

Heatmap plot

 

Working with missing values

In our previous post, we have shown a few ways by which you can deal with missing values. When you have missing values in your dataset you can do a few things, either you can drop the rows or columns that contain missing values or NaN values, or you can make an educated guess to fill desired values or you can calculate test statistics and fill. See here to know how you can do that.

Working with outliers

While there is no single definition of an outlier, in simple words we can define it as datapoints whose value is far greater than or less than most of the rest of the data.

When you are building a machine learning model, outlier detection is an important step to build an accurate model and set a good score. To get rid of outliers, we can use box plots.

prepare your data for machine learning models

From the above box plot, we can say that those data points who are more than 2IQRs(Interquartile range) away from the median can be a common criterion for the outliers.

Final Touch

In the final phase, you can get rid of all the features that are unimportant and make sure that all of your important features are included in your dataset.

Export or save file

You can export your cleaned or formatted dataset to the file format you want like CSV, Excel etc. To know how to see our post.

I hope you like my post. Do share it with your friends.

Related posts:

  1. Setting up Python Development Environment for Data Analysis
  2. Handling missing values using Python in Data Science
  3. 35 Pandas codes every data scientist aspirant must know
  4. Difference between Data Science, Machine learning, Artificial Intelligence and Deep Learning

Filed Under: Data Analytics, Data Science Tagged With: Data Science, Machine Learning, Prepare your data

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Subscribe to my Blog !!

Enter your email below to subscribe my blog and get the latest post right in your inbox.

  • Home
  • Terms
  • Privacy
  • Contact Us

Copyright © 2025 · WeirdGeek · All trademarks mentioned are the property of their respective owners.