Building simple Linear Regression model using Python's Sci-kit library

Here in this post, we will build a simple linear regression model using Python‘s Sci-kit learn/Sklearn library.

When it comes to defining Machine Learning, we can say its an art and science of giving machines especially computers an ability to learn to make a decision from data and all that without being explicitly programmed. The basic example that we see every day while accessing our email where machines or computers predict whether an email is a spam or not.

Basically, In regression tasks, the target variable or dependent variable or response variable, whatever you say, is a continuously varying variable such as the price of the house in case of Boston housing dataset.

The dataset for Linear Regression:

Here the dataset that i am going to use for building a simple linear regression model using Python’s Sci-kit library is Boston Housing Dataset which you can download from here. Also, for now, let’s try to predict the price from a single feature of a dataset i.e. RM: Average number of rooms.

Let’s see how to build a simple Linear Regression model using Python’s Sci-kit library:

First import all the necessary libraries that we are going to need to build our linear regression model.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model

In next step we will use pandas .read_csv() function to load our .CSV file and saved it as data, as shown below:

data = pd.read_csv("C:\\Users\\Pankaj\\Desktop\\Dataset\\Boston_housing.csv", sep = ",")
data.head()

Below is a top 5 rows that are returned by default when using .head() on a Dataframe.

linear regression model using python on Boston housing dataset

To get basic details about our Boston Housing dataset like null values, data types etc. we can use .info() as shown below:

data.info()
Output -
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
crim 506 non-null float64
zn 506 non-null float64
indus 506 non-null float64
chas 506 non-null int64
nox 506 non-null float64
rm 506 non-null float64
age 506 non-null float64
dis 506 non-null float64
rad 506 non-null int64
tax 506 non-null int64
ptratio 506 non-null float64
b 506 non-null float64
lstat 506 non-null float64
medv 506 non-null float64
dtypes: float64(11), int64(3) memory usage: 55.4 KB

To check the column names of the dataset we can use .columns attribute as shown below:

data.columns
Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat', 'medv'], dtype='object')

The Boston Housing Dataset characteristics or description is as follows:

Boston House Price Dataset description

You can also find basic statistic details like mean, median, standard deviation etc. about the dataset using .describe() method as shown below:

data.describe()

As Scikit learn wants “features” and “target” variables in X and y respectively. Here medv is our target variable, we can extract features and target arrays from our dataset as shown below. In X we drop the medv column and in y we keep only medv column:

X = data.drop("medv", axis=1).values
y = data["medv"].values

Here we are trying to try to predict the price from a single feature of a dataset i.e. RM: an Average number of rooms which is our feature variable in this case, so to extract it we can write as shown below:

X_rooms = X[:, 5]

To check the data type of both out feature variable and target variable, we can use type() function shown below:

type(X_rooms), type(y)
Output -
(numpy.ndarray, numpy.ndarray)

To get X_rooms and y in a shape we can use .reshape() function as shown below:

X_rooms.reshape(-1,1)
y = y.reshape(-1,1)

To find the shape of both feature variable and target variable, we can use .shape attribute as shown below:

print(X_rooms.shape, y.shape)
output -
((506, 1), (506, 1))

We can also plot heatmap of Boston Housing dataset using Seaborn’s heatmap function and respective code as shown below, where data.corr() computes the pairwise correlation between columns:

sns.heatmap(data.corr(), square=True, cmap='RdYlGn')

The dark green portion means that data is highly correlated or positive correlation where red colour means data is negatively correlated. As in our case rm and medv, heatmap shows a green colour which means that data is positively correlated.

heatmap of Boston Housing Data Set

First let’s create a scatter plot of X_rooms, y as shown below:

plt.scatter(X_rooms, y, color='green', s=3)
plt.ylabel("Value of house/1000($)")
plt.xlabel("Number of rooms")
plt.show()

linear regression model using python on Boston housing dataset

From the above plot, we can conclude that more room leads to higher prices.

Since the target variable here is quantitative, this is a regression problem. We have imported Linear regression from sklearn.linear_models as shown at the start of the post. Now first instantiate the LinearRegression() and then use .fit() to fit a linear regression and then predict the price, using .predict() as shown below:

regression = linear_model.LinearRegression()
regression.fit(X_rooms, y)
plt.scatter(X_rooms, y, color='green', s=3)
#We want to check out the regressor predictions over the range of data, we can do so:
Data_range = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1,1)
y_pred = regression.predict(Data_range)
plt.plot(Data_range, y_pred , color="black", linewidth = 3)
plt.show()

linear regression model using python on Boston housing dataset

This is how we can build a simple Linear Regression model using Python and with a single feature variable. In the next posts, we will try to build a linear regression model for all the feature variable and we will also use train_test_split() to split our dataset into training and test dataset and also measure the accuracy of the model.

We have also shared how to Perform Linear Regression using Least Squares in one of our posts.

Hope you like our post. Please share and subscribe to our newsletter. 🙂

Let’s see how to build a simple Linear Regression model using Python’s Sci-kit library:

Related posts:

Leave a Reply Cancel reply