Applying Linear Regression to Boston Housing Dataset

In this post, we will apply linear regression to Boston Housing Dataset on all available features. In our previous post, we have already applied linear regression and tried to predict the price from a single feature of a dataset i.e. RM: Average number of rooms.

We are going to use Boston Housing dataset which contains information about different houses in Boston. There are 506 samples and 13 feature variables in this dataset. Our aim is to predict the value of prices of the house using the given features.

Let’s see how to apply Linear Regression to Boston Housing Dataset in action:

First import all the necessary libraries that we are going to need to build our linear regression model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Now we will use pandas .read_csv() function to load our .CSV file and saved it as data, as shown below:

filepath = "C:\\Users\\Pankaj\\Desktop\\Dataset\\Boston_housing.csv"
data = pd.read_csv(filepath)

To check the column names of the dataset we can use .columns attribute as shown below:

data.columns
Output-
Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat', 'medv'], dtype='object')

The Boston Housing Dataset characteristics or description is as follows:

Linear Regression on Boston Housing Data set

To get basic details about our Boston Housing dataset like null values or missing values, data types etc. we can use .info() as shown below:

data.info()
Output-
<class 'pandas.core.frame.DataFrame>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
crim 506 non-null float64
zn 506 non-null float64
indus 506 non-null float64
chas 506 non-null int64
nox 506 non-null float64
rm 506 non-null float64
age 506 non-null float64
dis 506 non-null float64
rad 506 non-null int64
tax 506 non-null int64
ptratio 506 non-null float64
b 506 non-null float64
lstat 506 non-null float64
medv 506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 55.4 KB

Below is a top 5 rows that are returned by default when using .head() on a Dataframe.

data.head()

Linear Regression to Boston Housing Dataset

Generally, NaN or missing values can be in any form like 0, ? or may be written as “missing ” and in our case, as we can see above there are a lot of 0’s, so we can replace them with NaN to calculate how much data we are missing.

data.zn.replace(0, np.nan, inplace=True)
data.chas.replace(0, np.nan, inplace=True)

After replacing let’s again use .info() method to see the details about missing values in our dataset:

data.info()
Output-
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
crim 506 non-null float64
zn 134 non-null float64
indus 506 non-null float64
chas 35 non-null float64
nox 506 non-null float64
rm 506 non-null float64
age 506 non-null float64
dis 506 non-null float64
rad 506 non-null int64
tax 506 non-null int64
ptratio 506 non-null float64
b 506 non-null float64
lstat 506 non-null float64
medv 506 non-null float64
dtypes: float64(12), int64(2)
memory usage: 55.4 KB

Let’s calculate the percentage of missing values in our dataset. Generally, if there is 20-25% missing values we can impute them with different ways like mean, median or an educated guess by us. But if it’s more than that, it’s better to remove those features otherwise they can affect our result. As we can see below both “zn” and “chas” missing more than 70% data so we will remove both these features.

data.isnull().sum()/len(data)*100
Output-
crim 0.000000
zn 73.517787
indus 0.000000
chas 93.083004
nox 0.000000
rm 0.000000
age 0.0000<00
dis 0.000000
rad 0.000000
tax 0.000000
ptratio 0.000000
b 0.000000
lstat 0.000000
medv 0.000000
dtype: float64

We can drop both columns as shown below:

data = data.drop("zn", 1)
data = data.drop("chas", 1)

Now again check the percentage on missing data to make sure everything is fine so that we can go ahead without any worries.

data.isnull().sum()/len(data)*100
Output-
crim 0.0
indus 0.0
nox 0.0
rm 0.0
age 0.0
dis 0.0
rad 0.0
tax 0.0
ptratio 0.0
b 0.0
lstat 0.0
medv 0.0
dtype: float64

Now to get some basic statistics about our data like mean, median, count etc. we can use .describe() method as shown below:

data.describe()

Now let’s plot the histogram of all the available features to see the distribution.

data.hist(bins=50, figsize=(15, 15))
Output-
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF002EB8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF2A2208>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF2CF518>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF2F6748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF2F6DD8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF2F6E10>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF372B38>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF3A4208>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF3CB898>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF3F4F28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF4225F8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000126FF449C88>]],
      dtype=object)

Linear Regression to Boston Housing Data set

Now let’s check the correlation between all the feature variable and target variable by plotting the heatmap as shown below:

# set the size of the figure
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.heatmap(data.corr().round(2), square=True, cmap='RdYlGn', annot=True)

Linear Regression to Boston Housing Dataset

Observations from the above Heatmap:

From the above correlation plot, we can see that MEDV is strongly correlated to LSTAT, RM
RAD and TAX are strongly correlated, so we don’t include this in our features together to avoid multicollinearity. Similar to the features DIS and AGE which have a correlation of -0.75. So we will exclude these four features from our features list. You can find the reason behind this here.

As Scikit learn wants “features” and “target” variables in X and y respectively. Here medv is our target variable, we can extract features and target arrays from our dataset as shown below. From X we drop the medv column along with other features and in y we keep only medv column:

X = data.drop(["medv","rad","tax","dis","age"], 1).values
y= data["medv"].values

In order to get a better idea of the effect of each feature on the value on “medv”, all independent variables are visualized.

plt.figure(figsize=(20, 15))
features = ['crim','indus','nox','rm','ptratio','b','lstat']
target = data['medv']
for i, col in enumerate(features):
    plt.subplot(3, len(features)/2 , i+1)
    x = data[col]
    y = target
    plt.scatter(x, y, marker='o')
    plt.title(col)
    plt.xlabel(col)
    plt.ylabel('MEDV')

Linear Regression to Boston Housing Dataset

From the above visualization, we can conclude that medv and rm are linearly correlated and medv increase with an increase in rm which is no. of rooms and looks like both are following a linear relationship. Also when lstat increases medv or price decreases.

Let’s see the shape of our X (features) and y (target variable):

print(X.shape)
print(y.shape)
(506, 7)
(506,)

We can see that there are 7 features with 506 observations and 1 target variable with the same 506 observations. so all well we can now fit our linear regression to Boston Housing Dataset.

Applying Scikit learn Linear Regression to Boston Housing dataset’s predictor variables or independent variables to predict the value of dependent variable ‘MEDV’:

Now, let’s apply linear regression to Boston Housing Dataset and for that first, we will split the data into training and testing sets. We train the model with 70% of the data and test with the remaining 30%.

We do this because when we compute the metric to measure the model’s performance on the same data that we used to fit the classifier but as the same data is used to train the model, this will not provide us with the real answer on how well our model generalizes to unseen data. So for this reason its common to split our data into two sets, a training set on which we train and fit the classifier and a labelled test set on which we can make our predictions and finally compare these predictions with the known labels.

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)
print(len(X_train), "train set + ", len(X_test), "test set")
Output-
354 train set +  152 test set

Here, finally to check whether we have split our training and test set, we print the sizes as shown below:

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(354, 7)
(354,)
(152, 7)
(152,)

Since the target variable here is quantitative (continuously varying variable – “medv”), this is a regression problem. We have imported Linear regression from sklearn.linear_models as shown at the start of the post. Now first instantiate the LinearRegression() and then use .fit() to fit a linear regression and then predict the price, using .predict() as shown below:

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

Now we will compute the metrics to measure the model’s performance. Here we have computed R squared, defined as the metric which quantifies the amount of variance in the target variable that is predicted from the feature variable.

r2 = lr.score(X_test, y_test)
rmse = (np.sqrt(mean_squared_error(y_test, y_pred)))

print("The model performance is")
print("--------------------------------------")
print('R2 score is {}'.format(r2))
print('RMSE is {}'.format(rmse))
print("\n")
Output- 
The model performance is
--------------------------------------
R2 score is 0.6387311528873696
RMSE is 5.188377408265187

Let’s also plot a Scatter Plot on training label set and predicted label set which should in an ideal case follow a straight line.

plt.scatter(y_test, y_pred)
plt.show()

Linear Regression to Boston Housing Dataset

This is a good result but you will never use linear regression to Boston Housing Dataset this way i.e. out of the box and mostly we will use regularization which will further constraints on the model coefficients. In the coming posts, we will look a few ways to increase the model’s performance.

Hope you liked the post.

Applying Scikit learn Linear Regression to Boston Housing dataset’s predictor variables or independent variables to predict the value of dependent variable ‘MEDV’:

Related posts:

Leave a Reply Cancel reply