Dimension Reduction with Principal Component Analysis (PCA)

Here, in this post, we’ll see how to perform Dimension Reduction with Principal Component Analysis (PCA) using Sklearn library and also learn basic idea about dimension reduction and Principal Component Analysis (PCA).

Data is everywhere. Whatever you do in your day to day life will generate a tremendous amount of data that can be used by business to improve their products, to offer you better and relevant services. In fact, 90% of the data in the world has been generated in the last 4 years. Few examples of the kind of data being generated by you:

Google collects data related to your demographics, interests to show relevant ads
Social Networking websites like Facebook, Twitter collects data of what you like or love, share/retweet, post/tweet and even your location
Smartwatches that you use collects data like distance covered, your heart rate, amount of calories burned etc.
Online Marketplace like Amazon, Flipkart collects data of what you buy, view, click, etc. when you visit their website to recommend you a product or a list of products

As data generation and collection keeps increasing, getting an insight from that becomes more and more challenging. One of the easy ways of getting an insight from data is via visualization in form of charts/plots. For example, let’s say you have a dataset of a shape 1000 (rows/observations) and 50 (columns/features). So considering this dataset more than 1000 plots are possible to analyze the relationships. (In general, no. of plots drawn from a dataset with p features will be equal to p(p-1)/2 ). So you can think of how much time consuming it will be if you analyze and perform the exploratory analysis?

But not only the data but the features in the data are also increasing. Basically, datasets with a large number of features are called high-dimensional datasets.

If there are a lot of features in your dataset it will affect in both ways i.e. time taken to train a model as well as the accuracy of machine learning models. There are various ways by which we can reduce the features variables in our data. Here is a list below:

Removing features with more number of missing values in a dataset
Dropping low variance features from a dataset
Keeping features which are highly correlated with the target or dependent variable.
Principal Component Analysis (PCA)

Apart from these, there are other ways as well but for this post, we will see how to perform Dimension Reduction with Principal Component Analysis (PCA).

Dimension Reduction Technique

Dimension Reduction Technique can be defined as a technique which helps us in finding a pattern in data and uses these patterns to re-express it in a compressed form. During compression, only important and informative features are extracted and saved while less informative features also called noise features are removed.

Let’s take an example, below is a chart showing temperature in Celsius and Fahrenheit.

Dimension Reduction with Principal Component Analysis (PCA)

Here, both variables are showing the same information. So, it would make sense to use only one variable. Simply, we can convert the data from 2D to 1D.

So in nutshell, by using dimensionality reduction, we can represent the same data using fewer features i.e. reduce p dimensions of the data into a subset of k dimensions (k<<p).

Principal Component Analysis (PCA)

The most fundamental of dimension reduction technique is Principal Component Analysis. It is used to transform high-dimensional datasets into a dataset with fewer features (or to transform a dataset into low dimension), where the remaining features explain the maximum variance within the dataset. It is called Principal Component Analysis because it learns from the ‘Principal Components’ of the data.

Principal Component Analysis aka PCA performs dimension reduction in two steps:

In the first step, it performs decorrelation which will not reduce the dimension but it will rotate the data samples so that they are aligned with the coordinate axis and the resulted samples or features are not linear correlated. Also, it will shift the data samples so that they have zero mean.

In the second step, it will reduce dimension.

Advantages of Dimension Reduction with Principal Component Analysis (PCA)

Here is a list of few main reasons and advantages of Dimension Reduction:

If features in datasets are reduced, so the space required to store the data will be also less.
Fewer features in the dataset will lead to less computation time and ultimately speed up machine learning algorithms
Some algorithms perform well in lower dimensions
It also helps to avoid multicollinearity issue by removing unnecessary features.
It helps in visualizing all of the data clearly as discussed above.

Dimension Reduction with Principal Component Analysis (PCA)

Importing necessary modules

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Read the Dataset

data = pd.read_csv("C:\\Users\\Pankaj\\Desktop\\Dataset\\Boston_housing.csv")

Print first 5 rows

data.head()

Dimension Reduction with Principal Component Analysis (PCA)
Missing Value

Generally, NaN or missing values can be in any form like 0, ? or may be written as “missing ” and in our case, as we can see above there are a lot of 0’s, so we can replace them with NaN to calculate how much data we are missing.

data.zn.replace(0, np.nan, inplace=True)
data.chas.replace(0, np.nan, inplace=True)

After replacing let’s again use .info() method to see the details about missing values in our dataset:

data.info()
Output-
<'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
crim 506 non-null float64
zn 134 non-null float64
indus 506 non-null float64
chas 35 non-null float64
nox 506 non-null float64
rm 506 non-null float64
age 506 non-null float64
dis 506 non-null float64
rad 506 non-null int64
tax 506 non-null int64
ptratio 506 non-null float64
b 506 non-null float64
lstat 506 non-null float64
medv 506 non-null float64
dtypes: float64(12), int64(2)
memory usage: 55.4 KB

Let’s calculate the percentage of missing values in our dataset. Generally, if there is 20-25% missing values we can impute them with different ways like mean, median or an educated guess by us. But if it’s more than that, it’s better to remove those features otherwise they can affect our result. As we can see below both “zn” and “chas” missing more than 70% data so we will remove both these features.

data.isnull().sum()/len(data)*100
Output-
crim 0.000000
zn 73.517787
indus 0.000000
chas 93.083004
nox 0.000000
rm 0.000000
age 0.0000&amp;lt;00
dis 0.000000
rad 0.000000
tax 0.000000
ptratio 0.000000
b 0.000000
lstat 0.000000
medv 0.000000
dtype: float64

We can drop both columns as shown below:

data = data.drop("zn", 1)
data = data.drop("chas", 1)

Finding Correlation between features and the target variable

data.corr()

Dimension Reduction with Principal Component Analysis (PCA)

Finding Correlation between Features and Target Variable in Boston Housing Dataset using Heatmap

correlation = data.corr()
plt.figure(figsize=(10,10))
sns.heatmap(correlation, vmax=1, square=True,annot=True,cmap='viridis')
plt.title('Correlation')

Dimension Reduction with Principal Component Analysis (PCA)

Observations from the above Heatmap and data.corr():

From the above correlation plot, we can see that MEDV is strongly correlated to LSTAT, RM
RAD and TAX are strongly correlated, so we don’t include this in our features together to avoid multicollinearity. Similar to the features DIS and AGE which have a correlation of -0.75. So we will exclude these four features from our features list. You can find the reason behind this here.

Dropping Columns

data = data.drop(["rad","tax","dis","age"], 1)
data.shape
Output-
(506, 8)

Selecting Features and Target Variable

As Scikit learn wants “features” and “target” variables in X and y respectively. Here medv is our target variable, we can extract features and target arrays from our dataset as shown below. From X we drop the medv column along with other features and in y we keep only medv column:

X = data.drop(["medv", 1).values 
y= data["medv"].values

Instantiate Principal Component Analysis

Instantiating Principal Component Analysis as shown below:

pca = PCA()

Instantiate StandardScaler and use .fit() and .transform() methods

As PCA performs best with a normalized feature set. We must make sure that we standardize the data by transforming it onto a unit scale (mean=0 and variance=1).

scaler = StandardScaler()
X=scaler.fit_transform(X)
print(X)
Output-
array([[-0.41978194, -1.2879095 , -0.14421743, ..., -1.45900038,
         0.44105193, -1.0755623 ],
       [-0.41733926, -0.59338101, -0.74026221, ..., -0.30309415,
         0.44105193, -0.49243937],
       [-0.41734159, -0.59338101, -0.74026221, ..., -0.30309415,
         0.39642699, -1.2087274 ],
       ...,
       [-0.41344658,  0.11573841,  0.15812412, ...,  1.17646583,
         0.44105193, -0.98304761],
       [-0.40776407,  0.11573841,  0.15812412, ...,  1.17646583,
         0.4032249 , -0.86530163],
       [-0.41500016,  0.11573841,  0.15812412, ...,  1.17646583,
         0.44105193, -0.66905833]])

Now use .fit_transform() method of pca on features as shonw below:

pca.fit_transform(X)
Output-
array([[-1.97211024,  0.59768267, -0.60108255, ...,  0.38664927,
         0.28142543, -0.62304466],
       [-1.23831918, -0.25830272,  0.06828658, ...,  0.05618521,
         0.02647495,  0.09980334],
       [-1.91503575,  0.52105926,  0.36884808, ..., -0.04006242,
         0.06420373,  0.10642031],
       ...,
       [-0.59389099, -0.08245307,  0.61169186, ..., -0.4804092 ,
         0.12947899, -0.32492734],
       [-0.4404716 , -0.23262198,  0.56730136, ..., -0.48373932,
         0.16834501, -0.33567707],
       [-0.00219328, -0.92450271,  0.34045304, ..., -0.43443481,
         0.54374601, -0.43375723]])

Now we want to know how many principal components we can choose for our new feature subspace? A useful measure is the so-called “explained variance ratio“. The explained variance ratio tells us how much information (variance) can be attributed to each of the principal components. We can plot bar graph between no. of features on X axis and variance ratio on Y axis

features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_ratio_)
plt.xticks(features)
plt.xlabel("PCA features")
plt.ylabel("variance")
plt.show()

Dimension Reduction with Principal Component Analysis (PCA)

Now, let’s apply linear regression to Boston Housing Dataset and for that first, we will split the data into training and testing sets. We train the model with 70% of the data and test with the remaining 30%.

We do this because when we compute the metric to measure the model’s performance on the same data that we used to fit the classifier but as the same data is used to train the model, this will not provide us with the real answer on how well our model generalizes to unseen data. So for this reason its common to split our data into two sets, a training set on which we train and fit the classifier and a labelled test set on which we can make our predictions and finally compare these predictions with the known labels.

X_train, X_test,y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=123)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Output-
(354, 7)
(152, 7)
(354,)
(152,)

Again instantiate StandardScaler but this time we use fit only on the X_train and transform both X_train and X_test as shown below:

scaler = StandardScaler()
# Fit on training set only.
scaler.fit(X_train)
# Apply transform to both the training set and the test set.
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Now from the above bar plot, we can clearly see that the first four principal components describe approximately 70% of the variance in the data. SO we can choose n_components = 4 as shown below. Then we can apply .fit() and .transform() method of PCA as shown and finally calculate the explained variance and explained variance ration.

pca = PCA(n_components=4)

pca.fit(X_train)
Output-
PCA(copy=True, iterated_power='auto', n_components=4, random_state=None, svd_solver='auto', tol=0.0, whiten=False)

pca.n_components_
Output-
4

pca.explained_variance_
Output-
array([3.47964585, 1.01111235, 0.77413271, 0.65443654])

pca.explained_variance_ratio_
Output-
array([0.49568805, 0.14403659, 0.11027798, 0.09322684])

X_train = pca.transform(X_train)
X_test = pca.transform(X_test)

Since the target variable here is quantitative (continuously varying variable – “medv”), this is a regression problem. We have imported Linear regression from sklearn.linear_models as shown at the start of the post. Now first instantiate the LinearRegression() and then use .fit() to fit a linear regression and then predict the price, using .predict() as shown below:

lr = LinearRegression()
lr.fit(X_train, y_train)
Output-
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
y_pred = lr.predict(X_test)

Now we will compute the metrics to measure the model’s performance. Here we have computed R squared, defined as the metric which quantifies the amount of variance in the target variable that is predicted from the feature variable and RMSE(root mean square value).

r2 = lr.score(X_test, y_test)
rmse = (np.sqrt(mean_squared_error(y_test, y_pred)))
print("The model performance is")
print("--------------------------------------")
print('R2 score is {}'.format(r2))
print('RMSE is {}'.format(rmse))
print("\n")
Output-
The model performance is

--------------------------------------
R2 score is 0.5471230138083647
RMSE is 6.050221694800512

To know more in details about Dimension Reduction with Principal Component Analysis (PCA), you can check the official Sklearn documentation here. Read more articles on Machine Learning from here.

If you want to follow along and you can download the Boston Housing Dataset from here. Also, you can check Jupyter Notebook, check it on GitHub.

Dimension Reduction Technique

Principal Component Analysis (PCA)

Advantages of Dimension Reduction with Principal Component Analysis (PCA)

Dimension Reduction with Principal Component Analysis (PCA)

Related posts:

Leave a Reply Cancel reply