WeirdGeek

Data Science | Machine Learning | Automation

  • Data Analytics
  • Python
  • Data Science
  • Google Apps Script
  • Machine Learning
  • Artificial Intelligence
  • SQL Server

15/11/2018 By WeirdGeek Leave a Comment

Performing Linear Regression using Least Squares

Linear regression is defined as a linear approach which is used to model the relationship between dependent variable and one or more independent variable(s). When we try to model the relationship between a single feature variable and a single target variable, it is called simple linear regression. But when there is more than one independent variable then the process is called multiple linear regression.

Here in this post, we will learn how to perform Linear Regression using Least Squares. Here we will use the NumPy package and its .polyfit() function.

Let’s start with the basics of linear regression and how it actually works?

Understanding Regression

Think it as we want to fit a line to our data and as we know that we can define a line in two dimensions in the form of
y = ax + b,

where y is the target,
x is the single feature,
and a and b parameters slope and intercept respectively that we need to calculate.

Slope and Intercept in building Linear Regression using Least Squares

Selecting slope and intercept values, which describe out data in the best possible way

The slope sets how steep the line is and the intercept sets where the line crosses the y-axis. for selecting the best values for slope and intercept we have to make sure that all the data points collectively lie as close as possible to the line.

Residuals and Least Squares

Residuals in Linear Regression using Least Squares

The vertical distance between the data point and the line is called Residual. By looking at the above graph we can say that Residual_1 has negative value because the data point lies below the line. Similarly, Residual_2 has positive value as the data point lies above the line.

So we have to define the line in such a way that all the data points lie as close as possible to that line and also for which the sum of squares of all the residuals is minimum. The Process of finding the values or parameters for which the sum of squares of the residuals is minimal is called Least Squares.

Calculating Least Squares with np.polyfit() function

Here, we will use the .polyfit() function from the NumPy package which will perform the least square with polynomial function under the hood. Basic Syntax for np.polyfit() is:

slope(a), intercept(b) = np.polyfit ( X, Y, 1)

where
The first parameter(X) is the first variable,
The second parameter(Y) is the second variable,
The third parameter is the degree of polynomial we wish to fit. Here for a linear function, we enter 1.

Here for this post, we are going to use Anscombe’s-quartet data set which is stored as an excel file and we can read it using the pd.read_excel(). Also, we need to import Pandas and NumPy package with common alias name as shown below:

import pandas as pd
import numpy as np
data = pd.read_excel("C:\\Users\\Pankaj\\Desktop\\Practice\\anscombes-quartet.xlsx")

or we can write above code as follows if the directory of our both files i.e. excel file and coed file (.py) is same:

data = pd.read_excel("anscombes-quartet.xlsx")
print(data)

Linear Regression Linear Regression using Least Squares with anscombes quartet data

Calculating Summary Statistics

Calculating Mean using np.mean()

To know more how to calculate mean, variance and other summary statistics for Exploratory data analysis of a data set you can check our post.

Mean of X = Mean of X.1 = Mean of X.2 = Mean of X.3 = 9
Mean of Y = Mean of Y.1 = Mean of Y.2 = Mean of Y.3 = 7.5

Calculating Variance using np.var()

Variance of X = Variance of X.1 = Variance of X.2 = Variance of X.3 = 11
Variance of X = Variance of Y.1 = Variance of Y.2 = Variance of Y.3 = 4.125 (approx)

Calculating a(slope) and b(intercept) for all the groups((X,Y), (X.1, Y.1), (X.2, Y.2), (X.3,Y.3)) using np.polyfit() shown below:
 a, b = np.polyfit(data['X'],data['Y'],1)
print(a,b)
0.5000909090909095 3.000090909090909 
 a, b = np.polyfit(data['X.1'],data['Y.1'],1)
print(a, b)
0.5000000000000004 3.0009090909090896 
 a, b = np.polyfit(data['X.2'],data['Y.2'],1)
print(a, b)
0.4997272727272731 3.0024545454545453
 a, b = np.polyfit(data['X.3'],data['Y.3'],1)
print(a, b)
0.4999090909090908 3.0017272727272735 

So we can say from Linear regression line for all the values of X and Y for the different group is same where a(slope) = 0.50 and b(intercept) = 3.00 when rounded of for up to two decimal values

Plotting Linear Regression using Least Squares for X and Y (Fig1)
a, b = np.polyfit(data['X'], data['Y'], 1)
plt.scatter(data['X'], data['Y'], color='blue')#This will create matplotlib.collections.PathCollection object and won't show the plot untill we call plt.show()
X_th = np.array([3,15])
Y_th = a * X_th + b
plt.plot(X_th, Y_th, color='black', linewidth=3)#This will also create matplotlib.collections.PathCollection object and won't show the plot untill we call plt.show()
Plotting Linear Regression using Least Squares for X.1 and Y.1 (Fig2)
a, b = np.polyfit(data['X.1'],data['Y.1'],1)
plt.scatter(data['X.1'], data['Y.1'], color='blue')
X_th = np.array( [3, 15] )
Y_th = a * X_th + b
plt.plot(X_th, Y_th, color='black', linewidth=3)
Plotting Linear Regression using Least Squares for X.2 and Y.2 (Fig3)
a, b = np.polyfit(data['X.2'],data['Y.2'],1)
plt.scatter(data['X'], data['Y'], color='blue')
X_th = np.array([3,15])
Y_th = a * X_th + b
plt.plot(X_th, Y_th, color='black', linewidth=3)
Plotting Linear Regression using Least Squares for X.3 and Y.3 (Fig4)
a, b = np.polyfit(data['X.3'],data['Y.3'],1)
plt.scatter(data['X.3'], data['Y.3'], color='blue')
X_th = np.array([3,15])
Y_th = a * X_th + b
plt.plot(X_th, Y_th, color='black', linewidth=3)

Finally to show the plot use the command

plt.show()

Performing Linear Regression using Least Squares on Anscombe’s Quartet dataset

 

For more information about Pandas and NumPy package, you can see their official documentation: Pandas Documentation, NumPy Documentation.

Related posts:

  1. Anscombe Quartet and use of Exploratory Data Analysis
  2. Applying Linear Regression to Boston Housing Dataset
  3. Building simple Linear Regression model using Python’s Sci-kit library
  4. How to Split Data for Machine Learning with scikit-learn

Filed Under: Data Science, Machine Learning Tagged With: Data Science, Least Square, Linear Regression, Machine Learning, NumPy, Pandas, Residual

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Subscribe to my Blog !!

Enter your email below to subscribe my blog and get the latest post right in your inbox.

  • Home
  • Terms
  • Privacy
  • Contact Us

Copyright © 2025 · WeirdGeek · All trademarks mentioned are the property of their respective owners.