Linear regression is defined as a linear approach which is used to model the relationship between dependent variable and one or more independent variable(s). When we try to model the relationship between a single feature variable and a single target variable, it is called simple linear regression. But when there is more than one independent variable then the process is called multiple linear regression.
Here in this post, we will learn how to perform Linear Regression using Least Squares. Here we will use the NumPy package and its .polyfit() function.
Let’s start with the basics of linear regression and how it actually works?
Understanding Regression
Think it as we want to fit a line to our data and as we know that we can define a line in two dimensions in the form of
y = ax + b,
where y is the target,
x is the single feature,
and a and b parameters slope and intercept respectively that we need to calculate.
Selecting slope and intercept values, which describe out data in the best possible way
The slope sets how steep the line is and the intercept sets where the line crosses the y-axis. for selecting the best values for slope and intercept we have to make sure that all the data points collectively lie as close as possible to the line.
Residuals and Least Squares
The vertical distance between the data point and the line is called Residual. By looking at the above graph we can say that Residual_1 has negative value because the data point lies below the line. Similarly, Residual_2 has positive value as the data point lies above the line.
So we have to define the line in such a way that all the data points lie as close as possible to that line and also for which the sum of squares of all the residuals is minimum. The Process of finding the values or parameters for which the sum of squares of the residuals is minimal is called Least Squares.
Calculating Least Squares with np.polyfit() function
Here, we will use the .polyfit() function from the NumPy package which will perform the least square with polynomial function under the hood. Basic Syntax for np.polyfit() is:
slope(a), intercept(b) = np.polyfit ( X, Y, 1)
where
The first parameter(X) is the first variable,
The second parameter(Y) is the second variable,
The third parameter is the degree of polynomial we wish to fit. Here for a linear function, we enter 1.
Here for this post, we are going to use Anscombe’s-quartet data set which is stored as an excel file and we can read it using the pd.read_excel(). Also, we need to import Pandas and NumPy package with common alias name as shown below:
import pandas as pd
import numpy as np
data = pd.read_excel("C:\\Users\\Pankaj\\Desktop\\Practice\\anscombes-quartet.xlsx")
or we can write above code as follows if the directory of our both files i.e. excel file and coed file (.py) is same:
data = pd.read_excel("anscombes-quartet.xlsx")
print(data)
Calculating Summary Statistics
Calculating Mean using np.mean()
To know more how to calculate mean, variance and other summary statistics for Exploratory data analysis of a data set you can check our post.
Mean of X = Mean of X.1 = Mean of X.2 = Mean of X.3 = 9
Mean of Y = Mean of Y.1 = Mean of Y.2 = Mean of Y.3 = 7.5
Calculating Variance using np.var()
Variance of X = Variance of X.1 = Variance of X.2 = Variance of X.3 = 11
Variance of X = Variance of Y.1 = Variance of Y.2 = Variance of Y.3 = 4.125 (approx)
Calculating a(slope) and b(intercept) for all the groups((X,Y), (X.1, Y.1), (X.2, Y.2), (X.3,Y.3)) using np.polyfit() shown below:
a, b = np.polyfit(data['X'],data['Y'],1) print(a,b) 0.5000909090909095 3.000090909090909
a, b = np.polyfit(data['X.1'],data['Y.1'],1) print(a, b) 0.5000000000000004 3.0009090909090896
a, b = np.polyfit(data['X.2'],data['Y.2'],1) print(a, b) 0.4997272727272731 3.0024545454545453
a, b = np.polyfit(data['X.3'],data['Y.3'],1) print(a, b) 0.4999090909090908 3.0017272727272735
So we can say from Linear regression line for all the values of X and Y for the different group is same where a(slope) = 0.50 and b(intercept) = 3.00 when rounded of for up to two decimal values
Plotting Linear Regression using Least Squares for X and Y (Fig1)
a, b = np.polyfit(data['X'], data['Y'], 1)
plt.scatter(data['X'], data['Y'], color='blue')#This will create matplotlib.collections.PathCollection object and won't show the plot untill we call plt.show()
X_th = np.array([3,15])
Y_th = a * X_th + b
plt.plot(X_th, Y_th, color='black', linewidth=3)#This will also create matplotlib.collections.PathCollection object and won't show the plot untill we call plt.show()
Plotting Linear Regression using Least Squares for X.1 and Y.1 (Fig2)
a, b = np.polyfit(data['X.1'],data['Y.1'],1)
plt.scatter(data['X.1'], data['Y.1'], color='blue')
X_th = np.array( [3, 15] )
Y_th = a * X_th + b
plt.plot(X_th, Y_th, color='black', linewidth=3)
Plotting Linear Regression using Least Squares for X.2 and Y.2 (Fig3)
a, b = np.polyfit(data['X.2'],data['Y.2'],1)
plt.scatter(data['X'], data['Y'], color='blue')
X_th = np.array([3,15])
Y_th = a * X_th + b
plt.plot(X_th, Y_th, color='black', linewidth=3)
Plotting Linear Regression using Least Squares for X.3 and Y.3 (Fig4)
a, b = np.polyfit(data['X.3'],data['Y.3'],1)
plt.scatter(data['X.3'], data['Y.3'], color='blue')
X_th = np.array([3,15])
Y_th = a * X_th + b
plt.plot(X_th, Y_th, color='black', linewidth=3)
Finally to show the plot use the command
plt.show()
For more information about Pandas and NumPy package, you can see their official documentation: Pandas Documentation, NumPy Documentation.
Leave a Reply