Last Updated on March 7, 2022 by Jay
This tutorial will show you how to do a least squares linear regression with Python using an example we discussed earlier. Check here to learn what a least squares regression is.
Sample Dataset
We’ll use the following 10 randomly generated data point pairs.
x = [12,16,71,99,45,27,80,58,4,50]
y = [56,22,37,78,83,55,70,94,12,40]
Least Squares Formula
For a least squares problem, our goal is to find a line y = b + wx that best represents/fits the given data points. In other words, we need to find the b and w values that minimize the sum of squared errors for the line.
As a reminder, the following equations will solve the best b (intercept) and w (slope) for us:
Least Squares Linear Regression By Hand
Let’s create two new lists, xy and x_sqrt:
xy = []
for i, val in enumerate(x):
xy.append(x[i] * y[i])
x_sqrt = [i**2 for i in x]
n = len(x)
xy
[672, 352, 2627, 7722, 3735, 1485, 5600, 5452, 48, 2000]
x_sqrt
[144, 256, 5041, 9801, 2025, 729, 6400, 3364, 16, 2500]
n
10
We can then calculate the w (slope) and b (intercept) terms using the above formula:
w = (n*sum(xy) - sum(x)*sum(y)) / (n*sum(x_sqrt) - sum(x)**2)
b = (sum(y) - w*sum(x))/n
w
0.4950512786062967
b
31.82863092838909
Least Squares Linear Regression With Python Sklearn
Scikit-learn is a great Python library for data science, and we’ll use it to help us with linear regression. We also need to use numpy library to help with data transformation. Let’s install both using pip, note the library name is sklearn:
pip install sklearn numpy
In general, sklearn prefers 2D array input over 1D. The x and y lists are considered as 1D, so we have to convert them into 2D arrays using numpy’s reshape() method. Note although the below new x and y still look like 1D arrays after transformation, they are technically 2D because each x and y is now a list of lists.
from sklearn.linear_model import LinearRegression
import numpy as np
x = [12,16,71,99,45,27,80,58,4,50]
y = [56,22,37,78,83,55,70,94,12,40]
x = np.array(x).reshape(-1,1)
y = np.array(y).reshape(-1,1)
x
array([[12],
[16],
[71],
[99],
[45],
[27],
[80],
[58],
[ 4],
[50]])
y
array([[56],
[22],
[37],
[78],
[83],
[55],
[70],
[94],
[12],
[40]])
Our data is in the proper format now, we can create a linear regression and “fit” (another term is “train”) the model. Under the hood, sklearn will perform the w and b calculations.
We can check the intercept (b) and slope(w) values. Note by sklearn‘s naming convention, attributes followed by an underscore “_” implies they are estimated from the data.
linreg = LinearRegression().fit(x,y)
linreg.intercept_
array([31.82863093])
linreg.coef_
array([[0.49505128]])
As shown above, the values match our previously hand-calculated values.
Plot Data And Regression Line In Python
We’ll use the matplotlib library for plotting, get it with pip if you don’t have it yet:
pip install matplotlib
Matplotlib is probably the most well-known plotting library in Python. It provides great flexibility for customization if you know what you are doing 🙂
import matplotlib.pyplot as plt
%matplotlib notebook
plt.scatter(x,y)
ax = plt.gca()
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.plot(x,linreg.intercept_+linreg.coef_*x, color='r')