Least Squares Linear Regression With Python Example

Last Updated on March 7, 2022 by Jay

This tutorial will show you how to do a least squares linear regression with Python using an example we discussed earlier. Check here to learn what a least squares regression is.

Sample Dataset

We’ll use the following 10 randomly generated data point pairs.

x = [12,16,71,99,45,27,80,58,4,50]
y = [56,22,37,78,83,55,70,94,12,40]

Least Squares Formula

For a least squares problem, our goal is to find a line y = b + wx that best represents/fits the given data points. In other words, we need to find the b and w values that minimize the sum of squared errors for the line.

A least squares linear regression example

As a reminder, the following equations will solve the best b (intercept) and w (slope) for us:

Least Squares Linear Regression By Hand

Let’s create two new lists, xy and x_sqrt:

xy = []
for i, val in enumerate(x):
    xy.append(x[i] * y[i])

x_sqrt = [i**2 for i in x]
n  = len(x)

xy
[672, 352, 2627, 7722, 3735, 1485, 5600, 5452, 48, 2000]

x_sqrt
[144, 256, 5041, 9801, 2025, 729, 6400, 3364, 16, 2500]

n
10

We can then calculate the w (slope) and b (intercept) terms using the above formula:

w = (n*sum(xy) - sum(x)*sum(y)) / (n*sum(x_sqrt) - sum(x)**2)
b = (sum(y) - w*sum(x))/n

w
0.4950512786062967

b
31.82863092838909

Least Squares Linear Regression With Python Sklearn

Scikit-learn is a great Python library for data science, and we’ll use it to help us with linear regression. We also need to use numpy library to help with data transformation. Let’s install both using pip, note the library name is sklearn:

pip install sklearn numpy

In general, sklearn prefers 2D array input over 1D. The x and y lists are considered as 1D, so we have to convert them into 2D arrays using numpy’s reshape() method. Note although the below new x and y still look like 1D arrays after transformation, they are technically 2D because each x and y is now a list of lists.

from sklearn.linear_model import LinearRegression
import numpy as np

x = [12,16,71,99,45,27,80,58,4,50]
y = [56,22,37,78,83,55,70,94,12,40]
x = np.array(x).reshape(-1,1)
y = np.array(y).reshape(-1,1)

x
array([[12],
       [16],
       [71],
       [99],
       [45],
       [27],
       [80],
       [58],
       [ 4],
       [50]])

y
array([[56],
       [22],
       [37],
       [78],
       [83],
       [55],
       [70],
       [94],
       [12],
       [40]])

Our data is in the proper format now, we can create a linear regression and “fit” (another term is “train”) the model. Under the hood, sklearn will perform the w and b calculations.

We can check the intercept (b) and slope(w) values. Note by sklearn‘s naming convention, attributes followed by an underscore “_” implies they are estimated from the data.

linreg = LinearRegression().fit(x,y)

linreg.intercept_
array([31.82863093])

linreg.coef_
array([[0.49505128]])

As shown above, the values match our previously hand-calculated values.

Plot Data And Regression Line In Python

We’ll use the matplotlib library for plotting, get it with pip if you don’t have it yet:

pip install matplotlib

Matplotlib is probably the most well-known plotting library in Python. It provides great flexibility for customization if you know what you are doing 🙂

import matplotlib.pyplot as plt

%matplotlib notebook

plt.scatter(x,y)
ax = plt.gca()
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.plot(x,linreg.intercept_+linreg.coef_*x, color='r')

Additional Resources

Least Squares Linear Regression with An Example

Least Squares Linear Regression With Excel