Last Updated on March 7, 2022 by Jay

This tutorial will show you how to do a least squares linear regression with Python using an example we discussed earlier. Check here to learn what a least squares regression is.

## Sample Dataset

We’ll use the following 10 randomly generated data point pairs.

```
x = [12,16,71,99,45,27,80,58,4,50]
y = [56,22,37,78,83,55,70,94,12,40]
```

## Least Squares Formula

For a least squares problem, our goal is to find a line **y = b + wx** that best represents/fits the given data points. In other words, we need to find the **b** and **w **values that** minimize the sum of squared errors** for the line.

As a reminder, the following equations will solve the best **b** **(intercept)** and **w (slope)** for us:

## Least Squares Linear Regression By Hand

Let’s create two new lists, **xy** and **x_sqrt**:

```
xy = []
for i, val in enumerate(x):
xy.append(x[i] * y[i])
x_sqrt = [i**2 for i in x]
n = len(x)
xy
[672, 352, 2627, 7722, 3735, 1485, 5600, 5452, 48, 2000]
x_sqrt
[144, 256, 5041, 9801, 2025, 729, 6400, 3364, 16, 2500]
n
10
```

We can then calculate the **w (slope)** and **b (intercept)** terms using the above formula:

```
w = (n*sum(xy) - sum(x)*sum(y)) / (n*sum(x_sqrt) - sum(x)**2)
b = (sum(y) - w*sum(x))/n
w
0.4950512786062967
b
31.82863092838909
```

## Least Squares Linear Regression With Python Sklearn

**Scikit-learn** is a great Python library for data science, and we’ll use it to help us with linear regression. We also need to use **numpy** library to help with data transformation. Let’s install both using pip, note the library name is **sklearn**:

`pip install sklearn numpy`

In general, **sklearn** prefers 2D array input over 1D. The **x** and **y** lists are considered as 1D, so we have to convert them into 2D arrays using numpy’s **reshape()** method. Note although the below new **x** and **y** still look like 1D arrays after transformation, they are technically 2D because each x and y is now a list of lists.

```
from sklearn.linear_model import LinearRegression
import numpy as np
x = [12,16,71,99,45,27,80,58,4,50]
y = [56,22,37,78,83,55,70,94,12,40]
x = np.array(x).reshape(-1,1)
y = np.array(y).reshape(-1,1)
x
array([[12],
[16],
[71],
[99],
[45],
[27],
[80],
[58],
[ 4],
[50]])
y
array([[56],
[22],
[37],
[78],
[83],
[55],
[70],
[94],
[12],
[40]])
```

Our data is in the proper format now, we can create a linear regression and “**fit**” (another term is “train”) the model. Under the hood, **sklearn** will perform the **w** and **b** calculations.

We can check the **intercept (b)** and **slope(w)** values. Note by **sklearn**‘s naming convention, attributes followed by an underscore “_” implies they are estimated from the data.

```
linreg = LinearRegression().fit(x,y)
linreg.intercept_
array([31.82863093])
linreg.coef_
array([[0.49505128]])
```

As shown above, the values match our previously hand-calculated values.

## Plot Data And Regression Line In Python

We’ll use the **matplotlib** library for plotting, get it with pip if you don’t have it yet:

`pip install matplotlib`

**Matplotlib** is probably the most well-known plotting library in Python. It provides great flexibility for customization if you know what you are doing 🙂

```
import matplotlib.pyplot as plt
%matplotlib notebook
plt.scatter(x,y)
ax = plt.gca()
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.plot(x,linreg.intercept_+linreg.coef_*x, color='r')
```