Last Updated on March 7, 2022 by Jay
This tutorial will explain what a least-squares linear regression is in simple terms, and will follow by an example with Excel and Python implementations in later tutorials. The name “linear regression” might sound foreign, but it’s really just finding a line that best represents/fits some given data points.
Sample Dataset
Below are 10 pairs of randomly generated integer numbers. Feel free to copy them if you want to follow along and replicate the results we show here.
x = [12,16,71,99,45,27,80,58,4,50]
y = [56,22,37,78,83,55,70,94,12,40]
What is A Least Sqaures Linear Regression
Essentially we want to find a line as close as possible to all the data points. We can “eyeball” it and approximate a line, but it’s not going to be accurate. Therefore, we use a method called “ordinary least squares“.
Take a look at the above picture, let’s say we have an imaginary line (in red). The distance between each data point to the line is called an error or residual. We want to calculate the error/residual for each data point, square them, then add them up, this is called the “sum of squared residuals“.
We could draw many lines and calculate the sum of squared residuals, and our ultimate goal is to find a line that provides the minimum value of the sum of squared residuals out of all possible lines drawn.
A Linear Equation
In this example, we are dealing with just one variable – x. Let’s use an equation to formulate the relationship between x and y:
y = b + wx
- x and y values are self-explanatory, they are just the x,y of the given data points
- b is a constant term, also known as the “intercept“, which is the point that crosses y-axis when x=0
- w is also a constant term, also known as the “slope” of the line
Our ultimate goal is to find the b and w values such that the line (y = b + wx) will provide the smallest value of the sum of squared residuals.
I won’t bore you with the mathematical details here. Below is the formula to find the b and w values that minimize the sum of squared residuals for the line y = b + wx.
- N means the number of data point pairs, which is 10 in our example
- Σ is a greek symbol and means “sum”. Σ(xy) means “sum of x times y”
Don’t worry if this still looks confusing, we are going to do the calculation in both Excel and Python:
Least Sqaures Linear Regression In Excel
Least Sqaures Linear Regression In Python