Blog: Understanding The simple Maths behind Simple Linear Regression.
Not a lot of people like Maths and for good reasons. I’m not exactly fond of it, but I try to keep afresh with the basics:- Algebra, Line-Graphs, Trig, pre-calculus e.t.c. Thanks to platforms like Khan Academy… Learning Maths could be fun.
This article is for anyone interested in Machine Learning, ideally for beginners, new to Supervised Learning Technique- Regression.
Some may argue that Data Science and ML can be done without the Maths, I’m not here to refute that premise, but I’m saying one needs to find time to look beneath the hood of some of the tools and abstractions we use daily, to have a better intuition for heuristics.
Linear Regression as we already know, refers to the use of one or more independent variables to predict a dependent variable. A dependent variable must be continuous such as predicting Co2_Emissions, age or salaries of workers, tomorrow’s temperature e.t.c. While independent variables may be continuous or categorical.
We shall concentrate on Simple Linear Regression (SLR)in this article. SLR is arguably the most intuitive and ubiquitous Machine Learning Algorithm out there.
In Machine Learning, a model can be thought as a mathematical equation used to Predict a value, given one or more other values.
Usually the more relevant data you have, the more accurate your model is.
The image above, depicts a Simple Linear Regression Model (SLR). It is called Simple Linear Regression because only one feature or independent variable is used to predict a given label or target. In this case, only Engine_Size is used to predict Co2_Emissions. If we had more than one predictor, then we’d refer to it as Multiple Linear Regression (MLR).
The red line in the image above represents the model. It is a straight line that best fits the data. Thus the model is a mathematical equation that tries to predict the Co2_Emissions (dependent variable), given the Engine_size(independent variable).
The aim of this article is to create a better intuition to SLR, to make us more comfortable with the concept and its internal workings. It’s just simple Maths. Anyone can figure it out. One effective way to start is from the known to the unknown… So let’s go back to High school for a second.
y = mx + b
The Slope-Intercept form (y=mx+b) is a linear equation that applies directly in form, to Simple Linear Regression
y = The value on the y-axis
m = The Slope or gradient of the line(change in y / change in x)
x = The value on the x-axis
b = The y-intercept or the value of y when x is 0
A linear equation is an equation wherein if we plot all the values for x and y, the plot will be a perfect straight line on the coordinate plane.
Therefore the Slope-Intercept form states that for any straight line on the coordinate plane, the value of y is the product of the slope of the line m, and the value of x plus the y-intercept of the line b.
y = mx + b
Okay, back to Simple Linear Regression… The SLR model is identical to the Slope-Intercept form equation we saw above, the only difference is that we denote our label or dependent variable that we want to predict as y and we represent our weights or model parameters as m and b and our independent variable or feature as x.
In Simple Linear Regression:
y = wx + b
Which is same as:
y = b + wx
Which is same as:
y = b0 + b1x1
y = The dependent or target variable.(aka, the prediction or y_hat)
x = The independent or predictor variable. (aka, x1).
b0 = The y-intercept. (aka bias unit).
b1 = The Slope or Gradient of the Regression Line
And just like the Slope-Intercept form(y = mx + b), as long as the independent variable(x) and the dependent variable(y) have a Linear relationship, whether the relationship is positive or negative, we can always predict y given the weights (b0 and b1).
The Most Fundamental Questions are:-
1. How can we tell if an independent variable has a Linear relationship , whether positive or negative, with a dependent variable that we want to predict?
Because in Linear Regression(whether Simple or Multiple), there must be a Linear relationship between the independent or predictor variable(s) and the dependent or target variable.
2. How can we choose the best line for our SLR model? In other words, how can we find the ideal values for b0 and b1 such that they produce the best prediction line, given our independent and dependent variables?
How to verify if a Linear Relationship exists between two variables.
Ever heard the term ‘Correlation’ ??
In other words, correlation tries to tell us if a change in one variable affects the other or could cause a change in the other variable, and to what extent.
For Example if an increase in Engine_Size of a car likely leads to some increase in Co2_Emissions, then they are positively correlated. But if an increase in COMB_(mpg), likely leads to some decrease in C02_Emissions, then COMB_(mpg) and C02_Emissions are negatively correlated. If no likely relationship exists, then Statisticians say there is a weak correlation between them.
Correlation produces a number between -1 and 1. If the number is close to -1 it denotes a strong negative relationship, if it’s close to 1, it denotes a strong positive relationship, and if it’s just around 0, it denotes a weak relationship between the variables.
If the correlation has an absolute value above 0.6, it shows a Linear relationship exists between the variables.
The Linear relationship could be negative(if the correlation is negative) or positive.
Correlation or Pearson’s correlation is denoted by symbol r
‘ Remember Correlation does not imply Causation… ’
Let’s Play With Some Real Data…
We shall use the Fuel consumption ratings data set for car sales in Canada. (Original Fuel Consumption Ratings 2000–2014). But frankly, any popular data set for Regression analysis will suffice.
The Data set has been downloaded to Google drive so let’s import it to colab.
A little EDA to get a feel of the Data set…
let’s confirm the shape, column data types and check if NaN values exist.
For SLR, we want to predict Co2_Emissions(dependent variable) using only one feature or variable from the data set. Let’s view the correlation of variables in our Data set, so we can pick a strong independent variable.
Clearly, variables with the best correlation figures with Co2_Emissions are:- Fuel_Cons_City(0.92) , COMB_(mpg)(0.92) and Engine_Size(0.83). Let’s visualize each relationship.
All three variables have a strong Positive Linear Relationship with CO2_Emissions. I choose ENGINE_SIZE(L) as my independent variable for this exercise. You’re free to choose any one of them.
Now it’s time to answer the second fundamental question.
The best value for parameters b0 and b1 would be the value that minimizes the Mean Squared Error (MSE).
What is The MSE?
The Mean Squared Error is simply the sum of the squared differences between each predicted value and each actual value, divided by the total number of observations.
Here, an observation is a specific row of data. An example, is simply a pair of a given Engine_Size value and it’s corresponding CO2_Emissions value. Remember, the data set contains 14343 examples / observations.
So how can we find the ideal values for b0 and b1 that would produce the least MSE for our SLR Model?
Cheesy… First, we find the value of b1(slope) using a simple Mathematical formula. Then we substitute b1 in the SLR equation (y = b0 + b1x1), to find the value of b0(intercept or bias unit)… And that’s it.
The Slope formula for calculating b1(slope) in Simple Linear Regression is:-
n (is the total number of observations)
x (The independent variable, Engine Size, which is an n * 1 column vector)
y (The dependent variable, Co2_Emissions, which is an n * 1 column vector)
i = 1 (refers to first observation in the data set. Note: i goes from 1 up to n)
xi (The ith observation of x)
x_bar (The mean or average of x)
yi (The ith observation of y)
y_bar (The mean or average of y)
To find b1, we divide the Numerator of the slope formula by the Denominator.
The Numerator is simply the sum from i equals 1 through n for each value of xi minus x_bar, multiplied by the corresponding value of yi minus y_bar.
The Denominator is simply the sum from i equals 1 through n of the squared differences between each value of xi and x_bar.
Let’s solve for b1 or slope using The Slope Formula.
First let’s define our variables:- x and x_bar, as well as y and y_bar
Next let’s define a simple function that takes the variables and returns b1
Next Let’s substitute value of b1 in SLR Equation
Remember that the SLR equation (y = b0 + bixi) is identical to the slope intercept equation (y = b + mx).
Therefore we can use any given value of y and x that we know and substitute the slope (b1) into the SLR equation, to get the y_intercept or b0.
Let’s use the average values x_bar and y_bar.
y = b0 + b1x1
y_bar = b0 + b1(x_bar)
Let’s solve for b0
b0 + b1(x_bar) = y_bar
Finally solving for b0, it can be written as:
b0 = y_bar – b1(x_bar)
Let’s input the values of y_bar, b1 and x_bar to get the value of b0.
Solving Mathematically, we can see that the ideal values for model parameters b0 and b1 to give us the best linear fit for our Model is: 119(b0) and 37.28(b1).
Thus our mathematical SLR model is y_hat = 119 + 37.28(x1)
This means if we want to predict the Co2 emission for a car with Engine size 13.5 litres. Our unknown prediction is y_hat. Our predictor is x1, which is 13.5. So all we need do is substitute 13.5 into the equation to find y_hat.
y_hat = 119 + 37.28 * 13.5
y_hat = 622 (rounded to a whole number)
Let’s Compare our Maths Model to an Sklearn Model
First import LinearRegression from sklearn and create a model
Next we make the independent(X) and dependent(Y) variables as 2D arrays. Then we fit/train the model with the X and Y data… And finally we print out the slope(model.coef_) and the intercept(model.intercept_) values
Both the Maths and Sklearn Models have exactly the same parameters b0 and b1, giving credence to the fact that one can solve SLR intuitively without the libraries, especially if it’s a small/medium data set.
How are The Models performing?.
Let’s compare the RMSE for both The Maths and Sklearn Models.
The Root-Mean-Squared-Error is the square root of the MSE and is an ideal metric for measuring the performance of a Linear Regression model.
RMSE can be interpreted right on the same scale as the dependent or target variable and gives a good indicator of how well our model performs. Simply find the range of the target variable and compare the RMSE to it. The lower the RMSE as a percentage of the range, the better the model performance.
Both Models have an MSE of 1110 and RMSE of 33.
But, what does an RMSE of 33 really mean… What can it tell us?
To get the meaning of the RMSE, let’s make it a percentage of the range of the dependent variable. The lower the percent, the better the model.
With an RMSE of 33, our model error is within 7% of the range(487)of the dependent variable(CO2_Emissions)… Which means our SLR Model as simple as it is, is doing good.
Let’s See a Plot of both the Maths and Sklearn Models.
I hope I have been able to show you how Simple Linear Regression works… How Statistics and simple Maths drive this concept.
As you keep on learning, spend more time practicing and applying the concepts.