## Blog: Understanding The simple Maths behind Simple Linear Regression.

Not a lot of people like Maths and for good reasons. I’m not exactly fond of it, but I try to keep afresh with the basics:- Algebra, Line-Graphs, Trig, pre-calculus e.t.c. Thanks to platforms like Khan Academy… Learning Maths could be fun.

**T**his article is for anyone interested in Machine Learning, ideally for beginners, new to Supervised Learning Technique- *Regression*.

**S**ome may argue that Data Science and ML can be done without the Maths, I’m not here to refute that premise, but I’m saying one needs to find time to look beneath the hood of some of the tools and abstractions we use daily, to have a better intuition for heuristics.

**L**inear Regression as we already know, refers to the use of one or more independent variables to predict a dependent variable. A dependent variable must be continuous such as predicting Co2_Emissions, age or salaries of workers, tomorrow’s temperature e.t.c. While independent variables may be continuous or categorical.

**W**e shall concentrate on Simple Linear Regression **(SLR)**in this article. SLR is arguably the most intuitive and ubiquitous Machine Learning Algorithm out there.

**I**n Machine Learning, a model can be thought as a mathematical equation used to Predict a value, given one or more other values.

Usually the more relevant data you have, the more accurate your model is.

The image above, depicts a Simple Linear Regression Model **(SLR)**. It is called Simple Linear Regression because only one feature or independent variable is used to predict a given label or target. In this case, only Engine_Size is used to predict Co2_Emissions. If we had more than one predictor, then we’d refer to it as *Multiple Linear Regression* *(MLR)*.

**T**he **red line **in the image above represents the model. It is a straight line that best fits the data. Thus the model is a mathematical equation that tries to predict the Co2_Emissions (dependent variable), given the Engine_size(independent variable).

**T**he aim of this article is to create a better intuition to SLR, to make us more comfortable with the concept and its internal workings. It’s just simple Maths. Anyone can figure it out. One effective way to start is from the known to the unknown… So let’s go back to High school for a second.

*y = mx + b*

**T**he Slope-Intercept form *(y=mx+b)* is a linear equation that applies directly in form, to Simple Linear Regression

*y = The value on the y-axis*

*m = The Slope or gradient of the line(change in y **/** change in x)*

*x = The value on the x-axis*

*b = The y-intercept or the value of y when x is 0*

**A** linear equation is an equation wherein if we plot all the values for x and y, the plot will be a perfect straight line on the coordinate plane.

**T**herefore the Slope-Intercept form states that for any straight line on the coordinate plane, the value of ** y** is the product of the slope of the line

**, and the value of**

*m***plus the y-intercept of the line**

*x***.**

*b**y = mx + b*

**O**kay, back to Simple Linear Regression… The **SLR** model is identical to the Slope-Intercept form equation we saw above, the only difference is that we denote our label or dependent variable that we want to predict as ** y** and we represent our weights or model parameters as

**and**

*m***and our independent variable or feature as**

*b***.**

*x**In Simple Linear Regression:*

*y = wx + b*

*Which is same as:*

*y = b + wx*

Which is same as:

*y = b0 + b1x1*

*Where:-*

*y = The dependent or target variable.**(aka, the prediction or y_hat)*

*x = The independent or predictor variable. **(aka, x1)**.*

*b0 = The y-intercept. **(aka bias unit)**.*

*b1 = The Slope or Gradient of the Regression Line*

**A**nd just like the Slope-Intercept form(** y = mx + b)**, as long as the independent variable

*(*

*x**)*and the dependent variable

*(*

*y**)*have a Linear relationship, whether the relationship is positive or negative, we can always predict

**given the weights (**

*y***).**

*b0 and b1*### The Most Fundamental Questions are:-

1. How can we tell if an independent variable has a Linear relationship , whether positive or negative, with a dependent variable that we want to predict?

**B**ecause in Linear Regression*(whether Simple or Multiple)*, there **must** be a Linear relationship between the independent or predictor variable(s) and the dependent or target variable.

2. How can we choose the best line for our SLR model? In other words, how can we find the ideal values forb0andb1such that they produce the best prediction line, given our independent and dependent variables?

*How to verify if a Linear Relationship exists between two variables.*

*How to verify if a Linear Relationship exists between two variables.*

Ever heard the term **‘ Correlation’ ??**

In other words, correlation tries to tell us if a change in one variable affects the other or could cause a change in the other variable, and to what extent.

**F**or Example if an increase in Engine_Size of a car likely leads to some increase in Co2_Emissions, then they are positively correlated. But if an increase in COMB_(mpg), likely leads to some decrease in C02_Emissions, then COMB_(mpg) and C02_Emissions are negatively correlated. If no likely relationship exists, then Statisticians say there is a weak correlation between them.

**C**orrelation produces a number between -1 and 1. If the number is close to -1 it denotes a strong negative relationship, if it’s close to 1, it denotes a strong positive relationship, and if it’s just around 0, it denotes a weak relationship between the variables.

If the correlation has an absolute value above 0.6,it shows a Linear relationship exists between the variables.

The Linear relationship could be negative(if the correlation is negative) or positive.

Correlation or Pearson’s correlation is denoted by symbol *r*

‘ Remember Correlation does not imply Causation… ’

### Let’s Play With Some Real Data…

**W**e shall use the *Fuel consumption ratings* data set for car sales in Canada. **(Original Fuel Consumption Ratings 2000–2014)**. But frankly, any popular data set for Regression analysis will suffice.

**T**he** **Data set has been downloaded to Google drive so let’s import it to colab.

#### A little EDA to get a feel of the Data set…

let’s confirm the shape, column data types and check if NaN values exist.

### Correlation

For **SLR**, we want to predict Co2_Emissions*(dependent variable)* using only one feature or variable from the data set. Let’s view the correlation of variables in our Data set, so we can pick a strong independent variable.

**C**learly, variables with the best correlation figures with Co2_Emissions are:-* Fuel_Cons_City(***0.92***) *,* COMB_(mpg)(***0.92***) *and* Engine_Size(***0.83***). *Let’s visualize each relationship.

All three variables have a strong Positive Linear Relationship with CO2_Emissions. I choose ENGINE_SIZE(L) as my independent variable for this exercise. You’re free to choose any one of them.

Now it’s time to answer the second fundamental question.

The best value for parameters b0 and b1 would be the value that minimizes the Mean Squared Error (MSE).

#### What is The MSE?

The Mean Squared Error is simply the sum of the squared differences between each predicted value and each actual value, divided by the total number of observations.

Here, an ** observation **is a specific row of data

**An**

*.***, is simply a pair of a given Engine_Size value and it’s corresponding CO2_Emissions value. Remember, the data set contains 14343 examples / observations.**

*example*

So how can we find the ideal values for b0 and b1 that would produce the least MSE for our SLR Model?

**C**heesy… First, we find the value of *b1**(slope)*** **using a simple Mathematical formula. Then we substitute

**in the**

*b1***SLR**equation

**, to find the value of**

*(y = b0 + b1x1)*

*b0**(intercept or bias unit)*… And that’s it

*.*The ** Slope formula** for

**calculating**

*b1**(slope)*

**in Simple Linear Regression is:-**

where:

** n **(is the total number of observations)

** x** (The independent variable, Engine Size, which is an

**column vector)**

*n * 1*** y** (The dependent variable, Co2_Emissions, which is an

**column vector)**

*n * 1*** i = 1 **(refers to first observation in the data set.

**Note**:

**goes from 1 up to**

*i***)**

*n*** xi **(The

**th observation of**

*i***)**

*x*** x_bar** (The mean or average of

**)**

*x*** yi** (The

**th observation of**

*i***)**

*y*** y_bar** (The mean or average of

**)**

*y*#### In Summary

To find

b1, we divide theNumeratorof the slope formula by theDenominator.

The Numeratoris simply the sum fromiequals 1 throughnfor each value ofximinusx_bar, multiplied by the corresponding value ofyiminusy_bar.

The Denominatoris simply the sum fromequals 1 throughinof the squared differences between each value ofxiandx_bar.

### Let’s solve for b1 or slope using The Slope Formula.

First let’s define our variables:- ** x **and

**, as well as**

*x_bar***and**

*y*

*y_bar*Next let’s define a simple function that takes the variables and returns *b1*

### Next Let’s substitute value of b1 in SLR Equation

**R**emember that the SLR equation **( y = b0 + bixi**

*)*is identical to the slope intercept equation

**.**

*(y = b + mx)***T**herefore we can use any given value of ** y** and

**that we know and substitute the slope**

*x***into the SLR equation, to get the**

*(b1)***or**

*y_intercept*

*b0.*Let’s use the average values ** x_bar** and

**.**

*y_bar**y = b0 + b1x1*

*Therefore*:*y_bar = b0 + b1(x_bar)*

**Let’s solve for b0**

*Therefore***:***b0 + b1(x_bar) = y_bar*

*Finally solving for **b0**, it can be written as:**b0 = y_bar – b1(x_bar)*

Let’s input the values of ** y_bar**,

**and**

*b1***to get the value of**

*x_bar***.**

*b0*#### In Summary

Solving Mathematically, we can see that the ideal values for model parametersb0andb1to give us the best linear fit for our Model is:and119(b0).37.28(b1)

Thus our mathematical SLR model is *y_hat = *119* + *37.28*(x1)*

**T**his means if we want to predict the Co2 emission for a car with Engine size 13.5 litres. Our unknown prediction is ** y_hat**. Our predictor is

**, which is 13.5. So all we need do is substitute 13.5 into the equation to find**

*x1***.**

*y_hat**y_hat = *119* + *37.28* * *13.5

** y_hat = 622 **(rounded to a whole number)

### Let’s Compare our Maths Model to an Sklearn Model

First import LinearRegression from sklearn and create a model

**N**ext we make the independent** (X)** and dependent

**variables as 2D arrays. Then we fit/train the model with the**

*(Y)***and**

*X***data… And finally we print out the slope(**

*Y**model.coef_*) and the intercept(

*model.intercept_*) values

**B**oth the Maths and Sklearn Models have exactly the same parameters ** b0** and

**, giving credence to the fact that one can solve SLR intuitively without the libraries, especially if it’s a small/medium data set.**

*b1*### Finally,

#### Evaluation…

How are The Models performing?.

Let’s compare the ** RMSE** for both The Maths and Sklearn Models.

The Root-Mean-Squared-Error is the square root of the ** MSE** and is an ideal metric for measuring the performance of a Linear Regression model.

RMSE can be interpreted right on the same scale as the dependent or target variable and gives a good indicator of how well our model performs. Simply find the range of the target variable and compare the RMSE to it. The lower the RMSE as a percentage of the range, the better the model performance.

Both Models have an **MSE of 1110** and **RMSE of 33**.

But, what does an

RMSE of 33really mean… What can it tell us?

To get the meaning of the ** RMSE**, let’s make it a percentage of the range of the dependent variable. The lower the percent, the better the model.

With an ** RMSE of 33**, our model error is within

**7%**of the

**of the dependent variable(CO2_Emissions)… Which means our SLR Model as simple as it is, is doing good.**

*range*(487)#### Let’s See a Plot of both the Maths and Sklearn Models.

### Conclusion

I hope I have been able to show you how Simple Linear Regression works… How Statistics and simple Maths drive this concept.

As you keep on learning, spend more time practicing and applying the concepts.

Cheers.

*Source: Artificial Intelligence on Medium*