Regression is a process of finding the relationship between the Dependent variables and one or more Independent variables, Regression is only used for continuous variables.

Wikipedia: In statistics, linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variables) and one or more explanatory variables (or independent variables).

A simplified definition, Linear Regression is a process of finding out the relationship between the Target Variable and one or more Predictor Variables using a linear approach. There are two types of Linear Regression’s — Simple Linear Regression and Multiple Linear Regression.

Simple Linear Regression

Simple Linear Regression is a method which is very useful to find the relationship between two continuous variables where one is a Dependent Variable and other is an Independent Variable, the dependent variable is the target variable and the independent variable is called the predictor variable, it uses statistical relation between the two variables and both of the variables should be correlated so that the target can be predicted with the help of the predictor. Predictions are not accurate as it is statistical relation between the data and they will change for different values For Example- The relationship between rainfall and crop yield can be determined using Simple Linear Regression.

Multiple Linear Regression

Multiple Linear Regression is a method where we will find the relationship between one continuous target variable and two or more predictor variables, the predictor variables can be continuous or discrete variables but the target variable should be a continuous variable, in the real world we often use multiple linear regression than simple linear regression as most of the target variables are predicted by more than two predictor variables For Example- Selling of a house can be dependent on many factors so the selling price is the target variable and the predictor variables are location, size, number of bedrooms, furnished or non- furnished, etc so to find the target we need to find the relationship between each predictor and the target variable.

Linear Regression works on a linear approach, it attempts to model the relationship between the variables by fitting a line called Regression Line or the Line of Best Fit, it uses an equation to find the intercept and the slope of the regression line.


where Yo is the target variable Bo is the population y-intercept B1 is the population slope coefficient xi is the independent variable and e is the random error component, error plays a major role in linear regression, errors are nothing but the difference between the observed value minus the predicted value error=observed value-predicted value, these are also called as residuals in regression analysis.

Linear Regression is the first type of regression and it works on the principle of least squares so it tries to choose different Bo and B1 values to fit the best line so that the error is reduced. The sum of squared errors is taken as a metric to find the best line of fit, the goal is to find a line with a minimum error which will act as the best line of fit.

The model tries different combinations of Bo and B1 if the B1is greater than 0 then we can say that the predictor variable and the target variables are in a positive relationship and if the B1 is less than zero then we can say that the predictor and the target variable are in a negative relationship, if the model includes 0 then the Bo will be the average of all the predicted values when x=0, but setting 0 for all the predictor variables is often impossible, the value of B0 guarantees that the residuals have zero mean. The sum of squared errors is calculated by the sum of squared errors, the errors are squared to remove the negative symbol and the least SSE says that the model is good.

Assumptions of Linear Regression

Linear Regression works on some assumptions, there are five key assumptions that are to be satisfied when you are modeling using Linear Regression.

1. It is called Linear Regression as it uses a linear equation for modeling so it first assumption is that the data should be linear.

2. Regression is all about finding the relationship between the independent and dependent variable so there should be no relation between two independent variables, in other words, we can say that there should be no auto-correlation between the independent variables.

3. It assumes that the independent variables and the dependent variables are normal so the third assumption is called multivariate normality so the data should be normal across all the variables.

4. Linear Regression believes on equal variance across the errors which is called as Homoscedasticity so the fourth assumption is that the data should not contain any Heteroscedasticity.

5. The last and final assumption is that the independent variables should not have any multicollinearity, this occurs when the independent variables are too highly correlated with each other.

In Linear Regression modeling the independent variables should be influencing the dependent variables so to find out how much the independent variables are influencing the dependent variables we use a measure which is called as R-squared which is a statistical measure that represents the variance of dependent variable explained by the independent variable, this is also called as the coefficient of determination.

R-squared is calculated with the help of above formulae where the SSres is the sum of squares of regression which is nothing but the squared difference between the observed value minus the mean of actual values.

Similarly, SStot is also calculated which is nothing but the sum of squares of the total difference, it is calculated by the difference between the actual value minus the mean of the actual values. while correlation helps us to find the relationship between the variables R-squared explains us until what extent the variance of one variable explains the variance of another variable.

. . .

Source: Artificial Intelligence on Medium