Blog: Linear Regression Basics and Regularization Methods
Complete guide to all the regressions, Ridg, Lasso and Elastic Net Regression
Linear regression is the simplest and most widely used statistical technique for predictive modeling. It basically gives us an equation, where we have our features as independent variables, on which our target variable is dependent upon.
Here Y is the independent variable, X is the dependent variable and theta are the coefficients. Coefficients are weights assigned to each of the variables. Higher dependency will have higher weights.
Consider just one independent variable. Then the equation will be
This equation is called a simple linear regression equation, which represents a straight line, where theta0 is the intercept, theta1 is the slope of the line. Example: Below shown is the regression line(line of best fit) of Sales vs MRP graph.
Line of Best Fit
A line of best fit is a graph showing the general direction that a group of points seem to follow. The main purpose of line of best fit is that our predict value should be closer to the actual value. The best fit line tends to minimize the difference between the values predicted and the actual values i.e. error(or residue)
The residues are the vertical line extended from the residual line to the data points.
Our main objective is to find out these errors and minimize it. We can measure the residue in three ways.
- Sum of Residuals — (∑(Y — h(X))): This might result in canceling out of positive and negative errors, hence we use absolute
- Sum of the absolute value of residuals — (∑|Y-h(X)|)
- Sum of square of residuals — ( ∑ (Y-h(X))^2) — Mostly used
Therefore Sum of Squares(SS) error is given as:
Where h(x) i.e. predicted values= ( Θ1*x +Θ0), y is the actual value and the and m in the number of rows(data)
Cost/Loss Function: It is the measure of error of the model
This is similar to Sum of Square error hence it is also called Mean Squared Error.
For every model, we need to minimize the cost function. This is done by using a gradient descent algorithm.
Gradient Descent Algorithm
Gradient Descent is used to minimize the error function by iteratively moving in the direction of steepest descent(i.e. negative descent). It is used to update the parameter of our models
Suppose we want to find out the best parameters (θ1) and (θ2) for our linear regression algorithm. GD works by iteratively updating θ and find the points where the Cost Function is minimum. For in-depth notes on GD, visit here.
Evaluating your model — R-squared
R-squared — It determines how much of total variation in Y (dependent variable) is explained by variation in X (independent variable).
The value is always between 0 and 1, where 0 means the model doesnot explain the variablity in Y and 1 meaning full variablity in the target variablity
Drawback of R-squared
The drawback of adjusted R-square is that, if we use a new dependent variable in the learner, the R-squared increases or remains constant. This does not tell whether we are increasing complexity or making it more accurate.
Adjusted R-squared is the modified form of R-squared e that has been adjusted for the number of predictors in the model. It incorporates the model’s degree of freedom.
R2 = Sample R square
p = Number of predictors
N = total sample size
R Square is a basic matrix which tells you about how much variance is been explained by the model. What happens in a multivariate linear regression is that if you keep on adding new variables, the R square value will always increase irrespective of the variable significance. Adjusted R square calculates R square from only those variables whose addition in the model which are significant. So while doing a multivariate linear regression we should look at adjusted R square instead of R square.
Choosing the right parameters for your model
During a high dimension data, it would be inefficient to use all the columns in the regression model, since some of them might be imparting redundant information.
There are two main ways of selecting the variables —
- Forward Selection — Forward selection starts with the most significant predictor in the model and adds variable for each step.
- Backward Elimination — Backward elimination starts with all predictors in the model and removes the least significant variable for each step.
Selecting criteria can be set to any statistical measure like R-square, t-stat, etc.
Statistical Methods for Finding the Best Regression Model
- Adjusted R-squared and Predicted R-squared: Generally, you choose the models that have higher adjusted and predicted R-squared values.
- P-values for the predictors: In regression, low p-values indicate terms that are statistically significant. By backward elimination, one can systematically remove features with the highest p-value one-by-one until you are left with only significant predictors.
Interpretation of Regression Plots
Regression plots are Residual vs Fitted Plots
Heteroskedasticity- the presence of non-constant term in the error term results in hetroskedasticity. A funnel like shape in the plot indicate hetroskedasticity.
Reason for heteroskedasticity include could be due to the presence of outliers or extreme leverage values. When this occurs, the confidence interval for out of sample prediction tends to be unrealistically wide or narrow.
Polynomial Regression is another type of regression in which the maximum power of the independent variable is more than 1. Hence, the best fit line is not a straight line and instead in the form of a curve.
Quadratic regression, or regression with second order polynomial, is given by the following equation:
Y =Θ1 +Θ2*x +Θ3*x2
Below shown is polynomial regression with different degree=3 and degree=20.
We see for higher degree polynomial, the best fit line tend to converge to all the points. This means our model has fit to our training data well but tend to fit poorly on test data. This is called over-fitting. In that case, our model has high variance and low bias.
Similarly, we have another problem called underfitting, it occurs when our model neither fits the training data nor generalizes on the new data. In this case, we have a model with high bias and low variance.
Bias and Variance in Regression models
Bias — Bias is the simplifying assumptions made by a model to make the target function easier to learn.
Variance — Variance is the amount that the estimate of the target function will change if different training data was used.
There could be 4 for bias/variance in regression Model
- Very Accurate Model — therefore the error of our model will be low, meaning a low bias and low variance as shown in the first figure.
- As variance increases, the spread of our data point increases which result in less accurate prediction.
- As Bias increases the error between our predicted value and the observed values increases. A high bias assumes a strong assumption or strong restrictions on the model.
Underfitting — underfitting model performs poorly on training data. This happens because the model is unable to capture the relationship between the input example and the target variable.
To overcome underfitting or high bias, we can basically add new parameters to our model so that the model complexity increases, and thus reducing high bias.
Overfitting — As we add more and more parameters to our model, its complexity increases, which results in increasing variance and decreasing bias
To overcome overfitting there are two ways —
- Reduce the model complexity
In regularization, what we do is normally we keep the same number of features, but reduce the magnitude of the coefficients. To do this one can use the plot the coefficient graph of all these variables.
Above shows coeff. vs variable graph, we see that outlet Outlet_Identifier_OUT027 and Outlet_Type_Supermarket_Type3 is much higher than the rest of the coeff. Therefore our dependent variable will depend more on these variables.
We have different types of regression techniques which uses regularization to overcome this problem. So let us discuss them.
Ridge Regression is L1 or Loss 1 and L2 or Loss 2
The key difference between these two is the penalty term.
Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. Here the highlighted part represents L2 regularization element.
Objective = RSS + α * (sum of square of coefficients)
Here, α (alpha) is the parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of square of coefficients. α can take various values. Here we define Lambda(or α) which is a penalty factor:
- α = 0 — The objective becomes same as simple linear regression. We’ll get the same coefficients as simple linear regression.
2. α = ∞: The coefficients will be zero. Why? Because of infinite weight on the square of coefficients, anything less than zero will make the objective infinite.
3. 0 < α < ∞: The magnitude of α will decide the weight given to different parts of the objective. The coefficients will be somewhere between 0 and ones for simple linear regression.
The above plots for the best fit line, we see that as the value of α increases, the model complexity reduces. Higher the α, bigger the penalty Though higher values of αreduce overfitting, significantly high values can cause underfitting as well (eg. α= 5).
In lasso, the magnitude of the coefficients are shrunk to a small magnitude but they are never zero. It shrinks the parameters, therefore it is mostly used to prevent multicollinearity. It reduces the model complexity by coefficient shrinkage.
LASSO full form is Least Absolute Shrinkage Selector Operator . It is quite similar to ridge regression. LASSO adds “absolute value of magnitude” of coefficient as penalty term to the loss function. This way,
Traditional methods like cross-validation, stepwise regression to handle overfitting and perform feature selection work well with a small set of features but Regulatisation like the one explained above techniques are a great alternative when we are dealing with a large set of features.
Elastic Net Regression
It combines the power of L1 and L2 regularisation Elastic regression generally works well when we have a big dataset. Consider that we have a bunch of correlated independent variables in a dataset, then elastic net will simply form a group consisting of these correlated variables. Now if any one of the variable of this group is a strong predictor (meaning having a strong relationship with dependent variable), then we will include the entire group in the model building, because omitting other variables (like what we did in lasso) might result in losing some information in terms of interpretation ability, leading to a poor model performance.
The equation looks like this:
where α is the mixing parameter between ridge (α = 0) and lasso (α = 1) and λ.
α= a + b
λ= a / (a+b)
here a and b are weights assigned to L1 and L2 term respectively and set in a such a way that they control trade off between L1 and L2
a * (L1 term) + b* (L2 term)
Programmatically we use l1_ratio as a parameter of a function, which alone decides the type of regression (lasso, ridge, Elastic Net) Let alpha (or a+b) = 1, and now consider the following cases:
- If l1_ratio =1, therefore if we look at the formula of l1_ratio, we can see that l1_ratio can only be equal to 1 if a=1, which implies b=0. Therefore, it will be a lasso penalty.
- Similarly if l1_ratio = 0, implies a=0. Then the penalty will be a ridge penalty.
- For l1_ratio between 0 and 1, the penalty is the combination of ridge and lasso.
This ends the topic of Linear regression and regularisation methods. If you like this article, it a clap and follow me for more stuff like this.
Introduction I was talking to one of my friends who happens to be an operations manager at one of the Supermarket…www.analyticsvidhya.com
Introduction When we talk about Regression, we often end up discussing Linear and Logistic Regression. But, that’s not…www.analyticsvidhya.com