This is for my understanding and not intended for public use.
Regression is the lesser sensational cousin of Deep Neural Networks, but works best in large enterprises where there is a need for interpretability when it comes to making decisions.
- Regression analysis is a form of predictive modelling technique which investigates the relationship between a dependent and independent variable.
- Used for forecasting, time series modelling, finding casual effect relationships between variables.
- Fit a line to the data in such a way that the differences between the distance of data points from the curve is minimal.
- Helps find the significant relationships and strength of impact between and on variables.
Type of regression techniques
- Number of independent variables
- Shape of regression lines
- Type of dependent variable
Regression coefficient : A regression coefficient in multiple regression is the slope of the linear relationship between the criterion variable and the part of a predictor variable that is independent of all other predictor variables.
- Dependent variable is continuous and independent variables can be continuous or discrete, while the nature of regression is linear
- Uses a best fit straight line.
Y=a+b*X + e
- Calculates the best fit straight line by minimizing the sum of the squares of the vertical deviations from each data point to the line.
- There must be a linear relationship between the independant and independent variables.
- Multiple regression suffers from
- Multicollinearity : occurs when independent variable in a model are correlated. If correlation between vars is high enough it can cause problems when you fit the model.
The interpretation of a regression coefficient is that it represents the mean change in the dependent variable for each 1 unit change in an independent variable when you hold all of the other independent variables constant.
However when variables are correlated it indicates that changes in one variable are associated with a shift in another.
In multicollinearity, even though the least squares estimates are unbiased, their variances are large which deviates the observed value far from the true value.
- Structural multicollinearity : This type occurs when we create a model term using other terms. In other words, it’s a byproduct of the model that we specify rather than being present in the data itself. For example, if you square term X to model curvature, clearly there is a correlation between X and X2.
- Data multicollinearity : This type of multicollinearity is present in the data itself rather than being an artifact of our model. Observational experiments are more likely to exhibit this kind of multicollinearity.
- Autocorrelation : Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them
- Heteroskedasticity : Put simply, heteroscedasticity (also spelled heteroskedasticity) refers to the circumstance in which the variability of a variable is unequal across the range of values of a second variable that predicts it.
- A scatterplot of these variables will often create a cone-like shape, as the scatter (or variability) of the dependent variable (DV) widens or narrows as the value of the independent variable (IV) increases. The inverse of heteroscedasticity is homoscedasticity, which indicates that a DV’s variability is equal across values of an IV.
- Heteroscedasticity is most frequently discussed in terms of the assumption of parametric analyses (e.g. linear regression). More specifically, it is assumed that the error (a.k.a residual) of a regression model is homoscedastic across all values of the predicted value of the DV. Put more simply, a test of homoscedasticity of error terms determines whether a regression model’s ability to predict a DV is consistent across all values of that DV.
- The concern about heteroscedasticity, in the context of regression and other parametric analyses, is specifically related to error terms and NOT between two individual variables