Tuesday, October 26, 2021

Multi-Collinearity in Machine Learning

 # What is multicollinearity?

>Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be independent. If the degree of correlation between variables is high enough, it can cause problems when you fit the model and interpret the results.


>Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model. This means that an independent variable can be predicted from another independent variable in a regression model


# Why Multi-Collinearity is a problem?


When independent variables are highly correlated, change in one variable would cause change to another and so the model results fluctuate significantly. The model results will be unstable and vary a lot given a small change in the data or model. This will create the following problems:


1>It would be hard for you to choose the list of significant variables for the model if the model gives you different results every time.


2>Coefficient Estimates would not be stable and it would be hard for you to interpret the model. In other words, you cannot tell the scale of changes to the output if one of your predicting factors changes by 1 unit.


3>The unstable nature of the model may cause overfitting. If you apply the model to another sample of data, the accuracy will drop significantly compared to the accuracy of your training dataset


# How to identify that multicollinearity exists?

1> correlation  is greater > 0.8 between 2 variables 


2>Variance inflation factor(VIF) >20 


3>R Squared& Adj R- Squared  value should in between 0 to 1 [As close to 1 it will be good ]


4>Check the Coefficient value should not be high


5>If the Coefficient value is negative then it means that newspaper price change in 1 unit price will decrease by .0010 (As co-efficient value is negative)


6>Standard error should not be high it means multi-co-relation exists 


7>Higher p-value should be ignored 


Python code for multicollinearity is available on the below link:

GITUBLINK

No comments:

Post a Comment

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!"

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!" Big news for data engineers! Delta Lake 2.0+ in...