The Collinearity of Features
It’s wise to understand the stuff related to ‘collinearity’ or ‘multicollinearity’ to excel in the field of data science. Though both of these concepts explain the same thing with a subtle difference. (Don’t worry, I will be explaining the difference in naïve language)
Let’s check the agenda for this blog post.
- Collinearity Vs Multicollinearity
- How does the weight vector of features get corrupted in the presence of collinearity?
- Why Multicollinearity could be dangerous to your regression model?
- Ways to detect multicollinearity among the features?
- How to remove Multicollinearity?
Let’s start with understanding the difference between collinearity and multicollinearity.
Collinearity Vs Multicollinearity
These are the two most overlapping concepts from statistics. Collinearity is when there is a linear relationship or association between two variables or features. The term multicollinearity is formed for the linear association between two or more variables.
How does the weight vector of features get corruptted in the presence of collinearity?
Let’s say you have a 3 features F = {F1, F2, F3} with corresponding weights are W* = {1, 2, 3}.
Let’s say you have a new query point (row vector) Xq = {Xq1, Xq2, Xq3} then the weights of the features will align it this way -
Let’s rewrite the above sentence and label it as Equation (1).
Now as I have explained above, there must be some linear association between features or variables before we term them as multicollinear. To demonstrate this, let’s say,
Now let’s plug this value into our equation (1) and solve the equation.
Since we have two equations ready, let’s compare weight vectors corresponding to these two equations.
Note:-
Due to the presence of collinearity, weight vectors or weight associated with each feature could change. Many algorithms use the weight vector to do feature selection internally. Take the example of Logistic Regression, it uses |Wi| for feature importance. Therefore we must check if there is multicollinearity exists among the features.
Why Multicollinearity could be dangerous to your regression model?
Until now, I have explained how problematic it could be to model-based feature selection. Now let’s discuss the potential problems that may arise when we build regression models by ignoring multicollinearity.
Having multicollinearity in the regression model can change the interpretation of the regression coefficient. The regression coefficient is the value that represents the relationship when a meaningful change in the dependent variable for each unit change in the independent variable by keeping all other independent variables constant. That’s the problem. When you have independent variables correlated then it’s difficult to fulfill the criteria of “keeping other independent variables constant”. As a result, if we change 1 unit in the independent variable there might be an ‘X’ unit change in other independent variables due to collinearity between features. This makes the model difficult to understand the true relationship between dependent and independent variables.
Ways to detect multicollinearity among the features?
There are multiple methods to capture the multicollinearity among the features. In this blog post, let’s try to cover these methods!
- Correlation Matrix
- VIF
Let’s explore each of these above.
Correlation Matrix
You can be a Sherlock Holmes and plot the correlation matrix where you can spot multicollinearity very easily. This is the easiest way that I could think of for now.
Let me take you to a practical example. Here is the data that I am using to demonstrate.
The objective of the problem is to predict the sales price based on the mentioned variables. I know, I know, you don’t require much information on the objective of the dataset as we aren’t going to accomplish that in this blog. :)
Once we have the dataset ready, all you need to do is to load and exclude the target variable and run bunch of lines of Python code to print the correlation matrix like below. (Don’t worry, the code will be shared with you by the end of this blog post)
The couple of observations that I would like to add,
- The variable ‘Interior’ is highly correlated with # of Beds, the # Of Rooms, and # of Bath.
- There is a high correlation between # of Rooms and # of Beds which is oblivious!
- There is also a high correlation between ‘Condo Fee’ and ‘Tax’
As we have already seen the problem pertains to the high correlation and how it changes the weight vector. Hence building a model with these variables (without treating the problem) wouldn’t be advisable.
Variance Inflation Factor (VIF)
It’s better to have one metric to understand the multicollinearity than the correlation interpretation. There is one popular metric known as Variance Inflation Factor or shortly known as VIF which does the same thing. But what do you mean by VIF?
It’s ratio of the variance of the model to the variance of the model which include only that one independent variable.
In other words, we can say that the VIF is the measure of the strength of correlation between independent variables. It is predicted by taking a variable and regressing it with every other variable.
Let’s calculate the VIF for our dataset and showcase it here in the form of a table.
Look at the table carefully, the # of Beds and # of Rooms have a high VIF score which indicates that there is strong collinearity between these features. In real life, this is so oblivious and the above table does make sense. But what exactly those scores are?
The score of each independent variable represents how well is the variable explained by other independent variables. In order to accomplish this, each independent variable acted as the dependent variable. The R2 (R-squared) score will be generated. Higher the R2 score more that variable correlated with other variables. This is what VIF returns as a score (See the table above)
The main problem that I found with VIF is that there is no upper limit. It starts with 1 and goes towards infinity. The above table is a mere thumb rule as there are no hard and set rules for their interpretation. For example, some companies accept the VIF values till 7 and some want to have VIF below 4.
How to Fix Multicollinearity?
Since are multiple ways to detect multicollinearity, Likewise, there are multiple ways to fix it. Let me make life easier by jotting down some of these methods to fix the multicollinearity issue.
- Dropping the Correlated Variables
- Combining the Variables
- Life savior → PCA
Let’s start exploring each of the above techniques to understand more about it.
Dropping the Correlated Variables
This would be the easiest method to cut down the multicollinearity of the data. When there is high multicollinearity between two independent variables would mean that each variable is capable of explaining the other variables. Therefore we can keep one and remove other variables from the dataset without affecting much variance.
If you look at the above table, the variables like ‘Interior (Sq Ft) and # of Rooms are well explained by the other variables like # of Bed and # of Bath, etc. Therefore we can remove those variables from the dataset and check whether it brings down the multicollinearity or not.
One thing to remember is that dropping variables and checking VIF is an iterative process. Even though we have the information about two variables that need to be removed, we have not immediately removed them. Here, we have removed the ‘Interior(Sq Ft)’ first as it has the highest VIF value. As soon as we removed it and again checked for VIF values, the VIF values of other variables changed. Therefore we have to remove variables in an iterative fashion.
Combining the Variables
Yes, we can combine multiple variables together and create one variable which has capabilities to represent the other variables. In more technical terms, the representative variable should capture the variance of other variables.
In our case, we can combine ‘# of Beds’ and ‘# of Bath’ and form a new variable as ‘Total Rooms (Space)’ which can serve as a representative of both of these variables.
Let’s validate our assumptions with the VIF table. (We will be calling our VIF function on the modified data)
These two methods are used most in the industry to remove the effect of multicollinearity. There is one other method that we have yet to discuss is on, how PCA would be a life savior when you have a multicollinearity issue. …
Life savior → PCA
Principal Component Analysis (Short, PCA) is a dimensional reduction technique used to reduce the dimensions of the data. In the dimensionality reduction process, we reduce the dimensions of data with the help of % of variance required to build sensible ML models.
Most of us get confused easily about PCA being a feature selection technique. But in reality, it’s not. It rather features extraction techniques where features new features get extracted in the form of the components which capture most of the variance in the data.
The above table shows the number of principal components and variance captured by each of these. You can completely neglect the last 2 and be happy with 94% of variance or you can neglect the last 1 and can have 98% of captured variance by model. In both cases, you are reducing the dimensions. But here, we are not for ‘dimensionality reduction, we are here for ‘multicollinearity reduction’. Thus, for now, let’s keep all the components as it is and run the VIF model to check the VIF score for each of these components.
I don’t recommend using this method when you want to build a model with interpretability. We lose the identity of the variables as soon as we implement PCA. You must forget the interpretability when you use the PCA.
Final words
Having multicollinearity in the data can corrupt the model-building process. The multicollinearity can be problematic to your regression model as well as your classification models. A simple technique like a correlation matrix can be used for the identification of multicollinearity and as simple as dropping or combining the variables can be employed to fix the issue. Selection of the right variables (domain-specific) can make life easier, therefore special attention needs to be given while doing feature selection. We also talked about techniques like PCA to reduce multicollinearity but the red flag is for the interpretability. The moment we use PCA we lose the identity of the variables. As I always say, there are no right or wrong techniques, there are only techniques that can be used based on the situation.
Footnote
I have written numerous articles on Quora & Medium. There are articles lying in draft waiting for me to finish and publish. Follow me on Medium as well as on Linkedin to stay in touch with me. See you around. :)
Credits where it’s due -