How to Measure the Relationship Between Random Variables?

19 min readApr 6, 2020

Hello Reader,

Hope you have enjoyed my previous article about Probability Distribution 101. In this blog post, I am going to demonstrate how can we measure the relationship between Random Variables. This topic holds lot of weight as data science is all about various relations and depending on that various prediction that follows. Before we start, let’s see what we are going to discuss in this blog post.

Random Variables
Covariance
Pearson correlation coefficient (PCC)
Monotonic Functions
Spearman Rank Correlation Coefficient (SRCC)
Significance Test
Correlation Vs Regression
Correlation Vs Causation

Let’s initiate our discussion with understanding what Random Variable is in the field of statistics.

Random Variables

Image Source: https://www.thoughtco.com/probabilities-of-rolling-two-dice-3126559

If we Google ‘Random Variable’ we will get almost the same definition everywhere but my focus is not just on defining the definition here but to make you understand what exactly it is with the help of relevant examples. Here to make you understand the concept I am going to take an example of “Fraud Detection” which is a very useful case where people can relate most of the things to real life.

Let’s say you work at large Bank or any payment services like Paypal, Google Pay etc. Your task is to identify Fraudulent Transaction. For this, you identified some variables that will help to catch fraudulent transaction.

Amount Spend
IP Address
Number of Failed Attempts
Location
Time since the last transaction

There could be more variables in this list but for us, this is sufficient to understand the concept of random variables. Once a transaction completes we will have value for these variables (As shown below)

Since we are considering those variables having an impact on the transaction status whether it's a fraudulent or genuine transaction. The value for these variables cannot be determined before any transaction; However, the range or sets of value it can take is predetermined.

Amount Spend:- [0, Infinity]
IP Address:- Sets of all IP Address in the world
Number of Failed Attempts:- [0,1,2,3]
Time since the last transaction:- [0, Infinity]
Location:- [Mumbai, Delhi, Bengaluru]

Note that, for each transaction variable value would be different but what that value would be is Subject to Chance. In simpler term, values for each transaction would be different and what values it going to take is completely random and it is only known when the transaction gets finished. Thus these variables are nothing but termed as ‘Random Variables’

In a more formal way, we can define the Random Variable as follows:-

A random variable is any variable whose value cannot be determined beforehand meaning before the incident.

Such variables are subject to chance but the values of these variables can be restricted towards certain sets of value. For example, three failed attempts will block your account for further transaction.

A random variable is ubiquitous in nature meaning they are presents everywhere. (Below few examples)

The temperature in a day,
Length of the tweet
Profit per day
Sales per day etc.

Random variables are also known as Stochastic variables in the field statistics. There are 3 types of random variables

Discrete:- Discrete Random Variable can take only integer value. In the above example, ‘No of failed attempts’ is a discrete random variable.
Continuous:- Continuous Random Variable can take any value from a range of values. In the above example, ‘Amount Spend’ is a continuous random variable.
Categorical:- Categorical Random Variable can take one of the limited fixed set of values. In the above example, ‘Location’ is a categorical random variable.

I hope the above explanation was enough to understand the concept of Random variables. Now we will understand How to measure the relationship between random variables?

Let’s consider the following example, You have collected data of the students about their weight and height as follows: (Heights and weights are not collected independently. In the below table, one row represents the height and weight of the same person)

Here,

X: Height of the students
Y: Weight of the students

Is there any relationship between height and weight of the students? If we investigate closely we will see one of the following relationships could exist

When X increases, Y also increases.
When X increases, Y decreases.

Such relationships need to be quantified in order to use it in statistical analysis. So the question arises, How do we quantify such relationships? There are 3 ways to quantify such relationship

Co-variance,
Pearsons Correlation Coefficient (PCC),
Spearman Rank Correlation Coefficient (SRCC).

We will be discussing the above concepts in greater details in this post. Let's start with Covariance.

Covariance

Covariance is pretty much similar to variance. Let’s shed some light on the variance before we start learning about the Covariance.

Variance generally tells us how far data has been spread from it’s mean. Since mean is considered as a representative number of a dataset we generally like to know how far all other points spread out (Distance) from its mean. So basically it's average of squared distances from its mean. There are two types of variance:- Population variance and sample variance. Below table gives the formulation of both of its types.

Source: https://www.onlinemathlearning.com/variance.html

I hope the concept of variance is clear here. It was necessary to add it as it serves the base for the covariance.

Covariance is a measure of how much two random variables vary together. It’s similar to variance, but where variance tells you how a single variable varies, co variance tells you how two variables vary together.

Formulation of Covariance

As we have stated covariance is much similar to the concept called variance. Thus formulation of both can be close to each other.

If you closely look at the formulation of variance and covariance formulae they are very similar to each other.

Basically we can say its measure of a linear relationship between two random variables. Based on the direction we can say there are 3 types of Covariance can be seen:-

Positive Covariance
Negative Covariance
Zero Covariance

Positive Covariance

We define there is a positive relationship between two random variables X and Y when Cov(X, Y) is positive.

When X increases, Y also increases
There should be a directly proportional relationship between two random variables

Consider the following example.

In the above diagram, when X increases Y also gets increases. As we said earlier if this is a case then we term Cov(X, Y) is +ve. Let’s consider two points that denoted above i.e. (X1, Y1) and (X2, Y2). The mean of both the random variable is given by μx and μy respectively.

(X1-μx) = This operation returns a positive value as X1 > μx
(Y1-μy) = This operation returns a positive value as Y1 > μy

Thus multiplication of both positive numbers will be positive.

(X2-μx) = This operation returns a negative value as X2 < μx
(Y2-μy) = This operation returns a negative value as Y2 < μy

Thus multiplication of both negative numbers will be positive. Means if we have such a relationship between two random variables then covariance between them also will be positive.

Negative Covariance

We define there is a negative relationship between two random variables X and Y when Cov(X, Y) is -ve.

When X increases, Y decreases.
When there is an inversely proportional relationship between two random variables.

Consider the following example,

In the above diagram, we can clearly see as X increases, Y gets decreases. This is the case of Cov(X, Y) is -ve. Let’s check on two points (X1, Y1) and (X2, Y2) The mean of both the random variable is given by μx and μy respectively.

(X1-μx) = This operation returns a positive value as X1 > μx
(Y1-μy) = This operation returns a negative value as Y1 < μy

Thus multiplication of positive and negative numbers will be negative.

(X2-μx) = This operation returns a negative value as X2 < μx
(Y2-μy) = This operation returns a positive value as Y2 > μy

Thus multiplication of positive and negative will be negative. Means if we have such a relationship between two random variables then covariance between them also will be negative.

Zero Covariance

When there is NO RELATIONSHIP between two random variables. Then it is said to be ZERO covariance between two random variables. In this scenario, the data points scatter on X and Y axis such way that there is no linear pattern or relationship can be drawn from them.

Image Source: https://www.slideshare.net/JonWatte/covariance

This can also happen when both the random variables are independent of each other. However, the covariance between two random variables is ZERO that does not necessary means there is an absence of a relationship. A Nonlinear relationship can exist between two random variables that would result in a covariance value of ZERO!

Properties of Covariance

Covariance with itself is nothing but the variance of that variable.

When random variables are multiplied by constants (let's say a & b) then covariance can be written as follows:

Covariance between a random variable and constant is always ZERO!

Cov( X, Y) is as same as Cov(Y, X)

Drawbacks of using Covariance

When we say that the covariance between two random variables is +ve or -ve but we cannot gives the answer to How much positive? or How much negative? etc.
Covariance is completely dependent on scales/units of numbers. Therefore it is difficult to compare the covariance among the dataset having different scales. This drawback can be solved using Pearsons Correlation Coefficient (PCC).

Pearsons Correlation Coefficient (PCC)

In statistics, a correlation coefficient is used to describe how strong is the relationship between two random variables. There are several types of correlation coefficients: Pearson’s Correlation Coefficient (PCC) and the Spearman Rank Correlation Coefficient (SRCC).

Few real-life cases you might want to look at-

The more time you spend running on a treadmill, the more calories you will burn.
The less time I spend marketing my business, the fewer new customers I will have.
As the temperature goes up, ice cream sales also go up.
As the weather gets colder, air conditioning costs decrease.
If a car decreases speed, travel time to a destination increases.
As the temperature decreases, more heaters are purchased.

Every correlation coefficient has direction and strength. The direction is mainly dependent on the sign. Thus it classifies correlation further-

Positive Correlation:

If two random variables move together that is one variable increases as other increases then we label there is positive correlation exist between two variables.
Ex: As the temperature goes up, ice cream sales also go up.

Negative Correlation:

If two random variables move in the opposite direction that is as one variable increases other variable decreases then we label there is negative correlation exist between two variable.
Ex: As the weather gets colder, air conditioning costs decrease

No / Zero Correlation:

If two random variables show no relationship to one another then we label it as Zero Correlation or No Correlation.
Ex: There is no relationship between the amount of tea drunk and level of intelligence.

Image Source: https://www.simplypsychology.org/correlation.html

If you look at the above diagram, basically its scatter plot. Drawing scatter plot will help us understanding if there is a correlation exist between two random variable or not. Above scatter plot just describes which types of correlation exist between two random variables (+ve, -ve or 0) but it does not quantify the correlation that's where the correlation coefficient comes into the picture. Lets deep dive into Pearson’s correlation coefficient (PCC) right now.

Pearson’s correlation coefficient formulas are used to find how strong a relationship is between data. The formulas return a value between -1 and 1, where:

1 indicates a strong positive relationship.
-1 indicates a strong negative relationship.
A result of zero indicates no relationship at all.

Case 1:

In the first diagram, we can see both X & Y are negatively correlated. PCC returns -1 value if and only if the values of X and Y falls exactly on the same line. (Means there should be a strictly linear relationship between two variables)
In the second diagram, we can see both X & Y are negatively correlated yet PCC returns a value between -1 to 0 as data points don’t fall exactly on a linear line. There are some data points are scattered around the line. (There is no strict linear relationship between variables)

Case 2:

In the first diagram, we can see there is some sort of linear relationship between X and Y though they aren’t perfectly liner but there is a positive linear relationship exists between two random variables. Therefore PCC returns a value between 0 to 1
In the second diagram, X & Y are perfectly correlated. data points fall on a single line. Therefore we can say, it’s a strong linear relationship hence PCC will return value of +1.

Case 3:

In the above case, there is no linear relationship that can be seen between two random variables.
There is an absence of a linear relationship between two random variables but that doesn’t mean there is no relationship at all. There could be a possibility of a non-linear relationship but PCC doesn’t take that into account.
This is the perfect example of Zero Correlation. Thus PCC returns the value of 0.

Until now we have seen the cases about PCC returning values ranging between -1 < 0 < 1. But have you ever wondered, how do we get these values?

In the above formula, PCC can be calculated by dividing covariance between two random variables with their standard deviation. If we unfold further above formula then we get the following

As stated earlier, above formula returns the value between -1 < 0 < +1. But these value needs to be interpreted well in the statistics. Below table will help us to understand the interpretability of PCC:-

Limitations:

The correlation coefficient always assumes the linear relationship between two random variables regardless of the fact whether the assumption holds true or not.
Computationally expensive. It takes more time to calculate the PCC value.

The first limitation can be solved. There is another correlation coefficient method named Spearman Rank Correlation Coefficient (SRCC) can take the non-linear relationship into account. Since SRCC takes monotonic relationship into the account it is necessary to understand what Monotonocity or Monotonic Functions means.

Monotonic Functions

The monotonic functions preserve the given order. The term monotonic means no change. There are four types of monotonic functions

Monotonically Increasing Function
Strictly Monotonically Increasing Function
Monotonically Decreasing Function
Strictly Monotonically Decreasing Function

Monotonic function g(x) is said to be monotonic if x increases g(x) also increases. Such function is called Monotonically Increasing Function.

If x1 > x2 then g(x1) ≥ g(x2); Then g(x) is said to be monotonically increasing function.

If x1 > x2 then g(x1) > g(x2); Then g(x) is said to be strictly monotonically increasing function.

Strictly Monotonically Increasing Function

Monotonic function g(x) is said to be monotonic if x increases g(x) decreases. Such function is called Monotonically Decreasing Function.

If x1 < x2 then g(x1) ≥ g(x2); Thus g(x) is said to be Monotonically Decreasing Function.
If x1 < x2 then g(x1) > g(x2); Thus g(x) is said to be Strictly Monotonically Decreasing Function

Monotonically Decreasing Function and Strictly Monotonically Decreasing Function

Now we have understood the Monotonic Function or monotonic relationship between two random variables its time to study concept called Spearman Rank Correlation Coefficient (SRCC)

Spearman Rank Correlation Coefficient (SRCC)

The Spearman Rank Correlation Coefficient (SRCC) is the nonparametric version of Pearson’s Correlation Coefficient (PCC). Here nonparametric means a statistical test where it's not required for your data to follow a normal distribution. They’re also known as distribution-free tests and can provide benefits in certain situations.

The Spearman correlation evaluates the monotonic relationship between two continuous or ordinal variables In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate.

In SRCC we first find the rank of two variables and then we calculate the PCC of both the ranks. Thus we can define Spearman Rank Correlation Coefficient (SRCC) as below

The Spearman Rank Correlation Coefficient (SRCC) is a nonparametric test of finding Pearson Correlation Coefficient (PCC) of ranked variables of random variables.

Since SRCC evaluate the monotonic relationship between two random variables hence to accommodate monotonicity it is necessary to calculate ranks of variables of our interest. How do we calculate the rank will be discussed later.

Spearman’s Rank Correlation Coefficient also returns the value from -1 to +1 where

+1 = a perfect positive correlation between ranks
-1 = a perfect negative correlation between ranks
0 = no correlation between ranks.

Steps for calculation Spearman’s Correlation Coefficient:

Step 1: Check for a monotonic relationship.
Step 2: Calculate the Rank of two variables
Step 3: Calculate the PCC of the ranked variables.

How do we rank the variables?

This is important to understand how to calculate the ranks of two random variables since Spearman’s Rank Correlation Coefficient based on the ranks of two variables. Below example will help us understand the process of calculation:-

The scores for nine students in physics and math are as follows:

Physics: 35, 23, 47, 17, 10, 43, 9, 6, 28
Mathematics: 30, 33, 45, 23, 8, 49, 12, 4, 31

Compute the student’s ranks in the two subjects and compute the Spearman rank correlation.

Step 1:-

Let's visualize above and see whether the relationship between two random variables linear or monotonic?

As we can see the relationship between two random variables is not linear but monotonic in nature. This fulfils our first step of the calculation.

Step 2:-

Process of calculating ranks:

The lowest value will be ranked 1
Subsequent values ranked accordingly
When you have two identical values in the data (called a “tie”), you need to take the average of the ranks that they would have otherwise occupied. If two similar value lets say on 6th and 7th position then average (6+7)/2 would result in 6.5. This rank to be added for similar values.

In the above table, we calculated the ranks of Physics and Mathematics variables. There is no tie situation here with scores of both the variables.

Step 3:- Calculate Standard Deviation & Covariance of Rank. (This step is necessary when there is a tie between the ranks. If not, please ignore this step)

Step 4: Calculate SRCC

There are two methods to calculate SRCC based on whether there is tie between ranks or not.

If there is no tie between rank use the following formula to calculate SRCC

Here di is nothing but the difference between the ranks. For example, the first student’s physics rank is 3 and math rank is 5, so the difference is 2 and that number will be squared. Its good practice to add another column d-Squared to accommodate all the values as shown below.

If there is a tie between ranks use the following formula to calculate SRCC

In our example stated above, there is no tie between the ranks hence we will be using the first formula mentioned above.

ρ = 1-[(6 * 12) / 9*(81–1)

ρ = 0.9

The Spearman Rank Correlation for this set of data is 0.9

Advantages over PCC:

SRCC doesn’t require a linear relationship between two random variables. It doesn’t matter what relationship is but when X increasing, Y also increasing & X is decreasing then Y also decreasing then SRCC works well.
SRCC handles outlier where PCC is very sensitive to outliers.

The Spearman correlation is less sensitive than the Pearson correlation to strong outliers that are in the tails of both samples. That is because Spearman’s rho limits the outlier to the value of its rank

Significance Test

When we quantify the relationship between two random variables using one of the techniques that we have seen above can only give a picture of samples only. (We are making this assumption as most of the time we are dealing with samples only)

Sometimes our objective is to draw a conclusion about the population parameters; to do so we have to conduct a significance test. The significance test is something that tells us whether the sample drawn is from the same population or not. We will be using hypothesis testing to make statistical inferences about the population based on the given sample.

Here I will be considering Pearson’s Correlation Coefficient to explain the procedure of statistical significance test. The objective of this test is to make an inference of population ρ based on sample r.

Let’s define our Null and alternate hypothesis for this testing purposes. The hypothesis testing will determine whether the value of the population correlation parameter ρ is significantly different from 0 or not. We will conclude this based upon the sample correlation coefficient r and sample size n.

If we get ρ value 0 or close to 0 then we can conclude that there is not enough evidence to prove the relationship between x and y. If there is a correlation between x and y in a sample but does not occur the same in the population then we can say that occurrence of correlation between x and y in the sample is due to some random chance or it just mere coincident.

Let’s see what are the steps that required to run a statistical significance test on random variables.

Step 1: Define your hypothesis

Defining the hypothesis is nothing but the defining null and alternate hypothesis. Remember, we are always trying to reject null hypothesis means alternatively we are accepting the alternative hypothesis. In our case accepting alternative hypothesis means proving that there is a significant relationship between x and y in the population.

Null hypothesis H0: ρ = 0
Alternative hypothesis H1: ρ ≠ 0

Step 2: Student’s t-Test

The student’s t-test is used to generalize about the population parameters using the sample.

Image Source: https://fabian-kostadinov.github.io

Here,

n is the sample size
r is the sample correlation coefficient value

Once we get the t-value depending upon how big it is we can decide whether the same correlation can be seen in the population or not. But, the challenge is how big is actually big enough that needs to be decided. This is where the p-value comes into the picture. But what is the p-value?

P-Value

Actually, a p-value is used in hypothesis testing to support or reject the null hypothesis. It is the evidence against the null-hypothesis. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis. Thus, in other words, we can say that a p-value is a probability that the null hypothesis is true.

Let's say you get the p-value that is 0.0354 which means there is a 3.5% chance that the result you got is due to random chance (or it is coincident)
If you get the p-value that is 0.91 which means there a 91% chance that the result you got is due to random chance or coincident. It means the result is completely coincident and it is not due to your experiment.
Therefore the smaller the p-value, the more important or significant.

In statistics, we keep some threshold value 0.05 (This is also known as the level of significance α) If the p-value is ≤ α, we state that there is less than 5% chance that result is due to random chance and we reject the null hypothesis. If the p-value is > α, we fail to reject the null hypothesis.

The calculation of p-value can be done with various software. If we want to calculate manually we require two values i.e. t-value and degrees of freedom.

Correlation Vs Regression

Image Source: https://keydifferences.com

The difference between Correlation and Regression is one of the most discussed topics in data science. This question is also part of most data science interviews. Let’s understand it thoroughly so we can never get confused in this comparison.

Correlation is a statistical measure which determines the direction as well as the strength of the relationship between two numeric variables. In the other hand, regression is also a statistical technique used to predict the value of a dependent variable with the help of an independent variable.

In correlation, we find the degree of relationship between two variable, not the cause and effect relationship like regressions. The value of the correlation coefficient varies between -1 to +1 whereas, in the regression, a coefficient is an absolute figure.

Correlation Vs Causation

Correlation and causes are the most misunderstood term in the field statistics. I have seen many people use this term interchangeably. It is so much important to understand the nitty-gritty details about the confusing terms.

You might have heard about the popular term in statistics:-

“Correlation does not imply causation”

This phrase used in statistics to emphasize that a correlation between two variables does not imply that one causes the other.

Let's take the above example. As per the study, there is a correlation between sunburn cases and ice cream sales. But that does not mean one causes another. There could be the third factor that might be causing or affecting both sunburn cases and ice cream sales. Yes, you guessed it right. It’s the summer weather that causes both the things but remember increasing or decreasing sunburn cases does not cause anything on sales of the ice-cream.

Conclusion

So we have covered pretty much everything that is necessary to measure the relationship between random variables. I have also added some extra prerequisite chapters for the beginners like random variables, monotonic relationship etc.

Hope I have cleared some of your doubts today.

Thanks for reading. See you soon with another post!

References

Statistics How To: https://www.statisticshowto.com/
Some of my written notes
Google Images