Hypothesis Testing 101

15 min readSep 28, 2021

Hypothesis testing is an important concept in the area of statistics as well as in data science. In this post, we will be learning all the required information related to hypothesis testing. Stay tuned.

What is Hypothesis Testing?
Steps involved in Hypothesis Testing
Defining Null & Alternate Hypothesis
Choosing Test Statistics
Decision Making Through P-Value
Drawing Conclusion About The Population
Examples

Let’s understand each of these concepts in a detailed manner.

What is Hypothesis Testing?

The main objective of statistics is to test a hypothesis. Let’s take an example, you are a medical researcher and by doing an experiment you found out that the drug ‘ABC’ is much effective against treating fever. But to take this into account you’ve to perform this experiment multiple times (could be with different samples) If not, no one will consider your results accurate.

What is a Hypothesis?

A hypothesis is a falsifiable claim that requires verification, typically from experimental or observational data, and that allows for predictions about future observations.

Why are Hypotheses important?

Below reasons sheds some light on why are hypotheses important in the area of statistics.

Hypotheses improve experiment design, critical thinking, and data analysis.
It’s not possible to do clear and meaningful data analysis without a correct hypothesis.
Hypotheses transform loose ideas or speculation into concrete and specific claims.
Hypotheses are used to develop new and more accurate theories and dissolve bad theories.
Most progress in science, engineering, medicine, and technology is the result of hypothesis testing.

Types of Hypotheses

Hypotheses can be divided into 2 types. That is →

Strong Hypothesis
Weak Hypothesis

There are some characteristics of Strong Hypotheses that need to understand before we categories statements or claims into Strong or Weak Hypothesis.

Characteristics of Strong Hypothesis →

It should be clear
It should be specific
It should be falsifiable means it can be proved wrong
A good hypothesis is based on prior data and theory.
A good hypothesis leads to a statistical test. Meaning, there should be a way to test the hypothesis.
It has to be a statement, not a question.
A prediction about the direction of an effect.
A good hypothesis is relevant for unobserved data too.

Now based on the above information let’s classify the below statements into ‘Strong’, ‘Weak’ or ‘Not a hypothesis’ categories.

Statement 1 →

Medical research is important for curing diseases.

The above statement is correct but this is not a hypothesis as it doesn’t provide information about data, statistics etc.

Statement 2 →

The medication has an effect.

The above statement falls into the ‘Weak Hypothesis’ category. This statement can be used to falsify the claim but it’s not a clear and specific statement to be categorised into the ‘Strong Hypothesis’ category.

Statement 3 →

Will students pass this exam?

This is a question and not a statement. Thus it’s not a hypothesis.

Statement 4 →

Studying improves grades.

This is a hypothesis. But It’s not clear and specific. Thus it will be categorised into a weak hypothesis.

Statement 5 →

A combination of self-study and group study will improve final exam grades by at least 10%

This is a clear and specific statement. It will be categorised into a strong hypothesis.

Statement 6 →

Washing hands for 20 seconds reduces disease spread.

This is a clear and specific claim. It can be tested. Thus we will label it as a strong hypothesis.

I hope you’ve understood the hypotheses and types of hypotheses. Now it’s time to understand what exactly mean by hypothesis testing?

Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data.
Hypothesis Testing is an assumption that we make about the population parameter. This assumption may or may not be true. Hypothesis testing refers to the formal procedures used by statisticians to accept or reject statistical hypotheses.
In other words, we can say, hypothesis testing is nothing but testing whether or not a claim is valid.

Examples: →

(A) Most people get their job from networking.

P > 0.50 …. ‘Most’ keyword used hence it is proportion. Most means, more than half.

(B) The average payload of trucks on the highway is 18000 lbs.

μ = 18000 …. The average was given hence it’s μ.

(C) The pharmaceutical company X recently launched a gender selection drug, where they claim that if consumed, there are 80% of chances that women will have a baby girl. For this girl, 100 couples were tested.

Let’s state the assumption here.

“This drug doesn’t work and the probability of having a boy or a girl is 50%.”

Here we are not assuming or accepting the drug works. Because in the statistics you cannot prove something is correct.
To prove something is correct, we first need to take reciprocal of it and then try to prove that reciprocal is wrong which ultimately proves something is correct.

Let’s assume we tested this drug on 2 batches of 100 couples and the result is below,

52/100 had a girl…. It’s a very usual scenario
97/100 had a girl…. It’s a very unusual scenario

In the first case, we can call it a usual case as the probability is almost 50%. But in case 2 which is a very unusual scenario as the probability of having a baby girl is 97% which is quite huge compared to what we’ve assumed. Thus the second case proves our assumption wrong and we will be able to say that drug works.

Note →

The best way to determine whether a statistical hypothesis is true would be examining the entire population. But examining the entire population is often impractical. Therefore researchers tends to examine the random sample from the population. If the sample data is not consistant with statistical hypothesis, the hypothesis is rejected.

Steps involved in Hypothesis Testing

Defining Null & Alternate Hypothesis

In this step, we need to state the two hypotheses from the statement. There are two types of statistical hypotheses.

Null Hypothesis / H0
Alternative Hypothesis / H1

Null Hypothesis (H0)

The “Null Hypothesis” is the hypothesis that nothing interesting is happening in the data.
The “Null Hypothesis” is an assumption made about the population which needs to be tested. While we test the assumption about population parameter the null hypothesis considered to be true until evidence found against it.
Null hypothesis states that the population parameter (μ, P) is equal to some value. H0: μ = 5 (or) H0: P = 0.5
It is denoted as H0.
We start by assuming H0 is true then we use the evidence to conclude.

Reject H0 → When we’ve enough evidence to prove H0 is wrong.

Failed to Reject H0 → When we don’t have enough evidence to prove H0 is wrong.

Note → You cannot accept the null hypothesis. You either reject it or fail to reject it depending upon the evidence.

Alternative Hypothesis (H1)

It is opposite to the assumption made i.e. the null hypothesis. It is automatically accepted when the null hypothesis gets rejected.
In the research, you specify the “alternative hypothesis”
It should be called an “Effect Hypothesis”
In statistical analysis, you never test the alternative hypothesis. You only test the null hypothesis.
It is denoted as H1.
The alternative hypothesis states that the population parameter (μ, P) has a different value than H0…. H1: μ ≠ 5 (or) H1: P > 0.6
If you ‘reject’ the null hypothesis that means indirectly you accept the alternative hypothesis.
If you ‘fail to reject the null hypothesis that means you failed to accept the alternative hypothesis.

Let’s understand the above two concepts with the help of examples.

A medical trial is conducted to test whether or not a new medicine reduces cholesterol by 25%.

H0: Medicine reduces cholesterol by 25%. | P = 0.25

H1: Medicine does not reduce cholesterol by 25%. | P ≠ 0.25

We want to test whether the mean GPA of students in American colleges is different from 2.0 (out of 4.0).

H0: Average GPA of students in American colleges is 2.0 | μ = 2.0

H1: Average GPA of students in American colleges is not equal to 2.0 | μ ≠ 2.0

Choosing Test Statistics

Test statistics are used in hypothesis testing to make a decision on ‘Null Hypothesis’ → Whether to Reject the Null Hypothesis or Fail to Reject Null Hypothesis based on the evidence.
It's a number is that is calculated from a statistical test of the hypothesis. It shows how far is your observed data from the H0
To decide on your hypothesis, we need to calculate test statistics. Since test statistics is a number, it could be p-value or critical value (z-value, t-value, F-value or x²-value )

p-value Interpretation

It’s very important to understand ‘p-value’ if you are studying statistics. The majority of the decision is based on the interpretation of the p-value. And, let me tell you, this is the most confusing topic in the whole of statistics. There are many places where I have seen people misquote the p-value. I will take a simple example to explain to you what exactly is the p-value.

A Coin Toss Example → Find out whether it is a trick coin or not.

H0: It’s Fair Coin.

H1: It’s Trick Coin.

Now we have defined our H0 and H1. The next thing would be to experiment.

Let’s toss a coin the first time and observe the result. Let’s say, you got a tale on the first attempt.

Based on this first result should we call it a trick coin? We would probably not. Since we consider the null hypothesis is true which state that it's a fair coin and if it is a fair coin then getting tale on a coin toss is random. With this limited information, we cannot just reject the H0.

Therefore, P(1T | H0) = 0.50

Let’s toss a coin a second time and observe the result. Let’s say, you got a tale on the second attempt.

Based on the first two results, should we call it a trick coin? By considering H0 is true, there is a 25% chance that it could happen due to randomness. Hence we cannot reject the null hypothesis.

Therefore, P(2T | H0) = 0.25

Let’s toss a coin for the third time and observe the result. Let’s say, you got a tale on the third attempt as well.

Based on the first three results, should we call it a trick coin? By considering H0 is true, there is a 12.5% chance that this could’ve happened due to randomness. We are not yet convinced to call it a trick coin.

Therefore, P(3T | H0) = 0.12

Let’s toss a coin for the fourth time and observe the result. Let’s say, you got a tale on the fourth attempt too.

Based on the first four results, should we call it a trick coin? Now things get a little suspicious. By considering H0 is true, there is only a 6% (approx.) chance that this would’ve happened due to random chance. Let’s toss a coin one more time to confirm it with proof.

Let’s toss a coin for the fifth time and observe the result. Let’s say, you got a tale on the fifth attempt as well.

Therefore, P(4T | H0) = 0.06

Now should we call this a trick coin, after this conducted experiment? By considering H0 is true, there is only a 3% chance that this would’ve happened due to random chance. Since the probability of happening this randomly is less, we would reject the ‘Null Hypothesis’ and consider this coin is a trick coin and biased towards the tale.

The resultant probability that we got is 3% which is less than the threshold value that is 5% (We will see how do we decide this threshold later in this blog). This leads us to reject H0 and accept the alternative hypothesis.

This probability is nothing but the p-value in hypothesis testing. When the probability of observation by considering H0 is true falls below the threshold value/alpha/level of significance (i.e. 0.05) we reject the null hypothesis and accept the alternative hypothesis.

In this example, we’ve calculated the p-value easily with the help of various probability concepts. But with increasing sample size it won’t be easy to calculate the p-value with basic probability concepts. In such cases, we mostly use a technique called ‘Permutation Testing’ to calculate the p-value.

Calculating P-Value: Permutation Testing

Let’s understand Permutation Testing with the help of one example.

People who follow strict diet with gym have significant impact on weight loss than people who only goes to gym.

Let’s consider the two sets of samples with sample size 50 that we gathered in A and B. The initial weight of all the participants was 80KG.

A → Contains sample of those who follow gym routine with a strict diet

B → Contains a sample of those who follow only gym routines.

Let me define what is the P-value in the above case.

The P-Value is the p( Mean Difference in Weight ≥ 6KG | H0) which states that the probability of finding a difference (Let’s call it Δ ≥ 6KG) in mean weight when the null hypothesis is considered true. Now let’s look at calculating this probability i.e. p-value with the help of permutation testing.

Thus p-value literally is the percentage of values greater than Δ. In the above example is 3% hence the p-value is 0.03. Now, what does this p-value tell us? Should we reject the null hypothesis right now? …Nooh, we shouldn’t as we haven’t discussed yet how to make a decision on the basis of the p-value. We will learn how to make decisions on the basis of p-value later in this blog post but before that let’s learn something that we haven’t learnt till now. The p-value alternatives that is z-value & t-Value.

z-value and t-value

In hypothesis testing, we can conclude the hypothesis with the help of z-value or t-value too. Not necessarily we require a p-value always.

How to make a decision?

As we’ve stated already, we need to have test statistics (p-value or z-value or t-value) to make a decision about our hypothesis. We will learn about the decision-making process in this section.

In the decision-making process, we need to have the information about the below concepts →

Confidence Interval

It is a ‘range’ that is used to estimate a population parameter.
Confidence Interval has something called ‘Confidence Level’ associated with this.

Confidence Level

The confidence level tells that ‘ How confident are you that the actual value of the population parameter will be lying inside the range or interval.
This confidence level is expressed as ‘1-α’ where α is the compliment of a confidence level.
There are mostly 3 confidence levels that we deal with

0.95 or 95% CI is mostly used in statistics. (Marked in Green above)

Critical Value

In hypothesis testing, a critical value is a point of scale or graph used to split a graph into sections like ‘rejection region’ or ‘non-rejection region’. If your test statistic falls into that region we reject the null hypothesis.
It is derived from the level of significance α i.e. z = α / 2
A z-score that separate the ‘Likely Region’ from the ‘Unlikely Region’
We do not calculate z-score for when the population parameters value is unknown. Instead, we calculate something called a t-score.

Image Source: http://www.mathnstuff.com/

The critical value is the value which seperate the rejection region from the region of acceptance.

The critical value serves the boundary value (or condition) to test statistics. If the test statistic falls in the region of rejection (α) which is test statistics > critical value then we reject the H0.

If the test statistic falls into the region of acceptance (1-α) which is nothing but test statistics < critical then we fail to reject H0. The above example is right-tailed hence the critical value is located on the right side. This could be a position of left as well as on both positions also.

Note: The critical values are highly affected by the left tail test, right tail test or two-tail test. See the above table for more information.

Making Decision based on p-value / z- value / t-value

Note → Let’s consider the most used confidence level 95% or 0.95

Let’s consider the below values:

z-value = 3.81 | z-critical = 1.96
p-value = 0.03
t-value = 0.87 | t-critical = 1.96
α = 0.05
1-α = 0.95

With z-value: The z-value > z-critical, we reject the null hypothesis.

With t-value: The t-value < t-critical, we fail to reject the null hypothesis.

With p-value: The p-value < α, we reject the null hypothesis.

In modern computing environment nowadays it is possible to calculate p-value and majority of tools / softwares return p-value for us. Therefore, p-value as test statistics used by most of the people to make decision about hypothesis testing.

Errors in Statistical Test

In statistical hypothesis testing, no test is ever 100% certain as we rely on p-value which is based on probabilities, there is always a chance of making an incorrect conclusion regarding accepting or rejecting the null hypothesis.

Remember, when we make decisions based on statistics there are a total of 4 outcomes possible.

True Positive
True Negative
False Positive
False Negative

The chances of committing these two types of errors are inversely proportional: that is, decreasing type I error rate increases type II error rate and vice versa.

TYPE I ERROR: → FALSE POSITIVE

A type 1 error is also known as a false positive. It occurs when a null hypothesis is rejected, even though it is accurate and should not be rejected.
The probability of making a type 1 error is represented by your alpha level or confidence level. (α)
A p-value of 0.05 indicates that we are willing to accept 5% chance that we are wrong when we reject the null hypothesis.
We can reduce the risk of committing a type 1 error by using a lower value for p. For example, a p-value of 0.01 would mean there is a 1% chance of committing a Type I error.
But, using a lower value for alpha means you are less likely to detect true difference if one really exists. Thus risking type 2 error.
In terms of the courtroom example, a type I error corresponds to convicting an innocent defendant.

TYPE II ERROR: → FALSE NEGATIVE

A type 2 error is also known as a false negative.
It occurs when the researcher fails to reject the null hypothesis which is really false.
Here a researcher concludes there is no significant effect when actually there really is.
The probability of making type 2 error is called Beta (β) which is related to the power of the statistical test. (1-β)
We can decrease the risk of committing type 2 errors by ensuring the test has enough power. We can do this by ensuring the sample size is large enough to detect practical differences when one truly exists.
In terms of the courtroom example, a type II error corresponds to acquitting a criminal

CROSSOVER ERROR RATE (CER)

The crossover error rate is the point at which both the errors that is type 1 error and type 2 error are equal.
It represents the best way of measuring biometrics effectiveness. A system with a lower CER value provides more accuracy than a system with a higher CER value.

Python Code for Calculating P-Value Using Permutation Testing

treatment = [ 28.44,  29.32,  31.22,  29.58,  30.34,  28.76,  29.21,  30.4 ,
              31.12,  31.78,  27.58,  31.57,  30.73,  30.43,  30.31,  30.32,
              29.18,  29.52,  29.22,  30.56]control = [ 33.51,  30.63,  32.38,  32.52,  29.41,  30.93,  49.78,  28.96,
            35.77,  31.42,  30.76,  30.6 ,  23.64,  30.54,  47.78,  31.98,
            34.52,  32.42,  31.32,  40.72]from mlxtend.evaluate import permutation_testp_value = permutation_test(treatment, control,
                           method='approximate',
                           num_rounds=10000,
                           seed=0)
print(p_value)if p_value <= 0.05 :
    print("Null Hypothesis is rejected and Accepted the alternate hypothesis")
if p_value > 0.05 :
    print("Failed to reject the null hypothesis")

Output →

0.0066993300669933005
Null Hypothesis is rejected and Accepted the alternate hypothesis