Demystifying Confidence Interval

Akash Dugam
11 min readApr 9, 2020

Don’t lie to me when I ask you ‘Have you ever heard about Confidence Interval?’. It is impossible for someone who is studying statistics and not knowing one of the most important concepts in statistics. In this blog post, I will try to increase your confidence in this concept. Hope I will get succeeded. :)

My aim is to give you an idea about the concepts in a lucid manner. However, this blog post will not touch each and everything related to the confidence interval but this will be sufficient to understand important underlying concepts that required for data science journey. We will be touching the following concepts in this post.

  • Point Estimates and need for Confidence Interval (CI)
  • Computation of CI

Point Estimation:

This concept is necessary to understand as there are some drawbacks to this concept urged me to write this blog about Confidence Interval. The Point Estimation is defined as follow:-

In statistics, a point estimation is a value (usually it’s sample statistics) calculated from sample data will be served as an estimator for unknown population parameter

Below are the examples of the Point Estimator:-

  • The sample mean x̄ is a point estimate of the population mean, μ
  • The sample variance s² is a point estimate of the population variance σ²
  • The sample standard deviation (s) is a point estimate of the population standard deviation (σ).

Let’s take the help of an example to explorer the concept of Point Estimation in details.

Imagine we have a random variable X which represents the heights shown above. The distribution of heights is unknown. There are 10 observations in this sample i.e. X1, X2, X3, … , X10. Let’s assume this is a sample of one city and our aim is to estimate the mean height of the population of that city. Given this information one question that arises in the mind is

How do we do that? How do we calculate the mean height of the population given a sample of 10 observations?

In the most general way, we can calculate the mean of the given sample and approximate that to the population mean. In other words, we will use the value of x̄ (sample mean) to estimate the μ (population mean)

If we plug the value in the above formula we will get the sample average is 166.6.

As ’n’ increases x̄ moves towards μ

One thing we need to notice that here is — As ’n’ increases x̄ moves towards μ means as the size of the sample i.e. ’n’ increases the approximation will become more accurate.

Here we have estimated μ with the help of sample x̄. This process is nothing but it called a point estimation. In this case, x̄ is known as point estimator of the population mean μ.

This is not the best way to estimate the population parameter. Lets’ say instead of point estimator if we state the result like below:-

It says —

The population mean μ lies between 162.7 and 168.9 cm interval with 95% Probability (Confidence)

This estimation is better than point estimation as here we are not giving one single value instead we are giving interval with probability (confidence). This serves as the idea behind the concept called Confidence Interval (CI). In statistics, CI is more reliable than the point estimators as it gives meaning to the results.

Computation of Confidence Interval (CI)

There are multiple scenarios on which Confidence Interval (CI) needs to be calculated. Take a look at below situations-

  • A situation where information about underlying distribution is given.
  • A situation where information about population parameter standard deviation is known.
  • A situation where information about population parameter standard deviation is unknown.
  • A situation where the confidence interval needs to be calculated for the population parameter (other than the mean) like Median, Standard Deviation, 90th Percentile etc. In such cases method called Bootstrapping is used.

These two situations will be explained in greater details but before that, we must understand the following important concepts-

  1. Confidence Interval
  2. Confidence Interval
  3. Critical Value
  4. Margin of Error

Confidence Interval

  • It is a ‘range’ that used to estimate a population parameter. Like the range, we have seen above. i.e. [162.7, 168.9]
  • Here 162.7 is known as ‘Lower Bound’ and 168.9 is known as ‘Upper Bound’ values.
  • Confidence Interval has something called ‘Confidence Level’ associated with this.

Confidence Level

  • The confidence level tells that ‘ How confident are you that the actual value of the population parameter will be lying inside the range or interval.
  • This confidence level expressed as ‘1-α’ where α is the compliment of a confidence level.
  • There are mostly 3 confidence levels that we deal with
  • 0.95 or 95% CI is mostly used in statistics. (Marked in Green above)

Critical Value

  • In hypothesis testing, a critical value is a point of scale or graph used to split a graph into sections like ‘rejection region’ or ‘non-rejection region’. If your test statistic falls into that region we reject the null hypothesis.
  • It is derived from the level of significance α i.e. z = α / 2
  • A z-score that separate the ‘Likely Region’ from the ‘Unlikely Region’
  • We do not calculate z-score for when the population parameters value is unknown. Instead, we calculate something called t-score.
Image Source: http://www.mathnstuff.com/

Margin of Error

  • The maximum difference between sample statistics and population statistics is known as the Margin of Error
  • More the difference (or larger the margin), less the confidence that one should have in the result.
  • It is denoted by E
  • Statistics are not always right, by taking this into consideration confidence interval and margin of error states that estimated population value differs from the original due to some % margin of error.
  • We can calculate the Margin of Error (E) with the help of the following formula
Image Source: GetCalc

We have seen various situation on which confidence interval can be calculated. There are certain steps required to follow while calculating the confidence interval. These steps are as follows

  • Step 1:- Find a critical value (z) or t-score (Depending upon the situation) for a given confidence interval. When calculating t-score we need to take a help of t-distribution table to compute the critical value. For this, we need the information or value of CI and degrees of freedom (DF) i.e. DF = n-1
  • Step 2:- Find the Margin of Error (E)
  • Step 3:- Construct the confidence interval, (x̄-E) < μ < (x̄+E)

Let’s start constructing the confidence interval without further ado!

Case 1

Estimating CI given the underlying Distribution

Imagine you have the information about the distribution of data from which random variable is coming from. Let’s say that distribution is a normal distribution.

X ~ N(μ, σ)

Let X is a normally distributed random variable with μ = 168 cm and σ = 5 cm. By using position and shape Gaussian Distribution can be drawn like below

In the above diagram,

As X is the random variable is normally distributed with this information we can calculate or construct the confidence interval as follows:-

Case 2

Estimating C.I about the population mean when Population Standard Deviation is known!

Since we know the population standard deviation, in this case, we can approximate this to a normal distribution with the help of Central Limit Theorem (CLT)

  • The sample mean follows a gaussian distribution with the sample mean is equal to the population mean and the standard deviation is equal to population standard deviation divided by SQRT(n)
  • It doesn’t matter what population distribution is, sample distribution will follow a normal distribution with mean μ, standard deviation σ/SQRT(n)

Therefore, x̄ ~ N(μ, σ/SQRT(n))

It's clear that sampling distribution follows a normal distribution so constructing CI is very easy. Take a look at below:

Using information that we have x̄ = 168.5, σ = 5, n = 10. If we construct the CI with 95% confidence then -

μ ~ [165.34, 171.66]

Case 3

Estimating C.I about the population mean when Population Standard Deviation is Unknown!

We cannot use CLT when we do not know the Population Standard Deviation (σ). In such cases (which is most likely) we use something called Student’s t-distribution.

The students’ t-Distribution is used when we do not have information about a population parameter or sample size is small (usually less than 30).

The t-distribution is symmetric and bell-shaped, like the normal distribution, but has heavier tails, meaning that it is more prone to producing values that fall far from its mean.

Image Source: Wikipedia
  • As degrees of freedom increases (In the image it's noted as V) the height or the peakedness of the distribution increases.
  • The T distribution (also called Student’s T Distribution) is a family of distributions that look almost identical to the normal distribution curve, only a bit shorter and fatter. The t distribution is used instead of the normal distribution when you have small samples.
  • The larger the sample size, the more the t distribution looks like the normal distribution. In fact, for sample sizes larger than 30 (e.g. more degrees of freedom), the distribution is almost exactly like the normal distribution.

Let’s take one example to demonstrate how can construction of confidence interval is done when we do not know the standard deviation of the population. I have already defined the steps, we need to use that here.

Example:

Construct the 95% C.I for the average age of the people denied a promotion in the sample of 23 people. The Average age was 47 within standard deviation 7.2 Assume that the sample comes from the population that is normally distributed.

Given,

  • n = 23
  • x̄ = 47
  • s = 7.2 (Sample Standard Deviation)
  • CI = 95%

Step 1:- Calculate the Critical Value

Here we are using t-score instead of z-score as we do not have information about population standard deviation.

  • Using the information about sample size DF is calculated and its 22. As we all know DF = n-1
  • α = 1-CI = 1–0.95 = 0.05
  • Let’s use the above information DF and α and look into the t-statistics table we get the value T = 2.074 (Open the t-table select DF = 22 and α = 0.05 and take the value where they both cross)
  • Refer t-table here: https://www.statisticshowto.com/tables/t-distribution-table/

Step 2:- Calculate Margin of Error (MOE)

Now we will calculate MOE which will help us to understand how much different the true population mean from the estimated mean.

E = T * s/SQRT(n) = 2.074 * (7.2/sqrt(23))

E = 3.1

Step 3:- Construct the Confidence Interval (CI)

(x̄-E) < μ < (x̄+E) = (47–3.1) < μ < (47+3.1)

43.9 < μ < 50.1

μ ~ [43.9, 50.1] with 95% Confidence

Thus we can say, we are 95% sure that the population mean lies between 43.9 to 50.1 interval.

Until now we have found out 3 technique to find CI. But this is possible for a population parameter Mean (μ). What if you have population parameter is Median or you want to compute CI for 90th percentile or CI for standard deviation? In such cases, bootstrapping is used.

Case 4

Confidence Interval (CI) Using Bootstrapping

In the last lecture we have seen confidence interval for the population mean but what if we want to compute CI for median, variance or 90th percentile? This all become possible because of modern computing techniques. In this part, we will be computing CI for median using Empirical Bootstrap.

Bootstrapping means- Create artificial random samples from the sample itself.

Let’s define our task here-

Task: Estimate 95% CI for a median of X

Let’s assume random variable X is from any distribution i.e. X ~ F(μ, σ) and we have collected ‘n’ samples from it.

S = { X1, X2, X3,…, Xn}

There are certain steps needs to be followed for constructing the CI for a median. These steps are given below:-

Step 1- Generate Data by Resampling the Sample Data

Here in this first step, we need to generate ‘k’ samples of size ‘m’ using a sample with replacement strategy such way that m < n as shown in below-

Lets’ call these regenerated samples as s1, s2, s3, … , sk -

  • s1 = { X(11), X(12), X(13), …, X(1m)}
  • s2 = { X(21), X(22), X(23), …, X(2m)}
  • s3 = { X(31), X(32), X(33), …, X(3m)}
  • .
  • .
  • sk = { X(k1), X(k2), X(k3), …, X(km)}

This is how we have created artificial k samples of size m from sample size n. These resampled data is called Bootstraps.

Step 2:- Computation of Medium of Bootstrap samples.

Now in the second step, we need to compute the median of the k samples. If you want to construct the confidence interval for variance or standard deviation then we need to calculate variance or standard deviation instead of the median.

  • s1 — -> m1
  • s2 — -> m2
  • s3 — -> m3
  • .
  • .
  • .
  • sk — -> mk

Step 3: Sort the Medians

In this step, we need to sort the medians in ascending order. Let's say we have k=1000 then we need to calculate the median for the 1000's.

Medians = m1,m2,m3….m1000

Now sort above medians. We will get m’

m1 <= m2 <= m3 <= …. <= m1000 (increasing order)

Step 4: Compute CI from m

Our task is to construct a median for a 95% confidence level. Thus range would vary between 25 to 975 as shown below.

[m25 < Median < m975]

So we can say, total of 950 values are between m25 to m975. Because we are dealing with 95% CI and n = 1000

Therefore 950/1000 = 0.95 = 95%

This whole process of calculating CI is called a non-parametric technique as it is not depending upon the distribution. This is how we use the method of bootstrapping to compute the confidence interval for median, variance or standard deviation.

Conclusion!

In the calculation of a confidence interval, we must identify the data distribution and its known parameters as the computation of CI changes accordingly. Also, sometimes people tend to get confused at z-test or t-test and which one to be used in a specific situation so keeping that in mind I have added instructions wherever is necessary. At last, whenever we want to compute the CI for parameters apart from mean we will use concepts of bootstrapping.

Hope this post will help you.

--

--