Probability Distributions 101

36 min readApr 2, 2020

When I was learning the distribution trust me I used to mix two different concepts with each other. Thought to simplify these confusing things for the learners and this is what we are simplifying -

Revisit to Probability
Types of Data in Distribution
Probability Distributions

In Probability Distribution section we are going to understand various distributions: Bernoulli, Binomial, Geometric, Pascal, Poisson, Uniform, Normal etc.

Let's kickstart this series by understanding the Probability.

Revisit to Probability

In many data science problems we are asked to predict the values of some variables. But in statistics, many events can't be predicted with total assurance or certainty. In such cases what we do is we find out their probability to understand — ‘How likely these predictions hold true using idea of probability’

We won’t make justice to probability until we understand it with the help of universal probability examples — Tossing a coin, Throwing a die/dice.

When a coin is tossed, there are two possible outcomes — Heads (H) and Tails (T). We can say the probability of coin landing on H is 1/2 (0.50) and the probability of coin landing on T is 1/2 (0.50). When a single die is thrown, there are 6 possible outcomes that we can get i.e 1, 2, 3, 4, 5, 6. The probability of getting any one number on die is 1/6 (0.16). You may wonder how do we get these 0.50 or 0.16 values as a probability.

Above image illustrates the formula for a probability. In the case of ‘throwing a die’ example if I ask — When die is rolled, what is the probability of getting 3? A throwing die consists of 1 to 6 numbers and hence at a single time, we can have 3 only once. Thus as per formula defined above only 1 outcome is possible out of 6 outcomes.

P(Getting 3 on a die) = 1/6 = 0.16 or 16 %

Thus we can say —

When a user throws a die there is a 16% chance that he could get 3 and this is true with any other number as all the events are equally likely.

Enough about probability now. Let's understand what a statistical distribution is?

Did you know?

The average height of a male in India is 5 ft 5 inch.

How can someone come up with this number? I bet without survey, it is not possible at all. Whoever done this have taken sufficient sample, counted the height of each individual and arranged in buckets or in tabular format. Something like below -

From above we can conclude that —

Average height is between 5 to 6 ft
People smaller than 5 ft and taller than 6 ft are relatively rare.

Here we have divided data into a total of 4 bins. What if we use a smaller bin size for measurements? If we do this — We will get more precise results as follows -

If you observe the above figure,

Most of the data lie between 5 to 6 ft tall
But we can be more precise and say most of the data lie between 5.25' to 5.75'

By measuring more people using smaller bins, we get more accurate and precise estimates of how heights are distributed.

We can also draw a curve to approximate the histogram. The curve tells the same story that the histogram tells us.

There is a low probability to get hight less than 5 ft tall and higher than 6 ft tall as most of the data lies between 5ft to 6ft.

Thus both the histogram and curve are distributions. Since it tells us how the probabilities of the measurements are distributed.

Types of Data in Distributions

There are mainly two types of data we encounter generally -

Discrete: Discrete data can take any integer value. Ex: 0, 5, 100
Continuous: Continuous data can take any value from a range of values. Ex: 2.34, 56.6

Source: http://intellspot.com/discrete-vs-continuous-data/

Types of Probability Distributions

Up until now, we have seen some basic prerequisite to understanding various types of probability distributions.

A probability distribution is nothing but the table that gives the probability (in Table) for each value of a random variable

There are mainly two types of probability distributions -

Discrete Probability Distribution
Continuous Probability Distribution

In discrete distribution — Random variable X can have a discrete (countable) usually finite number of values. In simple words, if a random variable is a discrete variable (which means it contains countable finite values) then its probability distribution is called a discrete probability distribution.

Here is the list of frequently used discrete distributions in Data Science -

Bernoulli
Binomial
Negative Binomial
Geometric
Poisson

The discrete probability function is also known as a probability mass function (PMF)

In continuous distribution — Random variable X can have an infinite number of different values (usually uncountable). In simple words, if a random variable is a continuous variable, its probability distribution is called a continuous probability distribution.

A continuous probability distribution differs from discrete probability distribution in the following ways -

Discrete probability distribution’s sample space is finite thus we can record the frequency of each distinct value whereas for the continuous probability distribution sample space is infinite and we cannot record the frequency of each distinct value. Thus continuous distribution cannot be expressed in tabular form. Hence continuous probability distribution is expressed in terms of a graph or equation.
The equation of continuous probability distribution is known as Probability Density Function (PDF)

Here is the list of frequently used Continuous distributions in Data Science -

Uniform Continuous Probability Distribution
Normal Distribution
Standard Normal Distribution
Student’s T Distribution
Chi-Squared Distribution
Exponential Distribution
Logistic Distribution

Lets first understand Discrete Probability Distribution and afterwards, we will have a look at Continuous Probability Distribution.

Discrete Probability Distribution

The Uniform Discrete Probability Distribution

In this type of discrete distribution — Probability of each outcome is the same. That's why we call it ‘Uniform’ and since it is discrete which means what we have finite (countable) sets of number.

Thus we can define Uniform Discrete Probability Distribution as -

A discrete random variable X said to have a uniform distribution if the probability of each value is the same. Thus we can define PMF is 1/N

PMF of Uniform Discrete Probability Distribution

Generally, we denote Uniform Discrete Probability Distribution as U (a,b) where a and b nothing but the range of values. (a < b)

Values vary incrementally from the lowest value, a to the highest value, b.

a, a+1, a+2, a+3, . . . . b-3, b-2, b-1, b

Thus we can say N (Number of Values) = b-a+1

Universal example of probability throwing a dice suits here as the probability of getting a random variable of dice in one experiment is the same.

We can denote this as X ~U(1,6) which means — Variable X follows Uniform Discrete Probability Distribution ranging values 1 to 6

In the above diagram, ‘Height of Rectangle’ is given by PMF —

Height of Rectangle = PMF = 1/N = 1/b-a+1

If you observe, as the number of outcomes increases the height of the rectangle reduces.

General PMF of Discrete Probability Distribution = 1/b-a+1
Expected Value E(X) of Discrete Probability Distribution = (a+b) / 2
Variance of Discrete Probability Distribution = (b-a+1)²-1/12

Example -

A telephone number is selected at random from a directory. Suppose X denote the last digit of the selected telephone number. Find the probability that the last digit of the selected number is

6
Less than 3
Greater than or equal to 8

The last digit could be anything ranging from 0 to 9. Thus here a = 0, and b =9.

Last Digit is 6, P(X=6) = 1/b-a+1 = 1/10 = 0.1
Less than 3 which means P(X=0,1,2) = 1/10 + 1/10 + 1/10 = 0.1 + 0.1 + 0.1 = 0.3
Greate than or equal to 8 means P(X=8,9) = 1/10 + 1/10 = 0.1 + 0.1 = 0.2

The Bernoulli Distribution

The Bernoulli Distribution is special case of Binomial distribution. But there is thin difference between them too. Don’t worry, once we understood binomial distribution then difference between two distribution will be clear.

The Bernoulli Distribution is a discrete probability distribution of single Bernoulli Trial.

I am sure you have not understood the above definition as there is a technical term — Bernoulli Trial. Lets first understand that.

Bernoulli Trial is a statistical experiment which has only two outcomes. For example -

Tossing a coin, there are two outcomes: Head or Tail
Exam Result, there are two outcomes: Pass or Fail
A cricket match or Football match, either you win or lose.

Remember that, these two outcomes are mutually exclusive to each other means they can’t happen together. You either win the match or lose the match but you can’t have both.

Now let’s visit the definition once again and we will try to simplify it.

The Bernoulli Distribution is a discrete probability distribution that has only two binary outcomes 0 and 1 in single trial, In terms of probability 0 associated with 1-p (failure), 1 associated with p (success)

**X ~ Bern(P)**: Random Variable X follows Bernoulli Distribution

p = Probability of Success
1-p (q) = Probability of failure

In Bernoulli Distribution we often need to assign the 0 and 1 to outcomes. In general, we assign 1 for success (here success means the outcome we are interested in) and 0 for failure (here failure means the outcome that we are not interested in).

The expected value E(X) of Bernoulli Distribution changes on how do you assign or label your outcome

E(X) = p … when 1 = Success
E(X) = 1-p … when 1 = Failure

We can also write E(X) as:

E(X) = 1 * p + 0 * (1-p) … which is eventually p

Thus by plugging p and 1-p in variance formula -

σ² = (X₀-μ)² * p(X₀) + (X₁-μ)² * p(X₁)

σ² = (0-p)² * (1-p) + (1-p)² * p

σ² = p * (1-p)

The Binomial Distribution

I hope you have understood the Bernoulli distribution without any doubt, If yes then it would be a cakewalk for you to understand Binomial Distribution.

In Bernoulli Distribution what we have seen is that the experiment or trial repeated only once. Flipping a coin once and getting the result of either one of the two is a case of Bernoulli Distribution whereas flipping a coin multiple times and getting result either one of the two is a case of Binomial Distribution. That’s it. That’s the difference.

Thus we can define Binomial Distribution as:

Binomial Distribution is a discrete probability distribution that follows a sequence of identical Bernoulli events.

In the case of Binomial Distribution, certain rules must be followed to have that distribution. These rules are:

You should have a fixed number of trials
Trials must be independent (Outcome of one trial does not affect the other) Take an example of flipping a coin two times, the result of the first flip or trial does not influence or affects the second flip or trial
Each trial should have only two outcomes: Success and Failure.
Note that, probability of success remains the same in all the trials.

Application of Binomial Distribution is to answer a question like below:

“ Toss a fair coin 3 times what is the probability to get 2 heads?”

We will answer the above question but lets first check whether this follows Binomial rules or not.

Fixed Number of Trails: Yes 3 trials mentioned in the above example.
Trials must be Independent: Yes, a result of one trail does not affect the other trials.
Each trial should have two outcomes: Here H and T are two outcomes.
Probability of success remains the same in all the trails: Since our objective is getting head 4 times and the probability of getting head in each trail remains the same that is 1/2

Let's try to solve the above example with probability perspective. First, we will have a sample space of all the events. Let's say S

S = { HHH, HHT, HTH, HTT, THH, THT, TTH, TTT }

Now If you look at this closely, there are only 3 outcomes that give us 2 Heads (H): HHT, HTH, THH.

Thus the probability of getting 2 heads is 3/8. This is how you calculate using simple probability concepts. But what if the question comes with a bigger number like: ‘Toss a fair coin 1000 times what is the probability to get 501 heads?’ In such a scenario it is impractical to calculate probability using sample space concept.

Now you may ask if not using sample space how do we solve the above problem? We can take help of combinatorics to solve our problem. For instance, getting 501 heads out of 1000 trials is nothing but picking 501 elements from the sample space of 1000.

Here n is the number of trials and x is the number of successes we are looking into. Basically we are running n Bernoulli trials with parameter p and we are looking at x successes among n Bernoulli Trails. This can be written as

Thus probability function of a binomial distribution is given by

Of course, Binomial Distribution is expressed as X ~ Binomial(n, p) where n is the total number of fixed trials and p is the probability of success.

Now let's find out the expected value E(x) and variance of Binomial Distribution. In case of expected value E(x) is similar to what we have seen in Bernoulli Distribution. Since Bernoulli was associated with a single trial and Binomial associated with multiple n trails.

Therefore,

E(x) = p * n
σ² = n * p * (1-p)

The important note about Binomial Distribution is, as n means a fixed number of trials increases both p and (1-p) are not indefinitely small, it well approximates a Gaussian distribution. Take a look at below

Let’s look at some examples to get a deeper understanding of the binomial distribution -

A fair coin has tossed 100 times what is the probability that the head will appear exactly 52 times?

Here success is getting head on a coin and failure is getting tail on a coin. What we have given, n = 100, x = 52, p = 0.5, q = 0.5

Let's plug this info in Binomail Function,

P(X=52) = [100 ! / (100–52)! * 52!] * [0.5⁵² * 0.5⁴⁸]

P(X=52) = 0.07352

The probability of rolling 4 on dice is 30% and dice is rolled 10 times, find the probability of rolling 4 exactly 8 times, at most 8 times.

Here success is getting 4 on dice. And failure is rolling 1,2,3,5,6 on dice. What we have given is n = 10, x = 8, p = 0.30, q = 0.70 …For the first case

P(X=8) = [10! / (10–8)! * 8!] * [0.30⁸ * 0.70²]

P(X=8) = 0.0014467

In the second case, we have been asked to calculate Binomial Probability for “At Most” 8 times. Thus, x = 0,1,2,3,4,5,6,7,8 In such scenario we need to calculate for all the x and then using addition rule of probability we need to add those numbers together. Since this calculation is a little bit big to do it manually, I suggest to you online calculator to solve this. Here is the one I am using

https://stattrek.com/online-calculator/binomial.aspx

The Geometric Distribution

Now we will be learning about the Geometric Distribution and immediately after that Negative Binomial Distribution. These two distributions are very close to Binomial or Bernoulli Distribution. If you have understood the Binomial and Bernoulli Distribution then understanding these two will be very easy.

The Geometric Distribution is the discrete probability distribution that represents the number of trials that required to get first success.

It is useful for modelling situations in which it is necessary to know how many attempts are likely necessary for success, and thus has applications to population modelling, econometrics, return on investment (ROI) of research, and so on.

The rules that we mentioned in Binomial Distribution applies here as well thus I am not repeating here.

Now let’s form the PMF (Probability Mass Function) for geometric distribution.

In order to occur first success on xᵗʰ trial -

The first x-1 trials must be failures. We represent a failure by 1-p = q and x-1 represent trials. Therefore, [1-p]^x-1
The xᵗʰ trial must be a success. That is p

Let’s derive PMF of Geometric Distribution using the above information,

The expected value E(x) or mean and variance of the Geometric Distribution is given by,

μ = 1/p
σ = 1-p/p²

Example:-

A die is rolled until a 6 occurs. What is the resulting geometric distribution?

Above is a tree diagram for our considered event. If we get 6 on die then we stop as it fulfils our objective. But, if we don’t get in the first attempt we try on the second, third and so on.

On the first trial, we may get a 6 or we may not. If we get a 6 then we stop and if we don’t we move to the next trial. If we want to find out the probability on particular ‘try’ then -

P( Getting 6 on I trial) = 1/6

Above probability tells us getting a 6 on the first try. If we want to find the probability of getting a 6 on the second trial then -

P( Getting 6 on II trial) = (5/6) * (1/6)

Here (5/6) denotes failure of getting 1 6 on the first trial. So the probability of not getting a 6 on the first trial is 5/6

P( Getting 6 on III trial) = (5/6) * (5/6) * (1/6) = (5/6)² * (1/6)

P( Getting 6 on IV trial) = (5/6) * (5/6) * (5/6) * (1/6) = (5/6)³ * (1/6)

You can see how the pattern is getting formed. Therefore we can say,

P( Getting 6 on nth trial) = (5/6)^n-1 * (1/6)

This can also be represented pictorially, as in the following picture

In mathematical notation, we can define Geometric Distribution as

x ~ Geo(P)

The Negative Binomial Distribution

The Negative Binomial Distribution is much like Binomial Distribution with just one difference -

In Binomial Distribution, the number of trials (value of n) is fixed whereas in the Negative Binomial Distribution we keep our experiment continues until ‘r’ successes observed.
In Binomial Distribution, we generally looking for the probability of getting 8 heads if you toss coin 10 times whereas in Negative Binomial Distribution you flip a coin repeatedly and count the number of times coin lands on heads. This experiment continues until you get 8 heads.

Let’s use the above information and define Negative Binomial Distribution -

A negative binomial random variable (X) is the number of repeated trials to produce ‘r’ success in the negative binomial experiment. The probability distribution of the negative binomial random variable is called Negative Binomial Distribution

Note:- The Negative Binomial Distribution also is known as ‘Pascal Distribution’ in the world of statistics.

In order to calculate Negative Binomial Probability distribution, we need to have values of x, r, p. Once we have those values we can calculate the negative binomial distribution using the following PMF -

If we define the mean of the negative binomial distribution as the average number of trials required to produce ‘r’ successes then mean is -

where r is the number of success and p is the probability of success

The variance of the negative binomial distribution is

Example:-

A person conducting telephone surveys must get 3 more completed surveys before their job is finished. On each randomly dialed number, there is a 9% chance of reaching an adult who will complete the survey. What is the probability the 3rd completed survey occurs on 10th call?

Here we have x = 10, p = 0.09, r = 3. Lets plug these values in PMF and we will get -

P( x= 10, r=3, p=0.09) = (10–1) C (2–1) * 0.09³ * 0.91⁷

P(x, r, p) = 0.013

The Poisson Distribution

Let's say our task is to count the number of occurrences of an event in a given unit of time, distance, area and volume. For example -

The number of a car accident in a day.
The number of telephone calls received by a call centre in an hour
The number of faulty products in the manufacturing process

By taking above into account, this is how we define the Poisson Distribution

The Poisson Distribution is a discrete probability distribution that shows how many times event likely to occur within a specified time, distance, area and volume.

Now let’s have look at the important characteristics of Poisson Distribution-

The number of occurrences on each interval or specified continuum can range from ZERO to INFINITY
It describes the distribution of infrequent or rare event
Each event is independent of other events
It describes the discrete events over an interval or specified continuum ( time, distance, area and volume.)
The expected number of occurrences E(x) is assumed to be constant throughout the experiment.

Don’t bang your head if you have not understood above. We are going to simplify things with the help of one application. But before that, let’s have look at PMF of Poisson Distribution -

Here,

λ = # occurrences on specified interval or continuum
x = # occurrences we are interested in
e = Base of Natural Log valued 2.1718282

The other distribution that we have seen so far are easy to understand theoretically but I believe this is not the case of Poisson Distribution. It can be better understood with the help of an application. Let's jump straight to it.

Application:-

Let’s say you run the restaurant business and you wanted to check how many customers visit your restaurant on closing time that is 10:30 pm to 11 pm. (Maybe you want to set closing time of your restaurant with the help of data) Your observation concluded an average of 8 customers dines out for that specific time. Calculate the probability of dining out exactly 10 customers on a specific period of time.

Let’s jot down what data we have when we look at the above example-

The observed average is 8 customer, this is # of occurrences on the specified interval. i.e. λ
Here we are calculating the probability of dining out exactly 10 customers, this is what we are interested in thus x = 10
e = 2.1718282

Plug these value in the above formula to get the desired output.

Now I will take help of excel to calculate cumulative probabilities to make you understand further.

Above table describes the probability distribution on ranging value of x. As per Poisson Distribution’s characteristics: x value can range from o to ∞ (theoretically)

Above histogram plotted using Poisson Probability Distribution (Table) and we can conclude the following things:

The P(x) where x is 7 and 8 have the same value (Approx. 0.14, shown in green bar). This means the probability of 7 or 8 customers arriving is the same. It does not affect the mean.
After average 8, the value of the probability gets reduced.
We were interested in finding out the probability of 10 customers arriving on given time slot approx 0.1 (shown in an orange bar in the graph)

In our problem, we specifically asked to calculate the probability of 10 customers arriving on a given time slot. What if we wish to calculate for more than 10 customers arriving for that time slot? It’s very simple -

1-Green = Orange

So we have to calculate cumulative probability P( x ≤ 9) that will be the probability of all green bar shown below. In the table above look at the cumulative probability on x = 9, this will give us P(x ≤ 9).

Thus, P(x ≤ 9) = 0.7166. Now we will add this in our formula to calculate for more than 10 customers arriving for that time slot.

1-Green = Orange,

1–0.7166 = Orange

Therefore Orange = 0.2834

This means that the probability of arriving 10 customers on a given time slot is 0.2834

Continuous Probability Distribution

The Uniform Continuous Probability Distribution

Surprised to see this distribution again? Believe me, it's not the same. We have seen uniformity in discrete distribution but it is also possible to have uniformity when data is continuous.

First, let's understand the difference between discrete and continuous uniform distribution with the help of the following picture

Unlike discrete distribution, continuous distribution can take value.
Since it takes continuous values, it forms rectangular (As shown above) Thus this distribution is also known as Rectangular Distribution.
Both the distribution has a constant probability.
Example of Continuous Uniform Distribution is to think of Pizza Delivery. When you order Pizza from your home it gets delivered between 20 min to 30 min of time.

The continuous distribution defined with two-parameters that is a and b. It is written as X ~ U(a,b)

a = minimum
b = maximum

**The probability of any value between a and b is p**

In the case of a continuous random variable, area under the graph must be equal to 1

p * b-a = 1

Therefore p = 1 / b-a

The expected value E(x) and variance of the continuous uniform distribution is given below

The expected value E(x) is nothing but the mid-point between a and b. Therefore E(x)= (a + B) / 2
Var(X) = (b-a)²/12

Applications:-

On the average 30 min. of TV shows have 22 min. of the actual program. Let’s assume the probability distribution for the number of minutes of actual program is uniformly distributed from a minimum of 18 min. to 26 min.

What is the probability P(x) the show will have at least 25 min. of TV Program?
What is the probability the show will complete between 21 min and 25 min of telecast?
What is the probability that the show will have between 22.32 and 24.77 min of the program?

Before we solve the above 3 problems, let's write what information do we have. That is, a = 18, b = 26.

Now let’s calculate E(X), Var (X) and most importantly height of the rectangle.

E(x) = a + b / 2 = 18 + 26 / 2 = 22
Var(x) = (b-a)²/12 = (26–18)²/12 = 5.33
Height of the Rectangle = 1/b-a = 1/26–18 = 1/8 = 0.125

Once we completed calculating Height of Rectangle we can solve given problems by drawing the graph. That’s why we need to calculate Height of Rectangle beforehand.

What is the probability P(x) the show will have at least 25 min. of TV Program?

P(x) = X₂-X₁ / b-a = (26–25)/(26–18) = 1/8 = 0.125

The probability of the show will run (total time) between 25 min. to 26 min. is approx. 12%

What is the probability the show will complete between 21 min and 25 min of telecast?

P(x) = X₂-X₁ / b-a = (25–21)/(26–18) = 4/8 = 1/2 = 0.5

The probability of the show will run (total time) between 21 min. to 25 min is approx. 50%

What is the probability that the show will have between 22.32 and 24.77 min of the program?

P(x) = X₂-X₁ / b-a = (24.77–22.32)/(26–18) = 2.45/8 = 0.306

The probability of the show will run (total time) between 22.32 min to 24.77 min is approx 30%

The Normal Distribution

Data can be distributed many ways, it can be left-skewed, right-skewed or there could be no distinct pattern in the distribution. But in real life, there are many scenarios where data can be centred around it’s mean forming ‘Bell Shaped’ curve. This is where you see Normal Distribution.

Since we said, there are many real-life scenarios where Normal Distribution is predominant. Let me highlight some of them below -

Heights of people
IQ Score of people
Marks on test
Rolling a die

But what exactly is Normal Distribution?

It’s a continuous probability distribution that is symmetric about the mean, that describes that data is more gathered or more frequent near to its mean value than data far away from the mean. The Normal Distribution is also known as Gaussian Distribution and Bell Curve Distribution since it takes shape as Bell.

There are many characteristics associated with Normal Distribution that needs to be understood.

It's symmetric in nature.
In Normal Distribution mean, median and mode are similar.
50% values are less than and 50% values are more than the mean.
The Empirical Rule allows you to determine the proportion of values that fall within certain distances from the mean. (We will explain this later in this part)

Parameters of the Normal Distribution:

There are two parameters associated with Normal Distribution will be discussed here.

Mean (μ)
Standard Deviation (σ)

Note that, the normal distribution doesn’t have its position or shape. Mean of the distribution describes its position and standard deviation describes its shape. Let me show you what I mean by that.

Let’s discuss these parameters in more details:-

Mean

As we stated earlier in the normal distribution mean describes its position. Look at the below example:-

There are 4 different normal distributions defined in the graph: Blue, Red, Orange and Green
Blue, Red and Orange distribution having mean = 0 and standard deviation 0.2, 1, and 5 respectively. So these distribution is placed at the same location or position (having the same mean) but different standard deviations. Since standard deviation is not the equal shape of the distributions is not equal at all. Here we can conclude that, lower the standard deviation higher the peak of the curve. As standard deviation gets increases the shape of the distribution gets flatten (Look at the Orage distribution having standard deviation = 5)
The mean of a Green distribution is -2 hence the position is at left from 0.
The mean of the distribution is any numerical value but depending upon positive or negative it placed left or right to zero. It is very important to understand as sometimes you are in the situation to determine if two samples or two populations have a statistically different mean. For example, if you are doing an experiment of two groups of samples or population and you wanted to check whether means of two group are the same or not.

Standard Deviation

Standard deviation in the normal distribution represents the shape of the distribution (More precisely, it defines its width). The standard deviation determines how far away the values in the distribution from its mean. Let's understand this more by taking the help of the above distribution.

On carefully analysing the Blue and Orange distributions we can conclude that,

“Smaller the standard deviation (σ) value, narrower the width and taller the distribution. As standard deviation gets larger the width of the distribution increases and height of the distribution decreases”

In the above example, Blue and Green have narrower the distribution as the value of σ is 0.2 and 0.5 respectively.
When the distribution has a taller peak (means narrower the width), the probabilities are higher that values won’t fall far from the mean. As you increase the spread of the distribution, the likelihood that observations will be further away from the mean also increases.

We cannot conclude the discussion on normal distribution without describing the Empirical Rule for the Normal Distribution.

The Empirical Rule for the Normal Distribution

In the statistics, the Empirical Rule is also known as 3-Sigma Rule where each sigma represents the standard deviation away from the mean.

Below observation can be concluded from above:-

68 % of data fall within 1-sigma or 1 standard deviation i.e. μ -σ of the mean
95% of data fall within 2-sigma or 2 standard deviations i.e. μ -2σ away from the mean.
99.7% of data fall within 3-sigma or 3 standard deviations i.e. μ -3σ away from the mean.
Nearly all values lie within the 3 standard deviations of the mean.
Data that falls away from 3 standard deviations considered as an outlier.

Due to % of data associated with each standard deviation this rule is also known as ‘68–95–99.7 Rule’

Example: 95% of students at school are between 1.1m and 1.7m tall.

In this case, first, we need to find mean and standard deviation assuming the data is normally distributed.

Mean = (1.1m + 1.7m) / 2 = 1.4m
95% is 2 standard deviations either side of the mean (a total of 4 standard deviations) so:

1 standard deviation = (1.7m-1.1m) / 4 = 0.6m / 4 = 0.15m

A Special Case of Normal Distribution: Standard Normal Distribution

As we have seen that with the help of parameter values the position and the shape of the normal distribution can be changed. So, the normal distribution can have many different position and shape as the parameter value changes. However, in the special case of the normal distribution, the parameter values kept constant. That is mean = 0 and standard deviation = 1. This distribution is also known as the Z-distribution.

The normal random variable of the standard normal distribution is called Standard Score or Z-Score. A standard score or z-score represents the number of standard deviations above or below from the mean.

Standard score +1.5 represents the observation is 1.5 standard deviation above the average or mean
Standard score -1.5 represents the observation is 1.5 standard deviation below the average or mean
Remember, the mean has z-score = 0

This is what standard normal distribution is. But till now we haven’t discussed how do we need to convert Normal Distribution to Standard Normal Distribution. This process is known as Standardization. How do we perform the standardization? By calculating z-score. To standardize your data, you need to convert the raw measurements into Z-scores.

Therefore in standardization, each value in the normal distribution subtracted from the mean and divided by the standard deviation using the below formula.

So generally we note value as x and mean as μ and standard deviation as σ. Thus the above formula becomes.

where X = raw value that needs to be converted into z-score, μ = mean of the population, σ = standard deviation of the population.

Note:- The standardization process allows you to compare observations and calculate the probabilities across the different population. In simple words, It allows us to compare apples with oranges.

Since we are talking lot of time about comparing apple with oranges lets’ do this practically to understand this. Let’s compare their weights. Imagine that we have an apple that weighs 110 grams and an orange that weighs 100 grams.

In this case if we compare the weights given in raw value then it is easy to conclude that apple weighs more than orange. However, this is not the right way to compare as the population weight of both is different. Thus we need to standardize it by converting raw value to z-score. The following table gives the assumed population parameters.

Let’s calculate the z-score:-

Apple:- (110–100) / 15 = 0.66
Orange:- (100–140) / 25 = -1.6

By observing the numbers we can conclude the owing:-

The Z-score for the apple is positive +0.66 which means that our apple is weighed more than the average apple.
The Z-score for the orange is negative -1.6 which mean that orange is below far from the mean.

Using Z-scores, we’ve learned how each fruit fits within its own distribution and how they compare to each other.

Standardizing is useful when we have a normal distribution. However, we can anticipate that data is spread out that way (We cannot have a normal distribution every time) A crucial factor about the normal distribution is required lots of data and it's not the case always. When we have less data or sample size less than 30; we would not consider it as data is normally distributed until and unless it is explicitly mentioned about that sample being normal. If the sample size is limited which means there is chances of outlier will be affecting the analysis. Let's say our sample size is less than 30 in that case we don’t assume the normal distribution. In such a scenario we can use Students’ T Distribution (It’s small sample size approximation for normal distribution)

We will learn about Student’s t-Distribution in while but before that, it is important to understand ‘Sampling Distribution’ and ‘Central Limit Theorem’ in statistics.

Sampling Distribution

The general idea of a sampling distribution is to use a sample to estimate the population. But what is the sampling distribution?

Sampling Distribution is the probability distribution of sample statistics that are drawn from the population.

Let me explain what I mean by the above definition. Let’s say we have drawn m sample from the population. Once we draw m sample we calculate its statistics. The calculating statistics means calculating mean, standard deviation, range, proportion etc. It could be anything, not just limited to mean but for the understanding purpose, we calculate the mean of each m samples and its denoted as x̄ ( x-bar).

Once we calculate the x̄ of m samples the distribution of x̄ is known as the sampling distribution of sample mean. As I said initially, the aim of the sampling distribution is to use sample to estimate the population, what I mean by that is use sample statistics to estimate the population statistics. ( Use x̄ to estimate μ)

Central Limit Theorem (CLT)

The Central Limit Theorem is the fundamental theorem in the field of statistics. We have studied sample distribution earlier which will serve base to this theorem.

This theorem tells us that no matter what shape the population distribution has, the distribution of sample means has an approximately normal distribution so long as the sample size (n) is reasonably large

Let’s assume that we are sampling from the population that has Mean and Variance

Case 1:-

When Sample size > 30

If n > 30, no matter how population looks like (Skewed to right or left, middle or u-shape etc.) If you take a sample size more than 30, and we calculate the sample statistics (Mean, Variance, Proportion) and distribute them, then the distribution is going to be a Normal Distribution.

As we have seen in sampling distribution, sampling mean approximate the population mean.

The standard deviation of sample mean is not equal to the standard deviation of the population. Because the standard deviation of sample changes with the sample mean. Keep this in mind, More the sample, less the standard deviation

Case 2:-

When Sample size (n) ≤ 30 AND Population is Normally Distributed

If n ≤ 30 & we have given “The Population is Normally Distributed” then the sampling distribution of the sample mean is also normally distributed

Case 3:-

When n ≤ 30 and we do not have any information about Population Distribution

If the original population is not normally distributed, the distribution of sample means will not be normally distributed when the sample size is small.

All above-mentioned cases explained in the form of below table.

Student’s t-Distribution

At the end of the normal distribution, we discussed how we cannot assume a normal distribution when the sample size is limited (Generally, sample size < 30) In limited sample scenario, if you could remember we use something, called student’s t-distribution. Now let’s understand Student’s t-Distribution in more details.

The Student’s t-distribution is a continuous probability distribution that looks almost identical to a normal distribution (only a bit shorter and fatter) that is used to estimate population parameter when the sample size is small or the population standard deviation is unknown.

When we have no information about the population standard deviation means that value needs to be estimated, this increases the uncertainty that’s why the tail is thicker in t-Distribution to accommodate that extra uncertainty means to accommodate occurrences of values away from the mean.

Why Use the t Distribution?

According to CLT, the sampling distribution of statistics follows the normal distribution as long as the sample size of the sample drawn from the population is large enough. Therefore we know the standard deviation of the population, thus we can compute the z-score. After that, we can compute the probabilities with the help of normal distribution.
In real life scenario most of the time we do not have the information about the population standard deviation or also there could be a situation when the sample size is small. In such a scenario we cannot use z-score we have to use something called t-score / t-statistics by using below formula

Here, x̄ = sample mean | μ = population mean | S = sample standard deviation | n = sample size

The distribution of t-statistics is known as the student’s t-distribution or simply t-distribution. In mathematics t-distribution written in symbolic form as follows

X ~ t(k): The random variable x follows t-distribution with ‘k’ degrees of freedom.

There is only one parameter to t-distribution that is ‘Degrees of Freedom’. Let’s understand that in more details.

Degrees of Freedom

In statistics, the degrees of freedom (DF) indicate the number of independent values that can vary in an analysis without breaking any constraints.
Typically, the degrees of freedom (DF) is equal to your sample size minus the number of parameters that you want to estimate.
In t-test, we know that when you have a sample and estimate the mean, you have n-1 degree of freedom (DF) => Just one parameter we are estimating here. That’s why n-1.

The Expected value E(x) and the variance of the student’s t-distribution are given below

E(x) = μ …if k > 2
Var(x) = s² * k / k-2

Use cases:
The Student’s t-Distribution mostly used in hypothesis testing to decide whether to accept or fail to accept the null hypothesis.

In the above two-tailed test,

Region of rejection (area at two tails) can be described using z-score or t-score
The central region is known as acceptance region where if your p-values fall in somewhere in that region your hypothesis will be accepted.
If your z-score or t-score is less than -1.96 and greater than 1.96 your hypothesis will be rejected

Gaussian distributions and Student’s distributions are some of the most important continuous probability distributions in statistics and machine learning.

The t-distribution may be used as a placeholder for Gaussian when the population variance is not known, or when the sample size is small. Both are closely related to each other in a strict and formal way.

Log-Normal Distribution

Many of us are aware of the Normal Distribution but may be very few people know about the Log-Normal Distribution. Here, I will attempt to simplify this concept in lucid language.

The log-normal distribution is the probability distribution of a random variable whose logarithm follows a normal distribution.

There are many real-life cases that follow the Log-Normal Distribution like Posting comments on forums, dwelling on internet etc

In general, most log-normal distributions are the result of taking the natural log where the base is equal to e=2.718. However, the log-normal distribution can be scaled using a different base which affects the shape of the lognormal distribution.

If the random variable X is log-normally distributed, then Y = ln(X) has a normal distribution. Likewise, if Y has a normal distribution, then the exponential function of Y, i.e. X = exp(Y), has a log-normal distribution.

Let Z be a standard normal variable, which means the probability distribution of is normal centred at 0 and with variance 1. Then a log-normal distribution is defined as the probability distribution of a random variable

The term “log-normal” comes from the result of taking the logarithm of both sides:

As Z is normal, μ + σZ is also normal (the transformations just scale the distribution, and do not affect normality), meaning that the logarithm of X is normally distributed (hence the term log-normal).

In the above graph, red distribution with μ = 0 and σ = 0.25 is nearly normally distributed. As variance increases the distribution skewed towards the right as shown in the above graph.

Note:- Log-Normal Distribution is mostly skewed.

How do we test for log-normal distribution?

Step 1: Take a natural log of the desired variable
Step 2: Use the various statistical test to check normality (QQ Plot etc.)
step 3: If step 2 succeed, we can say the desired variable is log-normally distributed.

Power Law Distribution

In statistics, a power law is a functional relationship between two quantities where a relative change in one quantity results in a proportional change in another quantity. We can say that relative change in X causes a proportional relative change in Y.

Let's say the above graph describes the sales of the products states the power-law distribution where the green describes the 80% of total sales derived by 20% of the products. Thus it is clear that products that are in the left region which is few dominate the entire sales.

Example: Area of Square

Area_of_Square = a²

If the length of the square is 2 then the area will be 2² = 4.
If the length of the square is 4 then the area will be 4² = 16.

A power law distribution has the form Y = k X^α, where:

X and Y are variables of interest,
α is the law’s exponent,
k is a constant.

If you take the inverse of the power-law distribution that is Y = X^-1 that also considered as power-law. As a change in one quantity results in a negative change in another.

A power law describes an exponential distribution — where a few individual points account for a majority of the value in the population. Simply put, it’s the Pareto principle (80:20) on steroids:

Normal distributions assume that the entire population will be distributed across, with a huge majority of people around the average. For example, if you were to plot a distribution of the weights of a country’s entire adult population, it would probably resemble a normal distribution — a large proportion of people near the average, and the no. of people going down as you moved away from the average on either side.

On the other hand, populations that obey a power law are completely skewed to one side. Think about wealth distribution — you know how they bandy around the statistic that 1% of people account for 50% of the world’s wealth? That’s a power law.

Some example of such phenomena with this type of distribution:

Distribution of income,
The magnitude of an earthquake,
Size of cities according to population.
Sales of the products of the companies

How do we test whether two quantities follow Power Law Distribution or not?

Step 1:

Take a log of both the quantities.

Step 2:

Plot those log values with each other

Step 3:

If they show a linear relationship, this indicates that the two quantities have a power-law distribution. (As shown in below)

Pareto Distribution

As we have stated the Power Law Distribution above, this Pareto Distribution is very much like Power Law Distribution

Pareto distribution has colloquially become known and referred to as the Pareto principle, or “80–20 rule”, and is sometimes called the “Matthew principle”. This rule states that, for example, 80% of the wealth of a society is held by 20% of its population.

Whenever distribution follows Power Law its nothing but the Pareto Distribution. Pareto distribution is a skewed, heavy-tailed distribution that is sometimes used to model that distribution of incomes. The basis of the distribution is that a high proportion of a population has low income while only a few people have very high incomes.

Mathematically the Pareto Distribution is denoted as X ~ Pareto(Xm, α) Here Xm can be thought of μ and α can be thought as σ (As they are nearly same)

The PDF of Pareto distribution is given by

Observations:

Xm is at 1 (refer above diagram) which means the peak of the distribution is at 1.
Greenline has alpha = 1, Blueline has alpha = 2, Redline has alpha = 3, and the dark Blackline has alpha = infinity.
As alpha reduces the fatness in tail increases. You can observe green line has tail fatter than the other two.
When alpha become/equals to infinity, PDF looks like something Delta function (which means it has only one value and everything else is ZERO)
You can observe the Blackline which like delta which has everything ZERO but having one value (i.e 1) which where peak can be seen. Such a function called Dirac Delta Function

Readers, we have seen a lot of distributions from discrete to continuous. I have tried my best to explain this in a very lucid manner. If you still feel some concepts are not clear and need to be explained separately, I will make a separate post to explain it in more details. Feel free to comment below for this.

I have referenced diagram, some key concepts, definitions from a variety of websites. I am putting them in the reference section. Please do check them as well.

References:-

Statistics How to? https://www.statisticshowto.datasciencecentral.com
Stat Trek https://stattrek.com/
Math is Fun https://www.mathsisfun.com/
Britannica https://www.britannica.com/science/statistics
Statistics 101 https://www.youtube.com/user/BCFoltz
Wikipedia
My own notes.