Performance Measurement Models — (Part II)

Akash Dugam
15 min readMay 15, 2024

--

In Part-I of this series, we explored a variety of performance measurement models tailored for classification problems within the realm of Machine Learning. These metrics, including Accuracy, Confusion Matrix, Precision & Recall, and others, play a crucial role in evaluating the effectiveness of classification algorithms. But as we transition from the discrete outcomes of classification to the continuous nature inherent in regression problems, our approach to performance measurement must evolve accordingly.

In Part II, we shift our focus to regression-related performance metrics. Regression tasks, which predict continuous outcomes based on input features, require distinct metrics for assessing model performance. Here, we will delve into critical metrics such as →

  • SSE
  • MSE
  • RMSE
  • MAE
  • MAPE
  • MPE
  • R Squared & Adjusted R Squared
  • MAD
  • Distribution of error.

These performance metrics provides insight into the accuracy and efficiency of regression models. Understanding these metrics is essential for developing, evaluating, and refining predictive models that are crucial in countless applications from finance to healthcare.

Let’s get started, Shall we?

SSE (Sum of Squared Error)

It is defined as →

SSE is the summation of the squared differences between the predicted values and the actual values.

A lower SSE value indicates a better fit between the predicted and actual values. SSE is sensitive to outliers because squaring large deviations increases their impact on the total error.

Example

Let’s create a simple example to calculate the Sum of Squared Error (SSE) for a series of basketball game score predictions. We’ll assume we have predicted scores for five games and then compare these predictions to the actual scores of the games.

  • Error column shows the difference between the Predicted Score and the Actual Score. This represents how much our prediction deviated from the actual outcome.
  • Squared Error is the square of the Error. Squaring ensures that all errors are positive (since negative errors would indicate underestimation, but when squared, reflect the magnitude of the error just the same as overestimations).
Formula to Calculate SSE

What Does SSE Tell Us?

The SSE of 95 is a way of aggregating all the errors our predictions made into a single metric. It gives us an idea of the total error magnitude across all our predictions. Here’s why SSE is important:

  • Total Error Measurement: SSE provides a cumulative measure of error across all predictions. A lower SSE suggests better model accuracy.
  • Sensitivity to Large Errors: Because errors are squared, large deviations from actual values are given more weight, making SSE particularly sensitive to outliers. This can be both a strength and a weakness, depending on the context.
  • Model Comparison: Within the same dataset, SSE allows for the comparison of different models or prediction methods. The model with the lowest SSE is typically considered more accurate or better fitting to the data.

Limitations of SSE:

While SSE is useful for understanding and comparing model errors, it’s not without limitations. For instance, it doesn’t account for the number of data points or the variability of the data. Therefore, it’s often used in conjunction with other metrics, like Mean Squared Error (MSE), which normalizes SSE by the number of observations, making comparisons across datasets more meaningful.

Mean Squared Error (MSE)

One downside that we observed with SSE is that it is depend upon the number of data points.

Therefore, it is not easy to compare SSE between N = 100 and N = 1000. In case of N = 1000, the error will be bigger simply due to making more predictions.

MSE is the average of SSE; it scales the SSE to the sample size, providing a measure of the average deviation of the predictions.

It’s used to compare the accuracy of different models on the same dataset. MSE penalizes larger errors more than smaller ones due to the squaring process, making it useful when large errors are particularly undesirable.

Example

Let’s borrow the example from SSE → A Basketball score prediction!

To find the MSE, we’ll take the sum of all the squared errors (which we found to be 95 when discussing SSE) and then divide by the number of observations (games) to calculate the average.

This means that, on average, the squared error per game is 19.

Understanding MSE:

  • This metric provides a standardized way of looking at the error magnitude, allowing for easier comparison across models or datasets with the same scale.
  • MSE’s sensitivity to large errors makes it a useful tool for identifying models that might be making significant mistakes in some predictions, even if they perform well on average.
  • Remember, MSE is particularly useful because it puts the error metric back on a scale that is somewhat more interpretable, being in the squared units of the outcome variable. However, because it’s in squared units, direct comparisons to the original predictions or outcomes require taking the square root, which leads us to the RMSE for a more directly interpretable metric.

Limitations of MSE:

  • Scale Dependency: Because it’s an average of squared errors, MSE’s value is directly influenced by the scale of the dataset. Larger values in the data will typically lead to larger MSE values, making it challenging to compare performance across datasets with different scales.
  • Interpretation Difficulty: The squared term in MSE makes it a bit more difficult to interpret in the context of the original data units. This is why the root mean squared error (RMSE) is often preferred for its interpretability.

Root Mean Squared Error (RMSE)

Up until now we have seen two performance metrics that is → SSE and MSE. The major disadvantage of these two metrics is that they don’t have intuitive units.

Imagine you are forecasting temperature in Degree Celsius, but using the squared error, we will get Celsius squared. You must understand that there is no such thing as Celsius squared or it has no meaning to it. Therefore one way to express this error metric in the unit that make sense is to take Squared Root of the Mean Squared Error. In short, calculate the RMSE.

Example

Let’s borrow the example from MSE → A Basketball score prediction!

This calculation tells us the average error (in the same units as the scores) that our model’s predictions have from the actual scores.

Given the square root of 19, the RMSE would provide a direct sense of the model’s prediction accuracy on an average per-game basis, emphasizing the magnitude of errors and penalizing larger errors more significantly than smaller ones. This makes RMSE a highly insightful and widely used metric for evaluating the performance of prediction models. ​

Understanding RMSE:

RMSE provides a measure of how well a model can predict the value of an outcome. Here’s why it’s valuable:

  • Scale Compatibility: RMSE is expressed in the same units as the predicted and actual values, making it easier to interpret than MSE.
  • Sensitivity to Large Errors: Like MSE, RMSE emphasizes larger errors more significantly than smaller ones, due to the squaring of errors before averaging. The square root helps to moderate this effect but still retains the property of penalizing large errors more heavily.
  • Model Comparison: RMSE is widely used for comparing forecasting accuracy of various models. A lower RMSE indicates a better fit to the data.

Limitations of RMSE:

  • Still Sensitive to Outliers: While the square root helps moderate the impact of large errors, RMSE remains sensitive to outliers. This can sometimes lead to misleading interpretations if the dataset contains many outliers.
  • Not Normalized: RMSE does not normalize for the number of observations, so while it’s excellent for comparing models on the same dataset, it’s not as useful for comparing across datasets of different sizes or scales.

Mean Absolute Error (MAE)

MAE measures the average magnitude of errors in a set of predictions, without considering their direction.

It is less sensitive to outliers than MSE and RMSE since it does not square the errors.

MAE provides a direct measure of average error magnitude and is interpreted as the average distance from the predicted values to the actual values.

Example

Let’s borrow the example from RMSE → A Basketball score prediction!

  • The MAE of 4.2 points tells us that, on average, our predictions were off by 4.2 points from the actual scores.
  • This metric is useful because it gives us a direct measure of prediction accuracy on the same scale as the original data, and it’s robust against outliers. The MAE doesn’t penalize large errors more than smaller ones, as each error contributes linearly to the total error measure.

Mean Absolute Percentage Error (MAPE)

Imagine you’re guessing the price of several items. After all your guesses, you check the actual prices to see how off you were with each guess. MAPE helps you understand, on average, by what percentage you were wrong with your guesses. This makes MAPE very intuitive and easy to interpret, especially for non-technical stakeholders.

Let’s create a simple example to calculate the Mean Percentage Error (MPE) for a series of basketball game score predictions. We’ll assume we have predicted scores for five games and then compare these predictions to the actual scores of the games.

The Mean Absolute Percentage Error (MAPE) across all games is approximately 3.99%.

This means that, on average, the predictions were off by about 3.99% from the actual scores, without considering whether they were overestimations or underestimations. This percentage gives us a sense of the average size of the prediction errors in terms of their proportion to the actual scores, offering a straightforward way to understand prediction accuracy across the set of games. ​

Here’s how it works:

  1. Subtract your predicted score from the actual score to find the difference for each item. This difference tells you by how much you overestimated or underestimated.
  2. Turn these differences (errors) into absolute value (absolute errors).
  3. Turn these absolute errors into percentages by comparing them to the actual score. This tells you, for each item, by what percent your predicted score was too high or too low.
  4. Finally, calculate the average of these percentages. This average is your MAPE. It gives you a single percentage that represents how accurate your guesses were across all items, without getting skewed by whether you consistently overestimated or underestimated.

So, MAPE essentially tells you, “On average, your guesses were off by this percentage.” A lower MAPE means your guesses were closer to the actual prices, while a higher MAPE means they were further off.

Advantages:

  • Scale-independent, facilitating comparisons across different datasets.
  • Intuitive interpretation in terms of percentage errors.

Disadvantages:

  • Can be heavily skewed or undefined when actual values are zero or close to zero.

Mean Percentage Error (MPE)

Similar to MAPE, MPE calculates the average of percentage errors; however, unlike MAPE, MPE includes the sign, providing information on whether the model tends to under- or over-estimate.

MPE can be particularly informative when you’re interested in understanding the bias of the forecast errors.

Imagine you’re trying to predict the scores for a series of basketball games. After all the games, you compare your predicted scores to the actual scores to see how you did. MPE helps you figure out, on average, whether your predictions were too high or too low and by what percentage.

The Mean Percentage Error (MPE) across all games is approximately 0.32%.

What this tells us is that, on average, the predictions were very slightly overestimated by about 0.32%. The individual game calculations show us by what percentage each game’s score was over or under-predicted, while the MPE gives us an overall average of prediction accuracy across all the games.

Understanding MPE:

MPE is particularly useful for identifying bias in predictions. Here’s what it tells us:

  • Positive MPE: Indicates that, on average, the model overestimates the actual values. The model’s predictions are higher than they should be.
  • Negative MPE: Indicates that, on average, the model underestimates the actual values. The model’s predictions are lower than they should be.
  • MPE Close to 0%: Suggests that the model, on average, accurately predicts the actual values, although individual predictions may still be off.

In practice, MPE can offer valuable insights into a model’s predictive performance, especially regarding systematic biases. However, it’s often used alongside other metrics to provide a more comprehensive view of model accuracy and reliability.

R² (Coefficient of Determination)

In regression related problems, the metric R² used to determine “How good is our regression model?”

R-squared, also known as the coefficient of determination, is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.

It provides an indication of the goodness of fit of a model and is often used to analyze how well the model predicts or fits the actual data.

Before explaining R², there are some prerequisite terms that we need to understand. Those are -

  1. Errors in the Regression
  2. SSTotal (Mean Regression Model)
  3. SSResiduals

Let’s understand these concepts.

Errors in the Regression:

An error in regression model is defined as →

An error in regression is nothing but the difference between Actual Value & Predicted Value.

Error = Actual — Predicted

Error is most important concept in regression problems. Hence we have revisited it once again.

SSTotal (Mean Regression Model):

  • It is the differences between observed values and the mean of observed values
  • This is nothing but the most basic regression model called → Mean Regression Model.
  • Given any query point, the model predicts average value of Yi as Yq as shown in below:
  • Therefore it is nothing but the sum of squared error using simple mean model

SSResiduals:

  • It is the sum of squares of residuals (the differences between observed and predicted values).
  • It represents the variations in the dependent variable that cannot be explained by the model.

Now let’s proceed towards the calculation of R²

Calculation & Interpretation of R²

Let’s understand case by case.

Case 1: When SSResiduals = 0

  • When SSResiduals is zero or close to zero which means Actual is equal predicted.
  • This is the best regression case when R² is 1 or close to 1

Case 2: When SSResiduals < SSTotal

  • If SSResidual is less than SSTotal the R² lies between 0 & 1.
  • This is the typical case

Case 3: When SSResiduals = SSTotal

  • It means sum of square of residuals is same as simple mean model.
  • In this case R² = 0

Case 4: When SSResiduals > SSTotal

  • This is the worst case in regression.
  • In this case, R² is Negative.

Case 5: When SSTotal = 0

  • It indicates a special and very specific scenario in your data. This occurs when all of the observed values, yi​, are identical.
  • In other words, there is no variance in your dependent variable.
  • Each yi​ is equal to the mean of the dependent variable, yˉ (y bar)​, resulting in each term of the SSTotal​ sum being zero.
  • Since each datapoint yi is yˉ (y bar)​, the, just ere is no need of regression, just return yˉ (y bar)​ for any new query point.

Examples

Let’s consider a dataset where we’re trying to predict a person’s weight based on their height. In this scenario:

  • If R² = 0.8, it suggests that 80% of the variance in weight can be explained by height, indicating a strong relationship.
  • If R² = 0.2, it suggests that only 20% of the variance in weight can be explained by height, indicating a weak relationship.

Limitations and Considerations

  • Overfitting: A high R² value does not necessarily mean the model is good. A complex model might overfit the data, capturing noise rather than the underlying relationship.
  • Not a measure of accuracy: R² does not indicate whether the predictions are close to the observations; it only measures the proportion of variance explained.
  • Comparability: R² alone should not be used to compare models with different dependent variables or datasets.

Adjusted R²

Adjusted R² is a refinement of the R² statistic, designed to provide a more accurate measure of the goodness of fit for regression models, especially when comparing models with a different number of predictors.

Let’s dive into the details.

Definition and Formula

Adjusted R² adjusts the R² value to account for the number of predictors in the model. Unlike R², which always increases as you add more predictors, adjusted R² can decrease if the additional predictors don’t improve the model significantly.

This makes it a more reliable statistic for comparing models with different numbers of independent variables. The formula for adjusted R² is:

Where:

  • is the original coefficient of determination.
  • n is the sample size, i.e., the number of observations.
  • p is the number of predictors (independent variables) in the model.

Interpretation

The adjusted R² value provides a measure of how well the independent variables explain the variability of the dependent variable, adjusted for the number of predictors.

It penalizes the addition of irrelevant predictors, which is a significant advantage over R².

  • A higher adjusted R² indicates a better model fit, taking into account the number of predictors.
  • Unlike R², the adjusted R² can decrease if the contribution of the new predictors does not outweigh the penalty for increasing the number of predictors

Why Use Adjusted R²?

1. Model Comparison: It is particularly useful when comparing models with a different number of independent variables. It helps in selecting the model that has the right balance between goodness of fit and model complexity.

2. Penalty for Extra Predictors: Adjusted R² incorporates a penalty for adding predictors that do not contribute to an increase in R² commensurate with the loss of degrees of freedom. This discourages overfitting by penalizing complex models that don’t necessarily provide better explanatory power.

3. More Accurate Measure of Fit: For models with many predictors, adjusted R² gives a more accurate measure of how well the model generalizes, potentially avoiding the misleading interpretation that adding more variables makes the model better.

Example

  • Consider two models predicting house prices: Model A uses 2 predictors (square footage and number of bedrooms)
  • Model B uses 5 predictors (adding in age of the house, distance to the city center, and local crime rate).
  • If both models have similar R² values, Model A might be preferred due to its simplicity unless Model B has a significantly higher adjusted R², indicating that the additional complexity of the model is justified by a proportional increase in explanatory power.

Median Absolute Deviation (MAD)

The main concern that a single large error (residual) can significantly affect the sum of squared residuals SSResiduals, which in turn affects the R² value. R² is not robust in the presence of outliers, as outliers will disproportionately influence the sum of squares due to the squaring of the residuals. This is where MAD can be particularly useful.

By using the median of the absolute deviations rather than the mean (which would be more influenced by outliers), the MAD provides a measure of dispersion that is more resistant to the influence of outliers.

Let’s take a look at the error (ei)given below →

  • If one of the ei is large (Let’s say e4 is large) then it has high chances that It will affect the R². As we have studied, R² isn’t robust and it is affected by outliers.
  • MAD can be used to tackle the problem as it is not easily gets affected by outliers.
  • If we think eq is random variable, we can compute it’s median and MAD as well.
  • If we find out median (eq) and MAD (eq) are small then we understand our errors are also small.

Distribution of Errors

In last chapter we’ve seen to get more information about errors with the help of mean, median, standard deviation and MAD. But we can also use probability distribution to get much more information about the errors. (We will see how to utilize PDF & CDF)

Final Note

That’s a wrap! I’ve thoroughly covered performance measurement techniques, with Part I focusing on classification performance measurements and Part II on regression performance techniques. I’ll keep updating this blog with new content as needed. If you enjoyed reading, please clap and consider following me.

--

--