### Blog: Machine Learning Model Evaluation Methods: which one to use?

## Model Evaluation Methods are the ones that are used to select between different trained models or settings

When I started working in Analytics 12 years ago, I did not do predictive modeling in the initial 15 months of my career. But when I started it, my first encounter was with scorecard validation — yes Risk scorecard validation. Scorecards are very popular in Risk analytics and they are used to measure whether to give a loan product to someone or not, whether this person will default or not and so on. Scorecards are used mostly in Banking domain and they are still popular — heard about FICO, Experian scores? Anyway, when I moved into model validation team, my job was to validate relevancy and accuracy of a long list of scorecards on a regular basis. The methods I used were PSI (Population stability index), VDI (variable deviation index), KS and rank ordering. Also, the type of models I used to deal with were all Logistic regression or binary outcome models.

Until year 2016, for me a predictive model in any industry used to be a logistic regression model because we are always trying to predict whether an event will happen or not. In fact, I used to convert a regression model into a logistic regression model and do the predictions. Why? Because it was easy to interpret the output of a logistic regression model in SAS which provides fix set of evaluation matrices to look at — concordance, Hosmer-Lemeshow test, KS , Rank ordering and few more.

Life used to be easy when it comes to doing predictive modeling — you know you build a logistic regression model and check out a list of few evaluation methods. But time has changed.

With the growth of data, tools and techniques to do predictive modeling, set of evaluation methods used to assess the performance of a model has evolved as well. Model development is an iterative process and evaluation methods play an important role in it.

Now predictive model development has been replaced with machine learning. Concept is the same, it’s just that we call it with a different name — we are still predicting a dependent variable using a set of independent variables.

Machine learning has two key parts — supervised and unsupervised machine learning. In this article, we will focus on evaluation methods used for supervised machine learning wherein we predict the target using labelled data. We will focus on classification and regression models in supervised machine learning category.

### A typical machine learning process can be seen as follows:

**Representation **is about data structure which includes type of variables/features and what we are going to predict — a classifier or regressor.

**Train model **— train the model using a range of techniques. Yes, those days are gone when we just used to build one logistic or linear regression model and be done with the task of building a model. Nowadays, it’s a good idea to build a model using a range of models using different techniques because no one technique will suit all types of data.

**Evaluate model** — topic of this write up. It’s a good practice to decide in advance which evaluation matrix will be used to critically evaluate the performance of a model because this is the stage which will help decide if model is working fine.

**Refine model **— basis the outcome in the previous stage, we can try different techniques such changing the settings/parameters of model in stage 2, bring in more data, do more feature engineering and then repeat this iterative cycle.

In all of this, evaluation stage is the most critical one as it helps us decide if we are heading towards the right path to build a model and how the model is doing on a given dataset.

So how do we decide which evaluation matrix to use for a model? Does it depend on the technique or business problem? Let’s explore.

**Evaluation Methods for Classification problems**

In Classification models class, target/dependent variable has discrete classes. For example, a binary outcome dependent variable such as a customer will buy the new product or not, a customer will click on a particular ad or not, an email is spam or not and so on.

For classification problems, we will look at is **2×2 confusion matrix**.

*TP: Actual true and model predicted it true*

*TN: Actual false and model predicted it false*

*FP: Actual false and model predicted it true*

*FN: actual false and model predicted it false*

### A number of matrices can be calculated using confusion matrix

· Accuracy: (TP+TN)/Total

· Misclassification: (FP+FN)/total or (1-Accuracy)

· Precision: TP/(TP+FP). Out of total predicted true, what is the % of actual true as in how often model predicts accurately

· Recall/ Sensitivity/TPR: TP/(TP+FN). How sensitive the model is to predict an actual value of a true.

· False Positive Rate (FPR)/Specificity: FP/(FP+TN)

· F1: (2*Recall*Precision)/(Recall+Precision)

We can look at each of these matrices or we can use individual components of this matrix. For example, in medical research for cancer, we should focus on reducing the number of cases in FN or Type 2 error. This is because if someone has cancer but model incorrectly mark it as not having cancer, then that model is never going to be reliable. On the other hand, in marketing scenarios wherein we are targeting customer for a new campaign then focus should be on reducing FP or Type 1 error. This is because if we are targeting a lot of FP who are customers incorrectly identified as customers who have a high likelihood to buy a product then we would unnecessarily increase campaign cost by poor targeting.

Let’s see these matrices for the below data:

**Accuracy**for this scenario is 91% ( 50(TN)+100(TP))/165(Total). Accuracy is not always the right method to measure success. Example, in credit card transactions vast majority of transactions (for example 99 out of 100) will be non-fraudulent and only a small number of transactions (for example 1 out of 100) might be fraudulent. In such case if we used accuracy to predict a non-fraudulent transaction, then it’s always going to be close to 99%. Of course, this is not going to be accurate measure for such scenario. Imbalance classes are very common in machine learning scenarios and hence accuracy might not always be the right evaluation method to look at.**Recall/Sensitivity/TPR**for this scenario is 95% (100(TP)/105(5(FN)+100(TP))). — High recall means not only accurately predicted TP but also avoided false negatives. In case of cancer detection example it would mean it not only detected high number for TP as in identifying people who have cancer correctly but also that it rarely failed to detect a true cancerous tumor. So if the objective is to increase Recall then we should either increase the value of TP or reduce the value of FN.**Precision**for this scenario is**Specificity/FPR**for this scenario is 17% ( FP(10)/60(FP(10)+TN(50))- This gives the fraction of all negative instances that the classifier incorrectly identifies as positive. In other words, out of total number of actual negatives, how many instances the model falsely classifies as non-negative.**F1 Score or Precision Recall Trade off**

Precision — TP/(TP+FP)

Recall — TP/(TP+FN).

**We can have 2 scenarios**

- Low Recall, High Precision

For above scenario:

Recall=0.76

Precision=1

This is achieved when we have FP =0

2. High Recall, Low Precision

Recall=1

Precision=0.8

This is achieved when we have FN =0

When one increases the other one goes down. Hence comes F1 score which combines precision-recall into a single number.

F1:(2*Recall*Precision)/(Recall+Precision)

It’s the harmonic mean of Recall and Precision.We can rewrite the F1 score in terms of the quantities that we saw in the confusion matrix: true positives, false negatives and false positives.

F1: 2TP/(2TP+FN+FP)

Again, if we want high precision then we need to reduced FP and if we want high recall then we will have to reduce FN.

### Graphical Evaluation Methods for Classification Problems

#### Precision-Recall curve

Precision-Recall curve provides us visualized information of precision/recall at various threshold levels. Let’s see using this example

In the table below, we have 12 records for which we have actual/true labels, probability predicted by the model, predicted label at threshold 0.5 and 0.7.

You will notice as we change the threshold, predicted label changes which in turn effects values for Recall and Precision.

We use different cutoffs or thresholds to derive values for precision and recall which are then plotted on a graph which is called **Precision-Recall curve**. They are widely used evaluation methods for machine learning. In such graphs, on the X-axis we have Precision and Y-axis shows Recall.

An ideal point for a precision-recall curve is top right corner where both Precision and Recall are equal to 1, meaning there are no FP and no FN. However, in a real world this is highly unlikely, so any point closer to top right corner will be an ideal point to achieve. We should set the cutoff/threshold at a point where we achieve maximum precision and maximum recall, and after that point any drop in recall does not add much to precision.

**ROC Curve and AUC (Area Under the Curve)**

Another type of evaluation method which uses various thresholds to calculate False Positive rate(FPR) and True Positive Rate (TPR) is ROC curve. As we know, TPR is recall and is plotted on the Y-axis of this graph and FPR is specificity which is defined as FP/(FP+TN) and is plotted on X-axis. An ideal point for this graph is top left corner where TPR is 1 and FPR is 0, meaning there are no FN and no FP. For this curve, we should look to maximize TPR while minimizing FPR.

A single number called AUC can be derived from ROC curve which measures the area underneath the ROC curve as a way to summarize a classifier’s performance. Area under the dotted straight line is 0.5 which represents random guess — just like flipping coin results in 2 outcomes, this line equal likelihood of either outcome for a binary classifier. So, an AUC of zero represents a very bad classifier, and an AUC of one will represent an optimal classifier.

**Evaluation Methods for Regression Problems**

For classification, because there could be some scenarios like medical diagnostics predictions or costumer facing email interactions, where the consequences of false positive are very different than false negatives. It made sense to distinguish these types of errors and do a more detailed analysis. In evaluating classifiers, we looked at plots like precision recall curves that could show us the tradeoffs a classifier could achieve between making errors of those two types. In theory, we could apply the same type of error analysis and more detailed evaluation to regression that we applied for classification. For example, we could analyze the regression model’s predictions, and categorize errors of one type where the regression model’s predicted value was much larger than the target value compared to a second error type, where the predicted value was much smaller than the target value. In practice though it turns out that for most applications of regression, distinguishing between these types of different errors is not as important. This simplifies evaluation for regression quite a bit.

Typically an R-squared would be enough but we can use other evaluation methods such as mean_absolute_error, mean_squared_error

**R-squared measures how much prediction error is eliminated when we use least-squares regression or how much variation is in dependent variable is explained by the independent variables.**

**R-squared is calculated as 1- (RSS/TSS)**

Given two variables X and Y, and a regression equation such as Y=mX+B which measures the relationship between X and Y;

**TSS is total sum of squares** which is calculated by subtracting actual Y minus average Y and sum all those values at each data point and then square those values.

**RSS is residual sum of squares** which is calculated by subtracting actual Y from predicted Y and sum all those values at each data point and then square those values.

**A higher value of RSS** would mean that there is a big difference between actual values and the values predicted by the regression line. This will result in a smaller value of r-squared, indicating the regression line is not a good fit for the data

On the other hand, **a lower value of RSS** would mean that actual and predicted values are close because of which residuals are small. This will result in a larger value of r-squared, indicating the regression line is a good fit for the data

Thumb rule — R-squared ranges from 0–1. Value closer to 1 will indicate good model and value close to 0 would indicate bad model.

Also, it’s a good practice to check the **residuals plot**. A residual plot with no trend indicates a good model and a residual plot with a trend indicates a need for a non-linear model

On similar lines, we can calculate mean_absolute_error, mean_squared_error

**Mean Absolute error** is the absolute difference between actual and predicted values. A good model will have lower value for this evaluation method.

**Mean Squared error** is the squared difference between actual and predicted values. Again, look for a lower value

In case of regression, objective is to minimize the error so that our prediction is as close as possible to the actual target values.

### Conclusion

**W**e looked and understood many different evaluation methods for both classification and regression problems. In current scenarios, one model does not fit all the data and hence it’s important to explore various techniques while undertaking a machine learning task and hence it’s important that we set a single number evaluation method at the start, so that we can quickly evaluate the performance and move to the option which is working the best. Therefore, it’s important that we understand what all types of evaluation methods we have and how we can use them.

Hope you find this useful .

## Leave a Reply