How Are ML Models Measured?

When training and deploying machine learning (ML) models, it’s often impossible to measure model performance without implementing model monitoring as part of your MLOps lifecycle. Understanding model evaluation metrics, performance metrics for classification and regression, and the F1 score in machine learning is critical for anyone deploying ML models.

How do you measure the performance of a model?

There are several metrics and model monitoring tools used to evaluate the performance of an ML model, and choosing the right ones is crucial. After all, evaluating your ML algorithm or model is an essential part of any successful project. There are at least 7 different categories and types of metrics used to measure machine learning performance, including:

1. Classification Metrics

Accuracy
Precision
Recall
Logarithmic Loss
F1-Score
Receiver Operating Characteristic
Area Under Curve

2. Regression Metrics

Mean Squared Error
Mean Absolute Error

3. Ranking Metrics

Mean Reciprocal Rate
Discounted Cumulative Gain
Non-Discounted Cumulative Gain

4. Statistical Metrics

Correlation

5. Computer Vision Metrics

Peak Signal-to-Noise
Structural Similarity Index
Intersection over Union

6. Natural Language Processing Metrics

Perplexity
Bilingual Evaluation Understudy Score

7. Deep Learning Related Metrics

Inception Score
Frechet Inception Distance

Clearly, there are lots of different metrics and lots of variables that you could measure or assess regarding machine learning models. We could try to tell you which ones are better, or dive into all the technical jargon of each one. Instead, we are going to look at a couple of the more popular ones a little closer, starting with the F1-score.

What is the F1 score and why is it used?

In technical terms, the F1 score is defined as the harmonic mean between precision and recall. While certain applications might require more of an emphasis on either precision or recall, if you want to get a good sense of both metrics combined into one, then the F1 score is exactly what you are looking for.

What is the F1 score of a model written as a formula?

The F1 score formula is expressed in the following image, where TP denotes true positives, FP denotes false positives, and FN denotes false negatives.

F1 is calculated as follows:

$$F_1=2*\frac{precision * recall}{precision + recall}$$

where:

$$precision=\frac{TP}{TP + FP}$$

$$recall=\frac{TP}{TP + FP}$$

It is worth noting that there are often tradeoffs between precision and recall of ML models, and as one variable gets too high, the other begins to get significantly lower. With that in mind, let’s talk about what constitutes a good F1 Score.

What is a good F1 score?

The range for F1 scores is between 0 and 1, with one being the absolute best score possible. Naturally then, the higher the F1 score, the better, with a poor score denoting both low precision and low recall. As your recall and precision scores increase, your F1 score will also increase. That means if you find your F1 score is low, a good place to start looking for solutions is within your precision or recall metrics.

How to check the accuracy of a predictive model

Another simple yet important metric for ML models is classification accuracy, often referred to simply as accuracy. Accuracy refers to the ratio of correct predictions to the total number of samples. In other words it looks something like:

$$Accuracy=\frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$$

This might have you wondering, what is good accuracy for machine learning? While the goal is always to be as close to 1 (if you stick with the ratio) or 100 %, the reality is that perfect accuracy is hard to achieve. It is generally agreed upon that anywhere from 90% (or .9) is a good accuracy rate. However, that can change depending on the specific industry or model you are using. For example, if the model is being used to diagnose a deadly disease, a good accuracy rate might be closer to 95% or even 99%. Conversely, something with lower stakes like identifying whether pictures contain a dog in them might be candidates for a good accuracy score at 90%.

Fiddler…cracking the code on explainable ML

In many regards, artificial intelligence and machine learning embody the best of promise and progress for society. But, just like all humans have innate and implicit biases and blind spots, AI is also imperfect. The problem with machine learning is that most of what it decides, how it decides, and why it decides in a certain way, happens in a black box. That makes it difficult to detect model bias or flaws in the process. That can be a huge issue.

At Fiddler, we help MLOps and Data Science teams develop responsible AI by providing explainable AI. Once you understand why your ML models are making certain decisions, you can improve their overall performance.

Try Fiddler to get started on your path to building trust into AI.