Is accuracy a good measure of model performance?

Min Read

Companies that rely on modern artificial intelligence (AI) and machine learning (ML) applications find themselves at the forefront of change. When these models are well-designed and properly implemented, they can be truly transformative with endless use cases.

That being said, AI and ML are not perfect. Because they are often used to replicate human judgment, they’re vulnerable to the same bias or error that we, as humans, are subject to. When you consider the potential for model bias, combined with the “black box” nature of these models, the importance of model monitoring becomes clear.

You might think accuracy is the most important aspect of an ML model, but accuracy doesn’t actually tell you the whole story. Let’s take a closer look at the model accuracy vs model performance relationship, so you can better understand the role of measuring accuracy within the larger context of AI and ML applications.

What is the definition of model accuracy?

Model accuracy refers to the percentage of predictions generated by a particular model that are accurate or correct. By contrast, model performance takes a more comprehensive view of the model’s behavior. This involves assessing a model’s accuracy, its real-time performance, adaptability to new information or conditions, and so on.

Next, let’s look at how to calculate accuracy in machine learning.

How do we measure model accuracy?

At a basic level, an ML model’s classification accuracy measures the fraction or percentage of correct predictions attributed to the model. In a basic form, a classification accuracy formula looks like this:

$$\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$$

In a more detailed binary classification model, accuracy considers the frequency of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN):

$$\text{Accuracy} = \frac{\text{TP + TN}}{\text{TP + TN + FP + FN}}$$

These formulas’ results can ultimately be expressed as either a decimal number or converted to a percentage.

Example machine learning model accuracy calculations

If a ML model makes 400 predictions and 380 of them are correct, the model’s accuracy would be

$$\frac{300}{400}=0.95 \text{, or } 95%$$

For a binary classification example, let’s say the model makes 400 predictions, resulting in:

360 true positives
20 true negatives
7 false positives
13 false negatives

…then its accuracy would (again) be calculated as:

$$\frac{300+20}{360+20+7+13}=0.95 \text{, or } 95%$$

What’s the point of the second formula? To put it simply, the slightly more complex formula is better suited for imbalanced datasets — or data sets in which there is a significant disparity in the volume of positive and negative results.

Why is accuracy not always a good performance measure?

While on its surface, the concept of model accuracy might seem like the perfect measure for understanding a model’s reliability, it simply doesn’t tell the whole story. It’s a phenomenon called the “accuracy paradox,” which says that:

“In the framework of imbalanced data-sets, accuracy is no longer a proper measure, since it does not distinguish between the numbers of correctly classified examples of different classes. Hence, it may lead to erroneous conclusions.”

In short, while its simplicity is appealing, the most significant reason why accuracy is not a good measure for imbalanced data is that it doesn’t consider the nuances of classification. Measured in a vacuum, it simply provides a limited view of the model’s true dependability. This is why complementary metrics — like precision and recall — are so beneficial in the context of model monitoring.

How should you measure the performance of models?

A much better approach to ML monitoring considers additional metrics to complement and add context to basic accuracy calculations. Two of these additional model evaluation metrics for classification are precision and recall.

Precision specifically calculates the percentage of correct positive predictions.

$$\text{Precision} = \frac{\text{TP}}{\text{TP + FP}}$$

Recall, by contrast, calculates the percentage of actual positives correctly identified by the model.

$$\text{Recall} = \frac{\text{TP}}{\text{TP + FN}}$$

If we plug in our previous example’s numbers, then the model’s precision and recall calculations would look like this:

$$\text{Precision} = \frac{360}{360 + 7}=0.9809 \text{, or } 98.09%$$

$$\text{Recall} = \frac{360}{360 + 13}=0.9651 \text{, or } 96.51%$$

To put these measures together, then, in our example scenario (using the binary classification formula):

The model’s accuracy (percentage of correct predictions) is 95%;
The model’s precision score (percentage of correct positive predictions) is 98.09%; and
The model’s recall score (percentage of actual positives correctly predicted) is 96.51%.

The importance of well-rounded model performance evaluation

A machine learning model’s accuracy, precision, and recall are useful in gaining quick insights into basic trends within a dataset. Monitoring these over time will help you better understand the model’s overall performance and the factors that impact its reliability and, by extension, usability.

Keeping a close eye on model performance also helps you identify bigger-picture factors like model drift, as well as a dataset’s basic integrity and any hints of bias within the model.

The truth is that accuracy, precision, and recall are just three pieces in the bigger puzzle of machine learning evaluation metrics and performance monitoring. Ultimately, the better you understand the nuances of model performance — and the more willing you are to dig deep into the data, even asking difficult questions when hints of inaccuracy creep into the picture — the better-equipped you’ll be to take control and optimize your model.