An introduction to model metrics

Model metrics help you evaluate the performance of a model and allows you to qualitatively compare two models. You can use model metrics to surface low-performing classes, find and fix labeling errors, and improve the overall performance of the model before it hits production on real-world data.

Why does model accuracy not give a complete picture of the model's performance?

Accuracy tells us the model's overall performance, but this metric doesn't provide all the information needed to accurately assess a model's performance. For a more holistic picture, we'll need to consider other metrics, based on the specific context that the model is used in.

Generally, accuracy tends to be high in situations where a class has a very low probability of occurring, so a model can achieve high accuracy by simply predicting the most common class. For instance, the probability of finding cancer in computed topography scans or of finding swimming pools from satellite images of homes is low, so the model's accuracy can be high even if the model's ability to detect true positives is very poor.


Precision is a valuable metric when the negative cost of a false positive is high. For example, in spam detection models, a false positive would cause a vital email to be hidden and marked as spam when in fact, it is non-spam. A false positive in this case would negatively impact the user experience for seeing essential and urgent emails on time.


Recall is a helpful metric to use when the cost of false negative is high, and you want to minimize it. For example, in fraud detection models, a false negative would cause a fraudulent transaction to be successfully processed when it should have been flagged as fraudulent. This would obviously have a negative impact on the finances of the user. Recall is also helpful for most medical condition predictions, where you would minimize false negatives to increase recall.


In some cases with imbalanced data problems, both precision and recall are important – we can consider the F1 score as an evaluation metric. An F1 score helps in the detection of skewed datasets and rare classes. Generally, it is best to have high precision and recall so that your F1 score is high.

To demonstrate how accuracy only provides a partial assessment of a model's performance, we can compare the model metrics of two models below:

The accuracy in model A is 73.65%, and model B is 83.69%. Based on accuracy alone, model B seems to perform better. However, if you compare their recall scores, then model A has a better recall of 87.38% vs model B's 82.97% recall. Taking this into account, model A performs better since the cost of a false negative is high.

What do model metrics look like in Labelbox?

Rather than having you manually compute and upload metrics, Labelbox Model auto-computes metrics such as precision, recall, F-1, confusion matrix, etc. on individual predictions for you.

Labelbox Model will auto-generate metrics on individual predictions
  • You can simply upload your model predictions and ground truths to receive auto-generated metrics on model precision, recall, F1-score, TP/TN/FP/FN, and confusion matrix.
  • If the auto-generated metrics aren’t sufficient for your use case, you can upload your own custom metrics as well.
  • Visualize, filter, sort, and drill into your metrics, confidence scores, predictions, and annotations. This allows you to easily surface mispredictions, mislabeled data, and allows you to quickly identify improvements to your training data.
  • You can interact and click into the NxN confusion matrix or click into the IOU / Precision / Recall histograms to surface and view specific data rows in “gallery view.” For instance, you can understand where your model is not performing well, where your labels are off, or where your model is the least confident.
Auto-computed confusion matrix
  • Upload confidence scores alongside every prediction and tune the confidence and IOU thresholds in the Labelbox Model UI to see how model metrics change as the thresholds change.

Find the distribution of annotations and predictions in every model run via histograms

Use the prediction and annotation distribution histogram to surface important information about your model runs

In addition, you can easily understand the distribution of your annotations and predictions via histograms. This makes curating datasets for labeling and analyzing model performance easier than ever. You can now use distributions to find the most predicted or least-predicted class and surface classes represented in training data, but rarely predicted by the model.

You can learn more about Labelbox auto-metrics in our documentation or by reviewing how to upload image predictions in Labelbox.