Model metrics help you evaluate the performance of a model and allows you to qualitatively compare two models. You can use model metrics to surface low-performing classes, find and fix labeling errors, and improve the overall performance of the model before it hits production on real-world data.
Accuracy tells us the model's overall performance, but this metric doesn't provide all the information needed to accurately assess a model's performance. For a more holistic picture, we'll need to consider other metrics, based on the specific context that the model is used in.
Generally, accuracy tends to be high in situations where a class has a very low probability of occurring, so a model can achieve high accuracy by simply predicting the most common class. For instance, the probability of finding cancer in computed topography scans or of finding swimming pools from satellite images of homes is low, so the model's accuracy can be high even if the model's ability to detect true positives is very poor.
Precision is a valuable metric when the negative cost of a false positive is high. For example, in spam detection models, a false positive would cause a vital email to be hidden and marked as spam when in fact, it is non-spam. A false positive in this case would negatively impact the user experience for seeing essential and urgent emails on time.
Recall is a helpful metric to use when the cost of false negative is high, and you want to minimize it. For example, in fraud detection models, a false negative would cause a fraudulent transaction to be successfully processed when it should have been flagged as fraudulent. This would obviously have a negative impact on the finances of the user. Recall is also helpful for most medical condition predictions, where you would minimize false negatives to increase recall.
In some cases with imbalanced data problems, both precision and recall are important – we can consider the F1 score as an evaluation metric. An F1 score helps in the detection of skewed datasets and rare classes. Generally, it is best to have high precision and recall so that your F1 score is high.
To demonstrate how accuracy only provides a partial assessment of a model's performance, we can compare the model metrics of two models below:
The accuracy in model A is 73.65%, and model B is 83.69%. Based on accuracy alone, model B seems to perform better. However, if you compare their recall scores, then model A has a better recall of 87.38% vs model B's 82.97% recall. Taking this into account, model A performs better since the cost of a false negative is high.
Rather than having you manually compute and upload metrics, Labelbox Model auto-computes metrics such as precision, recall, F-1, confusion matrix, etc. on individual predictions for you.
Find the distribution of annotations and predictions in every model run via histograms
In addition, you can easily understand the distribution of your annotations and predictions via histograms. This makes curating datasets for labeling and analyzing model performance easier than ever. You can now use distributions to find the most predicted or least-predicted class and surface classes represented in training data, but rarely predicted by the model.
You can learn more about Labelbox auto-metrics in our documentation or by reviewing how to upload image predictions in Labelbox.