Computer vision is the study and practice of teaching computers to "see" — that is, to understand and infer information from images or videos. When a human looks at an image, they can describe its content, identify the objects and people in it, and recognize them if they’ve seen them before, all without much effort. Teaching a computer to do this, however, requires more work.
Computer vision is a challenge in machine learning (ML) because the visual world is complex and vast, and computers are designed to work best with constrained problems. A computer can quickly be programmed to perform a task like identifying images in a database with “flower” in the file name, for example, but asking it to identify flowers within a set of images is a far more difficult task, because not all flowers look the same, and the images probably differ due to lighting, camera angle, camera type, image quality, and more.
In the past few years, computer vision algorithms have become more robust and accessible, as have public datasets such as COCO and PascalVOC, which can be used to train these algorithms. However, for enterprises embracing computer vision initiatives, publicly available algorithms and datasets don’t quite make the cut. Businesses need models trained on their own data for specific use cases, whether it's identifying cancer cells in microscope images of tissue samples, tracking the movement of vehicles, or finding weeds in photos of crops in a field. Many companies are also discovering the need for computer vision models trained on video data, or a combination of video and image data, to ensure that the training data is more representative of what the model will need to analyze once it’s in production.
An example of video data annotated to train a computer vision model.
Because of these challenges, machine learning models built for enterprise use cases are frequently more complex and more difficult to create.
Let’s explore a couple of basic computer vision applications, along with some examples of enterprise use cases.
An object detection model trains a computer to find all the objects in an image and outline them in bounding boxes.
This is a common computer vision application used across industries. Farm businesses, for example, can use it to find weeds, damaged plants, or other problems in photos of their field, saving them significant time and costs. With this ML model, they can spray for weeds only in affected areas rather than the entire field.
To train an algorithm to do this, the ML team will need to first gather their dataset — images of the field — and have an expert botanist or agronomist label them by hand. These annotated images can then be fed into the algorithm to train it. To achieve the level of accuracy required to allow farming businesses to rely on the model, the algorithm will likely need to be trained over several iteration cycles.
Object detection is often used in tandem with many other computer vision applications, including:
Object classification: the model identifies the category for each object in the image, i.e. the type of weed in an image of a field
Object identification: the model finds each object in the image and identifies it
Object verification: the model decides whether a specific object is in an image
An object segmentation model tells the computer to examine every pixel in an image and classify it based on whether it belongs to a specific object. This creates a mask that defines the exact shape of the object, rather than encompassing the object in a box. Image segmentation is useful when it’s important not only to identify an object, but also the shape or orientation of the object.
This aerial image has buildings and a pool identified with segmentation masks.
One example of object segmentation is an insurance use case, where a computer vision model evaluates buildings and their surroundings in an aerial image to assess risk and damage. In this case, it’s important for the model to “see” the exact shape of an object — say, a collapsed wall — rather than identify the object with a bounding box.
Getting a computer vision algorithm to production can be boiled down to four basic steps:
Understand the business requirement
Create a training dataset
Train the model and assess its accuracy
Iterate with new training data until it reaches the desired accuracy
The first step in any computer vision initiative is to understand the problem you're trying to solve. For example, an e-commerce clothing company that wants to tailor store recommendations based on their customers' personal style would benefit from an algorithm that identifies and categorizes each piece of clothing in their catalog. A relatively simple model might be trained to analyze an image and note the following information about the piece of clothing within it:
Whether it is a top, bottom, or a dress
The color of the item
Any patterns or prints on the item
With a model that accurately identifies this data, the enterprise would be able to ensure that a person who has mentioned in their profile that they dislike floral prints, for example, would never see a floral print in their recommendations. A more complex version of this algorithm might be trained to categorize apparel further down to the type of fabric, style (such as a button down vs. a t-shirt), what other pieces in the catalog it might be paired with, and more.
Once the machine learning team has determined exactly what they want their computer vision model to do and what role it will play in the business, they must then develop the training data set. Training data is arguably the most crucial element to the machine learning process. High quality training data will accelerate your model's path to production, and low quality data can cause expensive delays and an inaccurate model. The model learns to “see” based on what it is given, so ensuring that your training data is accurate and representative of what the model will need to evaluate in production is paramount to success.
To create a robust training data, your team will need to start by gathering a relevant, diverse dataset. For the e-commerce example mentioned earlier, the dataset would need to include images of every type of item in their catalog — if the training dataset consisted only of images of tops, the model wouldn’t be able to pants or dresses. Gathering a dataset for this use case is relatively simple, since the company would presumably already have images of every item they’re selling. An ML team tackling a medical imaging use case, on the other hand, might have a much more difficult task in gathering the necessary data due to privacy restrictions, a lack of demographic variety in the available data, variations caused by different cameras or microscopes, etc. Once a dataset has been gathered, the team will need to annotate each image — drawing bounding boxes or segmentation masks over each object and carefully label them according to the guidelines necessary to “teach” the computer vision model. These guidelines will be the basis for the dataset’s ontology — the organizational system that classifies each item identified in the dataset.
For example, a machine learning model being trained to identify and assess the ripeness of bananas might use an ontology like this:
Categorizing every item pictured in a dataset usually ends up being more complicated than it sounds. The team needs to consider:
The geographic location of those who are labeling images, because the names of items can differ widely in different places. For example, “pants” has a different meaning in the United Kingdom than it does in the United States.
How to nest categories. Will the model recognize a jumpsuit as a type of pants, a type of dress, or will it get its own category?
How they’ll address any ambiguities that come up during the process.
One of the best impact-for-cost actions an ML team can take to improve a model is to iterate and update the annotations, ontology, and labeling guidelines, to fine tune the outputted training data, while they train and iterate on the model itself.
Once the ML team has generated the first training dataset, they’ll feed the dataset into the model and assess its accuracy. The ML engineers might determine where the model has the lowest accuracy — for example, it might have have extra difficulty finding the difference between long-sleeved shirts from sweaters — and use that information to create another dataset consisting mainly of long-sleeved shirts and sweaters, or update the existing labels on images of these shirts and sweaters, so the model can better identify them.
Usually, the process of getting a computer vision model to production-level accuracy requires multiple training datasets, and the improvement in model performance will generally be smaller with each iteration.
Training a computer vision model for an enterprise use case can be an extensive process, but there are a few actions that ML teams can take to ensure a smoother, faster journey to production.
Establish a robust labeling operation. Data annotation is a fundamental part of the training process, and it can make or break your entire ML initiative. Take time to find an experienced labeling team that understands your business requirements, implement quality management systems within your labeling workflow that incorporates domain expertise, and ensure that your entire labeling pipeline, from data lake to model input, is secure and seamless.
Find the right tools for the job. Fledgling ML teams often use free, open source labeling tools and pull together disparate systems and workforces to annotate their data, using USB drives and spreadsheets to transfer data and organize their operations. Enterprise teams with the intention to grow, however, would do better to find tools, such as a training data platform (TDP), that will help them scale and securely connect all the people, processes, and data for their needs.
To learn more about how a TDP can benefit your enterprise computer vision initiatives, read our white paper, Training Data Platforms 101.