LabelboxJanuary 28, 2020

How can framing your problem help you build a good dataset for it?

Have you ever shopped on an absolutely awful e-commerce site? Do you remember why you hated it?


  • You couldn’t find the item you were searching for.
  • Browsing through the site’s inventory was slow and frustrating.
  • The checkout process took way too long.

Anyone who’s used an e-commerce site has dealt with these problems. When it comes to e-commerce problems, companies like Edgecase, are leveraging AI to create better product recommendations for customers and improve conversion rates.

Taking advantage of AI through machine learning (ML) can give you the right tools to solve many types of organizational problems. But, first you have to break down the problem into a form your ML model can understand and solve.

That involves framing your problem, or in other words, understanding the question you want to answer and choosing the right dataset to solve it. Let’s dive into why you need to frame problems correctly and how your dataset affects which questions you’ll ask ML model to answer.

Framing Your Problems Correctly Ensures Your ML Model Can Solve It

We can understand complex problems and datasets relatively quickly, a unique skill machine learning does not share. For example, we can drive in a new location based on what we can see, but an AI would require immense amounts of data and training to perform that task, which we consider relatively simple.

Because of machine learning’s limitations with complex problems, it’s important to frame your problem in a way that your ML model will understand. To do so, you have to understand what your ML is capable of. To help illustrate, let’s talk about bananas.

Let’s say you need to transport bananas, but because of the large quantities, you need help determining the amount you’ll be shipping. In setting up your machine learning algorithm, you create a training dataset (training datasets are used to teach ML algorithms how to do the job they were created for) that has images of bananas.

The pictures look like this.

A bunch of bananas.

Will your algorithm be capable of learning how to count these bananas accurately? Not likely. This is because of the limitations of what an ML model can and can’t do.

Consider what you see in the image.

  • A bunch of bananas with some in front of the others.

What does an ML model see?

  • A few shapes that indicate bananas, along with some yellow patches. (When zoomed in the bananas start to blend together, even to the human eye.)

You know that the yellow patches behind the main bananas are more bananas, but an ML model can’t tell. It has no concept of object permanence. No matter what annotations your dataset uses, your model will have an extremely difficult time distinguishing those yellow patches as separate bananas.

That’s not to say that bananas are your ML model’s kryptonite (that would be bananas). If you focus on bananas that are not obscured by other objects, then your model can identify them. You just need to define what percent of a banana has to be in the image for it to register as a banana.

When framing your question, you need to come up with a strict definition for things. For instance, if you want an ML model to tell when bananas are ripe, you have to decide what ripe bananas look like. Since people’s definitions of what is ripe can vary, you must have a strict definition for your model.

Bananas in various stages of ripeness are shown, with a number below each banana.

In the above photo, one person may consider number 5 to be ripe, while another waits until a banana looks like 6 or 7.

You also have to determine how specific or general you want your model to be. So if you’re identifying bananas, you have to figure out if you’re going to say all bananas are the same or if you are going to separate by variety.

Framing your issue calls for an understanding of machine learning’s limitations and for you to set specific definitions and parameters relevant to your problem. But there’s more to it than that. Framing your problem also depends on what your dataset looks like.

Framing Your Problem Through Datasets: How It Works Together

Your dataset dictates how your problem is framed. By teaching your ML algorithm what it’s identifying and what the environment looks like, your dataset helps your ML model understand the problem. There are two main ways you can set up your dataset, and each one has a different impact on your ML model.

One: you discover a problem that you want to solve and then build a dataset around the problem.

  • Advantages: because you’re building your dataset around your issue, you know that your images are relevant. You can also find exactly what you need when you’re annotating data.
  • Disadvantages: Since you’re focused on what you think you need, your data collection is biased, and you may miss valuable insights.

These biases may not be something you’re aware of until you start building your training dataset. Imagine you’re building an ML model that identifies wedding dresses. What is your definition of a wedding dress?

  • Long, lacy white dresses?
  • Elaborate, red dresses?
  • Colorful, brightly patterned dresses?

Depending on your culture and background, you probably chose only one of these options. However, they’re all considered wedding garments in different parts of the world. Restricting your ML model to your knowledge of the subject introduces bias, meaning you’ll miss out on certain types of wedding dresses.

The second way of setting up your dataset is by using data you’ve already collected and mapping a question to it. So if you have a large collection of car pictures, you may try to identify any jeeps in your pictures.

  • Advantages: Your data represents the real world because it is not based on a single question or idea. You also don’t have to build a dataset from scratch.
  • Disadvantages: You don’t know what’s in your images or where to find the items you want to identify, which leads to a lot of time spent curating images.

Since you don’t know what’s in your dataset, there’s also a chance you’ll miss out on important factors. A construction site safety management company ran into this when they tried to make a scene detection AI. They wanted their AI to identify demolition sites from images, but their customers quickly pointed out an issue with the finished product. The company hadn't considered indoor demolition.

If you want to avoid making a mistake like this, you have to be intentional when you’re building a dataset. Being intentional involves knowing the full range of the problem and ensuring your dataset will cover all aspects of the issue.

When it comes to training data, it has to be as close as possible to the real world conditions your ML model will be working in. For instance, your images can’t be overly staged, but they have to be clear enough to be useable.

Because your ML model is based on your dataset, you have to curate good data to make the most of your model. While each dataset is different depending on what you want your model to accomplish, there are some basic principles you don’t want to ignore.

The Basics of Building a Good Dataset

What would you do if you needed to speed up production on a new aircraft line? This was the problem Airbus had to solve. As the company started building its new A350 aircraft, it needed to accelerate its production time. The model had to identify disruptions in the process early and match these problems to applicable solutions.

Airbus knew what it wanted from its ML model. However, their ML algorithm wasn’t fully trained from the outset. Airbus collected data on all the problems and actions that took place during the production process. This data was fed into the model to teach it how the building process should run.

The result? Their ML model was so successful that Airbus has even bigger plans to use machine learning in the future. Because the training data accurately represented working conditions, was sufficiently comprehensive, and specific, Airbus’ model improved its production process.

Your training data will determine whether your machine learning model works as intended or not. The right dataset makes your model as effective as it can be. However, before you can build a good dataset, you have to decide on your question.

A bride in a white, Western style wedding dress.

Step 0: The Question

Every part of your ML model, including your dataset, depends on your question. The question needs to be as unambiguous as possible. Adding specificity to your question means you’re less likely to leave out important details in your dataset.

Going back to the wedding dress example, you may want to answer the question, “is there a wedding dress in this picture?” However, it’s almost impossible to input every type of wedding dress. Plus, some people wear regular, everyday dresses for their wedding. Restricting your question to a specific type of wedding dress improves your model’s accuracy.

Step 1: Ensuring the Quality and Relevancy of Your Data

Once you have the right question, your main focus shifts to ensuring that your dataset closely represents the environment your model will work in. Your data not only has to reflect the real world conditions, but it also has to take into account all possible variables.

If you’re building a dataset from scratch, setting up data collection sites or sending out field workers can help you get the full range of the subject. Your value comes from getting a large sample of data since it will cover more variables.

In contrast, datasets that have already been created need to be sorted through to find and mark relevant information. You also have to make sure your data relates to the problem you’re solving.

For example, self-driving car models use video that people have taken of roads as they drive. This gives the car a large amount of information about the city. However, this data doesn’t transfer from one place to another.

If you record data about San Francisco, the car will be able to navigate fairly well. But move the car to Las Vegas, and suddenly all the data it’s collected is useless. Even though Las Vegas is also a major city in the U.S., it’s different enough from San Francisco to require a specialized dataset.

Of course, there’s only so much data you can collect before you have to test your model and see if it’s feasible. It’s time for a beta test.

Step 2: Beta Testing

Beta testing shows you whether your ML model is even feasible. By training your ML model as early as possible, possibly with as little as 100 images, you can start to get a sense of its potential.

Testing your model also reveals the gaps in your dataset. You can then revisit these areas and fix the annotations or add more information. Finally, testing early gives you a human failsafe in case it doesn’t work. Your employees can step in to fix and refine your model as they discover issues.

Beta testing is a great way to find out if your dataset is working for your ML model or not. Either way, improving your data will help you frame your problem in a way your model can understand and solve.

It’s No Problem

Framing your problems does more than help your machine learning model work correctly. It makes it easier to build the best dataset for your needs, which saves you time and money.

Your training data has to suit your problem for it to be effective. At Labelbox, we can give you the tools to create superior datasets. We can also connect you with data labeling teams to speed up your annotation process.