Leading AI lab builds team of STEM experts to improve domain-specific, multimodal reasoning
Problem
A leading AI lab sought to identify areas within K-12 STEM education where their large language model (LLM) struggled to generate accurate responses. To do so, they needed help sourcing a reliable and diverse team of STEM experts to evaluate their models and provide new domain-specific training data that would push the limitations of their LLM.
Solution
Labelbox's Labeling Services utilized the Alignerr network to assemble a team of highly skilled STEM experts with advanced degrees (PhDs and Masters) in fields like chemistry, biology, and engineering. Using their expertise, they created original multimodal (text and image) prompts and accurate answers to evaluate and improve the model’s responses.
Result
Labelbox's team of STEM experts consistently generated unique multimodal reasoning prompts that identified the model's limitations, enabling the lab to target key areas for improvement and drive performance enhancements. Labelbox is now a pivotal, fully integrated partner in delivering high-quality, domain-specific STEM data for the lab’s real-time loss training workflow.

Introduction
A leading AI lab aimed to identify areas within elementary, middle, and high school (K-12) STEM where their cutting-edge LLM struggled to generate accurate responses. Their goal was to find a reliable team that could seamlessly integrate into their real-time loss workflow to generate complex, differentiated multimodal (image and text) prompts, pinpointing the exact areas where the model needed improvement.
However, they faced a critical obstacle in generating unique, domain-specific image and text pairs—sourcing a large group of qualified STEM experts with expertise across technical domains such as biology, physics, engineering, and earth sciences.
Delivering differentiated, multimodal STEM data
The AI lab required a reliable data vendor with expertise in multimodal STEM, and Labelbox’s Labeling Services rose to the challenge. Powered by Alignerr, our talent pool spans a wide range of industry-specific domains, supports multiple languages, and can pull for a diverse group of experts from around the world.
With the ability to quickly source experts, execute a 24-hour calibration period, and manage projects from start to finish, Labelbox swiftly assembled a team of skilled STEM experts to tackle the task.
The task at hand was more than a simple labeling or evaluation task; it was significantly complex with detailed instructions that required original thinking. The AI trainer was asked to generate complex prompts that include images and text (multimodal) that covered a wide variety of STEM fields from all grade levels. Prompts were adjusted until they pushed the limits of the LLM, and then accurate responses were created to help train the model.
Given the complexity of the task, it demanded the expertise of top-tier domain specialists to create challenging prompt-response pairs. Labelbox carefully vetted hundreds of STEM experts in fields like engineering, math, and physics, ultimately selecting 150 highly qualified professionals with PhDs and Master's degrees. Under strict guidelines, the final datasets had to include original prompts that were not easily searchable or available online, covering a broad range of STEM topics.
“My advanced mathematics degree and AP teaching experiences helped me craft nuanced and novel questions that challenged existing AI models. It was exciting to generate multimodal datasets and challenged me to think about certain topics differently. My domain expertise was crucial to creating impactful datasets that will help advance the capabilities of AI.” - Derek H. Math masters and AP math teacher”
After creating the multimodal prompts and their corresponding responses, the team evaluated whether the AI lab’s model struggled with the prompts they created. Only if the model provided incorrect answers multiple times, was the prompt then considered a 'winning label.'
Improving domain-specific tasks performance with experts and software
The AI lab’s overarching goal was to enhance their real-time loss training workflow, which lacked a qualified team of experts to consistently evaluate their models and generate high-quality feedback. This workflow was crucial to their AI development process enabling continuous improvement, identifying model weaknesses, and integrating humans-in-the-loop—all essential for improving the LLM’s performance on domain-specific tasks.
Labelbox’s multimodal chat editor was crucial in incorporating expert human feedback, enabling clear labeling instructions, and permitting the direct evaluation of generated prompts using their specific LLM.
After multiple rounds of evaluation and expert review, Labelbox delivered a new multimodal dataset that significantly enhanced the AI lab’s model performance on complex STEM questions. With a reliable team of experts in place, the lab now has an efficient real-time loss workflow that continuously identifies weaknesses in STEM queries, allowing for precise improvements to their LLM.