TRINS: Towards Multimodal Language Models that Can Read

This paper introduces TRINS, a large-scale dataset specifically designed to improve multimodal models' ability to "read" — that is, to interpret, reason about, and generate language grounded in both visual scenes and embedded text (like signage, documents, and labels). The dataset contains 39,153 carefully annotated, text-rich images, each paired with long-form captions (averaging 65 words) and over 102,000 question–answer pairs designed to evaluate reading comprehension within images.
Most existing multimodal datasets (e.g., COCO, TextVQA) provide relatively short captions or sparse annotations. TRINS stands out by densely capturing both visual context and embedded text, making it especially useful for training models that need to perform OCR-like reasoning or multimodal understanding where text is a critical part of the image content.
To showcase the power of this dataset, the authors also introduce LaRA (Language–vision Reading Assistant), a compact and efficient multimodal model trained on TRINS. Despite being much smaller than models like GPT-4V or Gemini, LaRA achieves state-of-the-art performance on multiple benchmarks involving text-rich image understanding, including both captioning and visual question answering.
How Labelbox Was Used
The success of TRINS hinges on the quality and density of its annotations — and Labelbox played a central role in enabling this at scale.
- Human-in-the-loop annotation at scale:
The authors used Labelbox to annotate 40,576 images, involving 2,079 hours of human labeling and 159 hours of human review. Annotators were instructed to describe both the visual content and the embedded textual content, with attention to attributes like font style, color, position, and meaning. - Structured annotation guidelines:
Labelbox was used to create and enforce detailed ontologies and annotation policies, helping annotators capture high-quality data that included:- What is happening in the image.
- What the text says.
- Why the text is important.
- Optional deeper reasoning or interpretation.
- Quality assurance workflows:
After initial annotation, TRINS employed automated checks — including OCR vs. human text comparisons — and manual reviews via Labelbox to flag low-quality or incomplete examples. These entries were either rejected or routed for re-annotation, ensuring consistency and accuracy throughout the dataset. - Enabling long-form, expressive annotations:
Labelbox’s flexible interface made it easier to collect detailed, multi-sentence captions and diverse QA pairs, a sharp contrast to most datasets that rely on much shorter labels.
Further research
TRINS represents a major leap forward in multimodal data quality and scale, especially in domains that require deep understanding of images with embedded text (e.g., street scenes, product images, forms). With Labelbox, the researchers were able to create a dataset that is not only large, but also rich, expressive, and verified, enabling the training of smaller yet more capable models like LaRA.
This combination of advanced human annotation and machine-efficient modeling signals a promising direction for building language–vision models that can truly read and reason. You can read the full paper here.