×

Extracting clinical signal from millions of unstructured medical records

Problem

Extracting data from paper copies of millions of medical records to train ML models is time-consuming and expensive: records routinely run 500 to thousands of pages, come from many providers, hospitals, and insurers with no consistent formatting, and arrive as PDFs, faxes, and scans.

Solution

AHP used Labelbox's platform for automation and active learning — a rapid way to evaluate entity-extraction accuracy using a mix of contributors and in-house experts for human review.

Result

AHP dramatically sped up how it processes medical records into signal for its ML models, through easier collaboration and better QA. Classifying each page used to take about 13 seconds; with active learning and model-assisted labeling in Labelbox, the average per-label time dropped to just 8 seconds.

Extracting clinical signal from millions of unstructured medical records

Advent Health Partners runs OCR and NLP to review medical records. Labelbox produced the expert-reviewed signal and active-learning loop, cutting per-page labeling from 13 to 8 seconds.

Note: This post is a shortened recap of a virtual talk from Robert Coop, Chief AI Officer at Advent Health Partners during Labelbox Accelerate (Nov 2022).

The challenge

Advent Health Partners (AHP) is a healthcare service and technology company focused on healthcare reimbursement, with proprietary medical-record review technology. Its flagship product, the CAVO platform, gives insurers and payers an interactive interface for medical-record review. The data science team applies optical character recognition (OCR) and NLP entity extraction to those records. The challenge is the data: paper records routinely run 500 to thousands of pages, come from many providers, hospitals, and insurers with no consistent formatting, and arrive as PDFs, faxes, and scans. AHP's goal is to feed that into clinical ML models for focused review — checking whether documentation justifies the treatment a patient received.

The approach

AHP used Labelbox as a data engine to turn unstructured records into signal. The platform gave the team a fast way to evaluate entity-extraction accuracy using a mix of contributors and in-house experts for human review, and to build a page-classification model that auto-tags emergency department records, discharge notes, and document types for experts to verify. Active-learning workflows surfaced where the classification models lacked accuracy: in-house domain experts reviewed confident predictions to prevent errors and used uncertainty sampling. To sample efficiently, the team scored unlabeled data by the entropy of the classification output, took only 5-10% from the low-entropy (confident) bucket and the rest from high-entropy (uncertain), and trained on the new classes. When a class imbalance appeared — one class 25% of the data, the rarest about 7% — the team used an earlier model to build a semi-supervised model that balanced the classes and improved performance.

The outcome

AHP dramatically sped up signal production for its ML models. Classifying and annotating each page used to take about 13 seconds; with active learning and model-assisted labeling, the average dropped to 8 seconds.

By using Labelbox's model-assisted labeling workflows, we have been able to cut a full 25 hours off of the amount of time for each specific labeling task, and we’ve found that our labelers have an easier time through a software-first approach,” said Robert Coop, Chief AI Officer at Advent Health Partners.

Where this goes

Healthcare claims are a document-understanding problem at massive scale. Expert-reviewed signal, sampled where the model is least sure, is how you train a model to read records as carefully as a reviewer would.