Google Cloud powers LLM evaluation service with Labelbox

After seeing the impact of using Labelbox internally, Google Cloud selected it to bring expert-graded LLM evaluation to its own customers — preference signal across customizable criteria, delivered as evaluation infrastructure built into Vertex AI.

The challenge

Teams building generative AI move from prototype to production, and evaluating LLM performance becomes critical. The strongest evaluation techniques for LLMs and compound systems like RAG combine automated metrics with human judgment. Optimizing a model for human preference improves it — but producing that preference signal at scale is the most time-consuming, resource-intensive part of the process. Automated metrics show relevance; human judgment remains the standard for nuances like relevance, bias, and overall quality.

The approach

Google Cloud first used Labelbox internally. After seeing its impact, Google Cloud selected Labelbox to give Vertex AI customers LLM evaluation as a managed solution, built into the platform. From the Vertex AI interface, a customer launches an evaluation job, sets the type — single model or side-by-side comparison — and the criteria — question-answer, multi-turn chat, summarization. Labelbox's platform produces expert-graded preference signal against customizable dimensions like instruction following, verbosity, and relevance. Integrated APIs handle everything after task configuration through QA. Results visualize back in Vertex AI, so customers review and accept outputs and stay in full control of signal quality. The full suite of Labelbox products is available on the Google Cloud Marketplace, with no-code integrations to BigQuery, CloudSQL, and Google Sheets — including model distillation, RLHF, and LLM evaluation.

The outcome

Vertex AI customers develop and ship LLM applications with confidence. They get quality-reviewed evaluation results within days, and launch evaluation jobs in minutes.

Where this goes

Human preference judgment is the reward signal post-training runs on. Evaluation infrastructure like this — preference and grounding signal from expert knowledge workers — separates models that ship from models that stall.

Human preference signal for evaluating LLMs inside Vertex AI

Problem

Solution

Result

The challenge

The approach

The outcome

Where this goes

Try Labelbox today