×

Human preference signal for evaluating LLMs inside Vertex AI

Problem

As LLMs grow more sophisticated, accurately evaluating their performance becomes critical. Automated metrics give insight, but human judgment remains the standard for nuances like relevance, bias, and overall quality. Producing large-scale, high-quality human preference signal is the hard part for most enterprises — it takes time, resources, and domain expertise.

Solution

After seeing the impact of using Labelbox internally, Google Cloud selected it to deliver LLM evaluation as a managed solution inside Vertex AI. Customers launch a human evaluation job and set criteria (question-answer, multi-turn chat, summarization), and Labelbox's platform produces expert-graded preference signal across customizable dimensions like instruction following, verbosity, and relevance.

Result

Customers develop and ship LLM applications with confidence. They receive quality-reviewed evaluation results within days, and launch evaluation jobs in minutes.

Human preference signal for evaluating LLMs inside Vertex AI

After seeing the impact of using Labelbox internally, Google Cloud selected it to bring expert-graded LLM evaluation to its own customers — preference signal across customizable criteria, delivered as evaluation infrastructure built into Vertex AI.

The challenge

Teams building generative AI move from prototype to production, and evaluating LLM performance becomes critical. The strongest evaluation techniques for LLMs and compound systems like RAG combine automated metrics with human judgment. Optimizing a model for human preference improves it — but producing that preference signal at scale is the most time-consuming, resource-intensive part of the process. Automated metrics show relevance; human judgment remains the standard for nuances like relevance, bias, and overall quality.

The approach

Google Cloud first used Labelbox internally. After seeing its impact, Google Cloud selected Labelbox to give Vertex AI customers LLM evaluation as a managed solution, built into the platform. From the Vertex AI interface, a customer launches an evaluation job, sets the type — single model or side-by-side comparison — and the criteria — question-answer, multi-turn chat, summarization. Labelbox's platform produces expert-graded preference signal against customizable dimensions like instruction following, verbosity, and relevance. Integrated APIs handle everything after task configuration through QA. Results visualize back in Vertex AI, so customers review and accept outputs and stay in full control of signal quality. The full suite of Labelbox products is available on the Google Cloud Marketplace, with no-code integrations to BigQuery, CloudSQL, and Google Sheets — including model distillation, RLHF, and LLM evaluation.

The outcome

Vertex AI customers develop and ship LLM applications with confidence. They get quality-reviewed evaluation results within days, and launch evaluation jobs in minutes.

Where this goes

Human preference judgment is the reward signal post-training runs on. Evaluation infrastructure like this — preference and grounding signal from expert knowledge workers — separates models that ship from models that stall.