Google Cloud powers LLM evaluation service with Labelbox


As Large Language Models (LLMs) become more sophisticated, accurately evaluating their performance becomes increasingly critical. While automated metrics provide insights, human evaluation remains the gold standard for understanding nuances like relevance, bias, and overall quality. However, conducting large-scale, high-quality human evaluations is a major challenge for most enterprises, requiring significant time, resources, and expertise.


Labelbox and Google Cloud have partnered to deliver a fully managed LLM evaluation solution directly integrated into the Vertex AI platform. This solution empowers Google Cloud customers to seamlessly launch human evaluation jobs, set specific criteria for evaluation (e.g., question-answering, summarization).


Customers can now develop and ship LLM applications with confidence. They receive high-quality results within days. Launching LLM evaluation jobs takes minutes.

As teams building generative AI applications transition from prototypes to production, evaluating the performance of large language models (LLMs) is becoming critical to their success. State-of-the-art techniques for evaluating LLMs and compound AI systems, like RAG, typically employ a hybrid strategy of automated and human evaluation. While optimizing LLMs for human preference judgment can improve their performance, human evaluation remains one of the most time-consuming and resource intensive parts of the process.

To enable teams to evaluate and ship LLM applications confidently, Google Cloud has selected Labelbox to provide Vertex AI platform customers an integrated solution for LLM evaluation as a fully managed service.

Vertex AI LLM Evaluation

With this LLM Evaluation solution, Vertex AI customers can go directly into the Vertex AI interface to launch an LLM evaluation job, set their desired evaluation type (e.g., single model or side-by-side comparison) and criteria (e.g, question-answer, multi-turn chat, summarization), and get quality reviewed results within days from skilled evaluation professionals.

The LLM evaluation solution from Labelbox provides teams with easy access to human raters who will help evaluate the effectiveness of their organization’s LLMs against a wide range of customizable criteria - from instruction following, verbosity, to relevance of any given response.

With integrated APIs customers can simply configure their task within the Vertex AI platform and everything else is taken care of by Labelbox before the QA process. Seamless visualization of the labeling team’s responses within the Vertex AI platform also gives customers the ability to review and accept outputs, putting you in full control of the annotation quality.

A full suite of Labelbox products now available on the Google Cloud Marketplace

For teams looking to get the best of both worlds and combine a hybrid approach of AI-assistance with human evaluation, Google Cloud customers can now purchase a full suite of Labelbox products on the Google Cloud Marketplace. With native no-code integrations with Google Cloud’s BigQuery, CloudSQL and Google Sheets, customers can integrate data pipelines with Labelbox in minutes.

With this offering, Labelbox provides a data-centric AI platform providing data curation, AI-assisted labeling, premium data labeling services, and model diagnostics to align task-specific models and build intelligent applications. The latest updates to Labelbox’s products include model distillation, reinforcement learning with human feedback (RLHF) and LLM evaluation.