logo
×

Metrics-based RAG Development with Labelbox

Overview

One of the biggest challenges with RAG Application development is evaluating the quality of responses. As it currently stands, there are a series of different benchmark metrics, tools, and frameworks to help with RAG application evaluation.

LLM evaluation tools (such as RAGAS and DeepEval) provide a plethora of quality scores designed to evaluate the efficiency of the RAG model, ranging from retrieval to generation. In this guide, we’ll take a metrics based approach to enhance RAG development by focusing on 2 target metrics:

  • Context recall
    • This measures (on a scale from 0-1) how well the retrieved content aligns with the Ground Truth Answer
  • Context precision
    • Ideally, the most relevant context chunks should be retrieved first. This metric measures whether the RAG application can return the most relevant chunks (with respect to the Ground Truth) at the top.

Generating ground truth data and ‘synthetic’ data

To calculate performance metrics, we have to compare the RAG response with ground truth. Ground truths are factually verified responses to a given user query and must be linked back to a source document.

  • On the other hand, ground truth data can be automatically generated by pre-trained models. LLM Frameworks, such as LangChain, have modules that generate Question / Answer pairs based on a source document. While creating Synthetic ground truth is less labor intensive, drawbacks include quality issues, lacking real-world nuances, and inheriting model bias.
  • Ground truth data is typically manually created by experts through the collection and verification of internal source documents. While labor intensive and slower to collect, manual data labeling optimizes for accuracy and quality. 

To start off, we’ll use the Labelbox Platform to orchestrate a ground truth creation workflow with 2 approaches.

  1. Manual ground truth generation
  2. Synthetic ground truth generation with Human-in-the-Loop

Manual ground truth generation

By uploading common user questions (prompts) to the Labelbox Platform, we can set up an annotation project. Now subject matter experts can craft a thoughtful answer and provide the appropriate source.

Synthetic ground truth generation with Human-in-the-Loop

Leveraging the latest Foundation Models on Labelbox Foundry, we can combine an automated ground truth generation process with a Human in the Loop component for quality assurance.

To start off, we will include the source document for each user query (PDF in this case). We can then prompt a foundation model (Google Gemini 1.5 Pro in this case) for each user question and attach the PDF context needed to answer the question.

The provided prompt in this case is: *For the given text and PDF attachment, answer the following. Given the attached PDF, generate an answer using the information from the attachment. Respond with I don't know if the PDF attachment doesn't contain the answer to the question.*

Next, we can include the response generated from Gemini as pre-labels to an annotation project. 

Instead of starting from scratch while answering the question, human experts can modify and verify the Gemini response along with the corresponding document and user query, optimizing for quality and efficiency.


Improving RAG

A RAG based application consists of many components. Summarized in a recent research survey by Gao et al. (2023), a few factors that impact RAG system performance include (but not limited to):

  • Improving the retriever (e.g. Reranking)
  • Fine-tuning the embedding model for source documents
  • Adding document metadata tagging (for rerouting or semantic search)
  • Query rewriting
  • Optimizing the chunking strategy

We’ll focus on the following strategies for improving metrics based RAG evaluation:

  1. Optimizing the chunking strategy
  2. Fine-tuning the embedding model

Chunking strategy

While keeping LLM generation model (GPT Turbo 3.5) and Embeddings Model (sentence-transformers/distiluse-base-multilingual-cased-v2) consistent, we will compare the retrieval metrics derived from RAGAs.

  • To start off, we will export the question and response pairs generated from Labelbox to a Python Environment.
  • Next we will generate the RAG response based on the following chunking strategies:
    • Recursive character splitter (500 token size with 150 chunk overlap)
      • This is a heuristic approach and involves dynamically splitting text based on a set of rules (e.g. new lines / tables / page breaks) while maintaining coherence.
    • Spacy text splitter
      • spaCy uses a rule-based approach combined with machine learning models to perform tokenization.
  • Once the RAG response and context have been generated, we create a table with the following fields required for RAGAs metrics evaluation: (Question, Ground Truth, RAG Answer, Context)

By calculating RAGAs metrics evaluation using the necessary Query, Response, Ground Truth, and Source information, we can derive performance results for the entire dataset and aggregate the findings. The Context Precision and Context Recall metrics for Recursive and Spacy are shown below:

Recursive text chunking

Spacy text chunking

From the results, we observe that Spacy yielded slightly better context precision. Given the volume of our dataset (~100 QA pairs), the difference is likely statistically insignificant. However, we can conclude that choosing the Chunking strategy and parameters will certainly affect the output of the chatbot.

Given the distribution of our Recall and Precision metrics, both techniques seem to struggle with the same data rows. This indicates that the key semantics are not captured by our default embeddings (sentence-transformers/distiluse-base-multilingual-cased-v2). We’ll test out another embedding model to see if we can improve RAG performance metrics.

Embeddings model

An Embedding Model is responsible for generating embeddings, which is a vector representation to any piece of information, such as a chunk of text in a RAG application. A solid embeddings model is able to better capture semantic meaning of a user question and the corpus of source documents, therefore returning the most relevant chunks as the context for answer generation.

While holding the chunking strategy constant (Recursive), we will change the underlying embeddings model in the RAG workflow before evaluating with RAGAs. The Huggingface Leaderboard includes performance benchmarks for various embeddings models. Beyond retrieval performance, model size, latency, and context length are also important when choosing an embeddings model.

  • To start off, we will generate responses using the bge-base-en-v1.5 model and evaluate with RAGAs
    • While ranking 35 on the leaderboards, this model is extremely lightweight (109 million parameters)
  • From the results, changing the embeddings model (from distiluse-base-multilingual-cased-v2 to bge-base-en-v1.5) significantly improved the retrieval metrics, around ~10 points for both context recall and precision.
    • This indicates that the retrieved context chunks are more relevant to the ground truth and ranked higher on average as well.

Fine-tuning an embeddings model

Most embedding models are trained on general knowledge. Specific domains, such as internal organization information, may limit the effectiveness of these models. Therefore, we can choose to fine-tune an embeddings model with the goal of better understanding and generating domain specific information.

  • To fine tune embeddings, we need labeled examples of positive and negative context chunks.
    • Using a training dataset, we’ll upload the user query and the top X chunks to Labelbox
  • Using Labelbox Annotate, subject matter experts can determine if each chunk is relevant in answering the given question.
  • Using the Labelbox SDK we can now export our labels and convert to the following format for fine tuning: {"query": str, "pos": List[str], "neg":List[str]}
    • The fine tuning process for the BGE family of models is also described here.
  • Using the fine tuned embeddings model, we can generate updated responses and evaluate.

From the example in the screenshot, bge tends to incorrectly extract the citations section, which the expert labelers can correct and mark as irrelevant.

From the resulting metrics, we can see that by fine tuning on 50 data rows is able to improve context recall and context precision by 2-3 points each. We expect this to improve as more labels are provided to continuously fine tune the embeddings model.


Conclusion

Evaluating the quality of responses in developing RAG Applications is essential for creating efficient and reliable systems. By incorporating the Labelbox platform in metrics based RAG development, from ground truth generation to feedback for embedding models, developers can enhance the overall performance and reliability of RAG applications.