Unlocking precision: The "Needle-in-a-Haystack" test for LLM evaluation

Introduction

Selecting the optimal large language model (LLM) for specific tasks is crucial for maximizing efficiency and accuracy. One of the key challenges faced by teams is selecting the best models for pre-labeling tasks, especially when dealing with large datasets and complex annotations.

Labelbox’s Model Foundry provides a robust platform for evaluation and determining the most suitable model for various applications. To illustrate this, the Labelbox Labs team conducted an experiment simulating the "Needle-in-a-Haystack" test. This test involves identifying specific elements within vast amounts of data, ensuring the model’s precision and reliability.

By utilizing Labelbox Model Foundry’s advanced experiments and evaluation tools, teams can compare multiple LLMs to identify the one that delivers the highest accuracy and efficiency for pre-labeling on complex tasks, thus saving time and enhancing the quality of predictions.

In this blog post, we’ll dive into the intricacies of the "Needle-in-a-Haystack", exploring how to leverage Foundry to find the best model for your pre-labeling or data enrichment needs.

The "Needle-in-a-Haystack" test

The "Needle-in-a-Haystack" test is a specialized evaluation method designed to gauge the performance of large language models (LLMs) in identifying specific, often infrequent, elements in large datasets.

Imagine you have a massive dataset filled with a mix of common and rare pieces of information, similar to a haystack with a few needles hidden inside. The challenge is to determine how effectively a model can find those needles (rare information) without getting distracted by the surrounding hay (common information ). This rare information could be anything from specific keywords in a text document to unique objects in a video.

Previous analysis of Claude-2.1 performed by Greg Kamradt, member of the ARC Prize team.

Previous analysis of GPT-4 performed by Greg Kamradt, member of the ARC Prize team.

Why is this a good test for Model Foundry?

The "Needle-in-a-Haystack" test is a great fit for Model Foundry for several reasons:

Real-world relevance: This test simulates real-world conditions, where critical information is buried in a large dataset. This ensures that models are being simulated in environments that match actual applications.
Comprehensive evaluation: Labelbox Model Foundry offers advanced tools that make setting up experiments, running evaluations, and comparing results efficient and easy.
Better decision making: The insights gained from the Needle in a Haystack test can facilitate stronger decision-making when we are choosing the most suitable LLM for a task. This ensures investment in models that offer the best performance for application.

Creating the "Needle-in-a-Haystack" internally

The first step in our experiment was to create a detailed labeling instructions set that we could eventually send to LLMs for pre-labeling. It is important to note that we decided to use Text data for our study. Various other asset types such as Video and Image can also emulate a similar test.

Instructions overview

We wanted to build a dataset that consisted of conversations between users and a customer support chatbot, focusing on banking and financial transactions. Each conversation would be categorized into specific issues related to accounts, banking services, and transactions.

As a result, our instruction set would include detailed descriptions of each category, example conversations to guide the labeling process, and clear decision-making guidelines to help annotators distinguish between closely-related issues.

Ontology

The ontology included categories such as:

ACCOUNT_TERMINATION
ACCOUNT_RECOVERY
ACCOUNT_SECURITY_BREACH
ACCOUNT_EXTERNAL_QUERY
ACCOUNT_UPDATE_DETAILS
ACCOUNT_ID_CONFIRMATION
ACCOUNT_MISC_QUERY
BANKING_LOAN_SERVICES
BANKING_FEE_DISPUTE
BANKING_OVERDRAFT_SERVICES
BANKING_WIRE_TRANSFER_HELP
BANKING_SAVINGS_PLANS
BANKING_INVESTMENT_SERVICES
BANKING_POLICY_INFO
BANKING_CREDIT_CARD_ISSUES
BANKING_INSURANCE_PRODUCTS
BANKING_MOBILE_APP_SUPPORT
BANKING_DEBIT_CARD_ACTIVATION
BANKING_SECURITY_FEATURES
TRANSACTION_DISPUTE
TRANSCATION_REFUND
TRANSACTION_VERIFICATION
TRANSACTION_FRAUD_REPORT
TRANSACTION_LIMIT_INCREASE
TRANSACTION_HISTORY_QUERY

As seen, we chose closely-correlated categories and provided precise instructions so that while there were many similarities between subcategories, there were slight differences and nuances that our chosen LLM would have to notice and use to drive the decision-making process.

Building the dataset

Creating the dataset involved curating and structuring the data row to reflect real-world scenarios that modeled the above ontology. This ensured the dataset was comprehensive and challenging for the models.

Dataset composition

Data Row Content: Each data row represented a conversation between a user and customer support chatbot.

Sample data rows

ACCOUNT_TERMIATION: User conversations requesting closure of their account
BANKING_LOAN_SERVICES: Inquiries about applying for or managing loans
TRANSACTION_FRAUD_REPORT: Reports of suspected fraudulent activities

To view our labeling instructions, click the link here.

Model evaluation using Foundry LLMs

Model selection

We decided to evaluate our dataset on four leading LLMs currently on the market:

Gemini 1.5 Pro

Released by Google as part of the Gemini series;
Known for its strong multimodal capabilities;
Designed for complex reasoning and task completion.

GPT-4o

Developed by OpenAI;
An advanced iteration of the GPT (Generative Pre-trained Transformer) series;
Known for its strong natural language understanding and generation;
Optimized for faster response times and efficient computational resource usage.

Claude 3.5 Sonnet

Created by Anthropic;
Part of the Claude 3 model family;
Known for its strong performance in writing and complex tasks;
Capable of engaging in nuanced conversations and providing detailed explanations.

Gemini 1.5 Flash

Another model in the Google Gemini series;
Optimized for speed and efficiency;
Designed for tasks requiring quick responses;
Suitable for applications where real-time responses are crucial.

Analysis and insights

Once we had created Model Foundry Predictions on our dataset for all four LLMs, we placed them into a Model Experiment for model evaluation. Creating an experiment allowed us to dive deeply into the intricacies of each model to determine their overall performance on a needle in a haystack application.

From the exhibits above, we can see which models performed best from an precision perspective:

Gemini 1.5 Pro (81.55%)
Claude 3.5 Sonnet (80.98%)
GPT-4o (79.02%)
Gemini 1.5 Flash (76.96%)

Confusion matrices and Precision graphs are also available in Model Experiment, giving us a better understanding of the above precision scores.

From the graphs and further analysis, we can see the categories in the ontology that each model struggled with. Note that a struggle indicates a precision score of less than 0.75.

Gemini 1.5 Pro

BANKING_CREDIT_CARD_ISSUES
ACCOUNT_MISC_QUERY
ACCOUNT_EXTERNAL_QUERY
ACCOUNT_ID_CONFIRMATION

Claude 3.5 Sonnet

BANKING_CREDIT_CARD_ISSUES
ACCOUNT_MISC_QUERY
ACCOUNT_EXTERNAL_QUERY
ACCOUNT_ID_CONFIRMATION

GPT-4o

BANKING_CREDIT_CARD_ISSUES
ACCOUNT_ID_CONFIRMATION
BANKING_SECURITY_FEATURES
BANKING_POLICY_INFO
BANKING_EXTERNAL_QUERY
BANKING_MISC_QUERY

Gemini 1.5 Flash

BANKING_FEE_DISPUTE
BANKING_INSURANCE_PRODUCTS
ACCOUNT_ID_CONFIRMATION
BANKING_SECURITY_FEATURES
BANKING_POLICY_INFO
BANKING_EXTERNAL_QUERY
BANKING_MISC_QUERY

Based on the performance breakdown, we can draw several insights:

Top performers: Gemini 1.5 Pro and Claude 3.5 Sonnet emerge as the leading models for this particular needle in a haystack task, with very similar performance profiles.
Common challenges: All models struggled with certain categories, particularly ACCOUNT_ID_CONFIRMATION and BANKING_CREDIT_CARD_ISSUES. This suggests these categories may be inherently more difficult to classify or may require more specific training data.
Precision vs. Speed: While Gemini 1.5 Pro achieved the highest accuracy, teams should consider their specific needs. If real-time responses are crucial, Gemini 1.5 Flash might be a better choice despite its lower accuracy.
Room for improvement: Even the top-performing models have areas where they struggle. This information can be valuable for fine-tuning models or adjusting the labeling instructions for future iterations.

Leveraging Model Foundry for decision making and pre-labeling

The experiment demonstrates the power of Labelbox Model Foundry in facilitating data-driven decision-making for model selection and optimizing the pre-labeling process. By providing comprehensive evaluation tools and visualizations, Model Foundry enables teams to:

Compare multiple models simultaneously
Identify specific strengths and weaknesses of each model
Make informed decisions based on precision, recall, and overall accuracy
Pinpoint areas for potential model improvement or fine-tuning

In addition to model evaluation, Model Foundry significantly enhances the pre-labeling workflow:

Efficient pre-labeling: Once the best-performing model is identified, it can be seamlessly integrated into the pre-labeling pipeline, significantly reducing manual labeling efforts
Quality assurance: By understanding model strengths and weaknesses, teams can strategically allocate human resources to review and correct pre-labels in categories where models struggle
Iterative improvement: As more data is labeled and models are retained, teams can continuously evaluate and update their pre-labeling model, ensuring ongoing optimization of the labeling process.
Cost reduction: By selecting the most accurate model for pre-labeling, teams can minimize the need for manual corrections, leading to substantial time and cost savings in large-scale labeling projects.

By leveraging Model Foundry for both decision-making and pre-labeling processes, teams can significantly enhance the efficiency and accuracy of their entire data labeling pipeline.

Next Steps

To further improve model performance and decision-making, consider the following steps:

Fine-tune models on challenging categories.
Conduct additional experiments with different data types or industry-specific datasets.
Implement regular evaluations and feedback loops to identify areas for improvement and adapt to changing requirements.

By continually refining your approach and leveraging the insights gained from Model Foundry, you can ensure that your team is always using the most effective LLM for your specific needs, driving efficiency and accuracy in your AI-powered workflows.

Conclusion

The "Needle-in-a-Haystack" test, as implemented through Labelbox Model Foundry, proves to be an effective method for evaluating LLM performance on complex, nuanced tasks. By simulating real-world scenarios and leveraging Model’s advanced evaluation tools, teams can select the most suitable model for their specific pre-labeling needs.

In our experiment, Gemini 1.5 Pro and Claude 3.5 Sonnet demonstrated superior performance, but the choice between them (or other models) would depend on the specific requirements of the project, including factors like speed, resource efficiency, and more.

As the field of AI continues to evolve rapidly, tools like Labelbox Model Foundry become increasingly valuable, enabling teams to stay at the forefront of the space by consistently evaluating and selecting the best models for their unique challengers.

Continue reading

Programmatically launch human data jobs for RLHF and evaluation

Learn how to harness the SDK to manage human data labeling jobs for RLHF and model evaluation. With just a few steps, you can set up the SDK, import various types of data, and launch, monitor, and export labeling projects programmatically, all while ensuring data quality and scalability.

Evaluating leading text-to-speech models

Discover how to employ a more comprehensive approach to evaluating leading text-to-speech models using both human preference ratings and automated evaluation techniques.

Metrics-based RAG Development with Labelbox

Learn how to optimize your Retrieval-Augmented Generation (RAG) applications by focusing on key metrics like context recall and precision.

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free

Understand the difference

Explore data factory for

Data factory capabilities

Explore solutions for

Post-training tasks

Use cases

Learn

Connect

Featured reads

Unlocking precision: The "Needle-in-a-Haystack" test for LLM evaluation

Introduction

The "Needle-in-a-Haystack" test

Why is this a good test for Model Foundry?

Creating the "Needle-in-a-Haystack" internally

Instructions overview

Ontology

Building the dataset

Dataset composition

Sample data rows

Model evaluation using Foundry LLMs

Model selection

Gemini 1.5 Pro

GPT-4o

Claude 3.5 Sonnet

Gemini 1.5 Flash

Analysis and insights

Gemini 1.5 Pro

Claude 3.5 Sonnet

GPT-4o

Gemini 1.5 Flash

Leveraging Model Foundry for decision making and pre-labeling

Next Steps

Conclusion

Continue reading

Try Labelbox today