How to turn LangSmith logs into conversational data with Labelbox

To design task-specific LLMs, a step-by-step approach that will improve their day-to-day usability needs to be taken while ensuring safety and relevance and obtaining user feedback. The success of these in real-world scenarios depends on the availability of reliable, high-quality training data and with alignment from human preferences.

LangChain, one of the most popular frameworks for building LLM-powered applications, is complemented by LangSmith, a unified developer platform for building, testing, and monitoring LLM applications. Together, LangChain and LangSmith provide the framework and platform for you to manage the entire LLM-powered application lifecycle.

Labelbox offers a comprehensive data-centric AI platform that includes a native labeling editor, model-assisted labeling, human-labeling workflows, human labeling workforce, and diagnostics to align task-specific models and develop intelligent applications across various data modalities, including text, documents, audio, images, and video, while also providing features like data catalog, model automation, and workforce services to optimize labeling operations and scale human evaluation.

The LangSmith + Labelbox Workflow

In this tutorial guide, we’ll walk through combining Labelbox and LangSmith by getting conversation data created with LangSmith to Labelbox

Below is also a video guide for more details:


What you’ll need

  • Labelbox API key
  • LangSmith API key
  • OpenAI API

You can also follow-along with this Colab notebook


Utilizing LangSmith Python SDK, you can run a variety of LLMs on a test dataset, and example prompts to evaluate a model’s performance. First, you must create a dataset inside LangSmith. Make sure to keep track of the name you give your dataset and set the type as Chat:

After you create your dataset, add a few example prompts that can be used to evaluate your model.

Evaluate Model

Now that you are set up, you can evaluate your model with your example prompts. The notebook included in this tutorial goes over this process. We will be using the chain_results after running our model on our dataset. Below is an example of what that would look like:

Create Labelbox Data Rows

With the chain results obtained above, you can format to Labelbox conversation data using the function below. Please see the Labelbox Import Conversation Text Data developer guides for more information.

The function uses a dictionary python object parameter (user_id_dict) to match LangSmith user types to Labelbox types along with their alignment inside the editor. The messages in this example are all marked as “canLabel: True.” This means that all messages can be annotated inside our editor. Also, the global key is set randomly through the UUID Python library, but you can have the global_key set to the ID associated with the example run to avoid importing duplicate information.

Import Data Rows into Labelbox

Now that you have created your data rows with your conversational data, you can import them into Labelbox. Please reference these documents from Labelbox for more information. The script below demonstrates creating a Labelbox dataset and importing your data rows.

Use Cases and Implication

Utilizing Labelbox with LangSmith can better develop chatbots by engaging large language models (LLMs) for conversation intent classification. This combination enables constant model tracking and evaluation because LangSmith and Labelbox ensure the chatbot performance adheres to human preferences through proper training data. Also, alongside a human-in-the-loop system, semantic search and pre-trained models for pre-labeling allow nuanced analysis.

Combining efforts between Labelbox and LangSmith enhances the performance of RAG-based agents by improving reranking models and preprocessing documents to make them more relevant. Considering a broader collection of retrieved text or incorporating more human-generated labels rather than relying only on a single document can facilitate the generation of high-quality training labels, hence making the chatbot responses more relevant and accurate.

The tutorial above demonstrates the combination of LangSmith with Labelbox, enabling some of the mentioned use cases.


Once you have your data rows inside Labelbox, you can send them to an annotate project where you can apply a variety of annotations or have a popular LLM model generate pre-labels that can then be evaluated with a human in the loop. After annotating your data rows, you can utilize the generated labels to train your LLM further. You can then reassess and annotate your predictions using the same workflow.

If you’re interested in learning more about the capabilities offered by Labelbox and LangChain for supporting generative AI development, check out our partnership announcement for additional entails and use cases.

Interested in joining the conversation? Let us know what you think in our original post here.

Reach out to the Labelbox sales team to start implementing a hybrid evaluation system for your production generative AI applications.