Val Brodsky•October 28, 2024
Faster, cleaner code: Labelbox SDK migrates to Pydantic v2
The Labelbox Python SDK allows you to programmatically access the majority of our platform features. The SDK simplifies the automation of repetitive tasks, including uploading and exporting data with annotations and writing custom scripts. It enables integration of Labelbox into custom workflows, making complex processes more efficient.
The Labelbox SDK relies on Pydantic to model data related to complex annotation structures, to validate user inputs, and to serialize/deserialize data for GraphQL API calls. With over 90 models of data, many including custom dict() methods, validation logic can be complex without the use of a data validation library like Pydantic.
In this article, we outline our use of Pydantic. We’ll explore why we chose to migrate to v2 from v1, how we completed that migration process, and key lessons learned.
Pydantic v1 in the Labelbox SDK
To understand our migration to Pydantic v2, let's first look at how the Labelbox SDK used Pydantic v1.
Validating user inputs
A project in Labelbox consists of data to be labeled. Creating a project using the SDK involves passing around 10 configuration parameters for 20 different project types, so it is crucial to ensure that project configurations are correct. In case they are not, it is important to provide clear errors for easy remedy.
To simplify this, we use a Pydantic model inside the create_project() method. This model handles input validation automatically, separating project logic from validation and ensuring downstream code works with validated data.
Handling complex annotations
In addition to manual labeling, we support uploading labels (annotations) using the SDK. With nearly 90 annotation types, and several that are a combination of a few others, it is important that users structure them correctly and are notified of any issues before upload.To ensure this, we use Pydantic models.
For example:
This model combines annotations like Confidence and CustomMetrics, which handle their own validation and serialization.
GraphQL data handling
Labelbox SDK methods often execute GraphQL API calls, returning results as a dictionary. With Pydantic models, we can serialize, validate, and present a typed interface to users.
For example:
This model represents a labeling service and its attributes. The get class method uses a GraphQL client to retrieve data, then Pydantic validates and organizes the returned data into a model.
Dual Pydantic v1/v2 support
When Pydantic v2 was released in mid-2023, it offered key improvements in performance and features. However, since Pydantic v2 wasn't backward-compatible, our customers migrating to Pydnatic v2 would not be able to use the Labelbox SDK. What Labelbox did, however, was to create a compatibility layer to support both Pydantic v1 and v2 as seen below:
This solution allowed users to use either version. However, the Labelbox SDK development remained limited to using Pydnatic v1, which became more difficult as Pydantic v1 usage declined while Pydantic v2 usage increased. Despite knowing that Pydantic v2 offered a better development experience, we had to rely on Pydantic v1.
Beyond dual Pydantic v1/v2 support
Due to the aforementioned reasons, six months after releasing dual Pydantic v1/v2 compatibility, we explored replacing Pydantic v1 with a better data library. We considered three options:
- Migrating to Pydantic v2:
- Included improvements over v1.
- We were already familiar with it.
- Migration was expected to be straightforward.
- Other data libraries:
- We didn’t explore them because we preferred Pydantic v2.
- Writing our own library:
- Would eliminate a dependency and reduce potential client issues.
- Custom-built to meet our exact needs.
Given our efficiency-focused team, we chose Pydantic v2 to save time and lower our maintenance costs, which allowed us to focus on adding valuable SDK features for our customers.
Migration to Pydantic v2
This is how we migrated the Labelbox SDK to Pydantic v2.
Model config:
- model_config = ConfigDict(...)
Optional fields: V2 serializes missing fields as None, unlike v1:
- V1: classifications: List['ClassificationAnnotation']
- V2: classifications: Optional[List['ClassificationAnnotation']] = None
Validators:
- V1 root_validator(pre=True) → V2 @model_validator(mode="before")
- V1 root_validator → V2 @model_validator(mode="after")
Field validators:
- V1 validator → V2 @field_validator(...)
Serialization:
- Replace dict() with model_dump().
- For internal serialization, we used serialize_model() with wrap mode.
Type validation: In v2, accessing types within models required a more Pythonic approach:
- V1: cls.__fields__['annotations'].sub_fields
- V2: get_args(cls.model_fields['annotations'].annotation)
Discriminated Unions:
- V2’s “smart match” broke our v1 reliance on class ordering within Unions. Instead of using the legacy method, we rewrote the code to remove Union altogether.
Lessons learned
Our migration process was more challenging than expected due to the size and complexity of our codebase and the use of obscure Pydantic v1 features. Here are few key takeaways based on our experience that anyone going through a Pydantic v1 to v2 migration should know about:
- Careful planning and a robust test suite are essential for migrating large projects.
- Focus on a few key migration patterns (fewer than 10) to streamline the process.
- Don’t get bogged down by model-specific customizations; it's better to rewrite or simplify.
Despite the large effort, the migration to Pydantic v2 was successful. The Labelbox SDK release that contained Pydantic v2 went smoothly, and our code is now more concise and readable.
Visit our docs page to learn more about the Labelbox Python SDK.