What is Code Runner?
Code Runner infrastructure: A deep dive
How Code Runner works: A step-by-step overview
Advantages of Labelbox’s built-in code execution
Explore it yourself

Dmytro Apollonin•December 20, 2024

Code Runner: Secure, scalable code execution for model evaluation

In the world of large language models (LLMs), evaluating their responses effectively is a fundamental aspect of improving model performance. We’re excited to announce the latest addition to the Labelbox platform: Code Runner. This new capability pushes the boundaries of interactivity by allowing users to execute written code directly within the evaluation workflow.

Code Runner helps eliminate errors, optimizes functionality, and validates outputs, leading to higher-quality datasets. Today, we’ll introduce this new feature and then dive into the technical details of the infrastructure powering this feature, highlighting how it was designed with security, scalability, and robustness at its core.

What is Code Runner?

Code Runner is a new built-in feature of the Labelbox platform designed to improve the quality of responses and labels generated in any coding-related projects. The new features enables users to:

Directly execute code found in either model responses or user-written responses
Receive precise outputs including:
- Standard output (stdout)
- Standard error (stderr)
- Execution time
- Warnings or runtime errors

By integrating Code Runner into the evaluation pipeline, we aim to simplify the process of verifying the accuracy, efficiency, and functionality of code responses, all without users needing to leave the platform.

Our system automatically detects the language in the text area and suggests the appropriate environment for execution, whether Python or JavaScript (and more to come).

But what makes this feature stand out is the sophisticated infrastructure behind it, designed to ensure seamless execution while maintaining strict security and privacy standards.

Code Runner infrastructure: A deep dive

At the heart of Code Runner’s infrastructure lies Google Cloud Run, a fully managed compute platform that runs containerized applications in a secure, scalable manner. Here are the key components and principles driving the system:

1. Cloud Run for language-specific environments

Every code execution happens in a dedicated Cloud Run instance. Each instance is tailored to a specific programming language environment (e.g., Python, JavaScript, etc.) and is spun up dynamically based on the code type detected in the user response.

This design includes the following characteristics to ensure security and speed:

Isolation: Each execution is fully containerized, completely isolating the runtime environment from others.
Temporary directories: Code is executed in a temporary directory within the container, and it is deleted immediately after execution, leaving no trace behind.
Language-specific tools: Each environment comes preloaded with the necessary packages and libraries to ensure compatibility and speed.

2. Enhanced security with separate GCP projects

The Cloud Run service is hosted in a separate Google Cloud Platform (GCP) project, distinct from our main infrastructure. This segmentation provides an additional layer of security by isolating code execution from our core services. Even in the unlikely event of a compromise, the blast radius is contained.

3. Communication via private service connect

To ensure secure and controlled communication, all interactions between the main evaluation system and the Cloud Run service occur over Private Service Connect, which provides the following advantages:

No public exposure: The Cloud Run endpoint is never exposed to the public internet, reducing the risk of unauthorized access.
One-way communication: The Private Service Connect setup restricts outbound networking from the Cloud Run service, ensuring that executed code cannot make arbitrary network requests.
Granular networking controls: The private network allows for precise control over what resources the Cloud Run service can access.

4. Automatic cleanup

To maintain a lightweight and secure runtime, the system delivers:

Ephemeral execution: Each execution request is handled in a stateless, temporary environment.
Automatic deletion: Files, logs, and temporary directories are wiped as soon as execution completes, leaving no residual data.

How Code Runner works: A step-by-step overview

Now that you have an understanding of the powerful infrastructure underneath Code Runner, here is a summary of how the feature works from start to finish:

Code submission: A user requests code execution from the evaluation interface.
Language detection: The system detects the programming language and forwards the request to the corresponding Cloud Run service.
Execution: The Cloud Run instance spins up a container, executes the code in a sandboxed environment, and collects the results.
Result delivery: The system returns the output (stdout, stderr, execution time, and any warnings) to the user for analysis.
Cleanup: The container and all related resources are terminated and deleted.

Advantages of Labelbox’s built-in code execution

Code Runner’s infrastructure was designed specifically to provide the previously discussed benefits and to address several key challenges that other solutions may face:

Security: By isolating execution environments and ensuring no public exposure, we eliminate a significant attack surface.
Scalability: Cloud Run’s serverless nature allows us to scale dynamically with demand, handling thousands of requests efficiently.
Reliability: The use of ephemeral containers ensures that each execution starts in a clean slate, avoiding cross-contamination or resource conflicts.

Explore it yourself

With Code Runner, we’re empowering users to go beyond static evaluations, enabling dynamic, interactive testing that’s as secure as it is scalable. As always, we’re excited to hear your feedback and explore how we can push this feature even further.

If you want to explore Code Runner and other LLM evaluation tools, sign up for our platform today.

Stay tuned for updates, and happy coding!

Continue reading

Michael Haag•April 9, 2025

Introducing a powerful, new interactive Workflow editor

Learn about the new Labelbox Workflow that introduces an interactive, node-based editor to create, manage, and visualize multi-step review workflows.

Michael Haag•April 7, 2025

Q1 spotlight: Accelerating AI development with new products and services

Catch up on Labelbox's latest news from Q1, including expanded Leaderboards, the Alignerr Connect launch, and platform advancements empowering the next generation of AI models.

Ibrahim Muhammad•March 28, 2025

How to train and evaluate AI agents and trajectories with Labelbox

Learn how to use Labelbox's Multimodal Chat Editor for the key tasks of agent training and evaluation, using its new capabilities to evaluate and annotate the agent trajectories.

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free

Understand the difference

Explore data factory for

Data factory capabilities

Explore solutions for

Post-training tasks

Use cases

Learn

Connect

Featured reads