Dmytro Apollonin•December 20, 2024
Code Runner: Secure, scalable code execution for model evaluation
In the world of large language models (LLMs), evaluating their responses effectively is a fundamental aspect of improving model performance. We’re excited to announce the latest addition to the Labelbox platform: Code Runner. This new capability pushes the boundaries of interactivity by allowing users to execute written code directly within the evaluation workflow.
Code Runner helps eliminate errors, optimizes functionality, and validates outputs, leading to higher-quality datasets. Today, we’ll introduce this new feature and then dive into the technical details of the infrastructure powering this feature, highlighting how it was designed with security, scalability, and robustness at its core.
What is Code Runner?
Code Runner is a new built-in feature of the Labelbox platform designed to improve the quality of responses and labels generated in any coding-related projects. The new features enables users to:
- Directly execute code found in either model responses or user-written responses
- Receive precise outputs including:
- Standard output (stdout)
- Standard error (stderr)
- Execution time
- Warnings or runtime errors
By integrating Code Runner into the evaluation pipeline, we aim to simplify the process of verifying the accuracy, efficiency, and functionality of code responses, all without users needing to leave the platform.
But what makes this feature stand out is the sophisticated infrastructure behind it, designed to ensure seamless execution while maintaining strict security and privacy standards.
Code Runner infrastructure: A deep dive
At the heart of Code Runner’s infrastructure lies Google Cloud Run, a fully managed compute platform that runs containerized applications in a secure, scalable manner. Here are the key components and principles driving the system:
1. Cloud Run for language-specific environments
Every code execution happens in a dedicated Cloud Run instance. Each instance is tailored to a specific programming language environment (e.g., Python, JavaScript, etc.) and is spun up dynamically based on the code type detected in the user response.
This design includes the following characteristics to ensure security and speed:
- Isolation: Each execution is fully containerized, completely isolating the runtime environment from others.
- Temporary directories: Code is executed in a temporary directory within the container, and it is deleted immediately after execution, leaving no trace behind.
- Language-specific tools: Each environment comes preloaded with the necessary packages and libraries to ensure compatibility and speed.
2. Enhanced security with separate GCP projects
The Cloud Run service is hosted in a separate Google Cloud Platform (GCP) project, distinct from our main infrastructure. This segmentation provides an additional layer of security by isolating code execution from our core services. Even in the unlikely event of a compromise, the blast radius is contained.
3. Communication via private service connect
To ensure secure and controlled communication, all interactions between the main evaluation system and the Cloud Run service occur over Private Service Connect, which provides the following advantages:
- No public exposure: The Cloud Run endpoint is never exposed to the public internet, reducing the risk of unauthorized access.
- One-way communication: The Private Service Connect setup restricts outbound networking from the Cloud Run service, ensuring that executed code cannot make arbitrary network requests.
- Granular networking controls: The private network allows for precise control over what resources the Cloud Run service can access.
4. Automatic cleanup
To maintain a lightweight and secure runtime, the system delivers:
- Ephemeral execution: Each execution request is handled in a stateless, temporary environment.
- Automatic deletion: Files, logs, and temporary directories are wiped as soon as execution completes, leaving no residual data.
How Code Runner works: A step-by-step overview
Now that you have an understanding of the powerful infrastructure underneath Code Runner, here is a summary of how the feature works from start to finish:
- Code submission: A user requests code execution from the evaluation interface.
- Language detection: The system detects the programming language and forwards the request to the corresponding Cloud Run service.
- Execution: The Cloud Run instance spins up a container, executes the code in a sandboxed environment, and collects the results.
- Result delivery: The system returns the output (stdout, stderr, execution time, and any warnings) to the user for analysis.
- Cleanup: The container and all related resources are terminated and deleted.
Advantages of Labelbox’s built-in code execution
Code Runner’s infrastructure was designed specifically to provide the previously discussed benefits and to address several key challenges that other solutions may face:
- Security: By isolating execution environments and ensuring no public exposure, we eliminate a significant attack surface.
- Scalability: Cloud Run’s serverless nature allows us to scale dynamically with demand, handling thousands of requests efficiently.
- Reliability: The use of ephemeral containers ensures that each execution starts in a clean slate, avoiding cross-contamination or resource conflicts.
Explore it yourself
With Code Runner, we’re empowering users to go beyond static evaluations, enabling dynamic, interactive testing that’s as secure as it is scalable. As always, we’re excited to hear your feedback and explore how we can push this feature even further.
If you want to explore Code Runner and other LLM evaluation tools, sign up for our platform today.
Stay tuned for updates, and happy coding!