The specialist always beats the generalist
Agent Studio is the RL platform for developing, evaluating, and deploying specialist AI models that improve from real enterprise execution.
A customer service agent resolved more tickets with fewer hallucinations, faster responses, and lower inference cost
Resolution rate
Finetuned OS model
84%GPT-5.5
76%Claude Opus 4.8
73%Reduction in hallucinations
Finetuned OS model
72%GPT-5.5
46%Claude Opus 4.8
41%Cost per million tokens
Finetuned OS model
$3.20GPT-5.5
$7.78Claude Opus 4.8
$7.22Time to first token
Finetuned OS model
0.42sGPT-5.5
1.10sClaude Opus 4.8
1.28sGeneral agents produce general outcomes
General-purpose agents can appear strong in isolation. They break in consistent ways once deployed into real enterprise environments: struggling to maintain reliability across multi-step workflows, adapt to edge cases, and preserve quality as the environment changes underneath them.
The core issue isn't model capability. It's that general models are built to be broadly useful, not tuned to the specific structure, constraints, and decision patterns that define how your business actually works.
Without a tight loop between execution, measurement, and training, every run is just another task completed, not another signal the system gets to learn from.
A unified RL platform for specialist models
Agent Studio connects environments, evaluation, and training into a closed-loop reinforcement learning system. Rather than treating deployment as the endpoint, production becomes the training surface.
RL environments that reflect real work
Agent Studio turns workflows, tools, policies, and edge cases into executable environments for RL training and evaluation. Powered by WorldSim, these environments recreate the full enterprise software stack — with configurable world effects that generate diverse, realistic scenarios at scale.
Explore RL environments0554-pipeline-lbo-returns
Task runs
Show all 10 runs
# Pipeline LBO Returns Model with Multi-Tier Waterfall
## Context
You are an associate at a private equity fund preparing an investment committee update for a crude oil pipeline project acquired through an LBO. The task is to build an analyst-ready workbook that connects operating performance, the pre-computed debt schedule, exit valuation, and sponsor-management waterfall.
Save the completed workbook to /workspace/output/model.xlsx.
## Workbook to build
- 1. Financing Assumptions
- 2. Operating Assumptions
- 3. Debt Schedule
- 4. Model
- 5. Returns
Task files
OptionalFiles accessible to the model at the container mount path.
inputs.xlsx
/workspace/files/problem/inputs.xlsx
source.xlsx
/workspace/files/problem/source.xlsx
Evaluation systems that measure real execution
The difference between agents that stagnate and agents that improve is measurement. Agent Studio builds evaluation systems that score intelligence and skill at every level — final outcomes, intermediate decisions, and execution quality — so every run generates the signal your models need to get better.
specialist-model-evals
rankedAccuracy, efficiency, and latency across model candidates
Leader
GLM 5.1 FTv2
Accuracy
Pass@4 score on held-out financial analysis tasks
GLM 5.1 FTv2
GPT 5.5
Claude 4.8
GLM 5.1 FTv1
GLM 5.1 base
Token efficiency
Normalized useful work per token
GLM 5.1 FTv2
GLM 5.1 FTv1
GLM 5.1 base
Claude 4.8
GPT 5.5
Latency
Normalized responsiveness, raw latency shown
GLM 5.1 FTv2
GLM 5.1 FTv1
GLM 5.1 base
GPT 5.5
Claude 4.8
A training loop that compounds from real work
Every rollout produces graded trajectories that can feed fine-tuning and reinforcement learning. The result is a specialist model that improves from enterprise execution signals instead of synthetic benchmarks alone.
fcp-sft-glm-5.1-v7-fits6h
completedGRPO run - GLM 5.1
Run result
86% Pass@4 score (+30%)
Training run summary
Base
8h 20m
Training
11h 45m
Tuned
8h 55m
| Metric | Value |
|---|---|
| Source | 4,300 financial analysis tasks |
| Task family | DCF, LBO, acquisition, projection |
| Base model | GLM 5.1 |
| Training method | GRPO |
| Compute | GKE, H100 cluster |
| Endpoints | Baseline and tuned |
Evaluations
4 attempts per task
Problem
Baseline
Tuned
Delta
0195-company-abc-dcf-irr
0.55 -> 0.85
0248-pumptech-multicompany-acquisition
0.60 -> 0.89
0515-hotel-back-envelope
0.51 -> 0.83
0843-hartwell-manufacturing-dcf
0.54 -> 0.83
0120-smasco-financial-projection
0.52 -> 0.85
0631-proactivate-lbo-returns
0.57 -> 0.87
From workflow to specialist model — continuously
Define
Identify the knowledge work tasks: the workflows, decisions, and domain-specific judgments agents need to perform.
Connect
Labelbox agents connect to enterprise data sources and convert them into structured RL data representations.
Evaluate
Design the evaluation system: rubrics, success criteria, and scoring logic that define what good performance looks like.
Generate
Create task distributions that capture long-tail edge cases and operational variability.
Train
Run RL training on graded rollouts to produce specialist models that improve with every cycle.
Every cycle captures additional expertise. Every execution generates a new signal. Every improvement compounds your proprietary edge.
Privacy and security for enterprise agents
Agent Studio is built for high-stakes workflows where prompts, traces, reward signals, and training data need enterprise-grade governance. Labelbox applies the same privacy, security, and compliance posture across the systems that power specialist agents.

Your advantage is the learning loop, not the model
As foundation models improve and become accessible to everyone, sustainable advantage shifts away from the model itself. The organizations that win won't simply deploy intelligence; they'll own the learning loops that transform their expertise, workflows, and decisions into durable competitive advantage.
Every organization has unique workflows, judgments, and domain expertise embedded in its people. The question is whether that expertise remains trapped inside individuals or becomes a compounding organizational asset.
Agent Studio is designed to close that gap.