The specialist always beats the generalist

Agent Studio is the RL platform for developing, evaluating, and deploying specialist AI models that improve from real enterprise execution.

A customer service agent resolved more tickets with fewer hallucinations, faster responses, and lower inference cost

Resolution rate

Finetuned OS model

84%

GPT-5.5

76%

Claude Opus 4.8

73%

Reduction in hallucinations

Finetuned OS model

72%

GPT-5.5

46%

Claude Opus 4.8

41%

Cost per million tokens

Finetuned OS model

$3.20

GPT-5.5

$7.78

Claude Opus 4.8

$7.22

Time to first token

Finetuned OS model

0.42s

GPT-5.5

1.10s

Claude Opus 4.8

1.28s

General agents produce general outcomes

General-purpose agents can appear strong in isolation. They break in consistent ways once deployed into real enterprise environments: struggling to maintain reliability across multi-step workflows, adapt to edge cases, and preserve quality as the environment changes underneath them.

The core issue isn't model capability. It's that general models are built to be broadly useful, not tuned to the specific structure, constraints, and decision patterns that define how your business actually works.

Without a tight loop between execution, measurement, and training, every run is just another task completed, not another signal the system gets to learn from.

A unified RL platform for specialist models

Agent Studio connects environments, evaluation, and training into a closed-loop reinforcement learning system. Rather than treating deployment as the endpoint, production becomes the training surface.

RL environments that reflect real work

Agent Studio turns workflows, tools, policies, and edge cases into executable environments for RL training and evaluation. Powered by WorldSim, these environments recreate the full enterprise software stack — with configurable world effects that generate diverse, realistic scenarios at scale.

Explore RL environments

0554-pipeline-lbo-returns

Task runs

Show all 10 runs

Prompt6798 chars

# Pipeline LBO Returns Model with Multi-Tier Waterfall

## Context

You are an associate at a private equity fund preparing an investment committee update for a crude oil pipeline project acquired through an LBO. The task is to build an analyst-ready workbook that connects operating performance, the pre-computed debt schedule, exit valuation, and sponsor-management waterfall.

Save the completed workbook to /workspace/output/model.xlsx.

## Workbook to build

1. Financing Assumptions
2. Operating Assumptions
3. Debt Schedule
4. Model
5. Returns

AttachmentsGradingIssuesQAFormsSynthesizersSolver

Task files

Optional

Files accessible to the model at the container mount path.

inputs.xlsx

/workspace/files/problem/inputs.xlsx

source.xlsx

/workspace/files/problem/source.xlsx

Evaluation systems that measure real execution

The difference between agents that stagnate and agents that improve is measurement. Agent Studio builds evaluation systems that score intelligence and skill at every level — final outcomes, intermediate decisions, and execution quality — so every run generates the signal your models need to get better.

specialist-model-evals

ranked

Accuracy, efficiency, and latency across model candidates

Leader

GLM 5.1 FTv2

GLM 5.1 base

GLM 5.1 FTv1

GLM 5.1 FTv2

Claude 4.8

GPT 5.5

Accuracy

Pass@4 score on held-out financial analysis tasks

GLM 5.1 FTv2

86%

GPT 5.5

82%

Claude 4.8

79%

GLM 5.1 FTv1

78%

GLM 5.1 base

56%

Token efficiency

Normalized useful work per token

GLM 5.1 FTv2

84%

GLM 5.1 FTv1

76%

GLM 5.1 base

54%

Claude 4.8

42%

GPT 5.5

38%

Latency

Normalized responsiveness, raw latency shown

GLM 5.1 FTv2

1.1s

GLM 5.1 FTv1

1.7s

GLM 5.1 base

2.4s

GPT 5.5

2.6s

Claude 4.8

2.9s

A training loop that compounds from real work

Every rollout produces graded trajectories that can feed fine-tuning and reinforcement learning. The result is a specialist model that improves from enterprise execution signals instead of synthetic benchmarks alone.

fcp-sft-glm-5.1-v7-fits6h

completed

GRPO run - GLM 5.1

Run result

86% Pass@4 score (+30%)

Training run summary

Base

8h 20m

Training

11h 45m

Tuned

8h 55m

Metric	Value
Source	4,300 financial analysis tasks
Task family	DCF, LBO, acquisition, projection
Base model	GLM 5.1
Training method	GRPO
Compute	GKE, H100 cluster
Endpoints	Baseline and tuned

Evaluations

4 attempts per task

Problem

Baseline

Tuned

Delta

0195-company-abc-dcf-irr

+0.30

0.55 -> 0.85

0248-pumptech-multicompany-acquisition

+0.29

0.60 -> 0.89

0515-hotel-back-envelope

+0.32

0.51 -> 0.83

0843-hartwell-manufacturing-dcf

+0.29

0.54 -> 0.83

0120-smasco-financial-projection

+0.33

0.52 -> 0.85

0631-proactivate-lbo-returns

+0.30

0.57 -> 0.87

From workflow to specialist model — continuously

Define

Identify the knowledge work tasks: the workflows, decisions, and domain-specific judgments agents need to perform.

Connect

Labelbox agents connect to enterprise data sources and convert them into structured RL data representations.

Evaluate

Design the evaluation system: rubrics, success criteria, and scoring logic that define what good performance looks like.

Generate

Create task distributions that capture long-tail edge cases and operational variability.

Train

Run RL training on graded rollouts to produce specialist models that improve with every cycle.

Every cycle captures additional expertise. Every execution generates a new signal. Every improvement compounds your proprietary edge.

Privacy and security for enterprise agents

Agent Studio is built for high-stakes workflows where prompts, traces, reward signals, and training data need enterprise-grade governance. Labelbox applies the same privacy, security, and compliance posture across the systems that power specialist agents.

Read about privacy & security

Privacy and security illustration for the security page and Agent Studio privacy section.

Your advantage is the learning loop, not the model

As foundation models improve and become accessible to everyone, sustainable advantage shifts away from the model itself. The organizations that win won't simply deploy intelligence; they'll own the learning loops that transform their expertise, workflows, and decisions into durable competitive advantage.

Every organization has unique workflows, judgments, and domain expertise embedded in its people. The question is whether that expertise remains trapped inside individuals or becomes a compounding organizational asset.

Agent Studio is designed to close that gap.