Labelbox•June 24, 2026

Introducing Recursion: The RL platform for enterprise specialist agents

AI agents are moving from isolated experiments into real business workflows. They connect to enterprise tools, operate across internal systems, and execute increasingly complex multi-step tasks across the organization.

But there is a fundamental limitation that will not disappear with better foundation models: general intelligence alone does not create specialized outcomes.

As base models become more accessible, the differentiator will shift from who has access to intelligence to who can continuously improve it. The organizations that win in the AI era will not simply deploy AI. They will build learning systems that turn their expertise, workflows, and decisions into compounding intelligence.

Today, we are introducing Recursion, a unified platform for developing, evaluating, and deploying specialist AI models. Recursion is built on a simple idea: in the age of AI, the ability to learn and improve continuously matters more than any single model. The learning loop is your most durable competitive advantage.

The specialist always beats the generalist

Your business has accumulated something no foundation model has: years of domain expertise, decision patterns, and workflow knowledge embedded in your people and processes. General LLMs approximate this expensively, inconsistently, and at full inference cost with every prompt.

That same expertise, encoded into a specialist model through reinforcement learning, consistently outperforms general LLMs on your specific workflows at a fraction of the token cost. The model that knows your workflows beats the model that knows everything.

A smaller specialist model beats a larger generalist model

Comparison to other systems on the same held-out agentic finance tasks.

pass@1 gain vs 397B

+85%

pass@3 gain vs 397B

+46%

pass@5 gain vs 397B

+34%

Qwen3.6 35B (tuned)

ours

Mean reward: 0.23
pass@1: 0.37
pass@3: 0.60
pass@5: 0.67

Qwen3.5-397B

Mean reward: 0.25
pass@1: 0.20
pass@3: 0.41
pass@5: 0.50

System	Mean reward	pass@1	pass@3	pass@5
Qwen3.6 35B (tuned) ours	0.23	0.37	0.60	0.67
Qwen3.5-397B	0.25	0.20	0.41	0.50

In an agentic finance workflow, a Recursion-tuned Qwen3.5-35B open-source model reached 0.37 pass@1, 0.60 pass@3, and 0.67 pass@5, ahead of Qwen3.5-397B on each pass@k metric.

This comparison was measured on unseen tasks, which matters: the model learned more robust agentic behaviors rather than overfitting to specific financial procedures. In practice, this translates to a finance agent that reliably reaches correct outcomes across multi-step reasoning, tool use, and structured decision-making without iterative retries.

Customer Support

A customer service agent developed with Recursion achieves these results: more tickets resolved, fewer hallucinations, faster responses, and lower inference cost.

Resolution rate

Finetuned OS model

84%

GPT-5.5

76%

Claude Opus 4.8

73%

Reduction in hallucinations

Finetuned OS model

72%

GPT-5.5

46%

Claude Opus 4.8

41%

Cost per 1,000 support tickets

Finetuned OS model

$32

GPT-5.5

$158

Claude Opus 4.8

$176

Time to first token

Finetuned OS model

0.42s

GPT-5.5

1.10s

Claude Opus 4.8

1.28s

How Recursion works

Recursion connects environments, evaluation, and training into a unified reinforcement learning loop. Rather than treating deployment as the endpoint, production becomes the training surface. Every outcome becomes an opportunity to improve.

RL environments that reflect real work

Recursion turns workflows, tools, policies, and edge cases into executable environments for RL training and evaluation. Powered by WorldSim, these environments recreate the full enterprise software stack — with configurable world effects that generate diverse, realistic scenarios at scale.

0554-crude-pipeline-lbo-waterfall

Task runs

Show all 10 runs

Prompt6798 chars

# Pipeline LBO Returns Model with Multi-Tier Waterfall

## Context

You are an associate at a private equity fund preparing an investment committee update for a crude oil pipeline project acquired through an LBO. The task is to build an analyst-ready workbook that connects operating performance, the pre-computed debt schedule, exit valuation, and sponsor-management waterfall.

Save the completed workbook to /workspace/output/model.xlsx.

## Workbook to build

1. Financing Assumptions
2. Operating Assumptions
3. Debt Schedule
4. Model
5. Returns

AttachmentsGradingIssuesQAFormsSynthesizersSolver

Task files

Optional

Files accessible to the model at the container mount path.

inputs.xlsx

/workspace/files/problem/inputs.xlsx

source.xlsx

/workspace/files/problem/source.xlsx

Evaluation systems that measure real execution

The difference between agents that stagnate and agents that improve is measurement. Recursion builds evaluation systems that score intelligence and skill at every level — final outcomes, intermediate decisions, and execution quality — so every run generates the signal your models need to get better.

specialist-model-evals

ranked

Accuracy, efficiency, and latency across model candidates

Leader

GLM 5.1 FTv2

GLM 5.1 base

GLM 5.1 FTv1

GLM 5.1 FTv2

Claude 4.8

GPT 5.5

Accuracy

Pass@4 score on held-out financial analysis tasks

GLM 5.1 FTv2

86%

GPT 5.5

82%

Claude 4.8

79%

GLM 5.1 FTv1

78%

GLM 5.1 base

56%

Token efficiency

Normalized useful work per token

GLM 5.1 FTv2

84%

GLM 5.1 FTv1

76%

GLM 5.1 base

54%

Claude 4.8

42%

GPT 5.5

38%

Latency

Normalized responsiveness, raw latency shown

GLM 5.1 FTv2

1.1s

GLM 5.1 FTv1

1.7s

GLM 5.1 base

2.4s

GPT 5.5

2.6s

Claude 4.8

2.9s

A training loop that compounds from real work

Every rollout produces graded trajectories that can feed fine-tuning and reinforcement learning. The result is a specialist model that improves from enterprise execution signals instead of synthetic benchmarks alone.

financial-agent-glm-5.1-v7-fits6h

completed

GRPO run - GLM 5.1

Run result

86% Pass@4 score (+30%)

Training run summary

Base

8h 20m

Training

11h 45m

Tuned

8h 55m

Metric	Value
Source	4,300 financial analysis tasks
Task family	DCF, LBO, acquisition, projection
Base model	GLM 5.1
Training method	GRPO
Compute	GKE, H100 cluster
Endpoints	Baseline and tuned

Evaluations

4 attempts per task

Problem

Baseline

Tuned

Delta

0195-acme-industrials-dcf-irr

+0.30

0.55 -> 0.85

0248-pumptech-acquisition-rollup

+0.29

0.60 -> 0.89

0515-hotel-renovation-noi-yield

+0.32

0.51 -> 0.83

0843-hartwell-plant-expansion-dcf

+0.29

0.54 -> 0.83

0120-smasco-revenue-forecast

+0.33

0.52 -> 0.85

0631-proactivate-lbo-exit-returns

+0.30

0.57 -> 0.87

The enterprise advantage is a learning system

Historically, software helped organizations scale execution. AI introduces something fundamentally different: the ability to scale learning itself.

Every organization possesses a unique set of workflows, judgments, decision patterns, and domain expertise embedded within its people. The question is whether that expertise remains trapped inside individuals or becomes a durable organizational asset.

As foundation models improve and become accessible to everyone, sustainable advantage shifts away from the model itself and toward an organization's ability to continuously capture, evaluate, and improve its own expertise.

From workflow to specialist model — continuously

Every cycle captures additional expertise. Every execution generates a new signal. Every improvement compounds your proprietary edge.

Define

Identify the workflows, decisions, and domain-specific judgments agents need to perform. Start from real knowledge work so the learning loop is anchored in the tasks that matter.

Built for enterprise scale

AI will make intelligence broadly accessible. The enduring advantage will come from what organizations do with it.

The firms that thrive will be those that continuously transform expertise into systems, systems into learning, and learning into proprietary advantage. Recursion helps enterprises build company intelligence that outlasts any individual model.

Privacy and security for enterprise agents

Recursion is built for high-stakes workflows where prompts, traces, reward signals, and training data need enterprise-grade governance. Labelbox applies the same privacy, security, and compliance posture across the systems that power specialist agents.

Privacy and security illustration for the security page and Agent Studio privacy section.

Recursion supports a growing set of enterprise domains: finance, legal, security, insurance, manufacturing, and operations, with expansion into additional high-value workflows underway.

As agents take on more responsibility across the enterprise, the need for systems that can evaluate, learn, and improve from real execution will only grow. Reliable agents are not discovered. They are engineered through a compounding learning loop.

That is why we built Recursion. Each cycle evaluates execution, turns the result into training signal, and feeds that signal back into the next generation of specialist models agents. The system continually gets better because the learning loop repeats with better context, better measurements, and better outcomes.

Hundreds of AI teams build with Labelbox

Recursion helps teams turn enterprise execution into the signal their agents need to improve. Every evaluated task, rollout, and outcome becomes part of a learning system built for more reliable specialist agents.

Talk to our team

Introducing Recursion: The RL platform for enterprise specialist agents

The specialist always beats the generalist

A smaller specialist model beats a larger generalist model

Qwen3.6 35B (tuned)

Qwen3.5-397B

Customer Support

How Recursion works

RL environments that reflect real work

Evaluation systems that measure real execution

A training loop that compounds from real work

The enterprise advantage is a learning system

From workflow to specialist model — continuously

Define

Built for enterprise scale

Privacy and security for enterprise agents

Hundreds of AI teams build with Labelbox

How Meta built GIM with Labelbox data to evaluate frontier AI reasoning

Human preference signal for evaluating LLMs inside Vertex AI

Higher-quality training signal for personalized shopping AI

Tracking surgical instruments in video to advance robotic surgery