AI agents are moving from isolated experiments into real business workflows. They connect to enterprise tools, operate across internal systems, and execute increasingly complex multi-step tasks across the organization.
But there is a fundamental limitation that will not disappear with better foundation models: general intelligence alone does not create specialized outcomes.
As base models become more accessible, the differentiator will shift from who has access to intelligence to who can continuously improve it. The organizations that win in the AI era will not simply deploy AI. They will build learning systems that turn their expertise, workflows, and decisions into compounding intelligence.
Today, we are introducing Recursion, a unified platform for developing, evaluating, and deploying specialist AI models. Recursion is built on a simple idea: in the age of AI, the ability to learn and improve continuously matters more than any single model. The learning loop is your most durable competitive advantage.
The specialist always beats the generalist
Your business has accumulated something no foundation model has: years of domain expertise, decision patterns, and workflow knowledge embedded in your people and processes. General LLMs approximate this expensively, inconsistently, and at full inference cost with every prompt.
That same expertise, encoded into a specialist model through reinforcement learning, consistently outperforms general LLMs on your specific workflows at a fraction of the token cost. The model that knows your workflows beats the model that knows everything.
A smaller specialist model beats a larger generalist model
Comparison to other systems on the same held-out agentic finance tasks.
pass@1 gain vs 397B
+85%pass@3 gain vs 397B
+46%pass@5 gain vs 397B
+34%Qwen3.6 35B (tuned)
ours- Mean reward
- 0.23
- pass@1
- 0.37
- pass@3
- 0.60
- pass@5
- 0.67
Qwen3.5-397B
- Mean reward
- 0.25
- pass@1
- 0.20
- pass@3
- 0.41
- pass@5
- 0.50
| System | Mean reward | pass@1 | pass@3 | pass@5 |
|---|---|---|---|---|
Qwen3.6 35B (tuned)ours | 0.23 | 0.37 | 0.60 | 0.67 |
Qwen3.5-397B | 0.25 | 0.20 | 0.41 | 0.50 |
In an agentic finance workflow, a Recursion-tuned Qwen3.5-35B open-source model reached 0.37 pass@1, 0.60 pass@3, and 0.67 pass@5, ahead of Qwen3.5-397B on each pass@k metric.
This comparison was measured on unseen tasks, which matters: the model learned more robust agentic behaviors rather than overfitting to specific financial procedures. In practice, this translates to a finance agent that reliably reaches correct outcomes across multi-step reasoning, tool use, and structured decision-making without iterative retries.
Customer Support
A customer service agent developed with Recursion achieves these results: more tickets resolved, fewer hallucinations, faster responses, and lower inference cost.
Resolution rate
Finetuned OS model
84%GPT-5.5
76%Claude Opus 4.8
73%Reduction in hallucinations
Finetuned OS model
72%GPT-5.5
46%Claude Opus 4.8
41%Cost per 1,000 support tickets
Finetuned OS model
$32GPT-5.5
$158Claude Opus 4.8
$176Time to first token
Finetuned OS model
0.42sGPT-5.5
1.10sClaude Opus 4.8
1.28sHow Recursion works
Recursion connects environments, evaluation, and training into a unified reinforcement learning loop. Rather than treating deployment as the endpoint, production becomes the training surface. Every outcome becomes an opportunity to improve.
RL environments that reflect real work
Recursion turns workflows, tools, policies, and edge cases into executable environments for RL training and evaluation. Powered by WorldSim, these environments recreate the full enterprise software stack — with configurable world effects that generate diverse, realistic scenarios at scale.
0554-crude-pipeline-lbo-waterfall
Task runs
Show all 10 runs
# Pipeline LBO Returns Model with Multi-Tier Waterfall
## Context
You are an associate at a private equity fund preparing an investment committee update for a crude oil pipeline project acquired through an LBO. The task is to build an analyst-ready workbook that connects operating performance, the pre-computed debt schedule, exit valuation, and sponsor-management waterfall.
Save the completed workbook to /workspace/output/model.xlsx.
## Workbook to build
- 1. Financing Assumptions
- 2. Operating Assumptions
- 3. Debt Schedule
- 4. Model
- 5. Returns
Task files
OptionalFiles accessible to the model at the container mount path.
inputs.xlsx
/workspace/files/problem/inputs.xlsx
source.xlsx
/workspace/files/problem/source.xlsx
Evaluation systems that measure real execution
The difference between agents that stagnate and agents that improve is measurement. Recursion builds evaluation systems that score intelligence and skill at every level — final outcomes, intermediate decisions, and execution quality — so every run generates the signal your models need to get better.
specialist-model-evals
rankedAccuracy, efficiency, and latency across model candidates
Leader
GLM 5.1 FTv2
Accuracy
Pass@4 score on held-out financial analysis tasks
GLM 5.1 FTv2
GPT 5.5
Claude 4.8
GLM 5.1 FTv1
GLM 5.1 base
Token efficiency
Normalized useful work per token
GLM 5.1 FTv2
GLM 5.1 FTv1
GLM 5.1 base
Claude 4.8
GPT 5.5
Latency
Normalized responsiveness, raw latency shown
GLM 5.1 FTv2
GLM 5.1 FTv1
GLM 5.1 base
GPT 5.5
Claude 4.8
A training loop that compounds from real work
Every rollout produces graded trajectories that can feed fine-tuning and reinforcement learning. The result is a specialist model that improves from enterprise execution signals instead of synthetic benchmarks alone.
financial-agent-glm-5.1-v7-fits6h
completedGRPO run - GLM 5.1
Run result
86% Pass@4 score (+30%)
Training run summary
Base
8h 20m
Training
11h 45m
Tuned
8h 55m
| Metric | Value |
|---|---|
| Source | 4,300 financial analysis tasks |
| Task family | DCF, LBO, acquisition, projection |
| Base model | GLM 5.1 |
| Training method | GRPO |
| Compute | GKE, H100 cluster |
| Endpoints | Baseline and tuned |
Evaluations
4 attempts per task
Problem
Baseline
Tuned
Delta
0195-acme-industrials-dcf-irr
0.55 -> 0.85
0248-pumptech-acquisition-rollup
0.60 -> 0.89
0515-hotel-renovation-noi-yield
0.51 -> 0.83
0843-hartwell-plant-expansion-dcf
0.54 -> 0.83
0120-smasco-revenue-forecast
0.52 -> 0.85
0631-proactivate-lbo-exit-returns
0.57 -> 0.87
The enterprise advantage is a learning system
Historically, software helped organizations scale execution. AI introduces something fundamentally different: the ability to scale learning itself.
Every organization possesses a unique set of workflows, judgments, decision patterns, and domain expertise embedded within its people. The question is whether that expertise remains trapped inside individuals or becomes a durable organizational asset.
As foundation models improve and become accessible to everyone, sustainable advantage shifts away from the model itself and toward an organization's ability to continuously capture, evaluate, and improve its own expertise.
From workflow to specialist model — continuously
Every cycle captures additional expertise. Every execution generates a new signal. Every improvement compounds your proprietary edge.
Define
Identify the workflows, decisions, and domain-specific judgments agents need to perform. Start from real knowledge work so the learning loop is anchored in the tasks that matter.
Built for enterprise scale
AI will make intelligence broadly accessible. The enduring advantage will come from what organizations do with it.
The firms that thrive will be those that continuously transform expertise into systems, systems into learning, and learning into proprietary advantage. Recursion helps enterprises build company intelligence that outlasts any individual model.
Privacy and security for enterprise agents
Recursion is built for high-stakes workflows where prompts, traces, reward signals, and training data need enterprise-grade governance. Labelbox applies the same privacy, security, and compliance posture across the systems that power specialist agents.

Recursion supports a growing set of enterprise domains: finance, legal, security, insurance, manufacturing, and operations, with expansion into additional high-value workflows underway.
As agents take on more responsibility across the enterprise, the need for systems that can evaluate, learn, and improve from real execution will only grow. Reliable agents are not discovered. They are engineered through a compounding learning loop.
That is why we built Recursion. Each cycle evaluates execution, turns the result into training signal, and feeds that signal back into the next generation of specialist models agents. The system continually gets better because the learning loop repeats with better context, better measurements, and better outcomes.
Hundreds of AI teams build with Labelbox
Recursion helps teams turn enterprise execution into the signal their agents need to improve. Every evaluated task, rollout, and outcome becomes part of a learning system built for more reliable specialist agents.
Talk to our team




