Introducing Recursion: the RL platform for enterprise specialist agents

Labelbox•July 1, 2025

Building true RL systems: An experiment on solving real business tasks

Overview (TLDR): We ran a comprehensive experiment comparing sparse rewards vs. rubric-based rewards + GRPO on a complex e-commerce task. The results validate what many in the RL community have suspected: there's a much better way to train agents for complex business applications.

The reinforcement learning community has long debated whether traditional sparse reward approaches are sufficient for complex, multi-step business tasks. While techniques like rubric-based reward engineering and Group Relative Policy Optimization (GRPO) have shown promise in academic settings, there's been limited validation on realistic business applications.

At Labelbox, we decided to put these approaches to the test. We built a comprehensive e-commerce shopping agent and ran head-to-head comparisons between traditional sparse rewards versus the combination of rubric engineering + GRPO. Our goal wasn't to invent new techniques but to validate whether these promising approaches actually deliver better results on the kind of complex, multi-step tasks that businesses care about.

The answer? A resounding yes. Our proof of concept showed 300% better performance with rubric + GRPO as compared to sparse rewards alone.

The experiment: Real-world ecommerce complexity

In testing, we built a slightly simplified environment that captures some of the genuine complexity of business tasks. Our shopping agent needed to:

Search through a product catalog
Handle dynamic pricing and stock availability
Navigate search result quality variations
Manage budget constraints and user preferences
Deal with common friction points (out-of-stock items, price changes)
Complete end-to-end checkout processes

This mirrors the kind of multi-step, partially observable, constraint-heavy environment that makes business RL notoriously difficult.

Three approaches tested

We implemented 3 training approaches to validate the performance differences:

1. Traditional Sparse Rewards (Baseline)

The standard approach used in many RL applications:

Task Success → +100 points
Task Failure → 0 points
No intermediate feedback during the episode

2. Rubric-Based Rewards

Decomposed the task into measurable components:

Search Quality (0-20 points): Relevance of found products
Product Selection (0-30 points): Match to user criteria and budget
Navigation Efficiency (0-25 points): Steps taken vs. optimal path
Budget Compliance (0-15 points): Staying within financial constraints
Task Completion (0-10 points): Successfully finishing checkout

3. GRPO + Rubric Rewards (Combined)

Added Group Relative Policy Optimization on top of rubric rewards:

Sample 6 candidate actions at each step
Evaluate each using learned heuristics
Execute the best action from the group
Learn from action comparisons

Results: A clear performance hierarchy

We trained each approach for 200 episodes and measured performance across multiple metrics:

Method	Success Rate	Avg Reward	Training Time
Sparse Rewards	18%	45.3	100% baseline
Rubric Rewards	42%	167.8	60% of baseline
GRPO + Rubric	65%	234.5	40% of baseline

The performance hierarchy was consistent across different environment difficulty levels and held up during out-of-sample testing on unseen product catalogs.

What the data tells us

Rubric engineering provides a crucial learning signal

The jump from 18% to 42% success rate when adding rubric rewards validates a core hypothesis: agents need intermediate feedback to learn complex behaviors efficiently. Without it, they struggle to connect actions early in an episode to eventual outcomes.

GRPO improves exploration quality

The additional boost from 42% to 65% success rate with GRPO demonstrates that exploration strategy matters enormously. Traditional single-action sampling left significant performance on the table.

Training efficiency compounds

Not only did the combined approach achieve better final performance, it got there faster. The 60% reduction in training time means these techniques aren't just more effective—they're more economical.

Behavior quality improves

Beyond success rates, we observed qualitatively better agent behaviors:

More systematic search strategies
Better handling of edge cases (stock outages, budget constraints)
More robust performance across different environment configurations
Clearer learning progression through intermediate skills

Practical implementation insights

Through this proof of concept, we learned several practical lessons about implementing these techniques:

Rubric design matters

Not all rubric decompositions work equally well. Effective rubrics need:

Measurable components that can be objectively evaluated
Progressive difficulty that creates natural learning curricula
Business alignment with weights reflecting actual priorities
Immediate feedback rather than delayed evaluation

GRPO requires tuning

The group size (we used 6 candidates) and evaluation heuristics need optimization for each domain. Too few candidates limits exploration benefits; too many slows training.

Environment complexity should scale

Starting with simpler versions and gradually increasing difficulty helped agents build foundational skills systematically.

Broader applications

While we tested on e-commerce, these results have implications for any complex business RL application:

Customer service automation: Rubrics for response quality, resolution effectiveness
Supply chain optimization: Components for cost, speed, reliability
Content recommendation: Metrics for relevance, diversity, engagement
Financial trading: Factors for risk, return, compliance

The key insight is that business tasks rarely have simple binary success criteria, they involve optimizing multiple competing objectives that can be measured and rewarded incrementally.

Limitations and future work

This proof of concept has several limitations worth acknowledging:

Single domain testing: We only validated on e-commerce tasks
Limited scale: 200 episodes per method, though results were consistent
Simplified environment: Real e-commerce has additional complexity that we didn't capture, (e.g. much more variance in buyer-seller markets, far larger product catalogs)
Manual rubric design: We hand-crafted rubrics rather than learning them automatically

Future experiments worth exploring:

Cross-domain validation (customer service, logistics, etc.)
Automated rubric discovery techniques
Integration with other RL improvements (curriculum learning, meta-learning)
Longer-term training to understand convergence properties

The bottom line

Our experiment provides concrete evidence that the RL community's intuitions about sparse rewards are correct, at least for complex business applications. Rubric-based reward engineering combined with improved exploration techniques like GRPO isn't just theoretically appealing; it delivers measurable improvements in both performance and training efficiency.

For organizations considering RL for business applications, the message is clear: don't default to sparse rewards. The upfront investment in rubric design and exploration strategy pays significant dividends in agent performance and training costs.

The techniques we tested aren't novel, but the validation on realistic business complexity helps bridge the gap between academic promise and practical deployment. Sometimes the most valuable experiments aren't about inventing new methods, but instead, proving that existing good ideas actually work in practice.

Continue reading

When benchmarks saturate, what comes next? Meta’s GIM pushes AI evaluation toward integrated reasoning

Meta Superintelligence Labs introduces GIM (Grounded Integration Measure), a benchmark shifting from isolated recall to integrated reasoning. It evaluates how models coordinate constraints, ambiguity, spatial logic, and epistemic judgment within a single problem.

Labelbox•May 20, 2026

Introducing EchoChain: An audio benchmark for reasoning under pressure in full-duplex dialogue

We introduce EchoChain to advance audio evaluation by testing Dual-Stream Reasoning in scenario-driven conversations with mid-speech interruptions, constraint updates, and shifting objectives. The benchmark measures whether models sustain coherent, adaptive intelligence in real time.

Smit Nautambhai Modi•March 4, 2026

The AI safety illusion: why current safety datasets fool us on model safety

AI safety is often judged by refusal rates, but our study of datasets like AdvBench and HarmBench shows these scores rely on obvious trigger words, not real adversarial intent. Remove the cues and the supposed safety collapses, revealing a stark gap between benchmarks and real world risk.

Shahriar Golchin•February 20, 2026

Try Labelbox today

Get started for free or see how Labelbox can fit your specific needs by requesting a demo

Start for free