logo
×

Arjun NargolwalaJuly 1, 2025

Building true RL systems: An experiment on solving real business tasks


Overview (TLDR): We ran a comprehensive experiment comparing sparse rewards vs. rubric-based rewards + GRPO on a complex e-commerce task. The results validate what many in the RL community have suspected: there's a much better way to train agents for complex business applications.


The reinforcement learning community has long debated whether traditional sparse reward approaches are sufficient for complex, multi-step business tasks. While techniques like rubric-based reward engineering and Group Relative Policy Optimization (GRPO) have shown promise in academic settings, there's been limited validation on realistic business applications.

At Labelbox, we decided to put these approaches to the test. We built a comprehensive e-commerce shopping agent and ran head-to-head comparisons between traditional sparse rewards versus the combination of rubric engineering + GRPO. Our goal wasn't to invent new techniques but to validate whether these promising approaches actually deliver better results on the kind of complex, multi-step tasks that businesses care about.

The answer? A resounding yes. Our proof of concept showed 300% better performance with rubric + GRPO as compared to sparse rewards alone.


The experiment: Real-world ecommerce complexity

In testing, we built a slightly simplified environment that captures some of the genuine complexity of business tasks. Our shopping agent needed to:

  • Search through a product catalog
  • Handle dynamic pricing and stock availability
  • Navigate search result quality variations
  • Manage budget constraints and user preferences
  • Deal with common friction points (out-of-stock items, price changes)
  • Complete end-to-end checkout processes

This mirrors the kind of multi-step, partially observable, constraint-heavy environment that makes business RL notoriously difficult.


Three approaches tested

We implemented 3 training approaches to validate the performance differences:

1. Traditional Sparse Rewards (Baseline)

The standard approach used in many RL applications:

  • Task Success → +100 points
  • Task Failure → 0 points
  • No intermediate feedback during the episode

2. Rubric-Based Rewards

Decomposed the task into measurable components:

  • Search Quality (0-20 points): Relevance of found products
  • Product Selection (0-30 points): Match to user criteria and budget
  • Navigation Efficiency (0-25 points): Steps taken vs. optimal path
  • Budget Compliance (0-15 points): Staying within financial constraints
  • Task Completion (0-10 points): Successfully finishing checkout

3. GRPO + Rubric Rewards (Combined)

Added Group Relative Policy Optimization on top of rubric rewards:

  • Sample 6 candidate actions at each step
  • Evaluate each using learned heuristics
  • Execute the best action from the group
  • Learn from action comparisons

Results: A clear performance hierarchy

We trained each approach for 200 episodes and measured performance across multiple metrics:

Method

Success Rate

Avg Reward

Training Time

Sparse Rewards

18%

45.3

100% baseline

Rubric Rewards

42%

167.8

60% of baseline

GRPO + Rubric

65%

234.5

40% of baseline

The performance hierarchy was consistent across different environment difficulty levels and held up during out-of-sample testing on unseen product catalogs.


What the data tells us

Rubric engineering provides a crucial learning signal

The jump from 18% to 42% success rate when adding rubric rewards validates a core hypothesis: agents need intermediate feedback to learn complex behaviors efficiently. Without it, they struggle to connect actions early in an episode to eventual outcomes.

GRPO improves exploration quality

The additional boost from 42% to 65% success rate with GRPO demonstrates that exploration strategy matters enormously. Traditional single-action sampling left significant performance on the table.

Training efficiency compounds

Not only did the combined approach achieve better final performance, it got there faster. The 60% reduction in training time means these techniques aren't just more effective—they're more economical.

Behavior quality improves

Beyond success rates, we observed qualitatively better agent behaviors:

  • More systematic search strategies
  • Better handling of edge cases (stock outages, budget constraints)
  • More robust performance across different environment configurations
  • Clearer learning progression through intermediate skills

Practical implementation insights

Through this proof of concept, we learned several practical lessons about implementing these techniques:

Rubric design matters

Not all rubric decompositions work equally well. Effective rubrics need:

  • Measurable components that can be objectively evaluated
  • Progressive difficulty that creates natural learning curricula
  • Business alignment with weights reflecting actual priorities
  • Immediate feedback rather than delayed evaluation

GRPO requires tuning

The group size (we used 6 candidates) and evaluation heuristics need optimization for each domain. Too few candidates limits exploration benefits; too many slows training.

Environment complexity should scale

Starting with simpler versions and gradually increasing difficulty helped agents build foundational skills systematically.


Broader applications

While we tested on e-commerce, these results have implications for any complex business RL application:

  • Customer service automation: Rubrics for response quality, resolution effectiveness
  • Supply chain optimization: Components for cost, speed, reliability
  • Content recommendation: Metrics for relevance, diversity, engagement
  • Financial trading: Factors for risk, return, compliance

The key insight is that business tasks rarely have simple binary success criteria, they involve optimizing multiple competing objectives that can be measured and rewarded incrementally.


Limitations and future work

This proof of concept has several limitations worth acknowledging:

  • Single domain testing: We only validated on e-commerce tasks
  • Limited scale: 200 episodes per method, though results were consistent
  • Simplified environment: Real e-commerce has additional complexity that we didn't capture, (e.g. much more variance in buyer-seller markets, far larger product catalogs) 
  • Manual rubric design: We hand-crafted rubrics rather than learning them automatically

Future experiments worth exploring:

  • Cross-domain validation (customer service, logistics, etc.)
  • Automated rubric discovery techniques
  • Integration with other RL improvements (curriculum learning, meta-learning)
  • Longer-term training to understand convergence properties

The bottom line

Our experiment provides concrete evidence that the RL community's intuitions about sparse rewards are correct, at least for complex business applications. Rubric-based reward engineering combined with improved exploration techniques like GRPO isn't just theoretically appealing; it delivers measurable improvements in both performance and training efficiency.

For organizations considering RL for business applications, the message is clear: don't default to sparse rewards. The upfront investment in rubric design and exploration strategy pays significant dividends in agent performance and training costs.

The techniques we tested aren't novel, but the validation on realistic business complexity helps bridge the gap between academic promise and practical deployment. Sometimes the most valuable experiments aren't about inventing new methods, but instead, proving that existing good ideas actually work in practice.