Arjun Nargolwala•July 1, 2025
Building true RL systems: An experiment on solving real business tasks

Overview (TLDR): We ran a comprehensive experiment comparing sparse rewards vs. rubric-based rewards + GRPO on a complex e-commerce task. The results validate what many in the RL community have suspected: there's a much better way to train agents for complex business applications.
The reinforcement learning community has long debated whether traditional sparse reward approaches are sufficient for complex, multi-step business tasks. While techniques like rubric-based reward engineering and Group Relative Policy Optimization (GRPO) have shown promise in academic settings, there's been limited validation on realistic business applications.
At Labelbox, we decided to put these approaches to the test. We built a comprehensive e-commerce shopping agent and ran head-to-head comparisons between traditional sparse rewards versus the combination of rubric engineering + GRPO. Our goal wasn't to invent new techniques but to validate whether these promising approaches actually deliver better results on the kind of complex, multi-step tasks that businesses care about.
The answer? A resounding yes. Our proof of concept showed 300% better performance with rubric + GRPO as compared to sparse rewards alone.
The experiment: Real-world ecommerce complexity
In testing, we built a slightly simplified environment that captures some of the genuine complexity of business tasks. Our shopping agent needed to:
- Search through a product catalog
- Handle dynamic pricing and stock availability
- Navigate search result quality variations
- Manage budget constraints and user preferences
- Deal with common friction points (out-of-stock items, price changes)
- Complete end-to-end checkout processes
This mirrors the kind of multi-step, partially observable, constraint-heavy environment that makes business RL notoriously difficult.
Three approaches tested
We implemented 3 training approaches to validate the performance differences:
1. Traditional Sparse Rewards (Baseline)
The standard approach used in many RL applications:
- Task Success → +100 points
- Task Failure → 0 points
- No intermediate feedback during the episode
2. Rubric-Based Rewards
Decomposed the task into measurable components:
- Search Quality (0-20 points): Relevance of found products
- Product Selection (0-30 points): Match to user criteria and budget
- Navigation Efficiency (0-25 points): Steps taken vs. optimal path
- Budget Compliance (0-15 points): Staying within financial constraints
- Task Completion (0-10 points): Successfully finishing checkout
3. GRPO + Rubric Rewards (Combined)
Added Group Relative Policy Optimization on top of rubric rewards:
- Sample 6 candidate actions at each step
- Evaluate each using learned heuristics
- Execute the best action from the group
- Learn from action comparisons
Results: A clear performance hierarchy
We trained each approach for 200 episodes and measured performance across multiple metrics:
The performance hierarchy was consistent across different environment difficulty levels and held up during out-of-sample testing on unseen product catalogs.
What the data tells us
Rubric engineering provides a crucial learning signal
The jump from 18% to 42% success rate when adding rubric rewards validates a core hypothesis: agents need intermediate feedback to learn complex behaviors efficiently. Without it, they struggle to connect actions early in an episode to eventual outcomes.
GRPO improves exploration quality
The additional boost from 42% to 65% success rate with GRPO demonstrates that exploration strategy matters enormously. Traditional single-action sampling left significant performance on the table.
Training efficiency compounds
Not only did the combined approach achieve better final performance, it got there faster. The 60% reduction in training time means these techniques aren't just more effective—they're more economical.
Behavior quality improves
Beyond success rates, we observed qualitatively better agent behaviors:
- More systematic search strategies
- Better handling of edge cases (stock outages, budget constraints)
- More robust performance across different environment configurations
- Clearer learning progression through intermediate skills
Practical implementation insights
Through this proof of concept, we learned several practical lessons about implementing these techniques:
Rubric design matters
Not all rubric decompositions work equally well. Effective rubrics need:
- Measurable components that can be objectively evaluated
- Progressive difficulty that creates natural learning curricula
- Business alignment with weights reflecting actual priorities
- Immediate feedback rather than delayed evaluation
GRPO requires tuning
The group size (we used 6 candidates) and evaluation heuristics need optimization for each domain. Too few candidates limits exploration benefits; too many slows training.
Environment complexity should scale
Starting with simpler versions and gradually increasing difficulty helped agents build foundational skills systematically.
Broader applications
While we tested on e-commerce, these results have implications for any complex business RL application:
- Customer service automation: Rubrics for response quality, resolution effectiveness
- Supply chain optimization: Components for cost, speed, reliability
- Content recommendation: Metrics for relevance, diversity, engagement
- Financial trading: Factors for risk, return, compliance
The key insight is that business tasks rarely have simple binary success criteria, they involve optimizing multiple competing objectives that can be measured and rewarded incrementally.
Limitations and future work
This proof of concept has several limitations worth acknowledging:
- Single domain testing: We only validated on e-commerce tasks
- Limited scale: 200 episodes per method, though results were consistent
- Simplified environment: Real e-commerce has additional complexity that we didn't capture, (e.g. much more variance in buyer-seller markets, far larger product catalogs)
- Manual rubric design: We hand-crafted rubrics rather than learning them automatically
Future experiments worth exploring:
- Cross-domain validation (customer service, logistics, etc.)
- Automated rubric discovery techniques
- Integration with other RL improvements (curriculum learning, meta-learning)
- Longer-term training to understand convergence properties
The bottom line
Our experiment provides concrete evidence that the RL community's intuitions about sparse rewards are correct, at least for complex business applications. Rubric-based reward engineering combined with improved exploration techniques like GRPO isn't just theoretically appealing; it delivers measurable improvements in both performance and training efficiency.
For organizations considering RL for business applications, the message is clear: don't default to sparse rewards. The upfront investment in rubric design and exploration strategy pays significant dividends in agent performance and training costs.
The techniques we tested aren't novel, but the validation on realistic business complexity helps bridge the gap between academic promise and practical deployment. Sometimes the most valuable experiments aren't about inventing new methods, but instead, proving that existing good ideas actually work in practice.