Abstract
Learning high-quality Q-value functions plays a key role in the success of many modern off-policy deep reinforcement learning (RL) algorithms. Previous works focus on addressing the value overestimation issue, an outcome of adopting function approximators and off-policy learning. Deviating from the common viewpoint, we observe that Q-values are indeed underestimated in the latter stage of the RL training process, primarily related to the use of inferior actions from the current policy in Bellman updates as compared to the more optimal action samples in the replay buffer. We hypothesize that this long-neglected phenomenon potentially hinders policy learning and reduces sample efficiency. Our insight to address this issue is to incorporate sufficient exploitation of past successes while maintaining exploration optimism. We propose the Blended Exploitation and Exploration (BEE) operator, a simple yet effective approach that updates Q-value using both historical best-performing actions and the current policy. The instantiations of our method in both model-free and model-based settings outperform state-of-the-art methods in various continuous control tasks and achieve strong performance in failure-prone scenarios and real-world robot tasks.
Real-World Validation
Smooth Road
We conduct real-world validation of BAC using a cost-effective D’Kitty robot, tasked with traversing complex terrains and reaching a target point. In this demanding setup, BAC surpasses TD3 and SAC, showcasing stable and natural gaits. In contrast, TD3 tends to adopt lower postures, such as knee walking, while SAC exhibits more oscillatory gait patterns.
Rough Stone Road
Uphill Stone Road
Grassland
DogRun
BAC successfully solves the complex Dog tasks that come with 38-dimensional continuous action spaces, where other baselines struggle to learn meaningful behaviors. To the best of our knowledge, this is the first documented result of model-free methods effectively tackling these challenging Dog tasks.
HumanoidStandup
In the high-dimension, failure-prone HumanoidStandup task, BAC achieves returns up to ~280,000 and ~360,000 at 2.5 and 5 million steps respectively, which is 2.1x evalution scores against the strongest baseline. BAC swiftly achieves a stable standing pose. The SAC agent ends up with a wobbling kneeling posture, the DAC agent sits on the ground, and the RRS agent rolls around.
Benchmark Results
We propose BAC (model-free algorithm) and MB-BAC (model-based algorithm) based on the proposed BEE operator. The experimental results on several continuous control benchmark tasks illustrate the effectiveness of the BEE operator across both model-based and model-free paradigms.
BAC in MuJoCo Benchmarks
We evaluate BAC and five state-of-the-art baselines on a set of MuJoCo continuous control tasks, including Hopper, Walker2d, Swimmer, Ant, Humanoid, and HumanoidStandup. BAC surpasses all baselines in terms of eventual performance, coupled with better sample efficiency.
MB-BAC in MuJoCo Benchmarks
We evaluate the performance of MB-BAC, against six popular model-based and model-free baselines. The results showcases that MB-BAC learns faster than other modern model-based RL methods and yields promising asymptotic performance compared with model-free counterparts. These results highlights the universality of the BEE operator.