COPlanner

We present COPlanner, a powerful planning-driven framework for model-based Reinforcement Learning to mitigate the impact of imperfect world models and boosting policy learning. COPlanner is a plug-and-play framework that can be applied to any dyna-style model-based methods.

Abstract

Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration using current policy for dynamics model learning. However, due to the complex real-world environment, it is inevitable to learn an imperfect dynamics model with model prediction error, which can further mislead policy learning and result in sub-optimal solutions. In this paper, we propose COPlanner, a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem with conservative model rollouts and optimistic environment exploration. COPlanner leverages an uncertainty-aware policy-guided model predictive control (UP-MPC) component to plan for multi-step uncertainty estimation. This estimated uncertainty then serves as a penalty during model rollouts and as a bonus during real environment exploration respectively, to choose actions. Consequently, COPlanner can avoid model uncertain regions through conservative model rollouts, thereby alleviating the influence of model error. Simultaneously, it explores high-reward model uncertain regions to reduce model error actively through optimistic real environment exploration. COPlanner is a plug-and-play framework that can be applied to any dyna-style model-based methods. Experimental results on a series of proprioceptive and visual continuous control tasks demonstrate that both sample efficiency and asymptotic performance of strong model-based methods are significantly improved combined with COPlanner.

Uncertainty-aware Policy-guided MPC (UP-MPC)

We adopt random shooting during action selection before model rollouts and environment interaction. This process involves assessing the future uncertainty induced by each action through multi-step planning. During model rollouts, this uncertainty serves as a penalty to restrict action choices, encouraging the agent to improve policy within model certain regions, thereby avoiding model error. During interaction with the environment, the uncertainty acts as a bonus, motivating the agent to explore model uncertain regions but potentially high in rewards, thereby reducing world model errors.

Comparison with end-to-end world model-based methods

We combine COPlanner with MBPO and conduct comparisons with other MBPO-style MBRL methods across 12 continuous proprioceptive control tasks in MuJoCo-GYM and DMC environments. COPlanner demonstrates a significant improvement in sample efficiency and performance compared to other methods.

Comparison with latent world model-based methods

We also combine our method with the state-of-the-art latent world model-based RL method, DreamerV3, and conduct experimental comparisons. COPlanner achieves new state-of-the-art sample efficiency and performance in 8 visual continuous control tasks (top) and 6 proprioceptive continuous control tasks (bottom) on DMC.

Visualized Comparison with DreamerV3

Paper

COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL
Xiyao Wang, Ruijie Zheng, Yanchao Sun, Ruonan Jia, Wichayaporn Wongkamjan, Huazhe Xu, Furong Huang

ICLR 2024

View on arXiv

Citation

If you find our work useful, please consider citing the paper as follows:

@article{wang2023coplanner, title={COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL}, author={Wang, Xiyao and Zheng, Ruijie and Sun, Yanchao and Jia, Ruonan and Wongkamjan, Wichayaporn and Xu, Huazhe and Huang, Furong}, booktitle={International Conference on Learning Representations}, year={2024} }

COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL

Xiyao Wang1, Ruijie Zhang1, Yanchao Sun2, Ruonan Jia3, Wichayaporn Wongkamjan1, Huazhe Xu2, Furong Huang1, 1 University of Maryland, College Park 2 JPMorgan AI Research 3 Tsinghua University