We present COPlanner, a powerful planning-driven framework for model-based Reinforcement Learning to mitigate the impact of imperfect world models and boosting policy learning. COPlanner is a plug-and-play framework that can be applied to any dyna-style model-based methods.
Abstract
Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration using current policy for dynamics model learning. However, due to the complex real-world environment, it is inevitable to learn an imperfect dynamics model with model prediction error, which can further mislead policy learning and result in sub-optimal solutions. In this paper, we propose COPlanner, a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem with conservative model rollouts and optimistic environment exploration. COPlanner leverages an uncertainty-aware policy-guided model predictive control (UP-MPC) component to plan for multi-step uncertainty estimation. This estimated uncertainty then serves as a penalty during model rollouts and as a bonus during real environment exploration respectively, to choose actions. Consequently, COPlanner can avoid model uncertain regions through conservative model rollouts, thereby alleviating the influence of model error. Simultaneously, it explores high-reward model uncertain regions to reduce model error actively through optimistic real environment exploration. COPlanner is a plug-and-play framework that can be applied to any dyna-style model-based methods. Experimental results on a series of proprioceptive and visual continuous control tasks demonstrate that both sample efficiency and asymptotic performance of strong model-based methods are significantly improved combined with COPlanner.
Uncertainty-aware Policy-guided MPC (UP-MPC)
We adopt random shooting during action selection before model rollouts and environment interaction. This process involves assessing the future uncertainty induced by each action through multi-step planning. During model rollouts, this uncertainty serves as a penalty to restrict action choices, encouraging the agent to improve policy within model certain regions, thereby avoiding model error. During interaction with the environment, the uncertainty acts as a bonus, motivating the agent to explore model uncertain regions but potentially high in rewards, thereby reducing world model errors.
Comparison with end-to-end world model-based methods
Comparison with latent world model-based methods
Visualized Comparison with DreamerV3
Paper
COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RLXiyao Wang, Ruijie Zheng, Yanchao Sun, Ruonan Jia, Wichayaporn Wongkamjan, Huazhe Xu, Furong Huang
ICLR 2024
Citation
If you find our work useful, please consider citing the paper as follows: