Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy


Xiyao Wang1    Wichayaporn Wongkamjan1   Ruonan Jia2     Furong Huang1
1 University of Maryland, College Park      2 Tsinghua University
[Paper]  [Code] 

Abstract


Model-based reinforcement learning (RL) often achieves higher sample efficiency in practice than model-free RL by learning a dynamics model to generate samples for policy learning. Previous works learn a dynamics model that fits under the empirical state-action visitation distribution for all historical policies, i.e., the sample replay buffer. However, in this paper, we observe that fitting the dynamics model under the distribution for all his- torical policies does not necessarily benefit model prediction for the current policy since the policy in use is constantly evolving over time. The evolv- ing policy during training will cause state-action visitation distribution shifts. We theoretically an- alyze how this distribution shift over historical policies affects the model learning and model roll- outs. We then propose a novel dynamics model learning method, named Policy-adapted Dynam- ics Model Learning (PDML). PDML dynamically adjusts the historical policy mixture distribution to ensure the learned model can continually adapt to the state-action visitation distribution of the evolving policy. Experiments on a range of contin- uous control environments in MuJoCo show that PDML achieves significant improvement in sam- ple efficiency and higher asymptotic performance combined with the state-of-the-art model-based RL methods.


Mismatch between model learning and model rollouts


State-action visitation distribution shift during policy learning

The state-action visitation distribution of policies under different environment steps is very different due to policy learning. Leading to a huge state-action visitation distribution shift in the replay buffer.
HalfCheetah:

Hopper:

Prediction error curves of MBPO

Overall error means the model prediction error for all historical policies, and current error is the model prediction error for the current policy. We observe that there is a gap between the overall error and the current error. This means although the agent can learn a dynamics model which is good enough for all samples obtained by historical policies, this is at the expense of the prediction accuracy for the samples induced by current policy.


Method: Policy-adapted Dynamics Model Learning


Our main idea is to adjust the policy mixture distribution that continuously adapts to the current policy.
Weights designed for historical policies:

Weights designed for current policy:

Estimation of the policy distribution shift:



Experiment results


MuJoCo experiment

We combine PDML with the current most popular MBRL method MBPO and compare our method (PDML-MBPO) with several previous state-of-the-art baselines including MBPO, AMPO, VaGraM, SAC, and REDQ. PDML-MBPO outperforms all existing state-of-the-art methods, including model-based and model-free, in sample efficiency in five environments, and achieves competitive sample efficiency in Ant. In addition, PDML-MBPO obtains significantly better asymptotic performance compared to other state-of- the-art model-based methods. It is worth noting that the asymptotic performance of PDML-MBPO is very close to SAC in four environments (Hopper, Walker2d, Humanoid, and Pusher) and is even better than SAC occasionally. Furthermore, our method achieves impressive improvement in the most complex environment Humanoid.

Prediction errors compared with global dynamics model (MBPO)

We evaluate the one-step model prediction error for the current policy on Hopper, HalfCheetah, and Walker2d and compare the multi-step model rollouts compounding error of the policy-adapted model and the original dynamics model on Hopper.

Compare with Model-free Experience Replay Methods

Traditional model-free prioritized replay buffer methods don’t work in more complex environments (such as Walker2d), while our method achieves the best improvement!



Cite This Paper


       @inproceedings{wang2023live,
          title={Live in the moment: Learning dynamics model adapted to evolving policy},
          author={Wang, Xiyao and Wongkamjan, Wichayaporn and Jia, Ruonan and Huang, Furong},
          booktitle={International Conference on Machine Learning},
          pages={36470--36493},
          year={2023},
          organization={PMLR}
        }