Dyna-Validator: A Model-based Reinforcement Learning Method with Validated Simulated Experiences

Authors

  • Hengsheng Zhang School of Computer Science, Northwestern Polytechnical University Xi’an, 710072, Shaanxi, China
  • Jingchen Li School of Computer Science, Northwestern Polytechnical University Xi’an, 710072, Shaanxi, China
  • Ziming He School of Computer Science, Northwestern Polytechnical University Xi’an, 710072, Shaanxi, China
  • Jinhui Zhu School of Computer Science, Northwestern Polytechnical University Xi’an, 710072, Shaanxi, China
  • Haobin Shi School of Computer Science, Northwestern Polytechnical University Xi’an, 710072, Shaanxi, China

DOI:

https://doi.org/10.15837/ijccc.2023.5.5073

Keywords:

Model-based reinforcement learning (MBRL), Dyna, Simulated annealing

Abstract

Dyna is a planning paradigm that naturally weaves learning and planning together through environmental models. Dyna-style reinforcement learning improves the sample efficiency using the simulation experience generated by the environment model to update the value function. However, the existing Dyna-style planning methods are usually based on tabular methods, only suitable for tasks with low-dimensional and small-scale space. In addition, the quality of the simulation experience generated by the existing methods cannot be guaranteed, which significantly limits its application in tasks such as continuous control of high-dimensional robots and autonomous driving. To this end, we propose a model-based approach that controls planning through a validator. The validator filters high-quality experiences for policy learning and decides whether to stop planning. To deal with the exploration and exploitation dilemma in reinforcement learning, a combination of ϵ-greedy strategy and simulated annealing (SA) cooling schedule control is designed as an action selection strategy. The excellent performance of the proposed method is demonstrated in a set of classical Atari games. Experimental results show that learning dynamic models in some games can improve sample efficiency. This benefit is maximized by choosing the proper planning steps. In the optimization planning phase, our method maintains a smaller gap with the current state-of-the-art model-based reinforcement learning (MuZero). In order to achieve a good compromise between model accuracy and optimal programming step size, it is necessary to control the programming reasonably. The practical application of this method in a physical robot system helps reduce the influence of an imprecise depth prediction model on the task. Without human supervision, it is easier to collect training data and learn complex skills (such as grabbing and carrying items) while being more effective at scaling tasks that have never been seen before.

References

Kim, H.J., Madhavi, S. A Reinforcement Learning Model for Quantum Network Data Aggregation and Analysis[J]. Journal of System and Management Sciences, 2022, 12(1): 283-293. https://doi.org/10.33168/JSMS.2022.0120.

https://doi.org/10.3390/app12073520

Park, H.J., Kim, S.C. An Efficient Packet Transmission Protocol Using Reinforcing Learning in Wireless Sensor Networks[J]. Journal of System and Management Sciences, 2021, 11(2): 65-76. https://doi.org/10.33168/JSMS.2021.0205.

https://doi.org/10.33168/JSMS.2021.0205

Jie, W.J., Connie, T., Goh, M.K.O. Forward collision warning for autonomous driving[ J]. Journal of Logistics, Informatics and Service Science, 2022, 9(3): 208-225. https://doi.org/10.33168/LISS.2022.0315.

Hussein, A.K. Feature weighting based food recognition system[J]. Journal of Logistics, Informatics and Service Science, 2022, 9(3):191-207. https://doi.org/10.33168/LISS.2022.0314.

Anđelić N, Car Z, Šercer M. Neural Network-Based Model for Classification of Faults During Operation of a Robotic Manipulator[J]. Tehnički vjesnik, 2021, 28(4): 1380-1387.

https://doi.org/10.17559/TV-20201112163731

Anđelić N, Car Z, Šercer M. Prediction of Robot Grasp Robustness using Artificial Intelligence Algorithms[J]. Tehnički vjesnik, 2022, 29(1): 101-107.

https://doi.org/10.17559/TV-20210204092154

Li Y, He Z, Gu X, et al. AFedAvg: communication-efficient federated learning aggregation with adaptive communication frequency and gradient sparse[J]. Journal of Experimental & Theoretical Artificial Intelligence, 2022: 1-23.

https://doi.org/10.1080/0952813X.2022.2079730

Shi H, Li J, Mao J, et al. Lateral transfer learning for multiagent reinforcement learning[J]. IEEE Transactions on Cybernetics, 2021.

Sutton R S, Barto A G. Reinforcement learning: An introduction[M]. MIT press, 2018.

Chen L, Deng Y, Cheong K H. Probability transformation of mass function: A weighted network method based on the ordered visibility graph[J]. Engineering Applications of Artificial Intelligence, 2021, 105: 104438.

https://doi.org/10.1016/j.engappai.2021.104438

Li J, Shi H, Hwang K S. An explainable ensemble feedforward method with Gaussian convolutional filter[J]. Knowledge-Based Systems, 2021, 225: 107103.

https://doi.org/10.1016/j.knosys.2021.107103

Agarwal A, Kakade S, Yang L F. Model-based reinforcement learning with a generative model is minimax optimal[C]//Conference on Learning Theory. PMLR, 2020: 67-83.

Plaat A, Kosters W, Preuss M. High-Accuracy Model-Based Reinforcement Learning, a Survey[J]. arXiv preprint arXiv:2107.08241, 2021.

Zhang M, Vikram S, Smith L, et al. Solar: Deep structured representations for model-based reinforcement learning[C]//International Conference on Machine Learning. PMLR, 2019: 7444- 7453.

Janner M, Fu J, Zhang M, et al. When to trust your model: Model-based policy optimization[J]. arXiv preprint arXiv:1906.08253, 2019.

Oh J, Guo X, Lee H, et al. Action-conditional video prediction using deep networks in atari games[J]. arXiv preprint arXiv:1507.08750, 2015.

Ha D, Schmidhuber J. Recurrent world models facilitate policy evolution[J]. arXiv preprint arXiv:1809.01999, 2018.

Alaniz S. Deep reinforcement learning with model learning and monte carlo tree search in minecraft[J]. arXiv preprint arXiv:1803.08456, 2018.

Sutton R S. Dyna, an integrated architecture for learning, planning, and reacting[J]. ACM Sigart Bulletin, 1991, 2(4): 160-163.

https://doi.org/10.1145/122344.122377

Holland G Z, Talvitie E J, Bowling M. The effect of planning shape on dyna-style planning in high-dimensional state spaces[J]. arXiv preprint arXiv:1806.01825, 2018.

Azizzadenesheli K, Yang B, Liu W, et al. Sample-efficient deep RL with generative adversarial tree search[J]. arXiv preprint arXiv:1806.05780, 2018: 25.

Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[J]. Advances in neural information processing systems, 2014, 27.

Kaiser L, Babaeizadeh M, Milos P, et al. Model-based reinforcement learning for atari[J]. arXiv preprint arXiv:1903.00374, 2019.

Silver D, Schrittwieser J, Simonyan K, et al. Mastering the game of go without human knowledge[ J]. nature, 2017, 550(7676): 354-359.

https://doi.org/10.1038/nature24270

Schrittwieser J, Antonoglou I, Hubert T, et al. Mastering atari, go, chess and shogi by planning with a learned model[J]. Nature, 2020, 588(7839): 604-609.

https://doi.org/10.1038/s41586-020-03051-4

Deisenroth M P, Neumann G, Peters J. A survey on policy search for robotics[J]. Foundations and trends in Robotics, 2013, 2(1-2): 388-403.

https://doi.org/10.1561/2300000021

Veerapaneni R, Co-Reyes J D, Chang M, et al. Entity abstraction in visual model-based reinforcement learning[C]//Conference on Robot Learning. PMLR, 2020: 1439-1456.

Paxton C, Barnoy Y, Katyal K, et al. Visual robot task planning[C]//2019 international conference on robotics and automation (ICRA). IEEE, 2019: 8832-8838.

https://doi.org/10.1109/ICRA.2019.8793736

Deng Z, Guan H, Huang R, et al. Combining model-based q-learning with structural knowledge transfer for robot skill learning[J]. IEEE Transactions on Cognitive and Developmental Systems, 2017, 11(1): 26-35.

https://doi.org/10.1109/TCDS.2017.2718938

Hafner D, Lillicrap T, Fischer I, et al. Learning latent dynamics for planning from pixels[ C]//International Conference on Machine Learning. PMLR, 2019: 2555-2565.

Machado M C, Bellemare M G, Talvitie E, et al. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents[J]. Journal of Artificial Intelligence Research, 2018, 61: 523-562.

https://doi.org/10.1613/jair.5699

Pan Y, Zaheer M, White A, et al. Organizing experience: a deeper look at replay mechanisms for sample-based planning in continuous state domains[J]. arXiv preprint arXiv:1806.04624, 2018.

https://doi.org/10.24963/ijcai.2018/666

Heess N, Wayne G, Silver D, et al. Learning continuous control policies by stochastic value gradients[J]. arXiv preprint arXiv:1510.09142, 2015.

Feinberg V, Wan A, Stoica I, et al. Model-based value estimation for efficient model-free reinforcement learning[J]. arXiv preprint arXiv:1803.00101, 2018.

Kalweit G, Boedecker J. Uncertainty-driven imagination for continuous deep reinforcement learning[ C]//Conference on Robot Learning. PMLR, 2017: 195-206.

Kurutach T, Clavera I, Duan Y, et al. Model-ensemble trust-region policy optimization[J]. arXiv preprint arXiv:1802.10592, 2018.

Gu S, Lillicrap T, Sutskever I, et al. Continuous deep q-learning with model-based aceleration[ C]//International conference on machine learning. PMLR, 2016: 2829-2838.

Peng B, Li X, Gao J, et al. Deep dyna-q: Integrating planning for task-completion dialogue policy learning[J]. arXiv preprint arXiv:1801.06176, 2018.

https://doi.org/10.18653/v1/P18-1203

Sutton R S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming[M]//Machine learning proceedings 1990. Morgan Kaufmann, 1990: 216- 224.

https://doi.org/10.1016/B978-1-55860-141-3.50030-4

Holland G Z, Talvitie E J, Bowling M. The effect of planning shape on dyna-style planning in high-dimensional state spaces[J]. arXiv preprint arXiv:1806.01825, 2018.

Azizzadenesheli K, Yang B, Liu W, et al. Surprising negative results for generative adversarial tree search[J]. arXiv preprint arXiv:1806.05780, 2018.

Isola P, Zhu J Y, Zhou T, et al. Image-to-image translation with conditional adversarial networks[ C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 1125-1134.

https://doi.org/10.1109/CVPR.2017.632

Wang T, Bao X, Clavera I, et al. Benchmarking model-based reinforcement learning[J]. arXiv preprint arXiv:1907.02057, 2019.

Guo M, Liu Y, Malec J. A new Q-learning algorithm based on the metropolis criterion[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2004, 34(5): 2140-2143.

https://doi.org/10.1109/TSMCB.2004.832154

Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[ J]. nature, 2015, 518(7540): 529-533.

https://doi.org/10.1038/nature14236

Leibfried F, Kushman N, Hofmann K. A deep learning approach for joint video frame and reward prediction in atari games[J]. arXiv preprint arXiv:1611.07078, 2016.

Memisevic R. Learning to relate images[J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 35(8): 1829-1846.

https://doi.org/10.1109/TPAMI.2013.53

Mnih V, Badia A P, Mirza M, et al. Asynchronous methods for deep reinforcement learning[ C]//International conference on machine learning. PMLR, 2016: 1928-1937.

Additional Files

Published

2023-08-31

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.