A deep reinforcement learning-based optimization method for long-running applications container deployment

Authors

  • Lei Deng School of Computer Science, Northwestern Polytechnical University Xi’an, 710129, Shaanxi, China
  • Zhaoyang Wang Science and Technology on Electro-optic Control Laboratory Luoyang, 471023, China
  • Haoyang Sun School of Computer Science, Northwestern Polytechnical University Xi’an, 710129, Shaanxi, China
  • Bo Li Science and Technology on Electro-optic Control Laboratory Luoyang, 471023, China
  • Xiao Yang School of Computer Science, Northwestern Polytechnical University Xi’an, 710129, Shaanxi, China

DOI:

https://doi.org/10.15837/ijccc.2023.4.5013

Abstract

Unlike the short execution cycles of batch jobs, intelligent algorithmic applications typically run in long-cycle containers in the cloud(Long-Running Applications, LRA). Both need to meet strict SLO (service level objective) requirements, consider performance scaling to cope with peak load demands, and face issues such as I/O dependencies and resource contention and interference from coexisting containers. The above greatly complicates container deployment and can easily lead to performance bottlenecks. Therefore, the optimization of LRA-like container deployment is one of the key issues that cannot be avoided and needs to be addressed in the cloud computing model. This research uses deep reinforcement learning (DRL) to optimize the deployment of LRAs class containers. The proposed non-generic model is able to customize a dedicated model for each container group, providing high-quality placement and low training complexity; meanwhile, the proposed batch deployment scheme is able to optimize various scheduling objectives that are not directly supported by existing constraint-based schedulers, such as minimizing SLO violations. The experimental results show that the performance of the DRL deployment algorithm improves by 56.2% compared to the average RPS of the baseline, indicating that the manual deployment scheme can only meet the basic requirements but cannot cover the complex interactions between containers under constraints from a global perspective. This limitation severely limits the performance of the whole pod. Meanwhile, based on previous experience, the time consumption of a single deployment scheme is about 1 hour, while the time consumption of the DRL deployment algorithm may be less than 7.5 minutes.

References

M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica, "Discretized streams: Faulttolerant streaming computation at scale," in Proceedings of the twenty-fourth ACM symposium on operating systems principles, 2013, pp. 423-438.

https://doi.org/10.1145/2517349.2522737

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., "Tensorflow: a system for large-scale machine learning." in Osdi, vol. 16, no. 2016. Savannah, GA, USA, 2016, pp. 265-283.

X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen et al., "Mllib: Machine learning in apache spark," The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1235-1241, 2016.

D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, "Clipper: A low-latency online prediction serving system." in NSDI, vol. 17, 2017, pp. 613-627.

H. Tian, M. Yu, and W. Wang, "Continuum: A platform for cost-aware, low-latency continual learning," in Proceedings of the ACM Symposium on Cloud Computing, 2018, pp. 26-40.

https://doi.org/10.1145/3267809.3267817

W. Park and K. K. Seo, "A study on cloud-based software marketing strategies using cloud marketplace," Journal of Logistics, Informatics and Service Science, vol. 7, no. 2, pp. 1-13, 2020.

A. Gujarati, S. Elnikety, Y. He, K. S. McKinley, and B. B. Brandenburg, "Swayam: distributed autoscaling to meet slas of machine learning inference services with resource efficiency," in Proceedings of the 18th ACM/IFIP/USENIX middleware conference, 2017, pp. 109-120.

https://doi.org/10.1145/3135974.3135993

W. Shi, P. Zou et al., "Research on cloud enterprise resource integration and scheduling technology based on mixed set programming," Tehnički vjesnik, vol. 28, no. 6, pp. 2027-2035, 2021.

https://doi.org/10.17559/TV-20210718091658

M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi et al., "Spark sql: Relational data processing in spark," in Proceedings of the 2015 ACM SIGMOD international conference on management of data, 2015, pp. 1383-1394.

https://doi.org/10.1145/2723372.2742797

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica, "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing," in Presented as part of the 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), 2012, pp. 15-28.

P. Garefalakis, K. Karanasos, P. Pietzuch, A. Suresh, and S. Rao, "Medea: scheduling of long running applications in shared production clusters," in Proceedings of the thirteenth EuroSys conference, 2018, pp. 1-13.

J. Guo, Z. Chang, S. Wang, H. Ding, Y. Feng, L. Mao, and Y. Bao, "Who limits the resource efficiency of my datacenter: An analysis of alibaba datacenter traces," in Proceedings of the International Symposium on Quality of Service, 2019, pp. 1-10.

https://doi.org/10.1145/3326285.3329074

Q. Liu and Z. Yu, "The elasticity and plasticity in semi-containerized co-locating cloud workload: a view from alibaba trace," in Proceedings of the ACM Symposium on Cloud Computing, 2018, pp. 347-360.

https://doi.org/10.1145/3267809.3267830

H. Wu, W. Zhang, Y. Xu, H. Xiang, T. Huang, H. Ding, and Z. Zhang, "Aladdin: Optimized maximum flow management for shared production clusters," in 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2019, pp. 696-707.

https://doi.org/10.1109/IPDPS.2019.00078

Y.-J. Kim and B.-C. Ha, "Logistics service supply chain model," Journal of Logistics, Informatics and Service Science, vol. 9, no. 3, pp. 284-300, 2022.

V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth et al., "Apache hadoop yarn: Yet another resource negotiator," in Proceedings of the 4th annual Symposium on Cloud Computing, 2013, pp. 1-16.

https://doi.org/10.1145/2523616.2523633

A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes, "Large-scale cluster management at google with borg," in Proceedings of the Tenth European Conference on Computer Systems, 2015, pp. 1-17.

https://doi.org/10.1145/2741948.2741964

C. Delimitrou and C. Kozyrakis, "Paragon: Qos-aware scheduling for heterogeneous datacenters," ACM SIGPLAN Notices, vol. 48, no. 4, pp. 77-88, 2013.

https://doi.org/10.1145/2499368.2451125

C. Delimitrou and C. Kozyrakis, "Quasar: Resource-efficient and qos-aware cluster management," ACM SIGPLAN Notices, vol. 49, no. 4, pp. 127-144, 2014.

https://doi.org/10.1145/2644865.2541941

H. Yang, A. Breslow, J. Mars, and L. Tang, "Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers," ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 607-618, 2013.

https://doi.org/10.1145/2508148.2485974

D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis, "Heracles: Improving resource efficiency at scale," in Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015, pp. 450-462.

https://doi.org/10.1145/2749469.2749475

Y. Yuan and H. Xu, "Multiobjective flexible job shop scheduling using memetic algorithms," IEEE Transactions on Automation Science and Engineering, vol. 12, no. 1, pp. 336-353, 2013.

https://doi.org/10.1109/TASE.2013.2274517

D. Novaković, N. Vasić, S. Novaković, D. Kostić, and R. Bianchini, "Deepdive: Transparently identifying and managing performance interference in virtualized environments," in 2013 {USENIX} Annual Technical Conference ({USENIX}{ATC} 13), 2013, pp. 219-230.

A. Tumanov, T. Zhu, J. W. Park, M. A. Kozuch, M. Harchol-Balter, and G. R. Ganger, "Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters," in Proceedings of the Eleventh European Conference on Computer Systems, 2016, pp. 1-16.

https://doi.org/10.1145/2901318.2901355

J. Li, H. Shi, and K.-S. Hwang, "Using fuzzy logic to learn abstract policies in large-scale multiagent reinforcement learning," IEEE Transactions on Fuzzy Systems, vol. 30, no. 12, pp. 5211-5224, 2022.

https://doi.org/10.1109/TFUZZ.2022.3170646

H. Shi, J. Li, J. Mao, and K.-S. Hwang, "Lateral transfer learning for multiagent reinforcement learning," IEEE Transactions on Cybernetics, 2021.

Y. Kwak, W. J. Yun, S. Jung, J.-K. Kim, and J. Kim, "Introduction to quantum reinforcement learning: Theory and pennylane-based implementation," in 2021 International Conference on Information and Communication Technology Convergence (ICTC). IEEE, 2021, pp. 416-420.

https://doi.org/10.1109/ICTC52510.2021.9620885

L. Li, J. Chen, and W. Yan, "A particle swarm optimization-based container scheduling algorithm of docker platform," in Proceedings of the 4th International Conference on Communication and Information Processing, 2018, pp. 12-17.

https://doi.org/10.1145/3290420.3290432

D. Zhang, B.-H. Yan, Z. Feng, C. Zhang, and Y.-X. Wang, "Container oriented job scheduling using linear programming model," in 2017 3rd International Conference on Information Management (ICIM). IEEE, 2017, pp. 174-180.

https://doi.org/10.1109/INFOMAN.2017.7950370

B. Liu, P. Li, W. Lin, N. Shu, Y. Li, and V. Chang, "A new container scheduling algorithm based on multi-objective optimization," Soft Computing, vol. 22, pp. 7741-7752, 2018.

https://doi.org/10.1007/s00500-018-3403-7

D. G. Lago, E. R. Madeira, and D. Medhi, "Energy-aware virtual machine scheduling on data centers with heterogeneous bandwidths," IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 1, pp. 83-98, 2017.

https://doi.org/10.1109/TPDS.2017.2753247

C. Kaewkasi and K. Chuenmuneewong, "Improvement of container scheduling for docker using ant colony optimization," in 2017 9th international conference on knowledge and smart technology (KST). IEEE, 2017, pp. 254-259.

https://doi.org/10.1109/KST.2017.7886112

M. K. Hussein, M. H. Mousa, and M. A. Alqarni, "A placement architecture for a container as a service (caas) in a cloud environment," Journal of Cloud Computing, vol. 8, pp. 1-15, 2019.

https://doi.org/10.1186/s13677-019-0131-1

D.-K. Kang, G.-B. Choi, S.-H. Kim, I.-S. Hwang, and C.-H. Youn, "Workload-aware resource management for energy efficient heterogeneous docker containers," in 2016 IEEE Region 10 Conference (TENCON). IEEE, 2016, pp. 2428-2431.

https://doi.org/10.1109/TENCON.2016.7848467

X. Xu, W. Wang, T. Wu, W. Dou, and S. Yu, "A virtual machine scheduling method for tradeoffs between energy and performance in cloud environment," in 2016 International Conference on Advanced Cloud and Big Data (CBD). IEEE, 2016, pp. 246-251.

https://doi.org/10.1109/CBD.2016.050

Additional Files

Published

2023-06-20

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.