Optimal Data File Allocation for All-to-All Comparison in Distributed System: A Case Study on Genetic Sequence Comparison

Authors

  • Leixiao Li College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China College of Data Science and Application, Inner Mongolia University of Technology, Hohhot 010018, China Inner Mongolia Autonomous Region Engineering & Technology Research Center of Big Data Based Software Service, Hohhot 010018, China
  • Jing Gao College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China
  • Ren Mu State key laboratory, Beijing Jiaotong University, Beijing 100044, China

Keywords:

distributed system, all-to-all comparison, mix integer linear programming (MILP), file allocation, load balancing

Abstract

In order to solve the problem of unbalanced load of data les in large-scale data all-to-all comparison under distributed system environment, the differences of les themselves arefully considered. This paper aims to fully utilize the advantages of distributed system to enhance the le allocation of all-to-all comparison between the data les in a large dataset. For this purpose, the author formally described the all-to-all comparison problem, and con-structed a data allocation model via mixed integer linear programming (MILP). Meanwhile, a data allocation algorithm was developed on the Matlab using the intlinprog function of branch-and-bound method. Finally, our model and algorithm were veried through several experiments. The results show that the proposed le allocation strategy can achieve the basic load balance of each node in the distributed system without exceeding the storage capacity of any node, and completely localize the data le. The research ndings can be applied to such elds as bioinformatics, biometrics and data mining.

References

Borodin, V.; Bourtembourg, J.; Hnaien, F., Labadie, N. (2018). COTS software integration for simulation optimization coupling: case of ARENA and CPLEX products, International Journal of Modelling and Simulation, (5), 1-12, 2018.

Dai, Y.; Wu, W.; Zhou, H.B.; Zhang, J.; Ma, F.Y. (2018). Numerical Simulation and Oprimization of Oil Jet Lubrication for Rotorcraft Meshing Gears, International Journal of Simulation Modelling, 17(2), 318-326, 2018. https://doi.org/10.2507/IJSIMM17(2)CO6

Dai, Y.; Zhu, X.; Zhou, H.; Mao, Z.; Wu, W. (2018). Trajectory Tracking Control for Seafloor Tracked Vehicle By Adaptive Neural-Fuzzy Inference System Algorithm, International Journal of Computers Communications & Control, 13(4), 465-476, 2018. https://doi.org/10.15837/ijccc.2018.4.3267

Deng, J. (2014). Research and Improvement of Mixed Integer Linear Programming Model for Unit Combination, Nanning: Guangxi University, 12-16, 2014.

Gao, Y.J. (2017). Research on Data Allocation Strategy for All-to-all Comparison of Large Data Sets, Taiyuan: Taiyuan University of Technology, 5-10, 2017.

Guo, J.W.; Li, Y.; Du, L.P.; Zhao, G.F.; Jiang, J.Y. (2014). Research on distributed data mining system based on hadoop platform, Advances in Intelligent Systems and Computing, 255, 629-636, 2014. https://doi.org/10.1007/978-81-322-1759-6_72

He, H.; Du, Z.H.; Zhang, W.Z.; Chen, A. (2016). Optimization strategy of Hadoop small file storage for big data in healthcare, Journal of Supercomputing, 72(10), 3696-3707, 2016. https://doi.org/10.1007/s11227-015-1462-4

Hess, M.; Sczyrba, A.; Egan, R.; Kim, T.W.; Chokhawala, H.; Schroth, G.; Luo, S.; Clark, D.S.; Chen, F.; Zhang, T.; Mackie, R.I.; Pennacchio, L.A.; Tringe, S.G.; Visel, A.; Woyke, T.; Wang, Z.; Rubin, E.M. (2011). Metagenomic discovery of biomass-degrading genes and genomes from cow rumen, Science, 331(6016), 463-467, 2011. https://doi.org/10.1126/science.1200387

Hu, S.R. (1991). Modern supercomputer system, Journal of computer science, (1), 47-56, 1991.

Jiao, X.P.; Mu, J.J. (2013). Improved check node decomposition for linear programming decoding, IEEE Communications Letters, 17(2), 377-380, 2013. https://doi.org/10.1109/LCOMM.2012.122012.122396

Liao, J.; Trahay, F.; Xiao, G.; Li, L.; Ishikawa, Y. (2017). Performing initiative data prefetching in distributed file systems for cloud computing, IEEE Transactions on Cloud Computing, 5(3), 550-562, 2017. https://doi.org/10.1109/TCC.2015.2417560

Mu, R.; Wu, J.J.; Li, N. (2018). MATLAB and mathematical modeling, Beijing: Science Press, 63-78, 2018.

MAzller, E.R.; Carlson, R.C.; Junior, W.K. (2016). Intersection control for automated vehicles with MILP, IFAC-PapersOnLine, 49(3), 37-42, 2016. https://doi.org/10.1016/j.ifacol.2016.07.007

Nayahi, J.J.V.; Kavitha, V. (2017). Privacy and utility preserving data clustering for data anonymization and distribution on Hadoop, Future Generation Computer Systems, 74, 393- 408, 2017. https://doi.org/10.1016/j.future.2016.10.022

Pitty, S.S.; Karimi, I.A. (2008). Novel MILP models for scheduling permutation flowshops, Chemical Product and Process Modeling, 3(1), 35-42, 2008. https://doi.org/10.2202/1934-2659.1176

Sun, J.Y. (2016). Simulation experiment of operation research model based on MATLAB, Journal of Shenyang University (Natural Science Edition), 28(4), 337-339, 2016.

Schulman, J.; Duan, Y.; Ho, J.; Lee, A.; Awwal, I.; Bradlow, H. (2014). Motion planning with sequential convex optimization and convex collision checking, International Journal of Robotics Research, 33(9), 1251-1270, 2014. https://doi.org/10.1177/0278364914528132

Schmidt, B.; Hartmann, C. (2018). Wavepacket: a matlab package for numerical quantum dynamics. ii: open quantum systems, optimal control, and model reduction, Computer Physics Communications, 228, 229-244, 2018. https://doi.org/10.1016/j.cpc.2018.02.022

Ubarhande, V.; Popescu, A.; González-Vélez, H. (2015). Novel Data-Distribution Technique for Hadoop in Heterogeneous Cloud Environments, 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems, 217-224, 2015. https://doi.org/10.1109/CISIS.2015.37

Wang, L.Z.; Tao, J.; Ranjan, R.; Marten, H.; Streit, A.; Chen, J.Y.; Chen, D. (2013). GHadoop: MapReduce across distributed data centers for data-intensive computing, Future Generation Computer Systems, 29(3), 739-750, 2013. https://doi.org/10.1016/j.future.2012.09.001

Yang, X.P.; Zhou, X.G.; Cao, B.Y. (2015). Multi-level linear programming subject to addition-min fuzzy relation inequalities with application in Peer-to-Peer file sharing system, Journal of Intelligent and Fuzzy Systems, 28(6), 2679-2689, 2015 https://doi.org/10.3233/IFS-151546

Zhang, Y.F.; Tian, Y.C.; Fidge, C.; Kelly, W. (2016); Data-aware task scheduling for allto- all comparison problems in heterogeneous distributed systems, Journal of Parallel & Distributed Computing, 93(C), 87-101, 2016.

Zhang, Y.F.; Tian, Y.C.; Kelly, W.; Fidge, C. (2017). Scalable and efficient data distribution for distributed computing of all-to-all comparison problems, Future Generation Computer Systems, 67, 152-162, 2017. https://doi.org/10.1016/j.future.2016.08.020

Zhang, Y.F.; Tian, Y.C.; Kelly, W.; Fidge, C. (2014). A distributed computing framework for All-to-All comparison problems, IECON 2014 - 40th Annual Conference of the IEEE Industrial Electronics Society, 2499-2505, 2014.

Zhou, J.X.; Shao, X.M.; Qiao, J.Y.; Zhang, Y.W. (2012). MATLAB from the introduction to proficiency (2nd edition), Beijing: People's Post and Telecommunications Publishing House, 35-92, 2012.

Published

2019-04-14

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.