Optimal Data File Allocation for All-to-All Comparison in Distributed System: A Case Study on Genetic Sequence Comparison

  • Leixiao Li College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China College of Data Science and Application, Inner Mongolia University of Technology, Hohhot 010018, China Inner Mongolia Autonomous Region Engineering & Technology Research Center of Big Data Based Software Service, Hohhot 010018, China
  • Jing Gao College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China
  • Ren Mu State key laboratory, Beijing Jiaotong University, Beijing 100044, China

Abstract

In order to solve the problem of unbalanced load of data les in large-scale data all-to-all comparison under distributed system environment, the differences of les themselves arefully considered. This paper aims to fully utilize the advantages of distributed system to enhance the le allocation of all-to-all comparison between the data les in a large dataset. For this purpose, the author formally described the all-to-all comparison problem, and con-structed a data allocation model via mixed integer linear programming (MILP). Meanwhile, a data allocation algorithm was developed on the Matlab using the intlinprog function of branch-and-bound method. Finally, our model and algorithm were veried through several experiments. The results show that the proposed le allocation strategy can achieve the basic load balance of each node in the distributed system without exceeding the storage capacity of any node, and completely localize the data le. The research ndings can be applied to such elds as bioinformatics, biometrics and data mining.

References

[1] Borodin, V.; Bourtembourg, J.; Hnaien, F., Labadie, N. (2018). COTS software integration for simulation optimization coupling: case of ARENA and CPLEX products, International Journal of Modelling and Simulation, (5), 1-12, 2018.

[2] Dai, Y.; Wu, W.; Zhou, H.B.; Zhang, J.; Ma, F.Y. (2018). Numerical Simulation and Oprimization of Oil Jet Lubrication for Rotorcraft Meshing Gears, International Journal of Simulation Modelling, 17(2), 318-326, 2018.
https://doi.org/10.2507/IJSIMM17(2)CO6

[3] Dai, Y.; Zhu, X.; Zhou, H.; Mao, Z.; Wu, W. (2018). Trajectory Tracking Control for Seafloor Tracked Vehicle By Adaptive Neural-Fuzzy Inference System Algorithm, International Journal of Computers Communications & Control, 13(4), 465-476, 2018.
https://doi.org/10.15837/ijccc.2018.4.3267

[4] Deng, J. (2014). Research and Improvement of Mixed Integer Linear Programming Model for Unit Combination, Nanning: Guangxi University, 12-16, 2014.

[5] Gao, Y.J. (2017). Research on Data Allocation Strategy for All-to-all Comparison of Large Data Sets, Taiyuan: Taiyuan University of Technology, 5-10, 2017.

[6] Guo, J.W.; Li, Y.; Du, L.P.; Zhao, G.F.; Jiang, J.Y. (2014). Research on distributed data mining system based on hadoop platform, Advances in Intelligent Systems and Computing, 255, 629-636, 2014.
https://doi.org/10.1007/978-81-322-1759-6_72

[7] He, H.; Du, Z.H.; Zhang, W.Z.; Chen, A. (2016). Optimization strategy of Hadoop small file storage for big data in healthcare, Journal of Supercomputing, 72(10), 3696-3707, 2016.
https://doi.org/10.1007/s11227-015-1462-4

[8] Hess, M.; Sczyrba, A.; Egan, R.; Kim, T.W.; Chokhawala, H.; Schroth, G.; Luo, S.; Clark, D.S.; Chen, F.; Zhang, T.; Mackie, R.I.; Pennacchio, L.A.; Tringe, S.G.; Visel, A.; Woyke, T.; Wang, Z.; Rubin, E.M. (2011). Metagenomic discovery of biomass-degrading genes and genomes from cow rumen, Science, 331(6016), 463-467, 2011.
https://doi.org/10.1126/science.1200387

[9] Hu, S.R. (1991). Modern supercomputer system, Journal of computer science, (1), 47-56, 1991.

[10] Jiao, X.P.; Mu, J.J. (2013). Improved check node decomposition for linear programming decoding, IEEE Communications Letters, 17(2), 377-380, 2013.
https://doi.org/10.1109/LCOMM.2012.122012.122396

[11] Liao, J.; Trahay, F.; Xiao, G.; Li, L.; Ishikawa, Y. (2017). Performing initiative data prefetching in distributed file systems for cloud computing, IEEE Transactions on Cloud Computing, 5(3), 550-562, 2017.
https://doi.org/10.1109/TCC.2015.2417560

[12] Mu, R.; Wu, J.J.; Li, N. (2018). MATLAB and mathematical modeling, Beijing: Science Press, 63-78, 2018.

[13] MAzller, E.R.; Carlson, R.C.; Junior, W.K. (2016). Intersection control for automated vehicles with MILP, IFAC-PapersOnLine, 49(3), 37-42, 2016.
https://doi.org/10.1016/j.ifacol.2016.07.007

[14] Nayahi, J.J.V.; Kavitha, V. (2017). Privacy and utility preserving data clustering for data anonymization and distribution on Hadoop, Future Generation Computer Systems, 74, 393- 408, 2017.
https://doi.org/10.1016/j.future.2016.10.022

[15] Pitty, S.S.; Karimi, I.A. (2008). Novel MILP models for scheduling permutation flowshops, Chemical Product and Process Modeling, 3(1), 35-42, 2008.
https://doi.org/10.2202/1934-2659.1176

[16] Sun, J.Y. (2016). Simulation experiment of operation research model based on MATLAB, Journal of Shenyang University (Natural Science Edition), 28(4), 337-339, 2016.

[17] Schulman, J.; Duan, Y.; Ho, J.; Lee, A.; Awwal, I.; Bradlow, H. (2014). Motion planning with sequential convex optimization and convex collision checking, International Journal of Robotics Research, 33(9), 1251-1270, 2014.
https://doi.org/10.1177/0278364914528132

[18] Schmidt, B.; Hartmann, C. (2018). Wavepacket: a matlab package for numerical quantum dynamics. ii: open quantum systems, optimal control, and model reduction, Computer Physics Communications, 228, 229-244, 2018.
https://doi.org/10.1016/j.cpc.2018.02.022

[19] Ubarhande, V.; Popescu, A.; González-Vélez, H. (2015). Novel Data-Distribution Technique for Hadoop in Heterogeneous Cloud Environments, 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems, 217-224, 2015.
https://doi.org/10.1109/CISIS.2015.37

[20] Wang, L.Z.; Tao, J.; Ranjan, R.; Marten, H.; Streit, A.; Chen, J.Y.; Chen, D. (2013). GHadoop: MapReduce across distributed data centers for data-intensive computing, Future Generation Computer Systems, 29(3), 739-750, 2013.
https://doi.org/10.1016/j.future.2012.09.001

[21] Yang, X.P.; Zhou, X.G.; Cao, B.Y. (2015). Multi-level linear programming subject to addition-min fuzzy relation inequalities with application in Peer-to-Peer file sharing system, Journal of Intelligent and Fuzzy Systems, 28(6), 2679-2689, 2015
https://doi.org/10.3233/IFS-151546

[22] Zhang, Y.F.; Tian, Y.C.; Fidge, C.; Kelly, W. (2016); Data-aware task scheduling for allto- all comparison problems in heterogeneous distributed systems, Journal of Parallel & Distributed Computing, 93(C), 87-101, 2016.

[23] Zhang, Y.F.; Tian, Y.C.; Kelly, W.; Fidge, C. (2017). Scalable and efficient data distribution for distributed computing of all-to-all comparison problems, Future Generation Computer Systems, 67, 152-162, 2017.
https://doi.org/10.1016/j.future.2016.08.020

[24] Zhang, Y.F.; Tian, Y.C.; Kelly, W.; Fidge, C. (2014). A distributed computing framework for All-to-All comparison problems, IECON 2014 - 40th Annual Conference of the IEEE Industrial Electronics Society, 2499-2505, 2014.

[25] Zhou, J.X.; Shao, X.M.; Qiao, J.Y.; Zhang, Y.W. (2012). MATLAB from the introduction to proficiency (2nd edition), Beijing: People's Post and Telecommunications Publishing House, 35-92, 2012.
Published
2019-04-14
How to Cite
LI, Leixiao; GAO, Jing; MU, Ren. Optimal Data File Allocation for All-to-All Comparison in Distributed System: A Case Study on Genetic Sequence Comparison. INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL, [S.l.], v. 14, n. 2, p. 199-211, apr. 2019. ISSN 1841-9844. Available at: <http://univagora.ro/jour/index.php/ijccc/article/view/3526>. Date accessed: 02 july 2020. doi: https://doi.org/10.15837/ijccc.2019.2.3526.

Keywords

distributed system, all-to-all comparison, mix integer linear programming (MILP), file allocation, load balancing