Gene Sequences Parallel Alignment Model Based on Multiple Inputs and Outputs

  • Xiaolong Feng
  • Jing Gao

Abstract

Bioinformatics computing is a kind of big data processing problem, which usually has the characteristics of large data scale, large computational load and long computational time. Therefore, the use of big data technology in bioinformatics computing has gradually become a research hotspot, and using Hadoop for gene sequence alignment is one of it. It is a common way to use various tools to complete a job in the field of Biocomputing. In most studies of parallel alignment of gene sequences using Hadoop, third-party tools are also needed. However, there are few methods using Hadoop independently to complete gene sequences alignment. Adding data processing with other tools to Hadoop workflow not only affects the improvement of computing performance, but also complicates the application. In this paper, a parallel alignment model of gene sequences based on multiple inputs and outputs is proposed, which can independently complete parallel alignment of gene sequences in Hadoop platform without using other tools. This model not only simplifies the process flow of gene sequence alignment, but also improves the performance compared with other methods. This paper describes in detail the method of manipulating gene sequences with multiple inputs and outputs modes on Hadoop platform and the design of a computing model based on this method, and proves the superiority of this model through experiments.

References

[1] Abuin, J.M.; Pichel, J.C.; Pena, T.F.; Amiqo, J. (2015). BigBWA: Approaching the Burrows-Wheeler Aligner to Big Data Technologies, Bioinformatics, 31(24), 4003-4005, 2015.
https://doi.org/10.1093/bioinformatics/btv506

[2] Almeida, J.S.; Gruneberg, A.; Maass, W.; Vinga, S. (2012). Fractal MapReduce decomposition of sequence alignment, Algorithms for Molecular Biology, 7(1), 1-12, 2012.
https://doi.org/10.1186/1748-7188-7-12

[3] Bala, R.J.; Govinda, R.M.; Murthy, C.S.N. (2018). Reliability analysis and failure rate evaluation of load haul dump machines using Weibull distribution analysis, Mathematical Modelling of Engineering Problems, 5(2), 116-122, 2018.
https://doi.org/10.18280/mmep.050209

[4] Chen, Z.; Hou, Z.W.; Yang, Q.Q.; Chen, X.B. (2018). Adaptive Meshing Based on the Multi-level Partition of Unity and Dynamic Particle Systems for Medical Image Datasets, International Journal Bioautomation, 22(3), 229-238, 2018.
https://doi.org/10.7546/ijba.2018.22.3.229-238

[5] Cock, P.J.; Fields, C.J.; Goto, N.; Heuer, M.; Rice, P.M. (2009). The Sanger FASTQ file format for sequences with quality scores and the Solexa/Illumina FASTQ variants, Nucleic Acids Research, 38(6), 1767-1771, 2009.
https://doi.org/10.1093/nar/gkp1137

[6] Dai, Y.; Wu, W.; Zhou, H.B.; Zhang, J.; Ma, F.Y. (2018). Numerical Simulation and Optimization of Oil Jet Lubrication for Rotorcraft Meshing Gears, International Journal of Simulation Modelling, 17(2), 318-326, 2018.
https://doi.org/10.2507/IJSIMM17(2)CO6

[7] Dai, Y.; Zhu, X.; Zhou, H.; Mao, Z.; Wu, W. (2018). Trajectory Tracking Control for Seafloor Tracked Vehicle By Adaptive Neural-Fuzzy Inference System Algorithm, International Journal of Computers Communications & Control, 13(4), 465-476, 2018.
https://doi.org/10.15837/ijccc.2018.4.3267

[8] Dean, J.; Ghemawat, S. (2004). MapReduce: Simplified Data Processing on Large Clusters. Proceedings of Sixth Symposium on Operating System Design and Implementation (OSD2004), USENIX Association, 2004.

[9] Decap, D.; Reumers, J.; Herzeel, C.; Costanza, P.; Fostier, J. (2015). Halvade: scalable sequence analysis with MapReduce, Bioinformatics, 31(15), 2482-2488, 2015.
https://doi.org/10.1093/bioinformatics/btv179

[10] Gufler, B.; Augsten, N.; Reiser, A.; Kemper, A. (2012). The Partition Cost Model for Load Balancing in MapReduce, Cloud Computing and Services Science, Springer New York, 371-387, 2012.

[11] Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWAMEM, Genomics, 1-3, 2013.

[12] Li, H. (2009). The Sequence Alignment / Map (SAM) Format, Bioinformatics, 25(1-2), 1653-1654, 2009.

[13] Metzker, M.L. (2010). Sequencing technologies - the next generation, Nature Reviews Genetics, 11(1), 31-46, 2010.
https://doi.org/10.1038/nrg2626

[14] Pandey, R.V.; Schlotterer, C. (2013). DistMap: A Toolkit for Distributed Short Read Mapping on a Hadoop Cluster, PLOS ONE, 8(8), e72614, 2013.
https://doi.org/10.1371/journal.pone.0072614

[15] Pireddu, L.; Leo, S.; Zanetti, G. (2011). SEAL: a distributed short read mapping and duplicate removal tool, Bioinformatics, 27(15), 2159-2160, 2011.
https://doi.org/10.1093/bioinformatics/btr325

[16] Schatz, M.C. (2009). CloudBurst: highly sensitive read mapping with MapReduce, Bioinformatics, 25(11), 1363-1369, 2009.
https://doi.org/10.1093/bioinformatics/btp236

[17] Taylor, R.C. (2010); An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics, Bmc Bioinformatics, 11(S12), S1, 2010.
https://doi.org/10.1186/1471-2105-11-S12-S1

[18] Watson, J.D. (1990). The Human Genome Project: Past, Present, and Future, Science, 248(4951), 44-49, 1990.
https://doi.org/10.1126/science.2181665

[19] Zhang, J.; Wu, Y.Q.; Yi, H.C. (2018). Forward modelling of circular loop source and calculation of whole area apparent resistivity based on TEM, Traitement du Signal, 35(2), 183-198, 2018.
https://doi.org/10.3166/ts.35.183-198

[20] [Online]. Available: hadoop.apache.org/, Accesed on 20 June 2018.
Published
2019-04-14
How to Cite
FENG, Xiaolong; GAO, Jing. Gene Sequences Parallel Alignment Model Based on Multiple Inputs and Outputs. INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL, [S.l.], v. 14, n. 2, p. 141-153, apr. 2019. ISSN 1841-9844. Available at: <http://univagora.ro/jour/index.php/ijccc/article/view/3539>. Date accessed: 02 july 2020. doi: https://doi.org/10.15837/ijccc.2019.2.3539.

Keywords

Multiple inputs and outputs, MapReduce, gene sequence alignment, short reads mapping, BWA (Burrows-Wheeler aligner), parallel computing