Frequent Patterns Algorithm of Biological Sequences based on Pattern Prefix-tree

  • Linyan Xue College of Quality and Technical Supervision, Hebei University, Baoding 071002, China
  • Xiaoke Zhang College of Quality and Technical Supervision, Hebei University, Baoding 071002, China
  • Fei Xie College of Quality and Technical Supervision, Hebei University, Baoding 071002, China
  • Shuang Liu College of Quality and Technical Supervision, Hebei University, Baoding 071002, China
  • Peng Lin College of Management, Hebei University, Baoding 071002, China

Abstract

In the application of bioinformatics, the existing algorithms cannot be directly and efficiently implement sequence pattern mining. Two fast and efficient biological sequence pattern mining algorithms for biological single sequence and multiple sequences are proposed in this paper. The concept of the basic pattern is proposed, and on the basis of mining frequent basic patterns, the frequent pattern is excavated by constructing prefix trees for frequent basic patterns. The proposed algorithms implement rapid mining of frequent patterns of biological sequences based on pattern prefix trees. In experiment the family sequence data in the pfam protein database is used to verify the performance of the proposed algorithm. The prediction results confirm that the proposed algorithms can’t only obtain the mining results with effective biological significance, but also improve the running time efficiency of the biological sequence pattern mining.

References

[1] Apostolico, A.; Preparata, F. P. (1983). Optimal off-line detection of repetitions in a string, Theoretical Computer Science, 22(3), 297-315, 1983.
https://doi.org/10.1016/0304-3975(83)90109-3

[2] Ben-Hur, A.; Brutlag, D. (2003). Remote homology detection: A motif based approach, Bioinformatics, 19 Suppl 1(1), 26-33, 2003.
https://doi.org/10.1093/bioinformatics/btg1002

[3] Benkaddour, M. K., Bounoua, A. (2017). Feature extraction and classification using deep convolutional neural networks, PCA and SVC for face recognition, Traitement du Signal, 34(1-2), 77-91, 2017.
https://doi.org/10.3166/ts.34.77-91

[4] Cardon, L. R.; Stormo, G. D. (1992). Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments, Journal of Molecular Biology, 223(1), 159-170, 1992.
https://doi.org/10.1016/0022-2836(92)90723-W

[5] Chen, L.; Liu, W. (2013). Frequent patterns mining in multiple biological sequences, Computers in Biology & Medicine, 43(10), 1444-1451, 2013.
https://doi.org/10.1016/j.compbiomed.2013.07.009

[6] Crochemore, M.; Ilie, L. (2008). Maximal repetitions in strings, Journal of Computer & System Sciences, 74(5), 796-807, 2008.
https://doi.org/10.1016/j.jcss.2007.09.003

[7] Dong, T. (2017). Assessment of data reliability of wireless sensor network for bioinformatics, International Journal Bioautomation, 21(1), 241-250, 2017.
https://doi.org/10.1016/j.aeue.2017.09.003

[8] Jiang, Q.; Li, S.; Guo, S. (2011). A new model for finding approximate tandem repeats in DNA sequences, Journal of Software, 6(3), 386-394, 2011.
https://doi.org/10.4304/jsw.6.3.386-394

[9] Karmaker, S.; Ruhi, F. Y.; Mallick U. K. (2018). Mathematical analysis of a model on guava for biological pest control, Mathematical Modelling of Engineering Problems, 5(4), 427-440, 2018.
https://doi.org/10.18280/mmep.050420

[10] Kurtz, S.; Choudhuri, J. V.; Ohlebusch, E.; Schleiermacher, C.; Stoye, J.; Giegerich, R. (2001). REPuter: the manifold applications of repeat analysis on a genomic scale, Nucleic Acids Research, 29(22), 4633-4642, 2001.
https://doi.org/10.1093/nar/29.22.4633

[11] Lawrence, C. E.; Reilly, A. A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences, Proteins Structure Function & Bioinformatics, 7(1), 41-51, 1990.
https://doi.org/10.1002/prot.340070105

[12] Liao, C. C.; Chen, M. S. (2014). DFSP: A Depth-First SPelling algorithm for sequential pattern mining of biological sequences, Knowledge & Information Systems, 38(3), 623-639, 2014.
https://doi.org/10.1007/s10115-012-0602-x

[13] Makalowski, W. (2003). Not junk after all, Science, 300(5623), 1246-1247, 2003.
https://doi.org/10.1126/science.1085690

[14] Pasquier, N.; Bastide, Y.; Taouil, R.; Lakhal, L. (1999). Efficient mining of association rules using closed itemset lattices, Information Systems, 24(1), 25-46, 1999.
https://doi.org/10.1016/S0306-4379(99)00003-4

[15] Rubin, E. M.; Lucas, S.; Richardson, P. (2004). Finishing the euchromatic sequence of the human genome, Nature, 431(7011), 931-945, 2004.
https://doi.org/10.1038/nature03001

[16] Saqhai Maroof, M. A.; Yang, G. P.; Biyashev, R. M.; Maughan, P. J.; Zhang, Q. (1996). Analysis of the barley and rice genomes by comparative RFLP linkage mapping, Theoretical & Applied Genetics, 92(5), 541-551, 1996.
https://doi.org/10.1007/s001220050161

[17] Shapiro, J. A.; Von, S. R. (2005). Why repetitive DNA is essential to genome function, Biological Reviews, 80(2), 227-250, 2005.
https://doi.org/10.1017/S1464793104006657

[18] Stansfield, W.; King, R.; Mulligan, P. (2006). A dictionary of genetics, Yale Journal of Biology & Medicine, 75(4), 236-245, 2006.

[19] Won, J. I.; Park, S.; Yoon, J. H.; Kim, S. W. (2006). An efficient approach for sequence matching in large DNA databases, Journal of Information Science, 32(1), 88-104, 2006.
https://doi.org/10.1177/0165551506059229
Published
2019-08-05
How to Cite
XUE, Linyan et al. Frequent Patterns Algorithm of Biological Sequences based on Pattern Prefix-tree. INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL, [S.l.], v. 14, n. 4, p. 574-589, aug. 2019. ISSN 1841-9844. Available at: <http://univagora.ro/jour/index.php/ijccc/article/view/3607>. Date accessed: 05 july 2020. doi: https://doi.org/10.15837/ijccc.2019.4.3607.

Keywords

Bioinformatics, frequent patterns, biological sequence, pattern prefixtree, data mining