Frequent Patterns Algorithm of Biological Sequences based on Pattern Prefix-tree

Authors

  • Linyan Xue College of Quality and Technical Supervision, Hebei University, Baoding 071002, China
  • Xiaoke Zhang College of Quality and Technical Supervision, Hebei University, Baoding 071002, China
  • Fei Xie College of Quality and Technical Supervision, Hebei University, Baoding 071002, China
  • Shuang Liu College of Quality and Technical Supervision, Hebei University, Baoding 071002, China
  • Peng Lin College of Management, Hebei University, Baoding 071002, China

Keywords:

Bioinformatics, frequent patterns, biological sequence, pattern prefixtree, data mining

Abstract

In the application of bioinformatics, the existing algorithms cannot be directly and efficiently implement sequence pattern mining. Two fast and efficient biological sequence pattern mining algorithms for biological single sequence and multiple sequences are proposed in this paper. The concept of the basic pattern is proposed, and on the basis of mining frequent basic patterns, the frequent pattern is excavated by constructing prefix trees for frequent basic patterns. The proposed algorithms implement rapid mining of frequent patterns of biological sequences based on pattern prefix trees. In experiment the family sequence data in the pfam protein database is used to verify the performance of the proposed algorithm. The prediction results confirm that the proposed algorithms can’t only obtain the mining results with effective biological significance, but also improve the running time efficiency of the biological sequence pattern mining.

References

Apostolico, A.; Preparata, F. P. (1983). Optimal off-line detection of repetitions in a string, Theoretical Computer Science, 22(3), 297-315, 1983. https://doi.org/10.1016/0304-3975(83)90109-3

Ben-Hur, A.; Brutlag, D. (2003). Remote homology detection: A motif based approach, Bioinformatics, 19 Suppl 1(1), 26-33, 2003. https://doi.org/10.1093/bioinformatics/btg1002

Benkaddour, M. K., Bounoua, A. (2017). Feature extraction and classification using deep convolutional neural networks, PCA and SVC for face recognition, Traitement du Signal, 34(1-2), 77-91, 2017. https://doi.org/10.3166/ts.34.77-91

Cardon, L. R.; Stormo, G. D. (1992). Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments, Journal of Molecular Biology, 223(1), 159-170, 1992. https://doi.org/10.1016/0022-2836(92)90723-W

Chen, L.; Liu, W. (2013). Frequent patterns mining in multiple biological sequences, Computers in Biology & Medicine, 43(10), 1444-1451, 2013. https://doi.org/10.1016/j.compbiomed.2013.07.009

Crochemore, M.; Ilie, L. (2008). Maximal repetitions in strings, Journal of Computer & System Sciences, 74(5), 796-807, 2008. https://doi.org/10.1016/j.jcss.2007.09.003

Dong, T. (2017). Assessment of data reliability of wireless sensor network for bioinformatics, International Journal Bioautomation, 21(1), 241-250, 2017. https://doi.org/10.1016/j.aeue.2017.09.003

Jiang, Q.; Li, S.; Guo, S. (2011). A new model for finding approximate tandem repeats in DNA sequences, Journal of Software, 6(3), 386-394, 2011. https://doi.org/10.4304/jsw.6.3.386-394

Karmaker, S.; Ruhi, F. Y.; Mallick U. K. (2018). Mathematical analysis of a model on guava for biological pest control, Mathematical Modelling of Engineering Problems, 5(4), 427-440, 2018. https://doi.org/10.18280/mmep.050420

Kurtz, S.; Choudhuri, J. V.; Ohlebusch, E.; Schleiermacher, C.; Stoye, J.; Giegerich, R. (2001). REPuter: the manifold applications of repeat analysis on a genomic scale, Nucleic Acids Research, 29(22), 4633-4642, 2001. https://doi.org/10.1093/nar/29.22.4633

Lawrence, C. E.; Reilly, A. A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences, Proteins Structure Function & Bioinformatics, 7(1), 41-51, 1990. https://doi.org/10.1002/prot.340070105

Liao, C. C.; Chen, M. S. (2014). DFSP: A Depth-First SPelling algorithm for sequential pattern mining of biological sequences, Knowledge & Information Systems, 38(3), 623-639, 2014. https://doi.org/10.1007/s10115-012-0602-x

Makalowski, W. (2003). Not junk after all, Science, 300(5623), 1246-1247, 2003. https://doi.org/10.1126/science.1085690

Pasquier, N.; Bastide, Y.; Taouil, R.; Lakhal, L. (1999). Efficient mining of association rules using closed itemset lattices, Information Systems, 24(1), 25-46, 1999. https://doi.org/10.1016/S0306-4379(99)00003-4

Rubin, E. M.; Lucas, S.; Richardson, P. (2004). Finishing the euchromatic sequence of the human genome, Nature, 431(7011), 931-945, 2004. https://doi.org/10.1038/nature03001

Saqhai Maroof, M. A.; Yang, G. P.; Biyashev, R. M.; Maughan, P. J.; Zhang, Q. (1996). Analysis of the barley and rice genomes by comparative RFLP linkage mapping, Theoretical & Applied Genetics, 92(5), 541-551, 1996. https://doi.org/10.1007/s001220050161

Shapiro, J. A.; Von, S. R. (2005). Why repetitive DNA is essential to genome function, Biological Reviews, 80(2), 227-250, 2005. https://doi.org/10.1017/S1464793104006657

Stansfield, W.; King, R.; Mulligan, P. (2006). A dictionary of genetics, Yale Journal of Biology & Medicine, 75(4), 236-245, 2006.

Won, J. I.; Park, S.; Yoon, J. H.; Kim, S. W. (2006). An efficient approach for sequence matching in large DNA databases, Journal of Information Science, 32(1), 88-104, 2006. https://doi.org/10.1177/0165551506059229

Published

2019-08-05

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.