Frequent Patterns Algorithm of Biological Sequences based on Pattern Prefix-tree
Keywords:
Bioinformatics, frequent patterns, biological sequence, pattern prefixtree, data miningAbstract
In the application of bioinformatics, the existing algorithms cannot be directly and efficiently implement sequence pattern mining. Two fast and efficient biological sequence pattern mining algorithms for biological single sequence and multiple sequences are proposed in this paper. The concept of the basic pattern is proposed, and on the basis of mining frequent basic patterns, the frequent pattern is excavated by constructing prefix trees for frequent basic patterns. The proposed algorithms implement rapid mining of frequent patterns of biological sequences based on pattern prefix trees. In experiment the family sequence data in the pfam protein database is used to verify the performance of the proposed algorithm. The prediction results confirm that the proposed algorithms can’t only obtain the mining results with effective biological significance, but also improve the running time efficiency of the biological sequence pattern mining.References
Apostolico, A.; Preparata, F. P. (1983). Optimal off-line detection of repetitions in a string, Theoretical Computer Science, 22(3), 297-315, 1983. https://doi.org/10.1016/0304-3975(83)90109-3
Ben-Hur, A.; Brutlag, D. (2003). Remote homology detection: A motif based approach, Bioinformatics, 19 Suppl 1(1), 26-33, 2003. https://doi.org/10.1093/bioinformatics/btg1002
Benkaddour, M. K., Bounoua, A. (2017). Feature extraction and classification using deep convolutional neural networks, PCA and SVC for face recognition, Traitement du Signal, 34(1-2), 77-91, 2017. https://doi.org/10.3166/ts.34.77-91
Cardon, L. R.; Stormo, G. D. (1992). Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments, Journal of Molecular Biology, 223(1), 159-170, 1992. https://doi.org/10.1016/0022-2836(92)90723-W
Chen, L.; Liu, W. (2013). Frequent patterns mining in multiple biological sequences, Computers in Biology & Medicine, 43(10), 1444-1451, 2013. https://doi.org/10.1016/j.compbiomed.2013.07.009
Crochemore, M.; Ilie, L. (2008). Maximal repetitions in strings, Journal of Computer & System Sciences, 74(5), 796-807, 2008. https://doi.org/10.1016/j.jcss.2007.09.003
Dong, T. (2017). Assessment of data reliability of wireless sensor network for bioinformatics, International Journal Bioautomation, 21(1), 241-250, 2017. https://doi.org/10.1016/j.aeue.2017.09.003
Jiang, Q.; Li, S.; Guo, S. (2011). A new model for finding approximate tandem repeats in DNA sequences, Journal of Software, 6(3), 386-394, 2011. https://doi.org/10.4304/jsw.6.3.386-394
Karmaker, S.; Ruhi, F. Y.; Mallick U. K. (2018). Mathematical analysis of a model on guava for biological pest control, Mathematical Modelling of Engineering Problems, 5(4), 427-440, 2018. https://doi.org/10.18280/mmep.050420
Kurtz, S.; Choudhuri, J. V.; Ohlebusch, E.; Schleiermacher, C.; Stoye, J.; Giegerich, R. (2001). REPuter: the manifold applications of repeat analysis on a genomic scale, Nucleic Acids Research, 29(22), 4633-4642, 2001. https://doi.org/10.1093/nar/29.22.4633
Lawrence, C. E.; Reilly, A. A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences, Proteins Structure Function & Bioinformatics, 7(1), 41-51, 1990. https://doi.org/10.1002/prot.340070105
Liao, C. C.; Chen, M. S. (2014). DFSP: A Depth-First SPelling algorithm for sequential pattern mining of biological sequences, Knowledge & Information Systems, 38(3), 623-639, 2014. https://doi.org/10.1007/s10115-012-0602-x
Makalowski, W. (2003). Not junk after all, Science, 300(5623), 1246-1247, 2003. https://doi.org/10.1126/science.1085690
Pasquier, N.; Bastide, Y.; Taouil, R.; Lakhal, L. (1999). Efficient mining of association rules using closed itemset lattices, Information Systems, 24(1), 25-46, 1999. https://doi.org/10.1016/S0306-4379(99)00003-4
Rubin, E. M.; Lucas, S.; Richardson, P. (2004). Finishing the euchromatic sequence of the human genome, Nature, 431(7011), 931-945, 2004. https://doi.org/10.1038/nature03001
Saqhai Maroof, M. A.; Yang, G. P.; Biyashev, R. M.; Maughan, P. J.; Zhang, Q. (1996). Analysis of the barley and rice genomes by comparative RFLP linkage mapping, Theoretical & Applied Genetics, 92(5), 541-551, 1996. https://doi.org/10.1007/s001220050161
Shapiro, J. A.; Von, S. R. (2005). Why repetitive DNA is essential to genome function, Biological Reviews, 80(2), 227-250, 2005. https://doi.org/10.1017/S1464793104006657
Stansfield, W.; King, R.; Mulligan, P. (2006). A dictionary of genetics, Yale Journal of Biology & Medicine, 75(4), 236-245, 2006.
Won, J. I.; Park, S.; Yoon, J. H.; Kim, S. W. (2006). An efficient approach for sequence matching in large DNA databases, Journal of Information Science, 32(1), 88-104, 2006. https://doi.org/10.1177/0165551506059229
Published
Issue
Section
License
ONLINE OPEN ACCES: Acces to full text of each article and each issue are allowed for free in respect of Attribution-NonCommercial 4.0 International (CC BY-NC 4.0.
You are free to:
-Share: copy and redistribute the material in any medium or format;
-Adapt: remix, transform, and build upon the material.
The licensor cannot revoke these freedoms as long as you follow the license terms.
DISCLAIMER: The author(s) of each article appearing in International Journal of Computers Communications & Control is/are solely responsible for the content thereof; the publication of an article shall not constitute or be deemed to constitute any representation by the Editors or Agora University Press that the data presented therein are original, correct or sufficient to support the conclusions reached or that the experiment design or methodology is adequate.