Detection of P53 Consensus Sequence: A Novel String Matching With Classes Algorithm

Gıyasettin ÖZCAN
547 133

Abstract


We present a novel fast string matching technique for special DNA pattern forms and compare performance of recent CPU architectures on the matching problem. In particular, we consider consensus P53 DNA-binding consensus sequence, which has an important contribution for cancer treatment. Based on biological findings, consensus P53 pattern may emerge in various sequence forms and its length is not deterministic. Therefore, classic string matching algorithms are not able to solve the problem. For efficient solution, we consider bitwise string matching algorithms with classes and present a novel search technique which is based on 64-bit packed variables. In order to prevent obstacles based on variable length of the pattern, we search specific indexes of P53 on databases. For experimental analysis, we make use of mus musculus DNA sequences with approximately 2.3 billion nucleotides. We compare algorithm performance and three architectures with various level CPU parallelism. Test results show that our technique presents search efficiency during P53 pattern search in each architecture platform. Due to its structure, the algorithm also introduces an efficient solution to similar string matching with class problems.

Keywords


Computer Engineering; Computational Biology; Text Processing; Consensus Sequences; Bitwise Matching; Hardware Counters

Full Text:

PDF


References


Appel, W. and George, L. (2000) Optimal spilling for CISC machines with few registers, ACM SIGPLAN Notices, Vol 36 No 5, 243-253, doi: 10.1145/378795.378854

Baeza-Yates, R. and Gonnet, G. H., (1992) A new approach to text searching, Communications of the ACM, 35(10) , 74–82, doi: 10.1145/135239.135243

Boyer, R.S. and Moore, J.S. (1977 ) A Fast String Searching Algorithm, Communications of the ACM, 20, 10, 762-772, doi: 10.1145/359842.359859

Browne, S., Dongarra, J., Garner, N., Ho, G. and Mucci, P. (2000) A Portable Programming Interface for Performance Evaluation on Modern Processors, The International Journal of High Performance Computing Applications, 14:3, 189-204, doi:10.1177/109434200001400303

Durian, B., Holub, J., Peltola, H., and Tarhio, J. (2009) Tuning BNDM with q-grams, Proceedings of the Workshop on Algorithm Engineering and Experiments ALENEX. 29–37, doi: 10.1137/1.9781611972894.3

El-Deiry W. (1998) Regulation of p53 downstream genes, Semin Cancer Biololgy, 8 :345-357.

Fan, H., Yao, N., and Ma, H. (2009) Fast variants of the backward-oracle-marching algorithm, Proceedings of the Fourth International Conference on Internet Computing for Science and Engineering, IEEE Computer Society, 56–59

Faro, S. and Lecroq, T. (2010) The Exact String Matching Problem: A Comprehensive Experimental Evaluation, doi: 10.1145/2431211.2431212

Fuyao, Z. and Qingwei, L. (2009) A string matching algorithm based on efficient hash function, Information Engineering and Computer Science – ICIECS, doi: 10.1109/ICIECS.2009.5363191

Horspool, R. N. (1980) Practical fast searching in strings, Software – Practice & Experience, New Jersey, Volume 10 Number 6.

http://www.boost.org, Erişim Tarihi: 01.01.2013, Konu:C++ Boost Libraries

http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html. Erişim Tarihi: 1.1.2011, Konu: Intel Manual

http://www.ncbi.nlm.nih.gov/pubmedhealth/. Erişim Tatihi: 1.1.2014, Konu: PubMed Health Bethesda (MD): National Library of Medicine, mus musculus DNA sekansları

Karp, R. M. and Rabin, M. O. (1987) Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, New Jersey Volume 31 Issue 2, doi: 10.1147/rd.312.0249

Kern, S. E., Kinzler, K. W., Bruskin, A., Jarosz, D., Friedman, P., Prives, C., and Vogelstein B. (1991) Identification of p53 as a sequence-specific DNA-binding protein. Science. 252(5013):1708–11, doi: 10.1126/science.2047879

Kim, S. (1999) A new string-pattern matching algorithm using partitioning and hashing efficiently, Journal of Experimental Algorithmics (JEA), JEA Homepage archive Volume 4, Article No. 2, doi: 10.1145/347792.347803

Külekci, O. (2012) On enumeration of DNA sequences, Proceedings of ACM Conference on Bioinformatics, Computational Biology, and Biomedicine, Orlando, doi: 10.1145/2382936.2382993

Külekci, O. (2008) A method to overcome computer word size limitation in bit-parallel pattern matching, S.-H. Hong, H. Nagamochi, and T. Fukunaga, editors, Proceedings of the 19th International Symposium on Algorithms and Computation, ISAAC, Lecture Notes in Computer Science, Springer-Verlag, Berlin volume 5369, 496–506, doi: 10.1007/978-3-540-92182-0_45

Knuth, D., Morris, J. H. and Pratt, V. (1977) Fast pattern matching in strings, SIAM Journal on Computing, 6 (2), 323–350, 10.1137/0206024

Navarro, G., and Raffinot, M. (2000) Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics, Volume 5, New York, doi: 10.1145/351827.384246

Özcan, G., and Ünsal, O. S. (2015) Fast bitwise pattern matching algorithm for DNA sequences on modern hardware, Turk J Elec Eng & Comp Sci, Vol 23 (2015), pp.1405-1417 doi: 10.3906/elk-1304-165

Patterson, D., Hennessy, J., Arpaci-Dusseau, A. (2007). Computer architecture: a quantitative approach. Morgan Kaufmann,

Vintan, L. N. "Towards a High Performance Neural Branch Predictor (1999) Proceedings of the Int'l J. Conf. on Neural Networks, doi: 10.1109/IJCNN.1999.831066

Sunday, D. M. (1990) A Very Fast Substring Search Algorithm. Communications of the ACM, New York. , 33, 8, 132-142, doi: 10.1145/79173.79184




Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.