Global ETD Search

1	Efficient Computation of Regularities in Strings and Applications Yusufu, Munina 08 1900 (has links) <p> Regularities in strings model many phenomena and thus form the subject of extensive mathematical studies. Perhaps the most conspicuous regularities in strings are those that manifest themselves in the form of repeated subpatterns, that is, repeats, multirepeats, repetitions, runs and others. Regularities in the form of repeating substrings were the basis of one of the earliest and still widely used compression algorithms and remain central in more recent approaches. Repeats and repetitions of lengthy substrings in DNA and protein sequences are important markers in biological research. </p> <p> A large proportion of the available algorithms for computing regularities in strings depends on the prior computation of a suffix tree, or, more recently, of a suffix array. The design of new algorithms for computing regularities should emphasize conceptual simplicity, as well as both time and space efficiency.</p> <p> In this thesis, we investigate mathematical and algorithmical aspects of the computation of regularities in strings.</p> <p> The first part of the thesis is the development of space and time efficient nonextendible (NE) and supernonextendible (SNE) repeats algorithms RPT, shown to be more efficient than previous methods based on tests using different real data sets. In particular, we describe four variants of a new fast algorithm RPT1 that, based on suffix array construction, computes all the complete NE repeats in a given string x whose length (period) p ≥ pmin, where pmin ≥ 1 is a user-specified minimum. RPT1 uses 5n bytes of space directly, but requires the LCP array, whose construction needs 6n bytes. The variants RPT1-3 and RPT1-4 execute in O(n) time independent of alphabet size and are faster than the two other algorithms previously proposed for this problem. To provide a basis of comparison for RPT1, we also describe a straightforward algorithm RPT2 that computes complete NE repeats without any recourse to suffix arrays and whose total space requirement is only 5n bytes; however, this algorithm is slower than RPT1. Furthermore, we describe new fast algorithms RPT3 for computing all complete SNE repeats in x. Of these, RPT3-2 executes in O(n) time independent of alphabet size, thus asymptotically faster than the methods previously proposed. We conclude with a brief discussion of applications to bioinformatics and data compression.</p> <p> The second part of the thesis deals with the issue of finding the NE multirepeats in a set of N strings of average length n under various constraints. A multirepeat is a repeat that occurs at least m times (m ≥ 2) in each of at least q ≥ 1 strings in a given set of strings. We show that RPT1 can be extended to locate the multirepeats based on the investigation of the properties of the multirepeats and various strategies. We describe algorithms to find complete NE multirepeats, first with no restriction on "gap length" (that is, the gap between occurrences of the multirepeat), then with bounded gaps. For the first problem, we propose two algorithms with worst-case time complexities O(Nn+αlog2N) and O(Nn+α) that use 9Nn and 10Nn bytes of space, respectively, where α is the alphabet size. For the second problem, we describe an algorithm with worst-case time complexity O(RNn) that requires approximately 10Nn bytes, where R is the number of multirepeats output. We remark that if we set the min and max constraints on gaps equal to zero in this algorithm, we can find all repetitions (tandem repeats) in arbitrary subsets of a given set. We demonstrate that our algorithms are faster, more flexible and much more space efficient than algorithms recently proposed for this problem.</p> <p> Finally, the third part of the thesis provides a convenient framework for comparing the LZ factorization algorithms which are used in the computation of regularities in strings rather than in the traditional application to text compression. LZ factorization is the computational bottleneck in numerous string processing algorithms, especially in regularity studies, such as computing repetitions, runs, repeats with fixed gap, branching repeats, sequence alignment, local periods, and data compression. Since 1977, when Ziv and Lempel described a kind of string factorization useful for data compression, there has been a succession of algorithms proposed for computing "LZ factorization". In particular, there have been several recent algorithms proposed that extend the usefulness of LZ factorization, especially to the computation of runs in a string x. We choose these algorithms and analyze each algorithm separately, and remark on them by comparing some of their important aspects, for example, additional space required and handling mechanism. We also address their output format differences and some special features. We then provide a complete theoretical comparison of their time and space efficiency. We conduct intensive testing on both time and space performance and analyze the results carefully to draw conclusions in which situations these algorithms perform best. We believe that our investigation and analysis will be very useful for researchers in their choice of the proper LZ factorization algorithms to deal with the problems related to computation of the regularities in strings.</p> / Thesis / Doctor of Philosophy (PhD)
2	PARALLEL COMPUTING ALGORITHMS FOR TANDEM 2013 April 1900 (has links) Tandem mass spectrometry, also known as MS/MS, is an analytical technique to measure the mass-to-charge ratio of charged ions and widely used in genomics, proteomics and metabolomics areas. There are two types of automatic ways to interpret tandem mass spectra: de novo methods and database searching methods. Both of them need to use massive computational resources and complicated comparison algorithms. The real-time peptide-spectrum matching (RT-PSM) algorithm is a database searching method to interpret tandem mass spectra with strict time constraints. Restricted by the hardware and architecture of an individual workstation the RT-PSM algorithm has to sacrifice the level of accuracy in order to provide prerequisite processing speed. The peptide-spectrum similarity scoring module is the most time-consuming part out of four modules in the RT-PSM algorithm, which is also the core of the algorithm. In this study, a multi-core computing algorithm is developed for individual workstations. Moreover, a distributed computing algorithm is designed for a cluster. The improved algorithms can achieve the speed requirement of RT-PSM without sacrificing the accuracy. With some expansion, this distributed computing algorithm can also support different PSM algorithms. Simulation results show that compared with the original RT-PSM, the parallelization version achieves 25 to 34 times speed-up based on different individual workstations. A cluster with 240 CPU cores could accelerate the similarity score module 210 times compare with the single-thread similarity score module and the whole peptide identification process 85 times compare with the single-thread peptide identification process. tandem mass spectrum parallel computing algorithm multi-core computing algorithm distributed computing algorithm peptide identification
3	Aplicativo computacional da função discriminante quadrática para utilização em ciências experimentais / Simeão, Sandra Fiorelli de Almeida Penteado, 1965- January 2006 (has links) Orientador: Carlos Roberto Padovani / Banca: Adriano Wagner Ballarin / Banca: Flávio Fekkari Aragon / Banca: José Carlos Martinez / Banca: Marie Oshiiwa / Resumo: Aspectos teóricos relacionados à Análise Discriminante Multivariada - Linear e Quadrática - foram discutidos, por meio de um extenso levantamento histórico da função discriminante, com seus primórdios no trabalho de Fisher e sua posterior evolução, enfocando o intenso desenvolvimento das técnicas classificatórias discriminantes com o advento dos computadores. Foi dada ênfase aos softwares estatísticos desenvolvidos para PC, que realizam a análise discriminante, e que representam uma grande contribuição para pesquisadores e usuários desta técnica. Considerando a dificuldade existente quanto a aplicativos computacionais acessíveis a pesquisadores da área de ciências agrárias, elaborou-se um programa que realiza a análise discriminante quadrática com as respectivas freqüências de classificação correta, bem como o manual explicativo do usuário. Verificou-se que a função discriminante quadrática trata de um procedimento bastante útil nas ciências agrárias, como, por exemplo, em estudos nas áreas de solos, cultivos diversos (soja, milho, cana de açúcar, pupunha, braquiária, frutas), criação de animais e classificação e seleção de madeiras; porém, subutilizada frente à dificuldade de programas computacionais de fácil manuseio e acesso a pesquisadores das áreas aplicadas. Os procedimentos estudados e discutidos foram ilustrados com exemplos de aplicação, utilizando dados experimentais agronômicos de espécies de Girassóis e Eucalyptus, submetidos ao aplicativo desenvolvido. / Abstract: A large historical study of the discriminant function has allowed a discussion on theoretical aspects related to the Multivaried Discriminant Analysis - Linear and Quadratic, showing its past in the work of Fisher and its later evolution, emphasizing the wide development of classificatory discriminant techniques with the happening of the computers, and specific statistic softwares which practice the discriminant analysis, representing a big contribution to researches and users of this technique. Considering the difficulty in relation to accessible softwares to researches of the agrarian area, a software which performs a linear and quadratic discriminant analysis was built with its frequencies of correct classification, as well as an explicative manual to users. The quadratic discriminant was studied as being a very useful process in agrarian sciences. Some examples of this usefulness is in studies of the ground, diversified cultivation (soybean, corn, sugarcane, pejibaye, brachiaria decumbens fruits), animal creation and wood selection, and classification; however, misused in relation to the difficulties of easy handing and access to researchers of applied areas. The studied and discussed procedures were illustrated with applications, using agronomic experimental data of Sunflower and Eucalyptus, submitted to developed software. / Doutor Análise discriminante. Programação quadratica. Análise multivariada. Discriminant analysis. eng Quadratic discriminant function. eng Wrong classification rate. eng Computing algorithm. eng
4	Aplicativo computacional da função discriminante quadrática para utilização em ciências experimentais Simeão, Sandra Fiorelli de Almeida Penteado [UNESP] 19 December 2006 (has links) (PDF) Made available in DSpace on 2014-06-11T19:31:35Z (GMT). No. of bitstreams: 0 Previous issue date: 2006-12-19Bitstream added on 2014-06-13T21:02:54Z : No. of bitstreams: 1 simeao_sfap_dr_botfca.pdf: 899191 bytes, checksum: da6ed77a45734c278c56395d23c51cd0 (MD5) / Universidade Estadual Paulista (UNESP) / Aspectos teóricos relacionados à Análise Discriminante Multivariada - Linear e Quadrática - foram discutidos, por meio de um extenso levantamento histórico da função discriminante, com seus primórdios no trabalho de Fisher e sua posterior evolução, enfocando o intenso desenvolvimento das técnicas classificatórias discriminantes com o advento dos computadores. Foi dada ênfase aos softwares estatísticos desenvolvidos para PC, que realizam a análise discriminante, e que representam uma grande contribuição para pesquisadores e usuários desta técnica. Considerando a dificuldade existente quanto a aplicativos computacionais acessíveis a pesquisadores da área de ciências agrárias, elaborou-se um programa que realiza a análise discriminante quadrática com as respectivas freqüências de classificação correta, bem como o manual explicativo do usuário. Verificou-se que a função discriminante quadrática trata de um procedimento bastante útil nas ciências agrárias, como, por exemplo, em estudos nas áreas de solos, cultivos diversos (soja, milho, cana de açúcar, pupunha, braquiária, frutas), criação de animais e classificação e seleção de madeiras; porém, subutilizada frente à dificuldade de programas computacionais de fácil manuseio e acesso a pesquisadores das áreas aplicadas. Os procedimentos estudados e discutidos foram ilustrados com exemplos de aplicação, utilizando dados experimentais agronômicos de espécies de Girassóis e Eucalyptus, submetidos ao aplicativo desenvolvido. / A large historical study of the discriminant function has allowed a discussion on theoretical aspects related to the Multivaried Discriminant Analysis - Linear and Quadratic, showing its past in the work of Fisher and its later evolution, emphasizing the wide development of classificatory discriminant techniques with the happening of the computers, and specific statistic softwares which practice the discriminant analysis, representing a big contribution to researches and users of this technique. Considering the difficulty in relation to accessible softwares to researches of the agrarian area, a software which performs a linear and quadratic discriminant analysis was built with its frequencies of correct classification, as well as an explicative manual to users. The quadratic discriminant was studied as being a very useful process in agrarian sciences. Some examples of this usefulness is in studies of the ground, diversified cultivation (soybean, corn, sugarcane, pejibaye, brachiaria decumbens fruits), animal creation and wood selection, and classification; however, misused in relation to the difficulties of easy handing and access to researchers of applied areas. The studied and discussed procedures were illustrated with applications, using agronomic experimental data of Sunflower and Eucalyptus, submitted to developed software. Análise discriminante Programação quadratica Analise multivariada Algoritmo computacional Função discriminante quadrática Discriminant analysis Quadratic discriminant function Wrong classification rate Computing algorithm

1

Page generated in 0.0632 seconds