Global ETD Search

1	An ACGT-Words Tree for Efficient Data Access in Genomic Databases Hu, Jen-Wei 25 July 2003 (has links) Genomic sequence databases, like GenBank, EMBL, are widely used by molecular biologists for homology searching. Because of the increase of the size of genomic sequence databases, the importance of indexing the sequences for fast queries grows. The DNA sequences are composed of 4 base pairs, and these genomic sequences can be regarded as the text strings. Similar to conventional databases, there are some approaches use indexes to provide efficient access to the data. The inverted-list indexing approach uses hashing to store the database sequences. However, the perfect hashing function is difficult to construct, and the collision in a hash table may occur frequently. Different from the inverted-list approach, there are other data structures, such as the suffix tree, the suffix array, and the suffix binary search tree, to index the genomic sequences. One characteristic of those suffix-tree-like data structures is that they store all suffixes of the sequences. They do not break the sequences into words. The advantage of the suffix tree is simple. However, the storage space of the suffix tree is too large. The suffix array and the suffix binary search tree reduce more storage space than the suffix tree. But since they use the binary searching technique to find the query sequence, they waste too much time to do the search. Another data structure, the word suffix tree, uses the concept of words and stores partial suffixes to index the DNA sequence. Although the word suffix tree reduces the storage space, it will lose information in the search process. In this thesis, we propose a new index structure, ACGT-Words tree, for efficiently support query processing in genomic databases. We define the concept of words which is different from the word definition given in the word suffix tree, and separate the DNA sequences stored in the database and in the query sequence into distinct words. Our approach does not store all of the suffixes in the database sequences. Therefore, we need less space than the suffix tree approach. We also propose an efficient search algorithm to do the sequence match based on the ACGT-Words tree index structure; therefore, we can take less time to finish the search than the suffix array approach. Our approach also avoids the missing cases in the word suffix tree. Then, based on the ACGT-Words tree, we propose one improved operation for data insertion and two improved operations for the searching process. In the improved operation for insertion, we sort the ACGT-Words generated and then preprocess them before constructing the tree structure. In the two improved operations, we can provide better performance when the query sequence satisfies some conditions. The simulation results show that the ACGT-Words tree outperforms the suffix tree and the suffix array in terms of storage and processing time, respectively. Moreover, we show that the improved operations in the ACGT-Words tree also require shorter time to construct or search than the original processes or the suffix array. DNA sequence genomic databases indexing suffix tree suffix array
2	LEDA-SM: external memory algorithms and data structures in theory and practice Crauser, Andreas. Unknown Date (has links) (PDF) University, Diss., 2001--Saarbrücken.
3	Inexact Mapping of Short Biological Sequences in High Performance Computational Environments Salavert Torres, José 30 October 2014 (has links) La bioinformática es la aplicación de las ciencias computacionales a la gestión y análisis de datos biológicos. A partir de 2005, con la aparición de los secuenciadores de ADN de nueva generación surge lo que se conoce como Next Generation Sequencing o NGS. Un único experimento biológico puesto en marcha en una máquina de secuenciación NGS puede producir fácilmente cientos de gigabytes o incluso terabytes de datos. Dependiendo de la técnica elegida este proceso puede realizarse en unas pocas horas o días. La disponibilidad de recursos locales asequibles, tales como los procesadores multinúcleo o las nuevas tarjetas gráfi cas preparadas para el cálculo de propósito general GPGPU (General Purpose Graphic Processing Unit ), constituye una gran oportunidad para hacer frente a estos problemas. En la actualidad, un tema abordado con frecuencia es el alineamiento de secuencias de ADN. En bioinformática, el alineamiento permite comparar dos o más secuencias de ADN, ARN, o estructuras primarias proteicas, resaltando sus zonas de similitud. Dichas similitudes podrían indicar relaciones funcionales o evolutivas entre los genes o proteínas consultados. Además, la existencia de similitudes entre las secuencias de un individuo paciente y de otro individuo con una enfermedad genética detectada podría utilizarse de manera efectiva en el campo de la medicina diagnóstica. El problema en torno al que gira el desarrollo de la tesis doctoral consiste en la localización de fragmentos de secuencia cortos dentro del ADN. Esto se conoce bajo el sobrenombre de mapeo de secuencia o sequence mapping. Dicho mapeo debe permitir errores, pudiendo mapear secuencias incluso existiendo variabilidad genética o errores de lectura en el mapeo. Existen diversas técnicas para abordar el mapeo, pero desde la aparición de la NGS destaca la búsqueda por pre jos indexados y agrupados mediante la transformada de Burrows-Wheeler [28] (o BWT en lo sucesivo). Dicha transformada se empleó originalmente en técnicas de compresión de datos, como es el caso del algoritmo bzip2. Su utilización como herramienta para la indización y búsqueda posterior de información es más reciente [22]. La ventaja es que su complejidad computacional depende únicamente de la longitud de la secuencia a mapear. Por otra parte, una gran cantidad de técnicas de alineamiento se basan en algoritmos de programación dinámica, ya sea Smith-Watterman o modelos ocultos de Markov. Estos proporcionan mayor sensibilidad, permitiendo mayor cantidad de errores, pero su coste computacional es mayor y depende del tamaño de la secuencia multiplicado por el de la cadena de referencia. Muchas herramientas combinan una primera fase de búsqueda con la BWT de regiones candidatas al alineamiento y una segunda fase de alineamiento local en la que se mapean cadenas con Smith-Watterman o HMM. Cuando estamos mapeando permitiendo pocos errores, una segunda fase con un algoritmo de programación dinámica resulta demasiado costosa, por lo que una búsqueda inexacta basada en BWT puede resultar más e ficiente. La principal motivación de la tesis doctoral es la implementación de un algoritmo de búsqueda inexacta basado únicamente en la BWT, adaptándolo a las arquitecturas paralelas modernas, tanto en CPU como en GPGPU. El algoritmo constituirá un método nuevo de rami cación y poda adaptado a la información genómica. Durante el periodo de estancia se estudiarán los Modelos ocultos de Markov y se realizará una implementación sobre modelos de computación funcional GTA (Aggregate o Test o Generate), así como la paralelización en memoria compartida y distribuida de dicha plataforma de programación funcional. / Salavert Torres, J. (2014). Inexact Mapping of Short Biological Sequences in High Performance Computational Environments [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/43721 Inexact mapping Backward search BWT Burrows-Wheeler Transform Suffix Array GPGPU GPU
4	Bohaté rysy ve frázovém strojovém překladu / Rich Features in Phrase-Based Machine Translation Kos, Kamil January 2010 (has links) In this thesis we investigate several methods how to improve the quality of statistical machine translation (MT) by using linguistically rich information. First, we describe SemPOS, a metric that uses shallow semantic representation of sentences to evaluate the translation quality. We show that even though this metric has high correlation with human assessment of translation quality it is not directly suitable for system parameter optimization. Second, we extend the log-linear model used in statistical MT by additional source-context model that helps to better distinguish among possible translation options and select the most promising translation for a given context.
5	Um algoritmo para a construção de vetores de sufixo generalizados em memória externa / External memory generalized suffix array construction algorithm Louza, Felipe Alves da 17 December 2013 (has links) O vetor de sufixo é uma estrutura de dados importante utilizada em muitos problemas que envolvem cadeias de caracteres. Na literatura, muitos trabalhos têm sido propostos para a construção de vetores de sufixo em memória externa. Entretanto, esses trabalhos não enfocam conjuntos de cadeias, ou seja, não consideram vetores de sufixo generalizados. Essa limitação motiva esta dissertação, a qual avança no estado da arte apresentando o algoritmo eGSA, o primeiro algoritmo proposto para a construção de vetores de sufixo generalizados aumentado com o vetor de prefixo comum mais longo (LCP) e com a transformada de Burrows-Wheeler (BWT) em memória externa. A dissertação foi desenvolvida dentro do contexto de bioinformática, já que avanços tecnológicos recentes têm aumentado o volume de dados biológicos disponíveis, os quais são armazenados como cadeias de caracteres. O algoritmo eGSA foi validado por meio de testes de desempenho com dados reais envolvendo sequências grandes, como DNA, e sequências pequenas, como proteínas. Com relação aos testes comparativos com conjuntos de grandes cadeias de DNA, o algoritmo proposto foi comparado com o algoritmo correlato mais eficiente na literatura de construção de vetores de sufixo, o qual foi adaptado para construção de vetores generalizados. O algoritmo eGSA obteve um tempo médio de 3,2 a 8,3 vezes menor do que o algoritmo correlato e consumiu 50% menos de memória. Para conjuntos de cadeias pequenas de proteínas, foram realizados testes de desempenho apenas com o eGSA, já que no melhor do nosso conhecimento, não existem trabalhos correlatos que possam ser adaptados. Comparado com o tempo médio para conjuntos de cadeias grandes, o eGSA obteve tempos competitivos para conjuntos de cadeias pequenas. Portanto, os resultados dos testes demonstraram que o algoritmo proposto pode ser aplicado eficientemente para indexar tanto conjuntos de cadeias grandes quanto conjuntos de cadeias pequenas / The suffix array is an important data structure used in several string processing problems. In the literature, several approaches have been proposed to deal with external memory suffix array construction. However, these approaches are not specifically aimed to index sets of strings, that is, they do not consider generalized suffix arrays. This limitation motivates this masters thesis, which presents eGSA, the first external memory algorithm developed to construct generalized suffix arrays enhanced with the longest common prefix array (LCP) and the Burrows-Wheeler transform (BWT). We especially focus on the context of bioinformatics, as recent technological advances have increased the volume of biological data available, which are stored as strings. The eGSA algorithm was validated through performance tests with real data from DNA and proteins sequences. Regarding performance tests with large strings of DNA, we compared our algorithm with the most efficient and related suffix array construction algorithm in the literature, which was adapted to construct generalized arrays. The results demonstrated that our algorithm reduced the time spent by a factor of 3.2 to 8.3 and consumed 50% less memory. For sets of small strings of proteins, tests were performed only with the eGSA, since to the best of our knowledge, there is no related work that can be adapted. Compared to the average time spent to index sets of large strings, the eGSA obtained competitive times to index sets of small strings. Therefore, the performance tests demonstrated that the proposed algorithm can be applied efficiently to index both sets of large strings and sets of small strings Biological data Dados biológicos External memory Generalized suffix array Genome assembly Indexação Indexing Memória externa Montagem de genomas Vetor de sufixo generalizado
6	Succinct Indexes He, Meng 30 January 2008 (has links) This thesis defines and designs succinct indexes for several abstract data types (ADTs). The concept is to design auxiliary data structures that ideally occupy asymptotically less space than the information-theoretic lower bound on the space required to encode the given data, and support an extended set of operations using the basic operators defined in the ADT. As opposed to succinct (integrated data/index) encodings, the main advantage of succinct indexes is that we make assumptions only on the ADT through which the main data is accessed, rather than the way in which the data is encoded. This allows more freedom in the encoding of the main data. In this thesis, we present succinct indexes for various data types, namely strings, binary relations, multi-labeled trees and multi-labeled graphs, as well as succinct text indexes. For strings, binary relations and multi-labeled trees, when the operators in the ADTs are supported in constant time, our results are comparable to previous results, while allowing more flexibility in the encoding of the given data. Using our techniques, we improve several previous results. We design succinct representations for strings and binary relations that are more compact than previous results, while supporting access/rank/select operations efficiently. Our high-order entropy compressed text index provides more efficient support for searches than previous results that occupy essentially the same amount of space. Our succinct representation for labeled trees supports more operations than previous results do. We also design the first succinct representations of labeled graphs. To design succinct indexes, we also have some preliminary results on succinct data structure design. We present a theorem that characterizes a permutation as a suffix array, based on which we design succinct text indexes. We design a succinct representation of ordinal trees that supports all the navigational operations supported by various succinct tree representations. In addition, this representation also supports two other encodings schemes of ordinal trees as abstract data types. Finally, we design succinct representations of planar triangulations and planar graphs which support the rank/select of edges in counter clockwise order in addition to other operations supported in previous work, and a succinct representation of k-page graph which supports more efficient navigation than previous results for large values of k. succinct data structures data structures string binary relation text index suffix array tree graph succinct index Computer Science
7	Succinct Indexes He, Meng 30 January 2008 (has links) This thesis defines and designs succinct indexes for several abstract data types (ADTs). The concept is to design auxiliary data structures that ideally occupy asymptotically less space than the information-theoretic lower bound on the space required to encode the given data, and support an extended set of operations using the basic operators defined in the ADT. As opposed to succinct (integrated data/index) encodings, the main advantage of succinct indexes is that we make assumptions only on the ADT through which the main data is accessed, rather than the way in which the data is encoded. This allows more freedom in the encoding of the main data. In this thesis, we present succinct indexes for various data types, namely strings, binary relations, multi-labeled trees and multi-labeled graphs, as well as succinct text indexes. For strings, binary relations and multi-labeled trees, when the operators in the ADTs are supported in constant time, our results are comparable to previous results, while allowing more flexibility in the encoding of the given data. Using our techniques, we improve several previous results. We design succinct representations for strings and binary relations that are more compact than previous results, while supporting access/rank/select operations efficiently. Our high-order entropy compressed text index provides more efficient support for searches than previous results that occupy essentially the same amount of space. Our succinct representation for labeled trees supports more operations than previous results do. We also design the first succinct representations of labeled graphs. To design succinct indexes, we also have some preliminary results on succinct data structure design. We present a theorem that characterizes a permutation as a suffix array, based on which we design succinct text indexes. We design a succinct representation of ordinal trees that supports all the navigational operations supported by various succinct tree representations. In addition, this representation also supports two other encodings schemes of ordinal trees as abstract data types. Finally, we design succinct representations of planar triangulations and planar graphs which support the rank/select of edges in counter clockwise order in addition to other operations supported in previous work, and a succinct representation of k-page graph which supports more efficient navigation than previous results for large values of k. succinct data structures data structures string binary relation text index suffix array tree graph succinct index Computer Science
8	Um algoritmo para a construção de vetores de sufixo generalizados em memória externa / External memory generalized suffix array construction algorithm Felipe Alves da Louza 17 December 2013 (has links) O vetor de sufixo é uma estrutura de dados importante utilizada em muitos problemas que envolvem cadeias de caracteres. Na literatura, muitos trabalhos têm sido propostos para a construção de vetores de sufixo em memória externa. Entretanto, esses trabalhos não enfocam conjuntos de cadeias, ou seja, não consideram vetores de sufixo generalizados. Essa limitação motiva esta dissertação, a qual avança no estado da arte apresentando o algoritmo eGSA, o primeiro algoritmo proposto para a construção de vetores de sufixo generalizados aumentado com o vetor de prefixo comum mais longo (LCP) e com a transformada de Burrows-Wheeler (BWT) em memória externa. A dissertação foi desenvolvida dentro do contexto de bioinformática, já que avanços tecnológicos recentes têm aumentado o volume de dados biológicos disponíveis, os quais são armazenados como cadeias de caracteres. O algoritmo eGSA foi validado por meio de testes de desempenho com dados reais envolvendo sequências grandes, como DNA, e sequências pequenas, como proteínas. Com relação aos testes comparativos com conjuntos de grandes cadeias de DNA, o algoritmo proposto foi comparado com o algoritmo correlato mais eficiente na literatura de construção de vetores de sufixo, o qual foi adaptado para construção de vetores generalizados. O algoritmo eGSA obteve um tempo médio de 3,2 a 8,3 vezes menor do que o algoritmo correlato e consumiu 50% menos de memória. Para conjuntos de cadeias pequenas de proteínas, foram realizados testes de desempenho apenas com o eGSA, já que no melhor do nosso conhecimento, não existem trabalhos correlatos que possam ser adaptados. Comparado com o tempo médio para conjuntos de cadeias grandes, o eGSA obteve tempos competitivos para conjuntos de cadeias pequenas. Portanto, os resultados dos testes demonstraram que o algoritmo proposto pode ser aplicado eficientemente para indexar tanto conjuntos de cadeias grandes quanto conjuntos de cadeias pequenas / The suffix array is an important data structure used in several string processing problems. In the literature, several approaches have been proposed to deal with external memory suffix array construction. However, these approaches are not specifically aimed to index sets of strings, that is, they do not consider generalized suffix arrays. This limitation motivates this masters thesis, which presents eGSA, the first external memory algorithm developed to construct generalized suffix arrays enhanced with the longest common prefix array (LCP) and the Burrows-Wheeler transform (BWT). We especially focus on the context of bioinformatics, as recent technological advances have increased the volume of biological data available, which are stored as strings. The eGSA algorithm was validated through performance tests with real data from DNA and proteins sequences. Regarding performance tests with large strings of DNA, we compared our algorithm with the most efficient and related suffix array construction algorithm in the literature, which was adapted to construct generalized arrays. The results demonstrated that our algorithm reduced the time spent by a factor of 3.2 to 8.3 and consumed 50% less memory. For sets of small strings of proteins, tests were performed only with the eGSA, since to the best of our knowledge, there is no related work that can be adapted. Compared to the average time spent to index sets of large strings, the eGSA obtained competitive times to index sets of small strings. Therefore, the performance tests demonstrated that the proposed algorithm can be applied efficiently to index both sets of large strings and sets of small strings Dados biológicos Indexação Memória externa Montagem de genomas Vetor de sufixo generalizado Biological data External memory Generalized suffix array Genome assembly Indexing
9	Celogenomové zarovnání pomocí suffixových stromů / Whole genome alignment using suffix trees Klouba, Lukáš January 2017 (has links) The aim of this thesis is to create an algorithm that allows the alignment of the genome of two organisms by means of suffix structures and to implement it into the programming language environment R. The thesis deals with the description of the construction of the suffix structures and the methods of whole genome alignment. The result of the thesis is a functional algorithm for whole genome alignment by means of suffix structures implemented in the software environment R and its comparison with similar programs for the whole genome alignment.
10	Predikce transpozonů v DNA / Prediction of Transposons in DNA Černohub, Jan January 2014 (has links) Cílem práce je seznámení se s problematikou uchovávání informace v DNA, provést rešerši na téma transpozony, bioinformatické nástroje a algoritmy, které jsou používány k jejich detekci v nasekvenovaných genomech a vytvořit tak stručný úvod do obsáhle problematiky, včetně jejího zasazení do kontextu současně probíhajícího výzkumu v dané oblasti. Na základě přehledu stávajících algoritmů a nástrojů pro detekci transpozonů je navržen a implementován nástroj pro hledání tzv. LTR transpozonů.

Search results