Global ETD Search

61	Um algoritmo para a construção de vetores de sufixo generalizados em memória externa / External memory generalized suffix array construction algorithm Louza, Felipe Alves da 17 December 2013 (has links) O vetor de sufixo é uma estrutura de dados importante utilizada em muitos problemas que envolvem cadeias de caracteres. Na literatura, muitos trabalhos têm sido propostos para a construção de vetores de sufixo em memória externa. Entretanto, esses trabalhos não enfocam conjuntos de cadeias, ou seja, não consideram vetores de sufixo generalizados. Essa limitação motiva esta dissertação, a qual avança no estado da arte apresentando o algoritmo eGSA, o primeiro algoritmo proposto para a construção de vetores de sufixo generalizados aumentado com o vetor de prefixo comum mais longo (LCP) e com a transformada de Burrows-Wheeler (BWT) em memória externa. A dissertação foi desenvolvida dentro do contexto de bioinformática, já que avanços tecnológicos recentes têm aumentado o volume de dados biológicos disponíveis, os quais são armazenados como cadeias de caracteres. O algoritmo eGSA foi validado por meio de testes de desempenho com dados reais envolvendo sequências grandes, como DNA, e sequências pequenas, como proteínas. Com relação aos testes comparativos com conjuntos de grandes cadeias de DNA, o algoritmo proposto foi comparado com o algoritmo correlato mais eficiente na literatura de construção de vetores de sufixo, o qual foi adaptado para construção de vetores generalizados. O algoritmo eGSA obteve um tempo médio de 3,2 a 8,3 vezes menor do que o algoritmo correlato e consumiu 50% menos de memória. Para conjuntos de cadeias pequenas de proteínas, foram realizados testes de desempenho apenas com o eGSA, já que no melhor do nosso conhecimento, não existem trabalhos correlatos que possam ser adaptados. Comparado com o tempo médio para conjuntos de cadeias grandes, o eGSA obteve tempos competitivos para conjuntos de cadeias pequenas. Portanto, os resultados dos testes demonstraram que o algoritmo proposto pode ser aplicado eficientemente para indexar tanto conjuntos de cadeias grandes quanto conjuntos de cadeias pequenas / The suffix array is an important data structure used in several string processing problems. In the literature, several approaches have been proposed to deal with external memory suffix array construction. However, these approaches are not specifically aimed to index sets of strings, that is, they do not consider generalized suffix arrays. This limitation motivates this masters thesis, which presents eGSA, the first external memory algorithm developed to construct generalized suffix arrays enhanced with the longest common prefix array (LCP) and the Burrows-Wheeler transform (BWT). We especially focus on the context of bioinformatics, as recent technological advances have increased the volume of biological data available, which are stored as strings. The eGSA algorithm was validated through performance tests with real data from DNA and proteins sequences. Regarding performance tests with large strings of DNA, we compared our algorithm with the most efficient and related suffix array construction algorithm in the literature, which was adapted to construct generalized arrays. The results demonstrated that our algorithm reduced the time spent by a factor of 3.2 to 8.3 and consumed 50% less memory. For sets of small strings of proteins, tests were performed only with the eGSA, since to the best of our knowledge, there is no related work that can be adapted. Compared to the average time spent to index sets of large strings, the eGSA obtained competitive times to index sets of small strings. Therefore, the performance tests demonstrated that the proposed algorithm can be applied efficiently to index both sets of large strings and sets of small strings Biological data Dados biológicos External memory Generalized suffix array Genome assembly Indexação Indexing Memória externa Montagem de genomas Vetor de sufixo generalizado
62	Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem Gallé, Matthias 15 February 2011 (has links) (PDF) Motivé par la découverte automatique de la structure hiérarchique de séquences d'ADN, nous nous intéressons au probléme classique de la recherche de la plus petite grammaire algébrique générant exactement une séquence donnée. Ce probléme NP-dur a été largement étudié pour des applications comme la compression de données, la découverte de structure et la théorie algorithmique de l'information. Nous proposons de décomposer ce probléme en deux problémes d'optimisation complémentaires. Le premier consiste á choisir les chaînes de la séquence qui seront les constituants de la grammaire finale alors que le second, que nous appelons ''analyse grammaticale minimale'', consiste á trouver une grammaire de taille minimale permettant l'analyse syntaxique de ces constituants. Nous donnons une solution polynomiale au probléme d' ''analyse grammaticale minimale'' et montrons que cette décomposition permet de définir un espace de recherche complet pour le probléme de la plus petite grammaire algébrique. Nous nous intéressons aux algorithmes praticables permettant de retourner une approximation du probléme en un temps suffisamment raisonnable pour être appliqués á de grandes séquences telles que les séquences génomiques. Nous analysons l'impact de l'utilisation de classes différentes de maximalité de répétitions pour le choix des constituants et le compromis entre l'efficacité et la taille de la grammaire finale. Nous présentons des avancées algorithmiques pour une meilleure efficacité des algorithmes hors-ligne existants, dont notamment la mise á jour incrémentale de tableaux de suffixes en cours de recodage. Enfin, la nouvelle décomposition du probléme nous permet de proposer de nouveaux algorithmes génériques permettant de trouver des grammaires 10\% plus petites que l'état de l'art. Enfin, nous nous intéressons á l'impact de ces idées sur les applications. En ce qui concerne la découverte de structures, nous étudions le nombre de grammaires minimales et montrons que ce nombre peut être exponentiel dans le pire cas. Nos expérimentations sur des jeux de séquences permettent cependant de montrer une certaine stabilité de structure au sein des grammaires minimales obtenues á partir d'un ensemble de constituants. En ce qui concerne la compression des données, nous contribuons dans chacune des trois étapes de la compression á base de grammaires. Nous définissons alors un nouvel algorithme qui optimise la taille de la chaine de bits finale au lieu de la taille de la grammaire. En l'appliquant sur les séquences d'ADN, nos expérimentations montrent que cet algorithme surpasse tout autre compresseur spécifique d'ADN á base de grammaire. Nous améliorons ce résultat en utilisant des répétitions inexactes et arrivons á améliorer les taux de compression de 25\% par rapport aux meilleurs compresseurs d'ADN á base de grammaire. Outre l'obtention de taux de compression plus performants, cette approche permet également envisager des généralisations intéressantes de ces grammaires. bioinformatique algorithmique compression de donnees inference grammatical tableau de suffix
63	Succinct Indexes He, Meng 30 January 2008 (has links) This thesis defines and designs succinct indexes for several abstract data types (ADTs). The concept is to design auxiliary data structures that ideally occupy asymptotically less space than the information-theoretic lower bound on the space required to encode the given data, and support an extended set of operations using the basic operators defined in the ADT. As opposed to succinct (integrated data/index) encodings, the main advantage of succinct indexes is that we make assumptions only on the ADT through which the main data is accessed, rather than the way in which the data is encoded. This allows more freedom in the encoding of the main data. In this thesis, we present succinct indexes for various data types, namely strings, binary relations, multi-labeled trees and multi-labeled graphs, as well as succinct text indexes. For strings, binary relations and multi-labeled trees, when the operators in the ADTs are supported in constant time, our results are comparable to previous results, while allowing more flexibility in the encoding of the given data. Using our techniques, we improve several previous results. We design succinct representations for strings and binary relations that are more compact than previous results, while supporting access/rank/select operations efficiently. Our high-order entropy compressed text index provides more efficient support for searches than previous results that occupy essentially the same amount of space. Our succinct representation for labeled trees supports more operations than previous results do. We also design the first succinct representations of labeled graphs. To design succinct indexes, we also have some preliminary results on succinct data structure design. We present a theorem that characterizes a permutation as a suffix array, based on which we design succinct text indexes. We design a succinct representation of ordinal trees that supports all the navigational operations supported by various succinct tree representations. In addition, this representation also supports two other encodings schemes of ordinal trees as abstract data types. Finally, we design succinct representations of planar triangulations and planar graphs which support the rank/select of edges in counter clockwise order in addition to other operations supported in previous work, and a succinct representation of k-page graph which supports more efficient navigation than previous results for large values of k. succinct data structures data structures string binary relation text index suffix array tree graph succinct index Computer Science
64	Succinct Indexes He, Meng 30 January 2008 (has links) This thesis defines and designs succinct indexes for several abstract data types (ADTs). The concept is to design auxiliary data structures that ideally occupy asymptotically less space than the information-theoretic lower bound on the space required to encode the given data, and support an extended set of operations using the basic operators defined in the ADT. As opposed to succinct (integrated data/index) encodings, the main advantage of succinct indexes is that we make assumptions only on the ADT through which the main data is accessed, rather than the way in which the data is encoded. This allows more freedom in the encoding of the main data. In this thesis, we present succinct indexes for various data types, namely strings, binary relations, multi-labeled trees and multi-labeled graphs, as well as succinct text indexes. For strings, binary relations and multi-labeled trees, when the operators in the ADTs are supported in constant time, our results are comparable to previous results, while allowing more flexibility in the encoding of the given data. Using our techniques, we improve several previous results. We design succinct representations for strings and binary relations that are more compact than previous results, while supporting access/rank/select operations efficiently. Our high-order entropy compressed text index provides more efficient support for searches than previous results that occupy essentially the same amount of space. Our succinct representation for labeled trees supports more operations than previous results do. We also design the first succinct representations of labeled graphs. To design succinct indexes, we also have some preliminary results on succinct data structure design. We present a theorem that characterizes a permutation as a suffix array, based on which we design succinct text indexes. We design a succinct representation of ordinal trees that supports all the navigational operations supported by various succinct tree representations. In addition, this representation also supports two other encodings schemes of ordinal trees as abstract data types. Finally, we design succinct representations of planar triangulations and planar graphs which support the rank/select of edges in counter clockwise order in addition to other operations supported in previous work, and a succinct representation of k-page graph which supports more efficient navigation than previous results for large values of k. succinct data structures data structures string binary relation text index suffix array tree graph succinct index Computer Science
65	Computational Representation Of Protein Sequences For Homology Detection And Classification Ogul, Hasan 01 January 2006 (has links) (PDF) Machine learning techniques have been widely used for classification problems in computational biology. They require that the input must be a collection of fixedlength feature vectors. Since proteins are of varying lengths, there is a need for a means of representing protein sequences by a fixed-number of features. This thesis introduces three novel methods for this purpose: n-peptide compositions with reduced alphabets, pairwise similarity scores by maximal unique matches, and pairwise similarity scores by probabilistic suffix trees. New sequence representations described in the thesis are applied on three challenging problems of computational biology: remote homology detection, subcellular localization prediction, and solvent accessibility prediction, with some problem-specific modifications. Rigorous experiments are conducted on common benchmarking datasets, and a comparative analysis is performed between the new methods and the existing ones for each problem. On remote homology detection tests, all three methods achieve competitive accuracies with the state-of-the-art methods, while being much more efficient. A combination of new representations are used to devise a hybrid system, called PredLOC, for predicting subcellular localization of proteins and it is tested on two distinct eukaryotic datasets. To the best of author&rsquo / s knowledge, the accuracy achieved by PredLOC is the highest one ever reported on those datasets. The maximal unique match method is resulted with only a slight improvement in solvent accessibility predictions.
66	Estudo semântico e diacrônico do sufixo -dade na língua portuguesa / Semantic and diachronic study of the suffix -dade in the portuguese language Lisângela Simões 23 October 2009 (has links) Existem autores que consideram os sufixos como unidades mínimas vazias de significado que não alteram a classe gramatical da base a que se integram para formar novos vocábulos. Todavia, inúmeros exemplos refutam tais afirmações (de adjetivos podemos derivar substantivos abstratos: feliz felicidade; o sufixo eiro pode compor substantivos, como pedreiro, assim como adjetivos, interesseiro). Desta forma, percebemos que os sufixos possuem significado autônomo e que fazem mais do que alterar a classe gramatical de um termo. O objetivo desta pesquisa é tratar do sufixo dade na língua portuguesa, ressaltando, por meio de paráfrases semânticas, os sentidos diacronicamente atestados em gramáticas e dicionários de língua portuguesa, especialmente no Dicionário eletrônico Houaiss da língua portuguesa (2001). Acreditamos que este tipo de análise colabora com estudos de sentidos e de datações atribuídos ao sufixo. Esta pesquisa se enquadra nas propostas do Grupo de Morfologia Histórica do Português da Universidade de São Paulo (GMHP/USP, http://www.usp.br/gmhp), sob a coordenação do Prof. Dr. Mário Eduardo Viaro. / There are some authors who consider suffixes as minimum units with no meaning that cannot form a word with a different morphological category from its basis. However, a lot of examples could be easily taken to refute such ideas (from adjectives abstract nouns can be derived: feliz felicidade; the Portuguese suffix eiro can be a compound of nouns, such as pedreiro, or adjectives, interesseiro). This shows us how suffixes carry an autonomy in their meaning and not only change the grammatical category of a derived word. The aim of this research is to analyze the Portuguese suffix dade, showing, through semantic paraphrases, the meanings diachronically presented in grammar books and dictionaries on Portuguese, especially in the Dicionário eletrônico Houaiss da língua portuguesa (2001). We argue that this type of analysis could enrich the studies about meanings and date occurrences associated to this particular suffix. This work is inserted within the purposes of the Grupo de Morfologia Histórica do Português da Universidade de São Paulo (GMHP/USP, Group of Historical Morphology of the Portuguese Language, http://www.usp.br/gmhp), under PhD. Mário Eduardo Viaro coordination. Derivação sufixal Língua portuguesa Morfologia histórica Semântica Sufixo dade Historical morphology Portuguese language Semantics Suffix dade Suffixal derivation
67	Aspectos sincrônicos e diacrônicos do sufixo -ístico(a) no português e no galego / Synchronic and diachronic aspects of the suffix -ístico(a) in Portuguese Language and in Galician Language Nilsa Arean Garcia 13 February 2012 (has links) O presente trabalho, resultado das pesquisas do GMHP, Grupo de Mofologia Histórica do Português, procura estudar os aspectos sincrônicos e diacrônicos do sufixo -ístico(a), bem como as relações existentes com o seu desdobramento -ística e com os sufixos -ismo e -ista, no português e no galego, para justificar, então, sua mudança morfológica de atuação deverbal para denominal, bem como as suas demais mudanças semânticas e as línguas responsáveis pela sua disseminação. Para tanto, utilizando-se a metodologia desenvolvida pelo grupo, e com corpora lexicográfico, historiográfico, ademais de literário e jornalístico, procura-se inicialmente por meio da elaboração de uma prospecção geral, estabelecer o período de início de atuação do sufixo além de verificar como é entendido pelas obras lingüísticas atuais nas duas línguas em estudo. Em seguida, estuda-se a sua gênese greco-latina e, posteriormente, sua atuação em outras línguas, como também suas relações sintagmáticas e paradigmáticas com os demais sufixos envolvidos, para se estabelecer uma evolução ao longo dos séculos, por meio da elaboração de classificações semântico-funcionais de cada período estudado, bem como avaliar a importância das línguas de cultura e dos processos de tradução na disseminação dos sufixos em questão. Nesse sentido, com a construção do glossário de datações e abonações de palavras formadas com o sufixo -ística, desdobramento de -ístico(a), pode-se verificar a grande importância da língua alemã, em detrimento da francesa, na veiculação do sufixo. Finalmente, com a análise da produtividade de -ístico(a) no português, mostra-se que um sufixo não é apenas provido de significado semântico e funcional, mas também é constituído vários outros matizes, dentre eles os que caracterizam o gênero textual em que atua. / The present work, a result of researches carried out by the GHMP (GMHP in Portuguese) - Group of Historical Morphology of the Portuguese Language - aims at studying both the synchronic and diachronic aspects of the suffix -ístico(a), as well as its existing relations with its evolution -ística and with the suffixes -ismo and -ista in Portuguese and Galician, in order to justify its morphological change from a deverbal to a denominal nature, its further semantic changes as well as the languages responsible for its dissemination. For such, by making use of the method evolved by the group and with lexicographical, historiographical, not to mention literary and journalistic corpora, initially we aim by means of the elaboration of a general prospection, to establish the initial period of functioning of the suffix besides observing how it is understood by the present linguistic works in the two languages under study. Then, its Greco-Latin genesis and subsequently its functioning in other languages are analyzed as well as its syntagmatic and paradigmatic relations with the other suffixes involved in order to establish an evolution throughout the centuries by means of the elaboration of functional-semantic classifications of each period under study, as well as to assess the importance of languages of culture and of processes of translations in the dissemination of the suffixes in question. Thus, with the construction of a glossary of dating and sample sentences of formations with -ística, the evolution of -ístico(a), the great importance of the German language, unlike the French language, in the diffusion of the suffix can be observed. Finally, with the analysis of the productivity of -ístico(a) in Portuguese, it can be demonstrated that a suffix is not only possessed by both a semantic and functional meaning, but is also constituted by various other nuances, among which the ones that characterize the textual gender in which it functions. Formação de palavras Linguística histórica Morfologia histórica Sufixação Sufixo -ístico(a) Historical linguistics Historical morphology Suffix -ístico(a) Suffixation Word formation
68	Um algoritmo para a construção de vetores de sufixo generalizados em memória externa / External memory generalized suffix array construction algorithm Felipe Alves da Louza 17 December 2013 (has links) O vetor de sufixo é uma estrutura de dados importante utilizada em muitos problemas que envolvem cadeias de caracteres. Na literatura, muitos trabalhos têm sido propostos para a construção de vetores de sufixo em memória externa. Entretanto, esses trabalhos não enfocam conjuntos de cadeias, ou seja, não consideram vetores de sufixo generalizados. Essa limitação motiva esta dissertação, a qual avança no estado da arte apresentando o algoritmo eGSA, o primeiro algoritmo proposto para a construção de vetores de sufixo generalizados aumentado com o vetor de prefixo comum mais longo (LCP) e com a transformada de Burrows-Wheeler (BWT) em memória externa. A dissertação foi desenvolvida dentro do contexto de bioinformática, já que avanços tecnológicos recentes têm aumentado o volume de dados biológicos disponíveis, os quais são armazenados como cadeias de caracteres. O algoritmo eGSA foi validado por meio de testes de desempenho com dados reais envolvendo sequências grandes, como DNA, e sequências pequenas, como proteínas. Com relação aos testes comparativos com conjuntos de grandes cadeias de DNA, o algoritmo proposto foi comparado com o algoritmo correlato mais eficiente na literatura de construção de vetores de sufixo, o qual foi adaptado para construção de vetores generalizados. O algoritmo eGSA obteve um tempo médio de 3,2 a 8,3 vezes menor do que o algoritmo correlato e consumiu 50% menos de memória. Para conjuntos de cadeias pequenas de proteínas, foram realizados testes de desempenho apenas com o eGSA, já que no melhor do nosso conhecimento, não existem trabalhos correlatos que possam ser adaptados. Comparado com o tempo médio para conjuntos de cadeias grandes, o eGSA obteve tempos competitivos para conjuntos de cadeias pequenas. Portanto, os resultados dos testes demonstraram que o algoritmo proposto pode ser aplicado eficientemente para indexar tanto conjuntos de cadeias grandes quanto conjuntos de cadeias pequenas / The suffix array is an important data structure used in several string processing problems. In the literature, several approaches have been proposed to deal with external memory suffix array construction. However, these approaches are not specifically aimed to index sets of strings, that is, they do not consider generalized suffix arrays. This limitation motivates this masters thesis, which presents eGSA, the first external memory algorithm developed to construct generalized suffix arrays enhanced with the longest common prefix array (LCP) and the Burrows-Wheeler transform (BWT). We especially focus on the context of bioinformatics, as recent technological advances have increased the volume of biological data available, which are stored as strings. The eGSA algorithm was validated through performance tests with real data from DNA and proteins sequences. Regarding performance tests with large strings of DNA, we compared our algorithm with the most efficient and related suffix array construction algorithm in the literature, which was adapted to construct generalized arrays. The results demonstrated that our algorithm reduced the time spent by a factor of 3.2 to 8.3 and consumed 50% less memory. For sets of small strings of proteins, tests were performed only with the eGSA, since to the best of our knowledge, there is no related work that can be adapted. Compared to the average time spent to index sets of large strings, the eGSA obtained competitive times to index sets of small strings. Therefore, the performance tests demonstrated that the proposed algorithm can be applied efficiently to index both sets of large strings and sets of small strings Dados biológicos Indexação Memória externa Montagem de genomas Vetor de sufixo generalizado Biological data External memory Generalized suffix array Genome assembly Indexing
69	A descriptive analysis of the morphology of the Tshiguvhu dialect of Venda Mulaudzi, Phalandwa Abraham 01 1900 (has links) In this study an attempt is made to describe the morphological apects of Tshiguvhu. In chapter 1, it is indicated that historically, there was extensive early contact between Vhaguvhu and Balobedu and Tlokwa. In ,chapters 2 and 3, nouns and pronouns are analysed morphologically. Some similarities and differences between Tshiguvhu and Tshivenda are highlighted. These differences are ascribed to influences from Lobedu and Tlokwa. In chapters 4 and 5, the form of the verb and the use of verb forms in various tenses, where applicable, are described morphologically. Some verb roots and extensions have been influenced by Northern Sotho dialects whereas some have not. In chapter 6, the morphology of adverbs, interrogatives, conjunctions, ideophones and interjections are briefly described. In conclusion, it is indicated that Tshiguvhu is a dialect of Venda because of its cultural and historical bonds with Venda, although linguistically it shares some features with certain Northern Sotho dialects. / African Languages / M.A. (African Languages) Tshivenda Tshiguvhu Lobedu Tlokwa Morpheme Morph Allomorph Prefix Root Suffix 496.39765 Venda language -- Morphology Venda language -- Dialects -- Morphology Tshiguvhu dialect
70	Towards a Human Genomic Coevolution Network Savel, Daniel M. 04 June 2018 (has links) No description available. Computer Science Bioinformatics Next-generation sequencing suffix trees sequence assembly coevolution co-conservation multiple hypothesis testing eQTL phylogenetic profiles

Search results