• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 84
  • 17
  • 6
  • 6
  • 5
  • 4
  • 2
  • Tagged with
  • 140
  • 140
  • 43
  • 35
  • 24
  • 23
  • 17
  • 15
  • 15
  • 15
  • 14
  • 13
  • 13
  • 13
  • 11
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
121

Correction de données de séquençage de troisième génération / Error correction of third-generation sequencing data

Morisse, Pierre 26 September 2019 (has links)
Les objectifs de cette thèse s’inscrivent dans la large problématique du traitement des données issues de séquenceurs à très haut débit, et plus particulièrement des reads longs, issus de séquenceurs de troisième génération.Les aspects abordés dans cette problématiques se concentrent principalement sur la correction des erreurs de séquençage, et sur l’impact de la correction sur la qualité des analyses sous-jacentes, plus particulièrement sur l’assemblage. Dans un premier temps, l’un des objectifs de cette thèse est de permettre d’évaluer et de comparer la qualité de la correction fournie par les différentes méthodes de correction hybride (utilisant des reads courts en complément) et d’auto-correction (se basant uniquement sur l’information contenue dans les reads longs) de l’état de l’art. Une telle évaluation permet d’identifier aisément quelle méthode de correction est la mieux adaptée à un cas donné, notamment en fonction de la complexité du génome étudié, de la profondeur de séquençage, ou du taux d’erreurs des reads. De plus, les développeurs peuvent ainsi identifier les limitations des méthodes existantes, afin de guider leurs travaux et de proposer de nouvelles solutions visant à pallier ces limitations. Un nouvel outil d’évaluation, proposant de nombreuses métriques supplémentaires par rapport au seul outil disponible jusqu’alors, a ainsi été développé. Cet outil, combinant une approche par alignement multiple à une stratégie de segmentation, permet également une réduction considérable du temps nécessaire à l’évaluation. À l’aide de cet outil, un benchmark de l’ensemble des méthodes de correction disponibles est présenté, sur une large variété de jeux de données, de profondeur de séquençage, de taux d’erreurs et de complexité variable, de la bactérie A. baylyi à l’humain. Ce benchmark a notamment permis d’identifier deux importantes limitations des outils existants : les reads affichant des taux d’erreurs supérieurs à 30%, et les reads de longueur supérieure à 50 000 paires de bases. Le deuxième objectif de cette thèse est alors la correction des reads extrêmement bruités. Pour cela, un outil de correction hybride, combinant différentes approches de l’état de l’art, a été développé afin de surmonter les limitations des méthodes existantes. En particulier, cet outil combine une stratégie d’alignement des reads courts sur les reads longs à l’utilisation d’un graphe de de Bruijn, ayant la particularité d’être d’ordre variable. Le graphe est ainsi utilisé afin de relier les reads alignés, et donc de corriger les régions non couvertes des reads longs. Cette méthode permet ainsi de corriger des reads affichant des taux d’erreurs atteignant jusqu’à 44%, tout en permettant un meilleur passage à l’échelle sur de larges génomes et une diminution du temps de traitement, par rapport aux méthodes de l’état de l’art les plus efficaces. Enfin, le troisième objectif de cette thèse est la correction des reads extrêmement longs. Pour cela, un outil utilisant cette fois une approche par auto-correction a été développé, en combinant, de nouveau, différentes méthodologies de l’état de l’art. Plus précisément, une stratégie de calcul des chevauchements entre les reads, puis une double étape de correction, par alignement multiple puis par utilisation de graphes de de Bruijn locaux, sont utilisées ici. Afin de permettre à cette méthode de passer efficacement à l’échelle sur les reads extrêmement longs, la stratégie de segmentation mentionnée précédemment a été généralisée. Cette méthode d’auto-correction permet ainsi de corriger des reads atteignant jusqu’à 340 000 paires de bases, tout en permettant un excellent passage à l’échelle sur des génomes plus complexes, tels que celui de l’humain. / The aims of this thesis are part of the vast problematic of high-throughput sequencing data analysis. More specifically, this thesis deals with long reads from third-generation sequencing technologies. The aspects tackled in this topic mainly focus on error correction, and on its impact on downstream analyses such a de novo assembly. As a first step, one of the objectives of this thesis is to evaluate and compare the quality of the error correction provided by the state-of-the-art tools, whether they employ a hybrid (using complementary short reads) or a self-correction (relying only on the information contained in the long reads sequences) strategy. Such an evaluation allows to easily identify which method is best tailored for a given case, according to the genome complexity, the sequencing depth, or the error rate of the reads. Moreover, developpers can thus identify the limiting factors of the existing methods, in order to guide their work and propose new solutions allowing to overcome these limitations. A new evaluation tool, providing a wide variety of metrics, compared to the only tool previously available, was thus developped. This tool combines a multiple sequence alignment approach and a segmentation strategy, thus allowing to drastically reduce the evaluation runtime. With the help of this tool, we present a benchmark of all the state-of-the-art error correction methods, on various datasets from several organisms, spanning from the A. baylyi bacteria to the human. This benchmark allowed to spot two major limiting factors of the existing tools: the reads displaying error rates above 30%, and the reads reaching more than 50 000 base pairs. The second objective of this thesis is thus the error correction of highly noisy long reads. To this aim, a hybrid error correction tool, combining different strategies from the state-of-the-art, was developped, in order to overcome the limiting factors of existing methods. More precisely, this tool combines a short reads alignmentstrategy to the use of a variable-order de Bruijn graph. This graph is used in order to link the aligned short reads, and thus correct the uncovered regions of the long reads. This method allows to process reads displaying error rates as high as 44%, and scales better to larger genomes, while allowing to reduce the runtime of the error correction, compared to the most efficient state-of-the-art tools.Finally, the third objectif of this thesis is the error correction of extremely long reads. To this aim, aself-correction tool was developed, by combining, once again, different methologies from the state-of-the-art. More precisely, an overlapping strategy, and a two phases error correction process, using multiple sequence alignement and local de Bruijn graphs, are used. In order to allow this method to scale to extremely long reads, the aforementioned segmentation strategy was generalized. This self-correction methods allows to process reads reaching up to 340 000 base pairs, and manages to scale very well to complex organisms such as the human genome.
122

Accelerated Large-Scale Multiple Sequence Alignment with Reconfigurable Computing

Lloyd, G Scott 20 May 2011 (has links) (PDF)
Multiple Sequence Alignment (MSA) is a fundamental analysis method used in bioinformatics and many comparative genomic applications. The time to compute an optimal MSA grows exponentially with respect to the number of sequences. Consequently, producing timely results on large problems requires more efficient algorithms and the use of parallel computing resources. Reconfigurable computing hardware provides one approach to the acceleration of biological sequence alignment. Other acceleration methods typically encounter scaling problems that arise from the overhead of inter-process communication and from the lack of parallelism. Reconfigurable computing allows a greater scale of parallelism with many custom processing elements that have a low-overhead interconnect. The proposed parallel algorithms and architecture accelerate the most computationally demanding portions of MSA. An overall speedup of up to 150 has been demonstrated on a large data set when compared to a single processor. The reduced runtime for MSA allows researchers to solve the larger problems that confront biologists today.
123

Ensemble Methods for Historical Machine-Printed Document Recognition

Lund, William B. 03 April 2014 (has links) (PDF)
The usefulness of digitized documents is directly related to the quality of the extracted text. Optical Character Recognition (OCR) has reached a point where well-formatted and clean machine- printed documents are easily recognizable by current commercial OCR products; however, older or degraded machine-printed documents present problems to OCR engines resulting in word error rates (WER) that severely limit either automated or manual use of the extracted text. Major archives of historical machine-printed documents are being assembled around the globe, requiring an accurate transcription of the text for the automated creation of descriptive metadata, full-text searching, and information extraction. Given document images to be transcribed, ensemble recognition methods with multiple sources of evidence from the original document image and information sources external to the document have been shown in this and related work to improve output. This research introduces new methods of evidence extraction, feature engineering, and evidence combination to correct errors from state-of-the-art OCR engines. This work also investigates the success and failure of ensemble methods in the OCR error correction task, as well as the conditions under which these ensemble recognition methods reduce the Word Error Rate (WER), improving the quality of the OCR transcription, showing that the average document word error rate can be reduced below the WER of a state-of-the-art commercial OCR system by between 7.4% and 28.6% depending on the test corpus and methods. This research on OCR error correction contributes within the larger field of ensemble methods as follows. Four unique corpora for OCR error correction are introduced: The Eisenhower Communiqués, a collection of typewritten documents from 1944 to 1945; The Nineteenth Century Mormon Articles Newspaper Index from 1831 to 1900; and two synthetic corpora based on the Enron (2001) and the Reuters (1997) datasets. The Reverse Dijkstra Heuristic is introduced as a novel admissible heuristic for the A* exact alignment algorithm. The impact of the heuristic is a dramatic reduction in the number of nodes processed during text alignment as compared to the baseline method. From the aligned text, the method developed here creates a lattice of competing hypotheses for word tokens. In contrast to much of the work in this field, the word token lattice is created from a character alignment, preserving split and merged tokens within the hypothesis columns of the lattice. This alignment method more explicitly identifies competing word hypotheses which may otherwise have been split apart by a word alignment. Lastly, this research explores, in order of increasing contribution to word error rate reduction: voting among hypotheses, decision lists based on an in-domain training set, ensemble recognition methods with novel feature sets, multiple binarizations of the same document image, and training on synthetic document images.
124

Smith-Waterman Sequence Alignment For Massively Parallel High-Performance Computing Architectures

Steinfadt, Shannon Irene 19 April 2010 (has links)
No description available.
125

Sequence alignment

Chia, Nicholas Lee-Ping 13 September 2006 (has links)
No description available.
126

In Vivo RNAi Rescue in Drosophila melanogaster with Genomic Transgenes from Drosophila pseudoobscura

Schnorrer, Frank, Tomancak , Pavel, Schönbauer, Cornelia, Ejsmont, Radoslaw K., Langer, Christoph C. H. 10 December 2015 (has links) (PDF)
Background Systematic, large-scale RNA interference (RNAi) approaches are very valuable to systematically investigate biological processes in cell culture or in tissues of organisms such as Drosophila. A notorious pitfall of all RNAi technologies are potential false positives caused by unspecific knock-down of genes other than the intended target gene. The ultimate proof for RNAi specificity is a rescue by a construct immune to RNAi, typically originating from a related species. Methodology/Principal Findings We show that primary sequence divergence in areas targeted by Drosophila melanogaster RNAi hairpins in five non-melanogaster species is sufficient to identify orthologs for 81% of the genes that are predicted to be RNAi refractory. We use clones from a genomic fosmid library of Drosophila pseudoobscura to demonstrate the rescue of RNAi phenotypes in Drosophila melanogaster muscles. Four out of five fosmid clones we tested harbour cross-species functionality for the gene assayed, and three out of the four rescue a RNAi phenotype in Drosophila melanogaster. Conclusions/Significance The Drosophila pseudoobscura fosmid library is designed for seamless cross-species transgenesis and can be readily used to demonstrate specificity of RNAi phenotypes in a systematic manner.
127

Alinhamentos e comparação de sequências / Alignment and comparison of sequences

Araujo, Francisco Eloi Soares de 24 May 2012 (has links)
A comparação de sequências finitas é uma ferramenta que é utilizada para a solução de problemas em várias áreas. Comparamos sequências inferindo quais são as operações de edição de substituição, inserção e remoção de símbolos que transformam uma sequência em uma outra. As matrizes de pontuação são estruturas largamente utilizadas e que definem um custo para cada tipo de operação de edição. Uma matriz de pontuação G é indexada pelos símbolos do alfabeto. A entrada de G na linha A, coluna B mede o custo da operação de edição para substituir o símbolo A pelo símbolo B. As matrizes de pontuação induzem funções que atribuem uma pontuação para um conjunto de operações de edição. Algumas dessas funções para a comparação de duas e de várias sequências são estudadas nesta tese. Quando cada símbolo de cada sequência é editado exatamente uma vez para transformar uma sequência em outra, o conjunto de operações de edição pode ser representado por uma estrutura conhecida por alinhamento. Descrevemos uma estrutura para representar o conjunto de operações de edição que não pode ser representado por um alinhamento convencional e descrevemos um algoritmo para encontrar a pontuação de uma sequência ótima de operações de edição usando um algoritmo conhecido para encontrar a pontuação de um alinhamento convencional ótimo. Considerando três diferentes funções induzidas de pontuação, caracterizamos, para cada uma delas, a classe das matrizes para as quais as funções induzidas de pontuação são métricas nas sequências. Dadas duas matrizes de pontuação G e G\', dizemos que elas são equivalentes para uma dada função que é induzida por uma matriz de pontuação e que avalia a qualidade de um alinhamento se, para quaisquer dois alinhamentos A e B, vale o seguinte: o alinhamento A é ``melhor\'\' do que o alinhamento B considerando a matriz G se e somente se A é ``melhor\'\' do que o alinhamento B considerando a matriz G\'. Neste trabalho, determinamos condições necessárias e suficientes para que duas matrizes de pontuação sejam equivalentes. Finalmente, definimos três novos critérios para pontuar alinhamentos de várias sequências. Todos os critérios consideram o comprimento do alinhamento além das operações de edição por ele representadas. Para cada um dos critérios definidos,propomos um algoritmo e o problema de decisão correspondente mostramos ser NP-completo. / Comparison of finite sequences is a tool used to solve problems in several areas. In order to compare sequences, we infer which are the edit operations of substitution, insertion and deletion of symbols that transform one sequence into another. Scoring matrices are a widely used structure to define a cost for each type of edit operation. A scoring matrix G is indexed by symbols of an alphabet. The entry in G in row A and column B measures the cost of the edit operation for replacing symbol A by symbol B. Scoring matrices induce functions that assign a score for a set of edit operations. Some of these functions for comparing two and multiple sequences are studied in this thesis. If each symbol is edited exactly once for transforming a sequence into another, the set of edit operations can be represented by a structure called alignment. We describe a structure to represent the set of edit operations that cannot be represented by a conventional alignment and we design an algorithm to find the cost of an optimal sequence of edit operations by using a known algorithm to find the cost of an optimal alignment. Considering three different kinds of induced scoring functions, we characterize, for each one of them, the class of matrices for which the induced scoring functions are metrics on sequences. Given two scoring matrices G and G\', we say they are equivalent for a given function that is induced by a scoring matrix and that evaluates the quality of an alignment if, for any two alignments A and B of two sequences, we have the following: alignment A is ``better\'\' than B considering scoring matrix G if and only if A is ``better\'\' than B considering scoring matrix G\'. In this work, we determine necessary and sufficient conditions for scoring matrices to be equivalent. Finally, we define three new criteria for scoring alignments of several sequence. Every criterion considers the length of the alignment and the edit operations represented by it. An algorithm for each criterion is studied and the corresponding decision problem is shown to be NP-complete.
128

Epidemiology of representations : an empirical approach / Epidemiology of representations : an empirical approach / Epidemiología de las representaciones : un enfoque empírico

Lerique, Sébastien 27 October 2017 (has links)
Nous proposons une contribution empirique aux tentatives récentes d'unification des sciences cognitives et des sciences sociales.La Théorie de l'Attraction Culturelle (CAT) propose de s'atteler à des questions interdisciplinaires en utilisant une ontologie commune faite de représentations.D'après la CAT, malgré des transformations au niveau micro, la distribution globale des représentations peut rester stable grâce à des attracteurs culturels.Cette hypothèse est difficile à tester, mais les technologies du web permettent de combiner les avantages des techniques existantes pour étendre le champ des études possibles.Nous présentons deux études de cas sur de courts énoncés écrits.La première examine les changements que des citations subissent lorsqu'elles sont copiées en ligne.En combinant psycholinguistique et fouille de données, nous montrons que les substitutions de mots sont cohérentes avec l'hypothèse des attracteurs culturels, et avec les effets connus de variables lexicales.La deuxième étude étend ces résultats, et utilise une expérience web permettant de récolter des chaînes de transmission de qualité et en grande quantité.En étendant un algorithme bioinformatique, nous décomposons les transformations en des opérations plus simples, et proposons un premier modèle descriptif du processus qui relie les connaissances psycholinguistiques sur la transformation de phrases aux tendances de haut niveau identifiées dans la littérature sur l'évolution culturelle.Enfin, nous montrons que la compréhension de l'évolution de telles représentations nécessite une théorie du sens des énoncés, une tâche pour laquelle nous explorons les approches empiriques possibles. / We propose an empirical contribution to recent attempts to unify cognitive science and social science.We focus on Cultural Attraction Theory (CAT), a framework that proposes a common ontology made of representations for cognitive and social science to address interdisciplinary questions.CAT hypothesizes that in spite of important transformations at the micro-level, the overall distribution of representations remains stable due to dynamical attractors.Testing this hypothesis is challenging and existing approaches have several shortcomings.Yet, by taking advantage of web technologies one can combine the advantages of existing techniques to expand the range of possible empirical studies.We develop two case studies to show this with short written utterances.The first examines transformations that quotations undergo as they are propagated online.By connecting data mining tools with psycholinguistics, we show that word substitutions in quotations are consistent with the hypothesis of cultural attractors and with known effects of lexical features.The second case study expands these results, and makes use of a purposefully developed web experiment to gather quality transmission chain data sets.By extending a bioinformatics alignment algorithm, we decompose transformations into simpler operations, and propose a first descriptive model which relates psycholinguistic knowledge of sentence transformation to evolutionary trends elicited in the cultural evolution literature.Finally, we show that further understanding the evolution of such representations requires an account of meaning in context, a task for which we flesh out possible empirical approaches.
129

Modules réactionnels : un nouveau concept pour étudier l'évolution des voies métaboliques / Reaction modules : a new concept to study the evolution of metabolic pathways

Barba, Matthieu 16 December 2011 (has links)
J'ai mis au point une méthodologie pour annoter les superfamilles d'enzymes, en décrire l'histoire et les replacer dans l'évolution de leurs voies métaboliques. J'en ai étudié trois : (1) les amidohydrolases cycliques, dont les DHOases (dihydroorotases, biosynthèse des pyrimidines), pour lesquelles j'ai proposé une nouvelle classification. L'arbre phylogénétique inclut les dihydropyrimidinases (DHPases) et allantoïnases (ALNases) qui ont des réactions similaires dans d'autres voies (dégradation des pyrimidines et des purines respectivement). (2) L'étude de la superfamille des DHODases (qui suivent les DHOases) montre une phylogénie semblable aux DHOases, avec également des enzymes d'autres voies, dont les DHPDases (qui suivent les DHPases). De cette observation est né le concept de module réactionnel, qui correspond à la conservation de l’enchaînement de réactions semblables dans différentes voies métaboliques. Cela a été utilisé lors de (3) l'étude des carbamoyltransférases (TCases) qui incluent les ATCases (précédant les DHOases). J'ai d'abord montré l'existence d'une nouvelle TCase potentiellement impliquée dans la dégradation des purines et lui ai proposé un nouveau rôle en utilisant le concept de module réactionnel (enchaînement avec l'ALNase). Dans ces trois grandes familles j'ai aussi mis en évidence trois groupes de paralogues non identifiés qui se retrouvent pourtant dans un même contexte génétique appelé « Yge » et qui formeraient donc un module réactionnel constitutif d'une nouvelle voie hypothétique. Appliqué à diverses voies, le concept de modules réactionnels refléterait donc les voies métaboliques ancestrales dont ils seraient les éléments de base. / I designed a methodology to annotate enzyme superfamilies, explain their history and describe them in the context of metabolic pathways evolution. Three superfamilies were studied: (1) cyclic amidohydrolases, including DHOases (dihydroorotases, third step of the pyrimidines biosynthesis), for which I proposed a new classification. The phylogenetic tree also includes dihydropyrimidinases (DHPases) and allantoinases (ALNases) which catalyze similar reactions in other pathways (pyrimidine and purine degradation, respectively). (2) The DHODases superfamily (after DHOases) show a similar phylogeny as DHOases, including enzymes from other pathways, DHPDases in particular (after DHPases). This led to the concept of reaction module, i.e. a conserved series of similar reactions in different metabolic pathways. This was used to study (3) the carbamoyltransferases (TCases) which include ATCases (before DHOases). I first isolated a new kind of TCase, potentially involved in the purine degradation, and I proposed a new role for it in the light of reaction modules (linked with ALNase). In those three superfamilies I also found three groups of unidentified paralogs that were remarkably part of the same genetic context called “Yge” which would be a reaction module part of an unidentified pathway. The concept of reactions modules may then reflect the ancestral metabolic pathways for which they would be basic elements.
130

Alinhamentos e comparação de sequências / Alignment and comparison of sequences

Francisco Eloi Soares de Araujo 24 May 2012 (has links)
A comparação de sequências finitas é uma ferramenta que é utilizada para a solução de problemas em várias áreas. Comparamos sequências inferindo quais são as operações de edição de substituição, inserção e remoção de símbolos que transformam uma sequência em uma outra. As matrizes de pontuação são estruturas largamente utilizadas e que definem um custo para cada tipo de operação de edição. Uma matriz de pontuação G é indexada pelos símbolos do alfabeto. A entrada de G na linha A, coluna B mede o custo da operação de edição para substituir o símbolo A pelo símbolo B. As matrizes de pontuação induzem funções que atribuem uma pontuação para um conjunto de operações de edição. Algumas dessas funções para a comparação de duas e de várias sequências são estudadas nesta tese. Quando cada símbolo de cada sequência é editado exatamente uma vez para transformar uma sequência em outra, o conjunto de operações de edição pode ser representado por uma estrutura conhecida por alinhamento. Descrevemos uma estrutura para representar o conjunto de operações de edição que não pode ser representado por um alinhamento convencional e descrevemos um algoritmo para encontrar a pontuação de uma sequência ótima de operações de edição usando um algoritmo conhecido para encontrar a pontuação de um alinhamento convencional ótimo. Considerando três diferentes funções induzidas de pontuação, caracterizamos, para cada uma delas, a classe das matrizes para as quais as funções induzidas de pontuação são métricas nas sequências. Dadas duas matrizes de pontuação G e G\', dizemos que elas são equivalentes para uma dada função que é induzida por uma matriz de pontuação e que avalia a qualidade de um alinhamento se, para quaisquer dois alinhamentos A e B, vale o seguinte: o alinhamento A é ``melhor\'\' do que o alinhamento B considerando a matriz G se e somente se A é ``melhor\'\' do que o alinhamento B considerando a matriz G\'. Neste trabalho, determinamos condições necessárias e suficientes para que duas matrizes de pontuação sejam equivalentes. Finalmente, definimos três novos critérios para pontuar alinhamentos de várias sequências. Todos os critérios consideram o comprimento do alinhamento além das operações de edição por ele representadas. Para cada um dos critérios definidos,propomos um algoritmo e o problema de decisão correspondente mostramos ser NP-completo. / Comparison of finite sequences is a tool used to solve problems in several areas. In order to compare sequences, we infer which are the edit operations of substitution, insertion and deletion of symbols that transform one sequence into another. Scoring matrices are a widely used structure to define a cost for each type of edit operation. A scoring matrix G is indexed by symbols of an alphabet. The entry in G in row A and column B measures the cost of the edit operation for replacing symbol A by symbol B. Scoring matrices induce functions that assign a score for a set of edit operations. Some of these functions for comparing two and multiple sequences are studied in this thesis. If each symbol is edited exactly once for transforming a sequence into another, the set of edit operations can be represented by a structure called alignment. We describe a structure to represent the set of edit operations that cannot be represented by a conventional alignment and we design an algorithm to find the cost of an optimal sequence of edit operations by using a known algorithm to find the cost of an optimal alignment. Considering three different kinds of induced scoring functions, we characterize, for each one of them, the class of matrices for which the induced scoring functions are metrics on sequences. Given two scoring matrices G and G\', we say they are equivalent for a given function that is induced by a scoring matrix and that evaluates the quality of an alignment if, for any two alignments A and B of two sequences, we have the following: alignment A is ``better\'\' than B considering scoring matrix G if and only if A is ``better\'\' than B considering scoring matrix G\'. In this work, we determine necessary and sufficient conditions for scoring matrices to be equivalent. Finally, we define three new criteria for scoring alignments of several sequence. Every criterion considers the length of the alignment and the edit operations represented by it. An algorithm for each criterion is studied and the corresponding decision problem is shown to be NP-complete.

Page generated in 0.0562 seconds