• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 28
  • 3
  • 2
  • 1
  • Tagged with
  • 39
  • 39
  • 39
  • 14
  • 10
  • 9
  • 9
  • 7
  • 6
  • 6
  • 5
  • 5
  • 5
  • 5
  • 5
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
31

The Systematic Design and Application of Robust DNA Barcodes

Buschmann, Tilo 19 September 2016 (has links) (PDF)
High-throughput sequencing technologies are improving in quality, capacity, and costs, providing versatile applications in DNA and RNA research. For small genomes or fraction of larger genomes, DNA samples can be mixed and loaded together on the same sequencing track. This so-called multiplexing approach relies on a specific DNA tag, index, or barcode that is attached to the sequencing or amplification primer and hence accompanies every read. After sequencing, each sample read is identified on the basis of the respective barcode sequence. Alterations of DNA barcodes during synthesis, primer ligation, DNA amplification, or sequencing may lead to incorrect sample identification unless the error is revealed and corrected. This can be accomplished by implementing error correcting algorithms and codes. This barcoding strategy increases the total number of correctly identified samples, thus improving overall sequencing efficiency. Two popular sets of error-correcting codes are Hamming codes and codes based on the Levenshtein distance. Levenshtein-based codes operate only on words of known length. Since a DNA sequence with an embedded barcode is essentially one continuous long word, application of the classical Levenshtein algorithm is problematic. In this thesis we demonstrate the decreased error correction capability of Levenshtein-based codes in a DNA context and suggest an adaptation of Levenshtein-based codes that is proven of efficiently correcting nucleotide errors in DNA sequences. In our adaptation, we take any DNA context into account and impose more strict rules for the selection of barcode sets. In simulations we show the superior error correction capability of the new method compared to traditional Levenshtein and Hamming based codes in the presence of multiple errors. We present an adaptation of Levenshtein-based codes to DNA contexts capable of guaranteed correction of a pre-defined number of insertion, deletion, and substitution mutations. Our improved method is additionally capable of correcting on average more random mutations than traditional Levenshtein-based or Hamming codes. As part of this work we prepared software for the flexible generation of DNA codes based on our new approach. To adapt codes to specific experimental conditions, the user can customize sequence filtering, the number of correctable mutations and barcode length for highest performance. However, not every platform is susceptible to a large number of both indel and substitution errors. The Illumina “Sequencing by Synthesis” platform shows a very large number of substitution errors as well as a very specific shift of the read that results in inserted and deleted bases at the 5’-end and the 3’-end (which we call phaseshifts). We argue in this scenario that the application of Sequence-Levenshtein-based codes is not efficient because it aims for a category of errors that barely occurs on this platform, which reduces the code size needlessly. As a solution, we propose the “Phaseshift distance” that exclusively supports the correction of substitutions and phaseshifts. Additionally, we enable the correction of arbitrary combinations of substitution and phaseshift errors. Thus, we address the lopsided number of substitutions compared to phaseshifts on the Illumina platform. To compare codes based on the Phaseshift distance to Hamming Codes as well as codes based on the Sequence-Levenshtein distance, we simulated an experimental scenario based on the error pattern we identified on the Illumina platform. Furthermore, we generated a large number of different sets of DNA barcodes using the Phaseshift distance and compared codes of different lengths and error correction capabilities. We found that codes based on the Phaseshift distance can correct a number of errors comparable to codes based on the Sequence-Levenshtein distance while offering the number of DNA barcodes comparable to Hamming codes. Thus, codes based on the Phaseshift distance show a higher efficiency in the targeted scenario. In some cases (e.g., with PacBio SMRT in Continuous Long Read mode), the position of the barcode and DNA context is not well defined. Many reads start inside the genomic insert so that adjacent primers might be missed. The matter is further complicated by coincidental similarities between barcode sequences and reference DNA. Therefore, a robust strategy is required in order to detect barcoded reads and avoid a large number of false positives or negatives. For mass inference problems such as this one, false discovery rate (FDR) methods are powerful and balanced solutions. Since existing FDR methods cannot be applied to this particular problem, we present an adapted FDR method that is suitable for the detection of barcoded reads as well as suggest possible improvements.
32

The Systematic Design and Application of Robust DNA Barcodes

Buschmann, Tilo 02 September 2016 (has links)
High-throughput sequencing technologies are improving in quality, capacity, and costs, providing versatile applications in DNA and RNA research. For small genomes or fraction of larger genomes, DNA samples can be mixed and loaded together on the same sequencing track. This so-called multiplexing approach relies on a specific DNA tag, index, or barcode that is attached to the sequencing or amplification primer and hence accompanies every read. After sequencing, each sample read is identified on the basis of the respective barcode sequence. Alterations of DNA barcodes during synthesis, primer ligation, DNA amplification, or sequencing may lead to incorrect sample identification unless the error is revealed and corrected. This can be accomplished by implementing error correcting algorithms and codes. This barcoding strategy increases the total number of correctly identified samples, thus improving overall sequencing efficiency. Two popular sets of error-correcting codes are Hamming codes and codes based on the Levenshtein distance. Levenshtein-based codes operate only on words of known length. Since a DNA sequence with an embedded barcode is essentially one continuous long word, application of the classical Levenshtein algorithm is problematic. In this thesis we demonstrate the decreased error correction capability of Levenshtein-based codes in a DNA context and suggest an adaptation of Levenshtein-based codes that is proven of efficiently correcting nucleotide errors in DNA sequences. In our adaptation, we take any DNA context into account and impose more strict rules for the selection of barcode sets. In simulations we show the superior error correction capability of the new method compared to traditional Levenshtein and Hamming based codes in the presence of multiple errors. We present an adaptation of Levenshtein-based codes to DNA contexts capable of guaranteed correction of a pre-defined number of insertion, deletion, and substitution mutations. Our improved method is additionally capable of correcting on average more random mutations than traditional Levenshtein-based or Hamming codes. As part of this work we prepared software for the flexible generation of DNA codes based on our new approach. To adapt codes to specific experimental conditions, the user can customize sequence filtering, the number of correctable mutations and barcode length for highest performance. However, not every platform is susceptible to a large number of both indel and substitution errors. The Illumina “Sequencing by Synthesis” platform shows a very large number of substitution errors as well as a very specific shift of the read that results in inserted and deleted bases at the 5’-end and the 3’-end (which we call phaseshifts). We argue in this scenario that the application of Sequence-Levenshtein-based codes is not efficient because it aims for a category of errors that barely occurs on this platform, which reduces the code size needlessly. As a solution, we propose the “Phaseshift distance” that exclusively supports the correction of substitutions and phaseshifts. Additionally, we enable the correction of arbitrary combinations of substitution and phaseshift errors. Thus, we address the lopsided number of substitutions compared to phaseshifts on the Illumina platform. To compare codes based on the Phaseshift distance to Hamming Codes as well as codes based on the Sequence-Levenshtein distance, we simulated an experimental scenario based on the error pattern we identified on the Illumina platform. Furthermore, we generated a large number of different sets of DNA barcodes using the Phaseshift distance and compared codes of different lengths and error correction capabilities. We found that codes based on the Phaseshift distance can correct a number of errors comparable to codes based on the Sequence-Levenshtein distance while offering the number of DNA barcodes comparable to Hamming codes. Thus, codes based on the Phaseshift distance show a higher efficiency in the targeted scenario. In some cases (e.g., with PacBio SMRT in Continuous Long Read mode), the position of the barcode and DNA context is not well defined. Many reads start inside the genomic insert so that adjacent primers might be missed. The matter is further complicated by coincidental similarities between barcode sequences and reference DNA. Therefore, a robust strategy is required in order to detect barcoded reads and avoid a large number of false positives or negatives. For mass inference problems such as this one, false discovery rate (FDR) methods are powerful and balanced solutions. Since existing FDR methods cannot be applied to this particular problem, we present an adapted FDR method that is suitable for the detection of barcoded reads as well as suggest possible improvements.
33

Contrôle des fausses découvertes lors de la sélection de variables en grande dimension / Control of false discoveries in high-dimensional variable selection

Bécu, Jean-Michel 10 March 2016 (has links)
Dans le cadre de la régression, de nombreuses études s’intéressent au problème dit de la grande dimension, où le nombre de variables explicatives mesurées sur chaque échantillon est beaucoup plus grand que le nombre d’échantillons. Si la sélection de variables est une question classique, les méthodes usuelles ne s’appliquent pas dans le cadre de la grande dimension. Ainsi, dans ce manuscrit, nous présentons la transposition de tests statistiques classiques à la grande dimension. Ces tests sont construits sur des estimateurs des coefficients de régression produits par des approches de régressions linéaires pénalisées, applicables dans le cadre de la grande dimension. L’objectif principal des tests que nous proposons consiste à contrôler le taux de fausses découvertes. La première contribution de ce manuscrit répond à un problème de quantification de l’incertitude sur les coefficients de régression réalisée sur la base de la régression Ridge, qui pénalise les coefficients de régression par leur norme l2, dans le cadre de la grande dimension. Nous y proposons un test statistique basé sur le rééchantillonage. La seconde contribution porte sur une approche de sélection en deux étapes : une première étape de criblage des variables, basée sur la régression parcimonieuse Lasso précède l’étape de sélection proprement dite, où la pertinence des variables pré-sélectionnées est testée. Les tests sont construits sur l’estimateur de la régression Ridge adaptive, dont la pénalité est construite à partir des coefficients de régression du Lasso. Une dernière contribution consiste à transposer cette approche à la sélection de groupes de variables. / In the regression framework, many studies are focused on the high-dimensional problem where the number of measured explanatory variables is very large compared to the sample size. If variable selection is a classical question, usual methods are not applicable in the high-dimensional case. So, in this manuscript, we develop the transposition of statistical tests to the high dimension. These tests operate on estimates of regression coefficients obtained by penalized linear regression, which is applicable in high-dimension. The main objective of these tests is the false discovery control. The first contribution of this manuscript provides a quantification of the uncertainty for regression coefficients estimated by ridge regression in high dimension. The Ridge regression penalizes the coefficients on their l2 norm. To do this, we devise a statistical test based on permutations. The second contribution is based on a two-step selection approach. A first step is dedicated to the screening of variables, based on parsimonious regression Lasso. The second step consists in cleaning the resulting set by testing the relevance of pre-selected variables. These tests are made on adaptive-ridge estimates, where the penalty is constructed on Lasso estimates learned during the screening step. A last contribution consists to the transposition of this approach to group-variables selection.
34

Development of novel Classical and Quantum Information Theory Based Methods for the Detection of Compensatory Mutations in MSAs

Gültas, Mehmet 18 September 2013 (has links)
Multiple Sequenzalignments (MSAs) von homologen Proteinen sind nützliche Werkzeuge, um kompensatorische Mutationen zwischen nicht-konservierten Residuen zu charakterisieren. Die Identifizierung dieser Residuen in MSAs ist eine wichtige Aufgabe um die strukturellen Grundlagen und molekularen Mechanismen von Proteinfunktionen besser zu verstehen. Trotz der vielen Anzahl an Literatur über kompensatorische Mutationen sowie über die Sequenzkonservierungsanalyse für die Erkennung von wichtigen Residuen, haben vorherige Methoden meistens die biochemischen Eigenschaften von Aminosäuren nicht mit in Betracht gezogen, welche allerdings entscheidend für die Erkennung von kompensatorischen Mutationssignalen sein können. Jedoch werden kompensatorische Mutationssignale in MSAs oft durch das Rauschen verfälscht. Aus diesem Grund besteht ein weiteres Problem der Bioinformatik in der Trennung signifikanter Signale vom phylogenetischen Rauschen und beziehungslosen Paarsignalen. Das Ziel dieser Arbeit besteht darin Methoden zu entwickeln, welche biochemische Eigenschaften wie Ähnlichkeiten und Unähnlichkeiten von Aminosäuren in der Identifizierung von kompensatorischen Mutationen integriert und sich mit dem Rauschen auseinandersetzt. Deshalb entwickeln wir unterschiedliche Methoden basierend auf klassischer- und quantum Informationstheorie sowie multiple Testverfahren. Unsere erste Methode basiert auf der klassischen Informationstheorie. Diese Methode betrachtet hauptsächlich BLOSUM62-unähnliche Paare von Aminosäuren als ein Modell von kompensatorischen Mutationen und integriert sie in die Identifizierung von wichtigen Residuen. Um diese Methode zu ergänzen, entwickeln wir unsere zweite Methode unter Verwendung der Grundlagen von quantum Informationstheorie. Diese neue Methode unterscheidet sich von der ersten Methode durch gleichzeitige Modellierung ähnlicher und unähnlicher Signale in der kompensatorischen Mutationsanalyse. Des Weiteren, um signifikante Signale vom Rauschen zu trennen, entwickeln wir ein MSA-spezifisch statistisches Modell in Bezug auf multiple Testverfahren. Wir wenden unsere Methode für zwei menschliche Proteine an, nämlich epidermal growth factor receptor (EGFR) und glucokinase (GCK). Die Ergebnisse zeigen, dass das MSA-spezifisch statistische Modell die signifikanten Signale vom phylogenetischen Rauschen und von beziehungslosen Paarsignalen trennen kann. Nur unter Berücksichtigung BLOSUM62-unähnlicher Paare von Aminosäuren identifiziert die erste Methode erfolgreich die krankheits-assoziierten wichtigen Residuen der beiden Proteine. Im Gegensatz dazu, durch die gleichzeitige Modellierung ähnlicher und unähnlicher Signale von Aminosäurepaare ist die zweite Methode sensibler für die Identifizierung von katalytischen und allosterischen Residuen.
35

A Comparison of Microarray Analyses: A Mixed Models Approach Versus the Significance Analysis of Microarrays

Stephens, Nathan Wallace 20 November 2006 (has links) (PDF)
DNA microarrays are a relatively new technology for assessing the expression levels of thousands of genes simultaneously. Researchers hope to find genes that are differentially expressed by hybridizing cDNA from known treatment sources with various genes spotted on the microarrays. The large number of tests involved in analyzing microarrays has raised new questions in multiple testing. Several approaches for identifying differentially expressed genes have been proposed. This paper considers two: (1) a mixed models approach, and (2) the Signiffcance Analysis of Microarrays.
36

How tissues tell time

Rosahl, Agnes Lioba 22 January 2015 (has links)
Durch ihren Einfluß auf die Genexpression reguliert die zirkadiane Uhr physiologische Funktionen vieler Organe. Obwohl der zugrundeliegende allgemeine Uhrmechanismus gut untersucht ist, bestehen noch viele Unklarheiten über die gewebespezifische Regulation zirkadianer Gene. Neben ihrer gemeinsamen 24-h-Periode im Expressionsmuster unterscheiden diese sich darin, zu welcher Tageszeit sie am höchsten exprimiert sind und in welchem Gewebe sie oszillieren. Mittels Überrepräsentationsanalyse lassen sich Bindungsstellen von Transkriptionsfaktoren identifizieren, die an der Regulation ähnlich exprimierter Gene beteiligt sind. Um diese Methode auf zirkadiane Gene anzuwenden, ist es nötig, Untergruppen ähnlich exprimierter Gene genau zu definieren und Vergleichsgene passend auszuwählen. Eine hierarchische Methode zur Kontrolle der FDR hilft, aus der daraus entstehenden Menge vieler Untergruppenvergleiche signifikante Ergebnisse zu filtern. Basierend auf mit Microarrays gemessenen Zeitreihen wurde durch Promotoranalyse die gewebespezifische Regulation von zirkadianen Genen zweier Zelltypen in Mäusen untersucht. Bindungsstellen der Transkriptionsfaktoren CLOCK:BMAL1, NF-Y und CREB fanden sich in beiden überrepräsentiert. Diesen verwandte Transkriptionsfaktoren mit spezifischen Komplexierungsdomänen binden mit unterschiedlicher Stärke an Motivvarianten und arrangieren dabei Interaktionen mit gewebespezifischeren Regulatoren (z.B. HOX, GATA, FORKHEAD, REL, IRF, ETS Regulatoren und nukleare Rezeptoren). Vermutlich beeinflußt dies den Zeitablauf der Komplexbildung am Promotor zum Transkriptionsstart und daher auch gewebespezifische Transkriptionsmuster. In dieser Hinsicht sind der Gehalt an Guanin (G) und Cytosin (C) sowie deren CpG-Dinukleotiden wichtige Promotoreigenschaften, welche die Interaktionswahrscheinlichkeit von Transkriptionsfaktoren steuern. Grund ist, daß die Affinitäten, mit denen Regulatoren zu Promotoren hingezogen werden, von diesen Sequenzeigenschaften abhängen. / A circadian clock in peripheral tissues regulates physiological functions through gene expression timing. However, despite the common and well studied core clock mechanism, understanding of tissue-specific regulation of circadian genes is marginal. Overrepresentation analysis is a tool to detect transcription factor binding sites that might play a role in the regulation of co-expressed genes. To apply it to circadian genes that do share a period of about 24 hours, but differ otherwise in peak phase timing and tissue-specificity of their oscillation, clear definition of co-expressed gene subgroups as well as the appropriate choice of background genes are important prerequisites. In this setting of multiple subgroup comparisons, a hierarchical method for false discovery control reveals significant findings. Based on two microarray time series in mouse macrophages and liver cells, tissue-specific regulation of circadian genes in these cell types is investigated by promoter analysis. Binding sites for CLOCK:BMAL1, NF-Y and CREB transcription factors are among the common top candidates of overrepresented motifs. Related transcription factors of BHLH and BZIP families with specific complexation domains bind to motif variants with differing strengths, thereby arranging interactions with more tissue-specific regulators (e.g. HOX, GATA, FORKHEAD, REL, IRF, ETS regulators and nuclear receptors). Presumably, this influences the timing of pre-initiation complexes and hence tissue-specific transcription patterns. In this respect, the content of guanine (G) and cytosine (C) bases as well as CpG dinucleotides are important promoter properties directing the interaction probability of regulators, because affinities with which transcription factors are attracted to promoters depend on these sequence characteristics.
37

Statistical co-analysis of high-dimensional association studies

Liley, Albert James January 2017 (has links)
Modern medical practice and science involve complex phenotypic definitions. Understanding patterns of association across this range of phenotypes requires co-analysis of high-dimensional association studies in order to characterise shared and distinct elements. In this thesis I address several problems in this area, with a general linking aim of making more efficient use of available data. The main application of these methods is in the analysis of genome-wide association studies (GWAS) and similar studies. Firstly, I developed methodology for a Bayesian conditional false discovery rate (cFDR) for levering GWAS results using summary statistics from a related disease. I extended an existing method to enable a shared control design, increasing power and applicability, and developed an approximate bound on false-discovery rate (FDR) for the procedure. Using the new method I identified several new variant-disease associations. I then developed a second application of shared control design in the context of study replication, enabling improvement in power at the cost of changing the spectrum of sensitivity to systematic errors in study cohorts. This has application in studies on rare diseases or in between-case analyses. I then developed a method for partially characterising heterogeneity within a disease by modelling the bivariate distribution of case-control and within-case effect sizes. Using an adaptation of a likelihood-ratio test, this allows an assessment to be made of whether disease heterogeneity corresponds to differences in disease pathology. I applied this method to a range of simulated and real datasets, enabling insight into the cause of heterogeneity in autoantibody positivity in type 1 diabetes (T1D). Finally, I investigated the relation of subtypes of juvenile idiopathic arthritis (JIA) to adult diseases, using modified genetic risk scores and linear discriminants in a penalised regression framework. The contribution of this thesis is in a range of methodological developments in the analysis of high-dimensional association study comparison. Methods such as these will have wide application in the analysis of GWAS and similar areas, particularly in the development of stratified medicine.
38

Efron’s Method on Large Scale Correlated Data and Its Refinements

Ghoshal, Asmita 11 August 2023 (has links)
No description available.
39

Predicting stock market trends using time-series classification with dynamic neural networks

Mocanu, Remus 09 1900 (has links)
L’objectif de cette recherche était d’évaluer l’efficacité du paramètre de classification pour prédire suivre les tendances boursières. Les méthodes traditionnelles basées sur la prévision, qui ciblent l’immédiat pas de temps suivant, rencontrent souvent des défis dus à des données non stationnaires, compromettant le modèle précision et stabilité. En revanche, notre approche de classification prédit une évolution plus large du cours des actions avec des mouvements sur plusieurs pas de temps, visant à réduire la non-stationnarité des données. Notre ensemble de données, dérivé de diverses actions du NASDAQ-100 et éclairé par plusieurs indicateurs techniques, a utilisé un mélange d'experts composé d'un mécanisme de déclenchement souple et d'une architecture basée sur les transformateurs. Bien que la méthode principale de cette expérience ne se soit pas révélée être aussi réussie que nous l'avions espéré et vu initialement, la méthodologie avait la capacité de dépasser toutes les lignes de base en termes de performance dans certains cas à quelques époques, en démontrant le niveau le plus bas taux de fausses découvertes tout en ayant un taux de rappel acceptable qui n'est pas zéro. Compte tenu de ces résultats, notre approche encourage non seulement la poursuite des recherches dans cette direction, dans lesquelles un ajustement plus précis du modèle peut être mis en œuvre, mais offre également aux personnes qui investissent avec l'aide de l'apprenstissage automatique un outil différent pour prédire les tendances boursières, en utilisant un cadre de classification et un problème défini différemment de la norme. Il est toutefois important de noter que notre étude est basée sur les données du NASDAQ-100, ce qui limite notre l’applicabilité immédiate du modèle à d’autres marchés boursiers ou à des conditions économiques variables. Les recherches futures pourraient améliorer la performance en intégrant les fondamentaux des entreprises et effectuer une analyse du sentiment sur l'actualité liée aux actions, car notre travail actuel considère uniquement indicateurs techniques et caractéristiques numériques spécifiques aux actions. / The objective of this research was to evaluate the classification setting's efficacy in predicting stock market trends. Traditional forecasting-based methods, which target the immediate next time step, often encounter challenges due to non-stationary data, compromising model accuracy and stability. In contrast, our classification approach predicts broader stock price movements over multiple time steps, aiming to reduce data non-stationarity. Our dataset, derived from various NASDAQ-100 stocks and informed by multiple technical indicators, utilized a Mixture of Experts composed of a soft gating mechanism and a transformer-based architecture. Although the main method of this experiment did not prove to be as successful as we had hoped and seen initially, the methodology had the capability in surpassing all baselines in certain instances at a few epochs, demonstrating the lowest false discovery rate while still having an acceptable recall rate. Given these results, our approach not only encourages further research in this direction, in which further fine-tuning of the model can be implemented, but also offers traders a different tool for predicting stock market trends, using a classification setting and a differently defined problem. It's important to note, however, that our study is based on NASDAQ-100 data, limiting our model's immediate applicability to other stock markets or varying economic conditions. Future research could enhance performance by integrating company fundamentals and conducting sentiment analysis on stock-related news, as our current work solely considers technical indicators and stock-specific numerical features.

Page generated in 0.4559 seconds