Spelling suggestions: "subject:"highthroughput sequencing"" "subject:"highthroughput sequencing""
121 |
The Systematic Design and Application of Robust DNA BarcodesBuschmann, Tilo 02 September 2016 (has links)
High-throughput sequencing technologies are improving in quality, capacity, and costs, providing versatile applications in DNA and RNA research. For small genomes or fraction of larger genomes, DNA samples can be mixed and loaded together on the same sequencing track. This so-called multiplexing approach relies on a specific DNA tag, index, or barcode that is attached to the sequencing or amplification primer and hence accompanies every read. After sequencing, each sample read is identified on the basis of the respective barcode sequence.
Alterations of DNA barcodes during synthesis, primer ligation, DNA amplification, or sequencing may lead to incorrect sample identification unless the error is revealed and corrected. This can be accomplished by implementing error correcting algorithms and codes. This barcoding strategy increases the total number of correctly identified samples, thus improving overall sequencing efficiency. Two popular sets of error-correcting codes are Hamming codes and codes based on the Levenshtein distance.
Levenshtein-based codes operate only on words of known length. Since a DNA sequence with an embedded barcode is essentially one continuous long word, application of the classical Levenshtein algorithm is problematic. In this thesis we demonstrate the decreased error correction capability of Levenshtein-based codes in a DNA context and suggest an adaptation of Levenshtein-based codes that is proven of efficiently correcting nucleotide errors in DNA sequences. In our adaptation, we take any DNA context into account and impose more strict rules for the selection of barcode sets. In simulations we show the superior error correction capability of the new method compared to traditional Levenshtein and Hamming based codes in the presence of multiple errors.
We present an adaptation of Levenshtein-based codes to DNA contexts capable of guaranteed correction of a pre-defined number of insertion, deletion, and substitution mutations. Our improved method is additionally capable of correcting on average more random mutations than traditional Levenshtein-based or Hamming codes. As part of this work we prepared software for the flexible generation of DNA codes based on our new approach. To adapt codes to specific experimental conditions, the user can customize sequence filtering, the number of correctable mutations and barcode length for highest performance.
However, not every platform is susceptible to a large number of both indel and substitution errors. The Illumina “Sequencing by Synthesis” platform shows a very large number of substitution errors as well as a very specific shift of the read that results in inserted and deleted bases at the 5’-end and the 3’-end (which we call phaseshifts). We argue in this scenario that the application of Sequence-Levenshtein-based codes is not efficient because it aims for a category of errors that barely occurs on this platform, which reduces the code size needlessly. As a solution, we propose the “Phaseshift distance” that exclusively supports the correction of substitutions and phaseshifts. Additionally, we enable the correction of arbitrary combinations of substitution and phaseshift errors. Thus, we address the lopsided number of substitutions compared to phaseshifts on the Illumina platform.
To compare codes based on the Phaseshift distance to Hamming Codes as well as codes based on the Sequence-Levenshtein distance, we simulated an experimental scenario based on the error pattern we identified on the Illumina platform. Furthermore, we generated a large number of different sets of DNA barcodes using the Phaseshift distance and compared codes of different lengths and error correction capabilities. We found that codes based on the Phaseshift distance can correct a number of errors comparable to codes based on the Sequence-Levenshtein distance while offering the number of DNA barcodes comparable to Hamming codes. Thus, codes based on the Phaseshift distance show a higher efficiency in the targeted scenario. In some cases (e.g., with PacBio SMRT in Continuous Long Read mode), the position of the barcode and DNA context is not well defined. Many reads start inside the genomic insert so that adjacent primers might be missed. The matter is further complicated by coincidental similarities between barcode sequences and reference DNA. Therefore, a robust strategy is required in order to detect barcoded reads and avoid a large number of false positives or negatives.
For mass inference problems such as this one, false discovery rate (FDR) methods are powerful and balanced solutions. Since existing FDR methods cannot be applied to this particular problem, we present an adapted FDR method that is suitable for the detection of barcoded reads as well as suggest possible improvements.
|
122 |
Modélisation des réseaux de régulation de l’expression des gènes par les microARNPoirier-Morency, Guillaume 12 1900 (has links)
Les microARN sont de petits ARN non codants d'environ 22 nucléotides impliqués dans la régulation de l'expression des gènes. Ils ciblent les régions complémentaires des molécules d'ARN messagers que ces gènes codent et ajustent leurs niveaux de traduction en protéines en fonction des besoins de la cellule. En s'attachant à leurs cibles par complémentarité partielle de leurs séquences, ces deux groupes de molécules d'ARN compétitionnent activement pour former des interactions régulatrices. Par conséquent, prédire quantitativement les concentrations d'équilibres des duplexes formés est une tâche qui doit prendre un compte plusieurs facteurs dont l'affinité pour l'hybridation, la capacité à catalyser la cible, la coopérativité et l'accessibilité de l'ARN cible. Dans le modèle que nous proposons, miRBooking 2.0, chaque interaction possible entre un microARN et un site sur un ARN cible pour former un duplexe est caractérisée par une réaction enzymatique. Une réaction de ce type opère en deux phases : une formation réversible d'un complexe enzyme-substrat, le duplexe microARN-ARN, suivie d'une conversion irréversible du substrat en produit, un ARN cible dégradé, et de la restitution l'enzyme qui pourra participer à une nouvelle réaction. Nous montrons que l'état stationnaire de ce système, qui peut comporter jusqu'à 10 millions d'équations en pratique, est unique et son jacobien possède un très petit nombre de valeurs non-nulles, permettant sa résolution efficace à l'aide d'un solveur linéaire épars. Cette solution nous permet de caractériser précisément ce mécanisme de régulation et d'étudier le rôle des microARN dans un contexte cellulaire donné. Les prédictions obtenues sur un modèle de cellule HeLa corrèlent significativement avec un ensemble de données obtenu expérimentalement et permettent d'expliquer remarquablement les effets de seuil d'expression des gènes. En utilisant ces prédictions comme condition initiale et une méthode d'intégration numérique, nous simulons en temps réel la réponse du système aux changements de conditions expérimentales. Nous appliquons ce modèle pour cibler des éléments impliqués dans la transition épithélio-mésenchymateuse (EMT), un mécanisme biologique permettant aux cellules d'acquérir une mobilité essentielle pour proliférer. En identifiant des éléments transcrits différentiellement entre les conditions épithéliale et mésenchymateuse, nous concevons des microARN synthétiques spécifiques pour interférer avec cette transition. Pour ce faire, nous proposons une méthode basée sur une recherche gloutonne parallèle pour rechercher efficacement l'espace de la séquence du microARN et présentons des résultats préliminaires sur des marqueurs connus de l'EMT. / MicroRNAs are small non-coding RNAs of approximately 22 nucleotide long
involved in the regulation of gene expression. They target complementary
regions to the RNA transcripts molecules that these genes encode and adjust the concentration according to the needs of the cell.
As microRNAs and their RNA targets binds each other with imperfect complementarity, these two groups actively compete to form regulatory interactions. Consequently, attempting to quantitatively predict their equilibrium concentrations is a task that must take several factors into account, including the affinity for hybridization, the ability to catalyze the target, cooperation, and RNA accessibility.
In the model we propose, miRBooking 2.0, each possible interaction between
a microRNA and a binding site on a target RNA is characterized by an enzymatic
reaction. A reaction of this type operates in two phases: a reversible formation
of an enzyme-substrate complex, the microRNA-RNA duplex, and an irreversible
conversion of the substrate in an RNA degradation product that restores the
enzyme which can subsequently participate to other reactions.
We show that the stationary state of this system, which can include up to 10 million equations in practice, has a very shallow Jacobian, allowing its efficient resolution using a sparse linear solver. This solution allows us to
characterize precisely the mechanism of regulation and to study the role of
microRNAs in a given cellular context. Predictions obtained on a HeLa S3 cell
model correlate significantly with a set of experimental data obtained
experimentally and can remarkably explain the expression threshold effects of
genes. Using this solution as an initial condition and an explicit method of
numerical integration, we simulate in real time the response of the system to
changes of experimental conditions.
We apply this model to target elements involved in the Epithelio-Mesenchymal Transition (EMT), an important mechanism of tumours proliferation. By
identifying differentially expressed elements between the two conditions, we
design synthetic microRNAs to interfere with the transition. To do so, we
propose a method based on a parallel greedy best-first search to efficiently
crawl the sequence space of the microRNA and present preliminary results on
known EMT markers.
|
123 |
Analysing Non-Desired Output Data from High Throughput Sequencers for the Identification of the Source of Contamination / Analys av oönskade utdata från högkapacitetssekvenserare för identifikation av kontamineringskällorMartinez Maldonado, Mayra Guadalupe January 2022 (has links)
High-throughput Sequencing (HTS)-tekniker fortsätter att utvecklas snabbt, vilket ökar genomströmningen och minskar sannolikheten för fel. MGI Tech Co., Ltd. (MGI) är ett ledande HTS-varumärke som använder DNBSEQ-teknologi och finns i Center for Translational Microbiome (CTMR). MGI:s sequencers har en hög känslighet och det är viktigt att följa protokollen när proverna hanteras för att undvika introduktion av kontaminering. Detta projekt kommer att utforska tidigare genererade data vid CTMR för att fastställa hur och var i sekvenseringsprocessen kontaminering har introducerats. Data delas in i två huvudkategorier: primärdata, eller verkliga data (RD), och sekundära data, vidare uppdelad i Never Used Barcodes (NUB) och Non-Sequenced (NS). RD:n är sann mot provet, medan NUB och NS anses vara hämtade från bakgrundsbrus. RD, NUB och NS var föremål för taxonomiska analyser, på släkt- och artnivå, och streckkodsanalyser med hjälp av RStudio-gränssnittet för att identifiera och kontrastera de vanligaste i varje kategori. Dessutom var RD också föremål för dekontamineringsanalys på två databaser, VaMyGyn och KOLBIBAKT. Dekontaminering används för att identifiera förorenande arter i ett samhälle. Efter analysen fanns det inga starka bevis som tydde på laboratoriekontamination eller kontaminerade reagenser. Några av dessa NUB delade subsekvenser med RD barcodes, där antal reads för varje par var korrelerade mellan prover. Det kan vara en indikation på att RD barcoded med sekvenseringsfel blir inkorrekt tolkade som NUB. En djupare analys skulle krävas för att bekräfta det.CTMR är numera medveten om att kontaminering från laboratoriet, reagenser eller manipulation inte är orsaken till hämtning av bakgrundsljud. / High-throughput sequencing (HTS) technologies keep developing rapidly, increasing throughput and lowering probabilities of errors. MGI Tech Co., Ltd. (MGI) is a leading HTS brand that uses DNBSEQ technology and is present in the Centre for Translational Microbiome (CTMR). MGI’s sequencers have a high sensitivity and it is critical to follow the protocols when the samples are being handled to avoid introduction of contamination. This project will explore the previously generated data at CTMR to determine how and where in the sequencing process contamination has been introduced. Data is divided into two main categories: the primary data, or real data (RD), and secondary data, further divided into Never Used Barcodes (NUB) and Non-Sequenced (NS). The RD is true to the sample, while NUB and NS are considered background noise retrieved. The RD, NUB, and NS were subject to taxonomic analyses, at genus and species level, and barcode analyses using the RStudio interface to identify and contrast the most frequent in each category. Moreover, RD was also subject to decontam analysis on two databases, VaMyGyn and KOLBIBAKT. Decontam is used for the identification of contaminant species in a community. After the analysis, there was no strong evidence suggesting lab contamination or contaminated reagents. Some of the barcodes from NUB shared substrings with RD barcodes for which the amount of reads were correlated across samples. This may indicate that RD barcodes with sequencing errors are falsely identified as NUB, however, more analyses are needed to verify this. CTMR is now aware that contamination from the lab, reagents, or manipulation are not the causes for the background noise retrieval.
|
Page generated in 0.0707 seconds