Spelling suggestions: "subject:"[een] K MER"" "subject:"[enn] K MER""
1 |
RNA Sequence Classification Using Secondary Structure Fingerprints, Sequence-Based Features, and Deep LearningSutanto, Kevin 12 March 2021 (has links)
RNAs are involved in different facets of biological processes; including but not limited to controlling and inhibiting gene expressions, enabling transcription and translation from DNA to proteins, in processes involving diseases such as cancer, and virus-host interactions. As such, there are useful applications that may arise from studies and analyses involving RNAs, such as detecting cancer by measuring the abundance of specific RNAs, detecting and identifying infections involving RNA viruses, identifying the origins of and relationships between RNA viruses, and identifying potential targets when designing novel drugs.
Extracting sequences from RNA samples is usually not a major limitation anymore thanks to sequencing technologies such as RNA-Seq. However, accurately identifying and analyzing the extracted sequences is often still the bottleneck when it comes to developing RNA-based applications.
Like proteins, functional RNAs are able to fold into complex structures in order to perform specific functions throughout their lifecycle. This suggests that structural information can be used to identify or classify RNA sequences, in addition to the sequence information of the RNA itself. Furthermore, a strand of RNA may have more than one possible structural conformations it can fold into, and it is also possible for a strand to form different structures in vivo and in vitro. However, past studies that utilized secondary structure information for RNA identification purposes have relied on one predicted secondary structure for each RNA sequence, despite the possible one-to-many relationship between a strand of RNA and the possible secondary structures. Therefore, we hypothesized that using a representation that includes the multiple possible secondary structures of an RNA for classification purposes may improve the classification performance.
We proposed and built a pipeline that produces secondary structure fingerprints given a sequence of RNA, that takes into account the aforementioned multiple possible secondary structures for a single RNA. Using this pipeline, we explored and developed different types of secondary structure fingerprints in our studies. A type of fingerprints serves as high-level topological representations of the RNA structure, while another type represents matches with common known RNA secondary structure motifs we have curated from databases and the literature. Next, to test our hypothesis, the different fingerprints are then used with deep learning and with different datasets, alone and together with various sequence-based features, to investigate how the secondary structure fingerprints affect the classification performance.
Finally, by analyzing our findings, we also propose approaches that can be adopted by future studies to further improve our secondary structure fingerprints and classification performance.
|
2 |
A l’assaut du puzzle transcriptomique : optimisations, applications et nouvelles méthodes d’analyse pour le RNA-Seq / Unraveling the transcriptomic puzzle : optimizations, applications and new analysis methods for RNA-SequencingAudoux, Jérôme 08 March 2017 (has links)
Depuis leurs apparitions, les technologies de séquençage à haut débit (NGS) ont permis de révolutionner notre connaissance du transcriptome. Le RNA-Seq ou séquençage à haut-débit des transcrits, permet la numérisation rapide d’un transcriptome sous forme de millions de courtes séquences d’ADN. Contenue dans ces données brutes, l’information des transcrits peut être analysée quantitativement sous forme de profils d’expression. Les séquences obtenues contiennent également une multitude d’informations qualitatives comme les jonctions d’épissage, les variants génomiques ou post-transcriptionnels, ainsi que de nouvelles formes de transcriptions moins conventionnelles comme les ARN circulaires, les gènes de fusions ou les longs ARN non-codants.Peu à peu, le RNA-Seq s’impose comme une technologie de référence dans la recherche en biologie, et, demain dans la médecine génomique.Mes travaux de thèse proposent une vue transversale de la technologie RNA-Seq avec comme point de départ l’optimisation des méthodes d’analyses actuelles dans un contexte donné - via des procédures de benchmarking systématiques s’appuyant sur la simulations de données. Ces optimisations sont ensuite exploitées, dans le cadre d’applications sur la biologie des cancer (Leucémies et Hépatoblastome), afin d’identifier de nouveaux biomarqueurs, ainsi qu’une nouvelle stratification des patients dans le but de proposer des pistes thérapeutiques personnalisées. Enfin, mes derniers travaux portent sur la proposition de deux nouvelles méthodes d’analyse du RNA-Seq par décomposition en k-mers. La première, TranSiPedia, propose un nouveau paradigme, ayant pour objectif d'intégrer les données du transcriptome à très large échelle, via l'indexation systématique de données expérimentales. La seconde méthode, DE-kupl, propose une analyse différentielle - sans apriori - des données RNA-Seq pour l’identification de nouveaux biomarqueurs et la caractérisation de nouveaux mécanismes du transcriptome. / Since their introduction, next generation sequencing technologies (NGS) have shaped our vision of the transcriptome. RNA-seq, or high throughput transcript sequencing, enables the fast digitization of a transcriptome in the form of million of short DNA sequences. The information available in the raw data can be used in a quantitative way to extract expression profiles. The obtained sequences also provides a wide range of qualitative information such as splicing junction, genomic or post-transcriptional variants, as well as new forms of less conventional transcription such as circular RNA, fusion genes or long non coding RNA. Gradually, RNA-Seq is becoming a gold standard in molecular biology and tomorrow in genomic medicine.My thesis work proposes a global vision of the RNA-Seq technology, starting with the optimisation of current analysis methods to a particular context through systematic benchmarking procedures relying on the simulation on synthetic data. These optimizations are later used as a part of a work on the biology of cancer in order to identify new biomarkers in leukemia as well as a new stratification of hepatoblastoma patients to propose personalized treatments. Finally, my last work is focused on the proposal of two new analysis methods for RNA-Seq data, both based on the principle of k-mer decomposition. The first method, TranSiPedia, is a new paradigm to integrate transcriptome data at a very large scale through the systematic indexation of experimental data. The second method, DE-Kupl, is a new strategy to perform differential analysis, without a priori knowledge about the transcriptome. DE-kupl is designed to help the discovery of new biomarkers as well as the characterization of new mechanisms of the transcriptome.
|
3 |
[en] A NOVEL APPROACH FOR DE BRUIJN GRAPH CONSTRUCTION IN DE NOVO GENOME FRAGMENT ASSEMBLY / [pt] UMA NOVA ABORDAGEM PARA A CONSTRUÇÃO DO GRAFO DE BRUIJN NA MONTAGEM DE NOVO DE FRAGMENTOS DE GENOMAELVISMARY MOLINA DE ARMAS 04 May 2020 (has links)
[pt] A montagem de fragmentos de sequências biológicas é um problema fundamental na bioinformática. Na montagem de tipo De Novo, onde não existe um genoma de referência, é usada a estrutura de dados do grafo de Bruijn para auxiliar com o processamento computacional. Em particular, é necessário considerar um conjunto grande de k-mers, substrings das sequências biológicas. No entanto, a construção deste grafo tem grande custo computacional, especialmente muito consumo de memoria principal, tornando-se inviável no caso da montagem de grandes conjuntos de k-mers. Há soluções na literatura que utilizam o modelo de memória externa para conseguir executar o procedimento. Porém, todas envolvem alta redundância nos cálculos envolvendo os k-mers, aumentando consideravelmente o número de operações de E/S. Esta tese propõe uma nova abordagem para a construção do grafo de Bruijn que torna desnecessária a geração de todos os k-mer. A solução permite uma redução dos requisitos computacionais e a viabilidade da execução, o que é confirmado com os resultados experimentais. / [en] Fragment assembly is a current fundamental problem in bioinformatics. In the absence of a reference genome sequence that could guide the whole process, a de Bruijn Graph data structure has been considered to improve the computational processing. Notably, we need to count on a broad set of k-mers, biological sequences substrings. However, the construction of de Bruijn Graphs has a high computational cost, primarily due to main memory consumption. Some approaches use external memory processing to achieve feasibility. These solutions generate all k-mers with high redundancy, increasing the number of managed data and, consequently, the number of I/O operations. This thesis proposes a new approach for de Bruijn Graph construction that does not need to generate all k-mers. The solution enables to reduce computational requirements and execution feasibility, which is confirmed with the experimental results.
|
4 |
Clustering metagenome contigs using coverage with CONCOCT / Klustring av metagenom-kontiger baserat på abundans-profiler med CONCOCTBjarnason, Brynjar Smári January 2017 (has links)
Metagenomics allows studying genetic potentials of microorganisms without prior cultivation. Since metagenome assembly results in fragmented genomes, a key challenge is to cluster the genome fragments (contigs) into more or less complete genomes. The goal of this project was to investigate how well CONCOCT bins assembled contigs into taxonomically relevant clusters using the abundance profiles of the contigs over multiple samples. This was done by studying the effects of different parameter settings for CONCOCT on the clustering results when clustering metagenome contigs from in silico model communities generated by mixing data from isolate genomes. These parameters control how the model that CONCOCT trains is tuned and then how the model fits contigs to their cluster. Each parameter was tested in isolation while others were kept at their default values. For each of the data set used, the number of clusters was kept constant at the known number of species and strains in their respective data set. The resulting configuration was to use a tied covariance model, using principal components explaining 90% of the variance, and filtering out contigs shorter than 3000 bp. It also suggested that all available samples should be used for the abundance profiles. Using these parameters for CONCOCT, it was executed to have it estimate the number of clusters automatically. This gave poor results which lead to the conclusion that the process for selecting the number of clusters that was implemented in CONCOCT, “Bayesian Information Criterion”, was not good enough. That led to the testing of another similar mathematical model, “Dirichlet Process Gaussian Mixture Model”, that uses a different algorithm to estimate number of clusters. This new model gave much better results and CONCOCT has adapted a similar model in later versions. / Metagenomik möjliggör analys av arvsmassor i mikrobiella floror utan att först behöva odla mikroorgansimerna. Metoden innebär att man läser korta DNA-snuttar som sedan pusslas ihop till längre genomfragment (kontiger). Genom att gruppera kontiger som härstammar från samma organism kan man sedan återskapa mer eller mindre fullständiga genom, men detta är en svår bioinformatisk utmaning. Målsättningen med det här projektet var att utvärdera precisionen med vilken mjukvaran CONCOCT, som vi nyligen utvecklat, grupperar kontiger som härstammar från samma organism baserat på information om kontigernas sekvenskomposition och abundansprofil över olika prover. Vi testade hur olika parametrar påverkade klustringen av kontiger i artificiella metagenomdataset av olika komplexitet som vi skapade in silico genom att blanda data från tidigare sekvenserade genom. Parametrarna som testades rörde indata såväl som den statistiska modell som CONCOCT använder för att utföra klustringen. Parametrarna varierades en i taget medan de andra parametrarna hölls konstanta. Antalet kluster hölls också konstant och motsvarade antalet olika organismer i flororna. Bäst resultat erhölls då vi använde en låst kovariansmodell och använde principalkomponenter som förklarade 90% av variansen, samt filtrerade bort kontiger som var kortare än 3000 baspar. Vi fick också bäst resultat då vi använde alla tillgängliga prover. Därefter använde vi dessa parameterinställningar och lät CONCOCT själv bestämma lämpligt antal kluster i dataseten med “Bayesian Information Criterion” - metoden som då var implementerad i CONCOCT. Detta gav otillfredsställande resultat med i regel för få och för stora kluster. Därför testade vi en alternativ metod, “Dirichlet Process Gaussian Mixture Model”, för att uppskatta antal kluster. Denna metod gav avsevärt bättre resultat och i senare versioner av CONCOCT har en liknande metod implementerats.
|
5 |
Optimizing Parameters for High-quality Metagenomic AssemblyKumar, Ashwani 29 July 2015 (has links)
No description available.
|
6 |
Analysis of the relation between RNA and RBPs using machine learning / Analys av relationen mellan RNA och RBPs med hjälp av maskininlärningWassbjer, Mattias January 2021 (has links)
The study of RNA-binding proteins has recently increased in importance due to discoveries of their larger role in cellular processes. One study currently conducted at Umeå University involves constructing a model that will be able to improve our knowledge about T-cells by explaining how these cells work in different diseases. But before this model can become a reality, Umeå Univerity needs to investigate the relation between RNA and RNA-binding proteins and find proteins of which highly contribute to the activity of the RNA-binding proteins. To do so, they have decided to use four penalized regression Machine Learning models to analyse protein sequences from CD4 cells. These models consist of a ridge penalized model, an elastic net model, a neural network model, and a Bayesian model. The results show that the models have a number of RNA-binding protein sequences in common which they list as highly decisive in their predictions.
|
7 |
Développement de méthodes d'assemblage de génomes de novo adaptées aux bactéries endosymbiotesThéroux, Jean-François 04 1900 (has links)
Le but de ce projet était de développer des méthodes d'assemblage de novo dans le but d'assembler de petits génomes, principalement bactériens, à partir de données de séquençage de nouvelle-génération. Éventuellement, ces méthodes pourraient être appliquées à l'assemblage du génome de StachEndo, une Alpha-Protéobactérie inconnue endosymbiote de l'amibe Stachyamoeba lipophora. Suite à plusieurs analyses préliminaires, il fut observé que l’utilisation de lectures Illumina avec des assembleurs par graphe DeBruijn produisait les meilleurs résultats. Ces expériences ont également montré que les contigs produits à partir de différentes tailles de k-mères étaient complémentaires pour la finition des génomes. L’ajout de longues paires de lectures chevauchantes se montra essentiel pour la finition complète des grandes répétitions génomiques. Ces méthodes permirent d'assembler le génome de StachEndo (1,7 Mb). L'annotation de ce génome permis de montrer que StachEndo possède plusieurs caractéristiques inhabituelles chez les endosymbiotes. StachEndo constitue une espèce d'intérêt pour l'étude du développement endosymbiotique. / The goal of this project was to develop de novo genome assembly methods adapted to small genomes, especially bacterial, using next-generation sequencing data. Eventually, these methods could be used to assemble the genome of StachEndo, an unknown Alpha-Proteobacteria ensymbiont of the Stachyamoeba lipophora amoeba. Preliminary findings showed that the use of Illumina reads with DeBruijn graph assemblers yielded the best results. These experiments also showed that contigs produced with k-mers of various sizes were complementary in genome finishing assays. The addition of long-range paired-end reads proved necessary to fully close genomic assembly gaps. These methods made the assembly of StachEndo’s genome (1.7 Mb) possible. Through the annotation of StachEndo’s genes, several features that are unusal for endosymbionts were identified. StachEndo seems to be an interesting species for the study of endosymbiotic evolution.
|
Page generated in 0.0298 seconds