1 |
RNA Sequence Classification Using Secondary Structure Fingerprints, Sequence-Based Features, and Deep LearningSutanto, Kevin 12 March 2021 (has links)
RNAs are involved in different facets of biological processes; including but not limited to controlling and inhibiting gene expressions, enabling transcription and translation from DNA to proteins, in processes involving diseases such as cancer, and virus-host interactions. As such, there are useful applications that may arise from studies and analyses involving RNAs, such as detecting cancer by measuring the abundance of specific RNAs, detecting and identifying infections involving RNA viruses, identifying the origins of and relationships between RNA viruses, and identifying potential targets when designing novel drugs.
Extracting sequences from RNA samples is usually not a major limitation anymore thanks to sequencing technologies such as RNA-Seq. However, accurately identifying and analyzing the extracted sequences is often still the bottleneck when it comes to developing RNA-based applications.
Like proteins, functional RNAs are able to fold into complex structures in order to perform specific functions throughout their lifecycle. This suggests that structural information can be used to identify or classify RNA sequences, in addition to the sequence information of the RNA itself. Furthermore, a strand of RNA may have more than one possible structural conformations it can fold into, and it is also possible for a strand to form different structures in vivo and in vitro. However, past studies that utilized secondary structure information for RNA identification purposes have relied on one predicted secondary structure for each RNA sequence, despite the possible one-to-many relationship between a strand of RNA and the possible secondary structures. Therefore, we hypothesized that using a representation that includes the multiple possible secondary structures of an RNA for classification purposes may improve the classification performance.
We proposed and built a pipeline that produces secondary structure fingerprints given a sequence of RNA, that takes into account the aforementioned multiple possible secondary structures for a single RNA. Using this pipeline, we explored and developed different types of secondary structure fingerprints in our studies. A type of fingerprints serves as high-level topological representations of the RNA structure, while another type represents matches with common known RNA secondary structure motifs we have curated from databases and the literature. Next, to test our hypothesis, the different fingerprints are then used with deep learning and with different datasets, alone and together with various sequence-based features, to investigate how the secondary structure fingerprints affect the classification performance.
Finally, by analyzing our findings, we also propose approaches that can be adopted by future studies to further improve our secondary structure fingerprints and classification performance.
2 |
A l’assaut du puzzle transcriptomique : optimisations, applications et nouvelles méthodes d’analyse pour le RNA-Seq / Unraveling the transcriptomic puzzle : optimizations, applications and new analysis methods for RNA-SequencingAudoux, Jérôme 08 March 2017 (has links)
Depuis leurs apparitions, les technologies de séquençage à haut débit (NGS) ont permis de révolutionner notre connaissance du transcriptome. Le RNA-Seq ou séquençage à haut-débit des transcrits, permet la numérisation rapide d’un transcriptome sous forme de millions de courtes séquences d’ADN. Contenue dans ces données brutes, l’information des transcrits peut être analysée quantitativement sous forme de profils d’expression. Les séquences obtenues contiennent également une multitude d’informations qualitatives comme les jonctions d’épissage, les variants génomiques ou post-transcriptionnels, ainsi que de nouvelles formes de transcriptions moins conventionnelles comme les ARN circulaires, les gènes de fusions ou les longs ARN non-codants.Peu à peu, le RNA-Seq s’impose comme une technologie de référence dans la recherche en biologie, et, demain dans la médecine génomique.Mes travaux de thèse proposent une vue transversale de la technologie RNA-Seq avec comme point de départ l’optimisation des méthodes d’analyses actuelles dans un contexte donné - via des procédures de benchmarking systématiques s’appuyant sur la simulations de données. Ces optimisations sont ensuite exploitées, dans le cadre d’applications sur la biologie des cancer (Leucémies et Hépatoblastome), afin d’identifier de nouveaux biomarqueurs, ainsi qu’une nouvelle stratification des patients dans le but de proposer des pistes thérapeutiques personnalisées. Enfin, mes derniers travaux portent sur la proposition de deux nouvelles méthodes d’analyse du RNA-Seq par décomposition en k-mers. La première, TranSiPedia, propose un nouveau paradigme, ayant pour objectif d'intégrer les données du transcriptome à très large échelle, via l'indexation systématique de données expérimentales. La seconde méthode, DE-kupl, propose une analyse différentielle - sans apriori - des données RNA-Seq pour l’identification de nouveaux biomarqueurs et la caractérisation de nouveaux mécanismes du transcriptome. / Since their introduction, next generation sequencing technologies (NGS) have shaped our vision of the transcriptome. RNA-seq, or high throughput transcript sequencing, enables the fast digitization of a transcriptome in the form of million of short DNA sequences. The information available in the raw data can be used in a quantitative way to extract expression profiles. The obtained sequences also provides a wide range of qualitative information such as splicing junction, genomic or post-transcriptional variants, as well as new forms of less conventional transcription such as circular RNA, fusion genes or long non coding RNA. Gradually, RNA-Seq is becoming a gold standard in molecular biology and tomorrow in genomic medicine.My thesis work proposes a global vision of the RNA-Seq technology, starting with the optimisation of current analysis methods to a particular context through systematic benchmarking procedures relying on the simulation on synthetic data. These optimizations are later used as a part of a work on the biology of cancer in order to identify new biomarkers in leukemia as well as a new stratification of hepatoblastoma patients to propose personalized treatments. Finally, my last work is focused on the proposal of two new analysis methods for RNA-Seq data, both based on the principle of k-mer decomposition. The first method, TranSiPedia, is a new paradigm to integrate transcriptome data at a very large scale through the systematic indexation of experimental data. The second method, DE-Kupl, is a new strategy to perform differential analysis, without a priori knowledge about the transcriptome. DE-kupl is designed to help the discovery of new biomarkers as well as the characterization of new mechanisms of the transcriptome.
3 |
[pt] A montagem de fragmentos de sequências biológicas é um problema fundamental na bioinformática. Na montagem de tipo De Novo, onde não existe um genoma de referência, é usada a estrutura de dados do grafo de Bruijn para auxiliar com o processamento computacional. Em particular, é necessário considerar um conjunto grande de k-mers, substrings das sequências biológicas. No entanto, a construção deste grafo tem grande custo computacional, especialmente muito consumo de memoria principal, tornando-se inviável no caso da montagem de grandes conjuntos de k-mers. Há soluções na literatura que utilizam o modelo de memória externa para conseguir executar o procedimento. Porém, todas envolvem alta redundância nos cálculos envolvendo os k-mers, aumentando consideravelmente o número de operações de E/S. Esta tese propõe uma nova abordagem para a construção do grafo de Bruijn que torna desnecessária a geração de todos os k-mer. A solução permite uma redução dos requisitos computacionais e a viabilidade da execução, o que é confirmado com os resultados experimentais. / [en] Fragment assembly is a current fundamental problem in bioinformatics. In the absence of a reference genome sequence that could guide the whole process, a de Bruijn Graph data structure has been considered to improve the computational processing. Notably, we need to count on a broad set of k-mers, biological sequences substrings. However, the construction of de Bruijn Graphs has a high computational cost, primarily due to main memory consumption. Some approaches use external memory processing to achieve feasibility. These solutions generate all k-mers with high redundancy, increasing the number of managed data and, consequently, the number of I/O operations. This thesis proposes a new approach for de Bruijn Graph construction that does not need to generate all k-mers. The solution enables to reduce computational requirements and execution feasibility, which is confirmed with the experimental results.
4 |
Expanding the horizons of next generation sequencing with RUFUSFarrell, Andrew R. January 2014 (has links)
Thesis advisor: Gabor T. Marth / To help improve the analysis of forward genetic screens, we have developed an efficient and automated pipeline for mutational profiling using our reference guided tools including MOSAIK and FREEBAYES. Studies using next generation sequencing technologies currently employ either reference guided alignment or de novo assembly to analyze the massive amount of short read data produced by second generation sequencing technologies; the far more common approach being reference guided alignment due to the massive computational and sequencing costs associated with de novo assembly. The success of reference guided alignment is dependent on three factors; the accuracy of the reference, the ability of the mapper to correctly place a read, and the degree to which a variant allele differs from the reference. Reference assemblies are not perfect and none are entirely complete. Moreover, read mappers can only map reads in genomic locations that are unique enough to confidently place reads; paralogous sections, such as related gene families, cannot be characterized and are often ignored. Further, variant alleles that drastically alter the subject's DNA, such as insertions or deletions (INDELs), will not map to the reference and are either entirely missed or require further downstream analysis to characterize. Most importantly, reference guided methods are restricted to organisms for which such reference genomes have been assembled. The current alternative, de novo assembly of a genome, is prohibitively expensive for most labs requiring deep read coverage from numerous different library preparations as well as massive computing power. To address the shortcomings of current methods, while eliminating the costs intrinsic to de novo sequence assembly, we developed RUFUS, a novel, completely reference-independent variant discovery tool. RUFUS directly compares raw sequence data from two or more samples and identifies groups of reads unique to one or the other sample. RUFUS has at least the same variant detection sensitivity as mapping methods, with greatly increased specificity for SNPs and INDEL variation events. RUFUS is also capable of extremely sensitive copy number detection, without any restriction on event length. By modeling the underlying k-mer distribution, RUFUS produces a specific copy number spectrum for each individual sample. Applying a Bayesian detection method to detect changes in k-mer content between two samples, RUFUS produces copy number calls that are equally as sensitive as traditional copy number detection methods with far fewer false positives. Our data suggest that RUFUS' reference-free approach to variant discovery is able to substantially improve upon existing variant detection methods: reducing reference biases, reducing false positive variants, and detecting copy number variants with excellent sensitivity and specificity. / Thesis (PhD) — Boston College, 2014. / Submitted to: Boston College. Graduate School of Arts and Sciences. / Discipline: Biology.
5 |
Clustering metagenome contigs using coverage with CONCOCT / Klustring av metagenom-kontiger baserat på abundans-profiler med CONCOCTBjarnason, Brynjar Smári January 2017 (has links)
Metagenomics allows studying genetic potentials of microorganisms without prior cultivation. Since metagenome assembly results in fragmented genomes, a key challenge is to cluster the genome fragments (contigs) into more or less complete genomes. The goal of this project was to investigate how well CONCOCT bins assembled contigs into taxonomically relevant clusters using the abundance profiles of the contigs over multiple samples. This was done by studying the effects of different parameter settings for CONCOCT on the clustering results when clustering metagenome contigs from in silico model communities generated by mixing data from isolate genomes. These parameters control how the model that CONCOCT trains is tuned and then how the model fits contigs to their cluster. Each parameter was tested in isolation while others were kept at their default values. For each of the data set used, the number of clusters was kept constant at the known number of species and strains in their respective data set. The resulting configuration was to use a tied covariance model, using principal components explaining 90% of the variance, and filtering out contigs shorter than 3000 bp. It also suggested that all available samples should be used for the abundance profiles. Using these parameters for CONCOCT, it was executed to have it estimate the number of clusters automatically. This gave poor results which lead to the conclusion that the process for selecting the number of clusters that was implemented in CONCOCT, “Bayesian Information Criterion”, was not good enough. That led to the testing of another similar mathematical model, “Dirichlet Process Gaussian Mixture Model”, that uses a different algorithm to estimate number of clusters. This new model gave much better results and CONCOCT has adapted a similar model in later versions. / Metagenomik möjliggör analys av arvsmassor i mikrobiella floror utan att först behöva odla mikroorgansimerna. Metoden innebär att man läser korta DNA-snuttar som sedan pusslas ihop till längre genomfragment (kontiger). Genom att gruppera kontiger som härstammar från samma organism kan man sedan återskapa mer eller mindre fullständiga genom, men detta är en svår bioinformatisk utmaning. Målsättningen med det här projektet var att utvärdera precisionen med vilken mjukvaran CONCOCT, som vi nyligen utvecklat, grupperar kontiger som härstammar från samma organism baserat på information om kontigernas sekvenskomposition och abundansprofil över olika prover. Vi testade hur olika parametrar påverkade klustringen av kontiger i artificiella metagenomdataset av olika komplexitet som vi skapade in silico genom att blanda data från tidigare sekvenserade genom. Parametrarna som testades rörde indata såväl som den statistiska modell som CONCOCT använder för att utföra klustringen. Parametrarna varierades en i taget medan de andra parametrarna hölls konstanta. Antalet kluster hölls också konstant och motsvarade antalet olika organismer i flororna. Bäst resultat erhölls då vi använde en låst kovariansmodell och använde principalkomponenter som förklarade 90% av variansen, samt filtrerade bort kontiger som var kortare än 3000 baspar. Vi fick också bäst resultat då vi använde alla tillgängliga prover. Därefter använde vi dessa parameterinställningar och lät CONCOCT själv bestämma lämpligt antal kluster i dataseten med “Bayesian Information Criterion” - metoden som då var implementerad i CONCOCT. Detta gav otillfredsställande resultat med i regel för få och för stora kluster. Därför testade vi en alternativ metod, “Dirichlet Process Gaussian Mixture Model”, för att uppskatta antal kluster. Denna metod gav avsevärt bättre resultat och i senare versioner av CONCOCT har en liknande metod implementerats.
6 |
Optimizing Parameters for High-quality Metagenomic AssemblyKumar, Ashwani 29 July 2015 (has links)
No description available.
7 |
Analysis of the relation between RNA and RBPs using machine learning / Analys av relationen mellan RNA och RBPs med hjälp av maskininlärningWassbjer, Mattias January 2021 (has links)
The study of RNA-binding proteins has recently increased in importance due to discoveries of their larger role in cellular processes. One study currently conducted at Umeå University involves constructing a model that will be able to improve our knowledge about T-cells by explaining how these cells work in different diseases. But before this model can become a reality, Umeå Univerity needs to investigate the relation between RNA and RNA-binding proteins and find proteins of which highly contribute to the activity of the RNA-binding proteins. To do so, they have decided to use four penalized regression Machine Learning models to analyse protein sequences from CD4 cells. These models consist of a ridge penalized model, an elastic net model, a neural network model, and a Bayesian model. The results show that the models have a number of RNA-binding protein sequences in common which they list as highly decisive in their predictions.
8 |
Comparative Analysis of Genomic Similarity Tools in Species IdentificationNerella, Chandra Sekhar 14 January 2025 (has links)
This study presents the development and evaluation of an automated pipeline for genome comparison, leveraging four bioinformatics tools: alignment-based methods (pyANI, Fas- tANI) and k-mer-based methods (Sourmash, BinDash 2.0). The analysis focuses on high- quality genomic datasets characterized by 100% completeness, ensuring consistency and accuracy in the comparison process. The pipeline processes genomes under uniform con- ditions, recording key performance metrics such as execution time and rank correlations.
Initial comparisons were conducted on a subset of five genomes, generating 10 unique pair- wise comparisons to establish baseline performance. This preliminary analysis identified k = 10 as the optimal k-mer size for Sourmash and BinDash, significantly improving their comparability with alignment-based methods.
For the expanded dataset of 175 genomes, encompassing (175C2) = 15,225 unique comparisons, pyANI and FastANI demonstrated high similarity values, often exceeding 90% for closely related genomes. Rank correlations, calculated using Spearman's ρ and Kendall's τ , high- lighted strong agreement between pyANI and FastANI (ρ = 0.9630 , τ = 0.8625) due to their shared alignment-based methodology. Similarly, Sourmash and BinDash, both employing k-mer-based approaches, exhibited moderate-to-strong rank correlations (ρ = 0.6967, τ = 0.5290). In contrast, the rank correlations between alignment-based and k-mer-based tools were lower, underscoring methodological differences in genome similarity calculations.
Execution times revealed significant contrasts between the tools. Alignment-based meth- ods required substantial computation time, with pyANI taking an average of 1.97 seconds per comparison and FastANI averaging 0.81 seconds per comparison. Conversely, k-mer- based methods demonstrated exceptional computational efficiency, with Sourmash complet- ing comparisons in 2.1 milliseconds and BinDash in just 0.25 milliseconds per comparison, reflecting a difference of nearly three orders of magnitude between the two categories. These results underscore the trade-offs between computational cost and methodological approaches in genome similarity estimation.
This study provides valuable insights into the relative strengths and weaknesses of genome comparison tools, offering a comprehensive framework for selecting appropriate methods for diverse genomic research applications. The findings emphasize the importance of param- eter optimization for k-mer-based tools and highlight the scalability of these methods for large-scale genomic analyses. / Master of Science / This study explores the strengths and weaknesses of different tools used to compare genomes, which are the complete set of DNA in living organisms. Comparing genomes allows scientists to understand how different species are related, uncover shared traits, and identify what makes each species unique. The tools we examined fall into two main categories: detailed tools (called alignment-based methods) and faster, more approximate tools (called k-mer- based methods). The detailed tools, such as pyANI and FastANI, compare DNA sequences piece by piece, providing very accurate results. In contrast, the faster tools, such as Sourmash and BinDash, look for patterns in smaller sections of DNA, which makes them much quicker but sometimes less precise.
To start, we tested these tools on a small group of genomes to see how they performed. By adjusting a setting in the faster tools, we found that their results became more similar to the detailed tools, improving their reliability. Encouraged by these findings, we expanded the comparison to a much larger dataset of 175 genomes. For this larger dataset, the detailed tools provided highly accurate results but required much more time and computational power. On the other hand, the faster tools completed the comparisons in a fraction of the time, making them ideal for larger datasets where quick results are needed.
We also compared how the tools ranked genome similarities and found that tools using similar methods, like pyANI and FastANI, had very consistent rankings. Likewise, the faster tools, Sourmash and BinDash, also agreed with each other. However, the rankings between the two types of tools (detailed versus faster) were less consistent, reflecting their different approaches to genome comparison.
This research provides a practical guide for scientists choosing tools to compare genomes. If accuracy and detail are most important, alignment-based tools are the best choice, though they take more time and computational resources. If speed is critical, such as when working with very large datasets, k-mer-based tools offer an excellent alternative. By understanding the strengths and trade-offs of each method, researchers can make informed decisions to suit their specific needs, whether focusing on small, detailed studies or large-scale genome analyses.
9 |
Développement de méthodes d'assemblage de génomes de novo adaptées aux bactéries endosymbiotesThéroux, Jean-François 04 1900 (has links)
Le but de ce projet était de développer des méthodes d'assemblage de novo dans le but d'assembler de petits génomes, principalement bactériens, à partir de données de séquençage de nouvelle-génération. Éventuellement, ces méthodes pourraient être appliquées à l'assemblage du génome de StachEndo, une Alpha-Protéobactérie inconnue endosymbiote de l'amibe Stachyamoeba lipophora. Suite à plusieurs analyses préliminaires, il fut observé que l’utilisation de lectures Illumina avec des assembleurs par graphe DeBruijn produisait les meilleurs résultats. Ces expériences ont également montré que les contigs produits à partir de différentes tailles de k-mères étaient complémentaires pour la finition des génomes. L’ajout de longues paires de lectures chevauchantes se montra essentiel pour la finition complète des grandes répétitions génomiques. Ces méthodes permirent d'assembler le génome de StachEndo (1,7 Mb). L'annotation de ce génome permis de montrer que StachEndo possède plusieurs caractéristiques inhabituelles chez les endosymbiotes. StachEndo constitue une espèce d'intérêt pour l'étude du développement endosymbiotique. / The goal of this project was to develop de novo genome assembly methods adapted to small genomes, especially bacterial, using next-generation sequencing data. Eventually, these methods could be used to assemble the genome of StachEndo, an unknown Alpha-Proteobacteria ensymbiont of the Stachyamoeba lipophora amoeba. Preliminary findings showed that the use of Illumina reads with DeBruijn graph assemblers yielded the best results. These experiments also showed that contigs produced with k-mers of various sizes were complementary in genome finishing assays. The addition of long-range paired-end reads proved necessary to fully close genomic assembly gaps. These methods made the assembly of StachEndo’s genome (1.7 Mb) possible. Through the annotation of StachEndo’s genes, several features that are unusal for endosymbionts were identified. StachEndo seems to be an interesting species for the study of endosymbiotic evolution.
Page generated in 0.033 seconds