Spelling suggestions: "subject:"long real sequencing"" "subject:"long red sequencing""
1 |
Development of a bioinformatics approach for the functional analysis of alternative splicingFuente Lorente, Lorena de la 02 September 2019 (has links)
[ES] Uno de los aspectos más apasionantes de la transcripción es la plasticidad transcriptómica y proteómica mediada por los procesos de regulación post-transcripcional (PTR). Los mecanismos PTR como el splicing alternativo (AS) y la poliadenilación alternativa (APA) han emergido como procesos estrechamente regulados que juegan un papel clave en la generación de la complejidad transcriptómica y están asociados con la coordinación de la diferenciación celular o el desarrollo de tejidos. Sin embargo nuestro conocimiento sobre cómo estos mecanismos regulan las propiedades de los productos resultantes para definir el fenotipo es aún muy reducido. La cantidad de variantes existentes y el amplio rango de posibles consecuencias funcionales, hacen su validación funcional una tarea impracticable si se realiza caso por caso. Además, la falta de herramientas para la evaluación funcional orientada a isoformas ha provocado que gran parte del trabajo computacional haya empleado pipelines ad-hoc aplicadas a sistemas biológicos específicos o simplemente hayan confiado en análisis de enriquecimiento GO, los cuales no son informativos del impacto en las propiedades de las isoformas que hay detrás de la regulación PTR.
De hecho, a pesar de las más de sesenta mil publicaciones relativas al AS, muy pocas isoformas se han asociado con propiedades específicas, mientras que el número de nuevas variantes AS/APA con function desconocida crece exponencialmente debido a las técnicas de secuenciación de segunda generación (NGS). Además, y debido a limitaciones técnicas de las NGS para reconstruir la estructura de los transcritos, las tecnologías de secuenciación de tercera generación (TGS) están definiendo una nueva era en la que, por primera vez, es posible conocer la secuencia de elementos estructurales y funcionales en los mRNAs.
En esta tesis se han abordado tres propósitos principales para poder avanzar en el estudio funcional de las isoformas. En primer lugar, con las TGS siendo cada vez más utilizadas, la evaluación de la calidad de los transcriptomas \textit{de novo} es esencial para asegurar la fiabilidad de la diversidad transcriptómica encontrada. La falta de análisis de calidad orientados a secuencias largas ha motivado el desarrollo de SQANTI, una pipeline automatizado para la exhaustiva evaluación de TGS transcriptomas. En segundo lugar, la información a nivel de gen de la mayoría de bases de datos funcionales sigue siendo el principal escollo para el estudio de la variabilidad entre isoformas, especialmente en el caso de las isoformas nuevas, en las que las bases de datos estáticas impiden su caracterización. Así, hemos diseñado IsoAnnot, que construye una base de datos de anotaciones funcionales con resolución a nivel de isoformas integrando información diseminada por múltiples bases de datos y métodos de predicción. Finalmente, la indisponibilidad de métodos para estudiar el impacto funcional de la regulación de isoformas, nos ha motivado a desarrollar tappAS, una herramienta dinámica, flexible y diseñada para facilitar el abordaje de este tipo de estudios.
Por lo tanto, durante esta tesis hemos desarrollado una infraestructura que resuelve los retos principales del análisis funcional de isoformas, proporcionando un conjunto de nuevos métodos y herramientas que ofrecen una oportunidad única para explorar cómo el fenotipo se especifica post-transcripcionalmente, mediante la alteración de las propiedades funcionales de las isoformas expresadas. La aplicación de nuestro análisis a un doble sistema de diferenciación neuronal en ratón definió el efecto de la regulación de isoformas entre la diferenciación de motoneuronas y oligodendrocitos para múltiples elementos funcionales. Entre ellos, hemos descubierto regiones transmembrana que son diferencialmente incluidas en las isoformas expresadas entre ambos tipos celulares y cuya regulación podría estar contribuyendo al control de / [CA] Un dels aspectes més emocionants de la biologia del transcriptoma és l'adaptabilitat contextual de transcriptomes i proteomes eucariotes mitjançant la regulació post-transcripcional (PTR). Els mecanismes PTR, com el splicing alternatiu (AS) i la poliadenilació alternativa (APA), s'han convertit en processos molt regulats que juguen un paper clau en la generació de la complexitat del transcriptoma i en la coordinació de la diferenciació cel·lular o del desenvolupament de teixits. No obstant això, el nostre coneixement de com aquests mecanismes imprimeixen característiques funcionals diferents al conjunt resultant d'isoformes per definir el fenotip observat és encara escàs. El nombre de variants de PTR i les seues conseqüències potencialment funcionals fa que la validació funcional sigui una tasca poc pràctica si es fa cas per cas. A més, la manca d'enfocaments funcionals orientats a isoformes ha fet que gran part del treballs computacionals per esbrinar qüestions funcionals a nivell de transcriptoma siguen estratègies computacionals ad hoc aplicades a sistemes biològics específics o bé basats en un simple anàlisi d'enriquiment GO, que no aporten informació sobre l'impacte de la PTR sobre les propietats de les isoformes.
Així, malgrat les més de 60.000 publicacions existents sobre AS, poques de les isoformes existents s'han associat a propietats específiques, mentre que el nombre de noves variants AS/APA amb funcions desconegudes i fins i tot inexplorades augmenta de manera exponencial gràcies a la seqüenciació de nova generació (NGS). A causa de les limitacions tècniques del NGS per reconstruir l'estructura dels transcrits, la seqüenciació d'alt rendiment de transcrits de longitud completa mitjançant tecnologies de tercera generació (TGS) obre una nova era en la transcriptòmica, ja que millora la definició dels models genètics i, per primera vegada, permet associar amb precisió esdeveniments funcionals dins de la molècula d'ARN.
Aquesta tesi aborda tres grans reptes per a progressar en l'estudi de la funció de les isoformes. En primer lloc, amb l'aparició i la popularitat creixent del TGS, la definició precisa i la caracterització completa dels transcriptomes de novo són essencials per garantir la qualitat de qualsevol conclusió sobre la diversitat del transcriptoma. La manca d'anàlisis de qualitat orientats a lectures llargues va motivar el desenvolupament de SQANTI (https://bitbucket.org/ ConesaLab / sqanti), una estratègia computacional automatitzada per a la caracterització estructural i l'avaluació de la qualitat dels transcriptomes de longitud completa. En segon lloc, els recursos funcionals existents centrats en el gen suposen una gran limitació per a l'estudi extensiu de la variabilitat funcional de les isoformes, especialment en les noves isoformes, que no es poden caracteritzar per bases de dades estàtiques. Per tant, vam dissenyar IsoAnnot, que construeix dinàmicament una base de dades amb anotacions funcionals a nivell d'isoforma, que utilitza com a informació d'entrada les seqüències dels transcrits i integra informació de diverses bases de dades i mètodes de predicció. Finalment, com no hi havia cap mètode per interrogar l'impacte funcional del PTR, vam desenvolupar nous enfocaments i eines fàcils d'utilitzar, com ara tappAS (http://tappas.org/), dissenyada per facilitar als investigadors els estudis funcionals de transcriptoma complet i de regulació d'isoformes en contexts específics.
Per tant, aquesta tesi descriu el desenvolupament d'un marc d'anàlisi que aborda els reptes fonamentals de l'anàlisi funcional d'isoformes. Aplicada a un sistema de diferenciació neuronal murina, vam descobrir regions transmembrana específiques d'isoformes, la modulació de les quals per PTR podria contribuir a controlar la dinàmica mitocondrial específica del tipus cel·lular durant la determinació del destí neuronal. / [EN] One of the most exciting aspects of transcriptome biology is the contextual adaptability of eukaryotic transcriptomes and proteomes by post-transcriptional regulation (PTR). PTR mechanisms such as alternative splicing (AS) and alternative polyadenylation (APA) have emerged as tightly regulated processes playing a key role in generating transcriptome complexity and coordinating cell differentiation or tissue development. However, how these mechanisms imprint distinct functional characteristics on the resulting set of isoforms to define the observed phenotype remains poorly understood. The number of PTR variants and their resulting range of potentially functional consequences makes their functional validation an impractical task if done on a case-by-case basis. Besides, the lack of isoform-oriented functional profiling approaches has made that much of the computational work done to elucidate transcriptome-wide functional questions has either involved ad hoc computational pipelines applied to specific biological systems or has relied on simple GO-enrichment analysis that are not informative about the PTR impact on isoform properties.
Thus, even though more than 60,000 publications on AS, a few number of existing isoforms have been associated with specific properties while the number of novel AS/APA variants with unknown and even unexplored functions is exponentially increasing thanks to the use of next-generation sequencing (NGS). Due to the technical limitations of NGS to reconstruct the transcript structure, high-throughput sequencing of full-length transcripts using third-generation technologies (TGS) is opening up a new transcriptomics era that enhances the definition of gene models and, for the first time, enables to precisely associate functional events within the RNA molecule.
This thesis addresses three major challenges to the progression of the study of isoform function. First, with the emergence and increasing popularity of TGS, the accurate definition and comprehensive characterisation of de novo transcriptomes is essential to ensure the quality of any conclusions on transcriptome diversity drawn from these data. The lack of long-read oriented quality aware analysis motivated the development of SQANTI \url{(https://bitbucket.org/ConesaLab/sqanti)}, an automated pipeline for the structural characterization and quality assessment of full-length transcriptomes. Secondly, the gene-centric nature of functional resources remained the major limitation to the extended study of functional isoform variability, especially for novel isoforms, which cannot be characterised by static databases. Thus, we designed IsoAnnot, which dynamically constructs an isoform-resolved rich database of functional annotations by using as input transcript sequences and integrating information disseminated across several databases and prediction methods. Finally, because no methods to interrogate the functional impact of PTR were available, we developed novel approaches and user-friendly tools such as tappAS \url{(http://tappas.org/)}, designed to facilitate researchers the transcriptome-wide functional study of context-specific isoform regulation.
Thereby, this thesis describes the development of an analysis framework that tackles the fundamental challenges of the isoform functional analysis by providing a set of novel methods and tools that offer an unique opportunity to explore how the phenotype is specified by altering the functional characteristics of expressed isoforms. Applied to a murine neural differentiation system, our pipeline profiled the effect of isoform regulation on the inclusion of several functional elements within transcripts between motor-neuron and oligodendrocyte differentiation systems and specifically, we discovered isoform-specific transmembrane regions whose modulation by PTR might contribute to control cell type-specific mitochondrial dynamics during neural fate determination. / This work was funded by the following grants: From 2014 to 2018. FPU: Training programme for Academic Staff. Spanish Ministry of Education, FPU2013/02348. From 2016 to 2019. NOVELSEQ: Novel methods for new challenges in the analysis of high-throughput sequencing data. MINECO, BIO2015-1658-R. From 2014 to 2017. DEANN: Developing a European American NGS Network. EU Marie Curie IRSES, GA-612583. / Fuente Lorente, LDL. (2019). Development of a bioinformatics approach for the functional analysis of alternative splicing [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/124974
|
2 |
Understanding the relationship between neonatal dairy calves’ gut microbiota and incidence of diarrhea using full-length 16S rRNA gene amplicon sequencing and machine learningHawkins, Jalyn Grace 13 August 2024 (has links) (PDF)
A healthy gut microbiome is crucial for the development, growth, and health of dairy calves; however, diarrhea in pre-weaned calves is highly prevalent, difficult to treat, and causes detrimental effects to the dairy industry. This study characterized early gut microbiota using longread-based 16S rRNA gene sequencing and investigated its associations with calf diarrhea and colostrum microbiota. The full-length 16S rRNA gene was amplified and sequenced on a Nanopore sequencer. We identified shared bacterial species in colostrum and calf feces, whose abundance in calf feces reduced with age. Diarrheic calves exhibited differing gut diversity before, during, and after diarrhea, and harbored increased bacteria resistant to the Cefotaxime antibiotic. Several bacterial species were associated with age and calf health. Additionally, a machine learning model identified bacteria to predict diarrhea. This study will be useful for the goal of reducing antibiotic use to promote gut health and prevent and treat neonatal calf diarrhea.
|
3 |
Transcriptome-Wide Methods for functional and Structural Annotation of Long Non-Coding RNAsDaulatabad, Swapna Vidhur 05 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Non-coding RNAs across the genome have been associated with various biological processes, ranging from regulation of splicing to remodeling of chromatin. Amongst the repertoire of non-coding sequences lies a critical species of RNAs called long non-coding RNAs (lncRNAs). LncRNAs significantly contribute to a large spectrum of human phenotypes, including cancers, Heart failure, Diabetes, and Alzheimer’s disease. This dissertation emphasizes the need to characterize the functional role of lncRNAs to improve our understanding of human diseases. This work consolidates a resource from multiple computational genomics and natural language processing-based approaches to advance our ability to functionally annotate hundreds of lncRNAs and their interactions, providing a one-stop lncRNA functional annotation and dynamic interaction network and multi-facet omics data visualization platform.
RNA interactions are vital in various cellular processes, from transcription to RNA processing. These interactions dictate the functional scope of the RNA. However, the multifaceted functional nature of RNA stems from its ability to form secondary structures. Therefore, this work establishes a computational method to characterize RNA secondary structure by integrating SHAPE-seq and long-read sequencing to enhance further our understanding of RNA structure in modulating the post-transcriptional regulatory processes and deciphering the influence at several layers of biological features, ranging from structure composition to consequent protein occupancy.
This study will potentially impact the research community by providing methods, web interfaces, and computational pipelines, improving our functional understanding of long non-coding RNAs. This work also provides novel integration methods of technologies like Oxford Nanopore-based long-read sequencing, RNA structure-probing methods, and machine learning. The approaches developed in this dissertation are scalable and adaptable to investigate further the functional and regulatory role of RNA and its structure. Overall, this study accelerates the development of RNA-based diagnostics and the identification of therapeutic targets in human disease.
|
4 |
Genomic Structural Variation Across Five Continental Populations of Drosophila melanogasterLong, Evan Michael 01 April 2018 (has links)
Chromosomal structure variations (SV) including insertions, deletions, inversions, and translocations occur within the genome and can have a significant effect on organismalphenotype. Some of these effects are caused by structural variations containing genes. Modern sequencing using short reads makes the detection of large structural variations (> 1kb) very difficult. Large structural variations represent a significant amount of the genetic diversity within a population. We used a global sampling of Drosophila melanogaster (Ithaca, Zimbabwe, Beijing, Tasmania, and Netherlands) to represent diverse populations. We used long-read sequencing and optical mapping technologies to identify SVs in these genomes. Because the average read length used for these approaches are much longer than traditional short read sequencing, these maps facilitate the identification of chromosomal SVs of greater size and with more clarity. We found a wide diversity of structural variations in each of the five strains. These structural variations varied greatly in size and location, and significantly affected exonic regions of the genome. Structural variations accounted for a much larger difference in number of base pairs between strains than single nucleotide polymorphisms (SNPs).
|
5 |
Quantitative microbial risk assessment of small water supply systems with simultaneous detection of pathogenic bacteria / 小規模水供給システムにおける病原細菌の一斉検出法を活用した定量的微生物リスク評価Zeng, Jie 25 September 2023 (has links)
京都大学 / 新制・課程博士 / 博士(工学) / 甲第24898号 / 工博第5178号 / 新制||工||1988(附属図書館) / 京都大学大学院工学研究科都市環境工学専攻 / (主査)教授 伊藤 禎彦, 教授 松田 知成, 教授 越後 信哉 / 学位規則第4条第1項該当 / Doctor of Philosophy (Engineering) / Kyoto University / DGAM
|
6 |
Computational Tools for Improved Detection, Identification, and Classification of Plant Pathogens Using Genomics and MetagenomicsJohnson, Marcela Aguilera 13 February 2023 (has links)
Plant pathogens are one of the biggest threats to plant health and food security worldwide. To effectively contain plant disease outbreaks, classification and precise identification of pathogens is crucial to determine treatment and preventive measurements. Conventional methods of detection such as PCR may not be sufficient when the pathogen in question is unknown. Advances in sequencing technology have made it possible to sequence entire genomes and metagenomes in real-time and at a relatively low cost, opening an opportunity for the development of alternative methods for detection of novel and unknown plant pathogens. Within this dissertation, an integrated approach is used to reclassify a high-impact group of plant pathogens. Additionally, the application of metagenomics and nanopore sequencing using the Oxford Nanopore Technologies (ONT) MinION for fungal and bacterial plant pathogen detection and precise identification are demonstrated.
To improve the classification of the strains belonging to the Ralstonia solanacearum species complex (RSSC), we performed a meta-analysis using a comparative genomics and a reverse ecology approach to accurately portray and refine the understanding of the diversity and evolution of the RSSC. The groups identified by these approaches were circumscribed and made publicly available through the LINbase web server so future isolates can be properly classified.
To develop a culture-free detection method of plant pathogens, we used metagenomes of various plants and long-read nanopore sequencing to precisely identify plant pathogens to the strain-level and performed phylogenetic analysis with SNP resolution. In the first paper, we used tomato plants to demonstrate the detection power of bacterial plant pathogens. We compared bioinformatics tools for detection at the strain-level using reads and assemblies. In the second paper, we used a read-based approach to test the feasibility of the methodology to precisely detect the fungal pathogen causing boxwood blight. Lastly, with the improvement in nanopore sequencing, we used grapevine petioles to investigate whether we can go beyond detection and identification and do a phylogenetic analysis. We assembled a metagenome-assembled genome (MAG) of almost the same quality as the genomes obtained from cultured isolates and did a phylogenetic analysis with SNP resolution.
Finally, for the cases where there may be no related genome in the database like the pathogen in question, we used machine learning and metagenomics to develop a reference-free approach to detection of plant diseases. We trained eight different machine learning models with reads from healthy and infected plant metagenomes and compared the classification accuracy of reads as belonging to a healthy or infected plant. From the comparison, random forest was the best model in terms of computational resources needed while maintaining a high accuracy (> 0.90). / Doctor of Philosophy / Microbes are present in every environment on the planet and have been on Earth for billions of years. While some microbes are beneficial, others can cause diseases. To differentiate the ones causing diseases from those who do not, looking into the evolutionary forces making them different is crucial to classify and identify them correctly. Although microorganisms cause diseases in humans and animals, the ones causing diseases in plants are one of the biggest threats to plant health and food security worldwide.
In a perfect world, plant diseases would be diagnosed by eye or simple procedures. However, when a plant disease is present, it is not always obvious which organism, if any, is causing the disease making it hard for outbreaks to be detected and contained promptly. With technological advances, it is now possible to obtain all the genetic information of not only one organism but all the organisms living in an environment at a time. This genetic information can then be used to precisely identify what organism is causing a disease in a plant for faster disease diagnosis and, consequently, more efficient disease prevention and control.
In this dissertation, we used the bacterial group, called Ralstonia solanacearum species complex, which can cause different diseases in more than 200 crops, to investigate and understand the evolution and diversity of the members of this group. We also used newly developed technologies to obtain the genetic material of all the organisms living in multiple important plants including tomato, grapevine, and the ornamental bush, boxwood. Using this genetic material, we developed a methodology for the detection of bacteria and a fungus causing plant diseases.
While this works well when the suspected organism or a similar one is available for comparison, the detection of plant diseases in cases where this information is not available is challenging. Machine learning models, where computers can learn complex patterns from data, have the potential to detect pathogens without the need to compare the sequences to sequences of other pathogens. Here we also used the genetic material to train and compare different machine learning models to classify plants as either being infected or healthy.
|
7 |
Genetic basis and timing of a major mating system shift in CapsellaBachmann, J.A., Tedder, Andrew, Laenen, B., Fracassetti, M., Désamoré, A., Lafon-Placette, C., Steige, K.A., Callot, C., Marande, W., Neuffer, B., Bergès, H., Köhler, C., Castric, V., Slotte, T. 13 September 2019 (has links)
Yes / A crucial step in the transition from outcrossing to self-fertilization is the loss of genetic self-incompatibility (SI). In the Brassicaceae, SI involves the interaction of female and male speci-ficity components, encoded by the genesSRKandSCRat the self-incompatibility locus (S-lo-cus). Theory predicts thatS-linked mutations, and especially dominant mutations inSCR, arelikely to contribute to loss of SI. However, few studies have investigated the contribution ofdominant mutations to loss of SI in wild plant species. Here, we investigate the genetic basis of loss of SI in the self-fertilizing crucifer speciesCapsella orientalis, by combining genetic mapping, long-read sequencing of completeS-hap-lotypes, gene expression analyses and controlled crosses. We show that loss of SI inC. orientalisoccurred<2.6 Mya and maps as a dominant trait totheS-locus. We identify a fixed frameshift deletion in the male specificity geneSCRand con-firm loss of male SI specificity. We further identify anS-linked small RNA that is predicted tocause dominance of self-compatibility. Our results agree with predictions on the contribution of dominantS-linked mutations toloss of SI, and thus provide new insights into the molecular basis of mating system transitions. / Work at Uppsala Genome Center is funded by 550 RFI / VR and Science for Life Laboratory, Sweden. The SNP&SEQ Platform is supported by 551 the Swedish Research Council and the Knut and Alice Wallenberg Foundation. V.C. 552 acknowledges support by a grant from the European Research Council (NOVEL project, 553 grant #648321). The authors thank the French Ministère de l’Enseignement Supérieur et de la 554 Recherche, the Hauts de France Region and the European Funds for Regional Economical 555 Development for their financial support to this project. This work was supported by a grant 556 from the Swedish Research Council (grant #D0432001) and by a grant from the Science for 557 Life Laboratory, Swedish Biodiversity Program to T.S. The Swedish Biodiversity Program is 558 supported by the Knut and Alice Wallenberg Foundation.
|
8 |
Long-Read RNA-Seq: Quality Control and BenchmarkingPardo Palacios, Francisco José 18 November 2024 (has links)
[ES] La presente tesis muestra la utilización de las lecturas largas para resolver las limitaciones asociadas al ARN-Seq habitual, presentando innovaciones significativas en este campo. Las lecturas largas permiten capturar transcritos completos y detectar nuevas variantes de splicing, mejorando los resultados obtenidos con lecturas cortas en términos de precisión ya que no existe la necesidad de realizar un ensamblado de lecturas que podría dar lugar a isoformas quiméricas.
En el marco de este trabajo, se ha desarrollado la herramienta SQANTI3, diseñada para la evaluación y filtrado de transcriptomas. SQANTI3 clasifica modelos de transcripción de lecturas largas según categorías estructurales basadas en sus splice junctions (SJ) y anota diversas características de calidad, tales como la presencia de SJ no canónicas o la fiabilidad de las anotaciones de los sitios de inicio y término de transcripción (TSS y TTS, por sus siglas en inglés) utilizando datos ortogonales. También ofrece un módulo de filtrado de artefactos basado en aprendizaje automático y reglas definidas por el usuario, así como un módulo de "rescate" para evitar la pérdida de genes completos por un filtrado excesivo. Por último, SQANTI3 integra la anotación funcional de los transcriptomas con isoAnnot Lite, facilitando el análisis de cambios en la expresión de isoformas y sus implicaciones funcionales.
SQANTI3 se utilizó en los retos 1 y 3 del proyecto LRGASP (Long-read RNA-seq Genome Annotation Assessment Project), un esfuerzo internacional y multicéntrico para el benchmarking de herramientas bioinformáticas de lecturas largas en ARN-Seq. Ambos retos se centraron en la identificación correcta de transcritos en organismos altamente anotados (reto 1) y en organismos no modelo con limitaciones de información a priori (reto 3). LRGASP proporcionó datos de diferentes tecnologías y protocolos a los participantes para que presentaran los resultados obtenidos sus herramientas bioinformáticas. Estos resultados se evaluaron y compararon utilizando SQANTI3, dejando patente las diferencias de transcriptomas obtenidos para una misma muestra dependiendo de los datos y métodos empleados.
En resumen, el trabajo en esta tesis resalta la importancia que la utilización de lecturas largas para ARN-Seq puede tener en el futuro y como SQANTI3 es y será una herramienta clave para la evaluación y mejora de la calidad de los transcriptomas. / [CA] La present tesi mostra la utilització de les lectures llargues per resoldre les limitacions associades a l'ARN-Seq habitual, presentant innovacions significatives en aquest camp. Les lectures llargues permeten capturar transcrits complets i detectar noves variants de splicing, millorant els resultats obtinguts amb lectures curtes en termes de precisió, ja que no és necessari realitzar un assemblatge de lectures que podria donar lloc a isoformes quimèriques.
En el marc d'aquest treball, s'ha desenvolupat l'eina SQANTI3, dissenyada per a l'avaluació i filtratge de transcriptomes. SQANTI3 classifica models de transcripció de lectures llargues segons categories estructurals basades en les seues splice junctions (SJ) i anota diverses característiques de qualitat, com la presència de SJ no canòniques o la fiabilitat de les anotacions dels llocs d'inici i terme de transcripció (TSS i TTS, per les seues sigles en anglés) utilitzant dades ortogonals. També ofereix un mòdul de filtratge d'artefactes basat en aprenentatge automàtic o regles definides per l'usuari, així com un mòdul de "rescat" per a evitar la pèrdua de gens complets per un filtratge excessiu. Finalment, SQANTI3 integra l'anotació funcional dels transcriptomes amb isoAnnot Lite, facilitant l'anàlisi de canvis en l'expressió d'isoformes i les seues implicacions funcionals.
SQANTI3 es va utilitzar en els reptes 1 i 3 del projecte LRGASP (Long-read RNA-seq Genome Annotation Assessment Project), un esforç internacional i multicèntric per al benchmarking d'eines bioinformàtiques de lectures llargues en ARN-Seq. Ambdós reptes es van centrar en la identificació correcta de transcrits en organismes altament anotats (repte 1) i en organismes no model amb limitacions d'informació a priori (repte 3). LRGASP va proporcionar dades de diferents tecnologies i protocols als participants perquè presentaren els resultats obtinguts amb les seues eines bioinformàtiques. Aquests resultats es van avaluar i comparar utilitzant SQANTI3, deixant patent les diferències de transcriptomes obtinguts per a una mateixa mostra depenent de les dades i mètodes emprats.
En resum, aquesta tesi ressalta la importància que la utilització de lectures llargues per a ARN-Seq pot tindre en el futur i com SQANTI3 és i serà una eina clau per a l'avaluació i millora de la qualitat dels transcriptomes. / [EN] This thesis presents the usage of long-read sequencing to overcome the limitations associated with conventional RNA-Seq, introducing significant innovations in this field. Long-read sequencing enables the capture of full-length transcripts and the detection of novel splicing variants, improving the accuracy of results compared to short-read sequencing, as there is no need for assembly, which could otherwise lead to chimeric isoforms.
As part of this work, the SQANTI3 tool has been designed and developed for the evaluation and filtering of transcriptomes. SQANTI3 classifies long-read transcription models into structural categories based on their splice junctions (SJ) and annotates a wide variety of quality features, such as the presence of non-canonical SJs or the reliability of Transcription Start and Termination Sites (TSS and TTS) detected using orthogonal data. It also includes an artifact filtering module based on machine learning or user-defined rules, as well as a "rescue" module to prevent the loss of complete genes due to excessive filtering. Finally, SQANTI3 integrates the functional annotation of transcriptomes with isoAnnot Lite, facilitating the analysis of isoform expression changes and their functional implications.
SQANTI3 was used in challenges 1 and 3 of the Long-read RNA-seq Genome Annotation Assessment Project (LRGASP), an international and multicenter effort to benchmark bioinformatic tools for long-read RNA-Seq data. Both challenges focused on the correct identification of transcripts in well-annotated organisms (challenge 1) and in non-model organisms with limited prior information (challenge 3). LRGASP provided participants with data from different sequencing technologies and protocols to submit the results obtained by their bioinformatics tools. These results were evaluated and compared using SQANTI3, highlighting the differences in transcriptomes obtained from the same sample depending on the data and methods used.
In summary, the work in thesis emphasizes the importance that long-read RNA-Seq can have in the future and how SQANTI3 is and will continue to be a key tool for the evaluation and improvement of transcriptome quality. / The project is supported by the following grants: Pew Charitable Trust, NIGMS R35GM138122, NHGRI R21HG011280, Spanish Ministry of Science PID2020-119537RB-10, NIGMS R35GM142647, NIGMS R35GM133569, NHGRI U41HG007234, NHGRI F31HG010999, and UM1 HG009443, NHGRI R01HG008759 and R01HG011469, NHGRI R01HG007182, NHGRI UM1HG009402, NHMRC Investigator Grant GNT2017257, Comunitat Valenciana Grant ACIF/2018/290, Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation, Grant No. 2019-002443, an institutional fund from the Department of Biomedical Informatics, The Ohio State University, an institutional fund
from the Department of Computational Medicine and Bioinformatics, University of Michigan, SPBU 73023672, AMED 22kk0305013h9903,
23kk0305024h0001, Wellcome Trust [WT222155/Z/20/Z] , and European Molecular Biology Laboratory. We acknowledge the support of the Spanish Ministry of Science and Innovation to the EMBL partnership, Centro de Excelencia Severo Ochoa, and CERCA Programme / Generalitat de Catalunya and the support of the German Federal Ministry of Education and Research with the grant 161L0242A. This work has been also funded by NIH grant R21HG011280, by the Spanish Ministry of Science grants BES-2016-076994 and PID2020-119537RB-100, and by the Comunitat Valenciana grant ACIF/2018/290. / Pardo Palacios, FJ. (2024). Long-Read RNA-Seq: Quality Control and Benchmarking [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/212027
|
9 |
Co-transcriptional splicing in two yeastsHerzel, Lydia 18 September 2015 (has links) (PDF)
Cellular function and physiology are largely established through regulated gene expression. The first step in gene expression, transcription of the genomic DNA into RNA, is a process that is highly aligned at the levels of initiation, elongation and termination. In eukaryotes, protein-coding genes are exclusively transcribed by RNA polymerase II (Pol II). Upon transcription of the first 15-20 nucleotides (nt), the emerging nascent RNA 5’ end is modified with a 7-methylguanosyl cap. This is one of several RNA modifications and processing steps that take place during transcription, i.e. co-transcriptionally. For example, protein-coding sequences (exons) are often disrupted by non-coding sequences (introns) that are removed by RNA splicing. The two transesterification reactions required for RNA splicing are catalyzed through the action of a large macromolecular machine, the spliceosome. Several non-coding small nuclear RNAs (snRNAs) and proteins form functional spliceosomal subcomplexes, termed snRNPs.
Sequentially with intron synthesis different snRNPs recognize sequence elements within introns, first the 5’ splice site (5‘ SS) at the intron start, then the branchpoint and at the end the 3’ splice site (3‘ SS). Multiple conformational changes and concerted assembly steps lead to formation of the active spliceosome, cleavage of the exon-intron junction, intron lariat formation and finally exon-exon ligation with cleavage of the 3’ intron-exon junction. Estimates on pre-mRNA splicing duration range from 15 sec to several minutes or, in terms of distance relative to the 3‘ SS, the earliest detected splicing events were 500 nt downstream of the 3‘ SS. However, the use of indirect assays, model genes and transcription induction/blocking leave the question of when pre-mRNA splicing of endogenous transcripts occurs unanswered.
In recent years, global studies concluded that the majority of introns are removed during the course of transcription. In principal, co-transcriptional splicing reduces the need for post-transcriptional processing of the pre-mRNA. This could allow for quicker transcriptional responses to stimuli and optimal coordination between the different steps. In order to gain insight into how pre-mRNA splicing might be functionally linked to transcription, I wanted to determine when co-transcriptional splicing occurs, how transcripts with multiple introns are spliced and if and how the transcription termination process is influenced by pre-mRNA splicing.
I chose two yeast species, S. cerevisiae and S. pombe, to study co-transcriptional splicing. Small genomes, short genes and introns, but very different number of intron-containing genes and multi-intron genes in S. pombe, made the combination of both model organisms a promising system to study by next-generation sequencing and to learn about co-transcriptional splicing in a broad context with applicability to other species. I used nascent RNA-Seq to characterize co-transcriptional splicing in S. pombe and developed two strategies to obtain single-molecule information on co-transcriptional splicing of endogenous genes:
(1) with paired-end short read sequencing, I obtained the 3’ nascent transcript ends, which reflect the position of Pol II molecules during transcription, and the splicing status of the nascent RNAs. This is detected by sequencing the exon-intron or exon-exon junctions of the transcripts. Thus, this strategy links Pol II position with intron splicing of nascent RNA. The increase in the fraction of spliced transcripts with further distance from the intron end provides valuable information on when co-transcriptional splicing occurs.
(2) with Pacific Biosciences sequencing (PacBio) of full-length nascent RNA, it is possible to determine the splicing pattern of transcripts with multiple introns, e.g. sequentially with transcription or also non-sequentially. Part of transcription termination is cleavage of the nascent transcript at the polyA site. The splicing status of cleaved and non-cleaved transcripts can provide insights into links between splicing and transcription termination and can be obtained from PacBio data.
I found that co-transcriptional splicing in S. pombe is similarly prevalent to other species and that most introns are removed co-transcriptionally. Co-transcriptional splicing levels are dependent on intron position, adjacent exon length, and GC-content, but not splice site sequence. A high level of co-transcriptional splicing is correlated with high gene expression. In addition, I identified low abundance circular RNAs in intron-containing, as well as intronless genes, which could be side-products of RNA transcription and splicing.
The analysis of co-transcriptional splicing patterns of 88 endogenous S. cerevisiae genes showed that the majority of intron splicing occurs within 100 nt downstream of the 3‘ SS. Saturation levels vary, and confirm results of a previous study. The onset of splicing is very close to the transcribing polymerase (within 27 nt) and implies that spliceosome assembly and conformational rearrangements must be completed immediately upon synthesis of the 3‘ SS.
For S. pombe genes with multiple introns, most detected transcripts were completely spliced or completely unspliced. A smaller fraction showed partial splicing with the first intron being most often not spliced. Close to the polyA site, most transcripts were spliced, however uncleaved transcripts were often completely unspliced. This suggests a beneficial influence of pre-mRNA splicing for efficient transcript termination.
Overall, sequencing of nascent RNA with the two strategies developed in this work offers significant potential for the analysis of co-transcriptional splicing, transcription termination and also RNA polymerase pausing by profiling nascent 3’ ends. I could define the position of pre-mRNA splicing during the process of transcription and provide evidence for fast and efficient co-transcriptional splicing in S. cerevisiae and S. pombe, which is associated with highly expressed genes in both organisms. Differences in S. pombe co-transcriptional splicing could be linked to gene architecture features, like intron position, GC-content and exon length.
|
10 |
Étude des chromosomes sexuels et du déterminisme du sexe chez les plantes : comparaison des systèmes Silene et Coccinia / A study of sex chromosomes and sex determination in plants : Silene and Coccinia systems comparisonFruchard, Cécile 09 July 2018 (has links)
Bien que les sexes séparés (dioecie) soient plus rares que chez les animaux, ∼15 600 espèces dioiques ont évolué chez les angiospermes (∼6% de l'ensemble des espèces). La manière dont le sexe de ces plantes est contrôlé est une question centrale de la biologie végétale, mais également de l'agronomie car de nombreuses plantes cultivées sont des plantes dioiques (∼20% des espèces cultivées) mais dont un seul sexe (généralement les femelles) présente un intérêt agronomique. Pourtant, seulement trois gènes du déterminisme du sexe ont été identifiés à ce jour chez les plantes dioiques, chez le kaki, l'asperge et la fraise. La dioecie a vraisemblablement évolué plusieurs fois chez les angiospermes et il est possible que les gènes du déterminisme du sexe soient divers. Deux voies principales d'évolution vers la dioecie ont été identifiées. Les deux partent d'une espèce dont les fleurs sont hermaphrodites, le régime de reproduction ancestral chez les angiospermes, puis passent soit par un intermédiaire monoique (espèce avec des fleurs unisexuées mâles et femelles sur le même individu), soit par un intermédiaire gynodioique (espèce avec des femelles et des individus avec des fleurs hermaphrodites). Cette thèse a pour objet la comparaison de deux systèmes de plantes représentant ces deux voies. Chez Coccinia grandis, une cucurbitacée ayant également des chromosomes XY, l'évolution de la dioecie est passée par la monoecie. Chez Silene latifolia, une plante dioique bien étudiée avec des chromosomes sexuels XY, l'évolution de la dioecie s'est faite à partir de la gynodioecie. Trois gènes contrôlant la monoecie ont été identifiés chez le melon et il a été proposé que ces gènes soient les gènes du déterminisme dans les espèces dioiques proches du melon comme C. grandis. Nous avons donc opté pour une approche gène candidat dans cette espèce. Très peu de ressources génétiques et génomiques sont disponibles chez C. grandis, et nous avons choisi d'utiliser SEXDETector, une méthode probabiliste qui utilise des données RNA-seq pour génotyper des parents et leurs descendants, et qui infère les gènes lies au sexe sans génome de référence. Cette méthode m'a permis d'identifier 1 364 gènes présents sur les chromosomes sexuels de C. grandis. J'ai établi que les gènes differentiellement exprimés entre les sexes étaient plus abondants sur chromosomes sexuels que sur les autosomes. J'ai également observé des marques de la dégénérescence du chromosome Y chez cette plante, comme des diminutions d'expression ou des pertes de gènes. Enfin, mes résultats démontrent la présence de compensation de dosage chez C. grandis. Le test des gènes candidats est en cours. Chez S. latifolia, 3 grandes régions liées au déterminisme ont déjà été identifiées sur le chromosome Y. Pour identifier les gènes du déterminisme, nous avons choisi de séquencer ce chromosome. Le séquençage des chromosomes Y est encore un défi pour la génomique. La phase d'assemblage est très difficile à cause des répétitions présentes en grand nombre sur ces chromosomes. En conséquence, les séquences complètes de chromosome Y sont très rares, et principalement disponibles chez les animaux. Afin de minimiser les problèmes d'assemblage dus aux répétitions, nous avons utilisé des techniques dites de 3eme génération (avec de grandes lectures). J'ai moi-même généré des données MinION (Oxford Nanopore) à partir d'ADN de chromosome Y. L'assemblage a été réalisé en combinant des données Illumina, PacBio et MinION. Notre assemblage final fait une taille de 563 Mb pour un N50 de 6 114 pb, et contient 16 219 gènes annotés de novo / Although rarer than in animals, separate sexes (dioecy) have evolved in ∼15,600 angiosperm species (∼6% of all angiosperm species). How sex is controlled is a central question in plant sciences and also in agronomy as many crops are dioecious (∼20% of crops) with only one useful sex (usually female). Only three master sex-determining genes have been identified in dioecious plants so far, namely in persimmons, asparagus and strawberry. Dioecy likely evolved several times independently in angiosperms, suggesting that sex-determining genes are of diverse origins. Hermaphroditism is the predicted ancestral state of the angiosperm flower. Two main pathways have been identified that explain the evolution of hermaphroditism towards dioecy: either through a monoecious state (with both unisexual male and female flowers on the same individual) or a gynodioecious state (with females and individuals having hermaphroditic flowers). My aim is to compare two plant systems representing each one of these two pathways. In Coccinia grandis, a Cucurbitaceae with an XY chromosome system, dioecy evolved through monoecy. In Silene latifolia, a well-studied dioecious plant with XY sex chromosomes, dioecy evolved through gynodioecy. Three genes controlling monoecy have been identified in melon, and it was suggested that these genes act as sex-determining genes in closely related dioecious species such as C. grandis. I therefore chose a candidate gene approach in this species. Very few genetic and genomic data are available in C. grandis, and we chose to use SEX-DETector, a probabilistic method that uses RNA-seq data to genotype parents and their offspring, and infers sex-linked genes with no need for a reference genome. This method allowed me to identify 1,364 genes that are present on the sex chromosomes of C. grandis. I found that the sex chromosomes are enriched in sex-biasedgenes when compared to autosomes and I characterized Y chromosome degeneration in terms of decreased expression and gene loss. Finally, I showed that dosage compensation occurs in C. grandis. Testing for the three candidates genes is ongoing. In S. latifolia 3 regions involved in sex determination have already been identified on the Y chromosome. We chose to sequence this chromosome to identify sex-determining genes. The sequencing of Y chromosomes remains one of the greatest challenges of current genomics. The assembly step is very difficult because of their highly repeated content. Consequently, fully sequenced Y chromosomes are rare and mainly available for research in animals. To overcome the difficulty of assembling reads with many repeats, I used third generation sequencing (TGS, producing long reads). I produced a dataset using the Oxford Nanopore MinION sequencer with Y chromosome DNA. Assembling was performed using a combination of Illumina, MinION and PacBio sequencing data. The final assembly had a total length of 563 Mb with a scaffold N50 of 6,114 bp, and contained 16,219 de novo annotated genes
|
Page generated in 0.1008 seconds