Global ETD Search

11	Hypothesis-free detection of genome-changing events in pedigree sequencing Garimella, Kiran January 2016 (has links) In high-diversity populations, a complete accounting of de novo mutations can be difficult to obtain. Most analyses involve identifying such mutations by sequencing pedigrees on second-generation sequencing platforms and aligning the short reads to a reference assembly, the genomic sequence of a canonical member (or members) of a species. Often, large regions of the genomes under study may be greatly diverged from the reference sequence, or not represented at all (e.g. the HLA, antigenic genes, or other regions under balancing selective pressure). If the haplotypic background upon which a mutation occurs is absent, events can easily be missed (as reads have nowhere to align) and false-positives may abound (as the software forces the reads to align elsewhere). This thesis presents a novel method for de novo mutation discovery and allele identification. Rather than relying on alignment, our method is based on the de novo assembly of short-read sequence data using a multi-color de Bruijn graph. In this data structure, each sample is assigned a unique index (or "color"), reads from each sample are decomposed into smaller subsequences of length k (or "kmers"), and color-specific adjacency information between kmers is recorded. Mutations can be discovered in the graph itself by searching for characteristic motifs (e.g. a "bubble motifs", indicative of a SNP or indel, and "linear motifs" indicative of allelic and non-allelic recombination). De novo mutations differ from inherited mutations in that the kmers spanning the variant allele are absent in the parents; in a sense, they facilitate their own discovery by generating "novel" sequence. We exploit this fact to limit processing of the graph to only those regions containing these novel kmers. We verified our approach using simulations, validation, and visualization. On the simulations, we developed genome and read generation software driven by empirical distributions computed from real data to emit genomes with realistic features: recombinations, de novo variants, read fragment sizes, sequencing errors, and coverage profiles. In 20 artifical samples, we determined our sensitivity and specificity for novel kmer recovery to be approximately 98% and 100% at worst, respectively. Not every novel stretch can be reconstituted as a variant, owing to errors and homology in the graph. In simulations, our false discovery rate was 10% for "bubble" events and 12% for "linear" events. On validation, we obtained a high-quality draft assembly for a single P. falciparum child using a third-generation sequencing platform. We discovered three de novo events in the draft assembly, all three of which are recapitulated in our calls on the second-generation sequencing data for the same sample; no false-positives are present. On visualization, we developed an interactive web application capable of rendering a multi-color subgraph that assists in visually distinguishing between true variation and sequencing artifacts. We applied our caller to real datasets: 115 progeny across four previously analyzed experimental crosses of Plasmodium falciparum. We demonstrate our ability to access subtelomeric compartments of the genome, regions harboring antigenic genes under tremendous selective pressure, thus highly divergent between geographically distinct isolates and routinely masked and ignored in reference-based analyses. We also show our caller's ability to recover an important form of structural de novo variation: non-allelic homologous recombination (NAHR) events, an important mechanism for the pathogen to diversify its own antigenic repertoire. We demonstrate our ability to recover the few events in these samples known to exist, and overturn some previous findings indicating exchanges between "core" (non-subtelomeric) genes. We compute the SNP mutation rate to be approximately 2.91 per sample, insertion and deletion mutation rates to be 0.55 and 1.04 per sample, respectively, multi-nucleotide polymorphisms to be 0.72 per sample, and NAHR events to be 0.33 per sample. These findings are consistent across crosses. Finally, we investigated our method's scaling capabilities by processing a quintet of previously analyzed Pan troglodytes verus (western chimpanzee) samples. The genome of the chimpanzee is two orders of magnitude larger than the malaria parasite's (3, 300 Mbp versus 23 Mbp), diploid rather than haploid, poorly assembled, and the read dataset is lower coverage (20x versus 120x). Comparing to Sequenom validation data as well as visual validation, our sensitivity is expectedly low. However, this can be attributed to overaggressiveness in data cleaning applied by the de novo assembler atop which our software is built. We discuss the precise changes that would likely need to be made in future work to adapt our method to low-coverage samples.
12	Computational methods for de novo assembly of next-generation genome sequencing data Chikhi, Rayan 02 July 2012 (has links) (PDF) In this thesis, we discuss computational methods (theoretical models and algorithms) to perform the reconstruction (de novo assembly) of DNA sequences produced by high-throughput sequencers. This problem is challenging, both theoretically and practically. The theoretical difficulty arises from the complex structure of genomes. The assembly process has to deal with reconstruction ambiguities. The output of sequencing predicts up to an exponential number of reconstructions, yet only one is correct. To deal with this problem, only a fragmented approximation of the genome is returned. The practical difficulty stems from the huge volume of data produced by sequencers, with high redundancy. Significant computing power is required to process it. As larger genomes and meta-genomes are being sequenced, the need for efficient computational methods for de novo assembly is increasing rapidly. This thesis introduces novel contributions to genome assembly, both in terms of incorporating more information to improve the quality of results, and efficiently processing data to reduce the computation complexity. Specifically, we propose a novel algorithm to quantify the maximum theoretical genome coverage achievable by sequencing data (paired reads), and apply this algorithm to several model genomes. We formulate a set of computational problems that take into account pairing information in assembly, and study their complexity. Then, two novel concepts that cover practical aspects of assembly are proposed: localized assembly and memory-efficient reads indexing. Localized assembly consists in constructing and traversing a partial assembly graph. These ingredients are implemented in a complete de novo assembly software package, the Monument assembler. Monument is compared with other state of the art assembly methods. Finally, we conclude with a series of smaller projects, exploring concepts beyond classical de novo assembly. [INFO:INFO_OH] Computer Science/Other [INFO:INFO_OH] Informatique/Autre Algorithms Bioinformatics Genome assembly
13	Generating genomic resources for two crustacean species and their application to the study of White Spot Disease Verbruggen, Bas January 2016 (has links) Over the last decades the crustacean aquaculture sector has been steadily growing, in order to meet global demands for its products. A major hurdle for further growth of the industry is the prevalence of viral disease epidemics that are facilitated by the intense culture conditions. A devastating virus impacting on the sector is the White Spot Syndrome Virus (WSSV), responsible for over US $10 billion in losses in shrimp production and trade. The Pathogenicity of WSSV is high, reaching 100 % mortality within 3-10 days in penaeid shrimps. In contrast, the European shore crab Carcinus maenas has been shown to be relatively resistant to WSSV. Uncovering the basis of this resistance could help inform on the development of strategies to mitigate the WSSV threat. C. maenas has been used widely in studies on ecotoxicology and host-pathogen interactions. However, like most aquatic crustaceans, the genomic resources available for this species are limited, impairing experimentation. Therefore, to facilitate interpretations of the exposure studies, we first produced a C. maenas transcriptome and genome scaffold assembly. We also produced a transcriptome for the European lobster (Homarus gammarus), an ecologically and commercially important crustacean species in United Kingdom waters, for use in comparing WSSV responses in this, a susceptible species, and C. maenas. For the C. maenas transcriptome assembly we isolated and pooled RNA from twelve different tissues and sequenced RNA on an Illumina HiSeq 2500 platform. After de novo assembly a transcriptome encompassing 212,427 transcripts was produced. Similar, the H. gammarus transcriptome was based on RNA from nine tissues and contained 106,498 transcripts. The transcripts were filtered and annotated using a variety of tools (including BLAST, MEGAN and RSEM) and databases (including GenBank, Gene Ontology and KEGG). The annotation rate for transcripts in both transcriptomes was around 20-25 % which appears to be common for aquatic crustacean species, as a result of the lack of well annotated gene sequences for this clade. Since it is likely that the host immune system would play an important role in WSSV infection we characterized the IMD, JAK/STAT, Toll-like receptor and other innate immune system pathways. We found a strong overlap between the immune system pathways in C. maenas and H. gammarus. In addition we investigated the sequence diversity of known WSSV interacting proteins amongst susceptible penaeid shrimp/lobster and the more resistant C. maenas. There were differences in viral receptor sequences, like Rab7, that correlate with a less efficient infection by WSSV. To produce the genome scaffold assembly for C. maenas we isolated DNA from muscle tissue and produced both paired-end and mate pair libraries for processing on the Illumina HiSeq 2500 platform. A de novo draft genome assembly consisting of 338,980 scaffolds and covering 362 Mb (36 % of estimated genome size) was produced, using SOAP-denovo2 coupled with the BESST scaffolding system. The generated assembly was highly fragmented due to the presence of repetitive areas in the C. maenas genome. Using a combination of ab initio predictors, RNA-sequencing data from the transcriptome datasets and curated C. maenas sequences we produced a model encompassing 10,355 genes. The gene model for C. maenas Dscam, a gene potentially involved in (pan)crustacean immune memory, was investigated in greater detail as manual curation can improve on the results of ab initio predictors. The scaffold containing C. maenas Dscam was fragmented, thus only contained the latter exons of the gene. The assembled draft genome and transcriptomes for C. maenas and H. gammarus are valuable molecular resources for studies involving these and other aquatic crustacean species. To uncover the basis of their resistance to WSSV, we infected C. maenas with WSSV and measured mRNA and miRNA expression for 7 time points spread over a period of 28 days, using RNA-Seq and miRNA-Seq. The resistance of C. maenas to WSSV infection was confirmed by the fact that no mortalities occurred. In these animals replicating WSSV was latent and detected only after 7 days, and this occurred in five of out 28 infected crabs only. Differential expression of transcripts and miRNAs were identified for each time point. In the first 12 hours post exposure we observed decreased expression of important regulators in endocytosis. Since it is established that WSSV enters the host cells through endocytosis and that interactions between the viral protein VP28 and Rab7 are important in successful infection, it is likely that changes in this process could impact WSSV infection success. Additionally we observed an increased expression of transcripts involved in RNA interference pathways across many time points, indicating a longer term response to initial viral exposure. miRNA sequencing showed several miRNAs that were differentially expressed. The most striking finding was a novel C. maenas miRNA that we found to be significantly downregulated in every WSSV infected individual, suggesting that it may play an important role in mediating the response of the host to the virus. In silico target prediction pointed to the involvement of this miRNA in endocytosis regulation. Taken together we hypothesize that C. maenas resistance to WSSV involves obstruction of viral entry by endocytosis, a process probably regulated through miRNAs, resulting in inefficient uptake of virions. 572
14	Caracterização da região Bru1 no genoma da cultivar RB867515 (Saccharum spp.) utilizando sequenciamento de nova geração / Characterization of Bru1 region of sugarcane cultivar RB867515 using next generation sequencing Souza, Isabela Pavanelli de 25 September 2014 (has links) Submitted by Luciana Ferreira (lucgeral@gmail.com) on 2017-04-18T14:12:04Z No. of bitstreams: 2 Dissertação - Isabela Pavanelli de Souza - 2014.pdf: 8334281 bytes, checksum: 3dab37e35c18875483625a1b3a46036d (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2017-04-18T14:13:10Z (GMT) No. of bitstreams: 2 Dissertação - Isabela Pavanelli de Souza - 2014.pdf: 8334281 bytes, checksum: 3dab37e35c18875483625a1b3a46036d (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Made available in DSpace on 2017-04-18T14:13:10Z (GMT). No. of bitstreams: 2 Dissertação - Isabela Pavanelli de Souza - 2014.pdf: 8334281 bytes, checksum: 3dab37e35c18875483625a1b3a46036d (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Previous issue date: 2014-09-25 / Financiadora de Estudos e Projetos- Finep / Outro / Sugarcane is known as one of the most important crops of the word for its sub products utilization. Four countries, led by Brazil, supply the sugar international trade. Ethanol is other important sugarcane sub product, recognized as an alternative product to sugar, and had great demand in Brazilian trade, for its utilization as non-fossil fuel. The sugarcane genome is one of the most complex among crops, with 10 Gb. Its complete genome is not available, but the recent innovations in genomics tools open up new possibilities for the investigations about the sugarcane’s genome. We did a genome assembly and annotation of a Brazilian sugarcane cultivar (RB867515) genome region, correspondent to eight R570 homologous sequences already published. We use high qualities paired-ends libraries produced by Illumina HiSeq 2000 sequencing platform. The reads were aligned against eight R570 BACs (Bacterial Artificial Chromosome) sequences stored in NCBI using Bowtie2. We used MaSuRCA to assemble the aligned reads de novo, and the consensus sequences were obtained with SAMtools mpileup option. The transposable elements were identified using RepeatMasker and the gene regions were annotated with Blastx against the GenBank non-redundant protein database. After that, the consensus sequences were aligned with the matching reference (R570) using ClustalW in Mega software, to look for the percentage of mismatches and conserved sites between them. We obtained the number of scaffolds bigger than 1 kb ranging from 607 to 2,884, and the longest scaffold had near 21 kb. The consensus sequence length ranged from 81 to 142 kb, and the recovery rate relative to the reference ranged from 82% to 97%. The sequences amounted 1 Mb of RB867515 cultivar genome. We identified 5,145 repeated elements, which 4,662 were microsatellite and 460 were transposable elements, amounted 225 kb of repeated sequences. Among the mobile elements, the retrotransposons comprises 15% of nucleotide composition, ranging from 8% to 29% among BACs. The 134 genes identified on the eight BAC consensus sequences comprised a total of 243 kb, resulting in a density of one gene per 7.2 kb. The average number of genes per BAC was 16, with an average gene length of 1,841 bp. The percentage of mismatches between the RB867515 and R570 BACs ranged from 0.27% to 1.32%. The sugarcane BACs correspond to homeologous genomic regions, with this alignment we can suggest high divergence inside an homeologous group. / A cana-de-açúcar é reconhecida como uma das mais importante culturas do mundo, pela utilização dos seus subprodutos. O genoma da cana-de-açúcar é um dos mais complexos entre as plantas cultivadas, com aproximadamente 10 Gb. Seu genoma completo ainda não foi sequenciado, mas o surgimento e a popularização de novas ferramentas de análise genômica possibilitaram estudos refinados sobre essa cultura. Com o grande volume de informações que é possível gerar, a demanda atual é a produção ferramentas eficientes para o processamento dos dados. Foi realizado um assembly e anotação de uma região do genoma da cultivar RB867515 correspondente às sequências de 8 BACs da cultivar R570. As regiões correspondentes foram obtidas por alinhamento usando Bowtie2 com reads de bibliotecas paired-ends produzidos por sequenciador automático de nova geração e montados de novo utilizando MaSuRCA. Os scaffolds foram alinhados às sequência de referência usando BWA-SW, e as sequências consenso foram obtidas pela opção mplieup do SAMtools. Reads de cDNA de cinco tecidos vegetais, provenientes de 30 genótipos de cana-de-açúcar obtidos pela estratégia RNA-seq, foram mapeados nas sequências consenso a fim de identificar as regiões gênicas, que foram anotadas utilizando Blastx contra o banco de proteínas não redundante no GenBank. As regiões repetitivas foram determinadas pelo RepeatMasker e os microssatélites pelo IMEX. Para a comparação entre as sequências das duas cultivares, foi realizado uma alinhamento das sequências correspondentes nos dois genomas utilizando ClustalW no software Mega. O assembly das oito regiões, gerou de 607 à 2884 scaffolds maiores que 1 kb, com o maior scaffold chegando a 21 kb. As sequências consenso variam de 81 a 142 kb de tamanho, representando uma taxa de recuperação em relação à referência de 82% a 97%. O tamanho total das sequências montadas somou quase 1 Mb do genoma da cultivar de cana-deaçúcar. Em relação à anotação, foram identificados 5145 elementos genéticos repetitivos, em que 4662 são microssatélites e 460 são transposons, totalizando 225 kb em sequências repetidas ao longo dos BACs. Dentro do grupo dos elementos genéticos móveis os retrotransposons são maioria, com 15% da composição nucleotídica, variando de 8% a 29% entre os BACs. Foram identificados 134 genes nas oito sequências de cana-de-açúcar analisadas, totalizando 243 kb. O número de genes por BAC variou de 11 a 26, com uma média de 16 genes por BAC. Os genes encontrados apresentaram tamanho médio de 1841 pb, variando de 443 (BAC1) à 6316 pb (BAC3). A densidade de genes média foi de 1 gene por 7,2 kb. A porcentagem de mismatches entres as sequências dos BACs de RB867515 variou de 0,27% a 1,32%. Os BACs de cana-deaçúcar correspondem a regiões genômicas homeólogas, com o alinhamento realizado com as duas cultivares pode-se sugerir que existe alta divergência dentro do grupo de homeologia. Cana-de-açúcar Genômica Bioinformática Genome assembly Saccharum spp Illumina reads GENETICA::GENETICA VEGETAL
15	Improving genome assemblies of non-model non-vertebrate animals with long reads and Hi-C Guiglielmoni, Nadege 07 September 2021 (has links) (PDF) The corpus of reference genomes is rapidly expanding as more and more genome assemblies are released for a wide variety of species. The constant progress in sequencing technologies has led to the release in 2021 of a first complete, telomere-to-telomere, gap-less assembly of a human genome, yet a myriad of eukaryote species still lack genomic resources. For animals, genomic projects have focused on species closely related to humans (vertebrates) and those with an impact on health and agriculture. By contrast, there is still a dearth of non-vertebrate genomes that poorly represents their tremendous diversity (about 95% of animal diversity).Haploid chromosome-level genome assemblies using long reads and chromosome conformation capture (such as Hi-C) have become a standard in recent publications. To provide a haploid representation of diploid and polyploid genomes, assemblers collapse haplotypes into a single sequence, yet they are sensitive to high levels of heterozygosity and often yield fragmented assemblies with artefactual duplications. I tackled these shortcomings with two strategies: improving collapsed assemblies with a comprehensive long-read assembly methodology tuned for highly heterozygous genomes; and separating haplotypes to obtain phased assemblies using long reads and Hi-C. The assemblies were finally brought to chromosome-level scaffolds with a new Hi-C scaffolder, which demonstrated its efficiency on genomes of non-model organisms.These methods were applied to generate chromosome-level assemblies of three species for which none or few assemblies of closely related species were available: the bdelloid rotifer Adineta vaga, the coral Astrangia poculata, and the chaetognath Flaccisagitta enflata. These high-quality assemblies contribute to filling the current gaps in non-vertebrate genomics and pave the way for future sequencing initiatives aiming to generate such reference assemblies for all the species on Earth. / Doctorat en Sciences / info:eu-repo/semantics/nonPublished Sciences bio-médicales et agricoles Genome assembly Long reads Hi-C Non-vertebrate animals
16	De novo assembly of the rooibos genome Stander, Allison Anne January 2020 (has links) >Magister Scientiae - MSc / Rooibos (Aspalathus linearis) is endemic to the Cederberg region of South Africa, and one of the few indigenous medicinal plants commercially cultivated in the country. International interest in rooibos is growing, and currently most of the rooibos harvest is exported overseas to more than 30 countries. Various problems hamper the growth of the rooibos industry, including insect pests, diseases, drought and a decreasing lifespan of the plants. The availability of whole-genome data for rooibos can contribute to the selection of genetically superior plants, facilitating not only the identification of important genes and metabolic pathways in rooibos, but also the establishment of breeding programs. South Africa Medicinal plants Rooibos Next-generation sequencing De novo genome assembly
17	Relationships Among AA-Genome Chenopodium Diploids and a Whole-Genome Assembly of the North American Species, C. watsonii Young, Lauren Amillicent 06 June 2022 (has links) Chenopodium quinoa Willd., an ancient Andean pseudocereal almost exclusively consumed in South America, jumped onto the global stage when Western cultures noted quinoa's advantageous nutritional profile. Quinoa seed's high protein content, nutritionally balanced amino acid profile, low glycemic index, and high fiber, vitamin, and mineral content, make it a highly sought-after 'superfood'. Pitseed goosefoot (C. berlandieri Moq.), a closely related North American species sharing quinoa's genome composition (AABB), grows across the North American continent, inhabiting diverse environments including the saline coastal soils of the Gulf of Texas and the drought-prone regions of the Southwest. Quinoa and pitseed goosefoot, along with South American avian goosefoot (C. hircinum Schrad.), make up the Allotetraploid Goosefoot Complex (ATGC). We hypothesize that an ancient hybridization event between A- and B-genome diploids, with a subsequent whole-genome duplication, gave rise to the common ancestor of the ATGC. Prior data indicate that allopolyploidization most likely occurred within North America, with long-range dispersal of the ATGC to South America. We have sequenced the genome of the North American AA-genome diploid C. watsonii and identified via DNA marker analyses the closest extant species to the AA-genome diploid ancestor of the ATGC from among a panel of 41 AA-genome diploid resequenced accessions, encompassing 30 putative AA-genome diploid species, from North and South America. We also present evidence for reciprocal long-range dispersal of Chenopodium diploids between North and South America. Chenopodium berlandieri Chenopodium quinoa Chenopodium watsonii AA-genome diploid species whole-genome assembly phylogenomics Life Sciences
18	Whole-Genome Assembly of Atriplex hortensis L. Using OxfordNanopore Technology with Chromatin-Contact Mapping Hunt, Spencer Philip 01 July 2019 (has links) Atriplex hortensis (2n = 2x = 18, 1C genome size ~1.1 gigabases), also known as garden orach, is a highly nutritious, broadleaf annual of the Amaranthaceae-Chenopodiaceae family that has spread from its native Eurasia to other temperate and subtropical environments worldwide. Atriplex is a highly complex and polyphyletic genus of generally halophytic and/or xerophytic plants, some of which have been used as food sources for humans and animals alike. Although there is some literature describing the taxonomy and ecology of orach, there is a lack of genetic and genomic data that would otherwise help elucidate the genetic variation, phylogenetic position, and future potential of this species. Here, we report the assembly of the first highquality, chromosome-scale reference genome for orach cv. ‘Golden’. Sequence data was produced using Oxford Nanopore’s MinION sequencing technology in conjunction with Illumina short-reads and chromatin-contact mapping. Genome assembly was accomplished using the high-noise, single-molecule sequencing assembler, Canu. The genome is enriched for highly repetitive DNA (68%). The Canu assembly combined with the Hi-C chromatin-proximity data yielded a final assembly containing 1,325 scaffolds with a contig N50 of 98.9 Mb and with 94.7% of the assembly represented in the nine largest, chromosome-scale scaffolds. Sixty-eight percent of the genome was classified as highly repetitive DNA, with the most common repetitive elements being Gypsy and Copia-like LTRs. The annotation was completed using MAKER which identified 31,010 gene models and 2,555 tRNA genes. Completeness of the genome was assessed using the Benchmarking Universal Single Copy Orthologs (BUSCO) platform, which quantifies functional gene content using a large core set of highly conserved orthologous genes (COGs). Of the 1,375 plant-specific COGs in the Embryophyta database, 1,330 (96.7%) were identified in the Atriplex assembly. We also report the results of a resequencing panel consisting of 21 accessions which illustrates a high degree of genetic similarity among cultivars and wild material from various locations in North America and Europe. These genome resources provide vital information to better understand orach and facilitate future study and comparison. Atriplex hortensis orach Oxford Nanopore DNA sequencing proximity-guided assembly genome assembly Life Sciences Plant Sciences
19	Advancing the analysis of bisulfite sequencing data in its application to ecological plant epigenetics Nunn, Adam 27 October 2022 (has links) The aim of this thesis is to bridge the gap between the state-of-the-art bioinformatic tools and resources, currently at the forefront of epigenetic analysis, and their emerging applications to non-model species in the context of plant ecology. New, high-resolution research tools are presented; first in a specific sense, by providing new genomic resources for a selected non-model plant species, and also in a broader sense, by developing new software pipelines to streamline the analysis of bisulfite sequencing data, in a manner which is applicable to a wide range of non-model plant species. The selected species is the annual field pennycress, Thlaspi arvense, which belongs in the same lineage of the Brassicaceae as the closely-related model species, Arabidopsis thaliana, and yet does not benefit from such extensive genomic resources. It is one of three key species in a Europe-wide initiative to understand how epigenetic mechanisms contribute to natural variation, stress responses and long-term adaptation of plants. To this end, this thesis provides a high-quality, chromosome-level assembly for T. arvense, alongside a rich complement of feature annotations of particular relevance to the study of epigenetics. The genome assembly encompasses a hybrid approach, involving both PacBio continuous long reads and circular consensus sequences, alongside Hi-C sequencing, PCR-free Illumina sequencing and genetic maps. The result is a significant improvement in contiguity over the existing draft state from earlier studies. Much of the basis for building an understanding of epigenetic mechanisms in non-model species centres around the study of DNA methylation, and in particular the analysis of bisulfite sequencing data to bring methylation patterns into nucleotide-level resolution. In order to maintain a broad level of comparison between T. arvense and the other selected species under the same initiative, a suite of software pipelines which include mapping, the quantification of methylation values, differential methylation between groups, and epigenome-wide association studies, have also been developed. Furthermore, presented herein is a novel algorithm which can facilitate accurate variant calling from bisulfite sequencing data using conventional approaches, such as FreeBayes or Genome Analysis ToolKit (GATK), which until now was feasible only with specifically-adapted software. This enables researchers to obtain high-quality genetic variants, often essential for contextualising the results of epigenetic experiments, without the need for additional sequencing libraries alongside. Each of these aspects are thoroughly benchmarked, integrated to a robust workflow management system, and adhere to the principles of FAIR (Findability, Accessibility, Interoperability and Reusability). Finally, further consideration is given to the unique difficulties presented by population-scale data, and a number of concepts and ideas are explored in order to improve the feasibility of such analyses. In summary, this thesis introduces new high-resolution tools to facilitate the analysis of epigenetic mechanisms, specifically relating to DNA methylation, in non-model plant data. In addition, thorough benchmarking standards are applied, showcasing the range of technical considerations which are of principal importance when developing new pipelines and tools for the analysis of bisulfite sequencing data. The complete “Epidiverse Toolkit” is available at https://github.com/EpiDiverse and will continue to be updated and improved in the future.:ABSTRACT ACKNOWLEDGEMENTS 1 INTRODUCTION 1.1 ABOUT THIS WORK 1.2 BIOLOGICAL BACKGROUND 1.2.1 Epigenetics in plant ecology 1.2.2 DNA methylation 1.2.3 Maintenance of 5mC patterns in plants 1.2.4 Distribution of 5mC patterns in plants 1.3 TECHNICAL BACKGROUND 1.3.1 DNA sequencing 1.3.2 The case for a high-quality genome assembly 1.3.3 Sequence alignment for NGS 1.3.4 Variant calling approaches 2 BUILDING A SUITABLE REFERENCE GENOME 2.1 INTRODUCTION 2.2 MATERIALS AND METHODS 2.2.1 Seeds for the reference genome development 2.2.2 Sample collection, library preparation, and DNA sequencing 2.2.3 Contig assembly and initial scaffolding 2.2.4 Re-scaffolding 2.2.5 Comparative genomics 2.3 RESULTS 2.3.1 An improved reference genome sequence 2.3.2 Comparative genomics 2.4 DISCUSSION 3 FEATURE ANNOTATION FOR EPIGENOMICS 3.1 INTRODUCTION 3.2 MATERIALS AND METHODS 3.2.1 Tissue preparation for RNA sequencing 3.2.2 RNA extraction and sequencing 3.2.3 Transcriptome assembly 3.2.4 Genome annotation 3.2.5 Transposable element annotations 3.2.6 Small RNA annotations 3.2.7 Expression atlas 3.2.8 DNA methylation 3.3 RESULTS 3.3.1 Transcriptome assembly 3.3.2 Protein-coding genes 3.3.3 Non-coding loci 3.3.4 Transposable elements 3.3.5 Small RNA 3.3.6 Pseudogenes 3.3.7 Gene expression atlas 3.3.8 DNA Methylation 3.4 DISCUSSION 4 BISULFITE SEQUENCING METHODS 4.1 INTRODUCTION 4.2 PRINCIPLES OF BISULFITE SEQUENCING 4.3 EXPERIMENTAL DESIGN 4.4 LIBRARY PREPARATION 4.4.1 Whole Genome Bisulfite Sequencing (WGBS) 4.4.2 Reduced Representation Bisulfite Sequencing (RRBS) 4.4.3 Target capture bisulfite sequencing 4.5 BIOINFORMATIC ANALYSIS OF BISULFITE DATA 4.5.1 Quality Control 4.5.2 Read Alignment 4.5.3 Methylation Calling 4.6 ALTERNATIVE METHODS 5 FROM READ ALIGNMENT TO DNA METHYLATION ANALYSIS 5.1 INTRODUCTION 5.2 MATERIALS AND METHODS 5.2.1 Reference species 5.2.2 Natural accessions 5.2.3 Read simulation 5.2.4 Read alignment 5.2.5 Mapping rates 5.2.6 Precision-recall 5.2.7 Coverage deviation 5.2.8 DNA methylation analysis 5.3 RESULTS 5.4 DISCUSSION 5.5 A PIPELINE FOR WGBS ANALYSIS 6 THERE AND BACK AGAIN: INFERRING GENOMIC INFORMATION 6.1 INTRODUCTION 6.1.1 Implementing a new approach 6.2 MATERIALS AND METHODS 6.2.1 Validation datasets 6.2.2 Read processing and alignment 6.2.3 Variant calling 6.2.4 Benchmarking 6.3 RESULTS 6.4 DISCUSSION 6.5 A PIPELINE FOR SNP VARIANT ANALYSIS 7 POPULATION-LEVEL EPIGENOMICS 7.1 INTRODUCTION 7.2 CHALLENGES IN POPULATION-LEVEL EPIGENOMICS 7.3 DIFFERENTIAL METHYLATION 7.3.1 A pipeline for case/control DMRs 7.3.2 A pipeline for population-level DMRs 7.4 EPIGENOME-WIDE ASSOCIATION STUDIES (EWAS) 7.4.1 A pipeline for EWAS analysis 7.5 GENOTYPING-BY-SEQUENCING (EPIGBS) 7.5.1 Extending the epiGBS pipeline 7.6 POPULATION-LEVEL HAPLOTYPES 7.6.1 Extending the EpiDiverse/SNP pipeline 8 CONCLUSION APPENDICES A. SUPPLEMENT: BUILDING A SUITABLE REFERENCE GENOME B. SUPPLEMENT: FEATURE ANNOTATION FOR EPIGENOMICS C. SUPPLEMENT: FROM READ ALIGNMENT TO DNA METHYLATION ANALYSIS D. SUPPLEMENT: INFERRING GENOMIC INFORMATION BIBLIOGRAPHY info:eu-repo/classification/ddc/000 ddc:000
20	[en] A NOVEL APPROACH FOR DE BRUIJN GRAPH CONSTRUCTION IN DE NOVO GENOME FRAGMENT ASSEMBLY / [pt] UMA NOVA ABORDAGEM PARA A CONSTRUÇÃO DO GRAFO DE BRUIJN NA MONTAGEM DE NOVO DE FRAGMENTOS DE GENOMA ELVISMARY MOLINA DE ARMAS 04 May 2020 (has links) [pt] A montagem de fragmentos de sequências biológicas é um problema fundamental na bioinformática. Na montagem de tipo De Novo, onde não existe um genoma de referência, é usada a estrutura de dados do grafo de Bruijn para auxiliar com o processamento computacional. Em particular, é necessário considerar um conjunto grande de k-mers, substrings das sequências biológicas. No entanto, a construção deste grafo tem grande custo computacional, especialmente muito consumo de memoria principal, tornando-se inviável no caso da montagem de grandes conjuntos de k-mers. Há soluções na literatura que utilizam o modelo de memória externa para conseguir executar o procedimento. Porém, todas envolvem alta redundância nos cálculos envolvendo os k-mers, aumentando consideravelmente o número de operações de E/S. Esta tese propõe uma nova abordagem para a construção do grafo de Bruijn que torna desnecessária a geração de todos os k-mer. A solução permite uma redução dos requisitos computacionais e a viabilidade da execução, o que é confirmado com os resultados experimentais. / [en] Fragment assembly is a current fundamental problem in bioinformatics. In the absence of a reference genome sequence that could guide the whole process, a de Bruijn Graph data structure has been considered to improve the computational processing. Notably, we need to count on a broad set of k-mers, biological sequences substrings. However, the construction of de Bruijn Graphs has a high computational cost, primarily due to main memory consumption. Some approaches use external memory processing to achieve feasibility. These solutions generate all k-mers with high redundancy, increasing the number of managed data and, consequently, the number of I/O operations. This thesis proposes a new approach for de Bruijn Graph construction that does not need to generate all k-mers. The solution enables to reduce computational requirements and execution feasibility, which is confirmed with the experimental results. [pt] MONTAGEM DE GENOMAS [pt] K MER [pt] GRAFO DE BRUIJN [en] GENOME ASSEMBLY [en] K MER [en] DE BRUIJN GRAPH

Search results