Spelling suggestions: "subject:"text 1generation sequencing"" "subject:"text 4egeneration sequencing""
261 |
Computational Methods for Solving Next Generation Sequencing ChallengesAldwairi, Tamer Ali 13 December 2014 (has links)
In this study we build solutions to three common challenges in the fields of bioinformatics through utilizing statistical methods and developing computational approaches. First, we address a common problem in genome wide association studies, which is linking genotype features within organisms of the same species to their phenotype characteristics. We specifically studied FHA domain genes in Arabidopsis thaliana distributed within Eurasian regions by clustering those plants that share similar genotype characteristics and comparing that to the regions from which they were taken. Second, we also developed a tool for calculating transposable element density within different regions of a genome. The tool is built to utilize the information provided by other transposable element annotation tools and to provide the user with a number of options for calculating the density for various genomic elements such as genes, piRNA and miRNA or for the whole genome. It also provides a detailed calculation of densities for each family and subamily of the transposable elements. Finally, we address the problem of mapping multi reads in the genome and their effects on gene expression. To accomplish this, we implemented methods to determine the statistical significance of expression values within the genes utilizing both a unique and multi-read weighting scheme. We believe this approach provides a much more accurate measure of gene expression than existing methods such as discarding multi reads completely or assigning them randomly to a set of best assignments, while also providing a better estimation of the proper mapping locations of ambiguous reads. Overall, the solutions we built in these studies provide researchers with tools and approaches that aid in solving some of the common challenges that arise in the analysis of high throughput sequence data.
|
262 |
DECODING THE TRANSCRIPTIONAL LANDSCAPE OF TRIPLE-NEGATIVE BREAST CANCER USING NEXT GENERATION WHOLE TRANSCRIPTOME SEQUENCINGRadovich, Milan 16 March 2012 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Triple-negative breast cancers (TNBCs) are negative for the expression of estrogen (ER), progesterone (PR), and HER-2 receptors. TNBC accounts for 15% of all breast cancers and results in disproportionally higher mortality compared to ER & HER2-positive tumours. Moreover, there is a paucity of therapies for this subtype of breast cancer resulting primarily from an inadequate understanding of the transcriptional differences that differentiate TNBC from normal breast. To this end, we embarked on a comprehensive examination of the transcriptomes of TNBCs and normal breast tissues using next-generation whole transcriptome sequencing (RNA-Seq). By comparing RNA-seq data from these tissues, we report the presence of differentially expressed coding and non-coding genes, novel transcribed regions, and mutations not previously reported in breast cancer. From these data we have identified two major themes. First, BRCA1 mutations are well known to be associated with development of TNBC. From these data we have identified many genes that work in concert with BRCA1 that are dysregulated suggesting a role of BRCA1 associated genes with sporadic TNBC. In addition, we observe a mutational profile in genes also associated with BRCA1 and DNA repair that lend more evidence to its role. Second, we demonstrate that using microdissected normal epithelium maybe an optimal comparator when searching for novel therapeutic targets for TNBC. Previous studies have used other controls such as reduction mammoplasties, adjacent normal tissue, or other breast cancer subtypes, which may be sub-optimal and have lead to identifying ineffective therapeutic targets. Our data suggests that the comparison of microdissected ductal epithelium to TNBC can identify potential therapeutic targets that may lead to be better clinical efficacy. In summation, with these data, we provide a detailed transcriptional landscape of TNBC and normal breast that we believe will lead to a better understanding of this complex disease.
|
263 |
USING SYSTEMS BIOLOGY APPROACHES TO UNDERSTAND THE TRANSCRIPTIONAL REGULATION UNDERLYING PLANT DEFENSE AND GROWTHLiang Tang (14226836) 06 December 2022 (has links)
<p> </p>
<p>Plant complex traits are controlled by multi-layer of dynamic and complicated gene networks regulated at different levels. To better inform crop breeding to promote desired traits, a comprehensive and fundamental understanding of their genetic basis is much needed. With the rapid developments of <em>omics</em> planforms and next generation sequencing technology, we now have large-scale data from genome, epigenome, transcriptome, metabolome, and others for the crop plants. Integration of those multiple <em>omics</em> data together with computational approaches led to the establishment of a novel science known as system biology. Research described in this thesis used system biology approaches to dissect complex crop traits such as disease response of tomato (Chapter2 and Chapter3) and the heterosis of nitrogen use efficiency of maize (Chapter4).</p>
<p>Plant disease response is an elaborate, multilayered complex trait involving several lines of defense signaling. In the past decades, progress in molecular analyses of plant immune system has revealed key elements of a complex response network in Arabidopsis, a model species. Histone modifications, a type of epigenetic regulation, have emerged as key modulators that regulate defense responses, while our understanding of the role of histone-modifying enzymes in this process is still in its infancy. Here, we described the immune function of two histone methyltransferases SDG33 and SDG34 in tomato. We found the single mutants in <em>sdg33</em> and <em>sdg34</em> showed increased susceptibility to hemibiotrophic bacterial pathogen <em>Pseudomonas syringae</em> whereas the double mutant <em>sdg33sdg34</em> is comparable to wild type. Using RNA-seq and histone ChIP-seq approaches, we investigated the possible underlying mechanisms and found that the expression of a set of immune-related genes is misregulated by <em>P. syringae</em> only in the single mutants but not in the double mutant. Integrating with epigenomic data, we found that the misexpression of those SDG33/SDG34 dependent immune-response genes was associated with altered histone methylation status in the single mutant. Intriguingly, the double mutant also showed altered histone methylation but unaffected gene expression, suggesting a compensating regulatory mechanism at play. The function of SDG33 and SDG34 in immune response seems to be specific for the pathogen, as the double mutants exhibited enhanced resistance the single mutants showed no altered responses when treated with necrotrophic fungal pathogen <em>Botrytis cinerea</em>. Network analysis found the most regulatory gene by <em>B. cinerea</em> in a SDG33/SDG34 dependent manner have been implicated in biotic stress response such as <em>ERF4, TOPLESS, PUB23 </em>and<em> RCD1</em>. Comparing the immune response of double mutant against <em>P. syringae</em> and <em>B. cinerea</em>, we found that the disease related genes are only mis-regulated in the interaction of <em>B. cinerea</em> treatment not in the <em>P. syringae</em> treatment, which could be the reason of enhanced resistance to <em>B. cinerea</em> but not for <em>P. syringae</em> in the double mutants. In summary, we found the histone methyltransferases SDG33 and SDG34 has different functions in the immune response against <em>P. syringae</em> and <em>B. cinerea</em>, which might be direct or indirect relevant to the histone methylation level of the expression of downstream immune related gene.</p>
<p>In addition to biotic stress, another complex trait studied in this thesis is the heterosis of nitrogen use efficiency (NUE) in Maize. NUE is another complex trait associated with multiple physiological processes including N sensing, uptake, assimilation, transport, and storage. Heterosis refers to a phenomenon where the progeny generated by crossing two different cultivars of the same species exhibit superior fitness than the inbred parents. Even though, heterosis has been exploited to improve complex traits including NUE, the underlying molecular mechanisms is not completely understood. Here, we analyzed N-responsive transcriptomes and physiological traits of a panel of six maize hybrids and their corresponding inbreds grown in the field at two different N levels. We observed diverse levels of trait heterosis that are dependent on the N conditions and organ types. We discovered dramatic pattern shift of beyond-parental-range gene expression in hybrids in response to varying N levels. We identified through integrative analyses a set of genes whose expression heterosis are quantitatively correlated to trait heterosis. These genes are involved in response to stimulus, photosynthesis, and N metabolism, and likely mediate the heterosis phenotype of N-use and growth traits in maize. In summary, our integrated analysis provided insights into the mechanistic basis of the heterosis of NUE. </p>
<p>Together, applying systems and functional genomics approaches to investigate important agricultural traits could lead to a comprehensive understanding of plant complex traits to inform future engineering and breeding for better crops.</p>
|
264 |
Molecular Evolution of Odonata Opsins, Odonata Phylogenomics and Detection of False Positive Sequence Homology Using Machine LearningSuvorov, Anton 01 March 2018 (has links)
My dissertation comprises three related topics of evolutionary and computational biology, which correspond to the three Chapters. Chapter 1 focuses on tempo and mode of evolution in visual genes, namely opsins, via duplication events and subsequent molecular adaptation in Odonata (dragonflies and damselflies). Gene duplication plays a central role in adaptation to novel environments by providing new genetic material for functional divergence and evolution of biological complexity. Odonata have the largest opsin repertoire of any insect currently known. In particular our results suggest that both the blue sensitive (BS) and long-wave sensitive (LWS) opsin classes were subjected to strong positive selection that greatly weakens after multiple duplication events, a pattern that is consistent with the permanent heterozygote model. Due to the immense interspecific variation and duplicability potential of opsin genes among odonates, they represent a unique model system to test hypotheses regarding opsin gene duplication and diversification at the molecular level. Chapter 2 primarily focuses on reconstruction of the phylogenetic backbone of Odonata using RNA-seq data. In order to reconstruct the evolutionary history of Odonata, we performed comprehensive phylotranscriptomic analyses of 83 species covering 75% of all extant odonate families. Using maximum likelihood, Bayesian, coalescent-based and alignment free tree inference frameworks we were able to test, refine and resolve previously controversial relationships within the order. In particular, we confirmed the monophyly of Zygoptera, recovered Gomphidae and Petaluridae as sister groups with high confidence and identified Calopterygoidea as monophyletic. Fossil calibration coupled with diversification analyses provided insight into key events that influenced the evolution of Odonata. Specifically, we determined that there was a possible mass extinction of ancient odonate diversity during the P-Tr crisis and a single odonate lineage persisted following this extinction event. Lastly, Chapter 3 focuses on identification of erroneously assigned sequence homology using the intelligent agents of machine learning techniques. Accurate detection of homologous relationships of biological sequences (DNA or amino acid) amongst organisms is an important and often difficult task that is essential to various evolutionary studies, ranging from building phylogenies to predicting functional gene annotations. We developed biologically informative features that can be extracted from multiple sequence alignments of putative homologous genes (orthologs and paralogs) and further utilized in context of guided experimentation to verify false positive outcomes.
|
265 |
Implication du récepteur nucléaire orphelin Nur77 (Nr4a1) dans les effets des antipsychotiques par une approche de transcriptomique chez des rats déficients en Nur77Majeur, Simon 11 1900 (has links)
Malgré l’usage de médicaments antipsychotiques depuis plusieurs décennies, leur mécanisme d’action précis, autre que leur interaction avec les récepteurs dopaminergiques et sérotoninergiques, demeure peu connu. Nur77 (Nr4a1 ou NGFI-B) est un facteur de transcription de la famille des récepteurs nucléaires associé aux effets des antipsychotiques. Ceci étant dit, le mécanisme d’action de Nur77 est également peu connu. Afin de mieux comprendre les éléments impliqués avec les antipsychotiques et l’activité de Nur77, nous avons comparé les niveaux de transcrits dans le striatum suite à un traitement avec l’halopéridol chez des rats sauvages et déficients en Nur77 à l’aide de la technique de séquençage à haut débit (RNAseq) et d’une analyse bio-informatique. L’halopéridol et Nur77 ont modulé d’importants groupes de gènes associés avec la signalisation des récepteurs dopaminergiques et la synapse glutamatergique. L’analyse a révélé des modulations de gènes clés reliés à la signalisation des protéines G. Parmi les transcrits modulés significativement chez les rats traités avec halopéridol et ceux déficients en Nur77, la dual specificity phosphatase 5 (Dusp5) représente un nouveau candidat d’intérêt. En effet, nous avons confirmé que les niveaux d’ARNm et protéiques de Dusp5 dans le striatum sont associés aux mouvements involontaires anormaux (dyskinésie) dans un modèle de primates non-humains traités chroniquement avec halopéridol. Cette analyse transcriptomique a démontré des altérations rapides et importantes d’éléments impliqués dans la signalisation des protéines G par l’halopéridol, et a permis d’identifier, pour la première fois, une expression de Dusp5 dépendante de Nur77 en tant que nouvelle composante reliée avec la dyskinésie tardive. / Despite antipsychotic drugs being used for several decades, their precise mechanism of action remains elusive. Nur77 (Nr4a1 or NGFI-B) is a transcription factor of the nuclear receptor family associated with antipsychotic drug effects. However, the mechanism of action of Nur77 is also not well understood. To better understand the signaling components implicated with antipsychotic drug use and Nur77 activity, we compared striatal gene transcripts following haloperidol in wild-type and Nur77-deficient rats using Next Generation RNA Sequencing (RNAseq) and a bioinformatics analysis. Haloperidol and Nur77 modulated important subsets of striatal genes associated with dopamine receptor signaling and glutamate synapses. The analysis revealed modulations of key components of G protein signaling that are consistent with a rapid adaptation of striatal cells that may partially explain long-term haloperidol-induced dopamine D2 receptor upregulation. Amongst significantly modulated transcripts in rats treated with haloperidol and rats deficient in Nur77, dual specificity phosphatase 5 (Dusp5) represents a new and very interesting candidate. Indeed, we confirmed that striatal Dusp5 mRNA and protein levels were associated with abnormal involuntary movements (dyskinesia) in non-human primates chronically exposed to haloperidol. This transcriptomic analysis showed important haloperidol-induced G protein-coupled receptor signaling alterations that may support a regulatory role of Nur77 in dopamine D2 receptor signaling pathways and identified, for the first time, a putative Nur77-dependent expression of Dusp5 as a new signaling component for antipsychotic drug-induced tardive dyskinesia.
|
266 |
Towards a Human Genomic Coevolution NetworkSavel, Daniel M. 04 June 2018 (has links)
No description available.
|
267 |
Pipeline for Next Generation Sequencing data of phage displayed libraries to support affinity ligand discoverySchleimann-Jensen, Ella January 2022 (has links)
Affinity ligands are important molecules used in affinity chromatography for purification of significant substances from complex mixtures. To find affinity ligands specific to important target molecules could be a challenging process. Cytiva uses the powerful phage display technique to find new promising affinity ligands. The phage display technique is a method run in several enrichment cycles. When developing new affinity ligands, a protein scaffold library with a diversity of up to 1010-1011 different protein scaffold variants is run through the enrichment cycles. The result from the phage display rounds is screened for target molecule binding followed by sequencing, usually with one of the conventional screening methods ELISA or Biacore followed by Sanger sequencing. However, the throughput of these analyses are unfortunately very low, often with only a few hundred screened clones. Therefore, Next Generation Sequencing or NGS, has become an increasingly popular screening method for phage display libraries which generates millions of sequences from each phage display round. This creates a need for a robust data analysis pipeline to be able to interpret the large amounts of data. In this project, a pipeline for analysis of NGS data of phage displayed libraries has been developed at Cytiva. Cytiva uses NGS as one of their screening methods of phage displayed protein libraries because of the high throughput compared to the conventional screening methods. The purpose is to find new affinity ligands for purification of essential substances used in drugs. The pipeline has been created using the object-oriented programming language R and consists of several analyses covering the most important steps to be able to find promising results from the NGS data. With the developed pipeline the user can analyze the data on both DNA and protein sequence level and per position residue breakdown, as well as filter the data based on specific amino acids and positions. This gives a robust and thorough analysis which can lead to promising results that can be used in the development of novel affinity ligands for future purification products.
|
268 |
Droplet-Based Microfluidics for High-Throughput Single-Cell Omics ProfilingZhang, Qiang 06 September 2022 (has links)
Droplet-based microfluidics is a powerful tool permitting massive-scale single-cell analysis in pico-/nano-liter water-in-oil droplets. It has been integrated into various library preparation techniques to accomplish high-throughput scRNA-seq, scDNA-seq, scATAC-seq, scChIP-seq, as well as scMulti-omics-seq. These advanced technologies have been providing unique and novel insights into both normal differentiation and disease development at single-cell level. In this thesis, we develop four new droplet-based tools for single-cell omics profiling. First, the developed Drop-BS is the first droplet-based platform to construct single-cell bisulfite sequencing libraries for DNA methylome profiling and allows production of BS library of 2,000-10,000 single cells within 2 d. We applied the technology to separately profile mixed cell lines, mouse brain tissues, and human brain tissues to reveal cell type heterogeneity. Second, the new Drop-ChIP platform only requires two steps of droplet generation to achieve multiple steps of reactions in droplets such as single-cell lysis, chromatin fragmentation, ChIP, and barcoding. Third, we aim to establish a droplet-based platform to accomplish high-throughput full-length RNA-seq (Drop-full-seq), which both current tube-based and droplet-based methods cannot realize. Last, we constructed an in-house droplet-based tool to assist single-cell ATAC-seq library preparation (Drop-ATAC), which provided a low-cost and facile protocol to conduct scATAC-seq in laboratories without the expensive instrument. / Doctor of Philosophy / Microfluidics is a collection of techniques to manipulate fluids in the micrometer scale. One of microfluidic techniques is called "droplet-based microfluidics". It can manipulate (i.e., generate, merge, sort, split, etc) pico-/nano-liter of water-in-oil droplets. First, since the water phase is separated by the continuous oil phase, these droplets are discrete and individual reactors. Second, droplet-based microfluidics can achieve highly parallel manipulation of thousands to millions of droplets. These two advantages make droplet-based microfluidics an ideal tool to perform single-cell assays. Over the past 10 years, various droplet-based platforms have been developed to study single-cell transcriptome, genome, epigenome, as well as multi-ome. To expand droplet-based tools for single-cell analysis, we aim to develop four novel platforms in this thesis. First, Drop-BS, by integrating droplet generation and droplet fusion techniques, can achieve high-throughput single-cell bisulfite sequencing library preparation. It can generate 10,000 single-cell BS libraries within 2 days which is difficult to achieve for conventional library preparation in tubes/microwells. Second, we developed a novel and facile Drop-ChIP platform to prepare single-cell ChIP-seq library. It is easy to operate since it only requires two steps of droplet generation. It also generates higher quality of data compared to previous work. In addition, we are working on the development and characterization of the other two droplet-based tools to achieve full-length single-cell RNA-seq and single-cell ATAC-seq.
|
269 |
Low-Input Multi-Omic Studies of Brain Neuroscience Involved in Mental DiseasesZhu, Bohan 13 September 2022 (has links)
Psychiatric disorders are believed to result from the combination of genetic predisposition and many environmental triggers. While the large number of disease-associated genetic variations have been recognized by previous genome-wide association studies (GWAS), the role of epigenetic mechanisms that mediate the effects of environmental factors on CNS gene activity in the etiology of most mental illnesses is still largely unclear. A growing body of evidence suggested that the abnormalities (changes in gene expression, formation of neural circuits, and behavior) involved in most psychiatric syndromes are preserved by epigenetic modifications identified in several specific brain regions. In this thesis, we developed the second generation of one of our microfluidic technologies (MOWChIP-seq) and used it to profile genome-wide histone modifications in three mental illness-related biological studies: the effect of psychedelics in mice, schizophrenia, and the effect of maternal immune activation in mice offspring. The second generation of MOWChIP-seq was designed to generate histone modification profiles from as few as 100 cells per assay with a throughput as high as eight assays in each run. Then, we applied the new MOWChIP-seq and SMART-seq2 to profile the histone modification H3K27ac and transcriptome, respectively, using NeuN+ neuronal nuclei from the mouse frontal cortex after a single dose of psychedelic administration. The epigenomic and transcriptomic changes induced by 2,5-Dimethoxy-4-iodoamphetamine (DOI), a subtype of psychedelics, in mouse neuronal nuclei at various time points suggest that the long-lasting effects of the psychedelic are more closely related to epigenomic alterations than the changes in transcriptomic patterns. Next, we comprehensively characterized epigenomic and transcriptomic features from the frontal cortex of 29 individuals with schizophrenia and 29 individually matched controls (gender and age). We found that schizophrenia subjects exhibited thousands of neuronal vs. glial epigenetic differences at regions that included several susceptibility genetic loci, such as NRXN1, RGS4 and GRIN3A. Finally, we investigated the epigenetic and transcriptomic alterations induced by the maternal immune activation (MIA) in mice offspring's frontal cortex. Pregnant mice were injected with influenza virus at GD 9.5 and the frontal cortex from mice pups (10 weeks old) were examined later. The results offered us some insights into the contribution of MIA to the etiology of some mental disorders, like schizophrenia and autism. / Doctor of Philosophy / While this field is still in its early stage, the epigenetic studies of mental disorders present promise to expand our understanding about how environmental stimulates, interacting with genetic factors, contribute to the etiology of various psychiatric syndromes, like major depression and schizophrenia. Previous clinical trials suggested that psychedelics may represent a promising long-lasting treatment for patients with depression and other psychiatric conditions. These research presented the therapeutic potential of psychedelic compounds for treating major depression and demonstrated the capability of psychedelics in increasing dendritic density and stimulating synapse formation. However, the molecular mechanism mediating the clinical effectiveness of psychedelics remain largely unexplored. Our study revealed that epigenomic-driven changes in synaptic plasticity sustain psychedelics' long-lasting antidepressant action. Another serious mental illness is schizophrenia, which could affect how an individual feels, thinks, and behaves. Like most other mental disorders, schizophrenia results from a combination of genetic and environmental causes. Epigenetic marks allow a dynamic impact of environmental factors, including antipsychotic medications, on the access to genes and regulatory elements. Despite this, no study so far has profiled cell-type-specific genome-wide histone modifications in postmortem brain samples from schizophrenia subjects or the effect of antipsychotic treatment on such epigenetic marks. Here we show the first comprehensive epigenomic characterization of the frontal cortex of 29 individuals with schizophrenia and 29 matched controls. The process of brain development is surprisingly sensitive to a lot of environmental insults. Epidemiological studies have recognized maternal immune activation as a risk factor that may change the normal developmental trajectory of the fetal brain and increase the odds of developing a range of psychiatric disorders, including schizophrenia and autism, in its lifetime. Given the prevalence of the coronavirus, uncovering the molecular mechanism underlie the phenotypic alterations has become more urgent than before, for both prevention and treatment.
|
270 |
Improved Error Correction of NGS DataAlic, Andrei Stefan 15 July 2016 (has links)
Tesis por compendio / [EN] The work done for this doctorate thesis focuses on error correction of Next Generation Sequencing (NGS) data in the context of High Performance Computing (HPC).
Due to the reduction in sequencing cost, the increasing output of the sequencers and the advancements in the biological and medical sciences, the amount of NGS data has increased tremendously.
Humans alone are not able to keep pace with this explosion of information, therefore computers must assist them to ease the handle of the deluge of information generated by the sequencing machines.
Since NGS is no longer just a research topic (used in clinical routine to detect cancer mutations, for instance), requirements in performance and accuracy are more stringent.
For sequencing to be useful outside research, the analysis software must work accurately and fast.
This is where HPC comes into play.
NGS processing tools should leverage the full potential of multi-core and even distributed computing, as those platforms are extensively available.
Moreover, as the performance of the individual core has hit a barrier, current computing tendencies focus on adding more cores and explicitly split the computation to take advantage of them.
This thesis starts with a deep analysis of all these problems in a general and comprehensive way (to reach out to a very wide audience), in the form of an exhaustive and objective review of the NGS error correction field.
We dedicate a chapter to this topic to introduce the reader gradually and gently into the world of sequencing.
It presents real problems and applications of NGS that demonstrate the impact this technology has on science.
The review results in the following conclusions: the need of understanding of the specificities of NGS data samples (given the high variety of technologies and features) and the need of flexible, efficient and accurate tools for error correction as a preliminary step of any NGS postprocessing.
As a result of the explosion of NGS data, we introduce MuffinInfo.
It is a piece of software capable of extracting information from the raw data produced by the sequencer to help the user understand the data.
MuffinInfo uses HTML5, therefore it runs in almost any software and hardware environment.
It supports custom statistics to mould itself to specific requirements.
MuffinInfo can reload the results of a run which are stored in JSON format for easier integration with third party applications.
Finally, our application uses threads to perform the calculations, to load the data from the disk and to handle the UI.
In continuation to our research and as a result of the single core performance limitation, we leverage the power of multi-core computers to develop a new error correction tool.
The error correction of the NGS data is normally the first step of any analysis targeting NGS.
As we conclude from the review performed within the frame of this thesis, many projects in different real-life applications have opted for this step before further analysis.
In this sense, we propose MuffinEC, a multi-technology (Illumina, Roche 454, Ion Torrent and PacBio -experimental), any-type-of-error handling (mismatches, deletions insertions and unknown values) corrector.
It surpasses other similar software by providing higher accuracy (demonstrated by three type of tests) and using less computational resources.
It follows a multi-steps approach that starts by grouping all the reads using a k-mers based metric.
Next, it employs the powerful Smith-Waterman algorithm to refine the groups and generate Multiple Sequence Alignments (MSAs).
These MSAs are corrected by taking each column and looking for the correct base, determined by a user-adjustable percentage.
This manuscript is structured in chapters based on material that has been previously published in prestigious journals indexed by the Journal of Citation Reports (on outstanding positions) and relevant congresses. / [ES] El trabajo realizado en el marco de esta tesis doctoral se centra en la corrección de errores en datos provenientes de técnicas NGS utilizando técnicas de computación intensiva.
Debido a la reducción de costes y el incremento en las prestaciones de los secuenciadores, la cantidad de datos disponibles en NGS se ha incrementado notablemente. La utilización de computadores en el análisis de estas muestras se hace imprescindible para poder dar respuesta a la avalancha de información generada por estas técnicas. El uso de NGS transciende la investigación con numerosos ejemplos de uso clínico y agronómico, por lo que aparecen nuevas necesidades en cuanto al tiempo de proceso y la fiabilidad de los resultados. Para maximizar su aplicabilidad clínica, las técnicas de proceso de datos de NGS deben acelerarse y producir datos más precisos. En este contexto es en el que las técnicas de comptuación intensiva juegan un papel relevante. En la actualidad, es común disponer de computadores con varios núcleos de proceso e incluso utilizar múltiples computadores mediante técnicas de computación paralela distribuida. Las tendencias actuales hacia arquitecturas con un mayor número de núcleos ponen de manifiesto que es ésta una aproximación relevante.
Esta tesis comienza con un análisis de los problemas fundamentales del proceso de datos en NGS de forma general y adaptado para su comprensión por una amplia audiencia, a través de una exhaustiva revisión del estado del arte en la corrección de datos de NGS. Esta revisión introduce gradualmente al lector en las técnicas de secuenciación masiva, presentando problemas y aplicaciones reales de las técnicas de NGS, destacando el impacto de esta tecnología en ciencia. De este estudio se concluyen dos ideas principales: La necesidad de analizar de forma adecuada las características de los datos de NGS, atendiendo a la enorme variedad intrínseca que tienen las diferentes técnicas de NGS; y la necesidad de disponer de una herramienta versátil, eficiente y precisa para la corrección de errores.
En el contexto del análisis de datos, la tesis presenta MuffinInfo. La herramienta MuffinInfo es una aplicación software implementada mediante HTML5. MuffinInfo obtiene información relevante de datos crudos de NGS para favorecer el entendimiento de sus características y la aplicación de técnicas de corrección de errores, soportando además la extensión mediante funciones que implementen estadísticos definidos por el usuario. MuffinInfo almacena los resultados del proceso en ficheros JSON. Al usar HTML5, MuffinInfo puede funcionar en casi cualquier entorno hardware y software. La herramienta está implementada aprovechando múltiples hilos de ejecución por la gestión del interfaz.
La segunda conclusión del análisis del estado del arte nos lleva a la oportunidad de aplicar de forma extensiva técnicas de computación de altas prestaciones en la corrección de errores para desarrollar una herramienta que soporte múltiples tecnologías (Illumina, Roche 454, Ion Torrent y experimentalmente PacBio). La herramienta propuesta (MuffinEC), soporta diferentes tipos de errores (sustituciones, indels y valores desconocidos). MuffinEC supera los resultados obtenidos por las herramientas existentes en este ámbito. Ofrece una mejor tasa de corrección, en un tiempo muy inferior y utilizando menos recursos, lo que facilita además su aplicación en muestras de mayor tamaño en computadores convencionales. MuffinEC utiliza una aproximación basada en etapas multiples. Primero agrupa todas las secuencias utilizando la métrica de los k-mers. En segundo lugar realiza un refinamiento de los grupos mediante el alineamiento con Smith-Waterman, generando contigs. Estos contigs resultan de la corrección por columnas de atendiendo a la frecuencia individual de cada base.
La tesis se estructura por capítulos cuya base ha sido previamente publicada en revistas indexadas en posiciones dest / [CA] El treball realitzat en el marc d'aquesta tesi doctoral se centra en la correcció d'errors en dades provinents de tècniques de NGS utilitzant tècniques de computació intensiva.
A causa de la reducció de costos i l'increment en les prestacions dels seqüenciadors, la quantitat de dades disponibles a NGS s'ha incrementat notablement. La utilització de computadors en l'anàlisi d'aquestes mostres es fa imprescindible per poder donar resposta a l'allau d'informació generada per aquestes tècniques. L'ús de NGS transcendeix la investigació amb nombrosos exemples d'ús clínic i agronòmic, per la qual cosa apareixen noves necessitats quant al temps de procés i la fiabilitat dels resultats. Per a maximitzar la seua aplicabilitat clínica, les tècniques de procés de dades de NGS han d'accelerar-se i produir dades més precises. En este context és en el que les tècniques de comptuación intensiva juguen un paper rellevant. En l'actualitat, és comú disposar de computadors amb diversos nuclis de procés i inclús utilitzar múltiples computadors per mitjà de tècniques de computació paral·lela distribuïda. Les tendències actuals cap a arquitectures amb un nombre més gran de nuclis posen de manifest que és esta una aproximació rellevant.
Aquesta tesi comença amb una anàlisi dels problemes fonamentals del procés de dades en NGS de forma general i adaptat per a la seua comprensió per una àmplia audiència, a través d'una exhaustiva revisió de l'estat de l'art en la correcció de dades de NGS. Esta revisió introduïx gradualment al lector en les tècniques de seqüenciació massiva, presentant problemes i aplicacions reals de les tècniques de NGS, destacant l'impacte d'esta tecnologia en ciència. D'este estudi es conclouen dos idees principals: La necessitat d'analitzar de forma adequada les característiques de les dades de NGS, atenent a l'enorme varietat intrínseca que tenen les diferents tècniques de NGS; i la necessitat de disposar d'una ferramenta versàtil, eficient i precisa per a la correcció d'errors.
En el context de l'anàlisi de dades, la tesi presenta MuffinInfo. La ferramenta MuffinInfo és una aplicació programari implementada per mitjà de HTML5. MuffinInfo obté informació rellevant de dades crues de NGS per a afavorir l'enteniment de les seues característiques i l'aplicació de tècniques de correcció d'errors, suportant a més l'extensió per mitjà de funcions que implementen estadístics definits per l'usuari. MuffinInfo emmagatzema els resultats del procés en fitxers JSON. A l'usar HTML5, MuffinInfo pot funcionar en gairebé qualsevol entorn maquinari i programari. La ferramenta està implementada aprofitant múltiples fils d'execució per la gestió de l'interfície.
La segona conclusió de l'anàlisi de l'estat de l'art ens porta a l'oportunitat d'aplicar de forma extensiva tècniques de computació d'altes prestacions en la correcció d'errors per a desenrotllar una ferramenta que suport múltiples tecnologies (Illumina, Roche 454, Ió Torrent i experimentalment PacBio). La ferramenta proposada (MuffinEC), suporta diferents tipus d'errors (substitucions, indels i valors desconeguts). MuffinEC supera els resultats obtinguts per les ferramentes existents en este àmbit. Oferix una millor taxa de correcció, en un temps molt inferior i utilitzant menys recursos, la qual cosa facilita a més la seua aplicació en mostres més gran en computadors convencionals. MuffinEC utilitza una aproximació basada en etapes multiples. Primer agrupa totes les seqüències utilitzant la mètrica dels k-mers. En segon lloc realitza un refinament dels grups per mitjà de l'alineament amb Smith-Waterman, generant contigs. Estos contigs resulten de la correcció per columnes d'atenent a la freqüència individual de cada base.
La tesi s'estructura per capítols la base de la qual ha sigut prèviament publicada en revistes indexades en posicions destacades de l'índex del Journal of Citation Repor / Alic, AS. (2016). Improved Error Correction of NGS Data [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/67630 / Compendio
|
Page generated in 0.2941 seconds