11 |
Bayesian Inference for Genomic Data AnalysisOgundijo, Oyetunji Enoch January 2019 (has links)
High-throughput genomic data contain gazillion of information that are influenced by the complex biological processes in the cell. As such, appropriate mathematical modeling frameworks are required to understand the data and the data generating processes. This dissertation focuses on the formulation of mathematical models and the description of appropriate computational algorithms to obtain insights from genomic data.
Specifically, characterization of intra-tumor heterogeneity is studied. Based on the total number of allele copies at the genomic locations in the tumor subclones, the problem is viewed from two perspectives: the presence or absence of copy-neutrality assumption. With the presence of copy-neutrality, it is assumed that the genome contains mutational variability and the three possible genotypes may be present at each genomic location. As such, the genotypes of all the genomic locations in the tumor subclones are modeled by a ternary matrix. In the second case, in addition to mutational variability, it is assumed that the genomic locations may be affected by structural variabilities such as copy number variation (CNV). Thus, the genotypes are modeled with a pair of (Q + 1)-ary matrices. Using the categorical Indian buffet process (cIBP), state-space modeling framework is employed in describing the two processes and the sequential Monte Carlo (SMC) methods for dynamic models are applied to perform inference on important model parameters.
Moreover, the problem of estimating gene regulatory network (GRN) from measurement with missing values is presented. Specifically, gene expression time series data may contain missing values for entire expression values of a single point or some set of consecutive time points. However, complete data is often needed to make inference on the underlying GRN. Using the missing measurement, a dynamic stochastic model is used to describe the evolution of gene expression and point-based Gaussian approximation (PBGA) filters with one-step or two-step missing measurements are applied for the inference. Finally, the problem of deconvolving gene expression data from complex heterogeneous biological samples is examined, where the observed data are a mixture of different cell types. A statistical description of the problem is used and the SMC method for static models is applied to estimate the cell-type specific expressions and the cell type proportions in the heterogeneous samples.
|
12 |
Intégration de données génomiques (mutations, gènes majeurs, marqueurs SNP, haplotypes) dans les modèles d'évaluations génétiques des chèvres laitières pour améliorer l'efficacité de la sélection / Integration of genomic data (QTL, major gene, SNPs, haplotypes) in genomic evaluation models to improve efficiency of selection in French dairy goatsTeissier, Marc 05 February 2019 (has links)
Suite aux travaux de Céline Carillier (2012-2015), des évaluations ssGBLUP ont été mises en place en 2018 pour les races caprines Alpine et Saanen. L’objectif est d’améliorer les précisions des évaluations pour maximiser le progrès génétique pour les caractères d’intérêt. Pour notre première étude, nous nous sommes intéressés à l’effet de la taille de la population de référence (limitée pour ces races) sur les précisions des évaluations. L’accroissement de la population d’apprentissage ne s’est pas systématiquement accompagné d’une hausse des précisions. Le ssGBLUP présente des biais et tend à surestimer ou sous-estimer les valeurs génomiques. Des hyperparamètres ont été introduits dans la construction de la matrice génomique du ssGBLUP pour limiter ces biais. Ces hyperparamètres (, et ) peuvent améliorer les biais tout en affectant de manière limitée les précisions. Pour les races Alpine et Saanen, les biais sont proches de 1 pour un compris entre 0,1 et 0,3 et un compris entre 3 et 4. L’hyperparamètre a peu d’effet sur les précisions et les biais, sa valeur par défaut (0,95) semble être optimale. Dans une deuxième partie, nous nous sommes intéressés à l’intégration de mutations causales ou de QTLs dans les modèles d’évaluations pour améliorer les précisions. Des mutations causales et des QTLs ont été détectés dans les races caprines. On peut citer le gène de la caséine s1 pour le taux protéique ou DGAT1 pour le taux butyreux. D’autres études ont identifié un QTL, localisé sur le chromosome 19, en Saanen. Il a été détecté pour les caractères : quantités de lait et de matières (grasses et protéiques), la distance plancher-jarret et pour la qualité de l’attache arrière. L’utilisation des génotypes de la caséine s1 ou DGAT1 dans les modèles d’évaluations (gene content) a été inefficace pour améliorer les précisions des évaluations. Le gene content est une méthode multicaractère où le « gene content » est un second caractère corrélé au caractère en sélection. Pour le taux protéique ou butyreux, les précisions avec le gene content sont entre -11 % et 0 % inférieures aux précisions du ssGBLUP. En pondérant les SNPs de manière adéquate avec un ssGBLUP (appelée Weighted ssGBLUP et notée WssGBLUP), les précisions des évaluations ont été améliorées. Cette méthode attribue des poids aux SNPs en fonction de leur association aux caractères. Ces poids sont intégrés dans la construction de la matrice de parenté génomique. Des gains jusqu’à +5 % et +14 % (Alpine et Saanen) ont été observés par rapport au ssGBLUP. Le WssGBLUP est plus adapté pour la race Saanen car des QTLs sont présents sur la majorité des caractères. Pour la race Alpine, le WssGBLUP s’est avéré intéressant pour le taux protéique. Le ssGBLUP reste la meilleure méthode lorsque le caractère a une architecture génétique polygénique. Enfin, nous nous sommes intéressés à des modèles d’évaluation génomiques haplotypiques. Les haplotypes ont été construits en regroupant plusieurs SNPs consécutifs ou en se basant sur le déséquilibre de liaison entre SNPs. Les haplotypes sont utilisés pour construire une matrice de parenté haplotypique ou convertis en pseudo-SNPs, pour construire une matrice de parenté génomique. En Alpine, les précisions du ssGBLUP haplotypiques (ou pseudo-SNPs) ont évolué entre -1 % et 19 % par rapport au ssGBLUP basé sur l’information des SNPs. En Saanen, les précisions ont évolué entre -3 % et +6 % par rapport au ssGBLUP. Nous avons appliqué le WssGBLUP avec des pseudo-SNPs. En Saanen, une amélioration des précisions jusqu’à +16 % par rapport au ssGBLUP a été observée. Les gains les plus forts (supérieurs à +10 %) sont obtenus pour les caractères avec un QTL identifié (lait, matières grasses et protéiques, taux protéique, qualité de l’attache arrière et distance entre le plancher et le jarret). En Alpine, des gains de précision entre -8 % et +5 % ont été observés par rapport au ssGBLUP selon le caractère excepté pour les matières grasses (+19 %). / Following Céline Carillier’s PhD (2012-2015), genomic evaluations based on the ssGBLUP were implemented in 2018 in the dairy goat breeds Alpine and Saanen. The objective of breeders is to improve the accuracy of genomic evaluations in order to maximize genetic gain for traits of interest. In our first study, we looked at the effect of the size of the reference population (limited for these breeds) on the accuracy of genomic evaluations. The increase of the training population was not systematically associated with an increase of genomic accuracies. The ssGBLUP has some biases and tends to overestimate or underestimate genomic value estimates. To avoid these biases, hyperparameters were introduced into the construction of the ssGBLUP genomic relationship matrix. An analysis of these hyperparameters (, and ) was carried out and we found that the choice of them improves bias while having a limited impact on genomic accuracy. For the Alpine and Saanen breeds, the biases are close to 1 for a between 0.1 and 0.3 and a between 3 and 4. The hyperparameter has little effect on accuracy and bias and its default value (0,95) seems to be optimal. In a second part of my thesis, we focused on the integration of causal mutations or QTLs into genomic evaluation models to improve genomic accuracy. Causal mutations and QTLs were detected in the Alpine and Saanen breeds such as the s1 casein gene for protein content or DGAT1 for fat content. Other studies have shown a QTL, located on chromosome 19, in the Saanen breed. It was detected for different traits: milk, fat and protein content, udder floor position and rear udder attachment. The use of genotypes for s1 casein or DGAT1 in genomic evaluation models (gene content) was inefficient in improving evaluation accuracy. The gene content is a multi-trait method where the "gene content" is a second trait correlated to the selected trait. Whether for protein or fat content, accuracies with gene content were between -11% and 0% lower than the ssGBLUP accuracies for the Alpine and Saanen breeds. We have shown by adequately weighting SNPs in an ssGBLUP (approach called Weighted ssGBLUP and noted WssGBLUP), the accuracy of evaluations could be improved. This method assigns weights to SNPs based on their association with traits. These weights are integrated into the construction of the genomic relationship matrix. Gains up to +5% for the Alpine breed and +14% for the Saanen breed were observed compared to the ssGBLUP. The WssGBLUP is more suitable for the Saanen breed because QTLs are present on the majority of traits. For the Alpine breed, WssGBLUP was interesting for the protein content. The ssGBLUP remained the most interesting method when the trait had a polygenic genetic architecture. Finally, in the last study, we focused on haplotype genomic evaluation models. Haplotypes were constructed either by grouping several consecutive SNPs or by using the linkage disequilibrium (LD) between SNPs. The haplotypes are then used to build a haplotypic relationship matrix or converted to pseudo-SNPs to build a genomic relationship matrix. In the Alpine breed, the accuracy of the haplotypic ssGBLUP (or pseudo-SNPs) was increased between -1% and 19% compared to an ssGBLUP based on SNP information. On the other hand, in the Saanen breed, the accuracy was increased between -3% and +6% compared to a ssGBLUP. Finally, we applied the WssGBLUP approach using pseudo-SNPs. In the Saanen breed, an improvement in accuracy up to +16% compared to a ssGBLUP was observed. The highest gains (above +10%) were obtained for traits with an identified QTL (milk, fat and protein yields, protein content, udder floor position and rear udder attachment). In the Alpine breed, accuracy gains between -8% and +5% were observed compared to ssGBLUP depending on the trait except for fat yield and fat content where the gains reach +19%.
|
13 |
Topics in Signal Processing: applications in genomics and geneticsElmas, Abdulkadir January 2016 (has links)
The information in genomic or genetic data is influenced by various complex processes and appropriate mathematical modeling is required for studying the underlying processes and the data. This dissertation focuses on the formulation of mathematical models for certain problems in genomics and genetics studies and the development of algorithms for proposing efficient solutions. A Bayesian approach for the transcription factor (TF) motif discovery is examined and the extensions are proposed to deal with many interdependent parameters of the TF-DNA binding. The problem is described by statistical terms and a sequential Monte Carlo sampling method is employed for the estimation of unknown parameters. In particular, a class-based resampling approach is applied for the accurate estimation of a set of intrinsic properties of the DNA binding sites. Through statistical analysis of the gene expressions, a motif-based computational approach is developed for the inference of novel regulatory networks in a given bacterial genome. To deal with high false-discovery rates in the genome-wide TF binding predictions, the discriminative learning approaches are examined in the context of sequence classification, and a novel mathematical model is introduced to the family of kernel-based Support Vector Machines classifiers. Furthermore, the problem of haplotype phasing is examined based on the genetic data obtained from cost-effective genotyping technologies. Based on the identification and augmentation of a small and relatively more informative genotype set, a sparse dictionary selection algorithm is developed to infer the haplotype pairs for the sampled population. In a relevant context, to detect redundant information in the single nucleotide polymorphism (SNP) sites, the problem of representative (tag) SNP selection is introduced. An information theoretic heuristic is designed for the accurate selection of tag SNPs that capture the genetic diversity in a large sample set from multiple populations. The method is based on a multi-locus mutual information measure, reflecting a biological principle in the population genetics that is linkage disequilibrium.
|
14 |
Normalization and analysis of high-dimensional genomics dataLandfors, Mattias January 2012 (has links)
In the middle of the 1990’s the microarray technology was introduced. The technology allowed for genome wide analysis of gene expression in one experiment. Since its introduction similar high through-put methods have been developed in other fields of molecular biology. These high through-put methods provide measurements for hundred up to millions of variables in a single experiment and a rigorous data analysis is necessary in order to answer the underlying biological questions. Further complications arise in data analysis as technological variation is introduced in the data, due to the complexity of the experimental procedures in these experiments. This technological variation needs to be removed in order to draw relevant biological conclusions from the data. The process of removing the technical variation is referred to as normalization or pre-processing. During the last decade a large number of normalization and data analysis methods have been proposed. In this thesis, data from two types of high through-put methods are used to evaluate the effect pre-processing methods have on further analyzes. In areas where problems in current methods are identified, novel normalization methods are proposed. The evaluations of known and novel methods are performed on simulated data, real data and data from an in-house produced spike-in experiment.
|
15 |
Genomic data mining for the computational prediction of small non-coding RNA genesTran, Thao Thanh Thi 20 January 2009 (has links)
The objective of this research is to develop a novel computational prediction algorithm for non-coding RNA (ncRNA) genes using features computable for any genomic sequence without the need for comparative analysis. Existing comparative-based methods require the knowledge of closely related organisms in order to search for sequence and structural similarities. This approach imposes constraints on the type of ncRNAs, the organism, and the regions where the ncRNAs can be found. We have developed a novel approach for ncRNA gene prediction without the limitations of current comparative-based methods. Our work has established a ncRNA database required for subsequent feature and genomic analysis. Furthermore, we have identified significant features from folding-, structural-, and ensemble-based statistics for use in ncRNA prediction. We have also examined higher-order gene structures, namely operons, to discover potential insights into how ncRNAs are transcribed. Being able to automatically identify ncRNAs on a genome-wide scale is immensely powerful for incorporating it into a pipeline for large-scale genome annotation. This work will contribute to a more comprehensive annotation of ncRNA genes in microbial genomes to meet the demands of functional and regulatory genomic studies.
|
16 |
A novel framework for binning environmental genomic fragmentsYang, Bin, 杨彬 January 2010 (has links)
published_or_final_version / Computer Science / Master / Master of Philosophy
|
17 |
The role of parallel computing in bioinformaticsAkhurst, Timothy John January 2005 (has links)
The need to intelligibly capture, manage and analyse the ever-increasing amount of publicly available genomic data is one of the challenges facing bioinformaticians today. Such analyses are in fact impractical using uniprocessor machines, which has led to an increasing reliance on clusters of commodity-priced computers. An existing network of cheap, commodity PCs was utilised as a single computational resource for parallel computing. The performance of the cluster was investigated using a whole genome-scanning program written in the Java programming language. The TSpaces framework, based on the Linda parallel programming model, was used to parallelise the application. Maximum speedup was achieved at between 30 and 50 processors, depending on the size of the genome being scanned. Together with this, the associated significant reductions in wall-clock time suggest that both parallel computing and Java have a significant role to play in the field of bioinformatics.
|
18 |
Mapeamento de dados genômicos usando escalonamento multidimensional / Representation of genomics data with multidimensional scalingEspezúa Llerena, Soledad 04 June 2008 (has links)
Neste trabalho são exploradas diversas técnicas de escalonamento multidimensional (MDS), com o objetivo de estudar sua aplicabilidade no mapeamento de dados genômicos resultantes da técnica RFLP-PCR, sendo esse mapeamento realizado em espaços de baixa dimensionalidade (2D ou 3D) com o fim de aproveitar a habilidade de análise e interpretação visual que possuem os seres humanos. Foi realizada uma análise comparativa de diversos algoritmos MDS, visando sua aptidão para mapear dados genômicos. Esta análise compreendeu o estudo de alguns índices de desempenho como a precisão no mapeamento, o custo computacional e a capacidade de induzir bons agrupamentos. Para a realização dessa análise foi desenvolvida a ferramenta \"MDSExplorer\", a qual integra os algoritmos estudados e várias opções que permitem comparar os algoritmos e visualizar os mapeamentos. Á análise realizada sobre diversos bancos de dados citados na literatura, sugerem que o algoritmo LANDMARK possui o menor tempo computacional, uma precisão de mapeamento similar aos demais algoritmos, e uma boa capacidade de manter as estruturas existentes nos dados. Finalmente, o MDSExplorer foi usado para mapear um banco de dados genômicos: o banco de estirpes de bactérias fixadoras de nitrogênio, pertencentes ao gênero Bradyrhizobium, com objetivo de ajudar o especialista a inferir visualmente alguma taxonomia nessas estirpes. Os resultados na redução dimensional desse banco de dados sugeriram que a informação relevante (acima dos 60% da variância acumulada) para as regiões 16S, 23S e IGS estaria nas primeiras 5, 4 e 9 dimensões respectivamente. / In this work were studied various Multidimensional Scaling (MDS) techniques intended to apply in the mapping of genomics data obtained of RFLP-PCR technique. This mapping is done in a low dimensional space (2D or 3D), and has the intention of exploiting the visual human capability on analysis and synthesis. A comparative analysis of diverse algorithms MDS was carried out in order to devise its ubiquity in representing genomics data. This analysis covers the study of some indices of performance such as: the precision in the mapping, the computational cost and the capacity to induce good groupings. The purpose of this analysis was developed a software tool called \"MDSExplorer\", which integrates various MDS algorithms and some options that allow to compare the algorithms and to visualize the mappings. The analysis, carried out over diverse datasets cited in the literature, suggest that the algorithm LANDMARK has the lowest computational time, a good precision in the mapping, and a tendency to maintain the existing structures in the data. Finally, MDSExplorer was used to mapping a real genomics dataset: the RFLP-PRC images of a Brazilian collection of bacterial strains belonging to the genus Bradyrhizobium (known by their capability to transform the nitrogen of the atmosphere into compounds useful for the host plants), with the objective to aid the specialist to infer visually a taxonomy in these strains. The results in reduction of dimensionality in this data base, suggest that the relevant information (above 60% of variance accumulated) to the region 16S, 23S and IGS is around 5, 4 and 9 dimensions respectively.
|
19 |
Genome-wide analyses of single cell phenotypes using cell microarraysNarayanaswamy, Rammohan, 1978- 29 August 2008 (has links)
The past few decades have witnessed a revolution in recombinant DNA and nucleic acid sequencing technologies. Recently however, technologies capable of massively high-throughout, genome-wide data collection, combined with computational and statistical tools for data mining, integration and modeling have enabled the construction of predictive networks that capture cellular regulatory states, paving the way for ‘Systems biology’. Consequently, protein interactions can be captured in the context of a cellular interaction network and emergent ‘system’ properties arrived at, that may not have been possible by conventional biology. The ability to generate data from multiple, non-redundant experimental sources is one of the important facets to systems biology. Towards this end, we have established a novel platform called ‘spotted cell microarrays’ for conducting image-based genetic screens. We have subsequently used spotted cell microarrays for studying multidimensional phenotypes in yeast under different regulatory states. In particular, we studied the response to mating pheromone using a cell microarray comprised of the yeast non-essential deletion library and analyzed morphology changes to identify novel genes that were involved in mating. An important aspect of the mating response pathway is large-scale spatiotemporal changes to the proteome, an aspect of proteomics, still largely obscure. In our next study, we used an imaging screen and a computational approach to predict and validate the complement of proteins that polarize and change localization towards the mating projection tip. By adopting such hybrid approaches, we have been able to, not only study proteins involved in specific pathways, but also their behavior in a systemic context, leading to a broader comprehension of cell function. Lastly, we have performed a novel metabolic starvation-based screen using the GFP-tagged collection to study proteome dynamics in response to nutrient limitation and are currently in the process of rationalizing our observations through follow-up experiments. We believe this study to have implications in evolutionarily conserved cellular mechanisms such as protein turnover, quiescence and aging. Our technique has therefore been applied towards addressing several interesting aspects of yeast cellular physiology and behavior and is now being extended to mammalian cells. / text
|
20 |
Mapeamento de dados genômicos usando escalonamento multidimensional / Representation of genomics data with multidimensional scalingSoledad Espezúa Llerena 04 June 2008 (has links)
Neste trabalho são exploradas diversas técnicas de escalonamento multidimensional (MDS), com o objetivo de estudar sua aplicabilidade no mapeamento de dados genômicos resultantes da técnica RFLP-PCR, sendo esse mapeamento realizado em espaços de baixa dimensionalidade (2D ou 3D) com o fim de aproveitar a habilidade de análise e interpretação visual que possuem os seres humanos. Foi realizada uma análise comparativa de diversos algoritmos MDS, visando sua aptidão para mapear dados genômicos. Esta análise compreendeu o estudo de alguns índices de desempenho como a precisão no mapeamento, o custo computacional e a capacidade de induzir bons agrupamentos. Para a realização dessa análise foi desenvolvida a ferramenta \"MDSExplorer\", a qual integra os algoritmos estudados e várias opções que permitem comparar os algoritmos e visualizar os mapeamentos. Á análise realizada sobre diversos bancos de dados citados na literatura, sugerem que o algoritmo LANDMARK possui o menor tempo computacional, uma precisão de mapeamento similar aos demais algoritmos, e uma boa capacidade de manter as estruturas existentes nos dados. Finalmente, o MDSExplorer foi usado para mapear um banco de dados genômicos: o banco de estirpes de bactérias fixadoras de nitrogênio, pertencentes ao gênero Bradyrhizobium, com objetivo de ajudar o especialista a inferir visualmente alguma taxonomia nessas estirpes. Os resultados na redução dimensional desse banco de dados sugeriram que a informação relevante (acima dos 60% da variância acumulada) para as regiões 16S, 23S e IGS estaria nas primeiras 5, 4 e 9 dimensões respectivamente. / In this work were studied various Multidimensional Scaling (MDS) techniques intended to apply in the mapping of genomics data obtained of RFLP-PCR technique. This mapping is done in a low dimensional space (2D or 3D), and has the intention of exploiting the visual human capability on analysis and synthesis. A comparative analysis of diverse algorithms MDS was carried out in order to devise its ubiquity in representing genomics data. This analysis covers the study of some indices of performance such as: the precision in the mapping, the computational cost and the capacity to induce good groupings. The purpose of this analysis was developed a software tool called \"MDSExplorer\", which integrates various MDS algorithms and some options that allow to compare the algorithms and to visualize the mappings. The analysis, carried out over diverse datasets cited in the literature, suggest that the algorithm LANDMARK has the lowest computational time, a good precision in the mapping, and a tendency to maintain the existing structures in the data. Finally, MDSExplorer was used to mapping a real genomics dataset: the RFLP-PRC images of a Brazilian collection of bacterial strains belonging to the genus Bradyrhizobium (known by their capability to transform the nitrogen of the atmosphere into compounds useful for the host plants), with the objective to aid the specialist to infer visually a taxonomy in these strains. The results in reduction of dimensionality in this data base, suggest that the relevant information (above 60% of variance accumulated) to the region 16S, 23S and IGS is around 5, 4 and 9 dimensions respectively.
|
Page generated in 0.2448 seconds