1 |
Bayesian Biclustering on Discrete Data: Variable Selection MethodsGuo, Lei 18 October 2013 (has links)
Biclustering is a technique for clustering rows and columns of a data matrix simultaneously. Over the past few years, we have seen its applications in biology-related fields, as well as in many data mining projects. As opposed to classical clustering methods, biclustering groups objects that are similar only on a subset of variables. Many biclustering algorithms on continuous data have emerged over the last decade. In this dissertation, we will focus on two Bayesian biclustering algorithms we developed for discrete data, more specifically categorical data and ordinal data. / Statistics
|
2 |
Caracterização da estrutura de dependência do genoma humano usando campos markovianos: estudo de populações mundiais e dados de SNPs / Characterization of the human genome dependence structure using Markov random fields: populations worldwide study and SNP dataFrancisco José de Almeida Fernandes 01 February 2016 (has links)
A identificação de regiões cromossômicas, ou blocos de dependência dentro do genoma humano, que são transmitidas em conjunto para seus descendentes (haplótipos) tem sido um desafio e alvo de várias iniciativas de pesquisa, muitas delas utilizando dados de plataformas de marcadores moleculares do tipo SNP (Single Nucleotide Polymorphisms - SNPs), com alta densidade dentro do DNA humano. Este trabalho faz uso de uma modelagem estocástica de campos Markovianos de alcance variável, em uma amostra estratificada de diferentes populações, para encontrar blocos de SNPs, independentes entre si, estruturando assim o genoma em regiões ilhadas de dependência. Foram utilizados dados públicos de SNPs de diferentes populações mundiais (projeto HapMap), além de uma amostra da população brasileira. As regiões de dependência configuram janelas de influência as quais foram usadas para caracterizar as diferentes populações de acordo com sua ancestralidade e os resultados obtidos mostraram que as janelas da população brasileira têm, em média, tamanho maior, evidenciando a sua história recente de miscigenação. É também proposta uma otimização da função de verossimilhança do problema para obter as janelas de consenso maximais de todas as populações. Dada uma determinada janela de consenso, uma medida de distância apropriada para variáveis categóricas, é adotada para medir sua homogeneidade/heterogeneidade. Janelas homogêneas foram identificadas na região HLA (Human Leukocyte Antigen) do genoma, a qual está associada à resposta imunológica. O tamanho médio dessas janelas foi maior do que a média encontrada no restante do cromossomo, confirmando a alta dependência existente nesta região, considerada como bastante conservada na evolução humana. Finalmente, considerando a distribuição dos SNPs entre as populações nas janelas mais heterogêneas, a Análise de Correspondência foi aplicada na construção de um classificador capaz de determinar o percentual relativo de ancestralidade de um indivíduo, o qual, submetido à validação, obteve uma eficiência de 90% de acerto da população originária. / The identification of chromosome regions, or dependency blocks in the human genome, that are transmitted together to offspring (haploids) has been a challenge and object of several research initiatives, many of them using platforms of molecular markers such as SNP (Single Nucleotide Polymorphisms), with high density inside the human DNA. This work makes use of a stochastic modeling of Markov random fields, in a stratified sample of different populations, to find SNPs blocks, independent of each other, thus structuring the genome in stranded regions of dependency. Public data from different worldwide populations were used (HapMap project), beyond a Brazilian population. The dependence regions constitute windows of influence which were used to characterize the different populations according of their ancestry and the results showed that the Brazilian populations windows have, on average, a bigger size, showing their recent history of admixture. It is also proposed an optimization of likelihood function of the problem for the maximal windows of consensus from all populations. Given a particular window of consensus, a distance measure appropriated to categorical variables, it is adopted to evaluate its homogeneity/heterogeneity. Homogeneous windows were identified within region of genome called HLA (Human Leukocyte Antigen), which is associated with the immune response. The average size of these windows was bigger than the average found in the rest of the chromosome, confirming the high dependence verified in this region, considered highly conserved in the human evolution. Finally, considering the distribution of the SNPs among the populations in the most heterogeneous windows, the Correspondence Analysis was applied to build a classifier able to determine, for a given individual, the ancestry proportion from each population considered, which, submitted to a validation, obtained a 90% accuracy of the original population.
|
3 |
Caracterização da estrutura de dependência do genoma humano usando campos markovianos: estudo de populações mundiais e dados de SNPs / Characterization of the human genome dependence structure using Markov random fields: populations worldwide study and SNP dataFernandes, Francisco José de Almeida 01 February 2016 (has links)
A identificação de regiões cromossômicas, ou blocos de dependência dentro do genoma humano, que são transmitidas em conjunto para seus descendentes (haplótipos) tem sido um desafio e alvo de várias iniciativas de pesquisa, muitas delas utilizando dados de plataformas de marcadores moleculares do tipo SNP (Single Nucleotide Polymorphisms - SNPs), com alta densidade dentro do DNA humano. Este trabalho faz uso de uma modelagem estocástica de campos Markovianos de alcance variável, em uma amostra estratificada de diferentes populações, para encontrar blocos de SNPs, independentes entre si, estruturando assim o genoma em regiões ilhadas de dependência. Foram utilizados dados públicos de SNPs de diferentes populações mundiais (projeto HapMap), além de uma amostra da população brasileira. As regiões de dependência configuram janelas de influência as quais foram usadas para caracterizar as diferentes populações de acordo com sua ancestralidade e os resultados obtidos mostraram que as janelas da população brasileira têm, em média, tamanho maior, evidenciando a sua história recente de miscigenação. É também proposta uma otimização da função de verossimilhança do problema para obter as janelas de consenso maximais de todas as populações. Dada uma determinada janela de consenso, uma medida de distância apropriada para variáveis categóricas, é adotada para medir sua homogeneidade/heterogeneidade. Janelas homogêneas foram identificadas na região HLA (Human Leukocyte Antigen) do genoma, a qual está associada à resposta imunológica. O tamanho médio dessas janelas foi maior do que a média encontrada no restante do cromossomo, confirmando a alta dependência existente nesta região, considerada como bastante conservada na evolução humana. Finalmente, considerando a distribuição dos SNPs entre as populações nas janelas mais heterogêneas, a Análise de Correspondência foi aplicada na construção de um classificador capaz de determinar o percentual relativo de ancestralidade de um indivíduo, o qual, submetido à validação, obteve uma eficiência de 90% de acerto da população originária. / The identification of chromosome regions, or dependency blocks in the human genome, that are transmitted together to offspring (haploids) has been a challenge and object of several research initiatives, many of them using platforms of molecular markers such as SNP (Single Nucleotide Polymorphisms), with high density inside the human DNA. This work makes use of a stochastic modeling of Markov random fields, in a stratified sample of different populations, to find SNPs blocks, independent of each other, thus structuring the genome in stranded regions of dependency. Public data from different worldwide populations were used (HapMap project), beyond a Brazilian population. The dependence regions constitute windows of influence which were used to characterize the different populations according of their ancestry and the results showed that the Brazilian populations windows have, on average, a bigger size, showing their recent history of admixture. It is also proposed an optimization of likelihood function of the problem for the maximal windows of consensus from all populations. Given a particular window of consensus, a distance measure appropriated to categorical variables, it is adopted to evaluate its homogeneity/heterogeneity. Homogeneous windows were identified within region of genome called HLA (Human Leukocyte Antigen), which is associated with the immune response. The average size of these windows was bigger than the average found in the rest of the chromosome, confirming the high dependence verified in this region, considered highly conserved in the human evolution. Finally, considering the distribution of the SNPs among the populations in the most heterogeneous windows, the Correspondence Analysis was applied to build a classifier able to determine, for a given individual, the ancestry proportion from each population considered, which, submitted to a validation, obtained a 90% accuracy of the original population.
|
4 |
Bioinformatics for the Comparative Genomic Analysis of the Cotton (Gossypium) Polyploid ComplexPage, Justin Thomas 01 June 2015 (has links)
Understanding the composition, evolution, and function of the cotton (Gossypium) genome is complicated by the joint presence of two genomes in its nucleus (AT and DT genomes). Specifically, read-mapping (a fundamental part of next-generation sequence analysis) cannot adequately differentiate reads as belonging to one genome or the other. These two genomes were derived from progenitor A-genome and D-genome diploids involved in ancestral allopolyploidization. To better understand the allopolyploid genome, we developed PolyCat to categorize reads according to their genome of origin based on homoeo-SNPs that differentiate the two genomes. We re-sequenced the genomes of extant diploid relatives of tetraploid cotton that contain the A1 (Gossypium herbaceum), A2 (Gossypium arboreum), or D5 (Gossypium raimondii) genomes. We identified 24 million SNPs between the A-diploid and D-diploid genomes. These analyses facilitated the construction of a robust index of conserved SNPs between the A-genomes and D-genomes at all detected polymorphic loci. This index can be used by PolyCat to assign reads from an allotetraploid to its genome-of-origin. Continued characterization of the Gossypium genomes will further enhance our ability to manipulate fiber and agronomic production of cotton. We re-sequenced 34 allotetraploid cotton lines, representing all 7 tetraploid cotton species. The analysis of these genomes-using PolyCat and PolyDog-provides us with the beginnings of a HapMap-like resource for cotton species, including indices of both homoeo-SNPs and allele-SNPs. With this information, we explore the phylogenetic relationships among cotton species, including the newly characterized species G. ekmanianum and G. stephensii. We examine gene conversion both recent and ancient, discovering that recent gene conversion is extremely rare, and ancient gene conversion is far less extensive than previously believed, with many previously identified conversion events being more probably due to autapamorphic SNPs in the descent of diploid relatives. In order to carry out these experiments, many tools for next-generation sequence analysis were developed. These tools, along with PolyCat and PolyDog, comprise the tool suite BamBam.
|
5 |
Incidence and Regulatory Implications of Single Nucleotide Polymorphisms among Established Ovarian Cancer Genes.Ramdayal, Kavisha. January 2009 (has links)
<p>OVARIAN cancer research focuses on answering important questions related to the disease, determining whether new approaches are feasible to contribute towards improving current treatments or discovering new ones. This study focused on the transcriptional regulation of genes that have been implicated in ovarian cancer, based on the occurrences of single nucleotide polymorphisms (SNPs) within transcription factor binding sites (TFBSs). Through the application of several in silico tools, databases and custom programs, this research aimed to contribute toward the identification of potentially bio-medically important genes or SNPs for pre-diagnosis and subsequent treatment planning of ovarian cancer. A total of 379 candidate genes that have been experimentally associated with ovarian cancer were analyzed. This led to the identification of 121 SNPs that were found to coincide with putative TFBSs potentially influencing a total of 57 transcription factors that would normally bind to these TFBSs. These SNPs with potential phenotypic effect were then evaluated among several population groups, defined by the International HapMap consortium resulting in the identification of three SNPs present in five or more of the eleven population groups that have been sampled.</p>
|
6 |
Incidence and Regulatory Implications of Single Nucleotide Polymorphisms among Established Ovarian Cancer Genes.Ramdayal, Kavisha. January 2009 (has links)
<p>OVARIAN cancer research focuses on answering important questions related to the disease, determining whether new approaches are feasible to contribute towards improving current treatments or discovering new ones. This study focused on the transcriptional regulation of genes that have been implicated in ovarian cancer, based on the occurrences of single nucleotide polymorphisms (SNPs) within transcription factor binding sites (TFBSs). Through the application of several in silico tools, databases and custom programs, this research aimed to contribute toward the identification of potentially bio-medically important genes or SNPs for pre-diagnosis and subsequent treatment planning of ovarian cancer. A total of 379 candidate genes that have been experimentally associated with ovarian cancer were analyzed. This led to the identification of 121 SNPs that were found to coincide with putative TFBSs potentially influencing a total of 57 transcription factors that would normally bind to these TFBSs. These SNPs with potential phenotypic effect were then evaluated among several population groups, defined by the International HapMap consortium resulting in the identification of three SNPs present in five or more of the eleven population groups that have been sampled.</p>
|
7 |
Incidence and regulatory implications of single Nucleotide polymorphisms among established ovarian cancer genesRamdayal, Kavisha January 2009 (has links)
Magister Scientiae - MSc / OVARIAN cancer research focuses on answering important questions related to the disease, determining whether new approaches are feasible to contribute towards improving current treatments or discovering new ones. This study focused on the transcriptional regulation of genes that have been implicated in ovarian cancer, based on the occurrences of single nucleotide polymorphisms (SNPs) within transcription factor binding sites (TFBSs). Through the application of several in silico tools, databases and custom programs, this research aimed to contribute toward the identification of potentially bio-medically important genes or SNPs for pre-diagnosis and subsequent treatment planning of ovarian cancer. A total of 379 candidate genes that have been experimentally associated with ovarian cancer were analyzed. This led to the identification of 121 SNPs that were found to coincide with putative TFBSs potentially influencing a total of 57 transcription factors that would normally bind to these TFBSs. These SNPs with potential phenotypic effect were then evaluated among several population groups, defined by the International HapMap consortium resulting in the identification of three SNPs present in five or more of the eleven population groups that have been sampled. / South Africa
|
8 |
La consanguinité à l'ère du génome haut-débit : estimations et applications / Consanguinity in the High-Throughput Genome Era : Estimations and ApplicationsGazal, Steven 24 June 2014 (has links)
Un individu est dit consanguin si ses parents sont apparentés et s’il existe donc dans sa généalogie au moins une boucle de consanguinité aboutissant à un ancêtre commun. Le coefficient de consanguinité de l’individu est par définition la probabilité pour qu’à un point pris au hasard sur le génome, l’individu ait reçu deux allèles identiques par descendance qui proviennent d’un seul allèle présent chez un des ancêtres communs. Ce coefficient de consanguinité est un paramètre central de la génétique qui est utilisé en génétique des populations pour caractériser la structure des populations, mais également pour rechercher des facteurs génétiques impliqués dans les maladies. Le coefficient de consanguinité était classiquement estimé à partir des généalogies, mais des méthodes ont été développées pour s’affranchir des généalogies et l’estimer à partir de l’information apportée par des marqueurs génétiques répartis sur l’ensemble du génome.Grâce aux progrès des techniques de génotypage haut-débit, il est possible aujourd’hui d’obtenir les génotypes d’un individu sur des centaines de milliers de marqueurs et d’utiliser ces méthodes pour reconstruire les régions d’identité par descendance sur son génome et estimer un coefficient de consanguinité génomique. Il n’existe actuellement pas de consensus sur la meilleure stratégie à adopter sur ces cartes denses de marqueurs en particulier pour gérer les dépendances qui existent entre les allèles aux différents marqueurs (déséquilibre de liaison). Dans cette thèse, nous avons évalué les différentes méthodes disponibles à partir de simulations réalisées en utilisant de vraies données avec des schémas de déséquilibre de liaison réalistes. Nous avons montré qu’une approche intéressante consistait à générer plusieurs sous-cartes de marqueurs dans lesquelles le déséquilibre de liaison est minimal, d’estimer un coefficient de consanguinité sur chacune des sous-cartes par une méthode basée sur une chaîne de Markov cachée implémentée dans le logiciel FEstim et de prendre comme estimateur la médiane de ces différentes estimations. L’avantage de cette approche est qu’elle est utilisable sur n’importe quelle taille d’échantillon, voire sur un seul individu, puisqu’elle ne demande pas d’estimer les déséquilibres de liaison. L’estimateur donné par FEstim étant un estimateur du maximum de vraisemblance, il est également possible de tester si le coefficient de consanguinité est significativement différent de zéro et de déterminer la relation de parenté des parents la plus vraisemblable parmi un ensemble de relations. Enfin, en permettant l’identification de régions d’homozygoties communes à plusieurs malades consanguins, notre stratégie peut permettre l’identification des mutations récessives impliquées dans les maladies monogéniques ou multifactorielles.Pour que la méthode que nous proposons soit facilement utilisable, nous avons développé le pipeline, FSuite, permettant d’interpréter facilement les résultats d’études de génétique de populations et de génétique épidémiologique comme illustré sur le panel de référence HapMap III, et sur un jeu de données cas-témoins de la maladie d’Alzheimer. / An individual is said to be inbred if his parents are related and if his genealogy contains at least one inbreeding loop leading to a common ancestor. The inbreeding coefficient of an individual is defined as the probability that the individual has received two alleles identical by descent, coming from a single allele present in a common ancestor, at a random marker on the genome. The inbreeding coefficient is a central parameter in genetics, and is used in population genetics to characterize the population structure, and also in genetic epidemiology to search for genetic factors involved in recessive diseases.The inbreeding coefficient was traditionally estimated from genealogies, but methods have been developed to avoid genealogies and to estimate this coefficient from the information provided by genetic markers distributed along the genome.With the advances in high-throughput genotyping techniques, it is now possible to genotype hundreds of thousands of markers for one individual, and to use these methods to reconstruct the regions of identity by descent on his genome and estimate a genomic inbreeding coefficient. There is currently no consensus on the best strategy to adopt with these dense marker maps, in particular to take into account dependencies between alleles at different markers (linkage disequilibrium).In this thesis, we evaluated the different available methods through simulations using real data with realistic patterns of linkage disequilibrium. We highlighted an interesting approach that consists in generating several submaps to minimize linkage disequilibrium, estimating an inbreeding coefficient of each of the submaps based on a hidden Markov method implemented in FEstim software, and taking as estimator the median of these different estimates. The advantage of this approach is that it can be used on any sample size, even on an individual, since it requires no linkage disequilibrium estimate. FEstim is a maximum likelihood estimator, which allows testing whether the inbreeding coefficient is significantly different from zero and determining the most probable mating type of the parents. Finally, through the identification of homozygous regions shared by several consanguineous patients, our strategy permits the identification of recessive mutations involved in monogenic and multifactorial diseases.To facilitate the use of our method, we developed the pipeline FSuite, to interpret results of population genetics and genetic epidemiology studies, as shown on the HapMap III reference panel, and on a case-control Alzheimer's disease data.
|
9 |
Identifizierung genetischer Biomarker für die Wirksamkeit von Oxaliplatin:Kandidatengen-bezogene und Genom-weite Analysen / Identification of genetic biomarkers for the efficacy of oxaliplatin - candidate gene and genome-wide approachesSaman, Sadik 02 December 2014 (has links)
No description available.
|
10 |
Genetics of ankylosing spondylitisKaraderi, Tugce January 2012 (has links)
Ankylosing spondylitis (AS) is a common inflammatory arthritis of the spine and other affected joints, which is highly heritable, being strongly influenced by the HLA-B27 status, as well as hundreds of mostly unknown genetic variants of smaller effect. The aim of my research was to confirm some of the previously observed genetic associations and to identify new associations, many of which are in biological pathways relevant to AS pathogenesis, most notably the IL-23/T<sub>H</sub>17 axis (IL23R) and antigen presentation (ERAP1 and ERAP2). Studies presented in this thesis include replication and refinement of several potential associations initially identified by earlier GWAS (WTCCC-TASC, 2007 and TASC, 2010). I conducted an extended study of IL23R association with AS and undertook a meta-analysis, confirming the association between AS and IL23R (non-synonymous SNP rs11209026, p=1.5 x 10-9, OR=0.61). An extensive re-sequencing and fine mapping project, including a meta-analysis, to replicate and refine the association of TNFRSF1A with AS was also undertaken; a novel variant in intron 6 was identified and a weak association with a low frequency variant, rs4149584 (p=0.01, OR=1.58), was detected. Somewhat stronger associations were seen with rs4149577 (p=0.002, OR=0.91) and rs4149578 (p=0.015, OR=1.14) in the meta-analysis. Associations at several additional loci had been identified by a more recent GWAS (WTCCC2-TASC, 2011). I used in silico techniques, including imputation using a denser panel of variants from the 1000 Genomes Project, conditional analysis and rare/low frequency variant analysis, to refine these associations. Imputation analysis (1782 cases/5167 controls) revealed novel associations with ERAP2 (rs4869313, p=7.3 x 10-8, OR=0.79) and several additional candidate loci including IL6R, UBE2L3 and 2p16.3. Ten SNPs were then directly typed in an independent sample (1804 cases/1848 controls) to replicate selected associations and to determine the imputation accuracy. I established that imputation using the 1000 Genomes Project pilot data was largely reliable, specifically for common variants (genotype concordence~97%). However, more accurate imputation of low frequency variants may require larger reference populations, like the most recent 1000 Genomes reference panels. The results of my research provide a better understanding of the complex genetics of AS, and help identify future targets for genetic and functional studies.
|
Page generated in 0.0252 seconds