Global ETD Search

41	Seleção de características a partir da integração de dados por meio de análise de variação de número de cópias (CNV) para associação genótipo-fenótipo de doenças complexas Meneguin, Christian Reis January 2018 (has links) Orientador: Prof. Dr. David Corrêa Martins Júnior / Dissertação (mestrado) - Universidade Federal do ABC, Programa de Pós-Graduação em Ciência da Computação, Santo André, 2018. / As pesquisas em biologia sistêmica caracterizam-se pela interdisciplinaridade, a compreensão com visão ampla sobre as interações ocorridas internamente em organismos biológicos, hereditariedade e a influência de fatores ambientais. Neste cenário, é constituída uma rede complexa de interações na qual seus componentes são de diferentes tipos, como as variações do número de cópias (Copy Number Variation - CNVs), genes, entre outros. As doenças complexas que ocorrem neste contexto normalmente são consequências de perturbações intracelulares e intercelulares em tecidos e órgãos, sendo desenvolvidas de forma multifatorial, ou seja, a causa e o desenvolvimento dessas doenças são fruto de diversos fatores genéticos e ambientais. Nos últimos anos, tem sido produzido um volume bastante elevado de dados biológicos gerados por técnicas de sequenciamento de alto desempenho, requerendo pesquisas que envolvam para uma análise integrada desses dados. As variações do número de cópias (Copy Number Variation - CNVs), ou seja, a variação no número de repetições de subsequências de DNA entre indivíduos, se mostram úteis visto que estão relacionadas com outros tipos de dados como genes e dados de expressão gênica (abundâncias de mRNAs transcritos pelos genes em diferentes contextos). Devido a natureza heterogênea e a imensa quantidade de dados, a análise integrativa é um desafio computacional para o qual abordagens vêm sendo propostas. Neste sentido, nesta dissertação foi proposto um método que realiza a integração de dados (CNVs, dados de expressão gênica, haploinsuficiência, imprint, entre outros) por meio de um processo que permite identificar trechos comuns de CNVs entre amostras de diferentes indivíduos, sejam estas amostras de caso ou de controle e que possuem informações obtidas a partir das integrações feitas. Com este processo, o método aqui proposto diferencia-se dos métodos que realizam integração de dados por meio da análise de sobreposição dos dados biológicos, mas não geram novos dados contendo intervalos de CNVs existentes entre as amostras. O método proposto foi analisado com base no estudo de caso do autismo (Transtornos do Espectro Autista - TEA). O autismo, além de ser considerado uma doença complexa, possui algumas particularidades que dificultam o seu estudo quando comparado a outros tipos de doenças complexas como o câncer, por exemplo. Foram realizados dois experimentos que envolveram dados dos CNVs de indivíduos com TEA (caso) e indivíduos sem este transtorno (controle). Também foi feito um experimento utilizando amostras de CNVs de TEA e amostras de CNVs relacionados a outras doenças do neurodesenvolvimento. Os experimentos envolveram a integração dos tipos de dados propostos. Foi possível identificar trechos de CNVs que estão presentes somente em amostras associadas aos casos e não em controles, ou cenários de trechos de CNVs presentes em amostras de TEA e ausentes nas amostras de outras doenças do neurodesenvolvimento, e vice-versa. Os resultados também refletiram a tendência de indivíduos do gênero masculino serem mais afetados por TEA em relação ao feminino. Foi possível também identificar genes associados e informações como o biotipo e se estão presentes em dados de haploinsuficiência, imprint ou ainda dados de expressão agrupados em regiões e períodos. Finalmente, análises de enriquecimento das listas de genes dos CNVs resultantes do método apontam para diversas vias relacionadas com o TEA, tais como as vias de sinalização do receptor toll-like dependente de TRIF, do ácido gama-aminobutírico (GABA), de transmissão sináptica e secreção neurotransmissora, de recepção da insulina, de percepção sensorial olfativa, e de adesão celular independente de cálcio. / Researches in systems biology are characterized by interdisciplinarity, wide-ranging understanding of interactions within biological organisms, heredity, and the influence of environmental factors. In this scenario, a complex network of interactions is constituted of different types of components, such as CNVs (Copy Number Variations), genes, and others. Complex diseases that occur in this context are usually consequences of intracellular, intercellular, tissue, organ, and multifactorial disorders, i.e., the cause and development of these diseases are the result of various genetic and environmental factors. In recent years, a very large volume of biological data generated by high performance sequencing techniques has been produced, requiring researches involving an integrated analysis of these data. CNVs, i.e., the variation in the number of DNA subsequences between individuals, are useful because they are related to other types of data such as genes and gene expression data (abundances of mRNAs transcribed by genes in different contexts). Due to the heterogeneous nature and the immense amount of data, integrative analysis is a computational challenge for which approaches have been proposed. In this sense, in this dissertation a method was proposed that performs a data integration (CNVs, gene expression data, haploinsufficiency, imprint, among others) through a process that allows to identify common portions of CNVs between samples of different individuals, being these case or control samples and that have information obtained from the integration performed. In this context, the method proposed here differs from the methods that carry out data integration through the analysis of the overlay of the biological data, but does not generate new data containing ranges of CNVs existing between the samples. The proposed method was analyzed on the basis of the case study of Autistic Spectrum Disorder (ASD). Besides being considered a complex disease, TEA has some peculiarities that hinder its study when compared to other types of complex diseases such as cancer, for example. As a case study, two experiments were carried out that involved data from the CNVs of individuals with ASD (case) and individuals without this disorder (control). An experiment was also done using samples of ASD CNVs and CNVs samples related to other neurodevelopmental diseases. The experiments involved the integration of the proposed data types. Among the results, the method identified excerpts of CNVs that are present only in samples associated with the cases and not in controls, or scenarios of CNVs snippets present in TEA samples and not present in other neurodevelopmental disease samples, and vice-versa. The results also reflected the tendency for males to be more affected by TEA compared to the females. In the excerpts of CNVs in certain results, it was possible to identify associated gene informations such as the biotype and whether they are present in Haploinsufficiency, imprint or even expression data grouped in regions and periods. Finally, enrichment analyses involving lists of genes from the resulting CNVs point to several signaling pathways related to TEA, such as TRIF-dependent toll-like receptor signaling, gamma aminobutyric acid (GABA), synaptic transmission and neurotransmitter secretion, insulin reception, olfactory sensorial perception, and calcium independent cell-cell adhesion. VARIAÇÃO NO NÚMERO DE CÓPIAS DADOS DE EXPRESSÃO GÊNICA DOENÇAS COMPLEXAS INTEGRAÇÃO DE DADOS MINERAÇÃO DE DADOS COPY NUMBER VARIATION GENE EXPRESSION DATA COMPLEX DISEASES DATA INTEGRATION DATA MINING
42	應用存活分析在微陣列資料的基因表面定型之探討 / Gene Expression Profiling with Survival Analysis on Microarray Data 張仲凱, Chang,Chunf-Kai Unknown Date (has links) 如何藉由DNA微陣列資料跟存活資料的資訊來找出基因表現定型一直是個重要的議題。這些研究的主要目標是從大量的基因中找出那些真正跟存活時間或其它重要的臨床結果有顯著關係的小部分。Threshold Gradient Directed Regularization (TGDR)是ㄧ種已經被應用在高維度迴歸問題中能同時處理變數選取以及模型配適的演算法。然而，TGDR採用一種梯度投影型態的演算法使得收斂速率緩慢。在本篇論文中，我們建議新的包含Newton-Raphson求解演算法類型的改良版TGDR方法。我們建議的方法有類似TGDR的特性但卻有比較快的收斂速率。文中並利用一筆附有設限存活時間的真實微陣列癌症資料來做示範。本篇論文的第二部份是關於適用於區間設限存活資料的重複抽樣Peto-Peto檢定。這個重複抽樣Peto-Peto檢定能夠評估存活函數估計方法的檢定力，例如Turnbull的估計方法以及Kaplan-Meier的估計方法。這個檢定方法顯示出在區間設限資料時Kaplan-Meier的估計方法的檢定力要比Turnbull的估計方法的檢定力來得低。這個檢定方法將以模擬的區間設限資料以及一筆真實關於乳癌研究的區間設限資料來說明。 / Analyzing censored survival data with high-dimensional covariates arising from the microarray data has been an important issue. The main goal is to find genes that have pivotal influence with patient's survival time or other important clinical outcomes. Threshold Gradient Directed Regularization (TGDR) method has been used for simultaneous variable selection and model building in high-dimensional regression problems. However, the TGDR method adopts a gradient-projection type of method and would have slow convergence rate. In this thesis, we proposed Modified TGDR algorithms which incorporate Newton-Raphson type of search algorithm. Our proposed approaches have the similar characteristics with TGDR but faster convergence rates. A real cancer microarray data with censored survival times is used for demonstration. The second part of this thesis is about a proposed resampling based Peto-Peto test for survival functions on interval censored data. The proposed resampling based Peto-Peto test can evaluate the power of survival function estimation methods, such as Turnbull’s Procedure and Kaplan-Meier estimate. The test shows that the power based on Kaplan-Meier estimate is lower than that based on Turnbull’s estimation on interval censored data. This proposed test is demonstrated on simulated data and a real interval censored data from a breast cancer study. 基因表現資料設限存活資料 Cox比例風險模型重複抽樣Peto-Peto檢定 Gene expression data Censored survival data Cox proportional hazards model Rasmpling based Peto-Peto test
43	A signal transduction score flow algorithm for cyclic cellular pathway analysis, which combines transcriptome and ChIP-seq data Isik, Zerrin, Ersahin, Tulin, Atalay, Volkan, Aykanat, Cevdet, Cetin-Atalay, Rengul 08 April 2014 (has links) (PDF) Determination of cell signalling behaviour is crucial for understanding the physiological response to a specific stimulus or drug treatment. Current approaches for large-scale data analysis do not effectively incorporate critical topological information provided by the signalling network. We herein describe a novel model- and data-driven hybrid approach, or signal transduction score flow algorithm, which allows quantitative visualization of cyclic cell signalling pathways that lead to ultimate cell responses such as survival, migration or death. This score flow algorithm translates signalling pathways as a directed graph and maps experimental data, including negative and positive feedbacks, onto gene nodes as scores, which then computationally traverse the signalling pathway until a pre-defined biological target response is attained. Initially, experimental data-driven enrichment scores of the genes were computed in a pathway, then a heuristic approach was applied using the gene score partition as a solution for protein node stoichiometry during dynamic scoring of the pathway of interest. Incorporation of a score partition during the signal flow and cyclic feedback loops in the signalling pathway significantly improves the usefulness of this model, as compared to other approaches. Evaluation of the score flow algorithm using both transcriptome and ChIP-seq data-generated signalling pathways showed good correlation with expected cellular behaviour on both KEGG and manually generated pathways. Implementation of the algorithm as a Cytoscape plug-in allows interactive visualization and analysis of KEGG pathways as well as user-generated and curated Cytoscape pathways. Moreover, the algorithm accurately predicts gene-level and global impacts of single or multiple in silico gene knockouts. / Dieser Beitrag ist mit Zustimmung des Rechteinhabers aufgrund einer (DFG-geförderten) Allianz- bzw. Nationallizenz frei zugänglich. Genexpression biologische Signalwege computergestützte Datenanalyse Brustkrebs molekulare Netzwerke Kyoto Encyclopedia of Genes and Genomes KEGG ChIP-Seq gene expression data biological pathways breast cancer collaborative construction molecular networks binding sites tool deregulation environment ontology ddc:540 ddc:610 ddc:570 rvk:VA 1120 rvk:XA 10000 rvk:WA 15000
44	Identification and assessment of gene signatures in human breast cancer / Identification et évaluation de signatures géniques dans le cancer du sein humain Haibe-Kains, Benjamin 02 April 2009 (has links) This thesis addresses the use of machine learning techniques to develop clinical diagnostic tools for breast cancer using molecular data. These tools are designed to assist physicians in their evaluation of the clinical outcome of breast cancer (referred to as prognosis).<p>The traditional approach to evaluating breast cancer prognosis is based on the assessment of clinico-pathologic factors known to be associated with breast cancer survival. These factors are used to make recommendations about whether further treatment is required after the removal of a tumor by surgery. Treatment such as chemotherapy depends on the estimation of patients' risk of relapse. Although current approaches do provide good prognostic assessment of breast cancer survival, clinicians are aware that there is still room for improvement in the accuracy of their prognostic estimations.<p>In the late nineties, new high throughput technologies such as the gene expression profiling through microarray technology emerged. Microarrays allowed scientists to analyze for the first time the expression of the whole human genome ("transcriptome"). It was hoped that the analysis of genome-wide molecular data would bring new insights into the critical, underlying biological mechanisms involved in breast cancer progression, as well as significantly improve prognostic prediction. However, the analysis of microarray data is a difficult task due to their intrinsic characteristics: (i) thousands of gene expressions are measured for only few samples; (ii) the measurements are usually "noisy"; and (iii) they are highly correlated due to gene co-expressions. Since traditional statistical methods were not adapted to these settings, machine learning methods were picked up as good candidates to overcome these difficulties. However, applying machine learning methods for microarray analysis involves numerous steps, and the results are prone to overfitting. Several authors have highlighted the major pitfalls of this process in the early publications, shedding new light on the promising but overoptimistic results. <p>Since 2002, large comparative studies have been conducted in order to identify the key characteristics of successful methods for class discovery and classification. Yet methods able to identify robust molecular signatures that can predict breast cancer prognosis have been lacking. To fill this important gap, this thesis presents an original methodology dealing specifically with the analysis of microarray and survival data in order to build prognostic models and provide an honest estimation of their performance. The approach used for signature extraction consists of a set of original methods for feature transformation, feature selection and prediction model building. A novel statistical framework is presented for performance assessment and comparison of risk prediction models.<p>In terms of applications, we show that these methods, used in combination with a priori biological knowledge of breast cancer and numerous public microarray datasets, have resulted in some important discoveries. In particular, the research presented here develops (i) a robust model for the identification of breast molecular subtypes and (ii) a new prognostic model that takes into account the molecular heterogeneity of breast cancers observed previously, in order to improve traditional clinical guidelines and state-of-the-art gene signatures./Cette thèse concerne le développement de techniques d'apprentissage (machine learning) afin de mettre au point de nouveaux outils cliniques basés sur des données moleculaires. Nous avons focalisé notre recherche sur le cancer du sein, un des cancers les plus fréquemment diagnostiqués. Ces outils sont développés dans le but d'aider les médecins dans leur évaluation du devenir clinique des patients cancéreux (cf. le pronostique).<p>Les approches traditionnelles d'évaluation du pronostique d'un patient cancéreux se base sur des critères clinico-pathologiques connus pour être prédictifs de la survie. Cette évaluation permet aux médecins de décider si un traitement est nécessaire après l'extraction de la tumeur. Bien que les outils d'évaluation traditionnels sont d'une aide importante, les cliniciens sont conscients de la nécessité d'améliorer de tels outils.<p>Dans les années 90, de nouvelles technologies à haut-débit, telles que le profilage de l'expression génique par biopuces à ADN (microarrays), ont été mises au point afin de permettre aux scientifiques d'analyser l'expression de l'entièreté du génôme de cellules cancéreuses. Ce nouveau type de données moléculaires porte l'espoir d'améliorer les outils pronostiques traditionnels et d'approfondir nos connaissances concernant la génèse du cancer du sein. Cependant ces données sont extrêmement difficiles à analyser à cause (i) de leur haute dimensionalité (plusieurs dizaines de milliers de gènes pour seulement quelques centaines d'expériences); (ii) du bruit important dans les mesures; (iii) de la collinéarité entre les mesures dûe à la co-expression des gènes.<p>Depuis 2002, des études comparatives à grande échelle ont permis d'identifier les méthodes performantes pour l'analyse de groupements et la classification de données microarray, négligeant l'analyse de survie pertinente pour le pronostique dans le cancer du sein. Pour pallier ce manque, cette thèse présente une méthodologie originale adaptée à l'analyse de données microarray et de survie afin de construire des modèles pronostiques performants et robustes. <p>En termes d'applications, nous montrons que cette méthodologie, utilisée en combinaison avec des connaissances biologiques a priori et de nombreux ensembles de données publiques, a permis d'importantes découvertes. En particulier, il résulte de la recherche presentée dans cette thèse, le développement d'un modèle robuste d'identification des sous-types moléculaires du cancer du sein et de plusieurs signatures géniques améliorant significativement l'état de l'art au niveau pronostique. / Doctorat en Sciences / info:eu-repo/semantics/nonPublished Informatique générale Sciences exactes et naturelles Breast -- Cancer -- Data processing DNA microarrays Gene expression -- Data processing Sein -- Cancer -- Informatique Puces à ADN Expression génique -- Informatique apprentissage automatique machine learning
45	A signal transduction score flow algorithm for cyclic cellular pathway analysis, which combines transcriptome and ChIP-seq data Isik, Zerrin, Ersahin, Tulin, Atalay, Volkan, Aykanat, Cevdet, Cetin-Atalay, Rengul January 2012 (has links) Determination of cell signalling behaviour is crucial for understanding the physiological response to a specific stimulus or drug treatment. Current approaches for large-scale data analysis do not effectively incorporate critical topological information provided by the signalling network. We herein describe a novel model- and data-driven hybrid approach, or signal transduction score flow algorithm, which allows quantitative visualization of cyclic cell signalling pathways that lead to ultimate cell responses such as survival, migration or death. This score flow algorithm translates signalling pathways as a directed graph and maps experimental data, including negative and positive feedbacks, onto gene nodes as scores, which then computationally traverse the signalling pathway until a pre-defined biological target response is attained. Initially, experimental data-driven enrichment scores of the genes were computed in a pathway, then a heuristic approach was applied using the gene score partition as a solution for protein node stoichiometry during dynamic scoring of the pathway of interest. Incorporation of a score partition during the signal flow and cyclic feedback loops in the signalling pathway significantly improves the usefulness of this model, as compared to other approaches. Evaluation of the score flow algorithm using both transcriptome and ChIP-seq data-generated signalling pathways showed good correlation with expected cellular behaviour on both KEGG and manually generated pathways. Implementation of the algorithm as a Cytoscape plug-in allows interactive visualization and analysis of KEGG pathways as well as user-generated and curated Cytoscape pathways. Moreover, the algorithm accurately predicts gene-level and global impacts of single or multiple in silico gene knockouts. / Dieser Beitrag ist mit Zustimmung des Rechteinhabers aufgrund einer (DFG-geförderten) Allianz- bzw. Nationallizenz frei zugänglich. info:eu-repo/classification/ddc/540 ddc:540 info:eu-repo/classification/ddc/610 ddc:610 info:eu-repo/classification/ddc/570 ddc:570

Page generated in 0.1092 seconds