41 |
Seleção de características a partir da integração de dados por meio de análise de variação de número de cópias (CNV) para associação genótipo-fenótipo de doenças complexasMeneguin, Christian Reis January 2018 (has links)
Orientador: Prof. Dr. David Corrêa Martins Júnior / Dissertação (mestrado) - Universidade Federal do ABC, Programa de Pós-Graduação em Ciência da Computação, Santo André, 2018. / As pesquisas em biologia sistêmica caracterizam-se pela interdisciplinaridade, a compreensão
com visão ampla sobre as interações ocorridas internamente em organismos biológicos,
hereditariedade e a influência de fatores ambientais. Neste cenário, é constituída uma
rede complexa de interações na qual seus componentes são de diferentes tipos, como as
variações do número de cópias (Copy Number Variation - CNVs), genes, entre outros.
As doenças complexas que ocorrem neste contexto normalmente são consequências de
perturbações intracelulares e intercelulares em tecidos e órgãos, sendo desenvolvidas de
forma multifatorial, ou seja, a causa e o desenvolvimento dessas doenças são fruto de
diversos fatores genéticos e ambientais. Nos últimos anos, tem sido produzido um volume
bastante elevado de dados biológicos gerados por técnicas de sequenciamento de alto
desempenho, requerendo pesquisas que envolvam para uma análise integrada desses dados.
As variações do número de cópias (Copy Number Variation - CNVs), ou seja, a variação
no número de repetições de subsequências de DNA entre indivíduos, se mostram úteis
visto que estão relacionadas com outros tipos de dados como genes e dados de expressão
gênica (abundâncias de mRNAs transcritos pelos genes em diferentes contextos). Devido
a natureza heterogênea e a imensa quantidade de dados, a análise integrativa é um desafio
computacional para o qual abordagens vêm sendo propostas. Neste sentido, nesta
dissertação foi proposto um método que realiza a integração de dados (CNVs, dados de
expressão gênica, haploinsuficiência, imprint, entre outros) por meio de um processo que
permite identificar trechos comuns de CNVs entre amostras de diferentes indivíduos, sejam
estas amostras de caso ou de controle e que possuem informações obtidas a partir das
integrações feitas. Com este processo, o método aqui proposto diferencia-se dos métodos
que realizam integração de dados por meio da análise de sobreposição dos dados biológicos,
mas não geram novos dados contendo intervalos de CNVs existentes entre as amostras. O
método proposto foi analisado com base no estudo de caso do autismo (Transtornos do
Espectro Autista - TEA). O autismo, além de ser considerado uma doença complexa, possui
algumas particularidades que dificultam o seu estudo quando comparado a outros tipos
de doenças complexas como o câncer, por exemplo. Foram realizados dois experimentos
que envolveram dados dos CNVs de indivíduos com TEA (caso) e indivíduos sem este
transtorno (controle). Também foi feito um experimento utilizando amostras de CNVs de
TEA e amostras de CNVs relacionados a outras doenças do neurodesenvolvimento. Os
experimentos envolveram a integração dos tipos de dados propostos. Foi possível identificar
trechos de CNVs que estão presentes somente em amostras associadas aos casos e não em
controles, ou cenários de trechos de CNVs presentes em amostras de TEA e ausentes nas
amostras de outras doenças do neurodesenvolvimento, e vice-versa. Os resultados também
refletiram a tendência de indivíduos do gênero masculino serem mais afetados por TEA em
relação ao feminino. Foi possível também identificar genes associados e informações como
o biotipo e se estão presentes em dados de haploinsuficiência, imprint ou ainda dados de
expressão agrupados em regiões e períodos. Finalmente, análises de enriquecimento das
listas de genes dos CNVs resultantes do método apontam para diversas vias relacionadas
com o TEA, tais como as vias de sinalização do receptor toll-like dependente de TRIF, do
ácido gama-aminobutírico (GABA), de transmissão sináptica e secreção neurotransmissora,
de recepção da insulina, de percepção sensorial olfativa, e de adesão celular independente
de cálcio. / Researches in systems biology are characterized by interdisciplinarity, wide-ranging understanding
of interactions within biological organisms, heredity, and the influence of
environmental factors. In this scenario, a complex network of interactions is constituted of
different types of components, such as CNVs (Copy Number Variations), genes, and others.
Complex diseases that occur in this context are usually consequences of intracellular,
intercellular, tissue, organ, and multifactorial disorders, i.e., the cause and development
of these diseases are the result of various genetic and environmental factors. In recent
years, a very large volume of biological data generated by high performance sequencing
techniques has been produced, requiring researches involving an integrated analysis of
these data. CNVs, i.e., the variation in the number of DNA subsequences between individuals,
are useful because they are related to other types of data such as genes and
gene expression data (abundances of mRNAs transcribed by genes in different contexts).
Due to the heterogeneous nature and the immense amount of data, integrative analysis
is a computational challenge for which approaches have been proposed. In this sense, in
this dissertation a method was proposed that performs a data integration (CNVs, gene
expression data, haploinsufficiency, imprint, among others) through a process that allows
to identify common portions of CNVs between samples of different individuals, being these
case or control samples and that have information obtained from the integration performed.
In this context, the method proposed here differs from the methods that carry out data
integration through the analysis of the overlay of the biological data, but does not generate
new data containing ranges of CNVs existing between the samples. The proposed method
was analyzed on the basis of the case study of Autistic Spectrum Disorder (ASD). Besides
being considered a complex disease, TEA has some peculiarities that hinder its study
when compared to other types of complex diseases such as cancer, for example. As a case
study, two experiments were carried out that involved data from the CNVs of individuals
with ASD (case) and individuals without this disorder (control). An experiment was also
done using samples of ASD CNVs and CNVs samples related to other neurodevelopmental
diseases. The experiments involved the integration of the proposed data types. Among the
results, the method identified excerpts of CNVs that are present only in samples associated
with the cases and not in controls, or scenarios of CNVs snippets present in TEA samples
and not present in other neurodevelopmental disease samples, and vice-versa. The results
also reflected the tendency for males to be more affected by TEA compared to the females.
In the excerpts of CNVs in certain results, it was possible to identify associated gene
informations such as the biotype and whether they are present in Haploinsufficiency, imprint
or even expression data grouped in regions and periods. Finally, enrichment analyses
involving lists of genes from the resulting CNVs point to several signaling pathways related
to TEA, such as TRIF-dependent toll-like receptor signaling, gamma aminobutyric acid
(GABA), synaptic transmission and neurotransmitter secretion, insulin reception, olfactory
sensorial perception, and calcium independent cell-cell adhesion.
|
42 |
應用存活分析在微陣列資料的基因表面定型之探討 / Gene Expression Profiling with Survival Analysis on Microarray Data張仲凱, Chang,Chunf-Kai Unknown Date (has links)
如何藉由DNA微陣列資料跟存活資料的資訊來找出基因表現定型一直是個重要的議題。這些研究的主要目標是從大量的基因中找出那些真正跟存活時間或其它重要的臨床結果有顯著關係的小部分。Threshold Gradient Directed Regularization (TGDR)是ㄧ種已經被應用在高維度迴歸問題中能同時處理變數選取以及模型配適的演算法。然而,TGDR採用一種梯度投影型態的演算法使得收斂速率緩慢。在本篇論文中,我們建議新的包含Newton-Raphson求解演算法類型的改良版TGDR方法。我們建議的方法有類似TGDR的特性但卻有比較快的收斂速率。文中並利用一筆附有設限存活時間的真實微陣列癌症資料來做示範。
本篇論文的第二部份是關於適用於區間設限存活資料的重複抽樣Peto-Peto檢定。這個重複抽樣Peto-Peto檢定能夠評估存活函數估計方法的檢定力,例如Turnbull的估計方法以及Kaplan-Meier的估計方法。這個檢定方法顯示出在區間設限資料時Kaplan-Meier的估計方法的檢定力要比Turnbull的估計方法的檢定力來得低。這個檢定方法將以模擬的區間設限資料以及一筆真實關於乳癌研究的區間設限資料來說明。 / Analyzing censored survival data with high-dimensional covariates arising from the microarray data has been an important issue. The main goal is to find genes that have pivotal influence with patient's survival time or other important clinical outcomes. Threshold Gradient Directed Regularization (TGDR) method has been used for simultaneous variable selection and model building in high-dimensional regression problems. However, the TGDR method adopts a gradient-projection type of method and would have slow convergence rate. In this thesis, we proposed Modified TGDR algorithms which incorporate Newton-Raphson type of search algorithm. Our proposed approaches have the similar characteristics with TGDR but faster convergence rates. A real cancer microarray data with censored survival times is used for demonstration.
The second part of this thesis is about a proposed resampling based Peto-Peto test for survival functions on interval censored data. The proposed resampling based Peto-Peto test can evaluate the power of survival function estimation methods, such as Turnbull’s Procedure and Kaplan-Meier estimate. The test shows that the power based on Kaplan-Meier estimate is lower than that based on Turnbull’s estimation on interval censored data. This proposed test is demonstrated on simulated data and a real interval censored data from a breast cancer study.
|
43 |
A signal transduction score flow algorithm for cyclic cellular pathway analysis, which combines transcriptome and ChIP-seq dataIsik, Zerrin, Ersahin, Tulin, Atalay, Volkan, Aykanat, Cevdet, Cetin-Atalay, Rengul 08 April 2014 (has links) (PDF)
Determination of cell signalling behaviour is crucial for understanding the physiological response to a specific stimulus or drug treatment. Current approaches for large-scale data analysis do not effectively incorporate critical topological information provided by the signalling network. We herein describe a novel model- and data-driven hybrid approach, or signal transduction score flow algorithm, which allows quantitative visualization of cyclic cell signalling pathways that lead to ultimate cell responses such as survival, migration or death. This score flow algorithm translates signalling pathways as a directed graph and maps experimental data, including negative and positive feedbacks, onto gene nodes as scores, which then computationally traverse the signalling pathway until a pre-defined biological target response is attained. Initially, experimental data-driven enrichment scores of the genes were computed in a pathway, then a heuristic approach was applied using the gene score partition as a solution for protein node stoichiometry during dynamic scoring of the pathway of interest. Incorporation of a score partition during the signal flow and cyclic feedback loops in the signalling pathway significantly improves the usefulness of this model, as compared to other approaches. Evaluation of the score flow algorithm using both transcriptome and ChIP-seq data-generated signalling pathways showed good correlation with expected cellular behaviour on both KEGG and manually generated pathways. Implementation of the algorithm as a Cytoscape plug-in allows interactive visualization and analysis of KEGG pathways as well as user-generated and curated Cytoscape pathways. Moreover, the algorithm accurately predicts gene-level and global impacts of single or multiple in silico gene knockouts. / Dieser Beitrag ist mit Zustimmung des Rechteinhabers aufgrund einer (DFG-geförderten) Allianz- bzw. Nationallizenz frei zugänglich.
|
44 |
Identification and assessment of gene signatures in human breast cancer / Identification et évaluation de signatures géniques dans le cancer du sein humainHaibe-Kains, Benjamin 02 April 2009 (has links)
This thesis addresses the use of machine learning techniques to develop clinical diagnostic tools for breast cancer using molecular data. These tools are designed to assist physicians in their evaluation of the clinical outcome of breast cancer (referred to as prognosis).<p>The traditional approach to evaluating breast cancer prognosis is based on the assessment of clinico-pathologic factors known to be associated with breast cancer survival. These factors are used to make recommendations about whether further treatment is required after the removal of a tumor by surgery. Treatment such as chemotherapy depends on the estimation of patients' risk of relapse. Although current approaches do provide good prognostic assessment of breast cancer survival, clinicians are aware that there is still room for improvement in the accuracy of their prognostic estimations.<p>In the late nineties, new high throughput technologies such as the gene expression profiling through microarray technology emerged. Microarrays allowed scientists to analyze for the first time the expression of the whole human genome ("transcriptome"). It was hoped that the analysis of genome-wide molecular data would bring new insights into the critical, underlying biological mechanisms involved in breast cancer progression, as well as significantly improve prognostic prediction. However, the analysis of microarray data is a difficult task due to their intrinsic characteristics: (i) thousands of gene expressions are measured for only few samples; (ii) the measurements are usually "noisy"; and (iii) they are highly correlated due to gene co-expressions. Since traditional statistical methods were not adapted to these settings, machine learning methods were picked up as good candidates to overcome these difficulties. However, applying machine learning methods for microarray analysis involves numerous steps, and the results are prone to overfitting. Several authors have highlighted the major pitfalls of this process in the early publications, shedding new light on the promising but overoptimistic results. <p>Since 2002, large comparative studies have been conducted in order to identify the key characteristics of successful methods for class discovery and classification. Yet methods able to identify robust molecular signatures that can predict breast cancer prognosis have been lacking. To fill this important gap, this thesis presents an original methodology dealing specifically with the analysis of microarray and survival data in order to build prognostic models and provide an honest estimation of their performance. The approach used for signature extraction consists of a set of original methods for feature transformation, feature selection and prediction model building. A novel statistical framework is presented for performance assessment and comparison of risk prediction models.<p>In terms of applications, we show that these methods, used in combination with a priori biological knowledge of breast cancer and numerous public microarray datasets, have resulted in some important discoveries. In particular, the research presented here develops (i) a robust model for the identification of breast molecular subtypes and (ii) a new prognostic model that takes into account the molecular heterogeneity of breast cancers observed previously, in order to improve traditional clinical guidelines and state-of-the-art gene signatures./Cette thèse concerne le développement de techniques d'apprentissage (machine learning) afin de mettre au point de nouveaux outils cliniques basés sur des données moleculaires. Nous avons focalisé notre recherche sur le cancer du sein, un des cancers les plus fréquemment diagnostiqués. Ces outils sont développés dans le but d'aider les médecins dans leur évaluation du devenir clinique des patients cancéreux (cf. le pronostique).<p>Les approches traditionnelles d'évaluation du pronostique d'un patient cancéreux se base sur des critères clinico-pathologiques connus pour être prédictifs de la survie. Cette évaluation permet aux médecins de décider si un traitement est nécessaire après l'extraction de la tumeur. Bien que les outils d'évaluation traditionnels sont d'une aide importante, les cliniciens sont conscients de la nécessité d'améliorer de tels outils.<p>Dans les années 90, de nouvelles technologies à haut-débit, telles que le profilage de l'expression génique par biopuces à ADN (microarrays), ont été mises au point afin de permettre aux scientifiques d'analyser l'expression de l'entièreté du génôme de cellules cancéreuses. Ce nouveau type de données moléculaires porte l'espoir d'améliorer les outils pronostiques traditionnels et d'approfondir nos connaissances concernant la génèse du cancer du sein. Cependant ces données sont extrêmement difficiles à analyser à cause (i) de leur haute dimensionalité (plusieurs dizaines de milliers de gènes pour seulement quelques centaines d'expériences); (ii) du bruit important dans les mesures; (iii) de la collinéarité entre les mesures dûe à la co-expression des gènes.<p>Depuis 2002, des études comparatives à grande échelle ont permis d'identifier les méthodes performantes pour l'analyse de groupements et la classification de données microarray, négligeant l'analyse de survie pertinente pour le pronostique dans le cancer du sein. Pour pallier ce manque, cette thèse présente une méthodologie originale adaptée à l'analyse de données microarray et de survie afin de construire des modèles pronostiques performants et robustes. <p>En termes d'applications, nous montrons que cette méthodologie, utilisée en combinaison avec des connaissances biologiques a priori et de nombreux ensembles de données publiques, a permis d'importantes découvertes. En particulier, il résulte de la recherche presentée dans cette thèse, le développement d'un modèle robuste d'identification des sous-types moléculaires du cancer du sein et de plusieurs signatures géniques améliorant significativement l'état de l'art au niveau pronostique. / Doctorat en Sciences / info:eu-repo/semantics/nonPublished
|
45 |
A signal transduction score flow algorithm for cyclic cellular pathway analysis, which combines transcriptome and ChIP-seq dataIsik, Zerrin, Ersahin, Tulin, Atalay, Volkan, Aykanat, Cevdet, Cetin-Atalay, Rengul January 2012 (has links)
Determination of cell signalling behaviour is crucial for understanding the physiological response to a specific stimulus or drug treatment. Current approaches for large-scale data analysis do not effectively incorporate critical topological information provided by the signalling network. We herein describe a novel model- and data-driven hybrid approach, or signal transduction score flow algorithm, which allows quantitative visualization of cyclic cell signalling pathways that lead to ultimate cell responses such as survival, migration or death. This score flow algorithm translates signalling pathways as a directed graph and maps experimental data, including negative and positive feedbacks, onto gene nodes as scores, which then computationally traverse the signalling pathway until a pre-defined biological target response is attained. Initially, experimental data-driven enrichment scores of the genes were computed in a pathway, then a heuristic approach was applied using the gene score partition as a solution for protein node stoichiometry during dynamic scoring of the pathway of interest. Incorporation of a score partition during the signal flow and cyclic feedback loops in the signalling pathway significantly improves the usefulness of this model, as compared to other approaches. Evaluation of the score flow algorithm using both transcriptome and ChIP-seq data-generated signalling pathways showed good correlation with expected cellular behaviour on both KEGG and manually generated pathways. Implementation of the algorithm as a Cytoscape plug-in allows interactive visualization and analysis of KEGG pathways as well as user-generated and curated Cytoscape pathways. Moreover, the algorithm accurately predicts gene-level and global impacts of single or multiple in silico gene knockouts. / Dieser Beitrag ist mit Zustimmung des Rechteinhabers aufgrund einer (DFG-geförderten) Allianz- bzw. Nationallizenz frei zugänglich.
|
Page generated in 0.1092 seconds