• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 549
  • 94
  • 78
  • 58
  • 36
  • 25
  • 25
  • 25
  • 25
  • 25
  • 24
  • 22
  • 15
  • 4
  • 3
  • Tagged with
  • 954
  • 954
  • 221
  • 163
  • 139
  • 126
  • 97
  • 91
  • 88
  • 74
  • 72
  • 69
  • 66
  • 64
  • 62
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
481

Searching for the contemporary and temporal causal relations from data. / 数据中的时间因果关联分析 / CUHK electronic theses & dissertations collection / Shu ju zhong de shi jian yin guo guan lian fen xi

January 2012 (has links)
因果分析由于可以刻画随机事件之间的关系而被关注,而图模型则是描述因果关系的重要工具。在图模型框架中,数据集中隐含的因果关系被表示为定义在这个数据集上的贝叶斯网络,通过贝叶斯网络学习就可以完成数据集上的因果关系挖掘。因此,贝叶斯网络学习在因果分析中具有非常重要的作用。在本文中,我们提出了一种二段式的贝叶斯网络学习算法。在第一阶段,此算法从数据中构建出马尔可夫随机场。在第二阶段,此算法根据学习到的条件随机场构造出贝叶斯网络。本文中提出的二段式贝叶斯网络学习算法具有比现有算法更高的准确率,而且这种二段式算法中的一些技术可以很容易的被应用于其他贝叶斯网络学习算法当中。此外,通过与其他的时间序列中的因果分析模型(例如向量自回归和结构向量自回归模型)做比较,我们可以看出二段式的贝叶斯网络学习算法可以被用于时间序列的因果分析。 通过在真实数据集上的实验,我们证明的二段式贝叶斯网络学习算法在实际问题中的可用性。 / 本文开始介绍了基于约束的贝叶斯网络学习框架,其中的代表作是SGS 算法。在基于约束的贝叶斯网络学习框架中,如何减小测试条件独立的搜索空间是提高算法性能的关键步骤。二段式贝叶斯网络学习算法的核心即是研究如何减小条件独立测试的搜索空间。为此,我们证明了通过马尔可夫随机场来确定贝叶斯网络的结构可以有效的减小条件独立测试的计算复杂性以及增加算法的稳定性。在本文中,偏相关系数被用来度量条件独立。这种方法可用于基于约束的贝叶斯网络学习算法。具体来说,本文证明了在给定数据集的生成模型为线性的条件下,偏相关系数可被用于度量条件独立。而且本文还证明了高斯模型是线性结构方程模型的一个特例。本文比较了二段式的贝叶斯网络学习算法与当前性能最佳的贝叶斯算法在一系列真实贝叶斯网络上的表现。 / 文章的最后一部分研究了二段式的贝叶斯网络学习算法在时间序列因果分析中的应用。在这部分工作中,我们首先证明了结构向量自回归模型模型在高斯过程中不能发现同时期的因果关系。失败的原因是结构向量自回归模型不能满足贝叶斯网络的忠实性条件。因此,本文的最后一部分提出了一种区别于现有工作的基于贝叶斯网络的向量自回归和结构向量自回归模型学习算法。并且通过实验证明的算法在实际问题中的可用性。 / Causal analysis has drawn a lot of attention because it provides with deep insight of relations between random events. Graphical model is a dominant tool to represent causal relations. Under graphical model framework, causal relations implied in a data set are captured by a Bayesian network defined on this data set and causal discovery is achieved by constructing a Bayesian network from the data set. Therefore, Bayesian network learning plays an important role in causal relation discovery. In this thesis, we develop a Two-Phase Bayesian network learning algorithm that learns Bayesian network from data. Phase one of the algorithm learns Markov random fields from data, and phase two constructs Bayesian networks based on Markov random fields obtained. We show that the Two-Phase algorithm provides state-of-the-art accuracy, and the techniques proposed in this work can be easily adopted by other Bayesian network learning algorithms. Furthermore, we present that Two-Phase algorithm can be used for time series analysis by evaluating it against a series of time series causal learning algorithms, including VAR and SVAR. Its practical applicability is also demonstrated through empirical evaluation on real world data set. / We start by presenting a constraint-based Bayesian network learning framework that is a generalization of SGS algorithm [86]. We show that the key step in making Bayesian networks to learn efficiently is restricting the search space of conditioning sets. This leads to the core of this thesis: Two-Phase Bayesian network learning algorithm. Here we show that by learning Bayesian networks fromMarkov random fields, we efficiently reduce the computational complexity and enhance the reliability of the algorithm. Besides the proposal of this Bayesian network learning algorithm, we use zero partial correlation as an indicator of conditional independence. We show that partial correlation can be applied to arbitrary distributions given that data are generated by linear models. In addition, we prove that Gaussian distribution is a special case of linear structure equation model. We then compare our Two-Phase algorithm to other state-of-the-art Bayesian network algorithms on several real world Bayesian networks that are used as benchmark by many related works. / Having built an efficient and accurate Bayesian network learning algorithm, we then apply the algorithm for causal relation discovering on time series. First we show that SVAR model is incapable of identifying contemporaneous causal orders for Gaussian process because it fails to discover the structures faithful to the underlying distributions. We also develop a framework to learn true SVAR and VAR using Bayesian network, which is distinct from existing works. Finally, we show its applicability to a real world problem. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Wang, Zhenxing. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2012. / Includes bibliographical references (leaves 184-195). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese. / Abstract --- p.i / Acknowledgement --- p.v / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Causal Relation and Directed Graphical Model --- p.1 / Chapter 1.2 --- A Brief History of Bayesian Network Learning --- p.3 / Chapter 1.3 --- Some Important Issues for Causal BayesianNetwork Learning --- p.5 / Chapter 1.3.1 --- Learning Bayesian network locally --- p.6 / Chapter 1.3.2 --- Conditional independence test --- p.7 / Chapter 1.3.3 --- Causation discovery for time series --- p.8 / Chapter 1.4 --- Road Map of the Thesis --- p.10 / Chapter 1.5 --- Summary of the Remaining Chapters --- p.12 / Chapter 2 --- Background Study --- p.14 / Chapter 2.1 --- Notations --- p.14 / Chapter 2.2 --- Formal Preliminaries --- p.15 / Chapter 2.3 --- Constraint-Based Bayesian Network Learning --- p.24 / Chapter 3 --- Two-Phase Bayesian Network Learning --- p.33 / Chapter 3.1 --- Two-Phase Bayesian Network Learning Algorithm --- p.35 / Chapter 3.1.1 --- Basic Two-Phase algorithm --- p.37 / Chapter 3.1.2 --- Two-Phase algorithm with Markov blanket information --- p.59 / Chapter 3.2 --- Correctness Proof and Complexity Analysis --- p.73 / Chapter 3.2.1 --- Correctness proof --- p.73 / Chapter 3.2.2 --- Complexity analysis --- p.81 / Chapter 3.3 --- Related Works --- p.83 / Chapter 3.3.1 --- Search-and-score algorithms --- p.84 / Chapter 3.3.2 --- Constraint-based algorithms --- p.85 / Chapter 3.3.3 --- Other algorithms --- p.86 / Chapter 4 --- Measuring Conditional Independence --- p.88 / Chapter 4.1 --- Formal Definition of Conditional Independence --- p.88 / Chapter 4.2 --- Measuring Conditional Independence --- p.96 / Chapter 4.2.1 --- Measuring independence with partial correlation --- p.96 / Chapter 4.2.2 --- Measuring independence with mutual information --- p.104 / Chapter 4.3 --- Non-Gaussian Distributions and Equivalent Class --- p.108 / Chapter 4.4 --- Heuristic CI Tests UnderMonotone Faithfulness Condition --- p.116 / Chapter 5 --- Empirical Results of Two-Phase Algorithms --- p.125 / Chapter 5.1 --- Experimental Setup --- p.126 / Chapter 5.2 --- Structure Error After Each Phase of Two-Phase Algorithms --- p.129 / Chapter 5.3 --- Maximal and Average Sizes of Conditioning Sets --- p.131 / Chapter 5.4 --- Comparison of the Number of CI Tests Required by Dependency Analysis Approaches --- p.133 / Chapter 5.5 --- Reason forWhich Number of CI Tests Required Grow with Sample Size --- p.135 / Chapter 5.6 --- Two-Phase Algorithms on Linear Gaussian Data --- p.136 / Chapter 5.7 --- Two-phase Algorithms on Linear Non-Gaussian Data --- p.139 / Chapter 5.8 --- Compare Two-phase Algorithms with Search-and-Score Algorithms and Lasso Regression --- p.142 / Chapter 6 --- Causal Mining in Time Series Data --- p.146 / Chapter 6.1 --- A Brief Review of Causation Discovery in Time Series --- p.146 / Chapter 6.2 --- Limitations of Constructing SVAR from VAR --- p.150 / Chapter 6.3 --- SVAR Being Incapability of Identifying Contemporaneous Causal Order for Gaussian Process --- p.152 / Chapter 6.4 --- Estimating the SVARs by Bayesian Network Learning Algorithm --- p.157 / Chapter 6.4.1 --- Represent SVARs by Bayesian networks --- p.158 / Chapter 6.4.2 --- Getting back SVARs and VARs fromBayesian networks --- p.159 / Chapter 6.5 --- Experimental Results --- p.162 / Chapter 6.5.1 --- Experiment on artificial data --- p.162 / Chapter 6.5.2 --- Application in finance --- p.172 / Chapter 6.6 --- Comparison with Related Works --- p.174 / Chapter 7 --- Concluding Remarks --- p.178 / Bibliography --- p.184
482

Statistical inferences for a pure birth process

Hsu, Jyh-Ping January 2010 (has links)
Typescript (photocopy). / Digitized by Kansas Correctional Industries
483

Nonparametric density estimation for univariate and bivariate distributions with applications in discriminant analysis for the bivariate case

Haug, Mark January 2010 (has links)
Typescript (photocopy). / Digitized by Kansas Correctional Industries / Department: Statistics.
484

Adsorção sequencial unidimensional: Modelos para automontagem de moléculas / Unidimensional sequential adsorption: Models for self-assembly of molecules.

Alexandre Martins Melo 15 August 2005 (has links)
Neste trabalho estudamos deposição de partículas em redes unidimensionais, com uma abordagem estatística a partir de modelos como absorção sequencial aleatória (RSA) e cooperativa (CSA). O objetivo é a simulação de formação de monocamadas oligoméricas automontadas em substratos, que se justifica devido aos recentes avanços, por um lado, na área experimental, na formação de SAMs (self-assembled monolayers) e produção de nanoestruturas e, por um lado, na área teórica, no desenvolvimento de diversos modelos estocásticos para processos de adsorção sequencial. Introduzimos o modelo clássico de adsorção sequencial aleatória, a nomenclatura utilizada, e estudamos algumas de suas características. Analisamos a fração da rede que permanece vazia após ser atingida a saturação, para oligômeros formados por um número K de unidades. Em seguida, estudamos a cinética de um processo RSA, primeiro para o caso de monômeros, depois para dímeros, e então para oligômeros maiores. A estrutura da camada para o caso de dímeros é examinada a partir da distribuição de tamanhos de grãos (número de K-meros adjacentes), e da função de correlação. Os dados obtidos nas nossas simulações de Monte Carlo são comparados com resultados de modelos estocásticos existentes na literatura. A partir desse ponto, estudamos variações do processo de RSA simples. Adicionamos outro oligômero ao fluxo, chegando à adsorção sequencial de misturas. Estudamos a maneira como a fração de cada um dos oligômeros no fluxo de moléculas influi tanto na dinâmica do preenchimento da camada, como na taxa de saturação da rede. Em seguida analisamos a influência do oligômero adicional na estrutura da monocamada automontada obtida. Como antes, sempre que possível os dados obtidos são comparados com resultados da literatura. Introduzimos processos de relaxamento, chegando a adsorção-difusão, adsorção-evaporação e adsorção-difusão-evaporação, em todos os casos compilando modelos estocásticos disponíveis na literatura e, onde possível, comparando as previsões desses modelos com resultados de nossas simulações numéricas. Assim como no caso RSA simples, ao estudar esses processos derivados de RSA, analisamos tanto a cinética dos processos (incluindo o estado final do sistema), quanto as estruturas formadas (através da função de correlação e do tamanho de domínios), destacando as mudanças causadas quando se introduz processos de relaxamento durante a formação de uma SAM. Por fim, estudamos processos cooperativos, em que o destino dos oligômeros sendo depositados depende da vizinhança em que se encontram. Como é usual em modelos estatísticos de deposição sequencial, as interações são representadas por mudanças nas taxas de ocorrência de algum dos eventos, neste caso a difusão. Apresentamos dados inéditos, relativos ao modelo cooperativo de adsorção-difusão que simulamos numericamente, com o objetivo de demonstrar a influência que uma interação atrativa entre oligômeros pode ter na estrutura da monocamada depositada sobre o substrato. O objetivo é mostrar como a estrutura de um SAM pode fornecer indícios sobre as interações existentes entre oligômeros. / In this work we study particle deposition in one-dimensional lattices, through a statistical approach, using models such as random sequential adsorption (RSA) and cooperative sequential adsorption (CSA). The goal is to simulate the formation of oligomeric self-assembled monolayers (SAMs) on substrates, which are relevant right now important firstly because of recent advances in the experimental area, with manufacturing of nanostructures using SAMs, and secondly because of progresses in the theoretical field, with development of stochastic models of sequential adsorption process. We introduce the classical model of random sequential adsorption, the associated nomenclature, and some of its characteristics. We analyze the portion of the lattice that remains empty after reaching saturation, for oligomers formed with K units. Subsequently we study the kinetics of a RSA process, initially for the case of monomers, then for dimmers, and lastly for larger species. The structure of the monolayer is analysed in terms of grain or domain size (probability distribution p(n) for n contiguous ologomers), and correlation function, for dimmers. Data obtained in our Monte Carlo simulations are compared with existing results in literature. From this point on, we study variations of the simple RSA process. We add another oligomer to the flux, obtained RSA of mixtures. We examine how the flux of each of the oligomers affects the filling of the monolayer: not only the final coverage saturation, but the kinetics of the process. We further analyze the influence of the additional oligomer in the structure of the monolayer obtained. Also here we compare our results with models from the literature. We add relaxation to the adsorption processes, achieving adsorption-diffusion (RSAD), adsorption-desorption, and adsorption-diffusion-desorption, always compiling stochastic models available in the literature, and comparing our results where analytical models are available. As done previously with RSA, when studying these more complex processes, we analyze the kinetics (including the final coverage) as much as th resulting structure (correlation function and domain size distribution), highlighting changes caused by these relaxation processes. In the final chapter we study cooperative process, in which the destiny of oligomers being adsorbed depends on their neighborhood. As usual in statistical models of sequential adsorption, the interactions are represented by changes of rates of some event (in this case, diffusion). We present new data, concerning the cooperative adsorption-diffusion model we simulated, aiming to demonstrate the influence an attractive interaction between oligomers may have in the structure of the monolayer formed on the substrate. The objective is illustrate how a SAM structure may give clues about the existence of interactions between oligomers.
485

Interrelações da produtividade de cana-de-açúcar com atributos químicos de um argissolo vermelho eutrófico paulista /

Dal Bem, Edjair Augusto. January 2013 (has links)
Orientador: Morel de Passos e Carvalho / Banca: Marcelo Andreotti / Banca: Cassiano Garcia Roque / Resumo: A cultura da cana-de-açúcar representa hoje grande fonte de divisas para o Brasil, seja pela produção de açúcar quanto pela produção de álcool etílico. A modelagem geoespacial permite a descrição quantitativa da variabilidade espacial dos atributos do solo e a estimativa não tendenciosa da variância mínima de valores desses atributos em locais não amostrados. Acessar essa variabilidade faz da geoestatística uma eficiente ferramenta de suporte na decisão do manejo do solo para incremento de produtividade das culturas. Este trabalho teve como objetivo analisar as correlações, lineares e espaciais, entre a produtividade da cana de açúcar e os atributos químicos do solo, determinando aqueles que melhor se relacionaram com o aumento da produtividade agrícola em questão. O trabalho foi realizado no município de Rubinéia (SP), cujo solo da área é um Argissolo Vermelho Eutrófico. O local das coletas de dados de fertilidade do solo e da produtividade da planta foram através da alocação de uma malha geoestatística com 121 pontos amostrais em duas profundidades (0-0,20 e 0,20-0,40m), numa área de 1,30 ha, com a distância entre os pontos de 10 x 13 metros. As análise de produtividade foram realizadas a campo por meio de contagem de número de plantas e pesagem de colmos em uma área de 9 m² por ponto amostral. A série de dados foi normal para todos os atributos tecnológicos da cana-de-açúcar, enquanto que, para os atributos do solo na profundidade de 0-0,20 m os atributos que apresentaram normalidade dos dados foram o pH, o potássio e a saturação por bases (V%), e na profundidade de 0,20-0,40 foram o pH, o teor de magnésio e a saturação por bases. A média da produtividade de colmos foi de 94,6 t ha-1 e a número de colmos com média de 11,48 colmo m2. Para os atributos do solo, os que apresentaram maiores... (Resumo completo, clicar acesso eletrônico abaixo) / Abstract: The culture of sugarcane today represents big source of hard currency for Brazil, either by production of sugar or by production of ethyl alcohol. The geospatial modeling allows the quantitative description of the spatial variability of soil attributes and unbiased estimate of the minimum variance values of these attributes in locations non-sampled. Access this variability makes geostatistical an efficient decision support tool in of the soil management to increase crop productivity. This work had as objective analyze the correlations, linear and space between the productivity of sugarcane (Saccharum spp.) and soil chemical attributes, determining those that best correlated with the increase in agricultural productivity at issue. The work was conducted in the municipality of Rubinéia (SP), the soil in the area is an Red Ultisol eutrophic. Installed a geostatistical grid for data collection of soil and plant, with 121 sampling points in two depths (0-0.20 and 0.20-0.40 m), an area of 1.30 ha, with distance between dots of 10 x 13 meters. The analysis was conducted field productivity means for counting number of plants and weighing of stems in an area of 9 m² per sample point, analysis of the technological plants were performed in the laboratory of power plant Vale do Parana S/A using the methodology of CONSECANA. The data series was normal for all the technological attributes of sugarcane, whereas for the soil attributes in the 0-0.20 m depth the attributes of the data were presented normality pH, potassium (K) and base saturation (V%), and in depth from 0.20-0.40 the pH, magnesium (Mg) and V%. The average production was found in 94.6 t ha-1 and plant population obtained an average 11.48 p. m-2. For the attributes of the soil, to the highest averages were potassium layers 0-0.20 and 0.20-0.40 (K1 and K2), Magnesium layers... (Complete abstract click electronic access below) / Mestre
486

Bayesian analysis for complex structural equation models. / CUHK electronic theses & dissertations collection

January 2000 (has links)
Xin-Yuan Song. / "December 2000." / Thesis (Ph.D.)--Chinese University of Hong Kong, 2000. / Includes bibliographical references (p. 128-142). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Mode of access: World Wide Web. / Abstracts in English and Chinese.
487

Interaction-Based Learning for High-Dimensional Data with Continuous Predictors

Huang, Chien-Hsun January 2014 (has links)
High-dimensional data, such as that relating to gene expression in microarray experiments, may contain substantial amount of useful information to be explored. However, the information, relevant variables and their joint interactions are usually diluted by noise due to a large number of non-informative variables. Consequently, variable selection plays a pivotal role for learning in high dimensional problems. Most of the traditional feature selection methods, such as Pearson's correlation between response and predictors, stepwise linear regressions and LASSO are among the popular linear methods. These methods are effective in identifying linear marginal effect but are limited in detecting non-linear or higher order interaction effects. It is well known that epistasis (gene - gene interactions) may play an important role in gene expression where unknown functional forms are difficult to identify. In this thesis, we propose a novel nonparametric measure to first screen and do feature selection based on information from nearest neighborhoods. The method is inspired by Lo and Zheng's earlier work (2002) on detecting interactions for discrete predictors. We apply a backward elimination algorithm based on this measure which leads to the identification of many in influential clusters of variables. Those identified groups of variables can capture both marginal and interactive effects. Second, each identified cluster has the potential to perform predictions and classifications more accurately. We also study procedures how to combine these groups of individual classifiers to form a final predictor. Through simulation and real data analysis, the proposed measure is capable of identifying important variable sets and patterns including higher-order interaction sets. The proposed procedure outperforms existing methods in three different microarray datasets. Moreover, the nonparametric measure is quite flexible and can be easily extended and applied to other areas of high-dimensional data and studies.
488

Statistical modeling and statistical learning for disease prediction and classification

Chen, Tianle January 2014 (has links)
This dissertation studies prediction and classification models for disease risk through semiparametric modeling and statistical learning. It consists of three parts. In the first part, we propose several survival models to analyze the Cooperative Huntington's Observational Research Trial (COHORT) study data accounting for the missing mutation status in relative participants (Kieburtz and Huntington Study Group, 1996a). Huntington's disease (HD) is a progressive neurodegenerative disorder caused by an expansion of cytosine-adenine-guanine (CAG) repeats at the IT15 gene. A CAG repeat number greater than or equal to 36 is defined as carrying the mutation and carriers will eventually show symptoms if not censored by other events. There is an inverse relationship between the age-at-onset of HD and the CAG repeat length; the greater the CAG expansion, the earlier the age-at-onset. Accurate estimation of age-at-onset based on CAG repeat length is important for genetic counseling and the design of clinical trials for HD. Participants in COHORT (denoted as probands) undergo a genetic test and their CAG repeat number is determined. Family members of the probands do not undergo the genetic test and their HD onset information is provided by probands. Several methods are proposed in the literature to model the age specific cumulative distribution function (CDF) of HD onset as a function of the CAG repeat length. However, none of the existing methods can be directly used to analyze COHORT proband and family data because family members' mutation status is not always known. In this work, we treat the presence or absence of an expanded CAG repeat in first-degree family members as missing data and use the expectation-maximization (EM) algorithm to carry out the maximum likelihood estimation of the COHORT proband and family data jointly. We perform simulation studies to examine finite sample performance of the proposed methods and apply these methods to estimate the CDF of HD age-at-onset from the COHORT proband and family combined data. Our results show a slightly lower estimated cumulative risk of HD with the combined data compared to using proband data alone. We then extend the approach to predict the cumulative risk of disease accommodating predictors with time-varying effects and outcomes subject to censoring. We model the time-specific effect through a nonparametric varying-coefficient function and handle censoring through self-consistency equations that redistribute the probability mass of censored outcomes to the right. The computational procedure is extremely convenient and can be implemented by standard software. We prove large sample properties of the proposed estimator and evaluate its finite sample performance through simulation studies. We apply the method to estimate the cumulative risk of developing HD from the mutation carriers in COHORT data and illustrate an inverse relationship between the cumulative risk of HD and the length of CAG repeats at the IT15 gene. In the second part of the dissertation, we develop methods to accurately predict whether pre-symptomatic individuals are at risk of a disease based on their various marker profiles, which offers an opportunity for early intervention well before definitive clinical diagnosis. For many diseases, existing clinical literature may suggest the risk of disease varies with some markers of biological and etiological importance, for example age. To identify effective prediction rules using nonparametric decision functions, standard statistical learning approaches treat markers with clear biological importance (e.g., age) and other markers without prior knowledge on disease etiology interchangeably as input variables. Therefore, these approaches may be inadequate in singling out and preserving the effects from the biologically important variables, especially in the presence of potential noise markers. Using age as an example of a salient marker to receive special care in the analysis, we propose a local smoothing large margin classifier implemented with support vector machine to construct effective age-dependent classification rules. The method adaptively adjusts age effect and separately tunes age and other markers to achieve optimal performance. We derive the asymptotic risk bound of the local smoothing support vector machine, and perform extensive simulation studies to compare with standard approaches. We apply the proposed method to two studies of premanifest HD subjects and controls to construct age-sensitive predictive scores for the risk of HD and risk of receiving HD diagnosis during the study period. In the third part of the dissertation, we develop a novel statistical learning method for longitudinal data. Predicting disease risk and progression is one of the main goals in many clinical studies. Cohort studies on the natural history and etiology of chronic diseases span years and data are collected at multiple visits. Although kernel-based statistical learning methods are proven to be powerful for a wide range of disease prediction problems, these methods are only well studied for independent data but not for longitudinal data. It is thus important to develop time-sensitive prediction rules that make use of the longitudinal nature of the data. We develop a statistical learning method for longitudinal data by introducing subject-specific long-term and short-term latent effects through designed kernels to account for within-subject correlation of longitudinal measurements. Since the presence of multiple sources of data is increasingly common, we embed our method in a multiple kernel learning framework and propose a regularized multiple kernel statistical learning with random effects to construct effective nonparametric prediction rules. Our method allows easy integration of various heterogeneous data sources and takes advantage of correlation among longitudinal measures to increase prediction power. We use different kernels for each data source taking advantage of distinctive feature of data modality, and then optimally combine data across modalities. We apply the developed methods to two large epidemiological studies, one on Huntington's disease and the other on Alzhemeier's Disease (Alzhemeier's Disease Neuroimaging Initiative, ADNI) where we explore a unique opportunity to combine imaging and genetic data to predict the conversion from mild cognitive impairment to dementia, and show a substantial gain in performance while accounting for the longitudinal feature of data.
489

On Identifying Rare Variants for Complex Human Traits

Fan, Ruixue January 2015 (has links)
This thesis focuses on developing novel statistical tests for rare variants association analysis incorporating both marginal effects and interaction effects among rare variants. Compared with common variants, rare variants have lower minor allele frequencies (typically less than 5%), and hence traditional association tests for common variants will lose power for rare variants. Therefore, there is a pressing need of new analytical tools to tackle the problem of rare variants association with complex human traits. Several collapsing methods have been proposed that aggregate information of rare variants in a region and test them together. They can be divided into burden tests and non-burden tests based on their aggregation strategies. They are all variations of regression-based methods with the assumption that the phenotype is associated with the genotype via a (linear) regression model. Most of these methods consider only marginal effects of rare variants and fail to take into account gene-gene and gene-environmental interactive effects, which are ubiquitous and are of utmost importance in biological systems. In this thesis, we propose a summation of partition approach (SPA) -- a nonparametric strategy for rare variants association analysis. Extensive simulation studies show that SPA is powerful in detecting not only marginal effects but also gene-gene interaction effects of rare variants. Moreover, extensions of SPA are able to detect gene-environment interactions and other interactions existing in complicated biological system as well. We are also able to obtain the asymptotic behavior of the marginal SPA score, which guarantees the power of the proposed method. Inspired by the idea of stepwise variable selection, a significance-based backward dropping algorithm(SDA) is proposed to locate truly influential rare variants in a genetic region that has been identified significant. Unlike traditional backward dropping approaches which remove the least significant variables first, SDA introduces the idea of eliminating the most significant variable at each round. The removed variables are collected and their effects are evaluated by an influence ratio score -- the relative p-value change. Our simulation studies show that SDA is powerful to detect causal variables and SDA has lower false discovery rate than LASSO. We also demonstrate our method using the dataset provided by Genetic Analysis Workshop (GAW) 17 and the results support the superiority of SDA over LASSO. The general partition-retention framework can also be applied to detect gene-environmental interaction effects for common variants. We demonstrate this method using the dataset from Genetic Analysis Workshop (GAW) 18. Our nonparametric approach is able to identify a lot more possible influential gene-environmental pairs than traditional linear regression models. We propose in this thesis a "SPA-SDA" two step approach for rare variants association analysis at genomic scale: first identify significant regions of moderate sizes using SPA, and then apply SDA to the identified regions to pinpoint truly influential variables. This approach is computationally efficient for genomic data and it has the capacity to detect gene-gene and gene-environmental interactions.
490

Statistical Learning Methods for Personalized Medical Decision Making

Liu, Ying January 2016 (has links)
The theme of my dissertation is on merging statistical modeling with medical domain knowledge and machine learning algorithms to assist in making personalized medical decisions. In its simplest form, making personalized medical decisions for treatment choices and disease diagnosis modality choices can be transformed into classification or prediction problems in machine learning, where the optimal decision for an individual is a decision rule that yields the best future clinical outcome or maximizes diagnosis accuracy. However, challenges emerge when analyzing complex medical data. On one hand, statistical modeling is needed to deal with inherent practical complications such as missing data, patients' loss to follow-up, ethical and resource constraints in randomized controlled clinical trials. On the other hand, new data types and larger scale of data call for innovations combining statistical modeling, domain knowledge and information technologies. This dissertation contains three parts addressing the estimation of optimal personalized rule for choosing treatment, the estimation of optimal individualized rule for choosing disease diagnosis modality, and methods for variable selection if there are missing data. In the first part of this dissertation, we propose a method to find optimal Dynamic treatment regimens (DTRs) in Sequential Multiple Assignment Randomized Trial (SMART) data. Dynamic treatment regimens (DTRs) are sequential decision rules tailored at each stage of treatment by potentially time-varying patient features and intermediate outcomes observed in previous stages. The complexity, patient heterogeneity, and chronicity of many diseases and disorders call for learning optimal DTRs that best dynamically tailor treatment to each individual's response over time. We propose a robust and efficient approach referred to as Augmented Multistage Outcome-Weighted Learning (AMOL) to identify optimal DTRs from sequential multiple assignment randomized trials. We improve outcome-weighted learning (Zhao et al.~2012) to allow for negative outcomes; we propose methods to reduce variability of weights to achieve numeric stability and higher efficiency; and finally, for multiple-stage trials, we introduce robust augmentation to improve efficiency by drawing information from Q-function regression models at each stage. The proposed AMOL remains valid even if the regression model is misspecified. We formally justify that proper choice of augmentation guarantees smaller stochastic errors in value function estimation for AMOL; we then establish the convergence rates for AMOL. The comparative advantage of AMOL over existing methods is demonstrated in extensive simulation studies and applications to two SMART data sets: a two-stage trial for attention deficit hyperactivity disorder and the STAR*D trial for major depressive disorder. The second part of the dissertation introduced a machine learning algorithm to estimate personalized decision rules for medical diagnosis/screening to maximize a weighted combination of sensitivity and specificity. Using subject-specific risk factors and feature variables, such rules administer screening tests with balanced sensitivity and specificity, and thus protect low-risk subjects from unnecessary pain and stress caused by false positive tests, while achieving high sensitivity for subjects at high risk. We conducted simulation study mimicking a real breast cancer study, and we found significant improvements on sensitivity and specificity comparing our personalized screening strategy (assigning mammography+MRI to high-risk patients and mammography alone to low-risk subjects based on a composite score of their risk factors) to one-size-fits-all strategy (assigning mammography+MRI or mammography alone to all subjects). When applying to a Parkinson's disease (PD) FDG-PET and fMRI data, we showed that the method provided individualized modality selection that can improve AUC, and it can provide interpretable decision rules for choosing brain imaging modality for early detection of PD. To the best of our knowledge, this is the first time in the literature to propose automatic data-driven methods and learning algorithm for personalized diagnosis/screening strategy. In the last part of the dissertation, we propose a method, Multiple Imputation Random Lasso (MIRL), to select important variables and to predict the outcome for an epidemiological study of Eating and Activity in Teens. % in the presence of missing data. In this study, 80% of individuals have at least one variable missing. Therefore, using variable selection methods developed for complete data after list-wise deletion substantially reduces prediction power. Recent work on prediction models in the presence of incomplete data cannot adequately account for large numbers of variables with arbitrary missing patterns. We propose MIRL to combine penalized regression techniques with multiple imputation and stability selection. Extensive simulation studies are conducted to compare MIRL with several alternatives. MIRL outperforms other methods in high-dimensional scenarios in terms of both reduced prediction error and improved variable selection performance, and it has greater advantage when the correlation among variables is high and missing proportion is high. MIRL is shown to have improved performance when comparing with other applicable methods when applied to the study of Eating and Activity in Teens for the boys and girls separately, and to a subgroup of low social economic status (SES) Asian boys who are at high risk of developing obesity.

Page generated in 0.1264 seconds