Global ETD Search

1	Hypothesis Testing for High-Dimensional Regression Under Extreme Phenotype Sampling of Continuous Traits January 2018 (has links) acase@tulane.edu / Extreme phenotype sampling (EPS) is a broadly-used design to identify candidate genetic factors contributing to the variation of quantitative traits. By enriching the signals in the extreme phenotypic samples within the top and bottom percentiles, EPS can boost the study power compared with the random sampling with the same sample size. The existing statistical methods for EPS data test the variants/regions individually. However, many disorders are caused by multiple genetic factors. Therefore, it is critical to simultaneously model the effects of genetic factors, which may increase the power of current genetic studies and identify novel disease-associated genetic factors in EPS. The challenge of the simultaneous analysis of genetic data is that the number (p ~10,000) of genetic factors is typically greater than the sample size (n ~1,000) in a single study. The standard linear model would be inappropriate for this p>n problem due to the rank deficiency of the design matrix. An alternative solution is to apply a penalized regression method – the least absolute shrinkage and selection operator (LASSO). LASSO can deal with this high-dimensional (p>n) problem by forcing certain regression coefficients to be zero. Although the application of LASSO in genetic studies under random sampling has been widely studied, its statistical inference and testing under EPS remain unknown. We propose a novel sparse model (EPS-LASSO) with hypothesis test for high-dimensional regression under EPS based on a decorrelated score function to investigate the genetic associations, including the gene expression and rare variant analyses. The comprehensive simulation shows EPS-LASSO outperforms existing methods with superior power when the effects are large and stable type I error and FDR control. Together with the real data analysis of genetic study for obesity, our results indicate that EPS-LASSO is an effective method for EPS data analysis, which can account for correlated predictors. / 1 / Chao Xu extreme sampling high-dimensional regression genetic data analysis
2	Regression on Manifolds with Implications for System Identification Ohlsson, Henrik January 2008 (has links) <p>The trend today is to use many inexpensive sensors instead of a few expensive ones, since the same accuracy can generally be obtained by fusing several dependent measurements. It also follows that the robustness against failing sensors is improved. As a result, the need for high-dimensional regression techniques is increasing.</p><p>As measurements are dependent, the regressors will be constrained to some manifold. There is then a representation of the regressors, of the same dimension as the manifold, containing all predictive information. Since the manifold is commonly unknown, this representation has to be estimated using data. For this, manifold learning can be utilized. Having found a representation of the manifold constrained regressors, this low-dimensional representation can be used in an ordinary regression algorithm to find a prediction of the output. This has further been developed in the <em>Weight Determination by Manifold Regularization</em> (WDMR) approach.</p><p>In most regression problems, prior information can improve prediction results. This is also true for high-dimensional regression problems. Research to include physical prior knowledge in high-dimensional regression i.e., gray-box high-dimensional regression, has been rather limited, however. We explore the possibilities to include prior knowledge in high-dimensional manifold constrained regression by the means of regularization. The result will be called <em>gray-box WDMR</em>. In gray-box WDMR we have the possibility to restrict ourselves to predictions which are physically plausible. This is done by incorporating dynamical models for how the regressors evolve on the manifold.</p> / MOVIII System Identification High-Dimensional Regression Manifold Gray-Box Identification. Automatic control Reglerteknik
3	Methodology for Estimation and Model Selection in High-Dimensional Regression with Endogeneity Du, Fan 05 May 2023 (has links) No description available. Statistics Model Selection High Dimensional Regression Models Endogeneity
4	Regression on Manifolds with Implications for System Identification Ohlsson, Henrik January 2008 (has links) The trend today is to use many inexpensive sensors instead of a few expensive ones, since the same accuracy can generally be obtained by fusing several dependent measurements. It also follows that the robustness against failing sensors is improved. As a result, the need for high-dimensional regression techniques is increasing. As measurements are dependent, the regressors will be constrained to some manifold. There is then a representation of the regressors, of the same dimension as the manifold, containing all predictive information. Since the manifold is commonly unknown, this representation has to be estimated using data. For this, manifold learning can be utilized. Having found a representation of the manifold constrained regressors, this low-dimensional representation can be used in an ordinary regression algorithm to find a prediction of the output. This has further been developed in the Weight Determination by Manifold Regularization (WDMR) approach. In most regression problems, prior information can improve prediction results. This is also true for high-dimensional regression problems. Research to include physical prior knowledge in high-dimensional regression i.e., gray-box high-dimensional regression, has been rather limited, however. We explore the possibilities to include prior knowledge in high-dimensional manifold constrained regression by the means of regularization. The result will be called gray-box WDMR. In gray-box WDMR we have the possibility to restrict ourselves to predictions which are physically plausible. This is done by incorporating dynamical models for how the regressors evolve on the manifold. / MOVIII System Identification High-Dimensional Regression Manifold Gray-Box Identification. Control Engineering Reglerteknik
5	Topics in Modern Bayesian Computation Qamar, Shaan January 2015 (has links) <p>Collections of large volumes of rich and complex data has become ubiquitous in recent years, posing new challenges in methodological and theoretical statistics alike. Today, statisticians are tasked with developing flexible methods capable of adapting to the degree of complexity and noise in increasingly rich data gathered across a variety of disciplines and settings. This has spurred the need for novel multivariate regression techniques that can efficiently capture a wide range of naturally occurring predictor-response relations, identify important predictors and their interactions and do so even when the number of predictors is large but the sample size remains limited. </p><p>Meanwhile, efficient model fitting tools must evolve quickly to keep pace with the rapidly growing dimension and complexity of data they are applied to. Aided by the tremendous success of modern computing, Bayesian methods have gained tremendous popularity in recent years. These methods provide a natural probabilistic characterization of uncertainty in the parameters and in predictions. In addition, they provide a practical way of encoding model structure that can lead to large gains in statistical estimation and more interpretable results. However, this flexibility is often hindered in applications to modern data which are increasingly high dimensional, both in the number of observations $n$ and the number of predictors $p$. Here, computational complexity and the curse of dimensionality typically render posterior computation inefficient. In particular, Markov chain Monte Carlo (MCMC) methods which remain the workhorse for Bayesian computation (owing to their generality and asymptotic accuracy guarantee), typically suffer data processing and computational bottlenecks as a consequence of (i) the need to hold the entire dataset (or available sufficient statistics) in memory at once; and (ii) having to evaluate of the (often expensive to compute) data likelihood at each sampling iteration. </p><p>This thesis divides into two parts. The first part concerns itself with developing efficient MCMC methods for posterior computation in the high dimensional {\em large-n large-p} setting. In particular, we develop an efficient and widely applicable approximate inference algorithm that extends MCMC to the online data setting, and separately propose a novel stochastic search sampling scheme for variable selection in high dimensional predictor settings. The second part of this thesis develops novel methods for structured sparsity in the high-dimensional {\em large-p small-n} regression setting. Here, statistical methods should scale well with the predictor dimension and be able to efficiently identify low dimensional structure so as to facilitate optimal statistical estimation in the presence of limited data. Importantly, these methods must be flexible to accommodate potentially complex relationships between the response and its associated explanatory variables. The first work proposes a nonparametric additive Gaussian process model to learn predictor-response relations that may be highly nonlinear and include numerous lower order interaction effects, possibly in different parts of the predictor space. A second work proposes a novel class of Bayesian shrinkage priors for multivariate regression with a tensor valued predictor. Dimension reduction is achieved using a low-rank additive decomposition for the latter, enabling a highly flexible and rich structure within which excellent cell-estimation and region selection may be obtained through state-of-the-art shrinkage methods. In addition, the methods developed in these works come with strong theoretical guarantees.</p> / Dissertation Statistics Approximate Bayesian computation High dimensional regression Nonparametric regression Scalable Markov chain Monte Carlo Structured additive models Variable selection
6	Forecasting the Business Cycle using Partial Least Squares / Prediktion av ekonomiskacykler med hjälp av partiella minsta kvadrat metoden Lannsjö, Fredrik January 2014 (has links) Partial Least Squares is both a regression method and a tool for variable selection, that is especially appropriate for models based on numerous (possibly correlated) variables. While being a well established modeling tool in chemometrics, this thesis adapts PLS to financial data to predict the movements of the business cycle represented by the OECD Composite Leading Indicators. High-dimensional data is used, and a model with automated variable selection through a genetic algorithm is developed to forecast different economic regions with good results in out-of-sample tests. / Partial Least Squares är både en regressionsmetod och ett verktyg för variabelselektion som är specielltlämpligt för modeller baserade på en stor mängd (möjligtvis korrelerade) variabler.Medan det är en väletablerad modelleringsmetod inom kemimetri, anpassar den häruppsatsen PLS till finansiell data för att förutspå rörelserna av konjunkturen,representerad av OECD's Composite Leading Indicator. Högdimensionella dataanvänds och en model med automatiserad variabelselektion via en genetiskalgoritm utvecklas för att göra en prognos av olika ekonomiska regioner medgoda resultat i out-of-sample-tester Quantitative Forecast Partial Least Squares Variable Selection High-dimensional Regression Big Data Business Cycle Leading Indicators Mathematical Analysis Matematisk analys
7	Quelques contributions à la sélection de variables et aux tests non-paramétriques / A few contributions to variable selection and nonparametric tests Comminges, Laëtitia 12 December 2012 (has links) Les données du monde réel sont souvent de très grande dimension, faisant intervenir un grand nombre de variables non pertinentes ou redondantes. La sélection de variables est donc utile dans ce cadre. D'abord, on considère la sélection de variables dans le modèle de régression quand le nombre de variables est très grand. En particulier on traite le cas où le nombre de variables pertinentes est bien plus petit que la dimension ambiante. Sans supposer aucune forme paramétrique pour la fonction de régression, on obtient des conditions minimales permettant de retrouver l'ensemble des variables pertinentes. Ces conditions relient la dimension intrinsèque à la dimension ambiante et la taille de l'échantillon. Ensuite, on considère le problème du test d'une hypothèse nulle composite sous un modèle de régression non paramétrique multi varié. Pour une fonctionnelle quadratique donnée $Q$, l'hypothèse nulle correspond au fait que la fonction $f$ satisfait la contrainte $Q[f] = 0$, tandis que l'alternative correspond aux fonctions pour lesquelles $ \|Q[f]\|$ est minorée par une constante strictement positive. On fournit des taux minimax de test et les constantes de séparation exactes ainsi qu'une procédure optimale exacte, pour des fonctionnelles quadratiques diagonales et positives. On peut utiliser ces résultats pour tester la pertinence d'une ou plusieurs variables explicatives. L'étude des taux minimax pour les fonctionnelles quadratiques diagonales qui ne sont ni positives ni négatives, fait apparaître deux régimes différents : un régime « régulier » et un régime « irrégulier ». On applique ceci au test de l'égalité des normes de deux fonctions observées dans des environnements bruités / Real-world data are often extremely high-dimensional, severely under constrained and interspersed with a large number of irrelevant or redundant features. Relevant variable selection is a compelling approach for addressing statistical issues in the scenario of high-dimensional and noisy data with small sample size. First, we address the issue of variable selection in the regression model when the number of variables is very large. The main focus is on the situation where the number of relevant variables is much smaller than the ambient dimension. Without assuming any parametric form of the underlying regression function, we get tight conditions making it possible to consistently estimate the set of relevant variables. Secondly, we consider the problem of testing a particular type of composite null hypothesis under a nonparametric multivariate regression model. For a given quadratic functional $Q$, the null hypothesis states that the regression function $f$ satisfies the constraint $Q[f] = 0$, while the alternative corresponds to the functions for which $Q[f]$ is bounded away from zero. We provide minimax rates of testing and the exact separation constants, along with a sharp-optimal testing procedure, for diagonal and nonnegative quadratic functionals. We can apply this to testing the relevance of a variable. Studying minimax rates for quadratic functionals which are neither positive nor negative, makes appear two different regimes: “regular” and “irregular”. We apply this to the issue of testing the equality of norms of two functions observed in noisy environments Sélection de variables Régression non paramétrique Tests d'hypothèses non paramétriques Asymptotiques exactes Taux de séparation Approche minimax Sparsity pattern Nonparametric hypotheses testing Sharp asymptotics Separation rates Minimax approach High-dimensional regression
8	Semiparametric and Nonparametric Methods for Complex Data Kim, Byung-Jun 26 June 2020 (has links) A variety of complex data has broadened in many research fields such as epidemiology, genomics, and analytical chemistry with the development of science, technologies, and design scheme over the past few decades. For example, in epidemiology, the matched case-crossover study design is used to investigate the association between the clustered binary outcomes of disease and a measurement error in covariate within a certain period by stratifying subjects' conditions. In genomics, high-correlated and high-dimensional(HCHD) data are required to identify important genes and their interaction effect over diseases. In analytical chemistry, multiple time series data are generated to recognize the complex patterns among multiple classes. Due to the great diversity, we encounter three problems in analyzing those complex data in this dissertation. We have then provided several contributions to semiparametric and nonparametric methods for dealing with the following problems: the first is to propose a method for testing the significance of a functional association under the matched study; the second is to develop a method to simultaneously identify important variables and build a network in HDHC data; the third is to propose a multi-class dynamic model for recognizing a pattern in the time-trend analysis. For the first topic, we propose a semiparametric omnibus test for testing the significance of a functional association between the clustered binary outcomes and covariates with measurement error by taking into account the effect modification of matching covariates. We develop a flexible omnibus test for testing purposes without a specific alternative form of a hypothesis. The advantages of our omnibus test are demonstrated through simulation studies and 1-4 bidirectional matched data analyses from an epidemiology study. For the second topic, we propose a joint semiparametric kernel machine network approach to provide a connection between variable selection and network estimation. Our approach is a unified and integrated method that can simultaneously identify important variables and build a network among them. We develop our approach under a semiparametric kernel machine regression framework, which can allow for the possibility that each variable might be nonlinear and is likely to interact with each other in a complicated way. We demonstrate our approach using simulation studies and real application on genetic pathway analysis. Lastly, for the third project, we propose a Bayesian focal-area detection method for a multi-class dynamic model under a Bayesian hierarchical framework. Two-step Bayesian sequential procedures are developed to estimate patterns and detect focal intervals, which can be used for gas chromatography. We demonstrate the performance of our proposed method using a simulation study and real application on gas chromatography on Fast Odor Chromatographic Sniffer (FOX) system. / Doctor of Philosophy / A variety of complex data has broadened in many research fields such as epidemiology, genomics, and analytical chemistry with the development of science, technologies, and design scheme over the past few decades. For example, in epidemiology, the matched case-crossover study design is used to investigate the association between the clustered binary outcomes of disease and a measurement error in covariate within a certain period by stratifying subjects' conditions. In genomics, high-correlated and high-dimensional(HCHD) data are required to identify important genes and their interaction effect over diseases. In analytical chemistry, multiple time series data are generated to recognize the complex patterns among multiple classes. Due to the great diversity, we encounter three problems in analyzing the following three types of data: (1) matched case-crossover data, (2) HCHD data, and (3) Time-series data. We contribute to the development of statistical methods to deal with such complex data. First, under the matched study, we discuss an idea about hypothesis testing to effectively determine the association between observed factors and risk of interested disease. Because, in practice, we do not know the specific form of the association, it might be challenging to set a specific alternative hypothesis. By reflecting the reality, we consider the possibility that some observations are measured with errors. By considering these measurement errors, we develop a testing procedure under the matched case-crossover framework. This testing procedure has the flexibility to make inferences on various hypothesis settings. Second, we consider the data where the number of variables is very large compared to the sample size, and the variables are correlated to each other. In this case, our goal is to identify important variables for outcome among a large amount of the variables and build their network. For example, identifying few genes among whole genomics associated with diabetes can be used to develop biomarkers. By our proposed approach in the second project, we can identify differentially expressed and important genes and their network structure with consideration for the outcome. Lastly, we consider the scenario of changing patterns of interest over time with application to gas chromatography. We propose an efficient detection method to effectively distinguish the patterns of multi-level subjects in time-trend analysis. We suggest that our proposed method can give precious information on efficient search for the distinguishable patterns so as to reduce the burden of examining all observations in the data. Bayesian Hierarchical Model Fused Lasso Gaussian graphical model High-dimensional regression Kernel machine learning based regression Matched case-control study Measurement error in covariates Multivariate analysis Semiparametric regression

Search results