Spelling suggestions: "subject:"penalized"" "subject:"menalized""
41 |
Výběr modelu na základě penalizované věrohodnosti / Variable selection based on penalized likelihoodChlubnová, Tereza January 2016 (has links)
Selection of variables and estimation of regression coefficients in datasets with the number of variables exceeding the number of observations consti- tutes an often discussed topic in modern statistics. Today the maximum penalized likelihood method with an appropriately selected function of the parameter as the penalty is used for solving this problem. The penalty should evaluate the benefit of the variable and possibly mitigate or nullify the re- spective regression coefficient. The SCAD and LASSO penalty functions are popular for their ability to choose appropriate regressors and at the same time estimate the parameters in a model. This thesis presents an overview of up to date results in the area of characteristics of estimates obtained by using these two methods for both small number of regressors and multidimensional datasets in a normal linear model. Due to the fact that the amount of pe- nalty and therefore also the choice of the model is heavily influenced by the tuning parameter, this thesis further discusses its selection. The behavior of the LASSO and SCAD penalty functions for different values and possibili- ties for selection of the tuning parameter is tested with various numbers of regressors on simulated datasets.
|
42 |
Parametric, Nonparametric and Semiparametric Approaches in Profile Monitoring of Poisson DataPiri, Sepehr 01 January 2017 (has links)
Profile monitoring is a relatively new approach in quality control best used when the process data follow a profile (or curve). The majority of previous studies in profile monitoring focused on the parametric modeling of either linear or nonlinear profiles under the assumption of the correct model specification. Our work considers those cases where the parametric model for the family of profiles is unknown or, at least uncertain. Consequently, we consider monitoring Poisson profiles via three methods, a nonparametric (NP) method using penalized splines, a nonparametric (NP) method using wavelets and a semi parametric (SP) procedure that combines both parametric and NP profile fits. Our simulation results show that SP method is robust to the common problem of model misspecification of the user's proposed parametric model. We also showed that Haar wavelets are a better choice than the penalized splines in situations where a sudden jump happens or the jump is edgy.
In addition, we showed that the penalized splines are better than wavelets when the shape of the profiles are smooth. The proposed novel techniques have been applied to a real data set and compare with some state-of-the arts.
|
43 |
Ensemble Learning Method on Machine Maintenance DataZhao, Xiaochuang 05 November 2015 (has links)
In the industry, a lot of companies are facing the explosion of big data. With this much information stored, companies want to make sense of the data and use it to help them for better decision making, especially for future prediction. A lot of money can be saved and huge revenue can be generated with the power of big data. When building statistical learning models for prediction, companies in the industry are aiming to build models with efficiency and high accuracy. After the learning models have been developed for production, new data will be generated. With the updated data, the models have to be updated as well. Due to this nature, the model performs best today doesn’t mean it will necessarily perform the same tomorrow. Thus, it is very hard to decide which algorithm should be used to build the learning model. This paper introduces a new method that ensembles the information generated by two different classification statistical learning algorithms together as inputs for another learning model to increase the final prediction power.
The dataset used in this paper is NASA’s Turbofan Engine Degradation data. There are 49 numeric features (X) and the response Y is binary with 0 indicating the engine is working properly and 1 indicating engine failure. The model’s purpose is to predict whether the engine is going to pass or fail. The dataset is divided in training set and testing set. First, training set is used twice to build support vector machine (SVM) and neural network models. Second, it used the trained SVM and neural network model taking X of the training set as input to predict Y1 and Y2. Then, it takes Y1 and Y2 as inputs to build the Penalized Logistic Regression model, which is the ensemble model here. Finally, use the testing set follow the same steps to get the final prediction result. The model accuracy is calculated using overall classification accuracy. The result shows that the ensemble model has 92% accuracy. The prediction accuracies of SVM, neural network and ensemble models are compared to prove that the ensemble model successfully captured the power of the two individual learning model.
|
44 |
New approaches to identify gene-by-gene interactions in genome wide association studiesLu, Chen 22 January 2016 (has links)
Genetic variants identified to date by genome-wide association studies only explain a small fraction of total heritability. Gene-by-gene interaction is one important potential source of unexplained heritability. In the first part of this dissertation, a novel approach to detect such interactions is proposed. This approach utilizes penalized regression and sparse estimation principles, and incorporates outside biological knowledge through a network-based penalty. The method is tested on simulated data under various scenarios. Simulations show that with reasonable outside biological knowledge, the new method performs noticeably better than current stage-wise strategies in finding true interactions, especially when the marginal strength of main effects is weak.
The proposed method is designed for single-cohort analyses. However, it is generally acknowledged that only multi-cohort analyses have sufficient power to uncover genes and gene-by-gene interactions with moderate effects on traits, such as likely underlie complex diseases. Multi-cohort, meta-analysis approaches for penalized regressions are developed and investigated in the second part of this dissertation. Specifically, I propose two different ways of utilizing data-splitting principles in multi-cohort settings and develop three procedures to conduct meta-analysis. Using the method developed in the first part of this dissertation as an example of penalized regressions, three proposed meta-analysis procedures are compared to mega-analysis using a simulation study. The results suggest that the best approach is to split the participating cohorts into two groups, to perform variable selection for each cohort in the first group, to fit regular regression model on the union of selected variables for each cohort in the second group, and lastly to conduct a meta-analysis across cohorts in the second group.
In the last part of this dissertation, the novel method developed in the first part is applied to the Framingham Heart Study measures on total plasma Immunoglobulin E (IgE) concentrations, C-reactive protein levels, and Fasting Glucose. The effect of incorporating various sources of biological information on the ability to detect gene-gene interaction is explored. For IgE, for example, a number of potentially interesting interactions are identified. Some of these interactions involve pairs in human leukocyte antigen genes, which encode proteins that are the key regulators of the immune response. The remaining interactions are among genes previously found to be associated with IgE as main effects. Identification of these interactions may provide new insights into the genetic basis and mechanisms of atopic diseases.
|
45 |
Complexity penalized methods for structured and unstructured dataGoeva, Aleksandrina 08 November 2017 (has links)
A fundamental goal of statisticians is to make inferences from the sample about characteristics of the underlying population. This is an inverse problem, since we are trying to recover a feature of the input with the availability of observations on an output. Towards this end, we consider complexity penalized methods, because they balance goodness of fit and generalizability of the solution. The data from the underlying population may come in diverse formats - structured or unstructured - such as probability distributions, text tokens, or graph characteristics. Depending on the defining features of the problem we can chose the appropriate complexity penalized approach, and assess the quality of the estimate produced by it. Favorable characteristics are strong theoretical guarantees of closeness to the true value and interpretability. Our work fits within this framework and spans the areas of simulation optimization, text mining and network inference. The first problem we consider is model calibration under the assumption that given a hypothesized input model, we can use stochastic simulation to obtain its corresponding output observations. We formulate it as a stochastic program by maximizing the entropy of the input distribution subject to moment matching. We then propose an iterative scheme via simulation to approximately solve it. We prove convergence of the proposed algorithm under appropriate conditions and demonstrate the performance via numerical studies. The second problem we consider is summarizing text documents through an inferred set of topics. We propose a frequentist reformulation of a Bayesian regularization scheme. Through our complexity-penalized perspective we lend further insight into the nature of the loss function and the regularization achieved through the priors in the Bayesian formulation. The third problem is concerned with the impact of sampling on the degree distribution of a network. Under many sampling designs, we have a linear inverse problem characterized by an ill-conditioned matrix. We investigate the theoretical properties of an approximate solution for the degree distribution found by regularizing the solution of the ill-conditioned least squares objective. Particularly, we study the rate at which the penalized solution tends to the true value as a function of network size and sampling rate.
|
46 |
Generalized Minimum Penalized Hellinger Distance Estimation and Generalized Penalized Hellinger Deviance Testing for Generalized Linear Models: The Discrete CaseYan, Huey 01 May 2001 (has links)
In this dissertation, robust and efficient alternatives to quasi-likelihood estimation and likelihood ratio tests are developed for discrete generalized linear models. The estimation method considered is a penalized minimum Hellinger distance procedure that generalizes a procedure developed by Harris and Basu for estimating parameters of a single discrete probability distribution from a random sample. A bootstrap algorithm is proposed to select the weight of the penalty term. Simulations are carried out to compare the new estimators with quasi-likelihood estimation. The robustness of the estimation procedure is demonstrated by simulation work and by Hapel's α-influence curve. Penalized minimum Hellinger deviance tests for goodness-of-fit and for testing nested linear hypotheses are proposed and simulated. A nonparametric bootstrap algorithm is proposed to obtain critical values for the testing procedure.
|
47 |
Penalized spline modeling of the ex-vivo assays dose-response curves and the HIV-infected patients' bodyweight changeSarwat, Samiha 05 June 2015 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / A semi-parametric approach incorporates parametric and nonparametric functions in the model and is very useful in situations when a fully parametric model is inadequate. The objective of this dissertation is to extend statistical methodology employing the semi-parametric modeling approach to analyze data in health science research areas. This dissertation has three parts. The first part discusses the modeling of the dose-response relationship with correlated data by introducing overall drug effects in addition to the deviation of each subject-specific curve from the population average. Here, a penalized spline regression method that allows modeling of the smooth dose-response relationship is applied to data in studies monitoring malaria drug resistance through the ex-vivo assays.The second part of the dissertation extends the SiZer map, which is an exploratory and a powerful visualization tool, to detect underlying significant features (increase, decrease, or no change) of the curve at various smoothing levels. Here, Penalized Spline Significant Zero Crossings of Derivatives (PS-SiZer), using a penalized spline regression, is introduced to investigate significant features in correlated data arising from longitudinal settings. The third part of the dissertation applies the proposed PS-SiZer methodology to analyze HIV data. The durability of significant weight change over a period is explored from the PS-SiZer visualization. PS-SiZer is a graphical tool for exploring structures in curves by mapping areas where rate of change is significantly increasing, decreasing, or does not change. PS-SiZer maps provide information about the significant rate of weigh change that occurs in two ART regimens at various level of smoothing. A penalized spline regression model at an optimum smoothing level is applied to obtain an estimated first-time point where weight no longer increases for different treatment regimens.
|
48 |
Two-Stage SCAD Lasso for Linear Mixed Model SelectionYousef, Mohammed A. 07 August 2019 (has links)
No description available.
|
49 |
Sequential Change-point Detection in Linear Regression and Linear Quantile Regression Models Under High DimensionalityRatnasingam, Suthakaran 06 August 2020 (has links)
No description available.
|
50 |
Robust Approaches for Matrix-Valued ParametersJing, Naimin January 2021 (has links)
Modern large data sets inevitably contain outliers that deviate from the model assumptions. However, many widely used estimators, such as maximum likelihood estimators and least squared estimators, perform weakly with the existence of outliers. Alternatively, many statistical modeling approaches have matrices as the parameters. We consider penalized estimators for matrix-valued parameters with a focus on their robustness properties in the presence of outliers. We propose a general framework for robust modeling with matrix-valued parameters by minimizing robust loss functions with penalization. However, there are challenges to this approach in both computation and theoretical analysis. To tackle the computational challenges from the large size of the data, non-smoothness of robust loss functions, and the slow speed of matrix operations, we propose to apply the Frank-Wolfe algorithm, a first-order algorithm for optimization on a restricted region with low computation burden per iteration. Theoretically, we establish finite-sample error bounds under high-dimensional settings. We show that the estimation errors are bounded by small terms and converge in probability to zero under mild conditions in a neighborhood of the true model. Our method accommodates a broad classes of modeling problems using robust loss functions with penalization. Concretely, we study three cases: matrix completion, multivariate regression, and network estimation. For all cases, we illustrate the robustness of the proposed method both theoretically and numerically. / Statistics
|
Page generated in 0.0648 seconds