Global ETD Search

171	THE FAMILY OF CONDITIONAL PENALIZED METHODS WITH THEIR APPLICATION IN SUFFICIENT VARIABLE SELECTION Xie, Jin 01 January 2018 (has links) When scientists know in advance that some features (variables) are important in modeling a data, then these important features should be kept in the model. How can we utilize this prior information to effectively find other important features? This dissertation is to provide a solution, using such prior information. We propose the Conditional Adaptive Lasso (CAL) estimates to exploit this knowledge. By choosing a meaningful conditioning set, namely the prior information, CAL shows better performance in both variable selection and model estimation. We also propose Sufficient Conditional Adaptive Lasso Variable Screening (SCAL-VS) and Conditioning Set Sufficient Conditional Adaptive Lasso Variable Screening (CS-SCAL-VS) algorithms based on CAL. The asymptotic and oracle properties are proved. Simulations, especially for the large p small n problems, are performed with comparisons with other existing methods. We further extend to the linear model setup to the generalized linear models (GLM). Instead of least squares, we consider the likelihood function with L1 penalty, that is the penalized likelihood methods. We proposed for Generalized Conditional Adaptive Lasso (GCAL) for the generalized linear models. We then further extend the method for any penalty terms that satisfy certain regularity conditions, namely Conditionally Penalized Estimate (CPE). Asymptotic and oracle properties are showed. Four corresponding sufficient variable screening algorithms are proposed. Simulation examples are evaluated for our method with comparisons with existing methods. GCAL is also evaluated with a read data set on leukemia. Generalized Conditional Adaptive Lasso High-dimensional Data Variable Screening Variable Selection Applied Statistics Statistical Methodology Statistical Models Statistical Theory
172	Automatic <sup>13</sup>C Chemical Shift Reference Correction of Protein NMR Spectral Data Using Data Mining and Bayesian Statistical Modeling Chen, Xi 01 January 2019 (has links) Nuclear magnetic resonance (NMR) is a highly versatile analytical technique for studying molecular configuration, conformation, and dynamics, especially of biomacromolecules such as proteins. However, due to the intrinsic properties of NMR experiments, results from the NMR instruments require a refencing step before the down-the-line analysis. Poor chemical shift referencing, especially for 13C in protein Nuclear Magnetic Resonance (NMR) experiments, fundamentally limits and even prevents effective study of biomacromolecules via NMR. There is no available method that can rereference carbon chemical shifts from protein NMR without secondary experimental information such as structure or resonance assignment. To solve this problem, we constructed a Bayesian probabilistic framework that circumvents the limitations of previous reference correction methods that required protein resonance assignment and/or three-dimensional protein structure. Our algorithm named Bayesian Model Optimized Reference Correction (BaMORC) can detect and correct 13C chemical shift referencing errors before the protein resonance assignment step of analysis and without a three-dimensional structure. By combining the BaMORC methodology with a new intra-peaklist grouping algorithm, we created a combined method called Unassigned BaMORC that utilizes only unassigned experimental peak lists and the amino acid sequence. Unassigned BaMORC kept all experimental three-dimensional HN(CO)CACB-type peak lists tested within ± 0.4 ppm of the correct 13C reference value. On a much larger unassigned chemical shift test set, the base method kept 13C chemical shift referencing errors to within ± 0.45 ppm at a 90% confidence interval. With chemical shift assignments, Assigned BaMORC can detect and correct 13C chemical shift referencing errors to within ± 0.22 at a 90% confidence interval. Therefore, Unassigned BaMORC can correct 13C chemical shift referencing errors when it will have the most impact, right before protein resonance assignment and other downstream analyses are started. After assignment, chemical shift reference correction can be further refined with Assigned BaMORC. To further support a broader usage of these new methods, we also created a software package with web-based interface for the NMR community. This software will allow non-NMR experts to detect and correct 13C referencing errors at critical early data analysis steps, lowering the bar of NMR expertise required for effective protein NMR analysis. NMR referencing correction statistical model protein Applied Statistics Biochemistry Bioinformatics Molecular Biology Statistical Models Structural Biology Survival Analysis
173	Serial Testing for Detection of Multilocus Genetic Interactions Al-Khaledi, Zaid T. 01 January 2019 (has links) A method to detect relationships between disease susceptibility and multilocus genetic interactions is the Multifactor-Dimensionality Reduction (MDR) technique pioneered by Ritchie et al. (2001). Since its introduction, many extensions have been pursued to deal with non-binary outcomes and/or account for multiple interactions simultaneously. Studying the effects of multilocus genetic interactions on continuous traits (blood pressure, weight, etc.) is one case that MDR does not handle. Culverhouse et al. (2004) and Gui et al. (2013) proposed two different methods to analyze such a case. In their research, Gui et al. (2013) introduced the Quantitative Multifactor-Dimensionality Reduction (QMDR) that uses the overall average of response variable to classify individuals into risk groups. The classification mechanism may not be efficient under some circumstances, especially when the overall mean is close to some multilocus means. To address such difficulties, we propose a new algorithm, the Ordered Combinatorial Quantitative Multifactor-Dimensionality Reduction (OQMDR), that uses a series of testings, based on ascending order of multilocus means, to identify best interactions of different orders with risk patterns that minimize the prediction error. Ten-fold cross-validation is used to choose from among the resulting models. Regular permutations testings are used to assess the significance of the selected model. The assessment procedure is also modified by utilizing the Generalized Extreme-Value distribution to enhance the efficiency of the evaluation process. We presented results from a simulation study to illustrate the performance of the algorithm. The proposed algorithm is also applied to a genetic data set associated with Alzheimer's Disease. Multifactor dimensionality reduction Cross Validation Model selection Continuous Trait Continuous Phenotype Ordered Combinatorial Partitioning Applied Statistics Biostatistics Statistics and Probability
174	Comparing the Structural Components Variance Estimator and U-Statistics Variance Estimator When Assessing the Difference Between Correlated AUCs with Finite Samples Bosse, Anna L 01 January 2017 (has links) Introduction: The structural components variance estimator proposed by DeLong et al. (1988) is a popular approach used when comparing two correlated AUCs. However, this variance estimator is biased and could be problematic with small sample sizes. Methods: A U-statistics based variance estimator approach is presented and compared with the structural components variance estimator through a large-scale simulation study under different finite-sample size configurations. Results: The U-statistics variance estimator was unbiased for the true variance of the difference between correlated AUCs regardless of the sample size and had lower RMSE than the structural components variance estimator, providing better type 1 error control and larger power. The structural components variance estimator provided increasingly biased variance estimates as the correlation between biomarkers increased. Discussion: When comparing two correlated AUCs, it is recommended that the U-Statistics variance estimator be used whenever possible, especially for finite sample sizes and highly correlated biomarkers. Correlated AUCs ROC curves U-statistics Structural Components Power Type 1 Error Applied Statistics Biostatistics Statistical Methodology
175	Bayesian Logistic Regression Model for Siting Biomass-using Facilities Huang, Xia 01 December 2010 (has links) Key sources of oil for western markets are located in complex geopolitical environments that increase economic and social risk. The amalgamation of economic, environmental, social and national security concerns for petroleum-based economies have created a renewed emphasis on alternative sources of energy which include biomass. The stability of sustainable biomass markets hinges on improved methods to predict and visualize business risk and cost to the supply chain. This thesis develops Bayesian logistic regression models, with comparisons of classical maximum likelihood models, to quantify significant factors that influence the siting of biomass-using facilities and predict potential locations in the 13-state Southeastern United States for three types of biomass-using facilities. Group I combines all biomass-using mills, biorefineries using agricultural residues and wood-using bioenergy/biofuels plants. Group II included pulp and paper mills, and biorefineries that use agricultural and wood residues. Group III included food processing mills and biorefineries that use agricultural and wood residues. The resolution of this research is the 5-digit ZIP Code Tabulation Area (ZCTA), and there are 9,416 ZCTAs in the 13-state Southeastern study region. For both classical and Bayesian approaches, a training set of data was used plus a separate validation (hold out) set of data using a pseudo-random number-generating function in SAS® Enterprise Miner. Four predefined priors are constructed. Bayesian estimation assuming a Gaussian prior distribution provides the highest correct classification rate of 86.40% for Group I; Bayesian methods assuming the non-informative uniform prior has the highest correct classification rate of 95.97% for Group II; and Bayesian methods assuming a Gaussian prior gives the highest correct classification rate of 92.67% for Group III. Given the comparative low sensitivity for Group II and Group III, a hybrid model that integrates classification trees and local Bayesian logistic regression was developed as part of this research to further improve the predictive power. The hybrid model increases the sensitivity of Group II from 58.54% to 64.40%, and improves both of the specificity and sensitivity significantly for Group III from 98.69% to 99.42% and 39.35% to 46.45%, respectively. Twenty-five optimal locations for the biomass-using facility groupings at the 5-digit ZCTA resolution, based upon the best fitted Bayesian logistic regression model and the hybrid model, are predicted and plotted for the 13-state Southeastern study region. biorefineries agricultural residues site location prediction Bayesian logistic regression models Classification Trees Applied Statistics Statistical Methodology Statistical Models
176	A Study of Missing Data Imputation and Predictive Modeling of Strength Properties of Wood Composites Zeng, Yan 01 August 2011 (has links) Problem: Real-time process and destructive test data were collected from a wood composite manufacturer in the U.S. to develop real-time predictive models of two key strength properties (Modulus of Rupture (MOR) and Internal Bound (IB)) of a wood composite manufacturing process. Sensor malfunction and data “send/retrieval” problems lead to null fields in the company’s data warehouse which resulted in information loss. Many manufacturers attempt to build accurate predictive models excluding entire records with null fields or using summary statistics such as mean or median in place of the null field. However, predictive model errors in validation may be higher in the presence of information loss. In addition, the selection of predictive modeling methods poses another challenge to many wood composite manufacturers. Approach: This thesis consists of two parts addressing above issues: 1) how to improve data quality using missing data imputation; 2) what predictive modeling method is better in terms of prediction precision (measured by root mean square error or RMSE). The first part summarizes an application of missing data imputation methods in predictive modeling. After variable selection, two missing data imputation methods were selected after comparing six possible methods. Predictive models of imputed data were developed using partial least squares regression (PLSR) and compared with models of non-imputed data using ten-fold cross-validation. Root mean square error of prediction (RMSEP) and normalized RMSEP (NRMSEP) were calculated. The second presents a series of comparisons among four predictive modeling methods using imputed data without variable selection. Results: The first part concludes that expectation-maximization (EM) algorithm and multiple imputation (MI) using Markov Chain Monte Carlo (MCMC) simulation achieved more precise results. Predictive models based on imputed datasets generated more precise prediction results (average NRMSEP of 5.8% for model of MOR model and 7.2% for model of IB) than models of non-imputed datasets (average NRMSEP of 6.3% for model of MOR and 8.1% for model of IB). The second part finds that Bayesian Additive Regression Tree (BART) produced most precise prediction results (average NRMSEP of 7.7% for MOR model and 8.6% for IB model) than other three models: PLSR, LASSO, and Adaptive LASSO. missing data imputation predictive modeling partial least squares regression LASSO Adaptive LASSO BART Applied Statistics Statistical Methodology Statistical Models
177	Random Walks with Elastic and Reflective Lower Boundaries Devore, Lucas Clay 01 December 2009 (has links) No description available. probability theory Markov Chain theory Systems of Equations computer simulations Applied Mathematics Applied Statistics Numerical Analysis and Computation Probability Statistics and Probability
178	Optimisation of food overloading at long distance flights Eger, Karl-Heinz, Uranchimeg, Tudevdagva 22 August 2009 (has links) (PDF) This paper deals with optimisation of food overloading at long distance flights. It is described how in case of two offered meals and two several passenger groups reserve meals are to distribute to both meals such that the probability that each passenger will get the meal of its choice is maximised. A statistical procedure is presented for estimation of needed demand probabilities. applied statistics method of least squares optimisation parameter estimation ddc:600 ddc:000 ddc:500 Operations Research Statistische Analyse
179	Applied statistics in the classroom Rodriguez, Christopher Jessie 10 December 2013 (has links) The purpose of this report is to give teachers of AP Statistics a way to enrich student learning with an engaging, rigorous and relevant project. The report details the reasons necessary for student-based learning, along with examples in which projects in classrooms were successful. The project is centered on categorical data analysis involving tests of proportions, chi-squared distributions and confidence intervals. There are supplemental worksheets provided with the intent of showing students the relevance and applications of what they are learning to actual studies. Finally, a rubric is provided for students to align and focus their projects as well as for teachers to assess student learning. / text Statistics High school Applied statistics Statistics project AP statistics AP statistics project High school statistics Project rubric
180	MULTI-STATE MODELS FOR INTERVAL CENSORED DATA WITH COMPETING RISK Wei, Shaoceng 01 January 2015 (has links) Multi-state models are often used to evaluate the effect of death as a competing event to the development of dementia in a longitudinal study of the cognitive status of elderly subjects. In this dissertation, both multi-state Markov model and semi-Markov model are used to characterize the flow of subjects from intact cognition to dementia with mild cognitive impairment and global impairment as intervening transient, cognitive states and death as a competing risk. Firstly, a multi-state Markov model with three transient states: intact cognition, mild cognitive impairment (M.C.I.) and global impairment (G.I.) and one absorbing state: dementia is used to model the cognitive panel data. A Weibull model and a Cox proportional hazards (Cox PH) model are used to fit the time to death based on age at entry and the APOE4 status. A shared random effect correlates this survival time with the transition model. Secondly, we further apply a Semi-Markov process in which we assume that the wait- ing times are Weibull distributed except for transitions from the baseline state, which are exponentially distributed and we assume no additional changes in cognition occur between two assessments. We implement a quasi-Monte Carlo (QMC) method to calculate the higher order integration needed for the likelihood based estimation. At the end of this dissertation we extend a non-parametric “local EM algorithm” to obtain a smooth estimator of the cause-specific hazard function (CSH) in the presence of competing risk. All the proposed methods are justified by simulation studies and applications to the Nun Study data, a longitudinal study of late life cognition in a cohort of 461 subjects. multi-state Markov chain competing event Nun Study semi-Markov model Cause-specific hazard Local EM algorithm Applied Statistics Biostatistics

Search results