Global ETD Search

41	Denoising Tandem Mass Spectrometry Data Offei, Felix 01 May 2017 (has links) Protein identification using tandem mass spectrometry (MS/MS) has proven to be an effective way to identify proteins in a biological sample. An observed spectrum is constructed from the data produced by the tandem mass spectrometer. A protein can be identified if the observed spectrum aligns with the theoretical spectrum. However, data generated by the tandem mass spectrometer are affected by errors thus making protein identification challenging in the field of proteomics. Some of these errors include wrong calibration of the instrument, instrument distortion and noise. In this thesis, we present a pre-processing method, which focuses on the removal of noisy data with the hope of aiding in better identification of proteins. We employ the method of binning to reduce the number of noise peaks in the data without sacrificing the alignment of the observed spectrum with the theoretical spectrum. In some cases, the alignment of the two spectra improved. Protein Identification Tandem Mass Spectrometry Pre-processing Binning. Applied Statistics Clinical Trials Genomics Laboratory and Basic Science Research Statistical Methodology
42	THE FAMILY OF CONDITIONAL PENALIZED METHODS WITH THEIR APPLICATION IN SUFFICIENT VARIABLE SELECTION Xie, Jin 01 January 2018 (has links) When scientists know in advance that some features (variables) are important in modeling a data, then these important features should be kept in the model. How can we utilize this prior information to effectively find other important features? This dissertation is to provide a solution, using such prior information. We propose the Conditional Adaptive Lasso (CAL) estimates to exploit this knowledge. By choosing a meaningful conditioning set, namely the prior information, CAL shows better performance in both variable selection and model estimation. We also propose Sufficient Conditional Adaptive Lasso Variable Screening (SCAL-VS) and Conditioning Set Sufficient Conditional Adaptive Lasso Variable Screening (CS-SCAL-VS) algorithms based on CAL. The asymptotic and oracle properties are proved. Simulations, especially for the large p small n problems, are performed with comparisons with other existing methods. We further extend to the linear model setup to the generalized linear models (GLM). Instead of least squares, we consider the likelihood function with L1 penalty, that is the penalized likelihood methods. We proposed for Generalized Conditional Adaptive Lasso (GCAL) for the generalized linear models. We then further extend the method for any penalty terms that satisfy certain regularity conditions, namely Conditionally Penalized Estimate (CPE). Asymptotic and oracle properties are showed. Four corresponding sufficient variable screening algorithms are proposed. Simulation examples are evaluated for our method with comparisons with existing methods. GCAL is also evaluated with a read data set on leukemia. Generalized Conditional Adaptive Lasso High-dimensional Data Variable Screening Variable Selection Applied Statistics Statistical Methodology Statistical Models Statistical Theory
43	Comparing the Structural Components Variance Estimator and U-Statistics Variance Estimator When Assessing the Difference Between Correlated AUCs with Finite Samples Bosse, Anna L 01 January 2017 (has links) Introduction: The structural components variance estimator proposed by DeLong et al. (1988) is a popular approach used when comparing two correlated AUCs. However, this variance estimator is biased and could be problematic with small sample sizes. Methods: A U-statistics based variance estimator approach is presented and compared with the structural components variance estimator through a large-scale simulation study under different finite-sample size configurations. Results: The U-statistics variance estimator was unbiased for the true variance of the difference between correlated AUCs regardless of the sample size and had lower RMSE than the structural components variance estimator, providing better type 1 error control and larger power. The structural components variance estimator provided increasingly biased variance estimates as the correlation between biomarkers increased. Discussion: When comparing two correlated AUCs, it is recommended that the U-Statistics variance estimator be used whenever possible, especially for finite sample sizes and highly correlated biomarkers. Correlated AUCs ROC curves U-statistics Structural Components Power Type 1 Error Applied Statistics Biostatistics Statistical Methodology
44	Bayesian Logistic Regression Model for Siting Biomass-using Facilities Huang, Xia 01 December 2010 (has links) Key sources of oil for western markets are located in complex geopolitical environments that increase economic and social risk. The amalgamation of economic, environmental, social and national security concerns for petroleum-based economies have created a renewed emphasis on alternative sources of energy which include biomass. The stability of sustainable biomass markets hinges on improved methods to predict and visualize business risk and cost to the supply chain. This thesis develops Bayesian logistic regression models, with comparisons of classical maximum likelihood models, to quantify significant factors that influence the siting of biomass-using facilities and predict potential locations in the 13-state Southeastern United States for three types of biomass-using facilities. Group I combines all biomass-using mills, biorefineries using agricultural residues and wood-using bioenergy/biofuels plants. Group II included pulp and paper mills, and biorefineries that use agricultural and wood residues. Group III included food processing mills and biorefineries that use agricultural and wood residues. The resolution of this research is the 5-digit ZIP Code Tabulation Area (ZCTA), and there are 9,416 ZCTAs in the 13-state Southeastern study region. For both classical and Bayesian approaches, a training set of data was used plus a separate validation (hold out) set of data using a pseudo-random number-generating function in SAS® Enterprise Miner. Four predefined priors are constructed. Bayesian estimation assuming a Gaussian prior distribution provides the highest correct classification rate of 86.40% for Group I; Bayesian methods assuming the non-informative uniform prior has the highest correct classification rate of 95.97% for Group II; and Bayesian methods assuming a Gaussian prior gives the highest correct classification rate of 92.67% for Group III. Given the comparative low sensitivity for Group II and Group III, a hybrid model that integrates classification trees and local Bayesian logistic regression was developed as part of this research to further improve the predictive power. The hybrid model increases the sensitivity of Group II from 58.54% to 64.40%, and improves both of the specificity and sensitivity significantly for Group III from 98.69% to 99.42% and 39.35% to 46.45%, respectively. Twenty-five optimal locations for the biomass-using facility groupings at the 5-digit ZCTA resolution, based upon the best fitted Bayesian logistic regression model and the hybrid model, are predicted and plotted for the 13-state Southeastern study region. biorefineries agricultural residues site location prediction Bayesian logistic regression models Classification Trees Applied Statistics Statistical Methodology Statistical Models
45	A Study of Missing Data Imputation and Predictive Modeling of Strength Properties of Wood Composites Zeng, Yan 01 August 2011 (has links) Problem: Real-time process and destructive test data were collected from a wood composite manufacturer in the U.S. to develop real-time predictive models of two key strength properties (Modulus of Rupture (MOR) and Internal Bound (IB)) of a wood composite manufacturing process. Sensor malfunction and data “send/retrieval” problems lead to null fields in the company’s data warehouse which resulted in information loss. Many manufacturers attempt to build accurate predictive models excluding entire records with null fields or using summary statistics such as mean or median in place of the null field. However, predictive model errors in validation may be higher in the presence of information loss. In addition, the selection of predictive modeling methods poses another challenge to many wood composite manufacturers. Approach: This thesis consists of two parts addressing above issues: 1) how to improve data quality using missing data imputation; 2) what predictive modeling method is better in terms of prediction precision (measured by root mean square error or RMSE). The first part summarizes an application of missing data imputation methods in predictive modeling. After variable selection, two missing data imputation methods were selected after comparing six possible methods. Predictive models of imputed data were developed using partial least squares regression (PLSR) and compared with models of non-imputed data using ten-fold cross-validation. Root mean square error of prediction (RMSEP) and normalized RMSEP (NRMSEP) were calculated. The second presents a series of comparisons among four predictive modeling methods using imputed data without variable selection. Results: The first part concludes that expectation-maximization (EM) algorithm and multiple imputation (MI) using Markov Chain Monte Carlo (MCMC) simulation achieved more precise results. Predictive models based on imputed datasets generated more precise prediction results (average NRMSEP of 5.8% for model of MOR model and 7.2% for model of IB) than models of non-imputed datasets (average NRMSEP of 6.3% for model of MOR and 8.1% for model of IB). The second part finds that Bayesian Additive Regression Tree (BART) produced most precise prediction results (average NRMSEP of 7.7% for MOR model and 8.6% for IB model) than other three models: PLSR, LASSO, and Adaptive LASSO. missing data imputation predictive modeling partial least squares regression LASSO Adaptive LASSO BART Applied Statistics Statistical Methodology Statistical Models
46	STATISTICS IN THE BILLERA-HOLMES-VOGTMANN TREESPACE Weyenberg, Grady S. 01 January 2015 (has links) This dissertation is an effort to adapt two classical non-parametric statistical techniques, kernel density estimation (KDE) and principal components analysis (PCA), to the Billera-Holmes-Vogtmann (BHV) metric space for phylogenetic trees. This adaption gives a more general framework for developing and testing various hypotheses about apparent differences or similarities between sets of phylogenetic trees than currently exists. For example, while the majority of gene histories found in a clade of organisms are expected to be generated by a common evolutionary process, numerous other coexisting processes (e.g. horizontal gene transfers, gene duplication and subsequent neofunctionalization) will cause some genes to exhibit a history quite distinct from the histories of the majority of genes. Such “outlying” gene trees are considered to be biologically interesting and identifying these genes has become an important problem in phylogenetics. The R sofware package kdetrees, developed in Chapter 2, contains an implementation of the kernel density estimation method. The primary theoretical difficulty involved in this adaptation concerns the normalizion of the kernel functions in the BHV metric space. This problem is addressed in Chapter 3. In both chapters, the software package is applied to both simulated and empirical datasets to demonstrate the properties of the method. A few first theoretical steps in adaption of principal components analysis to the BHV space are presented in Chapter 4. It becomes necessary to generalize the notion of a set of perpendicular vectors in Euclidean space to the BHV metric space, but there some ambiguity about how to best proceed. We show that convex hulls are one reasonable approach to the problem. The Nye-PCA- algorithm provides a method of projecting onto arbitrary convex hulls in BHV space, providing the core of a modified PCA-type method. Phylogenetic trees Non-parametric statistics Outlier Detection Kernel Density Estimation Principal Components Analysis Applied Statistics Computational Biology Statistical Methodology
47	If and How Many 'Races'? The Application of Mixture Modeling to World-Wide Human Craniometric Variation Algee-Hewitt, Bridget Frances Beatrice 01 December 2011 (has links) Studies in human cranial variation are extensive and widely discussed. While skeletal biologists continue to focus on questions of biological distance and population history, group-specific knowledge is being increasingly used for human identification in medico-legal contexts. The importance of this research has been often overshadowed by both philosophic and methodological concerns. Many analyses have been constrained in their scope by the limited availability of representative samples and readily criticized for adopting statistical techniques that require user-guidance and a priori information. A multi-part project is presented here that implements model-based clustering as an alternative approach for population studies using craniometric traits. This project also introduces the use of forced-directed graphing and mixture-based supervised classification methods as statistically robust and practically useful techniques. This project considers three well-documented craniometric sources, whose samples collectively permit large-scale analyses and tests of population structure at a variety of partitions and for different goals. The craniofacial measurements drawn from the world-wide data sets collected by Howells and Hanihara permit rigorous tests for group differences and cryptic population structure. The inclusion of modern American samples from the Forensic Anthropology Data Bank allows for investigations into the importance of biosocial race and biogeographic ancestry in forensic anthropology. Demographic information from the United States Census Bureau is used to contextualize these samples within the range of the racial diversity represented in the American population-at-large. This project's findings support the presence of population structure, the utility of finite mixture methods to questions of biological classification, and the validity of supervised discrimination methods as reliable tools. They also attest to the importance of context for producing the most useful information on identity and affinity. These results suggest that a meaningful relationship between statistically inferred clusters and predefined groups does exist and that population-informative differences in cranial morphology can be detected with measured degrees of statistical certainty, even when true memberships are unknown. They imply, in turn, that the estimation of biogeographic ancestry and the identification of biosocial race in forensic anthropology can provide useful information for modern American casework that can be evidenced by scientific methods. craniometrics human variation model-based clustering finite mixture analysis forensic anthropology race Biological and Physical Anthropology Statistical Methodology
48	INFORMATIONAL INDEX AND ITS APPLICATIONS IN HIGH DIMENSIONAL DATA Yuan, Qingcong 01 January 2017 (has links) We introduce a new class of measures for testing independence between two random vectors, which uses expected difference of conditional and marginal characteristic functions. By choosing a particular weight function in the class, we propose a new index for measuring independence and study its property. Two empirical versions are developed, their properties, asymptotics, connection with existing measures and applications are discussed. Implementation and Monte Carlo results are also presented. We propose a two-stage sufficient variable selections method based on the new index to deal with large p small n data. The method does not require model specification and especially focuses on categorical response. Our approach always improves other typical screening approaches which only use marginal relation. Numerical studies are provided to demonstrate the advantages of the method. We introduce a novel approach to sufficient dimension reduction problems using the new measure. The proposed method requires very mild conditions on the predictors, estimates the central subspace effectively and is especially useful when response is categorical. It keeps the model-free advantage without estimating link function. Under regularity conditions, root-n consistency and asymptotic normality are established. The proposed method is very competitive and robust comparing to existing dimension reduction methods through simulations results. Categorical variable Distance Independence Sufficient variable selection Sufficient dimension reduction Applied Statistics Biostatistics Multivariate Analysis Statistical Methodology Statistical Theory
49	Communications and Methodologies in Crime Geography: Contemporary Approaches to Disseminating Criminal Incidence and Research Ogden, Mitchell 01 December 2019 (has links) Many tools exist to assist law enforcement agencies in mitigating criminal activity. For centuries, academics used statistics in the study of crime and criminals, and more recently, police departments make use of spatial statistics and geographic information systems in that pursuit. Clustering and hot spot methods of analysis are popular in this application for their relative simplicity of interpretation and ease of process. With recent advancements in geospatial technology, it is easier than ever to publicly share data through visual communication tools like web applications and dashboards. Sharing data and results of analyses boosts transparency and the public image of police agencies, an image important to maintaining public trust in law enforcement and active participation in community safety. Crime geography geospatial statistics visual communication Geo Apps Applied Statistics Criminology Geographic Information Sciences Human Geography Spatial Science Statistical Methodology
50	TESTING FOR TREATMENT HETEROGENEITY BETWEEN THE INDIVIDUAL OUTCOMES WITHIN A COMPOSITE OUTCOME Pogue, Janice M. 04 1900 (has links) <p>This series of papers explores the value of and mechanisms for using a heterogeneity test to compare treatment differences between the individual outcomes included in a composite outcome. Trialists often combine a group of outcomes together into a single composite outcome based on the belief that all will share a common treatment effect. The question addressed here is how this assumption of homogeneity of treatment effect can be assessed in the analysis of a trial that uses this type of composite outcome. A class of models that can be used to form such a test involve the analysis of multiple outcomes per person, and adjust for the association due to repeated outcomes being observed on the same individuals. We compare heterogeneity tests from multiple models for binary and time-to-event composite outcomes, to determine which have the greatest power to detect treatment differences for the individual outcomes within a composite outcome. Generally both marginal and random effects models are shown to be reasonable choices for such tests. We show that a treatment heterogeneity test may be used to help design a study with a composite outcome and how it can help in the interpretation of trial results.</p> / Doctor of Philosophy (PhD) composite outcomes heterogeneity tests clinical trials multiple outcomes Biostatistics Clinical Trials Multivariate Analysis Statistical Methodology Survival Analysis Biostatistics

Search results