• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 9
  • 1
  • 1
  • 1
  • Tagged with
  • 88
  • 88
  • 51
  • 38
  • 35
  • 32
  • 19
  • 19
  • 19
  • 17
  • 16
  • 16
  • 15
  • 15
  • 15
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
41

Denoising Tandem Mass Spectrometry Data

Offei, Felix 01 May 2017 (has links)
Protein identification using tandem mass spectrometry (MS/MS) has proven to be an effective way to identify proteins in a biological sample. An observed spectrum is constructed from the data produced by the tandem mass spectrometer. A protein can be identified if the observed spectrum aligns with the theoretical spectrum. However, data generated by the tandem mass spectrometer are affected by errors thus making protein identification challenging in the field of proteomics. Some of these errors include wrong calibration of the instrument, instrument distortion and noise. In this thesis, we present a pre-processing method, which focuses on the removal of noisy data with the hope of aiding in better identification of proteins. We employ the method of binning to reduce the number of noise peaks in the data without sacrificing the alignment of the observed spectrum with the theoretical spectrum. In some cases, the alignment of the two spectra improved.
42

THE FAMILY OF CONDITIONAL PENALIZED METHODS WITH THEIR APPLICATION IN SUFFICIENT VARIABLE SELECTION

Xie, Jin 01 January 2018 (has links)
When scientists know in advance that some features (variables) are important in modeling a data, then these important features should be kept in the model. How can we utilize this prior information to effectively find other important features? This dissertation is to provide a solution, using such prior information. We propose the Conditional Adaptive Lasso (CAL) estimates to exploit this knowledge. By choosing a meaningful conditioning set, namely the prior information, CAL shows better performance in both variable selection and model estimation. We also propose Sufficient Conditional Adaptive Lasso Variable Screening (SCAL-VS) and Conditioning Set Sufficient Conditional Adaptive Lasso Variable Screening (CS-SCAL-VS) algorithms based on CAL. The asymptotic and oracle properties are proved. Simulations, especially for the large p small n problems, are performed with comparisons with other existing methods. We further extend to the linear model setup to the generalized linear models (GLM). Instead of least squares, we consider the likelihood function with L1 penalty, that is the penalized likelihood methods. We proposed for Generalized Conditional Adaptive Lasso (GCAL) for the generalized linear models. We then further extend the method for any penalty terms that satisfy certain regularity conditions, namely Conditionally Penalized Estimate (CPE). Asymptotic and oracle properties are showed. Four corresponding sufficient variable screening algorithms are proposed. Simulation examples are evaluated for our method with comparisons with existing methods. GCAL is also evaluated with a read data set on leukemia.
43

Comparing the Structural Components Variance Estimator and U-Statistics Variance Estimator When Assessing the Difference Between Correlated AUCs with Finite Samples

Bosse, Anna L 01 January 2017 (has links)
Introduction: The structural components variance estimator proposed by DeLong et al. (1988) is a popular approach used when comparing two correlated AUCs. However, this variance estimator is biased and could be problematic with small sample sizes. Methods: A U-statistics based variance estimator approach is presented and compared with the structural components variance estimator through a large-scale simulation study under different finite-sample size configurations. Results: The U-statistics variance estimator was unbiased for the true variance of the difference between correlated AUCs regardless of the sample size and had lower RMSE than the structural components variance estimator, providing better type 1 error control and larger power. The structural components variance estimator provided increasingly biased variance estimates as the correlation between biomarkers increased. Discussion: When comparing two correlated AUCs, it is recommended that the U-Statistics variance estimator be used whenever possible, especially for finite sample sizes and highly correlated biomarkers.
44

Bayesian Logistic Regression Model for Siting Biomass-using Facilities

Huang, Xia 01 December 2010 (has links)
Key sources of oil for western markets are located in complex geopolitical environments that increase economic and social risk. The amalgamation of economic, environmental, social and national security concerns for petroleum-based economies have created a renewed emphasis on alternative sources of energy which include biomass. The stability of sustainable biomass markets hinges on improved methods to predict and visualize business risk and cost to the supply chain. This thesis develops Bayesian logistic regression models, with comparisons of classical maximum likelihood models, to quantify significant factors that influence the siting of biomass-using facilities and predict potential locations in the 13-state Southeastern United States for three types of biomass-using facilities. Group I combines all biomass-using mills, biorefineries using agricultural residues and wood-using bioenergy/biofuels plants. Group II included pulp and paper mills, and biorefineries that use agricultural and wood residues. Group III included food processing mills and biorefineries that use agricultural and wood residues. The resolution of this research is the 5-digit ZIP Code Tabulation Area (ZCTA), and there are 9,416 ZCTAs in the 13-state Southeastern study region. For both classical and Bayesian approaches, a training set of data was used plus a separate validation (hold out) set of data using a pseudo-random number-generating function in SAS® Enterprise Miner. Four predefined priors are constructed. Bayesian estimation assuming a Gaussian prior distribution provides the highest correct classification rate of 86.40% for Group I; Bayesian methods assuming the non-informative uniform prior has the highest correct classification rate of 95.97% for Group II; and Bayesian methods assuming a Gaussian prior gives the highest correct classification rate of 92.67% for Group III. Given the comparative low sensitivity for Group II and Group III, a hybrid model that integrates classification trees and local Bayesian logistic regression was developed as part of this research to further improve the predictive power. The hybrid model increases the sensitivity of Group II from 58.54% to 64.40%, and improves both of the specificity and sensitivity significantly for Group III from 98.69% to 99.42% and 39.35% to 46.45%, respectively. Twenty-five optimal locations for the biomass-using facility groupings at the 5-digit ZCTA resolution, based upon the best fitted Bayesian logistic regression model and the hybrid model, are predicted and plotted for the 13-state Southeastern study region.
45

A Study of Missing Data Imputation and Predictive Modeling of Strength Properties of Wood Composites

Zeng, Yan 01 August 2011 (has links)
Problem: Real-time process and destructive test data were collected from a wood composite manufacturer in the U.S. to develop real-time predictive models of two key strength properties (Modulus of Rupture (MOR) and Internal Bound (IB)) of a wood composite manufacturing process. Sensor malfunction and data “send/retrieval” problems lead to null fields in the company’s data warehouse which resulted in information loss. Many manufacturers attempt to build accurate predictive models excluding entire records with null fields or using summary statistics such as mean or median in place of the null field. However, predictive model errors in validation may be higher in the presence of information loss. In addition, the selection of predictive modeling methods poses another challenge to many wood composite manufacturers. Approach: This thesis consists of two parts addressing above issues: 1) how to improve data quality using missing data imputation; 2) what predictive modeling method is better in terms of prediction precision (measured by root mean square error or RMSE). The first part summarizes an application of missing data imputation methods in predictive modeling. After variable selection, two missing data imputation methods were selected after comparing six possible methods. Predictive models of imputed data were developed using partial least squares regression (PLSR) and compared with models of non-imputed data using ten-fold cross-validation. Root mean square error of prediction (RMSEP) and normalized RMSEP (NRMSEP) were calculated. The second presents a series of comparisons among four predictive modeling methods using imputed data without variable selection. Results: The first part concludes that expectation-maximization (EM) algorithm and multiple imputation (MI) using Markov Chain Monte Carlo (MCMC) simulation achieved more precise results. Predictive models based on imputed datasets generated more precise prediction results (average NRMSEP of 5.8% for model of MOR model and 7.2% for model of IB) than models of non-imputed datasets (average NRMSEP of 6.3% for model of MOR and 8.1% for model of IB). The second part finds that Bayesian Additive Regression Tree (BART) produced most precise prediction results (average NRMSEP of 7.7% for MOR model and 8.6% for IB model) than other three models: PLSR, LASSO, and Adaptive LASSO.
46

STATISTICS IN THE BILLERA-HOLMES-VOGTMANN TREESPACE

Weyenberg, Grady S. 01 January 2015 (has links)
This dissertation is an effort to adapt two classical non-parametric statistical techniques, kernel density estimation (KDE) and principal components analysis (PCA), to the Billera-Holmes-Vogtmann (BHV) metric space for phylogenetic trees. This adaption gives a more general framework for developing and testing various hypotheses about apparent differences or similarities between sets of phylogenetic trees than currently exists. For example, while the majority of gene histories found in a clade of organisms are expected to be generated by a common evolutionary process, numerous other coexisting processes (e.g. horizontal gene transfers, gene duplication and subsequent neofunctionalization) will cause some genes to exhibit a history quite distinct from the histories of the majority of genes. Such “outlying” gene trees are considered to be biologically interesting and identifying these genes has become an important problem in phylogenetics. The R sofware package kdetrees, developed in Chapter 2, contains an implementation of the kernel density estimation method. The primary theoretical difficulty involved in this adaptation concerns the normalizion of the kernel functions in the BHV metric space. This problem is addressed in Chapter 3. In both chapters, the software package is applied to both simulated and empirical datasets to demonstrate the properties of the method. A few first theoretical steps in adaption of principal components analysis to the BHV space are presented in Chapter 4. It becomes necessary to generalize the notion of a set of perpendicular vectors in Euclidean space to the BHV metric space, but there some ambiguity about how to best proceed. We show that convex hulls are one reasonable approach to the problem. The Nye-PCA- algorithm provides a method of projecting onto arbitrary convex hulls in BHV space, providing the core of a modified PCA-type method.
47

If and How Many 'Races'? The Application of Mixture Modeling to World-Wide Human Craniometric Variation

Algee-Hewitt, Bridget Frances Beatrice 01 December 2011 (has links)
Studies in human cranial variation are extensive and widely discussed. While skeletal biologists continue to focus on questions of biological distance and population history, group-specific knowledge is being increasingly used for human identification in medico-legal contexts. The importance of this research has been often overshadowed by both philosophic and methodological concerns. Many analyses have been constrained in their scope by the limited availability of representative samples and readily criticized for adopting statistical techniques that require user-guidance and a priori information. A multi-part project is presented here that implements model-based clustering as an alternative approach for population studies using craniometric traits. This project also introduces the use of forced-directed graphing and mixture-based supervised classification methods as statistically robust and practically useful techniques. This project considers three well-documented craniometric sources, whose samples collectively permit large-scale analyses and tests of population structure at a variety of partitions and for different goals. The craniofacial measurements drawn from the world-wide data sets collected by Howells and Hanihara permit rigorous tests for group differences and cryptic population structure. The inclusion of modern American samples from the Forensic Anthropology Data Bank allows for investigations into the importance of biosocial race and biogeographic ancestry in forensic anthropology. Demographic information from the United States Census Bureau is used to contextualize these samples within the range of the racial diversity represented in the American population-at-large. This project's findings support the presence of population structure, the utility of finite mixture methods to questions of biological classification, and the validity of supervised discrimination methods as reliable tools. They also attest to the importance of context for producing the most useful information on identity and affinity. These results suggest that a meaningful relationship between statistically inferred clusters and predefined groups does exist and that population-informative differences in cranial morphology can be detected with measured degrees of statistical certainty, even when true memberships are unknown. They imply, in turn, that the estimation of biogeographic ancestry and the identification of biosocial race in forensic anthropology can provide useful information for modern American casework that can be evidenced by scientific methods.
48

INFORMATIONAL INDEX AND ITS APPLICATIONS IN HIGH DIMENSIONAL DATA

Yuan, Qingcong 01 January 2017 (has links)
We introduce a new class of measures for testing independence between two random vectors, which uses expected difference of conditional and marginal characteristic functions. By choosing a particular weight function in the class, we propose a new index for measuring independence and study its property. Two empirical versions are developed, their properties, asymptotics, connection with existing measures and applications are discussed. Implementation and Monte Carlo results are also presented. We propose a two-stage sufficient variable selections method based on the new index to deal with large p small n data. The method does not require model specification and especially focuses on categorical response. Our approach always improves other typical screening approaches which only use marginal relation. Numerical studies are provided to demonstrate the advantages of the method. We introduce a novel approach to sufficient dimension reduction problems using the new measure. The proposed method requires very mild conditions on the predictors, estimates the central subspace effectively and is especially useful when response is categorical. It keeps the model-free advantage without estimating link function. Under regularity conditions, root-n consistency and asymptotic normality are established. The proposed method is very competitive and robust comparing to existing dimension reduction methods through simulations results.
49

TESTING FOR TREATMENT HETEROGENEITY BETWEEN THE INDIVIDUAL OUTCOMES WITHIN A COMPOSITE OUTCOME

Pogue, Janice M. 04 1900 (has links)
<p>This series of papers explores the value of and mechanisms for using a heterogeneity test to compare treatment differences between the individual outcomes included in a composite outcome. Trialists often combine a group of outcomes together into a single composite outcome based on the belief that all will share a common treatment effect. The question addressed here is how this assumption of homogeneity of treatment effect can be assessed in the analysis of a trial that uses this type of composite outcome. A class of models that can be used to form such a test involve the analysis of multiple outcomes per person, and adjust for the association due to repeated outcomes being observed on the same individuals. We compare heterogeneity tests from multiple models for binary and time-to-event composite outcomes, to determine which have the greatest power to detect treatment differences for the individual outcomes within a composite outcome. Generally both marginal and random effects models are shown to be reasonable choices for such tests. We show that a treatment heterogeneity test may be used to help design a study with a composite outcome and how it can help in the interpretation of trial results.</p> / Doctor of Philosophy (PhD)
50

A statistical framework to detect gene-environment interactions influencing complex traits

Deng, Wei Q. 27 August 2014 (has links)
<p>Advancements in human genomic technology have helped to improve our understanding of how genetic variation plays a central role in the mechanism of disease susceptibility. However, the very high dimensional nature of the data generated from large-scale genetic association studies has limited our ability to thoroughly examine genetic interactions. A prioritization scheme – Variance Prioritization (VP) – has been developed to select genetic variants based on differences in the quantitative trait variance between the possible genotypes using Levene’s test (Pare et al., 2010). Genetic variants with Levene’s test p-values lower than a pre-determined level of significance are selected to test for interactions using linear regression models. Under a variety of scenarios, VP has increased power to detect interactions over an exhaustive search as a result of reduced search space. Nevertheless, the use of Levene’s test does not take into account that the variance will either monotonically increase or decrease with the number of minor alleles when interactions are present. To address this issue, I propose a maximum likelihood approach to test for trends in variance between the genotypes, and derive a closed-form representation of the likelihood ratio test (LRT) statistic. Using simulations, I examine the performance of LRT in assessing the inequality of quantitative traits variance stratified by genotypes, and subsequently in identifying potentially interacting genetic variants. LRT is also used in an empirical dataset of 2,161 individuals to prioritize genetic variants for gene-environment interactions. The interaction p-values of the prioritized genetic variants are consistently lower than expected by chance compared to the non-prioritized, suggesting improved statistical power to detect interactions in the set of prioritized genetic variants. This new statistical test is expected to complement the existing VP framework and accelerate the process of genetic interaction discovery in future genome-wide studies and meta-analyses.</p> / Master of Health Sciences (MSc)

Page generated in 0.0949 seconds