Global ETD Search

121	Are Highly Dispersed Variables More Extreme? The Case of Distributions with Compact Support Adjogah, Benedict E 01 May 2014 (has links) We consider discrete and continuous symmetric random variables X taking values in [0; 1], and thus having expected value 1/2. The main thrust of this investigation is to study the correlation between the variance, Var(X) of X and the value of the expected maximum E(Mn) = E(X1,...,Xn) of n independent and identically distributed random variables X1,X2,...,Xn, each distributed as X. Many special cases are studied, some leading to very interesting alternating sums, and some progress is made towards a general theory. Keywords: variance expected maximum value massive cancellation in alternating sums Applied Mathematics Applied Statistics Other Mathematics Statistical Models
122	Multilevel Models for Longitudinal Data Khatiwada, Aastha 01 August 2016 (has links) Longitudinal data arise when individuals are measured several times during an ob- servation period and thus the data for each individual are not independent. There are several ways of analyzing longitudinal data when different treatments are com- pared. Multilevel models are used to analyze data that are clustered in some way. In this work, multilevel models are used to analyze longitudinal data from a case study. Results from other more commonly used methods are compared to multilevel models. Also, comparison in output between two software, SAS and R, is done. Finally a method consisting of fitting individual models for each individual and then doing ANOVA type analysis on the estimated parameters of the individual models is proposed and its power for different sample sizes and effect sizes is studied by simulation. repeated measures hierarchical models mixed effects individual models Applied Statistics Biostatistics Other Mathematics Statistical Models
123	THE FAMILY OF CONDITIONAL PENALIZED METHODS WITH THEIR APPLICATION IN SUFFICIENT VARIABLE SELECTION Xie, Jin 01 January 2018 (has links) When scientists know in advance that some features (variables) are important in modeling a data, then these important features should be kept in the model. How can we utilize this prior information to effectively find other important features? This dissertation is to provide a solution, using such prior information. We propose the Conditional Adaptive Lasso (CAL) estimates to exploit this knowledge. By choosing a meaningful conditioning set, namely the prior information, CAL shows better performance in both variable selection and model estimation. We also propose Sufficient Conditional Adaptive Lasso Variable Screening (SCAL-VS) and Conditioning Set Sufficient Conditional Adaptive Lasso Variable Screening (CS-SCAL-VS) algorithms based on CAL. The asymptotic and oracle properties are proved. Simulations, especially for the large p small n problems, are performed with comparisons with other existing methods. We further extend to the linear model setup to the generalized linear models (GLM). Instead of least squares, we consider the likelihood function with L1 penalty, that is the penalized likelihood methods. We proposed for Generalized Conditional Adaptive Lasso (GCAL) for the generalized linear models. We then further extend the method for any penalty terms that satisfy certain regularity conditions, namely Conditionally Penalized Estimate (CPE). Asymptotic and oracle properties are showed. Four corresponding sufficient variable screening algorithms are proposed. Simulation examples are evaluated for our method with comparisons with existing methods. GCAL is also evaluated with a read data set on leukemia. Generalized Conditional Adaptive Lasso High-dimensional Data Variable Screening Variable Selection Applied Statistics Statistical Methodology Statistical Models Statistical Theory
124	Automatic <sup>13</sup>C Chemical Shift Reference Correction of Protein NMR Spectral Data Using Data Mining and Bayesian Statistical Modeling Chen, Xi 01 January 2019 (has links) Nuclear magnetic resonance (NMR) is a highly versatile analytical technique for studying molecular configuration, conformation, and dynamics, especially of biomacromolecules such as proteins. However, due to the intrinsic properties of NMR experiments, results from the NMR instruments require a refencing step before the down-the-line analysis. Poor chemical shift referencing, especially for 13C in protein Nuclear Magnetic Resonance (NMR) experiments, fundamentally limits and even prevents effective study of biomacromolecules via NMR. There is no available method that can rereference carbon chemical shifts from protein NMR without secondary experimental information such as structure or resonance assignment. To solve this problem, we constructed a Bayesian probabilistic framework that circumvents the limitations of previous reference correction methods that required protein resonance assignment and/or three-dimensional protein structure. Our algorithm named Bayesian Model Optimized Reference Correction (BaMORC) can detect and correct 13C chemical shift referencing errors before the protein resonance assignment step of analysis and without a three-dimensional structure. By combining the BaMORC methodology with a new intra-peaklist grouping algorithm, we created a combined method called Unassigned BaMORC that utilizes only unassigned experimental peak lists and the amino acid sequence. Unassigned BaMORC kept all experimental three-dimensional HN(CO)CACB-type peak lists tested within ± 0.4 ppm of the correct 13C reference value. On a much larger unassigned chemical shift test set, the base method kept 13C chemical shift referencing errors to within ± 0.45 ppm at a 90% confidence interval. With chemical shift assignments, Assigned BaMORC can detect and correct 13C chemical shift referencing errors to within ± 0.22 at a 90% confidence interval. Therefore, Unassigned BaMORC can correct 13C chemical shift referencing errors when it will have the most impact, right before protein resonance assignment and other downstream analyses are started. After assignment, chemical shift reference correction can be further refined with Assigned BaMORC. To further support a broader usage of these new methods, we also created a software package with web-based interface for the NMR community. This software will allow non-NMR experts to detect and correct 13C referencing errors at critical early data analysis steps, lowering the bar of NMR expertise required for effective protein NMR analysis. NMR referencing correction statistical model protein Applied Statistics Biochemistry Bioinformatics Molecular Biology Statistical Models Structural Biology Survival Analysis
125	EFFECT OF SOCIOECONOMIC AND DEMOGRAPHIC FACTORS ON KENTUCKY CRASHES Cambron, Aaron Berry 01 January 2018 (has links) The goal of this research was to examine the potential predictive ability of socioeconomic and demographic data for drivers on Kentucky crash occurrence. Identifying unique background characteristics of at-fault drivers that contribute to crash rates and crash severity may lead to improved and more specific interventions to reduce the negative impacts of motor vehicle crashes. The driver-residence zip code was used as a spatial unit to connect five years of Kentucky crash data with socioeconomic factors from the U.S. Census, such as income, employment, education, age, and others, along with terrain and vehicle age. At-fault driver crash counts, normalized over the driving population, were used as the dependent variable in a multivariate linear regression to model socioeconomic variables and their relationship with motor vehicle crashes. The final model consisted of nine socioeconomic and demographic variables and resulted in a R-square of 0.279, which indicates linear correlation but a lack of strong predicting power. The model resulted in both positive and negative correlations of socioeconomic variables with crash rates. Positive associations were found with the terrain index (a composite measure of road curviness), travel time, high school graduation and vehicle age. Negative associations were found with younger drivers, unemployment, college education, and terrain difference, which considers the terrain index at the driver residence and crash location. Further research seems to be warranted to fully understand the role that socioeconomic and demographic characteristics play in driving behavior and crash risk. Multilinear Regression Crash Rate Socioeconomics Demographics At-Fault Zip Code Civil Engineering Multivariate Analysis Statistical Models Transportation Engineering
126	Lifetime value modelling / Frederick Jacques van der Westhuizen Van der Westhuizen, Frederick Jacques January 2009 (has links) Given the increase in popularity of Lifetime Value (LTV), the argument is that the topic will assume an increasingly central role in research and marketing. As such, the decision to assess the state of the field in Lifetime Value Modelling, and outline challenges unique to choice researchers in customer relationship management (CRM). As the research has argued, there are an excess of issues and analytical challenges that remain unresolved. The researcher hopes that this thesis inspires new answers and new approaches to resolve LTV. The scope of this project covers the building of a LTV model through multiple regression. This thesis is exclusively focused on modelling tenure. In this regard, there are a variety of benchmark statistical techniques arising from survival analysis, which could be applied, to tenure modelling. Tenure prediction will be looked at using survival analysis and compared with "crossbreed" data mining techniques that use multiple regression in concurrence with statistical techniques. It will be demonstrated how data mining tools complement the statistical models, and show that their mutual usage overcomes many of the shortcomings of each singular tool set, resulting in LTV models that are both accurate and comprehensible. Bank XYZ is used as an example and is based on a real scenario of one of the Banks of South Africa. / Thesis (M.Sc. (Computer Science))--North-West University, Vaal Triangle Campus, 2009. Lifetime value Central role Research Marketing Modelling tenure Data mining tools Statistical techniques Mutual usage Multiple regression Statistical models
127	Bayesian Logistic Regression Model for Siting Biomass-using Facilities Huang, Xia 01 December 2010 (has links) Key sources of oil for western markets are located in complex geopolitical environments that increase economic and social risk. The amalgamation of economic, environmental, social and national security concerns for petroleum-based economies have created a renewed emphasis on alternative sources of energy which include biomass. The stability of sustainable biomass markets hinges on improved methods to predict and visualize business risk and cost to the supply chain. This thesis develops Bayesian logistic regression models, with comparisons of classical maximum likelihood models, to quantify significant factors that influence the siting of biomass-using facilities and predict potential locations in the 13-state Southeastern United States for three types of biomass-using facilities. Group I combines all biomass-using mills, biorefineries using agricultural residues and wood-using bioenergy/biofuels plants. Group II included pulp and paper mills, and biorefineries that use agricultural and wood residues. Group III included food processing mills and biorefineries that use agricultural and wood residues. The resolution of this research is the 5-digit ZIP Code Tabulation Area (ZCTA), and there are 9,416 ZCTAs in the 13-state Southeastern study region. For both classical and Bayesian approaches, a training set of data was used plus a separate validation (hold out) set of data using a pseudo-random number-generating function in SAS® Enterprise Miner. Four predefined priors are constructed. Bayesian estimation assuming a Gaussian prior distribution provides the highest correct classification rate of 86.40% for Group I; Bayesian methods assuming the non-informative uniform prior has the highest correct classification rate of 95.97% for Group II; and Bayesian methods assuming a Gaussian prior gives the highest correct classification rate of 92.67% for Group III. Given the comparative low sensitivity for Group II and Group III, a hybrid model that integrates classification trees and local Bayesian logistic regression was developed as part of this research to further improve the predictive power. The hybrid model increases the sensitivity of Group II from 58.54% to 64.40%, and improves both of the specificity and sensitivity significantly for Group III from 98.69% to 99.42% and 39.35% to 46.45%, respectively. Twenty-five optimal locations for the biomass-using facility groupings at the 5-digit ZCTA resolution, based upon the best fitted Bayesian logistic regression model and the hybrid model, are predicted and plotted for the 13-state Southeastern study region. biorefineries agricultural residues site location prediction Bayesian logistic regression models Classification Trees Applied Statistics Statistical Methodology Statistical Models
128	A Study of Missing Data Imputation and Predictive Modeling of Strength Properties of Wood Composites Zeng, Yan 01 August 2011 (has links) Problem: Real-time process and destructive test data were collected from a wood composite manufacturer in the U.S. to develop real-time predictive models of two key strength properties (Modulus of Rupture (MOR) and Internal Bound (IB)) of a wood composite manufacturing process. Sensor malfunction and data “send/retrieval” problems lead to null fields in the company’s data warehouse which resulted in information loss. Many manufacturers attempt to build accurate predictive models excluding entire records with null fields or using summary statistics such as mean or median in place of the null field. However, predictive model errors in validation may be higher in the presence of information loss. In addition, the selection of predictive modeling methods poses another challenge to many wood composite manufacturers. Approach: This thesis consists of two parts addressing above issues: 1) how to improve data quality using missing data imputation; 2) what predictive modeling method is better in terms of prediction precision (measured by root mean square error or RMSE). The first part summarizes an application of missing data imputation methods in predictive modeling. After variable selection, two missing data imputation methods were selected after comparing six possible methods. Predictive models of imputed data were developed using partial least squares regression (PLSR) and compared with models of non-imputed data using ten-fold cross-validation. Root mean square error of prediction (RMSEP) and normalized RMSEP (NRMSEP) were calculated. The second presents a series of comparisons among four predictive modeling methods using imputed data without variable selection. Results: The first part concludes that expectation-maximization (EM) algorithm and multiple imputation (MI) using Markov Chain Monte Carlo (MCMC) simulation achieved more precise results. Predictive models based on imputed datasets generated more precise prediction results (average NRMSEP of 5.8% for model of MOR model and 7.2% for model of IB) than models of non-imputed datasets (average NRMSEP of 6.3% for model of MOR and 8.1% for model of IB). The second part finds that Bayesian Additive Regression Tree (BART) produced most precise prediction results (average NRMSEP of 7.7% for MOR model and 8.6% for IB model) than other three models: PLSR, LASSO, and Adaptive LASSO. missing data imputation predictive modeling partial least squares regression LASSO Adaptive LASSO BART Applied Statistics Statistical Methodology Statistical Models
129	Generalized Bathtub Hazard Models for Binary-Transformed Climate Data Polcer, James 01 May 2011 (has links) In this study, we use a hazard-based modeling as an alternative statistical framework to time series methods as applied to climate data. Data collected from the Kentucky Mesonet will be used to study the distributional properties of the duration of high and low-energy wind events relative to an arbitrary threshold. Our objectiveswere to fit bathtub models proposed in literature, propose a generalized bathtub model, apply these models to Kentucky Mesonet data, and make recommendations as to feasibility of wind power generation. Using two different thresholds (1.8 and 10 mph respectively), results show that the Hjorth bathtub model consistently performed better than all other models considered with coefficient of R-squared values at 0.95 or higher. However, fewer sites and months could be included in the analysis when we increased our threshold to 10 mph. Based on a 10 mph threshold, Bowling Green (FARM), Hopkinsville (PGHL), and Columbia (CMBA) posted the top 3 wind duration times in February of 2009. Further studies needed to establish long-term trends. non-parmetric cumulative Nelson-Aalan estimator high energy wind events climatology Kentucky Mesonet data Climate Numerical Analysis and Computation Statistical Models
130	Cagan Type Rational Expectations Model on Time Scales with Their Applications to Economics Ekiz, Funda 01 November 2011 (has links) Rational expectations provide people or economic agents making future decision with available information and past experiences. The first approach to the idea of rational expectations was given approximately fifty years ago by John F. Muth. Many models in economics have been studied using the rational expectations idea. The most familiar one among them is the rational expectations version of the Cagans hyperination model where the expectation for tomorrow is formed using all the information available today. This model was reinterpreted by Thomas J. Sargent and Neil Wallace in 1973. After that time, many solution techniques were suggested to solve the Cagan type rational expectations (CTRE) model. Some economists such as Muth [13], Taylor [26] and Shiller [27] consider the solutions admitting an infinite moving-average representation. Blanchard and Kahn [28] find solutions by using a recursive procedure. A general characterization of the solution was obtained using the martingale approach by Broze, Gourieroux and Szafarz in [22], [23]. We choose to study martingale solution of CTRE model. This thesis is comprised of five chapters where the main aim is to study the CTRE model on isolated time scales. Most of the models studied in economics are continuous or discrete. Discrete models are more preferable by economists since they give more meaningful and accurate results. Discrete models only contain uniform time domains. Time scale calculus enables us to study on m-periodic time domains as well as non periodic time domains. In the first chapter, we give basics of time scales calculus and stochastic calculus. The second chapter is the brief introduction to rational expectations and the CTRE model. Moreover, many other solution techniques are examined in this chapter. After we introduce the necessary background, in the third chapter we construct the CTRE Model on isolated time scales. Then we give the general solution of this model in terms of martingales. We continue our work with defining the linear system and higher order CTRE on isolated time scales. We use Putzer Algorithm to solve the system of the CTRE Model. Then, we examine the existence and uniqueness of the solution of the CTRE model. In the fourth chapter, we apply our solution algorithm developed in the previous chapter to models in Finance and stochastic growth models in Economics. rational expectations isolated time scale conditional expectations martingales growth models Discrete Mathematics and Combinatorics Economic Theory Mathematics Statistical Models

Search results