• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 114
  • 22
  • 19
  • 15
  • 7
  • 5
  • 5
  • 4
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 234
  • 234
  • 88
  • 44
  • 42
  • 37
  • 30
  • 29
  • 27
  • 25
  • 24
  • 22
  • 21
  • 20
  • 20
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
31

Identifying mixtures of mixtures using Bayesian estimation

Malsiner-Walli, Gertraud, Frühwirth-Schnatter, Sylvia, Grün, Bettina January 2017 (has links) (PDF)
The use of a finite mixture of normal distributions in model-based clustering allows to capture non-Gaussian data clusters. However, identifying the clusters from the normal components is challenging and in general either achieved by imposing constraints on the model or by using post-processing procedures. Within the Bayesian framework we propose a different approach based on sparse finite mixtures to achieve identifiability. We specify a hierarchical prior where the hyperparameters are carefully selected such that they are reflective of the cluster structure aimed at. In addition, this prior allows to estimate the model using standard MCMC sampling methods. In combination with a post-processing approach which resolves the label switching issue and results in an identified model, our approach allows to simultaneously (1) determine the number of clusters, (2) flexibly approximate the cluster distributions in a semi-parametric way using finite mixtures of normals and (3) identify cluster-specific parameters and classify observations. The proposed approach is illustrated in two simulation studies and on benchmark data sets.
32

Nonparametric estimation of the mixing distribution in mixed models with random intercepts and slopes

Saab, Rabih 24 April 2013 (has links)
Generalized linear mixture models (GLMM) are widely used in statistical applications to model count and binary data. We consider the problem of nonparametric likelihood estimation of mixing distributions in GLMM's with multiple random effects. The log-likelihood to be maximized has the general form l(G)=Σi log∫f(yi,γ) dG(γ) where f(.,γ) is a parametric family of component densities, yi is the ith observed response dependent variable, and G is a mixing distribution function of the random effects vector γ defined on Ω. The literature presents many algorithms for maximum likelihood estimation (MLE) of G in the univariate random effect case such as the EM algorithm (Laird, 1978), the intra-simplex direction method, ISDM (Lesperance and Kalbfleish, 1992), and vertex exchange method, VEM (Bohning, 1985). In this dissertation, the constrained Newton method (CNM) in Wang (2007), which fits GLMM's with random intercepts only, is extended to fit clustered datasets with multiple random effects. Owing to the general equivalence theorem from the geometry of mixture likelihoods (see Lindsay, 1995), many NPMLE algorithms including CNM and ISDM maximize the directional derivative of the log-likelihood to add potential support points to the mixing distribution G. Our method, Direct Search Directional Derivative (DSDD), uses a directional search method to find local maxima of the multi-dimensional directional derivative function. The DSDD's performance is investigated in GLMM where f is a Bernoulli or Poisson distribution function. The algorithm is also extended to cover GLMM's with zero-inflated data. Goodness-of-fit (GOF) and selection methods for mixed models have been developed in the literature, however their application in models with nonparametric random effects distributions is vague and ad-hoc. Some popular measures such as the Deviance Information Criteria (DIC), conditional Akaike Information Criteria (cAIC) and R2 statistics are potentially useful in this context. Additionally, some cross-validation goodness-of-fit methods popular in Bayesian applications, such as the conditional predictive ordinate (CPO) and numerical posterior predictive checks, can be applied with some minor modifications to suit the non-Bayesian approach. / Graduate / 0463 / rabihsaab@gmail.com
33

Nonparametric estimation of the mixing distribution in mixed models with random intercepts and slopes

Saab, Rabih 24 April 2013 (has links)
Generalized linear mixture models (GLMM) are widely used in statistical applications to model count and binary data. We consider the problem of nonparametric likelihood estimation of mixing distributions in GLMM's with multiple random effects. The log-likelihood to be maximized has the general form l(G)=Σi log∫f(yi,γ) dG(γ) where f(.,γ) is a parametric family of component densities, yi is the ith observed response dependent variable, and G is a mixing distribution function of the random effects vector γ defined on Ω. The literature presents many algorithms for maximum likelihood estimation (MLE) of G in the univariate random effect case such as the EM algorithm (Laird, 1978), the intra-simplex direction method, ISDM (Lesperance and Kalbfleish, 1992), and vertex exchange method, VEM (Bohning, 1985). In this dissertation, the constrained Newton method (CNM) in Wang (2007), which fits GLMM's with random intercepts only, is extended to fit clustered datasets with multiple random effects. Owing to the general equivalence theorem from the geometry of mixture likelihoods (see Lindsay, 1995), many NPMLE algorithms including CNM and ISDM maximize the directional derivative of the log-likelihood to add potential support points to the mixing distribution G. Our method, Direct Search Directional Derivative (DSDD), uses a directional search method to find local maxima of the multi-dimensional directional derivative function. The DSDD's performance is investigated in GLMM where f is a Bernoulli or Poisson distribution function. The algorithm is also extended to cover GLMM's with zero-inflated data. Goodness-of-fit (GOF) and selection methods for mixed models have been developed in the literature, however their application in models with nonparametric random effects distributions is vague and ad-hoc. Some popular measures such as the Deviance Information Criteria (DIC), conditional Akaike Information Criteria (cAIC) and R2 statistics are potentially useful in this context. Additionally, some cross-validation goodness-of-fit methods popular in Bayesian applications, such as the conditional predictive ordinate (CPO) and numerical posterior predictive checks, can be applied with some minor modifications to suit the non-Bayesian approach. / Graduate / 0463 / rabihsaab@gmail.com
34

Recurrent-Event Models for Change-Points Detection

Li, Qing 23 December 2015 (has links)
The driving risk of novice teenagers is the highest during the initial period after licensure but decreases rapidly. This dissertation develops recurrent-event change-point models to detect the time when driving risk decreases significantly for novice teenager drivers. The dissertation consists of three major parts: the first part applies recurrent-event change-point models with identical change-points for all subjects; the second part proposes models to allow change-points to vary among drivers by a hierarchical Bayesian finite mixture model; the third part develops a non-parametric Bayesian model with a Dirichlet process prior. In the first part, two recurrent-event change-point models to detect the time of change in driving risks are developed. The models are based on a non-homogeneous Poisson process with piecewise constant intensity functions. It is shown that the change-points only occur at the event times and the maximum likelihood estimators are consistent. The proposed models are applied to the Naturalistic Teenage Driving Study, which continuously recorded textit{in situ} driving behaviour of 42 novice teenage drivers for the first 18 months after licensure using sophisticated in-vehicle instrumentation. The results indicate that crash and near-crash rate decreases significantly after 73 hours of independent driving after licensure. The models in part one assume identical change-points for all drivers. However, several studies showed that different patterns of risk change over time might exist among the teenagers, which implies that the change-points might not be identical among drivers. In the second part, change-points are allowed to vary among drivers by a hierarchical Bayesian finite mixture model, considering that clusters exist among the teenagers. The prior for mixture proportions is a Dirichlet distribution and a Markov chain Monte Carlo algorithm is developed to sample from the posterior distributions. DIC is used to determine the best number of clusters. Based on the simulation study, the model gives fine results under different scenarios. For the Naturalist Teenage Driving Study data, three clusters exist among the teenagers: the change-points are 52.30, 108.99 and 150.20 hours of driving after first licensure correspondingly for the three clusters; the intensity rates increase for the first cluster while decrease for other two clusters; the change-point of the first cluster is the earliest and the average intensity rate is the highest. In the second part, model selection is conducted to determine the number of clusters. An alternative is the Bayesian non-parametric approach. In the third part, a Dirichlet process Mixture Model is proposed, where the change-points are assigned a Dirichlet process prior. A Markov chain Monte Carlo algorithm is developed to sample from the posterior distributions. Automatic clustering is expected based on change-points without specifying the number of latent clusters. Based on the Dirichlet process mixture model, three clusters exist among the teenage drivers for the Naturalistic Teenage Driving Study. The change-points of the three clusters are 96.31, 163.83, and 279.19 hours. The results provide critical information for safety education, safety countermeasure development, and Graduated Driver Licensing policy making. / Ph. D.
35

Statistical Approaches for Handling Missing Data in Cluster Randomized Trials

Fiero, Mallorie H. January 2016 (has links)
In cluster randomized trials (CRTs), groups of participants are randomized as opposed to individual participants. This design is often chosen to minimize treatment arm contamination or to enhance compliance among participants. In CRTs, we cannot assume independence among individuals within the same cluster because of their similarity, which leads to decreased statistical power compared to individually randomized trials. The intracluster correlation coefficient (ICC) is crucial in the design and analysis of CRTs, and measures the proportion of total variance due to clustering. Missing data is a common problem in CRTs and should be accommodated with appropriate statistical techniques because they can compromise the advantages created by randomization and are a potential source of bias. In three papers, I investigate statistical approaches for handling missing data in CRTs. In the first paper, I carry out a systematic review evaluating current practice of handling missing data in CRTs. The results show high rates of missing data in the majority of CRTs, yet handling of missing data remains suboptimal. Fourteen (16%) of the 86 reviewed trials reported carrying out a sensitivity analysis for missing data. Despite suggestions to weaken the missing data assumption from the primary analysis, only five of the trials weakened the assumption. None of the trials reported using missing not at random (MNAR) models. Due to the low proportion of CRTs reporting an appropriate sensitivity analysis for missing data, the second paper aims to facilitate performing a sensitivity analysis for missing data in CRTs by extending the pattern mixture approach for missing clustered data under the MNAR assumption. I implement multilevel multiple imputation (MI) in order to account for the hierarchical structure found in CRTs, and multiply imputed values by a sensitivity parameter, k, to examine parameters of interest under different missing data assumptions. The simulation results show that estimates of parameters of interest in CRTs can vary widely under different missing data assumptions. A high proportion of missing data can occur among CRTs because missing data can be found at the individual level as well as the cluster level. In the third paper, I use a simulation study to compare missing data strategies to handle missing cluster level covariates, including the linear mixed effects model, single imputation, single level MI ignoring clustering, MI incorporating clusters as fixed effects, and MI at the cluster level using aggregated data. The results show that when the ICC is small (ICC ≤ 0.1) and the proportion of missing data is low (≤ 25\%), the mixed model generates unbiased estimates of regression coefficients and ICC. When the ICC is higher (ICC > 0.1), MI at the cluster level using aggregated data performs well for missing cluster level covariates, though caution should be taken if the percentage of missing data is high.
36

Estimating Freeway Travel Time Reliability for Traffic Operations and Planning

Yang, Shu, Yang, Shu January 2016 (has links)
Travel time reliability (TTR) has attracted increasing attention in recent years, and is often listed as one of the major roadway performance and service quality measures for both traffic engineers and travelers. Measuring travel time reliability is the first step towards improving travel time reliability, ensuring on-time arrivals, and reducing travel costs. Four components may be primarily considered, including travel time estimation/collection, quantity of travel time selection, probability distribution selection, and TTR measure selection. Travel time is a key transportation performance measure because of its diverse applications and it also serves the foundation of estimating travel time reliability. Various modelling approaches to estimating freeway travel time have been well developed due to widespread installation of intelligent transportation system sensors. However, estimating accurate travel time using existing freeway travel time models is still challenging under congested conditions. Therefore, this study aimed to develop an innovative freeway travel time estimation model based on the General Motors (GM) car-following model. Since the GM model is usually used in a micro-simulation environment, the concepts of virtual leading and virtual following vehicles are proposed to allow the GM model to be used in macro-scale environments using aggregated traffic sensor data. Travel time data collected from three study corridors on I-270 in St. Louis, Missouri was used to verify the estimated travel times produced by the proposed General Motors Travel Time Estimation (GMTTE) model and two existing models, the instantaneous model and the time-slice model. The results showed that the GMTTE model outperformed the two existing models due to lower mean average percentage errors of 1.62% in free-flow conditions and 6.66% in two congested conditions. Overall, the GMTTE model demonstrated its robustness and accuracy for estimating freeway travel times. Most travel time reliability measures are derived directly from continuous probability distributions and applied to the traffic data directly. However, little previous research shows a consensus of probability distribution family selection for travel time reliability. Different probability distribution families could yield different values for the same travel time reliability measure (e.g. standard deviation). It is believe that the specific selection of probability distribution families has few effects on measuring travel time reliability. Therefore, two hypotheses are proposed in hope of accurately measuring travel time reliability. An experiment is designed to prove the two hypotheses. The first hypothesis is proven by conducting the Kolmogorov–Smirnov test and checking log-likelihoods, and Akaike information criterion with a correction for finite sample sizes (AICc) and Bayesian information criterion (BIC) convergences; and the second hypothesis is proven by examining both moment-based and percentile-based travel time reliability measures. The results from the two hypotheses testing suggest that 1) underfitting may cause disagreement in distribution selection, 2) travel time can be precisely fitted using mixture models with higher value of the number of mixture distributions (K), regardless of the distribution family, and 3) the travel time reliability measures are insensitive to the selection of distribution family. Findings of this research allows researchers and practitioners to avoid the work of testing various distributions, and travel time reliability can be more accurately measured using mixture models due to higher value of log-likelihoods. As with travel time collection, the accuracy of the observed travel time and the optimal travel time data quantity should be determined before using the TTR data. The statistical accuracy of TTR measures should be evaluated so that the statistical behavior and belief can be fully understood. More specifically, this issue can be formulated as a question: using a certain amount of travel time data, how accurate is the travel time reliability for a specific freeway corridor, time of day (TOD), and day of week (DOW)? A framework for answering this question has not been proposed in the past. Our study proposes a framework based on bootstrapping to evaluate the accuracy of TTR measures and answer the question. Bootstrapping is a computer-based method for assigning measures of accuracy to multiple types of statistical estimators without requiring a specific probability distribution. Three scenarios representing three traffic flow conditions (free-flow, congestion, and transition) were used to fully understand the accuracy of TTR measures under different traffic conditions. The results of the accuracy measurements primarily showed that: 1) the proposed framework can facilitate assessment of the accuracy of TTR, and 2) stabilization of the TTR measures did not necessarily correspond to statistical accuracy. The findings in our study also suggested that moment-based TTR measures may not be statistically sufficient for measuring freeway TTR. Additionally, our study suggested that 4 or 5 weeks of travel time data is enough for measuring freeway TTR under free-flow conditions, 40 weeks for congested conditions, and 35 weeks for transition conditions. A considerable number of studies have contributed to measuring travel time reliability. Travel time distribution estimation is considered as an important starting input of measuring travel time reliability. Kernel density estimation (KDE) is used to estimate travel time distribution, instead of parametric probability distributions, e.g. Lognormal distribution, the two state models. The Hasofer Lind - Rackwitz Fiessler (HL-RF) algorithm, widely used in the field of reliability engineering, is applied to this work. It is used to compute the reliability index of a system based on its previous performance. The computing procedure for travel time reliability of corridors on a freeway is first introduced. Network travel time reliability is developed afterwards. Given probability distributions estimated by the KDE technique, and an anticipated travel time from travelers, the two equations of the corridor and network travel time reliability can be used to address the question, "How reliable is my perceived travel time?" The definition of travel time reliability is in the sense of "on time performance", and it is conducted inherently from the perspective of travelers. Further, the major advantages of the proposed method are: 1) The proposed method demonstrates an alternative way to estimate travel time distributions when the choice of probability distribution family is still uncertain; 2) the proposed method shows its flexibility for being applied onto different levels of roadways (e.g. individual roadway segment or network). A user-defined anticipated travel time can be input, and travelers can utilize the computed travel time reliability information to plan their trips in advance, in order to better manage trip time, reduce cost, and avoid frustration.
37

Robust mixture regression modeling with Pearson type VII distribution

Zhang, Jingyi January 1900 (has links)
Master of Science / Department of Statistics / Weixing Song / A robust estimation procedure for parametric regression models is proposed in the paper by assuming the error terms follow a Pearson type VII distribution. The estimation procedure is implemented by an EM algorithm based on the fact that the Pearson type VII distributions are a scale mixture of a normal distribution and a Gamma distribution. A trimmed version of proposed procedure is also discussed in this paper, which can successfully trim the high leverage points away from the data. Finite sample performance of the proposed algorithm is evaluated by some extensive simulation studies, together with the comparisons made with other existing procedures in the literature.
38

Predicting Hearing Loss Using Auditory Steady-State Responses

li, yiwen 14 January 2009 (has links)
Auditory Steady-State Response (ASSR) is a promising tool for detecting hearing loss. In this project, we analyzed hearing threshold data obtained from two ASSR methods and a gold standard, pure tone audiometry, applied to both normal and hearing-impaired subjects. We constructed a repeated measures linear model to identify factors that show significant differences in the mean response. The analysis shows that there are significant differences due to hearing status (normal or impaired) and ASSR method, and that there is a significant interaction between hearing status and test signal frequency. The second task of this project was to predict the PTA threshold (gold standard) from the ASSR-A and ASSR-B thresholds separately at each frequency, in order to measure how accurate the ASSR measurements are and to obtain a ¡°correction function¡± to correct the bias in the ASSR measurements. We used two approaches. In the first, we modeled the relation of the PTA responses to the ASSR values for the two hearing status groups as a mixture model and tried two prediction methods. The mixture modeling was successful, but the predictions gave disappointing results. A second approach, using logistic regression to predict group membership based on ASSR value and then using those predictions to obtain a predictor of the PTA value, gave successful results.
39

Defining and predicting fast-selling clothing options

Jesperson, Sara January 2019 (has links)
This thesis aims to find a definition of fast-selling clothing options and to find a way to predict them using only a few weeks of sale data as input. The data used for this project contain daily sales and intake quantity for seasonal options, with sale start 2016-2018, provided by the department store chain Åhléns. A definition is found to describe fast-selling clothing options as those having sold a certain percentage of their intake after a fixed number of days. An alternative definition based on cluster affiliation is proven less effective. Two predictive models are tested, the first one being a probabilistic classifier and the second one being a k-nearest neighbor classifier, using the Euclidean distance. The probabilistic model is divided into three steps: transformation, clustering, and classification. The time series are transformed with B-splines to reduce dimensionality, where each time series is represented by a vector with its length and B-spline coefficients. As a tool to improve the quality of the predictions, the B-spline vectors are clustered with a Gaussian mixture model where every cluster is assigned one of the two labels fast-selling or ordinary, thus dividing the clusters into disjoint sets: one containing fast-selling clusters and the other containing ordinary clusters. Lastly, the time series to be predicted are assumed to be Laplace distributed around a B-spline and using the probability distributions provided by the clustering, the posterior probability for each class is used to classify the new observations. In the transformation step, the number of knots for the B-splines are evaluated with cross-validation and the Gaussian mixture models, from the clustering step, are evaluated with the Bayesian information criterion, BIC. The predictive performance of both classifiers is evaluated with accuracy, precision, and recall. The probabilistic model outperforms the k-nearest neighbor model with considerably higher values of accuracy, precision, and recall. The performance of each model is improved by using more data to make the predictions, most prominently with the probabilistic model.
40

Classification of phylogenetic data via Bayesian mixture modelling

Loza Reyes, Elisa January 2010 (has links)
Conventional probabilistic models for phylogenetic inference assume that an evolutionary tree,andasinglesetofbranchlengthsandstochasticprocessofDNA evolutionare sufficient to characterise the generating process across an entire DNA alignment. Unfortunately such a simplistic, homogeneous formulation may be a poor description of reality when the data arise from heterogeneous processes. A well-known example is when sites evolve at heterogeneous rates. This thesis is a contribution to the modelling and understanding of heterogeneityin phylogenetic data. Weproposea methodfor the classificationof DNA sites based on Bayesian mixture modelling. Our method not only accounts for heterogeneous data but also identifies the underlying classes and enables their interpretation. We also introduce novel MCMC methodology with the same, or greater, estimation performance than existing algorithms but with lower computational cost. We find that our mixture model can successfully detect evolutionary heterogeneity and demonstrate its direct relevance by applying it to real DNA data. One of these applications is the analysis of sixteen strains of one of the bacterial species that cause Lyme disease. Results from that analysis have helped understanding the evolutionary paths of these bacterial strains and, therefore, the dynamics of the spread of Lyme disease. Our method is discussed in the context of DNA but it may be extendedto othertypesof molecular data. Moreover,the classification scheme thatwe propose is evidence of the breadth of application of mixture modelling and a step forwards in the search for more realistic models of theprocesses that underlie phylogenetic data.

Page generated in 0.0737 seconds