1 |
Hypothesis Testing in Finite Mixture ModelsLi, Pengfei 11 December 2007 (has links)
Mixture models provide a natural framework for
unobserved heterogeneity in a population.
They are widely applied in astronomy, biology,
engineering, finance, genetics, medicine, social sciences,
and other areas.
An important first step for using mixture models is the test
of homogeneity. Before one tries to fit a mixture model,
it might be of value to know whether the data arise from a
homogeneous or heterogeneous population. If the data are
homogeneous, it is not even necessary to go into mixture modeling.
The rejection of the homogeneous model may also have scientific implications.
For example, in classical statistical genetics,
it is often suspected that only a subgroup of patients have a
disease gene which is linked to the marker. Detecting
the existence of this subgroup amounts to the rejection of
a homogeneous null model in favour of a two-component
mixture model. This problem has attracted intensive
research recently. This thesis makes substantial contributions
in this area of research.
Due to partial loss of identifiability, classic inference methods
such as the likelihood ratio test (LRT) lose their usual elegant
statistical properties. The limiting distribution of the LRT
often involves complex Gaussian processes,
which can be hard to implement in data analysis.
The modified likelihood ratio test (MLRT) is found to be a nice
alternative of the LRT. It restores the identifiability by introducing
a penalty to the log-likelihood function.
Under some mild conditions,
the limiting distribution of the MLRT is
1/2\chi^2_0+1/2\chi^2_1,
where \chi^2_{0} is a point mass at 0.
This limiting distribution is convenient to use in real data analysis.
The choice of the penalty functions in the MLRT is very flexible.
A good choice of the penalty enhances the power of the MLRT.
In this thesis, we first introduce a new class of penalty functions,
with which the MLRT enjoys a significantly improved power for testing
homogeneity.
The main contribution of this thesis is to propose a new class of
methods for testing homogeneity. Most existing methods in the
literature for testing of homogeneity, explicitly or implicitly, are
derived under the condition of finite Fisher information and a
compactness assumption on the space of the mixing parameters. The
finite Fisher information condition can prevent their usage to many
important mixture models, such as the mixture of geometric
distributions, the mixture of exponential distributions and more
generally mixture models in scale distribution families. The
compactness assumption often forces applicants to set artificial
bounds for the parameters of interest and makes the resulting
limiting distribution dependent on these bounds. Consequently,
developing a method without such restrictions is a dream of many
researchers. As it will be seen, the proposed EM-test in this thesis
is free of these shortcomings.
The EM-test combines the merits of the classic LRT and score test.
The properties of the EM-test are particularly easy to investigate
under single parameter mixture models.
It has a simple limiting distribution
0.5\chi^2_0+0.5\chi^2_1, the same as the MLRT.
This result is applicable to mixture models without requiring
the restrictive regularity conditions described earlier.
The normal mixture model is a very popular model in applications.
However it does not satisfy the strong identifiability condition,
which imposes substantial technical difficulties in the study of the
asymptotic properties. Most existing methods do not directly apply
to the normal mixture models, so the asymptotic properties have to
be developed separately. We investigate the use of the EM-test to
normal mixture models and its limiting distributions are derived.
For the homogeneity test in the presence of the structural
parameter, the limiting distribution is a simple function of the
0.5\chi^2_0+0.5\chi^2_1 and \chi^2_1 distributions. The test
with this limiting distribution is still very convenient to
implement. For normal mixtures in both mean and variance parameters,
the limiting distribution of the EM-test is found be to \chi^2_2.
Mixture models are also widely used in the analysis of the
directional data. The von Mises distribution is often regarded as
the circular normal model. Interestingly, it satisfies the strong
identifiability condition and the parameter space of the mean
direction is compact. However the theoretical results in the single
parameter mixture models can not directly apply to the von Mises
mixture models. Because of this, we also study the application of
the EM-test to von Mises mixture models in the presence of the
structural parameter. The limiting distribution of the EM-test is
also found to be 0.5\chi^2_0+0.5\chi^2_1.
Extensive simulation results are obtained to examine the precision
of the approximation of the limiting distributions to the finite
sample distributions of the EM-test. The type I errors with the
critical values determined by the limiting distributions are found
to be close to nominal values. In particular, we also propose
several precision enhancing methods, which are found to work well.
Real data examples are used to illustrate the use of the EM-test.
|
2 |
Hypothesis Testing in Finite Mixture ModelsLi, Pengfei 11 December 2007 (has links)
Mixture models provide a natural framework for
unobserved heterogeneity in a population.
They are widely applied in astronomy, biology,
engineering, finance, genetics, medicine, social sciences,
and other areas.
An important first step for using mixture models is the test
of homogeneity. Before one tries to fit a mixture model,
it might be of value to know whether the data arise from a
homogeneous or heterogeneous population. If the data are
homogeneous, it is not even necessary to go into mixture modeling.
The rejection of the homogeneous model may also have scientific implications.
For example, in classical statistical genetics,
it is often suspected that only a subgroup of patients have a
disease gene which is linked to the marker. Detecting
the existence of this subgroup amounts to the rejection of
a homogeneous null model in favour of a two-component
mixture model. This problem has attracted intensive
research recently. This thesis makes substantial contributions
in this area of research.
Due to partial loss of identifiability, classic inference methods
such as the likelihood ratio test (LRT) lose their usual elegant
statistical properties. The limiting distribution of the LRT
often involves complex Gaussian processes,
which can be hard to implement in data analysis.
The modified likelihood ratio test (MLRT) is found to be a nice
alternative of the LRT. It restores the identifiability by introducing
a penalty to the log-likelihood function.
Under some mild conditions,
the limiting distribution of the MLRT is
1/2\chi^2_0+1/2\chi^2_1,
where \chi^2_{0} is a point mass at 0.
This limiting distribution is convenient to use in real data analysis.
The choice of the penalty functions in the MLRT is very flexible.
A good choice of the penalty enhances the power of the MLRT.
In this thesis, we first introduce a new class of penalty functions,
with which the MLRT enjoys a significantly improved power for testing
homogeneity.
The main contribution of this thesis is to propose a new class of
methods for testing homogeneity. Most existing methods in the
literature for testing of homogeneity, explicitly or implicitly, are
derived under the condition of finite Fisher information and a
compactness assumption on the space of the mixing parameters. The
finite Fisher information condition can prevent their usage to many
important mixture models, such as the mixture of geometric
distributions, the mixture of exponential distributions and more
generally mixture models in scale distribution families. The
compactness assumption often forces applicants to set artificial
bounds for the parameters of interest and makes the resulting
limiting distribution dependent on these bounds. Consequently,
developing a method without such restrictions is a dream of many
researchers. As it will be seen, the proposed EM-test in this thesis
is free of these shortcomings.
The EM-test combines the merits of the classic LRT and score test.
The properties of the EM-test are particularly easy to investigate
under single parameter mixture models.
It has a simple limiting distribution
0.5\chi^2_0+0.5\chi^2_1, the same as the MLRT.
This result is applicable to mixture models without requiring
the restrictive regularity conditions described earlier.
The normal mixture model is a very popular model in applications.
However it does not satisfy the strong identifiability condition,
which imposes substantial technical difficulties in the study of the
asymptotic properties. Most existing methods do not directly apply
to the normal mixture models, so the asymptotic properties have to
be developed separately. We investigate the use of the EM-test to
normal mixture models and its limiting distributions are derived.
For the homogeneity test in the presence of the structural
parameter, the limiting distribution is a simple function of the
0.5\chi^2_0+0.5\chi^2_1 and \chi^2_1 distributions. The test
with this limiting distribution is still very convenient to
implement. For normal mixtures in both mean and variance parameters,
the limiting distribution of the EM-test is found be to \chi^2_2.
Mixture models are also widely used in the analysis of the
directional data. The von Mises distribution is often regarded as
the circular normal model. Interestingly, it satisfies the strong
identifiability condition and the parameter space of the mean
direction is compact. However the theoretical results in the single
parameter mixture models can not directly apply to the von Mises
mixture models. Because of this, we also study the application of
the EM-test to von Mises mixture models in the presence of the
structural parameter. The limiting distribution of the EM-test is
also found to be 0.5\chi^2_0+0.5\chi^2_1.
Extensive simulation results are obtained to examine the precision
of the approximation of the limiting distributions to the finite
sample distributions of the EM-test. The type I errors with the
critical values determined by the limiting distributions are found
to be close to nominal values. In particular, we also propose
several precision enhancing methods, which are found to work well.
Real data examples are used to illustrate the use of the EM-test.
|
3 |
Statistical methods for species richness estimation using count data from multiple sampling unitsArgyle, Angus Gordon 23 April 2012 (has links)
The planet is experiencing a dramatic loss of species. The majority of species are unknown to science, and it is usually infeasible to conduct a census of a region to acquire a complete inventory of all life forms. Therefore, it is important to estimate and conduct statistical inference on the total number of species in a region based on samples obtained from field observations. Such estimates may suggest the number of species new to science and at potential risk of extinction.
In this thesis, we develop novel methodology to conduct statistical inference, based on abundance-based data collected from multiple sampling locations, on the number of species within a taxonomic group residing in a region. The primary contribution of this work is the formulation of novel statistical methodology for analysis in this setting, where abundances of species are recorded at multiple sampling units across a region. This particular area has received relatively little attention in the literature.
In the first chapter, the problem of estimating the number of species is formulated in a broad context, one that occurs in several seemingly unrelated fields of study. Estimators are commonly developed from statistical sampling models. Depending on the organisms or objects under study, different sampling techniques are used, and consequently, a variety of statistical models have been developed for this problem. A review of existing estimation methods, categorized by the associated sampling model, is presented in the second chapter.
The third chapter develops a new negative binomial mixture model. The negative binomial model is employed to account for the common tendency of individuals of a particular species to occur in clusters. An exponential mixing distribution permits inference on the number of species that exist in the region, but were in fact absent from the sampling units. Adopting a classical approach for statistical inference, we develop the maximum likelihood estimator, and a corresponding profile-log-likelihood interval estimate of species richness. In addition, a Gaussian-based confidence interval based on large-sample theory is presented.
The fourth chapter further extends the hierarchical model developed in Chapter 3 into a Bayesian framework. The motivation for the Bayesian paradigm is explained, and a hierarchical model based on random effects and discrete latent variables is presented. Computing the posterior distribution in this case is not straight-forward. A data augmentation technique that indirectly places priors on species richness is employed to compute the model using a Metropolis-Hastings algorithm.
The fifth chapter examines the performance of our new methodology. Simulation studies are used to examine the mean-squared error of our proposed estimators. Comparisons to several commonly-used non-parametric estimators are made. Several conclusions emerge, and settings where our approaches can yield superior performance are clarified.
In the sixth chapter, we present a case study. The methodology is applied to a real data set of oribatid mites (a taxonomic order of micro-arthropods) collected from multiple sites in a tropical rainforest in Panama. We adjust our statistical sampling models to account for the varying masses of material sampled from the sites. The resulting estimates of species richness for the oribatid mites are useful, and contribute to a wider investigation, currently underway, examining the species richness of all arthropods in the rainforest.
Our approaches are the only existing methods that can make full use of the abundance-based data from multiple sampling units located in a single region. The seventh and final chapter concludes the thesis with a discussion of key considerations related to implementation and modeling assumptions, and describes potential avenues for further investigation. / Graduate
|
4 |
Statistical Inferences under a semiparametric finite mixture modelZhang, Shiju January 2005 (has links)
No description available.
|
5 |
On Clustering: Mixture Model Averaging with the Generalized Hyperbolic DistributionRicciuti, Sarah 11 1900 (has links)
Cluster analysis is commonly described as the classification of unlabeled observations into groups such that they are more similar to one another than to observations in other groups. Model-based clustering assumes that the data arise from a statistical (mixture) model and typically a group of many models are fit to the data, from which the `best' model is selected by a model selection criterion (often the BIC in mixture model applications). This chosen model is then the only model that is used for making inferences on the data. Although this is common practice, proceeding in this way ignores a large component of model selection uncertainty, especially for situations where the difference between the model selection criterion for two competing models is relatively insignificant. For this reason, recent interest has been placed on selecting a subset of models that are close to the selected best model and using a weighted averaging approach to incorporate information from multiple models in this set. Model averaging is not a novel approach, yet its presence in a clustering framework is minimal. Here, we use Occam's window to select a subset of models eligible for two types of averaging techniques: averaging a posteriori probabilities, and direct averaging of model parameters. The efficacy of these model-based averaging approaches is demonstrated for a family of generalized hyperbolic mixture models using real and simulated data. / Thesis / Master of Science (MSc)
|
6 |
Identifying mixtures of mixtures using Bayesian estimationMalsiner-Walli, Gertraud, Frühwirth-Schnatter, Sylvia, Grün, Bettina January 2017 (has links) (PDF)
The use of a finite mixture of normal distributions in model-based clustering allows to
capture non-Gaussian data clusters. However, identifying the clusters from the normal components
is challenging and in general either achieved by imposing constraints on the model or
by using post-processing procedures.
Within the Bayesian framework we propose a different approach based on sparse finite
mixtures to achieve identifiability. We specify a hierarchical prior where the hyperparameters
are carefully selected such that they are reflective of the cluster structure aimed at. In addition,
this prior allows to estimate the model using standard MCMC sampling methods. In combination
with a post-processing approach which resolves the label switching issue and results in
an identified model, our approach allows to simultaneously (1) determine the number of clusters,
(2) flexibly approximate the cluster distributions in a semi-parametric way using finite
mixtures of normals and (3) identify cluster-specific parameters and classify observations. The
proposed approach is illustrated in two simulation studies and on benchmark data sets.
|
7 |
Dynamic Food Demand in China and International Nutrition TransitionZhou, De 12 May 2014 (has links)
No description available.
|
8 |
Essays on Transaction Costs and Food Diversity in Developing CountriesSteffen David, Christoph 28 June 2017 (has links)
No description available.
|
9 |
Recurrent-Event Models for Change-Points DetectionLi, Qing 23 December 2015 (has links)
The driving risk of novice teenagers is the highest during the initial period after licensure but decreases rapidly. This dissertation develops recurrent-event change-point models to detect the time when driving risk decreases significantly for novice teenager drivers. The dissertation consists of three major parts: the first part applies recurrent-event change-point models with identical change-points for all subjects; the second part proposes models to allow change-points to vary among drivers by a hierarchical Bayesian finite mixture model; the third part develops a non-parametric Bayesian model with a Dirichlet process prior. In the first part, two recurrent-event change-point models to detect the time of change in driving risks are developed. The models are based on a non-homogeneous Poisson process with piecewise constant intensity functions. It is shown that the change-points only occur at the event times and the maximum likelihood estimators are consistent. The proposed models are applied to the Naturalistic Teenage Driving Study, which continuously recorded textit{in situ} driving behaviour of 42 novice teenage drivers for the first 18 months after licensure using sophisticated in-vehicle instrumentation. The results indicate that crash and near-crash rate decreases significantly after 73 hours of independent driving after licensure. The models in part one assume identical change-points for all drivers. However, several studies showed that different patterns of risk change over time might exist among the teenagers, which implies that the change-points might not be identical among drivers. In the second part, change-points are allowed to vary among drivers by a hierarchical Bayesian finite mixture model, considering that clusters exist among the teenagers. The prior for mixture proportions is a Dirichlet distribution and a Markov chain Monte Carlo algorithm is developed to sample from the posterior distributions. DIC is used to determine the best number of clusters. Based on the simulation study, the model gives fine results under different scenarios. For the Naturalist Teenage Driving Study data, three clusters exist among the teenagers: the change-points are 52.30, 108.99 and 150.20 hours of driving after first licensure correspondingly for the three clusters; the intensity rates increase for the first cluster while decrease for other two clusters; the change-point of the first cluster is the earliest and the average intensity rate is the highest. In the second part, model selection is conducted to determine the number of clusters. An alternative is the Bayesian non-parametric approach. In the third part, a Dirichlet process Mixture Model is proposed, where the change-points are assigned a Dirichlet process prior. A Markov chain Monte Carlo algorithm is developed to sample from the posterior distributions. Automatic clustering is expected based on change-points without specifying the number of latent clusters. Based on the Dirichlet process mixture model, three clusters exist among the teenage drivers for the Naturalistic Teenage Driving Study. The change-points of the three clusters are 96.31, 163.83, and 279.19 hours. The results provide critical information for safety education, safety countermeasure development, and Graduated Driver Licensing policy making. / Ph. D.
|
Page generated in 0.0557 seconds