Spelling suggestions: "subject:"[een] MIXTURE MODEL"" "subject:"[enn] MIXTURE MODEL""
51 |
Construction of amino acid rate matrices and extensions of the Barry and Hartigan model for phylogenetic inferenceZou, Liwen 09 August 2011 (has links)
This thesis considers two distinct topics in phylogenetic analysis. The first is
construction of empirical rate matrices for amino acid models. The second topic,
which constitutes the majority of the thesis, involves analysis of and extensions to
the BH model of Barry and Hartigan (1987).
There are a number of rate matrices used for phylogenetic analysis including
the PAM (Dayhoff et al. 1979), JTT (Jones et al. 1992) and WAG (Whelan and
Goldman 2001). The construction of each of these has difficulties. To avoid adjusting
for multiple substitutions, the PAM and JTT matrices were constructed using only
a subset of the data consisting of closely related species. The WAG model used
an incomplete maximum likelihood estimation to reduce computational cost. We
develop a modification of the pairwise methods first described in Arvestad and Bruno
that better adjusts for some of the sparseness difficulties that arise with amino acid
data.
The BH model is very flexible, allowing separate discrete-time Markov processes
to occur along different edges. We show, however, that an identifiability
problem arises for the BH model making it difficult to estimate character state frequencies
at internal nodes. To obtain such frequencies and edge-lengths for BH
model fits, we define a nonstationary GTR (NSGTR) model along an edge, and find
the NSGTR model that best approximates the fitted BH model. The NSGTR model
is slightly more restrictive but allows for estimation of internal node frequencies and interpretable edge lengths.
While adjusting for rates-across-sites variation is now common practice in phylogenetic
analyses, it is widely recognized that in reality evolutionary processes can
change over both sites and lineages. As an adjustment for this, we introduce a BH
mixture model that not only allows completely different models along edges of a
topology, but also allows for different site classes whose evolutionary dynamics can
take any form.
|
52 |
Extreme Value Mixture Modelling with Simulation Study and Applications in Finance and InsuranceHu, Yang January 2013 (has links)
Extreme value theory has been used to develop models for describing the distribution of rare events. The extreme value theory based models can be used for asymptotically approximating the behavior of the tail(s) of the distribution function. An important challenge in the application of such extreme value models is the choice of a threshold, beyond which point the asymptotically justified extreme value models can provide good extrapolation. One approach for determining the threshold is to fit the all available data by an extreme value mixture model.
This thesis will review most of the existing extreme value mixture models in the literature and implement them in a package for the statistical programming language R to make them more readily useable by practitioners as they are not commonly available in any software. There are many different forms of extreme value mixture models in the literature (e.g. parametric, semi-parametric and non-parametric), which provide an automated approach for estimating the threshold and taking into account the uncertainties with threshold selection.
However, it is not clear that how the proportion above the threshold or tail fraction should be treated as there is no consistency in the existing model derivations. This thesis will develop some new models by adaptation of the existing ones in the literature and placing them all within a more generalized framework for taking into account how the tail fraction is defined in the model. Various new models are proposed by extending some of the existing parametric form mixture models to have continuous density at the threshold, which has the advantage of using less model parameters and being more physically plausible. The generalised framework all the mixture models are placed within can be used for demonstrating the importance of the specification of the tail fraction. An R package called evmix has been created to enable these mixture models to be more easily applied and further developed. For every mixture model, the density, distribution, quantile, random number generation, likelihood and fitting function are presented (Bayesian inference via MCMC is also implemented for the non-parametric extreme value mixture models).
A simulation study investigates the performance of the various extreme value mixture models under different population distributions with a representative variety of lower and upper tail behaviors. The results show that the kernel density estimator based non-parametric form mixture model is able to provide good tail estimation in general, whilst the parametric and semi-parametric forms mixture models can give a reasonable fit if the distribution below the threshold is correctly specified. Somewhat surprisingly, it is found that including a constraint of continuity at the threshold does not substantially improve the model fit in the upper tail. The hybrid Pareto model performs poorly as it does not include the tail fraction term. The relevant mixture models are applied to insurance and financial applications which highlight the practical usefulness of these models.
|
53 |
Contaminated Chi-square Modeling and Its Application in Microarray Data AnalysisZhou, Feng 01 January 2014 (has links)
Mixture modeling has numerous applications. One particular interest is microarray data analysis. My dissertation research is focused on the Contaminated Chi-Square (CCS) Modeling and its application in microarray. A moment-based method and two likelihood-based methods including Modified Likelihood Ratio Test (MLRT) and Expectation-Maximization (EM) Test are developed for testing the omnibus null hypothesis of no contamination of a central chi-square distribution by a non-central Chi-Square distribution. When the omnibus null hypothesis is rejected, we further developed the moment-based test and the EM test for testing an extra component to the Contaminated Chi-Square (CCS+EC) Model. The moment-based approach is easy and there is no need for re-sampling or random field theory to obtain critical values. When the statistical models are complicated such as large mixtures of dimensional distributions, MLRT and EM test may have better power than moment based approaches, and the MLRT and EM tests developed herein enjoy an elegant asymptotic theory.
|
54 |
INFORMATION THEORETIC CRITERIA FOR IMAGE QUALITY ASSESSMENT BASED ON NATURAL SCENE STATISTICSZhang, Di January 2006 (has links)
Measurement of visual quality is crucial for various image and video processing applications. <br /><br /> The goal of objective image quality assessment is to introduce a computational quality metric that can predict image or video quality. Many methods have been proposed in the past decades. Traditionally, measurements convert the spatial data into some other feature domains, such as the Fourier domain, and detect the similarity, such as mean square distance or Minkowsky distance, between the test data and the reference or perfect data, however only limited success has been achieved. None of the complicated metrics show any great advantage over other existing metrics. <br /><br /> The common idea shared among many proposed objective quality metrics is that human visual error sensitivities vary in different spatial and temporal frequency and directional channels. In this thesis, image quality assessment is approached by proposing a novel framework to compute the lost information in each channel not the similarities as used in previous methods. Based on natural scene statistics and several image models, an information theoretic framework is designed to compute the perceptual information contained in images and evaluate image quality in the form of entropy. <br /><br /> The thesis is organized as follows. Chapter I give a general introduction about previous work in this research area and a brief description of the human visual system. In Chapter II statistical models for natural scenes are reviewed. Chapter III proposes the core ideas about the computation of the perceptual information contained in the images. In Chapter IV, information theoretic criteria for image quality assessment are defined. Chapter V presents the simulation results in detail. In the last chapter, future direction and improvements of this research are discussed.
|
55 |
Statistical methods for species richness estimation using count data from multiple sampling unitsArgyle, Angus Gordon 23 April 2012 (has links)
The planet is experiencing a dramatic loss of species. The majority of species are unknown to science, and it is usually infeasible to conduct a census of a region to acquire a complete inventory of all life forms. Therefore, it is important to estimate and conduct statistical inference on the total number of species in a region based on samples obtained from field observations. Such estimates may suggest the number of species new to science and at potential risk of extinction.
In this thesis, we develop novel methodology to conduct statistical inference, based on abundance-based data collected from multiple sampling locations, on the number of species within a taxonomic group residing in a region. The primary contribution of this work is the formulation of novel statistical methodology for analysis in this setting, where abundances of species are recorded at multiple sampling units across a region. This particular area has received relatively little attention in the literature.
In the first chapter, the problem of estimating the number of species is formulated in a broad context, one that occurs in several seemingly unrelated fields of study. Estimators are commonly developed from statistical sampling models. Depending on the organisms or objects under study, different sampling techniques are used, and consequently, a variety of statistical models have been developed for this problem. A review of existing estimation methods, categorized by the associated sampling model, is presented in the second chapter.
The third chapter develops a new negative binomial mixture model. The negative binomial model is employed to account for the common tendency of individuals of a particular species to occur in clusters. An exponential mixing distribution permits inference on the number of species that exist in the region, but were in fact absent from the sampling units. Adopting a classical approach for statistical inference, we develop the maximum likelihood estimator, and a corresponding profile-log-likelihood interval estimate of species richness. In addition, a Gaussian-based confidence interval based on large-sample theory is presented.
The fourth chapter further extends the hierarchical model developed in Chapter 3 into a Bayesian framework. The motivation for the Bayesian paradigm is explained, and a hierarchical model based on random effects and discrete latent variables is presented. Computing the posterior distribution in this case is not straight-forward. A data augmentation technique that indirectly places priors on species richness is employed to compute the model using a Metropolis-Hastings algorithm.
The fifth chapter examines the performance of our new methodology. Simulation studies are used to examine the mean-squared error of our proposed estimators. Comparisons to several commonly-used non-parametric estimators are made. Several conclusions emerge, and settings where our approaches can yield superior performance are clarified.
In the sixth chapter, we present a case study. The methodology is applied to a real data set of oribatid mites (a taxonomic order of micro-arthropods) collected from multiple sites in a tropical rainforest in Panama. We adjust our statistical sampling models to account for the varying masses of material sampled from the sites. The resulting estimates of species richness for the oribatid mites are useful, and contribute to a wider investigation, currently underway, examining the species richness of all arthropods in the rainforest.
Our approaches are the only existing methods that can make full use of the abundance-based data from multiple sampling units located in a single region. The seventh and final chapter concludes the thesis with a discussion of key considerations related to implementation and modeling assumptions, and describes potential avenues for further investigation. / Graduate
|
56 |
New tools for unsupervised learningXiao, Ying 12 January 2015 (has links)
In an unsupervised learning problem, one is given an unlabelled dataset and hopes to find some hidden structure; the prototypical example is clustering similar data. Such problems often arise in machine learning and statistics, but also in signal processing, theoretical computer science, and any number of quantitative scientific fields. The distinguishing feature of unsupervised learning is that there are no privileged variables or labels which are particularly informative, and thus the greatest challenge is often to differentiate between what is relevant or irrelevant in any particular dataset or problem.
In the course of this thesis, we study a number of problems which span the breadth of unsupervised learning. We make progress in Gaussian mixtures, independent component analysis (where we solve the open problem of underdetermined ICA), and we formulate and solve a feature selection/dimension reduction model. Throughout, our goal is to give finite sample complexity bounds for our algorithms -- these are essentially the strongest type of quantitative bound that one can prove for such algorithms. Some of our algorithmic techniques turn out to be very efficient in practice as well.
Our major technical tool is tensor spectral decomposition: tensors are generalisations of matrices, and often allow access to the "fine structure" of data. Thus, they are often the right tools for unravelling the hidden structure in an unsupervised learning setting. However, naive generalisations of matrix algorithms to tensors run into NP-hardness results almost immediately, and thus to solve our problems, we are obliged to develop two new tensor decompositions (with robust analyses) from scratch. Both of these decompositions are polynomial time, and can be viewed as efficient generalisations of PCA extended to tensors.
|
57 |
Cough Detection and Forecasting for Radiation Treatment of Lung CancerQiu, Zigang Jimmy 06 April 2010 (has links)
In radiation therapy, a treatment plan is designed to make the delivery of radiation to a target more accurate, effective, and less damaging to surrounding healthy tissues. In lung sites, the tumor is affected by the patient’s respiratory motion. Despite tumor motion, current practice still uses a static delivery plan. Unexpected changes due to coughs and sneezes are not taken into account and as a result, the tumor is not treated accurately and healthy tissues are damaged.
In this thesis we detail a framework of using an accelerometer device to detect and forecast coughs. The accelerometer measurements are modeled as a ARMA process to make forecasts. We draw from studies in cough physiology and use amplitudes and durations of the forecasted breathing cycles as features to estimate parameters of Gaussian Mixture Models for cough and normal breathing classes. The system was tested on 10 volunteers, where each data set consisted of one 3-5 minute accelerometer measurements to train the system, and two 1-3 minute accelerometer measurements for testing.
|
58 |
Cough Detection and Forecasting for Radiation Treatment of Lung CancerQiu, Zigang Jimmy 06 April 2010 (has links)
In radiation therapy, a treatment plan is designed to make the delivery of radiation to a target more accurate, effective, and less damaging to surrounding healthy tissues. In lung sites, the tumor is affected by the patient’s respiratory motion. Despite tumor motion, current practice still uses a static delivery plan. Unexpected changes due to coughs and sneezes are not taken into account and as a result, the tumor is not treated accurately and healthy tissues are damaged.
In this thesis we detail a framework of using an accelerometer device to detect and forecast coughs. The accelerometer measurements are modeled as a ARMA process to make forecasts. We draw from studies in cough physiology and use amplitudes and durations of the forecasted breathing cycles as features to estimate parameters of Gaussian Mixture Models for cough and normal breathing classes. The system was tested on 10 volunteers, where each data set consisted of one 3-5 minute accelerometer measurements to train the system, and two 1-3 minute accelerometer measurements for testing.
|
59 |
Forecasting seat sales in passenger airlines: introducing the round-trip modelVaredi, Mehrdad 07 January 2010 (has links)
This thesis aims to improve sales forecasting in the context of passenger airlines. We study two important issues that could potentially improve forecasting accuracy: day-to-day price change rather than price itself, and linking flights that are likely to be considered as pairs for a round trip by passengers; we refer to the latter as the Round-Trip Model (RTM). We find that price change is a significant variable regardless of days remaining to flight in the last three weeks to flight departure, which opens the possibility of planning for revenue maximizing price change patterns. We also find that the RTM can improve the precision of the forecasting models, and provide an improved pricing strategy for planners.
In the study of the effect of price change on sales, analysis of variance is applied; finite regression mixture models were tested to identify linked traffic in the two directions and the linked flights on a route in reverse directions; adaptive neuro-fuzzy inference system (ANFIS) is applied to develop comparative models for studying sales effect between price and price change, and one-way versus round-trip models. The price change model demonstrated more robust results with comparable estimation errors, and the concept model for the round-trip with only one linked flight reduced estimation error by 5%. This empirical study is performed on a database with 22,900 flights which was obtained from a major North American passenger airline.
|
60 |
Towards Finding Optimal Mixture Of Subspaces For Data ClassificationMusa, Mohamed Elhafiz Mustafa 01 October 2003 (has links) (PDF)
In pattern recognition, when data has different structures in different parts of the
input space, fitting one global model can be slow and inaccurate. Learning methods
can quickly learn the structure of the data in local regions, consequently, offering faster
and more accurate model fitting. Breaking training data set into smaller subsets may
lead to curse of dimensionality problem, as a training sample subset may not be enough
for estimating the required set of parameters for the submodels. Increasing the size of
training data may not be at hand in many situations. Interestingly, the data in local
regions becomes more correlated. Therefore, by decorrelation methods we can reduce
data dimensions and hence the number of parameters. In other words, we can find
uncorrelated low dimensional subspaces that capture most of the data variability. The
current subspace modelling methods have proved better performance than the global
modelling methods for the given type of training data structure. Nevertheless these
methods still need more research work as they are suffering from two limitations
2 There is no standard method to specify the optimal number of subspaces.
² / There is no standard method to specify the optimal dimensionality for each
subspace.
In the current models these two parameters are determined beforehand. In this dissertation
we propose and test algorithms that try to find a suboptimal number of
principal subspaces and a suboptimal dimensionality for each principal subspaces automatically.
|
Page generated in 0.0373 seconds