• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 124
  • 20
  • 18
  • 16
  • 5
  • 3
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 213
  • 213
  • 76
  • 48
  • 43
  • 41
  • 40
  • 38
  • 35
  • 30
  • 28
  • 27
  • 24
  • 23
  • 21
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
61

Statistical analysis of high dimensional data

Ruan, Lingyan 05 November 2010 (has links)
This century is surely the century of data (Donoho, 2000). Data analysis has been an emerging activity over the last few decades. High dimensional data is in particular more and more pervasive with the advance of massive data collection system, such as microarrays, satellite imagery, and financial data. However, analysis of high dimensional data is of challenge with the so called curse of dimensionality (Bellman 1961). This research dissertation presents several methodologies in the application of high dimensional data analysis. The first part discusses a joint analysis of multiple microarray gene expressions. Microarray analysis dates back to Golub et al. (1999). It draws much attention after that. One common goal of microarray analysis is to determine which genes are differentially expressed. These genes behave significantly differently between groups of individuals. However, in microarray analysis, there are thousands of genes but few arrays (samples, individuals) and thus relatively low reproducibility remains. It is natural to consider joint analyses that could combine microarrays from different experiments effectively in order to achieve improved accuracy. In particular, we present a model-based approach for better identification of differentially expressed genes by incorporating data from different studies. The model can accommodate in a seamless fashion a wide range of studies including those performed at different platforms, and/or under different but overlapping biological conditions. Model-based inferences can be done in an empirical Bayes fashion. Because of the information sharing among studies, the joint analysis dramatically improves inferences based on individual analysis. Simulation studies and real data examples are presented to demonstrate the effectiveness of the proposed approach under a variety of complications that often arise in practice. The second part is about covariance matrix estimation in high dimensional data. First, we propose a penalised likelihood estimator for high dimensional t-distribution. The student t-distribution is of increasing interest in mathematical finance, education and many other applications. However, the application in t-distribution is limited by the difficulty in the parameter estimation of the covariance matrix for high dimensional data. We show that by imposing LASSO penalty on the Cholesky factors of the covariance matrix, EM algorithm can efficiently compute the estimator and it performs much better than other popular estimators. Secondly, we propose an estimator for high dimensional Gaussian mixture models. Finite Gaussian mixture models are widely used in statistics thanks to its great flexibility. However, parameter estimation for Gaussian mixture models with high dimensionality can be rather challenging because of the huge number of parameters that need to be estimated. For such purposes, we propose a penalized likelihood estimator to specifically address such difficulties. The LASSO penalty we impose on the inverse covariance matrices encourages sparsity on its entries and therefore helps reducing the dimensionality of the problem. We show that the proposed estimator can be efficiently computed via an Expectation-Maximization algorithm. To illustrate the practical merits of the proposed method, we consider its application in model-based clustering and mixture discriminant analysis. Numerical experiments with both simulated and real data show that the new method is a valuable tool in handling high dimensional data. Finally, we present structured estimators for high dimensional Gaussian mixture models. The graphical representation of every cluster in Gaussian mixture models may have the same or similar structure, which is an important feature in many applications, such as image processing, speech recognition and gene network analysis. Failure to consider the sharing structure would deteriorate the estimation accuracy. To address such issues, we propose two structured estimators, hierarchical Lasso estimator and group Lasso estimator. An EM algorithm can be applied to conveniently solve the estimation problem. We show that when clusters share similar structures, the proposed estimator perform much better than the separate Lasso estimator.
62

Model Likelihoods and Bayes Factors for Switching and Mixture Models

Frühwirth-Schnatter, Sylvia January 2000 (has links) (PDF)
In the present paper we explore various approaches of computing model likelihoods from the MCMC output for mixture and switching models, among them the candidate's formula, importance sampling, reciprocal importance sampling and bridge sampling. We demonstrate that the candidate's formula is sensitive to label switching. It turns out that the best method to estimate the model likelihood is the bridge sampling technique, where the MCMC sample is combined with an iid sample from an importance density. The importance density is constructed in an unsupervised manner from the MCMC output using a mixture of complete data posteriors. Whereas the importance sampling estimator as well as the reciprocal importance sampling estimator are sensitive to the tail behaviour of the importance density, we demonstrate that the bridge sampling estimator is far more robust in this concern. Our case studies range from from selecting the number of classes in a mixture of multivariate normal distributions, testing for the inhomogeneity of a discrete time Poisson process, to testing for the presence of Markov switching and order selection in the MSAR model. (author's abstract) / Series: Forschungsberichte / Institut für Statistik
63

Nonlinear orbit uncertainty prediction and rectification for space situational awareness

DeMars, Kyle Jordan 07 February 2011 (has links)
A new method for predicting the uncertainty in a nonlinear dynamical system is developed and analyzed in the context of uncertainty evolution for resident space objects (RSOs) in the near-geosynchronous orbit regime under the influence of central body gravitational acceleration, third body perturbations, and attitude-dependent solar radiation pressure (SRP) accelerations and torques. The new method, termed the splitting Gaussian mixture unscented Kalman filter (SGMUKF), exploits properties of the differential entropy or Renyi entropy for a linearized dynamical system to determine when a higher-order prediction of uncertainty reaches a level of disagreement with a first-order prediction, and then applies a multivariate Gaussian splitting algorithm to reduce the impact of induced nonlinearity. In order to address the relative accuracy of the new method with respect to the more traditional approaches of the extended Kalman filter (EKF) and unscented Kalman filter (UKF), several concepts regarding the comparison of probability density functions (pdfs) are introduced and utilized in the analysis. The research also describes high-fidelity modeling of the nonlinear dynamical system which drives the motion of an RSO, and includes models for evaluation of the central body gravitational acceleration, the gravitational acceleration due to other celestial bodies, and attitude-dependent SRP accelerations and torques when employing a macro plate model of an RSO. Furthermore, a high-fidelity model of the measurement of the line-of-sight of a spacecraft from a ground station is presented, which applies light-time and stellar aberration corrections, and accounts for observer and target lighting conditions, as well as for the sensor field of view. The developed algorithms are applied to the problem of forward predicting the time evolution of the region of uncertainty for RSO tracking, and uncertainty rectification via the fusion of incoming measurement data with prior knowledge. It is demonstrated that the SGMUKF method is significantly better able to forward predict the region of uncertainty and is subsequently better able to utilize new measurement data. / text
64

Prediction with Mixture Models

Haider, Peter January 2013 (has links)
Learning a model for the relationship between the attributes and the annotated labels of data examples serves two purposes. Firstly, it enables the prediction of the label for examples without annotation. Secondly, the parameters of the model can provide useful insights into the structure of the data. If the data has an inherent partitioned structure, it is natural to mirror this structure in the model. Such mixture models predict by combining the individual predictions generated by the mixture components which correspond to the partitions in the data. Often the partitioned structure is latent, and has to be inferred when learning the mixture model. Directly evaluating the accuracy of the inferred partition structure is, in many cases, impossible because the ground truth cannot be obtained for comparison. However it can be assessed indirectly by measuring the prediction accuracy of the mixture model that arises from it. This thesis addresses the interplay between the improvement of predictive accuracy by uncovering latent cluster structure in data, and further addresses the validation of the estimated structure by measuring the accuracy of the resulting predictive model. In the application of filtering unsolicited emails, the emails in the training set are latently clustered into advertisement campaigns. Uncovering this latent structure allows filtering of future emails with very low false positive rates. In order to model the cluster structure, a Bayesian clustering model for dependent binary features is developed in this thesis. Knowing the clustering of emails into campaigns can also aid in uncovering which emails have been sent on behalf of the same network of captured hosts, so-called botnets. This association of emails to networks is another layer of latent clustering. Uncovering this latent structure allows service providers to further increase the accuracy of email filtering and to effectively defend against distributed denial-of-service attacks. To this end, a discriminative clustering model is derived in this thesis that is based on the graph of observed emails. The partitionings inferred using this model are evaluated through their capacity to predict the campaigns of new emails. Furthermore, when classifying the content of emails, statistical information about the sending server can be valuable. Learning a model that is able to make use of it requires training data that includes server statistics. In order to also use training data where the server statistics are missing, a model that is a mixture over potentially all substitutions thereof is developed. Another application is to predict the navigation behavior of the users of a website. Here, there is no a priori partitioning of the users into clusters, but to understand different usage scenarios and design different layouts for them, imposing a partitioning is necessary. The presented approach simultaneously optimizes the discriminative as well as the predictive power of the clusters. Each model is evaluated on real-world data and compared to baseline methods. The results show that explicitly modeling the assumptions about the latent cluster structure leads to improved predictions compared to the baselines. It is beneficial to incorporate a small number of hyperparameters that can be tuned to yield the best predictions in cases where the prediction accuracy can not be optimized directly. / Das Lernen eines Modells für den Zusammenhang zwischen den Eingabeattributen und annotierten Zielattributen von Dateninstanzen dient zwei Zwecken. Einerseits ermöglicht es die Vorhersage des Zielattributs für Instanzen ohne Annotation. Andererseits können die Parameter des Modells nützliche Einsichten in die Struktur der Daten liefern. Wenn die Daten eine inhärente Partitionsstruktur besitzen, ist es natürlich, diese Struktur im Modell widerzuspiegeln. Solche Mischmodelle generieren Vorhersagen, indem sie die individuellen Vorhersagen der Mischkomponenten, welche mit den Partitionen der Daten korrespondieren, kombinieren. Oft ist die Partitionsstruktur latent und muss beim Lernen des Mischmodells mitinferiert werden. Eine direkte Evaluierung der Genauigkeit der inferierten Partitionsstruktur ist in vielen Fällen unmöglich, weil keine wahren Referenzdaten zum Vergleich herangezogen werden können. Jedoch kann man sie indirekt einschätzen, indem man die Vorhersagegenauigkeit des darauf basierenden Mischmodells misst. Diese Arbeit beschäftigt sich mit dem Zusammenspiel zwischen der Verbesserung der Vorhersagegenauigkeit durch das Aufdecken latenter Partitionierungen in Daten, und der Bewertung der geschätzen Struktur durch das Messen der Genauigkeit des resultierenden Vorhersagemodells. Bei der Anwendung des Filterns unerwünschter E-Mails sind die E-Mails in der Trainingsmende latent in Werbekampagnen partitioniert. Das Aufdecken dieser latenten Struktur erlaubt das Filtern zukünftiger E-Mails mit sehr niedrigen Falsch-Positiv-Raten. In dieser Arbeit wird ein Bayes'sches Partitionierunsmodell entwickelt, um diese Partitionierungsstruktur zu modellieren. Das Wissen über die Partitionierung von E-Mails in Kampagnen hilft auch dabei herauszufinden, welche E-Mails auf Veranlassen des selben Netzes von infiltrierten Rechnern, sogenannten Botnetzen, verschickt wurden. Dies ist eine weitere Schicht latenter Partitionierung. Diese latente Struktur aufzudecken erlaubt es, die Genauigkeit von E-Mail-Filtern zu erhöhen und sich effektiv gegen verteilte Denial-of-Service-Angriffe zu verteidigen. Zu diesem Zweck wird in dieser Arbeit ein diskriminatives Partitionierungsmodell hergeleitet, welches auf dem Graphen der beobachteten E-Mails basiert. Die mit diesem Modell inferierten Partitionierungen werden via ihrer Leistungsfähigkeit bei der Vorhersage der Kampagnen neuer E-Mails evaluiert. Weiterhin kann bei der Klassifikation des Inhalts einer E-Mail statistische Information über den sendenden Server wertvoll sein. Ein Modell zu lernen das diese Informationen nutzen kann erfordert Trainingsdaten, die Serverstatistiken enthalten. Um zusätzlich Trainingsdaten benutzen zu können, bei denen die Serverstatistiken fehlen, wird ein Modell entwickelt, das eine Mischung über potentiell alle Einsetzungen davon ist. Eine weitere Anwendung ist die Vorhersage des Navigationsverhaltens von Benutzern einer Webseite. Hier gibt es nicht a priori eine Partitionierung der Benutzer. Jedoch ist es notwendig, eine Partitionierung zu erzeugen, um verschiedene Nutzungsszenarien zu verstehen und verschiedene Layouts dafür zu entwerfen. Der vorgestellte Ansatz optimiert gleichzeitig die Fähigkeiten des Modells, sowohl die beste Partition zu bestimmen als auch mittels dieser Partition Vorhersagen über das Verhalten zu generieren. Jedes Modell wird auf realen Daten evaluiert und mit Referenzmethoden verglichen. Die Ergebnisse zeigen, dass das explizite Modellieren der Annahmen über die latente Partitionierungsstruktur zu verbesserten Vorhersagen führt. In den Fällen bei denen die Vorhersagegenauigkeit nicht direkt optimiert werden kann, erweist sich die Hinzunahme einer kleinen Anzahl von übergeordneten, direkt einstellbaren Parametern als nützlich.
65

Model-based Learning: t-Families, Variable Selection, and Parameter Estimation

Andrews, Jeffrey Lambert 27 August 2012 (has links)
The phrase model-based learning describes the use of mixture models in machine learning problems. This thesis focuses on a number of issues surrounding the use of mixture models in statistical learning tasks: including clustering, classification, discriminant analysis, variable selection, and parameter estimation. After motivating the importance of statistical learning via mixture models, five papers are presented. For ease of consumption, the papers are organized into three parts: mixtures of multivariate t-families, variable selection, and parameter estimation. / Natural Sciences and Engineering Research Council of Canada through a doctoral postgraduate scholarship.
66

Econometric Models of Crop Yields: Two Essays

Tolhurst, Tor 17 May 2013 (has links)
This thesis is an investigation of econometric crop yield models divided into two essays. In the first essay, I propose estimating a single heteroscedasticity coefficient for all counties within a crop-reporting district by pooling county-level crop yield data in a two-stage estimation process. In the context of crop insurance---where heteroscedaticity has significant economic implications---I demonstrate the pooling approach provides economically and statistically significant improvements in rating crop insurance contracts over contemporary methods. In the second essay, I propose a new method for measuring the rate of technological change in crop yields. To date the agricultural economics literature has measured technological change exclusively at the mean; in contrast, the proposed model can measure the rate of technological change in endogenously-defined yield subpopulations. I find evidence of different rates of technological change in yield subpopulations, which leads to interesting questions about the effect of technological change on agricultural production. / Ontario Ministry of Agriculture and Food
67

The Global Epidemic of Childhood Obesity and Its Non-medical Costs

Fu, Qiang January 2015 (has links)
<p>This dissertation consists of three parts of empirical analyses investigating temporal patterns and consequences of (childhood) overweight and obesity, mainly in the United States and the People's Republic of China. Based on the China Health and Nutrition Survey, the first part conducts hierarchical age-period-cohort analyses of childhood overweight in China and finds a strong cohort effect driving the overweight epidemic. Results from the growth-curve models show that childhood overweight and underweight are related such that certain socio-economic groups with higher levels of childhood overweight also exhibit lower levels of childhood underweight. The second part situates the discussion on childhood obesity in a broader context. It compares temporal patterns of childhood overweight in China with these of adulthood overweight and finds that the salient cohort component is absent in rising adulthood overweight, which is dominated by strong period effects. A positive association between human development index and overweight/obesity prevalence across countries is also documented. Using multiple waves of survey data from the National Longitudinal Study of Adolescent Health, the third part analyzes the (latent) trajectory of childhood overweight/obesity in the United States. It finds that individuals with obesity growth trajectories are less likely to avoid mental depression, tend to have higher levels of neuroticism and lower levels of agreeableness/conscientiousness, and show less delinquent behaviors.</p> / Dissertation
68

MCMC Estimation of Classical and Dynamic Switching and Mixture Models

Frühwirth-Schnatter, Sylvia January 1998 (has links) (PDF)
In the present paper we discuss Bayesian estimation of a very general model class where the distribution of the observations is assumed to depend on a latent mixture or switching variable taking values in a discrete state space. This model class covers e.g. finite mixture modelling, Markov switching autoregressive modelling and dynamic linear models with switching. Joint Bayesian estimation of all latent variables, model parameters and parameters determining the probability law of the switching variable is carried out by a new Markov Chain Monte Carlo method called permutation sampling. Estimation of switching and mixture models is known to be faced with identifiability problems as switching and mixture are identifiable only up to permutations of the indices of the states. For a Bayesian analysis the posterior has to be constrained in such a way that identifiablity constraints are fulfilled. The permutation sampler is designed to sample efficiently from the constrained posterior, by first sampling from the unconstrained posterior - which often can be done in a convenient multimove manner - and then by applying a suitable permutation, if the identifiability constraint is violated. We present simple conditions on the prior which ensure that this method is a valid Markov Chain Monte Carlo method (that is invariance, irreducibility and aperiodicity hold). Three case studies are presented, including finite mixture modelling of fetal lamb data, Markov switching Autoregressive modelling of the U.S. quarterly real GDP data, and modelling the U .S./U.K. real exchange rate by a dynamic linear model with Markov switching heteroscedasticity. (author's abstract) / Series: Forschungsberichte / Institut für Statistik
69

Session Clustering Using Mixtures of Proportional Hazards Models

Mair, Patrick, Hudec, Marcus January 2008 (has links) (PDF)
Emanating from classical Weibull mixture models we propose a framework for clustering survival data with various proportionality restrictions imposed. By introducing mixtures of Weibull proportional hazards models on a multivariate data set a parametric cluster approach based on the EM-algorithm is carried out. The problem of non-response in the data is considered. The application example is a real life data set stemming from the analysis of a world-wide operating eCommerce application. Sessions are clustered due to the dwell times a user spends on certain page-areas. The solution allows for the interpretation of the navigation behavior in terms of survival and hazard functions. A software implementation by means of an R package is provided. (author´s abstract) / Series: Research Report Series / Department of Statistics and Mathematics
70

Model Likelihoods and Bayes Factors for Switching and Mixture Models

Frühwirth-Schnatter, Sylvia January 2002 (has links) (PDF)
In the present paper we discuss the problem of estimating model likelihoods from the MCMC output for a general mixture and switching model. Estimation is based on the method of bridge sampling (Meng and Wong, 1996), where the MCMC sample is combined with an iid sample from an importance density. The importance density is constructed in an unsupervised manner from the MCMC output using a mixture of complete data posteriors. Whereas the importance sampling estimator as well as the reciprocal importance sampling estimator are sensitive to the tail behaviour of the importance density, we demonstrate that the bridge sampling estimator is far more robust in this concern. Our case studies range from computing marginal likelihoods for a mixture of multivariate normal distributions, testing for the inhomogeneity of a discrete time Poisson process, to testing for the presence of Markov switching and order selection in the MSAR model. (author's abstract) / Series: Report Series SFB "Adaptive Information Systems and Modelling in Economics and Management Science"

Page generated in 0.0731 seconds