21 |
Modelos bayesianos semi-paramétricos para dados binários / Bayesian semi-parametric models for binary dataMárcio Augusto Diniz 11 June 2015 (has links)
Este trabalho propõe modelos Bayesiano semi-paramétricos para dados binários. O primeiro modelo é uma mistura em escala que permite lidar com discrepâncias relacionadas a curtose do modelo Logístico. É uma extensão relevante a partir do que já foi proposto por Basu e Mukhopadhyay (2000) ao possibilitar a interpretação da distribuição a priori dos parâmetros através de razões de chances. O segundo modelo usufrui da mistura em escala em conjunto com a transformação proposta por \\Yeo e Johnson (2000) possibilitando que a curtose assim como a assimetria sejam ajustadas e um parâmetro informativo de assimetria seja estimado. Esta transformação é muito mais apropriada para lidar com valores negativos do que a transformação de Box e Cox (1964) utilizada por Guerrero e Johnson (1982) e é mais simples do que o modelo proposto por Stukel (1988). Por fim, o terceiro modelo é o mais geral entre todos e consiste em uma mistura de posição e escala tal que possa descrever curtose, assimetria e também bimodalidade. O modelo proposto por Newton et al. (1996), embora, seja bastante geral, não permite uma interpretação palpável da distribuição a priori para os pesquisadores da área aplicada. A avaliação dos modelos é realizada através de medidas de distância de probabilidade Cramér-von Mises, Kolmogorov-Smirnov e Anderson-Darling e também pelas Ordenadas Preditivas Condicionais. / This work proposes semi-parametric Bayesian models for binary data. The first model is a scale mixture that allows handling discrepancies related to kurtosis of Logistic model. It is a more interesting extension than has been proposed by Basu e Mukhopadyay (1998) because this model allows the interpretation of the prior distribution of parameters using odds ratios. The second model enjoys the scale mixture together with the scale transformation proposed by Yeo and Johnson (2000) modeling the kurtosis and the asymmetry such that a parameter of asymmetry is estimated. This transformation is more appropriate to deal with negative values than the transformation of Box e Cox (1964) used by Guerrero e Johnson (1982) and simpler than the model proposed by Stukel (1988). Finally, the third model is the most general among all and consists of a location-scale mixture that can describe kurtosis and skewness also bimodality. The model proposed by Newton et al (1996), although general, does not allow a tangible interpretation of the a priori distribution for reseachers of applied area. The evaluation of the models is performed through distance measurements of distribution of probabilities Cramer-von Mises Kolmogorov-Smirnov and Anderson-Darling and also the Conditional Predictive sorted.
|
22 |
Models for fitting correlated non-identical bernoulli random variables with applications to an airline data problemPerez Romo Leroux, Andres January 2021 (has links)
Our research deals with the problem of devising models for fitting non- identical dependent Bernoulli variables and using these models to predict fu- ture Bernoulli trials.We focus on modelling and predicting random Bernoulli response variables which meet all of the following conditions:
1. Each observed as well as future response corresponds to a Bernoulli trial
2. The trials are non-identical, having possibly different probabilities of occurrence
3. The trials are mutually correlated, with an underlying complex trial cluster correlation structure. Also allowing for the possible partitioning of trials within clusters into groups. Within cluster - group level correlation is reflected in the correlation structure.
4. The probability of occurrence and correlation structure for both ob- served and future trials can depend on a set of observed covariates.
A number of proposed approaches meeting some of the above conditions are present in the current literature. Our research expands on existing statistical and machine learning methods.
We propose three extensions to existing models that make use of the above conditions. Each proposed method brings specific advantages for dealing with
correlated binary data. The proposed models allow for within cluster trial grouping to be reflected in the correlation structure. We partition sets of trials into groups either explicitly estimated or implicitly inferred. Explicit groups arise from the determination of common covariates; inferred groups arise via imposing mixture models. The main motivation of our research is in modelling and further understanding the potential of introducing binary trial group level correlations. In a number of applications, it can be beneficial to use models that allow for these types of trial groupings, both for improved predictions and better understanding of behavior of trials.
The first model extension builds on the Multivariate Probit model. This model makes use of covariates and other information from former trials to determine explicit trial groupings and predict the occurrence of future trials. We call this the Explicit Groups model.
The second model extension uses mixtures of univariate Probit models. This model predicts the occurrence of current trials using estimators of pa- rameters supporting mixture models for the observed trials. We call this the Inferred Groups model.
Our third methods extends on a gradient descent based boosting algorithm which allows for correlation of binary outcomes called WL2Boost. We refer to our extension of this algorithm as GWL2Boost.
Bernoulli trials are divided into observed and future trials; with all trials having associated known covariate information. We apply our methodology to the problem of predicting the set and total number of passengers who will not show up on commercial flights using covariate information and past passenger data.
The models and algorithms are evaluated with regards to their capac- ity to predict future Bernoulli responses. We compare the models proposed against a set of competing existing models and algorithms using available air- line passenger no-show data. We show that our proposed algorithm extension GWL2Boost outperforms top existing algorithms and models that assume in- dependence of binary outcomes in various prediction metrics. / Statistics
|
23 |
Testing for spatial correlation and semiparametric spatial modeling of binary outcomes with application to aberrant crypt foci in colon carcinogenesis experimentsApanasovich, Tatiyana Vladimirovna 01 November 2005 (has links)
In an experiment to understand colon carcinogenesis, all animals were exposed to a carcinogen while half the animals were also exposed to radiation. Spatially, we measured the existence of aberrant crypt foci (ACF), namely morphologically changed colonic crypts that are known to be precursors of colon cancer development. The biological question of interest is whether the locations of these ACFs are spatially correlated: if so, this indicates that damage to the colon due to carcinogens and radiation is localized. Statistically, the data take the form of binary outcomes (corresponding to the existence of an ACF) on a regular grid. We develop score??type methods based upon the Matern and conditionally autoregression (CAR) correlation models to test for the spatial correlation in such data, while allowing for nonstationarity. Because of a technical peculiarity of the score??type test, we also develop robust versions of the method. The methods are compared to a generalization of Moran??s test for continuous outcomes, and are shown via simulation to have the potential for increased power. When applied to our data, the methods indicate the existence of spatial correlation, and hence indicate localization of damage. Assuming that there are correlations in the locations of the ACF, the questions are how great are these correlations, and whether the correlation structures di?er when an animal is exposed to radiation. To understand the extent of the correlation, we cast the problem as a spatial binary regression, where binary responses arise from an underlying Gaussian latent process. We model these marginal probabilities of ACF semiparametrically, using ?xed-knot penalized regression splines and single-index models. We ?t the models using pairwise pseudolikelihood methods. Assuming that the underlying latent process is strongly mixing, known to be the case for many Gaussian processes, we prove asymptotic normality of the methods. The penalized regression splines have penalty parameters that must converge to zero asymptotically: we derive rates for these parameters that do and do not lead to an asymptotic bias, and we derive the optimal rate of convergence for them. Finally, we apply the methods to the data from our experiment.
|
24 |
Testing for spatial correlation and semiparametric spatial modeling of binary outcomes with application to aberrant crypt foci in colon carcinogenesis experimentsApanasovich, Tatiyana Vladimirovna 01 November 2005 (has links)
In an experiment to understand colon carcinogenesis, all animals were exposed to a carcinogen while half the animals were also exposed to radiation. Spatially, we measured the existence of aberrant crypt foci (ACF), namely morphologically changed colonic crypts that are known to be precursors of colon cancer development. The biological question of interest is whether the locations of these ACFs are spatially correlated: if so, this indicates that damage to the colon due to carcinogens and radiation is localized. Statistically, the data take the form of binary outcomes (corresponding to the existence of an ACF) on a regular grid. We develop score??type methods based upon the Matern and conditionally autoregression (CAR) correlation models to test for the spatial correlation in such data, while allowing for nonstationarity. Because of a technical peculiarity of the score??type test, we also develop robust versions of the method. The methods are compared to a generalization of Moran??s test for continuous outcomes, and are shown via simulation to have the potential for increased power. When applied to our data, the methods indicate the existence of spatial correlation, and hence indicate localization of damage. Assuming that there are correlations in the locations of the ACF, the questions are how great are these correlations, and whether the correlation structures di?er when an animal is exposed to radiation. To understand the extent of the correlation, we cast the problem as a spatial binary regression, where binary responses arise from an underlying Gaussian latent process. We model these marginal probabilities of ACF semiparametrically, using ?xed-knot penalized regression splines and single-index models. We ?t the models using pairwise pseudolikelihood methods. Assuming that the underlying latent process is strongly mixing, known to be the case for many Gaussian processes, we prove asymptotic normality of the methods. The penalized regression splines have penalty parameters that must converge to zero asymptotically: we derive rates for these parameters that do and do not lead to an asymptotic bias, and we derive the optimal rate of convergence for them. Finally, we apply the methods to the data from our experiment.
|
25 |
Modelo de regressão para dados binários com mistura de funções de ligação / Regression model with mixture of link functions for binary dataEugenio, Nicholas Wagner 08 February 2017 (has links)
Submitted by Ronildo Prado (ronisp@ufscar.br) on 2017-08-23T19:24:21Z
No. of bitstreams: 1
DissNWE.pdf: 1026071 bytes, checksum: f864f8b3841ec4e6496ee38d7961a585 (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-08-23T19:24:30Z (GMT) No. of bitstreams: 1
DissNWE.pdf: 1026071 bytes, checksum: f864f8b3841ec4e6496ee38d7961a585 (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-08-23T19:24:37Z (GMT) No. of bitstreams: 1
DissNWE.pdf: 1026071 bytes, checksum: f864f8b3841ec4e6496ee38d7961a585 (MD5) / Made available in DSpace on 2017-08-23T19:24:43Z (GMT). No. of bitstreams: 1
DissNWE.pdf: 1026071 bytes, checksum: f864f8b3841ec4e6496ee38d7961a585 (MD5)
Previous issue date: 2017-02-08 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / A regression model for binary data with mixture of four link functions (logit, probit, complementary
log log and Stukel) is shown and these functions are particular cases of the model. The
frequentist estimation procedure is exposed and, by simulation studies, it is notable that, comparing
with other models, the link function proposed presents a better performance in proportions’
estimations, while for predctions they are all equal. Its flexibility in being both a symmetric or
an assymmetric link function is corroborated on the real data analisys results, as the simulations.
Furthermore, it is shown a case where the mixture associates total weight for a link function
because it is no possible to improve the results by mixing other functions. / Apresenta-se um modelo de regressão para dados binários com mistura de quatro funções de
ligação (logit, probit, complementar log-log e Stukel) que também são seus casos particulares.
Os procedimentos de estimação frequentista são expostos e, através de estudos de simulações,
mostra-se que, em relação a outros modelos, a função de ligação proposta apresenta melhor
desempenho nas estimações de proporções, ao passo que para previsões é igual às demais. Sua
flexibilidade em poder ser tanto uma função de ligação simétrica quanto assimétrica é corroborada
pelo resultados das análises de três bancos de dados reais, bem como pelas simulações. Mostra-se
ainda um caso em que, por não conseguir obter melhores resultados com as combinações de
ligações, a mistura associa peso total a um de seus componentes.
|
26 |
Analytics tool for radar data / Analysverktyg för radardataNaumanen, Hampus, Malmgård, Torsten, Waade, Eystein January 2018 (has links)
Analytics tool for radar data was a project that started when radar specialists at Saab needed to modernize their tools that analyzes binary encoded radar data. Today, the analysis is accomplished using inadequate and ineffective applications not designed for that purpose, and consequently this makes the analysis tedious and more difficult compared to using an appropriate interface. The applications had limitations regarding different radar systems too, which restricted their usage significantly. The solution was to design a new software that imports, translates and visualizes the data independent of the radar system. The software was developed with several parts that communicates with each other to translate a binary file. A binary file consists of a series of bytes containing the information of the targets and markers separating the revolutions of the radar. The byte stream is split according to the ASTERIX protocol that defines the length of each Data Item and the extracted positional values are stored in arrays. The code is then designed to convert the positional values to cartesian coordinates and plot them on the screen. The software has implemented features such as play, pause, reverse and a plotting history that allows the user to analyze the data in a simple and user-friendly manner. There are also numerous ways the software could be extended. The code is constructed in such a way that new features can be implemented for additional analytical abilities without affecting the components already designed.
|
27 |
Klasifikace elektronických dokumentů s využitím shlukové analýzy / Classification of electronic documents using cluster analysisŠevčík, Radim January 2009 (has links)
The current age is characterised by unprecedented information growth, whether it is by amount or complexity. Most of it is available in digital form so we can analyze it using cluster analysis. We have tried to classify the documents from 20 Newsgroups collection in terms of their content only. The aim was to asses available clustering methods in a variety of applications. After the transformation into binary vector representation we performed several experiments and measured the values of entropy, purity and time of execution in application CLUTO. For a small number of clusters the best results offered the direct method (generally hierarchical method), but for more it was the repeated bisection (divisive). Agglomerative method proved not to be suitable. Using simulation we estimated the optimal number of clusters to be 10. For this solution we described in detail features of each cluster using repeated bisection method and i2 criterion function. In the future focus should be set on realisation of binary clustering with advantage of programming languages like Perl or C++. Results of this work might be of interest to web search engine developers and electronic catalogue administrators.
|
28 |
Synantropní květena vesnic na gradientu nadmořské výšky v jižní části Čech / Synanthropic flora of villages on altitudinal gradient in southern part of the Czech RepublicJENČOVÁ, Dana January 2011 (has links)
The study is a floristic survey of 131 villages in southern part of South Bohemia. In total 27.773 floristic records were collected with occurence of 585 taxa of wild vascular plants recorded, 548 taxa were further used in statistical analyses. Environmental factors with potencial effect on village flora composition and diversity were recorded along or extracted from various sources. Relations of diversity (number of species) and environmental factors were studied. Species composition was compared with these variables using multivariate statistical methods.
|
29 |
Statistical inference for joint modelling of longitudinal and survival dataLi, Qiuju January 2014 (has links)
In longitudinal studies, data collected within a subject or cluster are somewhat correlated by their very nature and special cares are needed to account for such correlation in the analysis of data. Under the framework of longitudinal studies, three topics are being discussed in this thesis. In chapter 2, the joint modelling of multivariate longitudinal process consisting of different types of outcomes are discussed. In the large cohort study of UK north Stafforshire osteoarthritis project, longitudinal trivariate outcomes of continuous, binary and ordinary data are observed at baseline, year 3 and year 6. Instead of analysing each process separately, joint modelling is proposed for the trivariate outcomes to account for the inherent association by introducing random effects and the covariance matrix G. The influence of covariance matrix G on statistical inference of fixed-effects parameters has been investigated within the Bayesian framework. The study shows that by joint modelling the multivariate longitudinal process, it can reduce the bias and provide with more reliable results than it does by modelling each process separately. Together with the longitudinal measurements taken intermittently, a counting process of events in time is often being observed as well during a longitudinal study. It is of interest to investigate the relationship between time to event and longitudinal process, on the other hand, measurements taken for the longitudinal process may be potentially truncated by the terminated events, such as death. Thus, it may be crucial to jointly model the survival and longitudinal data. It is popular to propose linear mixed-effects models for the longitudinal process of continuous outcomes and Cox regression model for survival data to characterize the relationship between time to event and longitudinal process, and some standard assumptions have been made. In chapter 3, we try to investigate the influence on statistical inference for survival data when the assumption of mutual independence on random error of linear mixed-effects models of longitudinal process has been violated. And the study is conducted by utilising conditional score estimation approach, which provides with robust estimators and shares computational advantage. Generalised sufficient statistic of random effects is proposed to account for the correlation remaining among the random error, which is characterized by the data-driven method of modified Cholesky decomposition. The simulation study shows that, by doing so, it can provide with nearly unbiased estimation and efficient statistical inference as well. In chapter 4, it is trying to account for both the current and past information of longitudinal process into the survival models of joint modelling. In the last 15 to 20 years, it has been popular or even standard to assume that longitudinal process affects the counting process of events in time only through the current value, which, however, is not necessary to be true all the time, as recognised by the investigators in more recent studies. An integral over the trajectory of longitudinal process, along with a weighted curve, is proposed to account for both the current and past information to improve inference and reduce the under estimation of effects of longitudinal process on the risk hazards. A plausible approach of statistical inference for the proposed models has been proposed in the chapter, along with real data analysis and simulation study.
|
30 |
Hard and fuzzy block clustering algorithms for high dimensional data / Algorithmes de block-clustering dur et flou pour les données en grande dimensionLaclau, Charlotte 14 April 2016 (has links)
Notre capacité grandissante à collecter et stocker des données a fait de l'apprentissage non supervisé un outil indispensable qui permet la découverte de structures et de modèles sous-jacents aux données, sans avoir à \étiqueter les individus manuellement. Parmi les différentes approches proposées pour aborder ce type de problème, le clustering est très certainement le plus répandu. Le clustering suppose que chaque groupe, également appelé cluster, est distribué autour d'un centre défini en fonction des valeurs qu'il prend pour l'ensemble des variables. Cependant, dans certaines applications du monde réel, et notamment dans le cas de données de dimension importante, cette hypothèse peut être invalidée. Aussi, les algorithmes de co-clustering ont-ils été proposés: ils décrivent les groupes d'individus par un ou plusieurs sous-ensembles de variables au regard de leur pertinence. La structure des données finalement obtenue est composée de blocs communément appelés co-clusters. Dans les deux premiers chapitres de cette thèse, nous présentons deux approches de co-clustering permettant de différencier les variables pertinentes du bruit en fonction de leur capacité \`a révéler la structure latente des données, dans un cadre probabiliste d'une part et basée sur la notion de métrique, d'autre part. L'approche probabiliste utilise le principe des modèles de mélanges, et suppose que les variables non pertinentes sont distribuées selon une loi de probabilité dont les paramètres sont indépendants de la partition des données en cluster. L'approche métrique est fondée sur l'utilisation d'une distance adaptative permettant d'affecter à chaque variable un poids définissant sa contribution au co-clustering. D'un point de vue théorique, nous démontrons la convergence des algorithmes proposés en nous appuyant sur le théorème de convergence de Zangwill. Dans les deux chapitres suivants, nous considérons un cas particulier de structure en co-clustering, qui suppose que chaque sous-ensemble d'individus et décrit par un unique sous-ensemble de variables. La réorganisation de la matrice originale selon les partitions obtenues sous cette hypothèse révèle alors une structure de blocks homogènes diagonaux. Comme pour les deux contributions précédentes, nous nous plaçons dans le cadre probabiliste et métrique. L'idée principale des méthodes proposées est d'imposer deux types de contraintes : (1) nous fixons le même nombre de cluster pour les individus et les variables; (2) nous cherchons une structure de la matrice de données d'origine qui possède les valeurs maximales sur sa diagonale (par exemple pour le cas des données binaires, on cherche des blocs diagonaux majoritairement composés de valeurs 1, et de 0 à l’extérieur de la diagonale). Les approches proposées bénéficient des garanties de convergence issues des résultats des chapitres précédents. Enfin, pour chaque chapitre, nous dérivons des algorithmes permettant d'obtenir des partitions dures et floues. Nous évaluons nos contributions sur un large éventail de données simulées et liées a des applications réelles telles que le text mining, dont les données peuvent être binaires ou continues. Ces expérimentations nous permettent également de mettre en avant les avantages et les inconvénients des différentes approches proposées. Pour conclure, nous pensons que cette thèse couvre explicitement une grande majorité des scénarios possibles découlant du co-clustering flou et dur, et peut être vu comme une généralisation de certaines approches de biclustering populaires. / With the increasing number of data available, unsupervised learning has become an important tool used to discover underlying patterns without the need to label instances manually. Among different approaches proposed to tackle this problem, clustering is arguably the most popular one. Clustering is usually based on the assumption that each group, also called cluster, is distributed around a center defined in terms of all features while in some real-world applications dealing with high-dimensional data, this assumption may be false. To this end, co-clustering algorithms were proposed to describe clusters by subsets of features that are the most relevant to them. The obtained latent structure of data is composed of blocks usually called co-clusters. In first two chapters, we describe two co-clustering methods that proceed by differentiating the relevance of features calculated with respect to their capability of revealing the latent structure of the data in both probabilistic and distance-based framework. The probabilistic approach uses the mixture model framework where the irrelevant features are assumed to have a different probability distribution that is independent of the co-clustering structure. On the other hand, the distance-based (also called metric-based) approach relied on the adaptive metric where each variable is assigned with its weight that defines its contribution in the resulting co-clustering. From the theoretical point of view, we show the global convergence of the proposed algorithms using Zangwill convergence theorem. In the last two chapters, we consider a special case of co-clustering where contrary to the original setting, each subset of instances is described by a unique subset of features resulting in a diagonal structure of the initial data matrix. Same as for the two first contributions, we consider both probabilistic and metric-based approaches. The main idea of the proposed contributions is to impose two different kinds of constraints: (1) we fix the number of row clusters to the number of column clusters; (2) we seek a structure of the original data matrix that has the maximum values on its diagonal (for instance for binary data, we look for diagonal blocks composed of ones with zeros outside the main diagonal). The proposed approaches enjoy the convergence guarantees derived from the results of the previous chapters. Finally, we present both hard and fuzzy versions of the proposed algorithms. We evaluate our contributions on a wide variety of synthetic and real-world benchmark binary and continuous data sets related to text mining applications and analyze advantages and inconvenients of each approach. To conclude, we believe that this thesis covers explicitly a vast majority of possible scenarios arising in hard and fuzzy co-clustering and can be seen as a generalization of some popular biclustering approaches.
|
Page generated in 0.0766 seconds