Global ETD Search

11	GENERATIVE MODELS WITH MARGINAL CONSTRAINTS Bingjing Tang (16380291) 16 June 2023 (has links) <p> Generative models form powerful tools for learning data distributions and simulating new samples. Recent years have seen significant advances in the flexibility and applicability of such models, with Bayesian approaches like nonparametric Bayesian models and deep neural network models such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) finding use in a wide range of domains. However, the black-box nature of these models means that they are often hard to interpret, and they often come with modeling implications that are inconsistent with side knowledge resulting from domain knowledge. This thesis studies situations where the modeler has side knowledge represented as probability distributions on functionals of the objects being modeled, and we study methods to incorporate this particular kind of side knowledge into flexible generative models. This dissertation covers three main parts. </p> <p><br></p> <p>The first part focuses on incorporating a special case of the aforementioned side knowledge into flexible nonparametric Bayesian models. Many times, practitioners have additional distributional information about a subset of the coordinates of the observations being modeled. The flexibility of nonparametric Bayesian models usually implies incompatibility with this side information. Such inconsistency triggers the necessity of developing methods to incorporate this side knowledge into flexible nonparametric Bayesian models. We design a specialized generative process to build in this side knowledge and propose a novel sigmoid Gaussian process conditional model. We also develop a corresponding posterior sampling method based on data augmentation to overcome a doubly intractable problem. We illustrate the efficacy of our proposed constrained nonparametric Bayesian model in a variety of real-world scenarios including modeling environmental and earthquake data. </p> <p><br></p> <p>The second part of the dissertation discusses neural network approaches to satisfying the said general side knowledge. Further, the generative models considered in this part broaden into black-box models. We formulate this side knowledge incorporation problem as a constrained divergence minimization problem and propose two scalable neural network approaches as its solution. We demonstrate their practicality using various synthetic and real examples. </p> <p><br></p> <p> The third part of the dissertation concentrates on a specific generative model of individual pixels of the fMRI data constructed from a latent group image. Usually there is two-fold side knowledge about the latent group image: spatial structure and partial activation zones. The former can be captured by modeling the prior for the group image with Markov random fields. The latter, which is often obtained from previous related studies, is left for future research. We propose a novel Bayesian model with Markov random fields and aim to estimate the maximum a posteriori for the group image. We also derive a variational Bayes algorithm to overcome local optima in the optimization.</p> Computational statistics Statistical data science Knowledge Constraints Nonparametric Bayesian Black-box Neural Networks Conditional Density Estimation Density Ratio Estimation Sigmoid Gaussian Processes
12	Advanced Nonparametric Bayesian Functional Modeling Gao, Wenyu 04 September 2020 (has links) Functional analyses have gained more interest as we have easier access to massive data sets. However, such data sets often contain large heterogeneities, noise, and dimensionalities. When generalizing the analyses from vectors to functions, classical methods might not work directly. This dissertation considers noisy information reduction in functional analyses from two perspectives: functional variable selection to reduce the dimensionality and functional clustering to group similar observations and thus reduce the sample size. The complicated data structures and relations can be easily modeled by a Bayesian hierarchical model, or developed from a more generic one by changing the prior distributions. Hence, this dissertation focuses on the development of Bayesian approaches for functional analyses due to their flexibilities. A nonparametric Bayesian approach, such as the Dirichlet process mixture (DPM) model, has a nonparametric distribution as the prior. This approach provides flexibility and reduces assumptions, especially for functional clustering, because the DPM model has an automatic clustering property, so the number of clusters does not need to be specified in advance. Furthermore, a weighted Dirichlet process mixture (WDPM) model allows for more heterogeneities from the data by assuming more than one unknown prior distribution. It also gathers more information from the data by introducing a weight function that assigns different candidate priors, such that the less similar observations are more separated. Thus, the WDPM model will improve the clustering and model estimation results. In this dissertation, we used an advanced nonparametric Bayesian approach to study functional variable selection and functional clustering methods. We proposed 1) a stochastic search functional selection method with application to 1-M matched case-crossover studies for aseptic meningitis, to examine the time-varying unknown relationship and find out important covariates affecting disease contractions; 2) a functional clustering method via the WDPM model, with application to three pathways related to genetic diabetes data, to identify essential genes distinguishing between normal and disease groups; and 3) a combined functional clustering, with the WDPM model, and variable selection approach with application to high-frequency spectral data, to select wavelengths associated with breast cancer racial disparities. / Doctor of Philosophy / As we have easier access to massive data sets, functional analyses have gained more interest to analyze data providing information about curves, surfaces, or others varying over a continuum. However, such data sets often contain large heterogeneities and noise. When generalizing the analyses from vectors to functions, classical methods might not work directly. This dissertation considers noisy information reduction in functional analyses from two perspectives: functional variable selection to reduce the dimensionality and functional clustering to group similar observations and thus reduce the sample size. The complicated data structures and relations can be easily modeled by a Bayesian hierarchical model due to its flexibility. Hence, this dissertation focuses on the development of nonparametric Bayesian approaches for functional analyses. Our proposed methods can be applied in various applications: the epidemiological studies on aseptic meningitis with clustered binary data, the genetic diabetes data, and breast cancer racial disparities. Breast Cancer Racial Disparities Dirichlet Process Mixture (DPM) Functional Clustering Functional Selection Genetic Type II Diabetes Matched Case-Crossover Study Nonparametric Bayesian Modeling
13	Statistical methods for variant discovery and functional genomic analysis using next-generation sequencing data Tang, Man 03 January 2020 (has links) The development of high-throughput next-generation sequencing (NGS) techniques produces massive amount of data, allowing the identification of biomarkers in early disease diagnosis and driving the transformation of most disciplines in biology and medicine. A greater concentration is needed in developing novel, powerful, and efficient tools for NGS data analysis. This dissertation focuses on modeling ``omics'' data in various NGS applications with a primary goal of developing novel statistical methods to identify sequence variants, find transcription factor (TF) binding patterns, and decode the relationship between TF and gene expression levels. Accurate and reliable identification of sequence variants, including single nucleotide polymorphisms (SNPs) and insertion-deletion polymorphisms (INDELs), plays a fundamental role in NGS applications. Existing methods for calling these variants often make simplified assumption of positional independence and fail to leverage the dependence of genotypes at nearby loci induced by linkage disequilibrium. We propose vi-HMM, a hidden Markov model (HMM)-based method for calling SNPs and INDELs in mapped short read data. Simulation experiments show that, under various sequencing depths, vi-HMM outperforms existing methods in terms of sensitivity and F1 score. When applied to the human whole genome sequencing data, vi-HMM demonstrates higher accuracy in calling SNPs and INDELs. One important NGS application is chromatin immunoprecipitation followed by sequencing (ChIP-seq), which characterizes protein-DNA relations through genome-wide mapping of TF binding sites. Multiple TFs, binding to DNA sequences, often show complex binding patterns, which indicate how TFs with similar functionalities work together to regulate the expression of target genes. To help uncover the transcriptional regulation mechanism, we propose a novel nonparametric Bayesian method to detect the clustering pattern of multiple-TF bindings from ChIP-seq datasets. Simulation study demonstrates that our method performs best with regard to precision, recall, and F1 score, in comparison to traditional methods. We also apply the method on real data and observe several TF clusters that have been recognized previously in mouse embryonic stem cells. Recent advances in ChIP-seq and RNA sequencing (RNA-Seq) technologies provides more reliable and accurate characterization of TF binding sites and gene expression measurements, which serves as a basis to study the regulatory functions of TFs on gene expression. We propose a log Gaussian cox process with wavelet-based functional model to quantify the relationship between TF binding site locations and gene expression levels. Through the simulation study, we demonstrate that our method performs well, especially with large sample size and small variance. It also shows a remarkable ability to distinguish real local feature in the function estimates. / Doctor of Philosophy / The development of high-throughput next-generation sequencing (NGS) techniques produces massive amount of data and bring out innovations in biology and medicine. A greater concentration is needed in developing novel, powerful, and efficient tools for NGS data analysis. In this dissertation, we mainly focus on three problems closely related to NGS and its applications: (1) how to improve variant calling accuracy, (2) how to model transcription factor (TF) binding patterns, and (3) how to quantify of the contribution of TF binding on gene expression. We develop novel statistical methods to identify sequence variants, find TF binding patterns, and explore the relationship between TF binding and gene expressions. We expect our findings will be helpful in promoting a better understanding of disease causality and facilitating the design of personalized treatments. next-generation sequencing hidden Markov model variant calling transcription factor nonparametric Bayesian log Gaussian Cox process Dirichlet process mixture gene expression wavelet-based functional model
14	Semiparametric Bayesian Approach using Weighted Dirichlet Process Mixture For Finance Statistical Models Sun, Peng 07 March 2016 (has links) Dirichlet process mixture (DPM) has been widely used as exible prior in nonparametric Bayesian literature, and Weighted Dirichlet process mixture (WDPM) can be viewed as extension of DPM which relaxes model distribution assumptions. Meanwhile, WDPM requires to set weight functions and can cause extra computation burden. In this dissertation, we develop more efficient and exible WDPM approaches under three research topics. The first one is semiparametric cubic spline regression where we adopt a nonparametric prior for error terms in order to automatically handle heterogeneity of measurement errors or unknown mixture distribution, the second one is to provide an innovative way to construct weight function and illustrate some decent properties and computation efficiency of this weight under semiparametric stochastic volatility (SV) model, and the last one is to develop WDPM approach for Generalized AutoRegressive Conditional Heteroskedasticity (GARCH) model (as an alternative approach for SV model) and propose a new model evaluation approach for GARCH which produces easier-to-interpret result compared to the canonical marginal likelihood approach. In the first topic, the response variable is modeled as the sum of three parts. One part is a linear function of covariates that enter the model parametrically. The second part is an additive nonparametric model. The covariates whose relationships to response variable are unclear will be included in the model nonparametrically using Lancaster and Šalkauskas bases. The third part is error terms whose means and variance are assumed to follow non-parametric priors. Therefore we denote our model as dual-semiparametric regression because we include nonparametric idea for both modeling mean part and error terms. Instead of assuming all of the error terms follow the same prior in DPM, our WDPM provides multiple candidate priors for each observation to select with certain probability. Such probability (or weight) is modeled by relevant predictive covariates using Gaussian kernel. We propose several different WDPMs using different weights which depend on distance in covariates. We provide the efficient Markov chain Monte Carlo (MCMC) algorithms and also compare our WDPMs to parametric model and DPM model in terms of Bayes factor using simulation and empirical study. In the second topic, we propose an innovative way to construct weight function for WDPM and apply it to SV model. SV model is adopted in time series data where the constant variance assumption is violated. One essential issue is to specify distribution of conditional return. We assume WDPM prior for conditional return and propose a new way to model the weights. Our approach has several advantages including computational efficiency compared to the weight constructed using Gaussian kernel. We list six properties of this proposed weight function and also provide the proof of them. Because of the additional Metropolis-Hastings steps introduced by WDPM prior, we find the conditions which can ensure the uniform geometric ergodicity of transition kernel in our MCMC. Due to the existence of zero values in asset price data, our SV model is semiparametric since we employ WDPM prior for non-zero values and parametric prior for zero values. On the third project, we develop WDPM approach for GARCH type model and compare different types of weight functions including the innovative method proposed in the second topic. GARCH model can be viewed as an alternative way of SV for analyzing daily stock prices data where constant variance assumption does not hold. While the response variable of our SV models is transformed log return (based on log-square transformation), GARCH directly models the log return itself. This means that, theoretically speaking, we are able to predict stock returns using GARCH models while this is not feasible if we use SV model. Because SV models ignore the sign of log returns and provides predictive densities for squared log return only. Motivated by this property, we propose a new model evaluation approach called back testing return (BTR) particularly for GARCH. This BTR approach produces model evaluation results which are easier to interpret than marginal likelihood and it is straightforward to draw conclusion about model profitability by applying this approach. Since BTR approach is only applicable to GARCH, we also illustrate how to properly cal- culate marginal likelihood to make comparison between GARCH and SV. Based on our MCMC algorithms and model evaluation approaches, we have conducted large number of model fittings to compare models in both simulation and empirical study. / Ph. D. Additive Model Bayes factor Cubic Splines Dual-Semiparametric Regression Generalized Polya urn Geometric ergodicity Gibbs sampling Metropolis-Hastings Nonparametric Bayesian Model Ordinal data Parameterization Semiparametric Regr
15	Efficient Bayesian methods for mixture models with genetic applications / Métodos Bayesianos eficientes para modelos de mistura com aplicações em genética Zuanetti, Daiane Aparecida 14 December 2016 (has links) We propose Bayesian methods for selecting and estimating different types of mixture models which are widely used inGenetics and MolecularBiology. We specifically propose data-driven selection and estimation methods for a generalized mixture model, which accommodates the usual (independent) and the first-order (dependent) models in one framework, and QTL (quantitativetrait locus) mapping models for independent and pedigree data. For clustering genes through a mixture model, we propose three nonparametric Bayesian methods: a marginal nested Dirichlet process (NDP), which is able to cluster distributions and, a predictive recursion clustering scheme (PRC) and a subset nonparametric Bayesian (SNOB) clustering algorithm for clustering bigdata. We analyze and compare the performance of the proposed methods and traditional procedures of selection, estimation and clustering in simulated and real datasets. The proposed methods are more flexible, improve the convergence of the algorithms and provide more accurate estimates in many situations. In addition, we propose methods for estimating non observable QTLs genotypes and missing parents and improve the Mendelian probability of inheritance of nonfounder genotype using conditional independence structures.We also suggest applying diagnostic measures to check the goodness of fit of QTLmappingmodels. / Nos propomos métodos Bayesianos para selecionar e estimar diferentes tipos de modelos de mistura que são amplamente utilizados em Genética e Biologia Molecular. Especificamente, propomos métodos direcionados pelos dados para selecionar e estimar um modelo de mistura generalizado, que descreve o modelo de mistura usual (independente) e o de primeira ordem numa mesma estrutura, e modelos de mapeamento de QTL com dados independentes e familiares. Para agrupar genes através de modelos de mistura, nos propomos três métodos Bayesianos não-paramétricos: o processo de Dirichlet aninhado que possibilita agrupamento de distribuições e, um algoritmo preditivo recursivo e outro Bayesiano não- paramétrico exato para agrupar dados de alta dimensão. Analisamos e comparamos o desempenho dos métodos propostos e dos procedimentos tradicionais de seleção e estimação de modelos e agrupamento de dados em conjuntos de dados simulados e reais. Os métodos propostos são mais flexíveis, aprimoram a convergência dos algoritmos e apresentam estimativas mais precisas em muitas situações. Além disso, nos propomos procedimentos para estimar o genótipo não observável dos QTL se de pais faltantes e melhorar a probabilidade Mendeliana de herança genética do genótipo dos descendentes através da estrutura condicional de independência entre as variáveis. Também sugerimos aplicar medidas de diagnóstico para verificar a qualidade do ajuste dos modelos de mapeamento de QTLs.
16	On New Constructive Tools in Bayesian Nonparametric Inference Al Labadi, Luai 22 June 2012 (has links) The Bayesian nonparametric inference requires the construction of priors on infinite dimensional spaces such as the space of cumulative distribution functions and the space of cumulative hazard functions. Well-known priors on the space of cumulative distribution functions are the Dirichlet process, the two-parameter Poisson-Dirichlet process and the beta-Stacy process. On the other hand, the beta process is a popular prior on the space of cumulative hazard functions. This thesis is divided into three parts. In the first part, we tackle the problem of sampling from the above mentioned processes. Sampling from these processes plays a crucial role in many applications in Bayesian nonparametric inference. However, having exact samples from these processes is impossible. The existing algorithms are either slow or very complex and may be difficult to apply for many users. We derive new approximation techniques for simulating the above processes. These new approximations provide simple, yet efficient, procedures for simulating these important processes. We compare the efficiency of the new approximations to several other well-known approximations and demonstrate a significant improvement. In the second part, we develop explicit expressions for calculating the Kolmogorov, Levy and Cramer-von Mises distances between the Dirichlet process and its base measure. The derived expressions of each distance are used to select the concentration parameter of a Dirichlet process. We also propose a Bayesain goodness of fit test for simple and composite hypotheses for non-censored and censored observations. Illustrative examples and simulation results are included. Finally, we describe the relationship between the frequentist and Bayesian nonparametric statistics. We show that, when the concentration parameter is large, the two-parameter Poisson-Dirichlet process and its corresponding quantile process share many asymptotic pr operties with the frequentist empirical process and the frequentist quantile process. Some of these properties are the functional central limit theorem, the strong law of large numbers and the Glivenko-Cantelli theorem. Dirichlet process Nonparametric Bayesian inference Ferguson and Klass Representation Brownian bridge Quantile process Weak convergence Simulation Gamma process Levy measure Stick-breaking representation Stable law process Two-parameter Poisson-Dirichlet process Beta-Stacy process Goodness of fit test Kolmogorov distance Wolpert and Iskstadt representation Cramer-von Mises distance Levy distance Beta process
17	On New Constructive Tools in Bayesian Nonparametric Inference Al Labadi, Luai 22 June 2012 (has links) The Bayesian nonparametric inference requires the construction of priors on infinite dimensional spaces such as the space of cumulative distribution functions and the space of cumulative hazard functions. Well-known priors on the space of cumulative distribution functions are the Dirichlet process, the two-parameter Poisson-Dirichlet process and the beta-Stacy process. On the other hand, the beta process is a popular prior on the space of cumulative hazard functions. This thesis is divided into three parts. In the first part, we tackle the problem of sampling from the above mentioned processes. Sampling from these processes plays a crucial role in many applications in Bayesian nonparametric inference. However, having exact samples from these processes is impossible. The existing algorithms are either slow or very complex and may be difficult to apply for many users. We derive new approximation techniques for simulating the above processes. These new approximations provide simple, yet efficient, procedures for simulating these important processes. We compare the efficiency of the new approximations to several other well-known approximations and demonstrate a significant improvement. In the second part, we develop explicit expressions for calculating the Kolmogorov, Levy and Cramer-von Mises distances between the Dirichlet process and its base measure. The derived expressions of each distance are used to select the concentration parameter of a Dirichlet process. We also propose a Bayesain goodness of fit test for simple and composite hypotheses for non-censored and censored observations. Illustrative examples and simulation results are included. Finally, we describe the relationship between the frequentist and Bayesian nonparametric statistics. We show that, when the concentration parameter is large, the two-parameter Poisson-Dirichlet process and its corresponding quantile process share many asymptotic pr operties with the frequentist empirical process and the frequentist quantile process. Some of these properties are the functional central limit theorem, the strong law of large numbers and the Glivenko-Cantelli theorem. Dirichlet process Nonparametric Bayesian inference Ferguson and Klass Representation Brownian bridge Quantile process Weak convergence Simulation Gamma process Levy measure Stick-breaking representation Stable law process Two-parameter Poisson-Dirichlet process Beta-Stacy process Goodness of fit test Kolmogorov distance Wolpert and Iskstadt representation Cramer-von Mises distance Levy distance Beta process
18	On New Constructive Tools in Bayesian Nonparametric Inference Al Labadi, Luai January 2012 (has links) The Bayesian nonparametric inference requires the construction of priors on infinite dimensional spaces such as the space of cumulative distribution functions and the space of cumulative hazard functions. Well-known priors on the space of cumulative distribution functions are the Dirichlet process, the two-parameter Poisson-Dirichlet process and the beta-Stacy process. On the other hand, the beta process is a popular prior on the space of cumulative hazard functions. This thesis is divided into three parts. In the first part, we tackle the problem of sampling from the above mentioned processes. Sampling from these processes plays a crucial role in many applications in Bayesian nonparametric inference. However, having exact samples from these processes is impossible. The existing algorithms are either slow or very complex and may be difficult to apply for many users. We derive new approximation techniques for simulating the above processes. These new approximations provide simple, yet efficient, procedures for simulating these important processes. We compare the efficiency of the new approximations to several other well-known approximations and demonstrate a significant improvement. In the second part, we develop explicit expressions for calculating the Kolmogorov, Levy and Cramer-von Mises distances between the Dirichlet process and its base measure. The derived expressions of each distance are used to select the concentration parameter of a Dirichlet process. We also propose a Bayesain goodness of fit test for simple and composite hypotheses for non-censored and censored observations. Illustrative examples and simulation results are included. Finally, we describe the relationship between the frequentist and Bayesian nonparametric statistics. We show that, when the concentration parameter is large, the two-parameter Poisson-Dirichlet process and its corresponding quantile process share many asymptotic pr operties with the frequentist empirical process and the frequentist quantile process. Some of these properties are the functional central limit theorem, the strong law of large numbers and the Glivenko-Cantelli theorem. Dirichlet process Nonparametric Bayesian inference Ferguson and Klass Representation Brownian bridge Quantile process Weak convergence Simulation Gamma process Levy measure Stick-breaking representation Stable law process Two-parameter Poisson-Dirichlet process Beta-Stacy process Goodness of fit test Kolmogorov distance Wolpert and Iskstadt representation Cramer-von Mises distance Levy distance Beta process
19	Analyse intégrative de données de grande dimension appliquée à la recherche vaccinale / Integrative analysis of high-dimensional data applied to vaccine research Hejblum, Boris 06 March 2015 (has links) Les données d’expression génique sont reconnues comme étant de grande dimension, etnécessitant l’emploi de méthodes statistiques adaptées. Mais dans le contexte des essaisvaccinaux, d’autres mesures, comme par exemple les mesures de cytométrie en flux, sontégalement de grande dimension. De plus, ces données sont souvent mesurées de manièrelongitudinale. Ce travail est bâti sur l’idée que l’utilisation d’un maximum d’informationdisponible, en modélisant les connaissances a priori ainsi qu’en intégrant l’ensembledes différentes données disponibles, améliore l’inférence et l’interprétabilité des résultatsd’analyses statistiques en grande dimension. Tout d’abord, nous présentons une méthoded’analyse par groupe de gènes pour des données d’expression génique longitudinales. Ensuite,nous décrivons deux analyses intégratives dans deux études vaccinales. La premièremet en évidence une sous-expression des voies biologiques d’inflammation chez les patientsayant un rebond viral moins élevé à la suite d’un vaccin thérapeutique contre le VIH. Ladeuxième étude identifie un groupe de gènes lié au métabolisme lipidique dont l’impactsur la réponse à un vaccin contre la grippe semble régulé par la testostérone, et donc liéau sexe. Enfin, nous introduisons un nouveau modèle de mélange de distributions skew t àprocessus de Dirichlet pour l’identification de populations cellulaires à partir de donnéesde cytométrie en flux disponible notamment dans les essais vaccinaux. En outre, nousproposons une stratégie d’approximation séquentielle de la partition a posteriori dans lecas de mesures répétées. Ainsi, la reconnaissance automatique des populations cellulairespourrait permettre à la fois une avancée pratique pour le quotidien des immunologistesainsi qu’une interprétation plus précise des résultats d’expression génique après la priseen compte de l’ensemble des populations cellulaires. / Gene expression data is recognized as high-dimensional data that needs specific statisticaltools for its analysis. But in the context of vaccine trials, other measures, such asflow-cytometry measurements are also high-dimensional. In addition, such measurementsare often repeated over time. This work is built on the idea that using the maximum ofavailable information, by modeling prior knowledge and integrating all data at hand, willimprove the inference and the interpretation of biological results from high-dimensionaldata. First, we present an original methodological development, Time-course Gene SetAnalysis (TcGSA), for the analysis of longitudinal gene expression data, taking into accountprior biological knowledge in the form of predefined gene sets. Second, we describetwo integrative analyses of two different vaccine studies. The first study reveals lowerexpression of inflammatory pathways consistently associated with lower viral rebound followinga HIV therapeutic vaccine. The second study highlights the role of a testosteronemediated group of genes linked to lipid metabolism in sex differences in immunologicalresponse to a flu vaccine. Finally, we introduce a new model-based clustering approach forthe automated treatment of cell populations from flow-cytometry data, namely a Dirichletprocess mixture of skew t-distributions, with a sequential posterior approximation strategyfor dealing with repeated measurements. Hence, the automatic recognition of thecell populations could allow a practical improvement of the daily work of immunologistsas well as a better interpretation of gene expression data after taking into account thefrequency of all cell populations. Analyse intégrée Analyse par groupe de gènes Bayesien non paramétrique Connaissance a priori Cytométrie en flux Dimorphisme sexuel Distribution skew t Données de grande dimension Fenêtrage automatisé Grippe Génomique Modèle de mélange Processus de Dirichlet Vaccin VIH Automated gating Dirichlet process Flow cytometry Flu Gene set analysis Highdimensional data HIV Integrative analysis Mixture model Nonparametric Bayesian Prior knowledge Sexual dimorphism Skew t-distribution Statistical genomics Vaccine
20	A nonparametric Bayesian perspective for machine learning in partially-observed settings Akova, Ferit 31 July 2014 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Robustness and generalizability of supervised learning algorithms depend on the quality of the labeled data set in representing the real-life problem. In many real-world domains, however, we may not have full knowledge of the underlying data-generating mechanism, which may even have an evolving nature introducing new classes continually. This constitutes a partially-observed setting, where it would be impractical to obtain a labeled data set exhaustively defined by a fixed set of classes. Traditional supervised learning algorithms, assuming an exhaustive training library, would misclassify a future sample of an unobserved class with probability one, leading to an ill-defined classification problem. Our goal is to address situations where such assumption is violated by a non-exhaustive training library, which is a very realistic yet an overlooked issue in supervised learning. In this dissertation we pursue a new direction for supervised learning by defining self-adjusting models to relax the fixed model assumption imposed on classes and their distributions. We let the model adapt itself to the prospective data by dynamically adding new classes/components as data demand, which in turn gradually make the model more representative of the entire population. In this framework, we first employ suitably chosen nonparametric priors to model class distributions for observed as well as unobserved classes and then, utilize new inference methods to classify samples from observed classes and discover/model novel classes for those from unobserved classes. This thesis presents the initiating steps of an ongoing effort to address one of the most overlooked bottlenecks in supervised learning and indicates the potential for taking new perspectives in some of the most heavily studied areas of machine learning: novelty detection, online class discovery and semi-supervised learning. Statistical decision Nonparametric statistics -- Research Mathematical statistics Stochastic processes Boosting (Algorithms) Statistics -- Data processing Machine learning Computational linguistics Data mining Computational intelligence

Search results