Global ETD Search

1	Advances in Machine Learning for Compositional Data Gordon Rodriguez, Elliott January 2022 (has links) Compositional data refers to simplex-valued data, or equivalently, nonnegative vectors whose totals are uninformative. This data modality is of relevance across several scientific domains. A classical example of compositional data is the chemical composition of geological samples, e.g., major-oxide concentrations. A more modern example arises from the microbial populations recorded using high-throughput genetic sequencing technologies, e.g., the gut microbiome. This dissertation presents a set of methodological and theoretical contributions that advance the state of the art in the analysis of compositional data. Our work can be divided along two categories: problems in which compositional data represents the input to a predictive model, and problems in which it represents the output of the model. For the first class of problems, we build on the popular log-ratio framework to develop an efficient learning algorithm for high-dimensional compositional data. Our algorithm runs orders of magnitude faster than competing alternatives, without sacrificing model quality. For the second class of problems, we define a novel exponential family of probability distributions supported on the simplex. This distribution enjoys attractive mathematical properties and provides a performant probability model for simplex-valued outcomes. Taken together, our results constitute a broad contribution to the toolkit of researchers and practitioners studying compositional data. Statistics Machine learning--Statistical methods Geochemistry Bacteriology
2	Advances in Machine Learning for Complex Structured Functional Data Tang, Chengliang January 2022 (has links) Functional data analysis (FDA) refers to a broad collection of statistical and machine learning methods that deal with the data in the form of random functions. In general, functional data are assumed to lie in a constrained functional space, e.g., images, and smooth curves, rather than the conventional Euclidean space, e.g., scalar vectors. The explosion of massive data and high-performance computational resources brings exciting opportunities as well as new challenges to this field. On one hand, the rich information from modern functional data enables an investigation into the underlying data patterns at an unprecedented scale and resolution. On the other hand, the inherent complex structures and huge data sizes of modern functional data pose additional practical challenges to model building, model training, and model interpretation under various circumstances. This dissertation discusses recent advances in machine learning for analyzing complex structured functional data. Chapter 1 begins with a general introduction to examples of modern functional data and related data analysis challenges. Chapter 2 introduces a novel machine learning framework, artificial perceptual learning (APL), to tackle the problem of weakly supervised learning in functional remote sensing data. Chapter 3 develops a flexible function-on-scalar regression framework, Wasserstein distributional learning (WDL), to address the challenge of modeling density functional outputs. Chapter 4 concludes the dissertation and discusses future directions. Statistics Machine learning--Statistical methods Statistical functionals
3	Some topics in dimension reduction and clustering Zhao, Jianhua, 赵建华 January 2009 (has links) published_or_final_version / Statistics and Actuarial Science / Doctoral / Doctor of Philosophy Data mining - Statistical methods. Machine learning - Statistical methods.
4	Flexible Sparse Learning of Feature Subspaces Ma, Yuting January 2017 (has links) It is widely observed that the performances of many traditional statistical learning methods degenerate when confronted with high-dimensional data. One promising approach to prevent this downfall is to identify the intrinsic low-dimensional spaces where the true signals embed and to pursue the learning process on these informative feature subspaces. This thesis focuses on the development of flexible sparse learning methods of feature subspaces for classification. Motivated by the success of some existing methods, we aim at learning informative feature subspaces for high-dimensional data of complex nature with better flexibility, sparsity and scalability. The first part of this thesis is inspired by the success of distance metric learning in casting flexible feature transformations by utilizing local information. We propose a nonlinear sparse metric learning algorithm using a boosting-based nonparametric solution to address metric learning problem for high-dimensional data, named as the sDist algorithm. Leveraged a rank-one decomposition of the symmetric positive semi-definite weight matrix of the Mahalanobis distance metric, we restructure a hard global optimization problem into a forward stage-wise learning of weak learners through a gradient boosting algorithm. In each step, the algorithm progressively learns a sparse rank-one update of the weight matrix by imposing an L-1 regularization. Nonlinear feature mappings are adaptively learned by a hierarchical expansion of interactions integrated within the boosting framework. Meanwhile, an early stopping rule is imposed to control the overall complexity of the learned metric. As a result, without relying on computationally intensive tools, our approach automatically guarantees three desirable properties of the final metric: positive semi-definiteness, low rank and element-wise sparsity. Numerical experiments show that our learning model compares favorably with the state-of-the-art methods in the current literature of metric learning. The second problem arises from the observation of high instability and feature selection bias when applying online methods to highly sparse data of large dimensionality for sparse learning problem. Due to the heterogeneity in feature sparsity, existing truncation-based methods incur slow convergence and high variance. To mitigate this problem, we introduce a stabilized truncated stochastic gradient descent algorithm. We employ a soft-thresholding scheme on the weight vector where the imposed shrinkage is adaptive to the amount of information available in each feature. The variability in the resulted sparse weight vector is further controlled by stability selection integrated with the informative truncation. To facilitate better convergence, we adopt an annealing strategy on the truncation rate. We show that, when the true parameter space is of low dimension, the stabilization with annealing strategy helps to achieve lower regret bound in expectation. Mathematical statistics Machine learning--Statistical methods Machine learning Statistics
5	Advances in imbalanced data learning Lu, Yang 29 August 2019 (has links) With the increasing availability of large amount of data in a wide range of applications, no matter for industry or academia, it becomes crucial to understand the nature of complex raw data, in order to gain more values from data engineering. Although many problems have been successfully solved by some mature machine learning techniques, the problem of learning from imbalanced data continues to be one of the challenges in the field of data engineering and machine learning, which attracted growing attention in recent years due to its complexity. In this thesis, we focus on four aspects of imbalanced data learning and propose solutions to the key problems. The first aspect is about ensemble methods for imbalanced data classification. Ensemble methods, e.g. bagging and boosting, have the advantages to cure imbalanced data by integrated with sampling methods. However, there are still problems in the integration. One problem is that undersampling and oversampling are complementary each other and the sampling ratio is crucial to the classification performance. This thesis introduces a new method HSBagging which is based on bagging with hybrid sampling. Experiments show that HSBagging outperforms other state-of-the-art bagging method on imbalanced data. Another problem is about the integration of boosting and sampling for imbalanced data classification. The classifier weights of existing AdaBoost-based methods are inconsistent with the objective of class imbalance classification. In this thesis, we propose a novel boosting optimization framework GOBoost. This framework can be applied to any boosting-based method for class imbalance classification by simply replacing the calculation of classifier weights. Experiments show that the GOBoost-based methods significantly outperform the corresponding boosting-based methods. The second aspect is about online learning for imbalanced data stream with concept drift. In the online learning scenario, if the data stream is imbalanced, it will be difficult to detect concept drifts and adapt the online learner to them. The ensemble classifier weights are hard to adjust to achieve the balance between the stability and adaptability. Besides, the classier built on samples in size-fixed chunk, which may be highly imbalanced, is unstable in the ensemble. In this thesis, we propose Adaptive Chunk-based Dynamic Weighted Majority (ACDWM) to dynamically weigh the individual classifiers according to their performance on the current data chunk. Meanwhile, the chunk size is adaptively selected by statistical hypothesis tests. Experiments on both synthetic and real datasets with concept drift show that ACDWM outperforms both of the state-of-the-art chunk-based and online methods. In addition to imbalanced data classification, the third aspect is about clustering on imbalanced data. This thesis studies the key problem of imbalanced data clustering called uniform effect within the k-means-type framework, where the clustering results tend to be balanced. Thus, this thesis introduces a new method called Self-adaptive Multi-prototype-based Competitive Learning (SMCL) for imbalanced clusters. It uses multiple subclusters to represent each cluster with an automatic adjustment of the number of subclusters. Then, the subclusters are merged into the final clusters based on a novel separation measure. Experimental results show the efficacy of SMCL for imbalanced clusters and the superiorities against its competitors. Rather than a specific algorithm for imbalanced data learning, the final aspect is about a measure of class imbalanced dataset for classification. Recent studies have shown that imbalance ratio is not the only cause of the performance loss of a classifier in imbalanced data classification. To the best of our knowledge, there is no any measurement about the extent of influence of class imbalance on the classification performance of imbalanced data. Accordingly, this thesis proposes a data measure called Bayes Imbalance Impact Index (B1³) to reflect the extent of influence purely by the factor of imbalance for the whole dataset. As a result we can therefore use B1³ to judge whether it is worth using imbalance recovery methods like sampling or cost-sensitive methods to recover the performance loss of a classifier. The experiments show that B1³ is highly consistent with improvement of F1score made by the imbalance recovery methods on both synthetic and real benchmark datasets. Two ensemble frameworks for imbalanced data classification are proposed for sampling rate selection and boosting weight optimization, respectively. 2. •A chunk-based online learning algorithm is proposed to dynamically adjust the ensemble classifiers and select the chunk size for imbalanced data stream with concept drift. 3. •A multi-prototype competitive learning algorithm is proposed for clustering on imbalanced data. 4. •A measure of imbalanced data is proposed to evaluate how the classification performance of a dataset is influenced by the factor of imbalance.
6	Essays on the use of probabilistic machine learning for estimating customer preferences with limited information Padilla, Nicolas January 2021 (has links) In this thesis, I explore in two essays how to augment thin historical purchase data with other sources of information using Bayesian and probabilistic machine learning frameworks to better infer customers' preferences and their future behavior. In the first essay, I posit that firms can better manage recently-acquired customers by using the information from acquisition to inform future demand preferences for those customers. I develop a probabilistic machine learning model based on Deep Exponential Families to relate multiple acquisition characteristics with individual level demand parameters, and I show that the model is able to capture flexibly non-linear relationships between acquisition behaviors and demand parameters. I estimate the model using data from a retail context and show that firms can better identify which new customers are the most valuable. In the second essay, I explore how to combine the information collected through the customer journey—search queries, clicks and purchases; both within-journeys and across journeys—to infer the customer’s preferences and likelihood of buying, in settings in which there is thin purchase history and where preferences might change from one purchase journey to another. I propose a non-parametric Bayesian model that combines these different sources of information and accounts for what I call context heterogeneity, which are journey-specific preferences that depend on the context of the specific journey. I apply the model in the context of airline ticket purchases using data from one of the largest travel search websites and show that the model is able to accurately infer preferences and predict choice in an environment characterized by very thin historical data. I find strong context heterogeneity across journeys, reinforcing the idea that treating all journeys as stemming from the same set of preferences may lead to erroneous inferences. Machine learning--Statistical methods Consumers' preferences Bayesian statistical decision theory
7	Interaction-Based Learning for High-Dimensional Data with Continuous Predictors Huang, Chien-Hsun January 2014 (has links) High-dimensional data, such as that relating to gene expression in microarray experiments, may contain substantial amount of useful information to be explored. However, the information, relevant variables and their joint interactions are usually diluted by noise due to a large number of non-informative variables. Consequently, variable selection plays a pivotal role for learning in high dimensional problems. Most of the traditional feature selection methods, such as Pearson's correlation between response and predictors, stepwise linear regressions and LASSO are among the popular linear methods. These methods are effective in identifying linear marginal effect but are limited in detecting non-linear or higher order interaction effects. It is well known that epistasis (gene - gene interactions) may play an important role in gene expression where unknown functional forms are difficult to identify. In this thesis, we propose a novel nonparametric measure to first screen and do feature selection based on information from nearest neighborhoods. The method is inspired by Lo and Zheng's earlier work (2002) on detecting interactions for discrete predictors. We apply a backward elimination algorithm based on this measure which leads to the identification of many in influential clusters of variables. Those identified groups of variables can capture both marginal and interactive effects. Second, each identified cluster has the potential to perform predictions and classifications more accurately. We also study procedures how to combine these groups of individual classifiers to form a final predictor. Through simulation and real data analysis, the proposed measure is capable of identifying important variable sets and patterns including higher-order interaction sets. The proposed procedure outperforms existing methods in three different microarray datasets. Moreover, the nonparametric measure is quite flexible and can be easily extended and applied to other areas of high-dimensional data and studies. Epistasis (Genetics) Instrumental variables (Statistics) Nonparametric statistics Cluster analysis Machine learning--Statistical methods Statistics
8	Statistical Learning Methods for Personalized Medical Decision Making Liu, Ying January 2016 (has links) The theme of my dissertation is on merging statistical modeling with medical domain knowledge and machine learning algorithms to assist in making personalized medical decisions. In its simplest form, making personalized medical decisions for treatment choices and disease diagnosis modality choices can be transformed into classification or prediction problems in machine learning, where the optimal decision for an individual is a decision rule that yields the best future clinical outcome or maximizes diagnosis accuracy. However, challenges emerge when analyzing complex medical data. On one hand, statistical modeling is needed to deal with inherent practical complications such as missing data, patients' loss to follow-up, ethical and resource constraints in randomized controlled clinical trials. On the other hand, new data types and larger scale of data call for innovations combining statistical modeling, domain knowledge and information technologies. This dissertation contains three parts addressing the estimation of optimal personalized rule for choosing treatment, the estimation of optimal individualized rule for choosing disease diagnosis modality, and methods for variable selection if there are missing data. In the first part of this dissertation, we propose a method to find optimal Dynamic treatment regimens (DTRs) in Sequential Multiple Assignment Randomized Trial (SMART) data. Dynamic treatment regimens (DTRs) are sequential decision rules tailored at each stage of treatment by potentially time-varying patient features and intermediate outcomes observed in previous stages. The complexity, patient heterogeneity, and chronicity of many diseases and disorders call for learning optimal DTRs that best dynamically tailor treatment to each individual's response over time. We propose a robust and efficient approach referred to as Augmented Multistage Outcome-Weighted Learning (AMOL) to identify optimal DTRs from sequential multiple assignment randomized trials. We improve outcome-weighted learning (Zhao et al.~2012) to allow for negative outcomes; we propose methods to reduce variability of weights to achieve numeric stability and higher efficiency; and finally, for multiple-stage trials, we introduce robust augmentation to improve efficiency by drawing information from Q-function regression models at each stage. The proposed AMOL remains valid even if the regression model is misspecified. We formally justify that proper choice of augmentation guarantees smaller stochastic errors in value function estimation for AMOL; we then establish the convergence rates for AMOL. The comparative advantage of AMOL over existing methods is demonstrated in extensive simulation studies and applications to two SMART data sets: a two-stage trial for attention deficit hyperactivity disorder and the STAR*D trial for major depressive disorder. The second part of the dissertation introduced a machine learning algorithm to estimate personalized decision rules for medical diagnosis/screening to maximize a weighted combination of sensitivity and specificity. Using subject-specific risk factors and feature variables, such rules administer screening tests with balanced sensitivity and specificity, and thus protect low-risk subjects from unnecessary pain and stress caused by false positive tests, while achieving high sensitivity for subjects at high risk. We conducted simulation study mimicking a real breast cancer study, and we found significant improvements on sensitivity and specificity comparing our personalized screening strategy (assigning mammography+MRI to high-risk patients and mammography alone to low-risk subjects based on a composite score of their risk factors) to one-size-fits-all strategy (assigning mammography+MRI or mammography alone to all subjects). When applying to a Parkinson's disease (PD) FDG-PET and fMRI data, we showed that the method provided individualized modality selection that can improve AUC, and it can provide interpretable decision rules for choosing brain imaging modality for early detection of PD. To the best of our knowledge, this is the first time in the literature to propose automatic data-driven methods and learning algorithm for personalized diagnosis/screening strategy. In the last part of the dissertation, we propose a method, Multiple Imputation Random Lasso (MIRL), to select important variables and to predict the outcome for an epidemiological study of Eating and Activity in Teens. % in the presence of missing data. In this study, 80% of individuals have at least one variable missing. Therefore, using variable selection methods developed for complete data after list-wise deletion substantially reduces prediction power. Recent work on prediction models in the presence of incomplete data cannot adequately account for large numbers of variables with arbitrary missing patterns. We propose MIRL to combine penalized regression techniques with multiple imputation and stability selection. Extensive simulation studies are conducted to compare MIRL with several alternatives. MIRL outperforms other methods in high-dimensional scenarios in terms of both reduced prediction error and improved variable selection performance, and it has greater advantage when the correlation among variables is high and missing proportion is high. MIRL is shown to have improved performance when comparing with other applicable methods when applied to the study of Eating and Activity in Teens for the boys and girls separately, and to a subgroup of low social economic status (SES) Asian boys who are at high risk of developing obesity. Machine learning Biometry Therapeutics Multiple imputation (Statistics) Medical statistics Machine learning--Statistical methods
9	Statistical machine learning for data mining and collaborative multimedia retrieval. / CUHK electronic theses & dissertations collection January 2006 (has links) Another issue studied in the framework is Distance Metric Learning (DML). Learning distance metrics is critical to many machine learning tasks, especially when contextual information is available. To learn effective metrics from pairwise contextual constraints, two novel methods, Discriminative Component Analysis (DCA) and Kernel DCA, are proposed to learn both linear and nonlinear distance metrics. Empirical results on data clustering validate the advantages of the algorithms. / Based on this unified learning framework, a novel scheme is suggested for learning Unified Kernel Machines (UKM). The UKM scheme combines supervised kernel machine learning, unsupervised kernel de sign, semi-supervised kernel learning, and active learning in an effective fashion. A key component in the UKM scheme is to learn kernels from both labeled and unlabeled data. To this purpose; a new Spectral Kernel Learning (SKL) algorithm is proposed, which is related to a quadratic program. Empirical results show that the UKM technique is promising for classification tasks. / In addition to the above methodologies, this thesis also addresses some practical issues in applying machine learning techniques to real-world applications. For example, in a time-dependent data mining application, in order to design a domain-specific kernel, marginalized kernel techniques are suggested to formulate an effective kernel aimed at web data mining tasks. / Last, the thesis investigates statistical machine learning techniques with applications to multimedia retrieval and addresses some practical issues, such as robustness to noise and scalability. To bridge semantic gap issues of multimedia retrieval, a Collaborative Multimedia Retrieval (CMR) scheme is proposed to exploit historical log data of users' relevance feedback for improving retrieval tasks. Two types of learning tasks in the CMR scheme are identified and two innovative algorithms are proposed to effectively solve the problems respectively. / Statistical machine learning techniques have been widely applied in data mining and multimedia information retrieval. While traditional methods; such as supervised learning, unsupervised learning, and active learning, have been extensively studied separately, there are few comprehensive schemes to investigate these techniques in a unified approach. This thesis proposes a unified learning paradigm (ULP) framework that integrates several machine learning techniques including supervised learning; unsupervised learning, semi-supervised learning, active learning and metric learning in a synergistic way to maximize the effectiveness of a learning task. / Within the unified learning framework, this thesis further explores two important challenging tasks. One is Batch Mode Active Learning (BMAL). In contrast to traditional approaches, the BMAL method searches a batch of informative examples for labeling. To develop an effective algorithm, the BMAL task is formulated into a convex optimization problem and a novel bound optimization algorithm is proposed to efficiently solve it with global optima. Extensive evaluations on text categorization tasks show that the BMAL algorithm is superior to traditional methods. / Hoi Chu Hong. / "September 2006." / Adviser: Michael R. Lyu. / Source: Dissertation Abstracts International, Volume: 68-03, Section: B, page: 1723. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2006. / Includes bibliographical references (p. 203-223). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts in English and Chinese. / School code: 1307. Data mining Machine learning--Statistical methods Multimedia systems
10	Statistical Perspectives on Modern Network Embedding Methods Davison, Andrew January 2022 (has links) Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction being performed on diverse data sets, including protein-protein interaction networks, social networks and citation networks. A frequent approach to approaching these tasks begins by learning an Euclidean embedding of the network, to which machine learning algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. This distinguishes it from the setting of traditional i.i.d data where there is essentially only one way of subsampling the data - selecting the data points uniformly and without replacement. Despite the strong empirical performance when using embeddings produced in such a manner, they are not well understood theoretically, particularly with regards to the role of the sampling scheme. Here, we develop a unifying framework which encapsulates representation learning methods for networks which are trained via performing gradient updates obtained by subsampling the network, including random-walk based approaches such as node2vec. In particular, we prove, under the assumption that the network has an exchangeable law, that the distribution of the learned embedding vectors asymptotically decouples. We characterize the asymptotic distribution of the learned embedding vectors, and give the corresponding rates of convergence, which depend on factors such as the sampling scheme, the choice of loss function, and the choice of embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks; in particular, we apply our results to argue that the embedding vectors produced by node2vec can be used to perform weakly consistent community detection. Statistics Machine learning--Statistical methods Machine learning--Graphic methods Computer networks

Search results