Spelling suggestions: "subject:"covariate shift"" "subject:"covariate hift""
1 |
Feature distribution learning for covariate shift adaptation using sparse filteringZennaro, Fabio January 2017 (has links)
This thesis studies a family of unsupervised learning algorithms called feature distribution learning and their extension to perform covariate shift adaptation. Unsupervised learning is one of the most active areas of research in machine learning, and a central challenge in this field is to develop simple and robust algorithms able to work in real-world scenarios. A traditional assumption of machine learning is the independence and identical distribution of data. Unfortunately, in realistic conditions this assumption is often unmet and the performances of traditional algorithms may be severely compromised. Covariate shift adaptation has then developed as a lively sub-field concerned with designing algorithms that can account for covariate shift, that is for a difference in the distribution of training and test samples. The first part of this dissertation focuses on the study of a family of unsupervised learning algorithms that has been recently proposed and has shown promise: feature distribution learning; in particular, sparse filtering, the most representative feature distribution learning algorithm, has commanded interest because of its simplicity and state-of-the-art performance. Despite its success and its frequent adoption, sparse filtering lacks any strong theoretical justification. This research questions how feature distribution learning can be rigorously formalized and how the dynamics of sparse filtering can be explained. These questions are answered by first putting forward a new definition of feature distribution learning based on concepts from information theory and optimization theory; relying on this, a theoretical analysis of sparse filtering is carried out, which is validated on both synthetic and real-world data sets. In the second part, the use of feature distribution learning algorithms to perform covariate shift adaptation is considered. Indeed, because of their definition and apparent insensitivity to the problem of modelling data distributions, feature distribution learning algorithms seems particularly fit to deal with covariate shift. This research questions whether and how feature distribution learning may be fruitfully employed to perform covariate shift adaptation. After making explicit the conditions of success for performing covariate shift adaptation, a theoretical analysis of sparse filtering and another novel algorithm, periodic sparse filtering, is carried out; this allows for the determination of the specific conditions under which these algorithms successfully work. Finally, a comparison of these sparse filtering-based algorithms against other traditional algorithms aimed at covariate shift adaptation is offered, showing that the novel algorithm is able to achieve competitive performance. In conclusion, this thesis provides a new rigorous framework to analyse and design feature distribution learning algorithms; it sheds light on the hidden assumptions behind sparse filtering, offering a clear understanding of its conditions of success; it uncovers the potential and the limitations of sparse filtering-based algorithm in performing covariate shift adaptation. These results are relevant both for researchers interested in furthering the understanding of unsupervised learning algorithms and for practitioners interested in deploying feature distribution learning in an informed way.
|
2 |
Towards Robust and Adaptive Machine Learning : A Fresh Perspective on Evaluation and Adaptation Methodologies in Non-Stationary EnvironmentsBayram, Firas January 2023 (has links)
Machine learning (ML) has become ubiquitous in various disciplines and applications, serving as a powerful tool for developing predictive models to analyze diverse variables of interest. With the advent of the digital era, the proliferation of data has presented numerous opportunities for growth and expansion across various domains. However, along with these opportunities, there is a unique set of challenges that arises due to the dynamic and ever-changing nature of data. These challenges include concept drift, which refers to shifting data distributions over time, and other data-related issues that can be framed as learning problems. Traditional static models are inadequate in handling these issues, underscoring the need for novel approaches to enhance the performance robustness and reliability of ML models to effectively navigate the inherent non-stationarity in the online world. The field of concept drift is characterized by several intricate aspects that challenge learning algorithms, including the analysis of model performance, which requires evaluating and understanding how the ML model's predictive capability is affected by different problem settings. Additionally, determining the magnitude of drift necessary for change detection is an indispensable task, as it involves identifying substantial shifts in data distributions. Moreover, the integration of adaptive methodologies is essential for updating ML models in response to data dynamics, enabling them to maintain their effectiveness and reliability in evolving environments. In light of the significance and complexity of the topic, this dissertation offers a fresh perspective on the performance robustness and adaptivity of ML models in non-stationary environments. The main contributions of this research include exploring and organizing the literature, analyzing the performance of ML models in the presence of different types of drift, and proposing innovative methodologies for drift detection and adaptation that solve real-world problems. By addressing these challenges, this research paves the way for the development of more robust and adaptive ML solutions capable of thriving in dynamic and evolving data landscapes. / Machine learning (ML) is widely used in various disciplines as a powerful tool for developing predictive models to analyze diverse variables. In the digital era, the abundance of data has created growth opportunities, but it also brings challenges due to the dynamic nature of data. One of these challenges is concept drift, the shifting data distributions over time. Consequently, traditional static models are inadequate for handling these challenges in the online world. Concept drift, with its intricate aspects, presents a challenge for learning algorithms. Analyzing model performance and detecting substantial shifts in data distributions are crucial for integrating adaptive methodologies to update ML models in response to data dynamics, maintaining effectiveness and reliability in evolving environments. In this dissertation, a fresh perspective is offered on the robustness and adaptivity of ML models in non-stationary environments. This research explores and organizes existing literature, analyzes ML model performance in the presence of drift, and proposes innovative methodologies for detecting and adapting to drift in real-world problems. The aim is to develop more robust and adaptive ML solutions capable of thriving in dynamic and evolving data landscapes.
|
3 |
Integral Equations For Machine Learning ProblemsQue, Qichao 28 September 2016 (has links)
No description available.
|
4 |
Learning under differing training and test distributionsBickel, Steffen January 2008 (has links)
One of the main problems in machine learning is to train a predictive model from training data and to make predictions on test data. Most predictive models are constructed under the assumption that the training data is governed by the exact same distribution which the model will later be exposed to. In practice, control over the data collection process is often imperfect. A typical scenario is when labels are collected by questionnaires and one does not have access to the test population. For example, parts of the test population are underrepresented in the survey, out of reach, or do not return the questionnaire. In many applications training data from the test distribution are scarce because they are difficult to obtain or very expensive. Data from auxiliary sources drawn from similar distributions are often cheaply available.
This thesis centers around learning under differing training and test distributions and covers several problem settings with different assumptions on the relationship between training and test distributions-including multi-task learning and learning under covariate shift and sample selection bias. Several new models are derived that directly characterize the divergence between training and test distributions, without the intermediate step of estimating training and test distributions separately. The integral part of these models are rescaling weights that match the rescaled or resampled training distribution to the test distribution. Integrated models are studied where only one optimization problem needs to be solved for learning under differing distributions. With a two-step approximation to the integrated models almost any supervised learning algorithm can be adopted to biased training data.
In case studies on spam filtering, HIV therapy screening, targeted advertising, and other applications the performance of the new models is compared to state-of-the-art reference methods. / Eines der wichtigsten Probleme im Maschinellen Lernen ist das Trainieren von Vorhersagemodellen aus Trainingsdaten und das Ableiten von Vorhersagen für Testdaten. Vorhersagemodelle basieren üblicherweise auf der Annahme, dass Trainingsdaten aus der gleichen Verteilung gezogen werden wie Testdaten. In der Praxis ist diese Annahme oft nicht erfüllt, zum Beispiel, wenn Trainingsdaten durch Fragebögen gesammelt werden. Hier steht meist nur eine verzerrte Zielpopulation zur Verfügung, denn Teile der Population können unterrepräsentiert sein, nicht erreichbar sein, oder ignorieren die Aufforderung zum Ausfüllen des Fragebogens. In vielen Anwendungen stehen nur sehr wenige Trainingsdaten aus der Testverteilung zur Verfügung, weil solche Daten teuer oder aufwändig zu sammeln sind. Daten aus alternativen Quellen, die aus ähnlichen Verteilungen gezogen werden, sind oft viel einfacher und günstiger zu beschaffen.
Die vorliegende Arbeit beschäftigt sich mit dem Lernen von Vorhersagemodellen aus Trainingsdaten, deren Verteilung sich von der Testverteilung unterscheidet. Es werden verschiedene Problemstellungen behandelt, die von unterschiedlichen Annahmen über die Beziehung zwischen Trainings- und Testverteilung ausgehen. Darunter fallen auch Multi-Task-Lernen und Lernen unter Covariate Shift und Sample Selection Bias. Es werden mehrere neue Modelle hergeleitet, die direkt den Unterschied zwischen Trainings- und Testverteilung charakterisieren, ohne dass eine einzelne Schätzung der Verteilungen nötig ist. Zentrale Bestandteile der Modelle sind Gewichtungsfaktoren, mit denen die Trainingsverteilung durch Umgewichtung auf die Testverteilung abgebildet wird. Es werden kombinierte Modelle zum Lernen mit verschiedenen Trainings- und Testverteilungen untersucht, für deren Schätzung nur ein einziges Optimierungsproblem gelöst werden muss. Die kombinierten Modelle können mit zwei Optimierungsschritten approximiert werden und dadurch kann fast jedes gängige Vorhersagemodell so erweitert werden, dass verzerrte Trainingsverteilungen korrigiert werden.
In Fallstudien zu Email-Spam-Filterung, HIV-Therapieempfehlung, Zielgruppenmarketing und anderen Anwendungen werden die neuen Modelle mit Referenzmethoden verglichen.
|
Page generated in 0.038 seconds