In the last decade, machine learning models have increased in size and amount of data they are using, which has led to improved performance on many tasks. Most notably, there has been a significant development in end-to-end deep learning and reinforcement learning models with new learning algorithms and architectures proposed frequently. Furthermore, while previous methods were focused on supervised learning, in the last five years, many models were proposed that learn in semi-supervised or self-supervised ways. The model is then fine-tuned to a specific task or different data domain. Adapting machine learning models learned on one type of data to similar but different data is called domain adaptation. This thesis discusses various challenges in the domain adaptation of machine learning models to specific tasks and real-world applications and proposes solutions for those challenges.
Data in real-world applications have different properties than clean machine-learning datasets commonly used for the experimental evaluation of proposed models. Learning appropriate representations from high-dimensional complex data with internal dependencies is arduous due to the curse of dimensionality and spurious correlation. However, most real-world data have these properties in addition to a small number of labeled samples since labeling is expensive and tedious. Additionally, accuracy drops drastically if models are applied to domain-specific datasets and unbalanced problems. Moreover, state-of-the-art models are not able to handle missing data. In this thesis, I strive to create frameworks that can learn a good representation of high-dimensional small data with correlations between variables.
The first chapter of this thesis describes the motivation, background, and research objectives. It also gives an overview of contributions and publications. A background needed to understand this thesis is provided in the second chapter and an introduction to domain adaptation is described in chapter three. The fourth chapter discusses domain adaptation with small target data. It describes the algorithm for semi-supervised learning over domain-specific short texts such as reviews or tweets. The proposed framework achieves up to 12.6% improvement when only 5000 labeled examples are available. The fifth chapter explores the influence of unanticipated bias in fine-tuning data. This chapter outlines how the bias in news data influences the classification performance of domain-specific text, where the domain is U.S. politics. It is shown that fine-tuning with domain-specific data is not always beneficial, especially if bias towards one label is present. The sixth chapter examines domain adaptation on datasets with high missing rates. It reviews a system created to learn from high-dimensional small data from psychological studies, which have up to 70% missingness. The proposed framework is achieving 9.3% smaller imputation and 33% lower prediction errors. The seventh chapter discusses the curse of dimensionality problem in domain adaptation. It presents a methodology for discovering research articles containing evolutionary timetrees. That system can search for, download, and filter research articles in which timetrees are imported. It scans 5 million articles in a few days. The proposed method also decreases the error of finding research papers by 21% compared to the baseline, which cannot work with high-dimensional data properly. The last, eighth chapter, summarizes the findings of this thesis and suggests future prospects. / Computer and Information Science
Identifer | oai:union.ndltd.org:TEMPLE/oai:scholarshare.temple.edu:20.500.12613/8499 |
Date | January 2023 |
Creators | Stanojevic, Marija, 0000-0001-8227-6577 |
Contributors | Obradovic, Zoran, Dragut, Eduard Constantin, Vucetic, Slobodan, Kumar, Sudhir |
Publisher | Temple University. Libraries |
Source Sets | Temple University |
Language | English |
Detected Language | English |
Type | Thesis/Dissertation, Text |
Rights | IN COPYRIGHT- This Rights Statement can be used for an Item that is in copyright. Using this statement implies that the organization making this Item available has determined that the Item is in copyright and either is the rights-holder, has obtained permission from the rights-holder(s) to make their Work(s) available, or makes the Item available under an exception or limitation to copyright (including Fair Use) that entitles it to make the Item available., http://rightsstatements.org/vocab/InC/1.0/ |
Relation | http://dx.doi.org/10.34944/dspace/8463, Theses and Dissertations |
Page generated in 0.0027 seconds