1 |
Class discovery via feature selection in unsupervised settingsCurtis, Jessica 13 February 2016 (has links)
Identifying genes linked to the appearance of certain types of cancers and their phenotypes is a well-known and challenging problem in bioinformatics. Discovering marker genes which, upon genetic mutation, drive the proliferation of different types and subtypes of cancer is critical for the development of advanced tests and therapies that will specifically identify, target, and treat certain cancers. Therefore, it is crucial to find methods that are successful in recovering "cancer-critical genes" from the (usually much larger) set of all genes in the human genome.
We approach this problem in the statistical context as a feature (or variable) selection problem for clustering, in the case where the number of important features is typically small (or rare) and the signal of each important feature is typically minimal (or weak). Genetic datasets typically consist of hundreds of samples (n) each with tens of thousands gene-level measurements (p), resulting in the well-known statistical "large p small n" problem. The class or cluster identification is based on the clinical information associated with the type or subtype of the cancer (either known or unknown) for each individual. We discuss and develop novel feature ranking methods, which complement and build upon current methods in the field. These ranking methods are used to select features which contain the most significant information for clustering. Retaining only a small set of useful features based on this ranking aids in both a reduction in data dimensionality, as well as the identification of a set of genes that are crucial in understanding cancer subtypes.
In this paper, we present an outline of cutting-edge feature selection methods, and provide a detailed explanation of our own contributions to the field. We explain both the practical properties and theoretical advantages of the new tools that we have developed. Additionally, we explore a well-developed case study applying these new feature selection methods to different levels of genetic data to explore their practical implementation within the field of bioinformatics.
|
2 |
Machine Learning in the Open WorldYicheng Cheng (11197908) 29 July 2021 (has links)
<div>By Machine Learning in the Open World, we are trying to build models that can be used in a more realistic setting where there could always be something "unknown" happening. Beyond the traditional machine learning tasks such as classification and segmentation where all classes are predefined, we are dealing with the challenges from newly emerged classes, irrelevant classes, outliers, and class imbalance.</div><div>At the beginning, we focus on the Non-Exhaustive Learning (NEL) problem from a statistical aspect. By NEL, we assume that our training classes are non-exhaustive, where the testing data could contain unknown classes. And we aim to build models that could simultaneously perform classification and class discovery. We proposed a non-parametric Bayesian model that learns some hyper-parameters from both training and discovered classes (which is empty at the beginning), then infer the label partitioning under the guidance of the learned hyper-parameters, and repeat the above procedure until convergence.</div><div>After obtaining good results on applications with plain and low dimensional data such flow-cytometry and some benchmark datasets, we move forward to Non-Exhaustive Feature Learning (NEFL). For NEFL, we extend our work with deep learning techniques to learn representations on datasets with complex structural and spatial correlations. We proposed a metric learning approach to learn a feature space with good discrimination on both training classes and generalize well on unknown classes. Then we developed some variants of this metric learning algorithm to deal with outliers and irrelevant classes. We applied our final model to applications such as open world image classification, image segmentation, and SRS hyperspectral image segmentation and obtained promising results.</div><div>Finally, we did some explorations with Out of Distribution detection (OOD) to detect irrelevant sample and outliers to complete the story.</div>
|
3 |
Knowledge transfer and retention in deep neural networksFini, Enrico 17 April 2023 (has links)
This thesis addresses the crucial problem of knowledge transfer and retention in deep neural networks. The ability to transfer knowledge from previously learned tasks and retain it for future use is essential for machine learning models to continually adapt to new tasks and improve their overall performance. In principle, knowledge can be transferred between any type of task, but we believe it to be particularly challenging in the field of computer vision, where the size and diversity of visual data often result in high compute requirements and the need for large, complex models. Hence, we analyze transfer and retention learning between unsupervised and supervised visual tasks, which form the main focus of this thesis. We categorize our efforts into several knowledge transfer and retention paradigms, and we tackle them with several contributions for the scientific community. The thesis proposes settings and methods based on knowledge distillation and self-supervised learning techniques. In particular, we devise two novel continual learning settings and seven new methods for knowledge transfer and retention, setting new state-of-the-art in a wide range of tasks. In conclusion, this thesis provides a valuable contribution to the field of computer vision and machine learning and sets a foundation for future work in this area.
|
4 |
Contrastive Filtering And Dual-Objective Supervised Learning For Novel Class Discovery In Document-Level Relation ExtractionHansen, Nicholas 01 June 2024 (has links) (PDF)
Relation extraction (RE) is a task within natural language processing focused on the classification of relationships between entities in a given text. Primary applications of RE can be seen in various contexts such as knowledge graph construction and question answering systems. Traditional approaches to RE tend towards the prediction of relationships between exactly two entity mentions in small text snippets. However, with the introduction of datasets such as DocRED, research in this niche has progressed into examining RE at the document-level. Document-level relation extraction (DocRE) disrupts conventional approaches as it inherently introduces the possibility of multiple mentions of each unique entity throughout the document along with a significantly higher probability of multiple relationships between entity pairs.
There have been many effective approaches to document-level RE in recent years utilizing various architectures, such as transformers and graph neural networks. However, all of these approaches focus on the classification of a fixed number of known relationships. As a result of the large quantity of possible unique relationships in a given corpus, it is unlikely that all interesting and valuable relationship types are labeled before hand. Furthermore, traditional naive approaches to clustering on unlabeled data to discover novel classes are not effective as a result of the unique problem of large true negative presence. Therefore, in this work we propose a multi-step filter and train approach leveraging the notion of contrastive representation learning to discover novel relationships at the document level. Additionally, we propose the use of an alternative pretrained encoder in an existing DocRE solution architecture to improve F1 performance in base multi-label classification on the DocRED dataset by 0.46.
To the best of our knowledge, this is the first exploration of novel class discovery applied to the document-level RE task. Based upon our holdout evaluation method, we increase novel class instance representation in the clustering solution by 5.5 times compared to the naive approach and increase the purity of novel class clusters by nearly 4 times. We then further enable the retrieval of both novel and known classes at test time provided human labeling of cluster propositions achieving a macro F1 score of 0.292 for novel classes. Finally, we note only a slight macro F1 decrease on previously known classes from 0.402 with fully supervised training to 0.391 with our novel class discovery training approach.
|
Page generated in 0.091 seconds