1 |
Robust Approaches for Learning with Noisy LabelsLu, Yangdi January 2022 (has links)
Deep neural networks (DNNs) have achieved remarkable success in data-intense applications, while such success relies heavily on massive and carefully labeled data. In practice, obtaining large-scale datasets with correct labels is often expensive, time-consuming, and sometimes even impossible. Common approaches of constructing datasets involve some degree of error-prone processes, such as automatic labeling or crowdsourcing, which inherently introduce noisy labels. It has been observed that noisy labels severely degrade the generalization performance of classifiers, especially the overparameterized (deep) neural networks. Therefore, studying noisy labels and developing techniques for training accurate classifiers in the presence of noisy labels is of great practical significance. In this thesis, we conduct a thorough study to fully understand LNL and provide a comprehensive error decomposition to reveal the core issue of LNL. We then point out that the core issue in LNL is that the empirical risk minimizer is unreliable, i.e., the DNNs are prone to overfitting noisy labels during training. To reduce the learning errors, we propose five different methods, 1) Co-matching: a framework consists of two networks to prevent the model from memorizing noisy labels; 2) SELC: a simple method to progressively correct noisy labels and refine the model; 3) NAL: a regularization method that automatically distinguishes the mislabeled samples and prevents the model from memorizing them; 4) EM-enhanced loss: a family of robust loss functions that not only mitigates the influence of noisy labels, but also avoids underfitting problem; 5) MixNN: a framework that trains the model with new synthetic samples to mitigate the impact of noisy labels. Our experimental results demonstrate that the proposed approaches achieve comparable or better performance than the state-of-the-art approaches on benchmark datasets with simulated label noise and large-scale datasets with real-world label noise. / Dissertation / Doctor of Philosophy (PhD) / Machine Learning has been highly successful in data-intensive applications but is often hampered when datasets contain noisy labels. Recently, Learning with Noisy Labels (LNL) is proposed to tackle this problem. By using techniques from LNL, the models can still generalize well even when trained on the data containing noisy supervised information. In this thesis, we study this crucial problem and provide a comprehensive analysis to reveal the core issue of LNL. We then propose five different methods to effectively reduce the learning errors in LNL. We show that our approaches achieve comparable or better performance compared to the state-of-the-art approaches on benchmark datasets with simulated label noise and real-world noisy datasets.
|
2 |
Textová klasifikace s limitovanými trénovacími daty / Text classification with limited training dataLaitoch, Petr January 2021 (has links)
The aim of this thesis is to minimize manual work needed to create training data for text classification tasks. Various research areas including weak supervision, interactive learning and transfer learning explore how to minimize training data creation effort. We combine ideas from available literature in order to design a comprehensive text classification framework that employs keyword-based labeling instead of traditional text annotation. Keyword-based labeling aims to label texts based on keywords contained in the texts that are highly correlated with individual classification labels. As noted repeatedly in previous work, coming up with many new keywords is challenging for humans. To accommodate for this issue, we propose an interactive keyword labeler featuring the use of word similarity for guiding a user in keyword labeling. To verify the effectiveness of our novel approach, we implement a minimum viable prototype of the designed framework and use it to perform a user study on a restaurant review multi-label classification problem.
|
3 |
Towards Learning Compact Visual Embeddings using Deep Neural NetworksJanuary 2019 (has links)
abstract: Feature embeddings differ from raw features in the sense that the former obey certain properties like notion of similarity/dissimilarity in it's embedding space. word2vec is a preeminent example in this direction, where the similarity in the embedding space is measured in terms of the cosine similarity. Such language embedding models have seen numerous applications in both language and vision community as they capture the information in the modality (English language) efficiently. Inspired by these language models, this work focuses on learning embedding spaces for two visual computing tasks, 1. Image Hashing 2. Zero Shot Learning. The training set was used to learn embedding spaces over which similarity/dissimilarity is measured using several distance metrics like hamming / euclidean / cosine distances. While the above-mentioned language models learn generic word embeddings, in this work task specific embeddings were learnt which can be used for Image Retrieval and Classification separately.
Image Hashing is the task of mapping images to binary codes such that some notion of user-defined similarity is preserved. The first part of this work focuses on designing a new framework that uses the hash-tags associated with web images to learn the binary codes. Such codes can be used in several applications like Image Retrieval and Image Classification. Further, this framework requires no labelled data, leaving it very inexpensive. Results show that the proposed approach surpasses the state-of-art approaches by a significant margin.
Zero-shot classification is the task of classifying the test sample into a new class which was not seen during training. This is possible by establishing a relationship between the training and the testing classes using auxiliary information. In the second part of this thesis, a framework is designed that trains using the handcrafted attribute vectors and word vectors but doesn’t require the expensive attribute vectors during test time. More specifically, an intermediate space is learnt between the word vector space and the image feature space using the hand-crafted attribute vectors. Preliminary results on two zero-shot classification datasets show that this is a promising direction to explore. / Dissertation/Thesis / Masters Thesis Computer Engineering 2019
|
4 |
Supervision Beyond Manual Annotations for Learning Visual RepresentationsDoersch, Carl 01 April 2016 (has links)
For both humans and machines, understanding the visual world requires relating new percepts with past experience. We argue that a good visual representation for an image should encode what makes it similar to other images, enabling the recall of associated experiences. Current machine implementations of visual representations can capture some aspects of similarity, but fall far short of human ability overall. Even if one explicitly labels objects in millions of images to tell the computer what should be considered similar—a very expensive procedure—the labels still do not capture everything that might be relevant. This thesis shows that one can often train a representation which captures similarity beyond what is labeled in a given dataset. That means we can begin with a dataset that has uninteresting labels, or no labels at all, and still build a useful representation. To do this, we propose to using pretext tasks: tasks that are not useful in and of themselves, but serve as an excuse to learn a more general-purpose representation. The labels for a pretext task can be inexpensive or even free. Furthermore, since this approach assumes training labels differ from the desired outputs, it can handle output spaces where the correct answer is ambiguous, and therefore impossible to annotate by hand. The thesis explores two broad classes of supervision. The first isweak image-level supervision, which is exploited to train mid-level discriminative patch classifiers. For example, given a dataset of street-level imagery labeled only with GPS coordinates, patch classifiers are trained to differentiate one specific geographical region (e.g. the city of Paris) from others. The resulting classifiers each automatically collect and associate a set of patches which all depict the same distinctive architectural element. In this way, we can learn to detect elements like balconies, signs, and lamps without annotations. The second type of supervision requires no information about images other than the pixels themselves. Instead, the algorithm is trained to predict the context around image patches. The context serves as a sort of weak label: to predict well, the algorithm must associate similar-looking patches which also have similar contexts. After training, the feature representation learned using this within-image context indeed captures visual similarity across images, which ultimately makes it useful for real tasks like object detection and geometry estimation.
|
5 |
Weakly supervised methods for learning actions and objectsPrest, Alessandro 04 September 2012 (has links) (PDF)
Modern Computer Vision systems learn visual concepts through examples (i.e. images) which have been manually annotated by humans. While this paradigm allowed the field to tremendously progress in the last decade, it has now become one of its major bottlenecks. Teaching a new visual concept requires an expensive human annotation effort, limiting systems to scale to thousands of visual concepts from the few dozens that work today. The exponential growth of visual data available on the net represents an invaluable resource for visual learning algorithms and calls for new methods able to exploit this information to learn visual concepts without the need of major human annotation effort. As a first contribution, we introduce an approach for learning human actions as interac- tions between persons and objects in realistic images. By exploiting the spatial structure of human-object interactions, we are able to learn action models automatically from a set of still images annotated only with the action label (weakly-supervised). Extensive experimental evaluation demonstrates that our weakly-supervised approach achieves the same performance of popular fully-supervised methods despite using substantially less supervision. In the second part of this thesis we extend this reasoning to human-object interactions in realistic video and feature length movies. Popular methods represent actions with low- level features such as image gradients or optical flow. In our approach instead, interactions are modeled as the trajectory of the object wrt to the person position, providing a rich and natural description of actions. Our interaction descriptor is an informative cue on its own and is complimentary to traditional low-level features. Finally, in the third part we propose an approach for learning object detectors from real- world web videos (i.e. YouTube). As opposed to the standard paradigm of learning from still images annotated with bounding-boxes, we propose a technique to learn from videos known only to contain objects of a target class. We demonstrate that learning detec- tors from video alone already delivers good performance requiring much less supervision compared to training from images annotated with bounding boxes. We additionally show that training from a combination of weakly annotated videos and fully annotated still images improves over training from still images alone.
|
6 |
Modeling Structured Data with Invertible Generative ModelsLu, You 01 February 2022 (has links)
Data is complex and has a variety of structures and formats. Modeling datasets is a core problem in modern artificial intelligence. Generative models are machine learning models, which model datasets with probability distributions. Deep generative models combine deep learning with probability theory, so that can model complicated datasets with flexible models. They have become one of the most popular models in machine learning, and have been applied to many problems.
Normalizing flows are a novel class of deep generative models that allow efficient exact likelihood calculation, exact latent variable inference and sampling. They are constructed using functions whose inverse and Jacobian determinant can be efficiently computed. In this paper, we develop normalizing flow based generative models to model complex datasets. In general, data can be categorized to unlabeled data, labeled data, and weakly labeled data. We develop models for these three types of data, respectively.
First, we develop Woodbury transformations, which are flow layers for general unsupervised normalizing flows, and can improve the flexibility and scalability of current flow based models. Woodbury transformations achieve efficient invertibility via Woodbury matrix identity and efficient determinant calculation via Sylvester's determinant identity. In contrast with other operations used in state-of-the-art normalizing flows, Woodbury transformations enable (1) high-dimensional interactions, (2) efficient sampling, and (3) efficient likelihood evaluation. Other similar operations, such as 1x1 convolutions, emerging convolutions, or periodic convolutions allow at most two of these three advantages. In our experiments on multiple image datasets, we find that Woodbury transformations allow learning of higher-likelihood models than other flow architectures while still enjoying their efficiency advantages.
Second, we propose conditional Glow (c-Glow), a conditional generative flow for structured output learning, which is an advanced variant of supervised learning with structured labels. Traditional structured prediction models try to learn a conditional likelihood, i.e., p(y|x), to capture the relationship between the structured output y and the input features x. For many models, computing the likelihood is intractable. These models are therefore hard to train, requiring the use of surrogate objectives or variational inference to approximate likelihood. C-Glow benefits from the ability of flow-based models to compute p(y|x) exactly and efficiently. Learning with c-Glow does not require a surrogate objective or performing inference during training. Once trained, we can directly and efficiently generate conditional samples. We develop a sample-based prediction method, which can use this advantage to do efficient and effective inference. In our experiments, we test c-Glow on five different tasks. C-Glow outperforms the state-of-the-art baselines in some tasks and predicts comparable outputs in the other tasks. The results show that c-Glow is applicable to many different structured prediction problems.
Third, we develop label learning flows (LLF), which is
a general framework for weakly supervised learning problems. Our method is a generative model based on normalizing flows. The main idea of LLF is to optimize the conditional likelihoods of all possible labelings of the data within a constrained space defined by weak signals. We develop a training method for LLF that trains the conditional flow inversely and avoids estimating the labels. Once a model is trained, we can make predictions with a sampling algorithm. We apply LLF to three weakly supervised learning problems. Experiment results show that our method outperforms many state-of-the-art alternatives.
Our research shows the advantages and versatility of normalizing flows. / Doctor of Philosophy / Data is now more affordable and accessible. At the same time, datasets are more and more complicated. Modeling data is a key problem in modern artificial intelligence and data analysis. Deep generative models combine deep learning and probability theory, and are now a major way to model complex datasets. In this dissertation, we focus on a novel class of deep generative model--normalizing flows. They are becoming popular because of their abilities to efficiently compute exact likelihood, infer exact latent variables, and draw samples. We develop flow-based generative models for different types of data, i.e., unlabeled data, labeled data, and weakly labeled data. First, we develop Woodbury transformations for unsupervised normalizing flows, which improve the flexibility and expressiveness of flow based models. Second, we develop conditional generative flows for an advanced supervised learning problem -- structured output learning, which removes the need of approximations, and surrogate objectives in traditional (deep) structured prediction models. Third, we develop label learning flows, which is a general framework for weakly supervised learning problems. Our research improves the performance of normalizing flows, and extend the applications of them to many supervised and weakly supervised problems.
|
7 |
Interpretable Fine-Grained Visual CategorizationGuo, Pei 16 June 2021 (has links)
Not all categories are created equal in object recognition. Fine-grained visual categorization (FGVC) is a branch of visual object recognition that aims to distinguish subordinate categories within a basic-level category. Examples include classifying an image of a bird into specific species like "Western Gull" or "California Gull". Such subordinate categories exhibit characteristics like small inter-category variation and large intra-class variation, making distinguishing them extremely difficult. To address such challenges, an algorithm should be able to focus on object parts and be invariant to object pose. Like many other computer vision tasks, FGVC has witnessed phenomenal advancement following the resurgence of deep neural networks. However, the proposed deep models are usually treated as black boxes. Network interpretation and understanding aims to unveil the features learned by neural networks and explain the reason behind network decisions. It is not only a necessary component for building trust between humans and algorithms, but also an essential step towards continuous improvement in this field. This dissertation is a collection of papers that contribute to FGVC and neural network interpretation and understanding. Our first contribution is an algorithm named Pose and Appearance Integration for Recognizing Subcategories (PAIRS) which performs pose estimation and generates a unified object representation as the concatenation of pose-aligned region features. As the second contribution, we propose the task of semantic network interpretation. For filter interpretation, we represent the concepts a filter detects using an attribute probability density function. We propose the task of semantic attribution using textual summarization that generates an explanatory sentence consisting of the most important visual attributes for decision-making, as found by a general Bayesian inference algorithm. Pooling has been a key component in convolutional neural networks and is of special interest in FGVC. Our third contribution is an empirical and experimental study towards a thorough yet intuitive understanding and extensive benchmark of popular pooling approaches. Our fourth contribution is a novel LMPNet for weakly-supervised keypoint discovery. A novel leaky max pooling layer is proposed to explicitly encourages sparse feature maps to be learned. A learnable clustering layer is proposed to group the keypoint proposals into final keypoint predictions. 2020 marks the 10th year since the beginning of fine-grained visual categorization. It is of great importance to summarize the representative works in this domain. Our last contribution is a comprehensive survey of FGVC containing nearly 200 relevant papers that cover 7 common themes.
|
8 |
Classification of brain tumors in weakly annotated histopathology images with deep learningHrabovszki, Dávid January 2021 (has links)
Brain and nervous system tumors were responsible for around 250,000 deaths in 2020 worldwide. Correctly identifying different tumors is very important, because treatment options largely depend on the diagnosis. This is an expert task, but recently machine learning, and especially deep learning models have shown huge potential in tumor classification problems, and can provide fast and reliable support for pathologists in the decision making process. This thesis investigates classification of two brain tumors, glioblastoma multiforme and lower grade glioma in high-resolution H&E-stained histology images using deep learning. The dataset is publicly available from TCGA, and 220 whole slide images were used in this study. Ground truth labels were only available on whole slide level, but due to their large size, they could not be processed by convolutional neural networks. Therefore, patches were extracted from the whole slide images in two sizes and fed into separate networks for training. Preprocessing steps ensured that irrelevant information about the background was excluded, and that the images were stain normalized. The patch-level predictions were then combined to slide level, and the classification performance was measured on a test set. Experiments were conducted about the usefulness of pre-trained CNN models and data augmentation techniques, and the best method was selected after statistical comparisons. Following the patch-level training, five slide aggregation approaches were studied, and compared to build a whole slide classifier model. Best performance was achieved when using small patches (336 x 336 pixels), pre-trained CNN model without frozen layers, and mirroring data augmentation. The majority voting slide aggregation method resulted in the best whole slide classifier with 91.7% test accuracy and 100% sensitivity. In many comparisons, however, statistical significance could not be shown because of the relatively small size of the test set.
|
9 |
A Study Of Localization And Latency Reduction For Action RecognitionMasood, Syed Zain 01 January 2012 (has links)
The success of recognizing periodic actions in single-person-simple-background datasets, such as Weizmann and KTH, has created a need for more complex datasets to push the performance of action recognition systems. In this work, we create a new synthetic action dataset and use it to highlight weaknesses in current recognition systems. Experiments show that introducing background complexity to action video sequences causes a significant degradation in recognition performance. Moreover, this degradation cannot be fixed by fine-tuning system parameters or by selecting better feature points. Instead, we show that the problem lies in the spatio-temporal cuboid volume extracted from the interest point locations. Having identified the problem, we show how improved results can be achieved by simple modifications to the cuboids. For the above method however, one requires near-perfect localization of the action within a video sequence. To achieve this objective, we present a two stage weakly supervised probabilistic model for simultaneous localization and recognition of actions in videos. Different from previous approaches, our method is novel in that it (1) eliminates the need for manual annotations for the training procedure and (2) does not require any human detection or tracking in the classification stage. The first stage of our framework is a probabilistic action localization model which extracts the most promising sub-windows in a video sequence where an action can take place. We use a non-linear classifier in the second stage of our framework for the final classification task. We show the effectiveness of our proposed model on two well known real-world datasets: UCF Sports and UCF11 datasets. iii Another application of the weakly supervised probablistic model proposed above is in the gaming environment. An important aspect in designing interactive, action-based interfaces is reliably recognizing actions with minimal latency. High latency causes the system’s feedback to lag behind and thus significantly degrade the interactivity of the user experience. With slight modification to the weakly supervised probablistic model we proposed for action localization, we show how it can be used for reducing latency when recognizing actions in Human Computer Interaction (HCI) environments. This latency-aware learning formulation trains a logistic regression-based classifier that automatically determines distinctive canonical poses from the data and uses these to robustly recognize actions in the presence of ambiguous poses. We introduce a novel (publicly released) dataset for the purpose of our experiments. Comparisons of our method against both a Bag of Words and a Conditional Random Field (CRF) classifier show improved recognition performance for both pre-segmented and online classification tasks.
|
10 |
Human Action Localization And Recognition In Unconstrained VideosBoyraz, Hakan 01 January 2013 (has links)
As imaging systems become ubiquitous, the ability to recognize human actions is becoming increasingly important. Just as in the object detection and recognition literature, action recognition can be roughly divided into classification tasks, where the goal is to classify a video according to the action depicted in the video, and detection tasks, where the goal is to detect and localize a human performing a particular action. A growing literature is demonstrating the benefits of localizing discriminative sub-regions of images and videos when performing recognition tasks. In this thesis, we address the action detection and recognition problems. Action detection in video is a particularly difficult problem because actions must not only be recognized correctly, but must also be localized in the 3D spatio-temporal volume. We introduce a technique that transforms the 3D localization problem into a series of 2D detection tasks. This is accomplished by dividing the video into overlapping segments, then representing each segment with a 2D video projection. The advantage of the 2D projection is that it makes it convenient to apply the best techniques from object detection to the action detection problem. We also introduce a novel, straightforward method for searching the 2D projections to localize actions, termed TwoPoint Subwindow Search (TPSS). Finally, we show how to connect the local detections in time using a chaining algorithm to identify the entire extent of the action. Our experiments show that video projection outperforms the latest results on action detection in a direct comparison. Second, we present a probabilistic model learning to identify discriminative regions in videos from weakly-supervised data where each video clip is only assigned a label describing what action is present in the frame or clip. While our first system requires every action to be manually outlined in every frame of the video, this second system only requires that the video be given a single highlevel tag. From this data, the system is able to identify discriminative regions that correspond well iii to the regions containing the actual actions. Our experiments on both the MSR Action Dataset II and UCF Sports Dataset show that the localizations produced by this weakly supervised system are comparable in quality to localizations produced by systems that require each frame to be manually annotated. This system is able to detect actions in both 1) non-temporally segmented action videos and 2) recognition tasks where a single label is assigned to the clip. We also demonstrate the action recognition performance of our method on two complex datasets, i.e. HMDB and UCF101. Third, we extend our weakly-supervised framework by replacing the recognition stage with a twostage neural network and apply dropout for preventing overfitting of the parameters on the training data. Dropout technique has been recently introduced to prevent overfitting of the parameters in deep neural networks and it has been applied successfully to object recognition problem. To our knowledge, this is the first system using dropout for action recognition problem. We demonstrate that using dropout improves the action recognition accuracies on HMDB and UCF101 datasets.
|
Page generated in 0.0728 seconds