• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 30
  • 22
  • 6
  • 2
  • 2
  • 1
  • Tagged with
  • 76
  • 76
  • 58
  • 43
  • 20
  • 19
  • 14
  • 14
  • 13
  • 11
  • 11
  • 11
  • 10
  • 10
  • 10
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
71

New Methods for Learning from Heterogeneous and Strategic Agents

Divya, Padmanabhan January 2017 (has links) (PDF)
1 Introduction In this doctoral thesis, we address several representative problems that arise in the context of learning from multiple heterogeneous agents. These problems are relevant to many modern applications such as crowdsourcing and internet advertising. In scenarios such as crowdsourcing, there is a planner who is interested in learning a task and a set of noisy agents provide the training data for this learning task. Any learning algorithm making use of the data provided by these noisy agents must account for their noise levels. The noise levels of the agents are unknown to the planner, leading to a non-trivial difficulty. Further, the agents are heterogeneous as they differ in terms of their noise levels. A key challenge in such settings is to learn the noise levels of the agents while simultaneously learning the underlying model. Another challenge arises when the agents are strategic. For example, when the agents are required to perform a task, they could be strategic on the efforts they put in. As another example, when required to report their costs incurred towards performing the task, the agents could be strategic and may not report the costs truthfully. In general, the performance of the learning algorithms could be severely affected if the information elicited from the agents is incorrect. We address the above challenges that arise in the following representative learning problems. Multi-label Classification from Heterogeneous Noisy Agents Multi-label classification is a well-known supervised machine learning problem where each instance is associated with multiple classes. Since several labels can be assigned to a single instance, one of the key challenges in this problem is to learn the correlations between the classes. We first assume labels from a perfect source and propose a novel topic model called Multi-Label Presence-Absence Latent Dirichlet Allocation (ML-PA-LDA). In the current day scenario, a natural source for procuring the training dataset is through mining user-generated content or directly through users in a crowdsourcing platform. In the more practical scenario of crowdsourcing, an additional challenge arises as the labels of the training instances are provided by noisy, heterogeneous crowd-workers with unknown qualities. With this as the motivation, we further adapt our topic model to the scenario where the labels are provided by multiple noisy sources and refer to this model as ML-PA-LDA-MNS (ML-PA-LDA with Multiple Noisy Sources). With experiments on standard datasets, we show that the proposed models achieve superior performance over existing methods. Active Linear Regression with Heterogeneous, Noisy and Strategic Agents In this work, we study the problem of training a linear regression model by procuring labels from multiple noisy agents or crowd annotators, under a budget constraint. We propose a Bayesian model for linear regression from multiple noisy sources and use variational inference for parameter estimation. When labels are sought from agents, it is important to minimize the number of labels procured as every call to an agent incurs a cost. Towards this, we adopt an active learning approach. In this specific context, we prove the equivalence of well-studied criteria of active learning such as entropy minimization and expected error reduction. For the purpose of annotator selection in active learning, we observe a useful connection with the multi-armed bandit framework. Due to the nature of the distribution of the rewards on the arms, we resort to the Robust Upper Confidence Bound (UCB) scheme with truncated empirical mean estimator to solve the annotator selection problem. This yields provable guarantees on the regret. We apply our model to the scenario where annotators are strategic and design suitable incentives to induce them to put in their best efforts. Ranking with Heterogeneous Strategic Agents We look at the problem where a planner must rank multiple strategic agents, a problem that has many applications including sponsored search auctions (SSA). Stochastic multi-armed bandit (MAB) mechanisms have been used in the literature to solve this problem. Existing stochastic MAB mechanisms with a deterministic payment rule, proposed in the literature, necessarily suffer a regret of (T 2=3), where T is the number of time steps. This happens because these mechanisms address the worst case scenario where the means of the agents’ stochastic rewards are separated by a very small amount that depends on T . We however take a detour and allow the planner to indicate the resolution, , with which the agents must be distinguished. This immediately leads us to introduce the notion of -Regret. We propose a dominant strategy incentive compatible (DSIC) and individually rational (IR), deterministic MAB mechanism, based on ideas from the Upper Confidence Bound (UCB) family of MAB algorithms. The proposed mechanism - UCB achieves a -regret of O(log T ). We first establish the results for single slot SSA and then non-trivially extend the results to the case of multi-slot SSA.
72

Automatic Categorization of News Articles With Contextualized Language Models / Automatisk kategorisering av nyhetsartiklar med kontextualiserade språkmodeller

Borggren, Lukas January 2021 (has links)
This thesis investigates how pre-trained contextualized language models can be adapted for multi-label text classification of Swedish news articles. Various classifiers are built on pre-trained BERT and ELECTRA models, exploring global and local classifier approaches. Furthermore, the effects of domain specialization, using additional metadata features and model compression are investigated. Several hundred thousand news articles are gathered to create unlabeled and labeled datasets for pre-training and fine-tuning, respectively. The findings show that a local classifier approach is superior to a global classifier approach and that BERT outperforms ELECTRA significantly. Notably, a baseline classifier built on SVMs yields competitive performance. The effect of further in-domain pre-training varies; ELECTRA’s performance improves while BERT’s is largely unaffected. It is found that utilizing metadata features in combination with text representations improves performance. Both BERT and ELECTRA exhibit robustness to quantization and pruning, allowing model sizes to be cut in half without any performance loss.
73

Methods for data and user efficient annotation for multi-label topic classification / Effektiva annoteringsmetoder för klassificering med multipla klasser

Miszkurka, Agnieszka January 2022 (has links)
Machine Learning models trained using supervised learning can achieve great results when a sufficient amount of labeled data is used. However, the annotation process is a costly and time-consuming task. There are many methods devised to make the annotation pipeline more user and data efficient. This thesis explores techniques from Active Learning, Zero-shot Learning, Data Augmentation domains as well as pre-annotation with revision in the context of multi-label classification. Active ’Learnings goal is to choose the most informative samples for labeling. As an Active Learning state-of-the-art technique Contrastive Active Learning was adapted to a multi-label case. Once there is some labeled data, we can augment samples to make the dataset more diverse. English-German-English Backtranslation was used to perform Data Augmentation. Zero-shot learning is a setup in which a Machine Learning model can make predictions for classes it was not trained to predict. Zero-shot via Textual Entailment was leveraged in this study and its usefulness for pre-annotation with revision was reported. The results on the Reviews of Electric Vehicle Charging Stations dataset show that it may be beneficial to use Active Learning and Data Augmentation in the annotation pipeline. Active Learning methods such as Contrastive Active Learning can identify samples belonging to the rarest classes while Data Augmentation via Backtranslation can improve performance especially when little training data is available. The results for Zero-shot Learning via Textual Entailment experiments show that this technique is not suitable for the production environment. / Klassificeringsmodeller som tränas med övervakad inlärning kan uppnå goda resultat när en tillräcklig mängd annoterad data används. Annoteringsprocessen är dock en kostsam och tidskrävande uppgift. Det finns många metoder utarbetade för att göra annoteringspipelinen mer användar- och dataeffektiv. Detta examensarbete utforskar tekniker från områdena Active Learning, Zero-shot Learning, Data Augmentation, samt pre-annotering, där annoterarens roll är att verifiera eller revidera en klass föreslagen av systemet. Målet med Active Learning är att välja de mest informativa datapunkterna för annotering. Contrastive Active Learning utökades till fallet där en datapunkt kan tillhöra flera klasser. Om det redan finns några annoterade data kan vi utöka datamängden med artificiella datapunkter, med syfte att göra datasetet mer mångsidigt. Engelsk-Tysk-Engelsk översättning användes för att konstruera sådana artificiella datapunkter. Zero-shot-inlärning är en teknik i vilken en maskininlärningsmodell kan göra förutsägelser för klasser som den inte var tränad att förutsäga. Zero-shot via Textual Entailment utnyttjades i denna studie för att utöka datamängden med artificiella datapunkter. Resultat från datamängden “Reviews of Electric Vehicle Charging ”Stations visar att det kan vara fördelaktigt att använda Active Learning och Data Augmentation i annoteringspipelinen. Active Learning-metoder som Contrastive Active Learning kan identifiera datapunkter som tillhör de mest sällsynta klasserna, medan Data Augmentation via Backtranslation kan förbättra klassificerarens prestanda, särskilt när få träningsdata finns tillgänglig. Resultaten för Zero-shot Learning visar att denna teknik inte är lämplig för en produktionsmiljö.
74

An evaluation of U-Net’s multi-label segmentation performance on PDF documents in a medical context / En utvärdering av U-Nets flerklassiga segmenteringsprestanda på PDF-dokument i ett medicinskt sammanhang

Sebek, Fredrik January 2021 (has links)
The Portable Document Format (PDF) is an ideal format for viewing and printing documents. Today many companies store their documents in a PDF format. However, the conversion from a PDF document to any other structured format is inherently difficult. As a result, a lot of the information contained in a PDF document is not directly accessible - this is problematic. Manual intervention is required to accurately convert a PDF into another file format - this can be deemed as both strenuous and exhaustive work. An automated solution to this process could greatly improve the accessibility to information in many companies. A significant amount of literature has investigated the process of extracting information from PDF documents in a structured way. In recent years these methodologies have become heavily dependent on computer vision. The work on this paper evaluates how the U-Net model handles multi-label segmentation on PDF documents in a medical context - extending on Stahl et al.’s work in 2018. Furthermore, it compares two newer extensions of the U-Net model, MultiResUNet (2019) and SS-U-Net (2021). Additionally, it assesses how each of the models performs in a data-sparse environment. The three models were implemented, trained, and then evaluated. Their performance was measured using the Dice coefficient, Jaccard coefficient, and percentage similarity. Furthermore, visual inspection was also used to analyze how the models performed from a perceptual standpoint. The results indicate that both the U-Net and the SS-U-Net are exceptional at segmenting PDF documents effectively in a data abundant environment. However, the SS-U-Net outperformed both the U-Net and the MultiResUNet in the data-sparse environment. Furthermore, the MultiResUNet significantly underperformed in comparison to both the U-Net and SS-U-Net models in both environments. The impressive results achieved by the U-Net and SS-U-Net models suggest that it can be combined with a larger system. This proposed system allows for accurate and structured extraction of information from PDF documents. / Portable Document Format (PDF) är ett välfungerande format för visning och utskrift av dokument. I dagsläget väljer många företag därmed att lagra sina dokument i PDF-format. Konvertering från PDF format till någon annan typ av strukturerat format är dock svårt, och detta resulterar i att mycket av informationen i PDF-dokumenten är svårtillgängligt, vilket är problematiskt för de företag som arbetar med detta filformat. Det krävs manuellt arbete för att konvertera en PDF till ett annat filformat - detta kan betraktas som både ansträngande och uttömmande arbete. En automatiserad lösning på denna process skulle kunna förbättra tillgängligheten av information för många företag. En stor mängd litteratur har undersökt processen att extrahera information från PDF-dokument på ett strukturerat sätt. På senare tid har dessa metoder blivit starkt beroende av datorseende. Den här forskningen utvärderar hur U-Net-modellen hanterar segmentering av PDF dokument, baserat på flerfaldiga etiketter, i ett medicinskt sammanhang. Arbetet är en utökning av Stahl et al. forskning från 2018. Dessutom jämförs två nyare utökade varianter av U-Net-modellen , MultiResUNet (2019) och SS-U-Net (2021). Utöver detta så utvärderas även varje modell utefter hur den presterar i en gles datamiljö. De tre modellerna implementerades, utbildades och utvärderades. Deras prestanda mättes med hjälp av Dice-koefficienten, Jaccard-koefficienten och procentuell likhet. Vidare så görs även en visuell inspektion för att analysera hur modellerna presterar utifrån en perceptuell synvinkel. Resultaten tyder på att både U-Net och SS-U-Net är exceptionella när det gäller att segmentera PDF-dokument i en riklig datamiljö. SS-U-Net överträffade emellertid både U-Net och MultiResUNet i den glesa datamiljön. Dessutom underpresterade MultiResUNet signifikant i jämförelse med både U-Net och SS-U-Net modellen i båda miljöerna. De imponerande resultaten som uppnåtts av modellerna U-Net och SS-U-Net tyder på att de kan kombineras med ett större system. Detta föreslagna systemet möjliggör korrekt och strukturerad extrahering av information från PDF-dokument.
75

Visualization of live search / Visualisering av realtidssök

Nilsson, Olof January 2013 (has links)
The classical search engine result page is used for many interactions with search results. While these are effective at communicating relevance, they do not present the context well. By giving the user an overview in the form of a spatialized display, in a domain that has a physical analog that the user is familiar with, context should become pre-attentive and obvious to the user. A prototype has been built that takes public medical information articles and assigns these to parts of the human body. The articles are indexed and made searchable. A visualization presents the coverage of a query on the human body and allows the user to interact with it to explore the results. Through usage cases the function and utility of the approach is shown.
76

A Machine Learning Model of Perturb-Seq Data for use in Space Flight Gene Expression Profile Analysis

Liam Fitzpatric Johnson (18437556) 27 April 2024 (has links)
<p dir="ltr">The genetic perturbations caused by spaceflight on biological systems tend to have a system-wide effect which is often difficult to deconvolute into individual signals with specific points of origin. Single cell multi-omic data can provide a profile of the perturbational effects but does not necessarily indicate the initial point of interference within a network. The objective of this project is to take advantage of large scale and genome-wide perturbational or Perturb-Seq datasets by using them to pre-train a generalist machine learning model that is capable of predicting the effects of unseen perturbations in new data. Perturb-Seq datasets are large libraries of single cell RNA sequencing data collected from CRISPR knock out screens in cell culture. The advent of generative machine learning algorithms, particularly transformers, make it an ideal time to re-assess large scale data libraries in order to grasp cell and even organism-wide genomic expression motifs. By tailoring an algorithm to learn the downstream effects of the genetic perturbations, we present a pre-trained generalist model capable of predicting the effects of multiple perturbations in combination, locating points of origin for perturbation in new datasets, predicting the effects of known perturbations in new datasets, and annotation of large-scale network motifs. We demonstrate the utility of this model by identifying key perturbational signatures in RNA sequencing data from spaceflown biological samples from the NASA Open Science Data Repository.</p>

Page generated in 0.0254 seconds