• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 2
  • Tagged with
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Comparing Feature Extraction Methods and Effects of Pre-Processing Methods for Multi-Label Classification of Textual Data / Utvärdering av Metoder för Extraktion av Särdrag och Förbehandling av Data för Multi-Taggning av Textdata

Eklund, Martin January 2018 (has links)
This thesis aims to investigate how different feature extraction methods applied to textual data affect the results of multi-label classification. Two different Bag of Words extraction methods are used, specifically the Count Vector and the TF-IDF approaches. A word embedding method is also investigated, called the GloVe extraction method. Multi-label classification can be useful for categorizing items, such as pieces of music or news articles, that may belong to multiple classes or topics. The effect of using different pre-processing methods is also investigated, such as the use of N-grams, stop-word elimination, and stemming. Two different classifiers, an SVM and an ANN, are used for multi-label classification using a Binary Relevance approach. The results indicate that the choice of extraction method has a meaningful impact on the resulting classifications, but that no one method consistently outperforms the others. Instead the results show that the GloVe extraction method performs the best for the recall metrics, while the Bag of Words methods perform the best for the precision metrics. / Detta arbete ämnar att undersöka vilken effekt olika metoder för att extrahera särdrag ur textdata har när dessa används för att multi-tagga textdatan. Två metoder baserat på Bag of Words undersöks, närmare bestämt Count Vector-metoden samt TF-IDF-metoden. Även en metod som använder sig av word embessings undersöks, som kallas för GloVe-metoden. Multi-taggning av data kan vara användbart när datan, exempelvis musikaliska stycken eller nyhetsartiklar, kan tillhöra flera klasser eller områden. Även användandet av flera olika metoder för att förbehandla datan undersöks, såsom användandet utav N-gram, eliminering av icke-intressanta ord, samt transformering av ord med olika böjningsformer till gemensam stamform. Två olika klassificerare, en SVM samt en ANN, används för multi-taggningen genom använding utav en metod kallad Binary Relevance. Resultaten visar att valet av metod för extraktion av särdrag har en betydelsefull roll för den resulterande multi-taggningen, men att det inte finns en metod som ger bäst resultat genom alla tester. Istället indikerar resultaten att extraktionsmetoden baserad på GloVe presterar bäst när det gäller 'recall'-mätvärden, medan Bag of Words-metoderna presterar bäst gällade 'precision'-mätvärden.
2

Topical Classification of Images in Wikipedia : Development of topical classification models followed by a study of the visual content of Wikipedia / Ämneklassificering av bilder i Wikipedia : Utveckling av ämneklassificeringsmodeller följd av studier av Wikipedias bilddata

Vieira Bernat, Matheus January 2023 (has links)
With over 53 million articles and 11 million images, Wikipedia is the greatest encyclopedia in history. The number of users is equally significant, with daily views surpassing 1 billion. Such an enormous system needs automation of tasks to make it possible for the volunteers to maintain. When it comes to textual data, there is a system based on machine learning called ORES providing automation to tasks such as article quality estimation and article topic routing. A visual counterpart system also needs to be developed to support tasks such as vandalism detection in images and for a better understanding of the visual data of Wikipedia. Researchers from the Wikimedia Foundation identified a hindrance to implementing the visual counterpart of ORES: the images of Wikipedia lack topical metadata. Thus, this work aims to develop a deep learning model that classifies images into a set of topics, which have been pre-determined in parallel work. State-of-the-art image classification models and other methods to mitigate the existing class imbalance are used. The conducted experiments show, among others, that: using the data that considers the hierarchy of labels performs better; resampling techniques are ineffective at mitigating imbalance due to the high label concurrence; sample-weighting improves metrics; and that initializing parameters as pre-trained on ImageNet rather than randomly yields better metrics. Moreover, we find interesting outlier labels that, despite having fewer samples, obtain better performance metrics, which is believed to be either due to bias from pre-training or simply more signal in the label. The distribution of the visual data predicted by the models displayed. Finally, some qualitative examples of the model predictions to some images are presented, proving the ability of the model to find correct labels that are missing in the ground truth

Page generated in 0.3806 seconds