1 |
Paper Categorization Using Naive BayesCui, Man 29 April 2013 (has links)
Literature survey is a time-consuming process as researchers spend a lot of time in
searching the papers of interest. While search engines can be useful in finding papers
that contain a certain set of keywords, one still has to go through these papers in order
to decide whether they are of interest. On the other hand, one can quickly decide
which papers are of interest if each one of them is labelled with a category. The process
of labelling each paper with a category is termed paper categorization, an instance of
a more general problem called text classification. In this thesis, we presented a text
classifier called Iris that makes use of the popular Naive Bayes algorithm. With Iris,
we were able to (1) evaluate Naive Bayes using a number of popular datasets, (2)
propose a GUI for assisting users with document categorization and searching, and
(3) demonstrate how the GUI can be utilized for paper categorization and searching. / Graduate / 0984
|
2 |
The Use of Distributional Semantics in Text Classification Models : Comparative performance analysis of popular word embeddingsNorlund, Tobias January 2016 (has links)
In the field of Natural Language Processing, supervised machine learning is commonly used to solve classification tasks such as sentiment analysis and text categorization. The classical way of representing the text has been to use the well known Bag-Of-Words representation. However lately low-dimensional dense word vectors have come to dominate the input to state-of-the-art models. While few studies have made a fair comparison of the models' sensibility to the text representation, this thesis tries to fill that gap. We especially seek insight in the impact various unsupervised pre-trained vectors have on the performance. In addition, we take a closer look at the Random Indexing representation and try to optimize it jointly with the classification task. The results show that while low-dimensional pre-trained representations often have computational benefits and have also reported state-of-the-art performance, they do not necessarily outperform the classical representations in all cases.
|
3 |
Cross-lingual genre classificationPetrenz, Philipp January 2014 (has links)
Automated classification of texts into genres can benefit NLP applications, in that the structure, location and even interpretation of information within a text are dictated by its genre. Cross-lingual methods promise such benefits to languages which lack genre-annotated training data. While there has been work on genre classification for over two decades, none has considered cross-lingual methods before the start of this project. My research aims to fill this gap. It follows previous approaches to monolingual genre classification that exploit simple, low-level text features, many of which can be extracted in different languages and have similar functions. This contrasts with work on cross-lingual topic or sentiment classification of texts that typically use word frequencies as features. These have been shown to have limited use when it comes to genres. Many such methods also assume cross-lingual resources, such as machine translation, which limits the range of their application. A selection of these approaches are used as baselines in my experiments. I report the results of two semi-supervised methods for exploiting genre-labelled source language texts and unlabelled target language texts. The first is a relatively simple algorithm that bridges the language gap by exploiting cross-lingual features and then iteratively re-trains a classification model on previously predicted target texts. My results show that this approach works well where only few cross-lingual resources are available and texts are to be classified into broad genre categories. It is also shown that further improvements can be achieved through multi-lingual training or cross-lingual feature selection if genre-annotated texts are available in several source languages. The second is a variant of the label propagation algorithm. This graph-based classifier learns genre-specific feature set weights from both source and target language texts and uses them to adjust the propagation channels for each text. This allows further feature sets to be added as additional resources, such as Part of Speech taggers, become available. While the method performs well even with basic text features, it is shown to benefit from additional feature sets. Results also indicate that it handles fine-grained genre classes better than the iterative re-labelling method.
|
4 |
アンカーテキストとハイパーリンクに基づくWeb 文書の階層的分類鈴木, 祐介, Suzuki, Yusuke, 松原, 茂樹, Matsubara, Shigeki, 吉川, 正俊, Yoshikswa, Masatoshi 06 1900 (has links)
No description available.
|
5 |
Social Fairness in Semi-Supervised Toxicity Text ClassificationShayesteh, Shahriar 11 July 2023 (has links)
The rapid growth of user-generated content on social media platforms in the form of text
caused moderating toxic language manually to become an increasingly challenging task.
Consequently, researchers have turned to artificial intelligence (AI) and machine learning
(ML) models to detect and classify toxic comments automatically. However, these models
often exhibit unintended bias against comments containing sensitive terms related to de-
mographic groups, such as race and gender, which leads to unfair classifications of samples.
In addition, most existing research on this topic focuses on fully supervised learning frame-
works. Therefore, there is a growing need to explore fairness in semi-supervised toxicity
detection due to the difficulty of annotating large amounts of data. In this thesis, we aim
to address this gap by developing a fair generative-based semi-supervised framework for
mitigating social bias in toxicity text classification. This framework consists of two parts,
first, we trained a semi-supervised generative-based text classification model on the bench-
mark toxicity datasets. Then, in the second step, we mitigated social bias in the trained
classifier in step 1 using adversarial debiasing, to improve fairness. In this work, we use
two different semi-supervised generative-based text classification models, NDAGAN and
GANBERT (the difference between them is that the former adds negative data augmenta-
tion to address some of the problems in GANBERT), to propose two fair semi-supervised
models called FairNDAGAN and FairGANBERT. Finally, we compare the performance of
the proposed fair semi-supervised models in terms of accuracy and fairness (equalized odds
difference) against baselines to clarify the challenges of social fairness in semi-supervised
toxicity text classification for the first time.
Based on the experimental results, the key contributions of this research are: first,
we propose a novel fair semi-supervised generative-based framework for fair toxicity text
classification for the first time. Second, we show that we can achieve fairness in semi-
supervised toxicity text classification without considerable loss of accuracy. Third, we
demonstrate that achieving fairness at the coarse-grained level improves fairness at the
fine-grained level but does not always guarantee it. Fourth, we justify the impact of
the labeled and unlabeled data in terms of fairness and accuracy in the studied semi-
supervised framework. Finally, we demonstrate the susceptibility of the supervised and
semi-supervised models against data imbalance in terms of accuracy and fairness.
|
6 |
A Large Collection Learning Optimizer FrameworkChakravarty, Saurabh 30 June 2017 (has links)
Content is generated on the web at an increasing rate. The type of content varies from text on a traditional webpage to text on social media portals (e.g., social network sites and microblogs). One such example of social media is the microblogging site Twitter. Twitter is known for its high level of activity during live events, natural disasters, and events of global importance. Challenges with the data in the Twitter universe include the limit of 140 characters on the text length. Because of this limitation, the vocabulary in the Twitter universe includes short abbreviations of sentences, emojis, hashtags, and other non-standard usage. Consequently, traditional text classification techniques are not very effective on tweets. Fortunately, sophisticated text processing techniques like cleaning, lemmatizing, and removal of stop words and special characters will give us clean text which can be further processed to derive richer word semantic and syntactic relationships using state of the art feature selection techniques like Word2Vec. Machine learning techniques, using word features that capture semantic and context relationships, can be of benefit regarding classification accuracy.
Improving text classification results on Twitter data would pave the way to categorize tweets relative to human defined real world events. This would allow diverse stakeholder communities to interactively collect, organize, browse, visualize, analyze, summarize, and explore content and sources related to crises, disasters, human rights, inequality, population growth, resiliency, shootings, sustainability, violence, etc. Having the events classified into different categories would help us study causality and correlations among real world events.
To check the efficacy of our classifier, we would compare our experimental results with an Association Rules (AR) classifier. This classifier composes its rules around the most discriminating words in the training data. The hierarchy of rules, along with an ability to tune to a support threshold, makes it an effective classifier for scenarios where short text is involved.
Traditionally, developing classification systems for these purposes requires a great degree of human intervention. Constantly monitoring new events, and curating training and validation sets, is tedious and time intensive. Significant human capital is required for such annotation endeavors. Also, involved efforts are required to tune the classifier for best performance. Developing and tuning classifiers manually using human intervention would not be a viable option if we are to monitor events and trends in real-time. We want to build a framework that would require very little human intervention to build and choose the best among the available performing classification techniques in our system.
Another challenge with classification systems is related to their performance with unseen data. For the classification of tweets, we are continually faced with a situation where a given event contains a certain keyword that is closely related to it. If a classifier, built for a particular event, due to overfitting to what is a biased sample with limited generality, is faced with new tweets with different keywords, accuracy may be reduced. We propose building a system that will use very little training data in the initial iteration and will be augmented with automatically labelled training data from a collection that stores all the incoming tweets. A system that is trained on incoming tweets that are labelled using sophisticated techniques based on rich word vector representation would perform better than a system that is trained on only the initial set of tweets.
We also propose to use sophisticated deep learning techniques like Convolutional Neural Networks (CNN) that can capture the combination of the words using an n-gram feature representation. Such sophisticated feature representation could account for the instances when the words occur together.
We divide our case studies into two phases: preliminary and final case studies. The preliminary case studies focus on selecting the best feature representation and classification methodology out of the AR and the Word2Vec based Logistic Regression classification techniques. The final case studies focus on developing the augmented semi-supervised training methodology and the framework to develop a large collection learning optimizer to generate a highly performant classifier.
For our preliminary case studies, we are able to achieve an F1 score of 0.96 that is based on Word2Vec and Logistic Regression. The AR classifier achieved an F1 score of 0.90 on the same data.
For our final case studies, we are able to show improvements of F1 score from 0.58 to 0.94 in certain cases based on our augmented training methodology. Overall, we see improvement in using the augmented training methodology on all datasets. / Master of Science / Content is generated on social media at a very fast pace. Social media content in the form of tweets that is generated by the microblog site Twitter is quite popular for understanding the events and trends that are prevalent at a given point of time across various geographies. Categorizing these tweets into their real-world event categories would be useful for researchers, students, academics and the government. Categorizing tweets to their real-world categories is a challenging task. Our framework involves building a classification system that can learn how to categorize tweets for a given category if it is provided with a few samples of the relevant and non-relevant tweets. The system retrieves additional tweets from an auxiliary data source to further learn what is relevant and irrelevant based on how similar a tweet is to a positive example. Categorizing the tweets in an automated way would be useful in analyzing and studying the events and trends for past and future real-world events.
|
7 |
Knowledge-enhanced text classification : descriptive modelling and new approachesMartinez-Alvarez, Miguel January 2014 (has links)
The knowledge available to be exploited by text classification and information retrieval systems has significantly changed, both in nature and quantity, in the last years. Nowadays, there are several sources of information that can potentially improve the classification process, and systems should be able to adapt to incorporate multiple sources of available data in different formats. This fact is specially important in environments where the required information changes rapidly, and its utility may be contingent on timely implementation. For these reasons, the importance of adaptability and flexibility in information systems is rapidly growing. Current systems are usually developed for specific scenarios. As a result, significant engineering effort is needed to adapt them when new knowledge appears or there are changes in the information needs. This research investigates the usage of knowledge within text classification from two different perspectives. On one hand, the application of descriptive approaches for the seamless modelling of text classification, focusing on knowledge integration and complex data representation. The main goal is to achieve a scalable and efficient approach for rapid prototyping for Text Classification that can incorporate different sources and types of knowledge, and to minimise the gap between the mathematical definition and the modelling of a solution. On the other hand, the improvement of different steps of the classification process where knowledge exploitation has traditionally not been applied. In particular, this thesis introduces two classification sub-tasks, namely Semi-Automatic Text Classification (SATC) and Document Performance Prediction (DPP), and several methods to address them. SATC focuses on selecting the documents that are more likely to be wrongly assigned by the system to be manually classified, while automatically labelling the rest. Document performance prediction estimates the classification quality that will be achieved for a document, given a classifier. In addition, we also propose a family of evaluation metrics to measure degrees of misclassification, and an adaptive variation of k-NN.
|
8 |
Using machine learning to classify news articlesLagerkrants, Eleonor, Holmström, Jesper January 2016 (has links)
In today’s society a large portion of the worlds population get their news on electronicdevices. This opens up the possibility to enhance their reading experience bypersonalizing news for the readers based on their previous preferences. We have conductedan experiment to find out how accurately a Naïve Bayes classifier can selectarticles that a user might find interesting. Our experiments was done on two userswho read and classified 200 articles as interesting or not interesting. Those articleswere divided into four datasets with the sizes 50, 100, 150 and 200. We used a NaïveBayes classifier with 16 different settings configurations to classify the articles intotwo categories. From these experiments we could find several settings configurationsthat showed good results. One settings configuration was chosen as a good generalsetting for this kind of problem. We found that for datasets with a size larger than 50there were no significant increase in classification confidence.
|
9 |
Automatic Document Classification Applied to Swedish NewsBlein, Florent January 2005 (has links)
<p>The first part of this paper presents briefly the ELIN[1] system, an electronic newspaper project. ELIN is a framework that stores news and displays them to the end-user. Such news are formatted using the xml[2] format. The project partner Corren[3] provided ELIN with xml articles, however the format used was not the same. My first task has been to develop a software that converts the news from one xml format (Corren) to another (ELIN).</p><p>The second and main part addresses the problem of automatic document classification and tries to find a solution for a specific issue. The goal is to automatically classify news articles from a Swedish newspaper company (Corren) into the IPTC[4] news categories.</p><p>This work has been carried out by implementing several classification algorithms, testing them and comparing their accuracy with existing software. The training and test documents were 3 weeks of the Corren newspaper that had to be classified into 2 categories.</p><p>The last tests were run with only one algorithm (Naïve Bayes) over a larger amount of data (7, then 10 weeks) and categories (12) to simulate a more real environment.</p><p>The results show that the Naïve Bayes algorithm, although the oldest, was the most accurate in this particular case. An issue raised by the results is that feature selection improves speed but can seldom reduce accuracy by removing too many features.</p>
|
10 |
Improving Multiclass Text Classification with the Support Vector MachineRennie, Jason D. M., Rifkin, Ryan 16 October 2001 (has links)
We compare Naive Bayes and Support Vector Machines on the task of multiclass text classification. Using a variety of approaches to combine the underlying binary classifiers, we find that SVMs substantially outperform Naive Bayes. We present full multiclass results on two well-known text data sets, including the lowest error to date on both data sets. We develop a new indicator of binary performance to show that the SVM's lower multiclass error is a result of its improved binary performance. Furthermore, we demonstrate and explore the surprising result that one-vs-all classification performs favorably compared to other approaches even though it has no error-correcting properties.
|
Page generated in 0.023 seconds