Spelling suggestions: "subject:"[een] TEXT CLASSIFICATION"" "subject:"[enn] TEXT CLASSIFICATION""
11 |
Using sentence-level classification to predict sentiment at the document-levelHutton, Amanda Rachel 21 August 2012 (has links)
This report explores various aspects of sentiment mining. The two research goals for the report were: (1) to determine useful methods in increasing recall of negative sentences and (2) to determine the best method for applying sentence level classification to the document level. The methods in this report were applied to the Movie Reviews corpus at both the document and sentence level. The basic approach was to first identify polar and neutral sentences within the text and then classify the polar sentences as either positive or negative. The Maximum Entropy classifier was used as the baseline system in which the application of further methods was explored. Part-of-speech tagging was used for its effectiveness to determine if its inclusion increased recall of negative sentences. It was also used to aid in the handling of negations within sentences at the sentence level. Smoothing was investigated and various metrics to describe the sentiment composition were explored to address goal (2). Negative recall was shown to increase with the adjustment of the classification threshold and was also seen to increase through the methods used to address goal (2). Overall, classifying at the sentence level using bigrams and a cutoff value of one was observed to result in the highest evaluation scores. / text
|
12 |
State-of-Mind Classification From Unstructured Texts Using Statistical Features and Lexical Network FeaturesBayram, Ulya 01 October 2019 (has links)
No description available.
|
13 |
Information Retrieval for Call Center Quality AssuranceMcMurtry, William F. 02 October 2020 (has links)
No description available.
|
14 |
Information and Representation Tradeoffs in Document ClassificationJin, Timothy 23 May 2022 (has links)
No description available.
|
15 |
Deep Active Learning for Short-Text Classification / Aktiv inlärning i djupa nätverk för klassificering av korta texterZhao, Wenquan January 2017 (has links)
In this paper, we propose a novel active learning algorithm for short-text (Chinese) classification applied to a deep learning architecture. This topic thus belongs to a cross research area between active learning and deep learning. One of the bottlenecks of deeplearning for classification is that it relies on large number of labeled samples, which is expensive and time consuming to obtain. Active learning aims to overcome this disadvantage through asking the most useful queries in the form of unlabeled samples to belabeled. In other words, active learning intends to achieve precise classification accuracy using as few labeled samples as possible. Such ideas have been investigated in conventional machine learning algorithms, such as support vector machine (SVM) for imageclassification, and in deep neural networks, including convolutional neural networks (CNN) and deep belief networks (DBN) for image classification. Yet the research on combining active learning with recurrent neural networks (RNNs) for short-text classificationis rare. We demonstrate results for short-text classification on datasets from Zhuiyi Inc. Importantly, to achieve better classification accuracy with less computational overhead,the proposed algorithm shows large reductions in the number of labeled training samples compared to random sampling. Moreover, the proposed algorithm is a little bit better than the conventional sampling method, uncertainty sampling. The proposed activelearning algorithm dramatically decreases the amount of labeled samples without significantly influencing the test classification accuracy of the original RNNs classifier, trainedon the whole data set. In some cases, the proposed algorithm even achieves better classification accuracy than the original RNNs classifier. / I detta arbete studerar vi en ny aktiv inlärningsalgoritm som appliceras på en djup inlärningsarkitektur för klassificering av korta (kinesiska) texter. Ämnesområdet hör därmedtill ett ämnesöverskridande område mellan aktiv inlärning och inlärning i djupa nätverk .En av flaskhalsarna i djupa nätverk när de används för klassificering är att de beror avtillgången på många klassificerade datapunkter. Dessa är dyra och tidskrävande att skapa. Aktiv inlärning syftar till att överkomma denna typ av nackdel genom att generera frågor rörande de mest informativa oklassade datapunkterna och få dessa klassificerade. Aktiv inlärning syftar med andra ord till att uppnå bästa klassificeringsprestanda medanvändandet av så få klassificerade datapunkter som möjligt. Denna idé har studeratsinom konventionell maskininlärning, som tex supportvektormaskinen (SVM) för bildklassificering samt inom djupa neuronnätverk inkluderande bl.a. convolutional networks(CNN) och djupa beliefnetworks (DBN) för bildklassificering. Emellertid är kombinationenav aktiv inlärning och rekurrenta nätverk (RNNs) för klassificering av korta textersällsynt. Vi demonstrerar här resultat för klassificering av korta texter ur en databas frånZhuiyi Inc. Att notera är att för att uppnå bättre klassificeringsnoggranhet med lägre beräkningsarbete (overhead) så uppvisar den föreslagna algoritmen stora minskningar i detantal klassificerade träningspunkter som behövs jämfört med användandet av slumpvisadatapunkter. Vidare, den föreslagna algoritmen är något bättre än den konventionellaurvalsmetoden, osäkherhetsurval (uncertanty sampling). Den föreslagna aktiva inlärningsalgoritmen minska dramatiskt den mängd klassificerade datapunkter utan att signifikant påverka klassificeringsnoggranheten hos den ursprungliga RNN-klassificeraren när den tränats på hela datamängden. För några fall uppnår den föreslagna algoritmen t.o.m.bättre klassificeringsnoggranhet än denna ursprungliga RNN-klassificerare.
|
16 |
Utterances classifier for chatbots’ intentsJoigneau, Axel January 2018 (has links)
Chatbots are the next big improvement in the era of conversational services. A chatbot is a virtual person who can carry out a conversation with a human about a certain subject, using interactive textual skills. Currently, there are many cloud-based chatbots services that are being developed and improved such as IBM Watson, well known for winning the quiz show “Jeopardy!” in 2011. Chatbots are based on a large amount of structured data. They contains many examples of questions that are associated to a specific intent which represents what the user wants to say. Those associations are currently being done by hand, and this project focuses on improving this data structuring using both supervised and unsupervised algorithms. A supervised reclassification using an improved Barycenter method reached 85% in precision and 75% in recall for a data set containing 2005 questions. Questions that did not match any intent were then clustered in an unsupervised way using a K-means algorithm that reached a purity of 0.5 for the optimal K chosen. / Chatbots är nästa stora förbättring i konversationstiden. En chatbot är en virtuell person som kan genomföra en konversation med en människa om ett visst ämne, med hjälp av interaktiva textkunskaper. För närvarande finns det många molnbaserade chatbots-tjänster som utvecklas och förbättras som IBM Watson, känt för att vinna quizshowen "Jeopardy!" 2011. Chatbots baseras på en stor mängd strukturerade data. De innehåller många exempel på frågor som är kopplade till en specifik avsikt som representerar vad användaren vill säga. Dessa föreningar görs för närvarande för hand, och detta projekt fokuserar på att förbättra denna datastrukturering med hjälp av både övervakade och oövervakade algoritmer. En övervakad omklassificering med hjälp av en förbättrad Barycenter-metod uppnådde 85 % i precision och 75 % i recall för en dataset innehållande 2005 frågorna. Frågorna som inte matchade någon avsikt blev sedan grupperade på ett oövervakad sätt med en K-medelalgoritm som nådde en renhet på 0,5 för den optimala K som valts.
|
17 |
Document Classification using Characteristic SignaturesMondal, Abhro Jyoti January 2017 (has links)
No description available.
|
18 |
Short Text Classification in Twitter to Improve Information FilteringSriram, Bharath 03 September 2010 (has links)
No description available.
|
19 |
Improving Text Classification Using Graph-based MethodsKarajeh, Ola Abdel-Raheem Mohammed 05 June 2024 (has links)
Text classification is a fundamental natural language processing task. However, in real-world applications, class distributions are usually skewed, e.g., due to inherent class imbalance. In addition, the task difficulty changes based on the underlying language. When rich morphological structure and high ambiguity are exhibited, natural language understanding can become challenging. For example, Arabic, ranked the fifth most widely used language, has a rich morphological structure and high ambiguity that result from Arabic orthography. Thus, Arabic natural language processing is challenging. Several studies employ Long Short- Term Memory (LSTM) networks and Convolutional Neural Networks (CNNs), but Graph Convolutional Networks (GCNs) have not yet been investigated for the task. Sequence- based models can successfully capture semantics in local consecutive text sequences. On the other hand, graph-based models can preserve global co-occurrences that capture non- consecutive and long-distance semantics. A text representation approach that combines local and global information can enhance performance in practical class imbalance text classification scenarios. Yet, multi-view graph-based text representations have received limited attention.
In this research, first we introduce Multi-view Minority Class Text Graph Convolutional Network (MMCT-GCN), a transductive multi-view text classification model that captures textual graph representations for the minority class alongside sequence-based text representations. Experimental results show that MMCT-GCN obtains consistent improvements over baselines. Second, we develop an Arabic Bidirectional Encoder Representations from Transformers (BERT) Graph Convolutional Network (AraBERT-GCN), a hybrid model that combines the large-scale pre-trained models that encode the local context and semantics alongside graph-based features that are capable of extracting the global word co-occurrences in non-consecutive extended semantics by only one or two hops. Experimental results show that AraBERT-GCN outperforms the state-of-the-art (SOTA) on our Arabic text datasets. Finally, we propose an Arabic Multidimensional Edge Graph Convolutional Network (AraMEGraph) designed for text classification that encapsulates richer and context-aware representations of word and phrase relationships, thus mitigating the impact of the complexity and ambiguity of the Arabic language. / Doctor of Philosophy / The text classification task is an important step in understanding natural language. However, this task has many challenges, such as uneven data distributions and language difficulty. For example, Arabic is the fifth most spoken language. It has many different word forms and meanings, which can make things harder to understand. Long Short-Term Memory (LSTM) networks and Convolutional Neural Networks (CNNs) are widely utilized for text classification. However, another kind of network called graph convolutional network (GCN) has yet to be explored for this task. Graph-based models keep track of how words are connected, even if they are not right next to each other in a sentence. This helps with better understanding the meaning of words. On the other hand, sequence-based models do well in understanding the meaning of words that are right next to each other. Mixing both types of information in text understanding can work better, especially when dealing with unevenly distributed data. In this research, we introduce a new text classification method called Multi-view Minority Class Text Graph Convolutional Network (MMCT-GCN). This model looks at text from different angles and combines information from graphs and sequence-based models. Our experiments show that this model performs better than other ones proposed in the literature. Additionally, we propose an Arabic BERT Graph Convolutional Network (AraBERT-GCN). It combines pre-trained models that understand words in context and graph features that look at how words relate to each other globally. This helps AraBERT- GCN do better than other models when working with Arabic text. Finally, we develop a special network called Arabic Multidimensional Edge Graph Convolutional Network (AraMEGraph) for Arabic text. It is designed to better understand Arabic and classify text more accurately. We do this by adding special edge features with multiple dimensions to help the network learn the relationships between words and phrases.
|
20 |
Beyond Supervised Learning: Applications and Implications of Zero-shot Text ClassificationBorst-Graetz, Janos 25 October 2024 (has links)
This dissertation explores the application of zero-shot text classification, a technique for categorizing texts without annotated data in the target domain.
A true zero-shot setting breaks with the conventions of the traditional supervised machine learning paradigm that relies on
quantitative in-domain evaluation for optimization, performance measurement, and model selection.
The dissertation summarizes existing research to build a theoretical foundation for zero-shot methods, emphasizing efficiency and transparency.
It benchmarks selected approaches across various tasks and datasets to understand their general performance, strengths, and weaknesses, mirroring the model selection process.
On this foundation, two case studies demonstrate the application of zero-shot text classification:
The first engages with historical German stock market reports, utilizing zero-shot methods for aspect-based sentiment classification.
The case study reveals that although there are qualitative differences between finetuned and zero-shot approaches,
the aggregated results are not easily distinguishable, sparking a discussion about the practical implications.
The second case study integrates zero-shot text classification into a civil engineering document management system,
showcasing how the flexibility of zero-shot models and the omission of the training process can benefit the development of prototype software,
at the cost of an unknown performance.
These findings indicate that, although zero-shot text classification works for the exemplary cases, the results are not generalizable.
Taking up the findings of these case studies, the dissertation discusses dilemmas and theoretical considerations that arise from omitting
the in-domain evaluation of applying zero-shot text classification.
It concludes by advocating a broader focus beyond traditional quantitative metrics in order to build trust in zero-shot text classification,
highlighting their practical utility as well as the necessity for further exploration as these technologies evolve.:1 Introduction
1.1 Problem Context
1.2 Related Work
1.3 Research Questions & Contribution
1.4 Author’s Publications
1.5 Structure of This Work
2 Research Context
2.1 The Current State of Text Classification
2.2 Efficiency
2.3 Approaches to Addressing Data Scarcity in Machine Learning
2.4 Challenges of Recent Developments
2.5 Model Sizes and Hardware Resources
2.6 Conclusion
3 Zero-shot Text Classification
3.1 Text Classification
3.2 State-of-the-Art in Text Classification
3.3 Neural Network Approaches to Data-Efficient Text Classification
3.4 Zero-shot Text Classification
3.5 Application
3.6 Requirements for Zero-shot Models
3.7 Approaches to Transfer Zero-shot
3.7.1 Terminology
3.7.2 Similarity-based and Siamese Networks
3.7.3 Language Model Token Predictions
3.7.4 Sentence Pair Classification
3.7.5 Instruction-following Models or Dialog-based Systems
3.8 Class Name Encoding in Text Classification
3.9 Approach Selection
3.10 Conclusion
4 Model Performance Survey
4.1 Experiments
4.1.1 Datasets
4.1.2 Model Selection
4.1.3 Hypothesis Templates
4.2 Zero-shot Model Evaluation
4.3 Dataset Complexity
4.4 Conclusion
5 Case Study: Historic German Stock Market Reports
5.1 Project
5.2 Motivation
5.3 Related Work
5.4 The Corpus and Dataset - Berliner Börsenzeitung
5.4.1 Corpus
5.4.2 Sentiment Aspects
5.4.3 Annotations
5.5 Methodology
5.5.1 Evaluation Approach
5.5.2 Trained Pipeline
5.5.3 Zero-shot Pipeline
5.5.4 Dictionary Pipeline
5.5.5 Tradeoffs
5.5.6 Label Space Definitions
5.6 Evaluation - Comparison of the Pipelines on BBZ
5.6.1 Sentence-based Sentiment
5.6.2 Aspect-based Sentiment
5.6.3 Qualitative Evaluation
5.7 Discussion and Conclusion
6 Case Study: Document Management in Civil Engineering
6.1 Project
6.2 Motivation
6.3 Related Work
6.4 The Corpus and Knowledge Graph
6.4.1 Data
6.4.2 BauGraph – The Knowledge Graph
6.5 Methodology
6.5.1 Document Insertion Pipeline
6.5.2 Frontend Integration
6.6 Discussion and Conclusion
7 MLMC
7.1 How it works
7.2 Motivation
7.3 Extensions of the Framework
7.4 Other Projects
7.4.1 Product Classification
7.4.2 Democracy Monitor
7.4.3 Climate Change Adaptation Finance
7.5 Conclusion
8 Discussion: The Five Dilemmas of Zero-shot
8.1 On Evaluation
8.2 The Five Dilemmas of Zero-shot
8.2.1 Dilemma of Evaluation or Are You Working at All?
8.2.2 Dilemma of Comparison or How Do I Get the Best Model?
8.2.3 Dilemma of Annotation and Label Definition or Are We Talking about the Same Thing?
8.2.4 Dilemma of Interpretation or Am I Biased?
8.2.5 Dilemma of Unsupervised Text Classification or Do I Have to Trust You?
8.3 Trust in Zero-shot Capabilities
8.4 Conclusion
9 Conclusion
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9.1.1 RQ1: Strengths and Weaknesses . . . . . . . . . . . . . . . . 139
9.1.2 RQ2: Application Studies . . . . . . . . . . . . . . . . . . . . 141
9.1.3 RQ3: Implications . . . . . . . . . . . . . . . . . . . . . . . . 143
9.2 Final Thoughts & Future Directions . . . . . . . . . . . . . . . . . . 144
References 147
A Appendix for Survey Chapter A.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
A.2 Task-specific Hypothesis Templates . . . . . . . . . . . . . . . . . . 180
A.3 Fractions of SotA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
181
B Uncertainty vs. Accuracy 182
C Declaration of Authorship 185
D Declaration: Use of AI-Tools 186
E Bibliographic Data 187 / In dieser Dissertation wird die Anwendung von Zero-Shot-Textklassifikation -- die Kategorisierung von Texten ohne annotierte Daten in der Anwendungsdomäne -- untersucht.
Ein echter Zero-Shot-Ansatz bricht mit den Konventionen des traditionellen überwachten maschinellen Lernens, welches auf einer quantitativen Evaluierung in der Zieldomäne
zur Optimierung,
Performanzmessung und Modellauswahl (model selection) basiert.
Eine Zusammenfassung bestehender Forschungsarbeiten bildet die theoretische Grundlage für die verwendeten Zero-Shot-Methoden, wobei Effizienz und Transparenz im Vordergrund stehen.
Ein Vergleich ausgewählter Ansätze mit verschiedenen Tasks und Datensätzen soll allgemeine Stärken und Schwächen aufzeigen und den Prozess der Modellauswahl widerspiegeln.
Auf dieser Grundlage wird die Anwendung der Zero-Shot-Textklassifikation anhand von zwei Fallstudien demonstriert:
Die erste befasst sich mit historischen deutschen Börsenberichten, wobei Zero-Shot zur aspekt-basierten Sentiment-Klassifikation eingesetzt wird.
Es zeigt sich, dass es zwar qualitative Unterschiede zwischen trainierten und Zero-Shot-Ansätzen gibt, dass die aggregierten Ergebnisse aber nicht leicht zu unterscheiden sind, was Überlegungen zu praktischen Implikationen anstößt.
Die zweite Fallstudie integriert Zero-Shot-Textklassifikation in ein Dokumentenmanagementsystem für das Bauwesen und zeigt, wie die Flexibilität von Zero-Shot-Modellen und der Wegfall des Trainingsprozesses die Entwicklung von Prototypen vereinfachen können -- mit dem Nachteil, dass die Genauigkeit des Modells unbekannt bleibt.
Die Ergebnisse zeigen, dass die Zero-Shot-Textklassifikation in den Beispielanwendungen zwar annähernd funktioniert, die Ergebnisse aber nicht leicht verallgemeinerbar sind.
Im Anschluss werden Dilemmata und theoretische Überlegungen erörtert, die sich aus dem Wegfall der Evaluierung in der Zieldomäne von Zero-Shot-Textklassifikation ergeben.
Abschließend wird ein breiterer Fokus über die traditionellen quantitativen Metriken hinaus vorgeschlagen, um Vertrauen in die Zero-Shot-Textklassifikation aufzubauen und
den praktischen Nutzen zu verbessern. Die Überlegungen zeigen aber auch die Notwendigkeit weiterer Forschung im Zuge der Weiterentwicklung dieser Technologien.:1 Introduction
1.1 Problem Context
1.2 Related Work
1.3 Research Questions & Contribution
1.4 Author’s Publications
1.5 Structure of This Work
2 Research Context
2.1 The Current State of Text Classification
2.2 Efficiency
2.3 Approaches to Addressing Data Scarcity in Machine Learning
2.4 Challenges of Recent Developments
2.5 Model Sizes and Hardware Resources
2.6 Conclusion
3 Zero-shot Text Classification
3.1 Text Classification
3.2 State-of-the-Art in Text Classification
3.3 Neural Network Approaches to Data-Efficient Text Classification
3.4 Zero-shot Text Classification
3.5 Application
3.6 Requirements for Zero-shot Models
3.7 Approaches to Transfer Zero-shot
3.7.1 Terminology
3.7.2 Similarity-based and Siamese Networks
3.7.3 Language Model Token Predictions
3.7.4 Sentence Pair Classification
3.7.5 Instruction-following Models or Dialog-based Systems
3.8 Class Name Encoding in Text Classification
3.9 Approach Selection
3.10 Conclusion
4 Model Performance Survey
4.1 Experiments
4.1.1 Datasets
4.1.2 Model Selection
4.1.3 Hypothesis Templates
4.2 Zero-shot Model Evaluation
4.3 Dataset Complexity
4.4 Conclusion
5 Case Study: Historic German Stock Market Reports
5.1 Project
5.2 Motivation
5.3 Related Work
5.4 The Corpus and Dataset - Berliner Börsenzeitung
5.4.1 Corpus
5.4.2 Sentiment Aspects
5.4.3 Annotations
5.5 Methodology
5.5.1 Evaluation Approach
5.5.2 Trained Pipeline
5.5.3 Zero-shot Pipeline
5.5.4 Dictionary Pipeline
5.5.5 Tradeoffs
5.5.6 Label Space Definitions
5.6 Evaluation - Comparison of the Pipelines on BBZ
5.6.1 Sentence-based Sentiment
5.6.2 Aspect-based Sentiment
5.6.3 Qualitative Evaluation
5.7 Discussion and Conclusion
6 Case Study: Document Management in Civil Engineering
6.1 Project
6.2 Motivation
6.3 Related Work
6.4 The Corpus and Knowledge Graph
6.4.1 Data
6.4.2 BauGraph – The Knowledge Graph
6.5 Methodology
6.5.1 Document Insertion Pipeline
6.5.2 Frontend Integration
6.6 Discussion and Conclusion
7 MLMC
7.1 How it works
7.2 Motivation
7.3 Extensions of the Framework
7.4 Other Projects
7.4.1 Product Classification
7.4.2 Democracy Monitor
7.4.3 Climate Change Adaptation Finance
7.5 Conclusion
8 Discussion: The Five Dilemmas of Zero-shot
8.1 On Evaluation
8.2 The Five Dilemmas of Zero-shot
8.2.1 Dilemma of Evaluation or Are You Working at All?
8.2.2 Dilemma of Comparison or How Do I Get the Best Model?
8.2.3 Dilemma of Annotation and Label Definition or Are We Talking about the Same Thing?
8.2.4 Dilemma of Interpretation or Am I Biased?
8.2.5 Dilemma of Unsupervised Text Classification or Do I Have to Trust You?
8.3 Trust in Zero-shot Capabilities
8.4 Conclusion
9 Conclusion
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9.1.1 RQ1: Strengths and Weaknesses . . . . . . . . . . . . . . . . 139
9.1.2 RQ2: Application Studies . . . . . . . . . . . . . . . . . . . . 141
9.1.3 RQ3: Implications . . . . . . . . . . . . . . . . . . . . . . . . 143
9.2 Final Thoughts & Future Directions . . . . . . . . . . . . . . . . . . 144
References 147
A Appendix for Survey Chapter A.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
A.2 Task-specific Hypothesis Templates . . . . . . . . . . . . . . . . . . 180
A.3 Fractions of SotA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
181
B Uncertainty vs. Accuracy 182
C Declaration of Authorship 185
D Declaration: Use of AI-Tools 186
E Bibliographic Data 187
|
Page generated in 0.0673 seconds