• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 3
  • Tagged with
  • 4
  • 4
  • 4
  • 4
  • 4
  • 4
  • 4
  • 2
  • 2
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

A Multi-label Text Classification Framework: Using Supervised and Unsupervised Feature Selection Strategy

Ma, Long 08 August 2017 (has links)
Text classification, the task of metadata to documents, requires significant time and effort when performed by humans. Moreover, with online-generated content explosively growing, it becomes a challenge for manually annotating with large scale and unstructured data. Currently, lots of state-or-art text mining methods have been applied to classification process, many of them based on the key word extraction. However, when using these key words as features in classification task, it is common that feature dimension is huge. In addition, how to select key words from tons of documents as features in classification task is also a challenge. Especially when using tradition machine learning algorithm in the large data set, the computation cost would be high. In addition, almost 80% of real data is unstructured and non-labeled. The advanced supervised feature selection methods cannot be used directly in selecting entities from massive of data. Usually, extracting features from unlabeled data for classification tasks, statistical strategies have been utilized to discover key features. However, we propose a nova method to extract important features effectively before feeding them into the classification assignment. There is another challenge in the text classification is the multi-label problem, the assignment of multiple non-exclusive labels to the documents. This problem makes text classification more complicated when compared with single label classification. Considering above issues, we develop a framework for extracting and eliminating data dimensionality, solving the multi-label problem on labeled and unlabeled data set. To reduce data dimension, we provide 1) a hybrid feature selection method that extracts meaningful features according to the importance of each feature. 2) we apply the Word2Vec to represent each document with a lower feature dimension when doing the document categorization for the big data set. 3) An unsupervised approach to extract features from real online generated data for text classification and prediction. On the other hand, to solve the multi-label classification task, we design a new Multi-Instance Multi-Label (MIML) algorithm in the proposed framework.
2

Balancing Performance and Usage Cost: A Comparative Study of Language Models for Scientific Text Classification / Balansera prestanda och användningskostnader: En jämförande undersökning av språkmodeller för klassificering av vetenskapliga texter

Engel, Eva January 2023 (has links)
The emergence of large language models, such as BERT and GPT-3, has revolutionized natural language processing tasks. However, the development and deployment of these models pose challenges, including concerns about computational resources and environmental impact. This study aims to compare discriminative language models for text classification based on their performance and usage cost. We evaluate the models using a hierarchical multi-label text classification task and assess their performance using primarly F1-score. Additionally, we analyze the usage cost by calculating the Floating Point Operations (FLOPs) required for inference. We compare a baseline model, which consists of a classifier chain with logistic regression models, with fine-tuned discriminative language models, including BERT with two different sequence lengths and DistilBERT, a distilled version of BERT. Results show that the DistilBERT model performs optimally in terms of performance, achieving an F1-score of 0.56 averaged on all classification layers. The baseline model and BERT with a maximal sequence length of 128 achieve F1-scores of 0.51. However, the baseline model outperforms the transformers at the most specific classification level with an F1-score of 0.33. Regarding usage cost, the baseline model significantly requires fewer FLOPs compared to the transformers. Furthermore, restricting BERT to a maximum sequence length of 128 tokens instead of 512 sacrifices some performance but offers substantial gains in usage cost. The code and dataset are available on GitHub. / Uppkomsten av stora språkmodeller, som BERT och GPT-3, har revolutionerat språkteknologi. Dock ger utvecklingen och implementeringen av dessa modeller upphov till utmaningar, bland annat gällande beräkningsresurser och miljöpåverkan. Denna studie syftar till att jämföra diskriminativa språkmodeller för textklassificering baserat på deras prestanda och användningskostnad. Vi utvärderar modellerna genom att använda en hierarkisk textklassificeringsuppgift och bedöma deras prestanda primärt genom F1-score. Dessutom analyserar vi användningskostnaden genom att beräkna antalet flyttalsoperationer (FLOPs) som krävs för inferens. Vi jämför en grundläggande modell, som består av en klassifikationskedja med logistisk regression, med finjusterande diskriminativa språkmodeller, inklusive BERT med två olika sekvenslängder och DistilBERT, en destillerad version av BERT. Resultaten visar att DistilBERT-modellen presterar optimalt i fråga om prestanda och uppnår en genomsnittlig F1-score på 0,56 för alla klassificeringsnivåer. Den grundläggande modellen och BERT med en maximal sekvenslängd på 128 uppnår ett F1-score på 0,51. Dock överträffar den grundläggande modellen transformermodellerna på den mest specifika klassificeringsnivån med en F1-score på 0,33. När det gäller användningskostnaden kräver den grundläggande modellen betydligt färre FLOPs jämfört med transformermodellerna. Att begränsa BERT till en maximal sekvenslängd av 128 tokens ger vissa prestandaförluster men erbjuder betydande besparingar i användningskostnaden. Koden och datamängden är tillgängliga på GitHub.
3

Automatic Categorization of News Articles With Contextualized Language Models / Automatisk kategorisering av nyhetsartiklar med kontextualiserade språkmodeller

Borggren, Lukas January 2021 (has links)
This thesis investigates how pre-trained contextualized language models can be adapted for multi-label text classification of Swedish news articles. Various classifiers are built on pre-trained BERT and ELECTRA models, exploring global and local classifier approaches. Furthermore, the effects of domain specialization, using additional metadata features and model compression are investigated. Several hundred thousand news articles are gathered to create unlabeled and labeled datasets for pre-training and fine-tuning, respectively. The findings show that a local classifier approach is superior to a global classifier approach and that BERT outperforms ELECTRA significantly. Notably, a baseline classifier built on SVMs yields competitive performance. The effect of further in-domain pre-training varies; ELECTRA’s performance improves while BERT’s is largely unaffected. It is found that utilizing metadata features in combination with text representations improves performance. Both BERT and ELECTRA exhibit robustness to quantization and pruning, allowing model sizes to be cut in half without any performance loss.
4

Methods for data and user efficient annotation for multi-label topic classification / Effektiva annoteringsmetoder för klassificering med multipla klasser

Miszkurka, Agnieszka January 2022 (has links)
Machine Learning models trained using supervised learning can achieve great results when a sufficient amount of labeled data is used. However, the annotation process is a costly and time-consuming task. There are many methods devised to make the annotation pipeline more user and data efficient. This thesis explores techniques from Active Learning, Zero-shot Learning, Data Augmentation domains as well as pre-annotation with revision in the context of multi-label classification. Active ’Learnings goal is to choose the most informative samples for labeling. As an Active Learning state-of-the-art technique Contrastive Active Learning was adapted to a multi-label case. Once there is some labeled data, we can augment samples to make the dataset more diverse. English-German-English Backtranslation was used to perform Data Augmentation. Zero-shot learning is a setup in which a Machine Learning model can make predictions for classes it was not trained to predict. Zero-shot via Textual Entailment was leveraged in this study and its usefulness for pre-annotation with revision was reported. The results on the Reviews of Electric Vehicle Charging Stations dataset show that it may be beneficial to use Active Learning and Data Augmentation in the annotation pipeline. Active Learning methods such as Contrastive Active Learning can identify samples belonging to the rarest classes while Data Augmentation via Backtranslation can improve performance especially when little training data is available. The results for Zero-shot Learning via Textual Entailment experiments show that this technique is not suitable for the production environment. / Klassificeringsmodeller som tränas med övervakad inlärning kan uppnå goda resultat när en tillräcklig mängd annoterad data används. Annoteringsprocessen är dock en kostsam och tidskrävande uppgift. Det finns många metoder utarbetade för att göra annoteringspipelinen mer användar- och dataeffektiv. Detta examensarbete utforskar tekniker från områdena Active Learning, Zero-shot Learning, Data Augmentation, samt pre-annotering, där annoterarens roll är att verifiera eller revidera en klass föreslagen av systemet. Målet med Active Learning är att välja de mest informativa datapunkterna för annotering. Contrastive Active Learning utökades till fallet där en datapunkt kan tillhöra flera klasser. Om det redan finns några annoterade data kan vi utöka datamängden med artificiella datapunkter, med syfte att göra datasetet mer mångsidigt. Engelsk-Tysk-Engelsk översättning användes för att konstruera sådana artificiella datapunkter. Zero-shot-inlärning är en teknik i vilken en maskininlärningsmodell kan göra förutsägelser för klasser som den inte var tränad att förutsäga. Zero-shot via Textual Entailment utnyttjades i denna studie för att utöka datamängden med artificiella datapunkter. Resultat från datamängden “Reviews of Electric Vehicle Charging ”Stations visar att det kan vara fördelaktigt att använda Active Learning och Data Augmentation i annoteringspipelinen. Active Learning-metoder som Contrastive Active Learning kan identifiera datapunkter som tillhör de mest sällsynta klasserna, medan Data Augmentation via Backtranslation kan förbättra klassificerarens prestanda, särskilt när få träningsdata finns tillgänglig. Resultaten för Zero-shot Learning visar att denna teknik inte är lämplig för en produktionsmiljö.

Page generated in 0.1121 seconds