Global ETD Search

111	Mapping Java Source Code To Architectural Concerns Through Machine Learning Florean, Alexander, Jalal, Laoa January 2021 (has links) The explosive growth of software systems with both size and complexity results in the recognised need of techniques to combat architectural degradation. Reflexion Modelling is a method commonly used for Software Architectural Consistency Checking (SACC). However, the steps needed to utilise the method involve manual mapping, which could become tedious depending on the system's size. Recently, machine learning has been showing promising results outperforming other approaches. However, neither a comparison of different classifiers nor a comprehensive investigation of how to best pre-process source code has yet been performed. This thesis compares different classifier and their performance to the manual effort needed to train them and how different pre-processing settings affect their accuracy. The study can be divided into two areas: pre-processing and how large the manual mapping should be to achieve satisfactory performance. Across the three software systems used in this study, the overall best performing model, MaxEnt, achieved the following average results, accuracy 0.88, weighted precision 0.89 and weighted recall 0.88. SVM performed almost identically to MaxEnt. Furthermore, the results show that Naive-Bayes, the algorithm in recent related work approaches, performs worse than SVM and MaxEnt. The results yielded that the pre-processing that extracts packages and libraries, together with the feature representation method Bag-of-Words had the best performance. Furthermore, it was found that manual mapping of a minimum of ten files per concern is needed for satisfactory performance. The research results represent a further step towards automating code-to-architecture mappings, as required in reflexion modelling and similar techniques. software architecture software architecture consistency code-to-architecture-mapping text classification machine learning Computer Sciences Datavetenskap (datalogi)
112	Information Extraction for Test Identification in Repair Reports in the Automotive Domain Jie, Huang January 2023 (has links) The knowledge of tests conducted on a problematic vehicle is essential for enhancing the efficiency of mechanics. Therefore, identifying the tests performed in each repair case is of utmost importance. This thesis explores techniques for extracting data from unstructured repair reports to identify component tests. The main emphasis is on developing a supervised multi-class classifier to categorize data and extract sentences that describe repair diagnoses and actions. It has been shown that incorporating a category-aware contrastive learning objective can improve the repair report classifier’s performance. The proposed approach involves training a sentence representation model based on a pre-trained model using a category-aware contrastive learning objective. Subsequently, the sentence representation model is further trained on the classification task using a loss function that combines the cross-entropy and supervised contrastive learning losses. By applying this method, the macro F1-score on the test set is increased from 90.45 to 90.73. The attempt to enhance the performance of the repair report classifier using a noisy data classifier proves unsuccessful. The noisy data classifier is trained using a prompt-based fine-tuning method, incorporating open-ended questions and two examples in the prompt. This approach achieves an F1-score of 91.09 and the resulting repair report classification datasets are found easier to classify. However, they do not contribute to an improvement in the repair report classifier’s performance. Ultimately, the repair report classifier is utilized to aid in creating the input necessary for identifying component tests. An information retrieval method is used to conduct the test identification. The incorporation of this classifier and the existing labels when creating queries leads to an improvement in the mean average precision at the top 3, 5, and 10 positions by 0.62, 0.81, and 0.35, respectively, although with a slight decrease of 0.14 at the top 1 position. text classification information retrieval contrastive learning prompt-based fine-tuning large language models
113	Optimizing the Performance of Text Classification Models by Improving the Isotropy of the Embeddings using a Joint Loss Function Attieh, Joseph January 2022 (has links) Recent studies show that the spatial distribution of the sentence representations generated from pre-trained language models is highly anisotropic, meaning that the representations are not uniformly distributed among the directions of the embedding space. Thus, the expressiveness of the embedding space is limited, as the embeddings are less distinguishable and less diverse. This results in a degradation in the performance of the models on the downstream task. Most methods that define the state-of-the-art in this area proceed by improving the isotropy of the sentence embeddings by refining the corresponding contextual word representations, then deriving the sentence embeddings from these refined representations. In this thesis, we propose to improve the quality and distribution of the sentence embeddings extracted from the [CLS] token of the pre-trained language models by improving the isotropy of the embeddings. We add one feed-forward layer, referred to as the Isotropy Layer, between the model and the downstream task layers. We train this layer using a novel joint loss function that optimizes an isotropy quality measure and the downstream task loss. This joint loss pushes the embeddings outputted by the Isotropy Layer to be more isotropic, and it also retains the semantics needed to perform the downstream task. The proposed approach results in transformed embeddings with better isotropy, that generalize better on the downstream task. Furthermore, the approach requires training one feed-forward layer, instead of retraining the whole network. We quantify and evaluate the isotropy through multiple metrics, mainly the Explained Variance and the IsoScore. Experimental results on 3 GLUE datasets with classification as the downstream task show that our proposed method is on par with the state-of-the-art, as it achieves performance gains of around 2-3% on the downstream tasks compared to the baseline. We also present a small case study on one language abuse detection dataset, then interpret some of the findings in light of the results. / Nya studier visar att den rumsliga fördelningen av de meningsrepresentationer som ge- nereras från förtränade språkmodeller är mycket anisotropisk, vilket innebär att representationerna mellan riktningarna i inbäddningsutrymmet inte är jämnt fördelade. Inbäddningsutrymmets uttrycksförmåga är således begränsad, eftersom inbäddningarna är mindre särskiljbara och mindre varierande. Detta leder till att modellernas prestanda försämras i nedströmsuppgiften. De flesta metoder som definierar den senaste tekniken på detta område går ut på att förbättra isotropin hos inbäddningarna av meningar genom att förädla motsvarande kontextuella ordrepresentationer och sedan härleda inbäddningarna av meningar från dessa förädlade representationer. I den här avhandlingen föreslår vi att kvaliteten och fördelningen av de inbäddningar av meningar som utvinns från [CLS]-tokenet i de förtränade språkmodellerna förbättras genom inbäddningarnas isotropi. Vi lägger till ett feed-forward-skikt, kallat det isotropa skiktet, mellan modellen och de nedströms liggande uppgiftsskikten. Detta lager tränas med hjälp av en ny gemensam förlustfunktion som optimerar ett kvalitetsmått för isotropi och förlusten av nedströmsuppgiften. Den gemensamma förlusten resulterar i att de inbäddningar som produceras av det isotropa lagret blir mer isotropa, samtidigt som den semantik som behövs för att utföra den nedströms liggande uppgiften bibehålls. Det föreslagna tillvägagångssättet resulterar i transformerade inbäddningar med bättre isotropi, som generaliseras bättre för den efterföljande uppgiften. Dessutom kräver tillvägagångssättet träning av ett feed-forward-skikt, i stället för omskolning av hela nätverket. Vi kvantifierar och utvärderar isotropin med hjälp av flera mått, främst Förklarad Varians och IsoScore. Experimentella resultat på tre GLUE-dataset visar att vår föreslagna metod är likvärdig med den senaste tekniken, eftersom den uppnår prestandaökningar på cirka 2-3 % på nedströmsuppgifterna jämfört med baslinjen. Vi presenterar även en liten fallstudie på ett dataset för upptäckt av språkmissbruk och tolkar sedan några av resultaten mot bakgrund av dessa. Text Classification Isotropy Embeddings BERT IsoScore Klassificering av Text Isotropi Inbäddningar BERT IsoScore Computer and Information Sciences Data- och informationsvetenskap
114	Parliament proceeding classification via Machine Learning algorithms: A case of Greek parliament proceedings Kavallos, Christos-Sotirios January 2023 (has links) The Greek Parliament is a critical institution for the Greek Democracy, where important decisions are made that affect the lives of millions of people. It consists of representatives from different political parties, and each party has a unique political ideology, stance, and agenda. The proposed research aims to automatically classify parliamentary proceedings to their respective political parties based on the content of their speeches, debates, and discussions. The goal of this research is to assess the feasibility of classifying Greek parliament proceedings to their respective political party via machine learning and neural network algorithms. By using machine learning algorithms and neural networks, the system can learn from large amounts of data and make accurate predictions about the category of a given proceeding. One possible approach is to use supervised learning algorithms, where the system is trained on a dataset of parliamentary proceedings labeled with the respective political parties. The machine learning algorithms can then learn the underlying patterns and features in the text data and accurately classify new proceedings to their respective parties. Another potential approach is to use deep learning neural networks, such as recurrent neural networks (RNNs), to classify the proceedings. These networks can be trained on large amounts of labeled data and can learn the complex relationships between the text features and political parties. The results of this research can be used to gain insights into the political landscape and the positions of different parties on various issues. The ability to automatically classify parliamentary proceedings to their political parties can also aid in political analysis, including tracking the voting patterns of different parties and their representatives and generally the potential revolutionization of social and human sciences is existent. Moreover, the proposed research can have implications for policy-making and governance. By analyzing the proceedings and identifying the political parties' positions and priorities, policymakers can better understand the political landscape and craft policies that align with the values and priorities of different parties. In conclusion, the classification of parliament proceedings, in our case Greek, to their political parties via NLP with machine learning algorithms is a promising research topic that has potential applications in political analysis and decision-making. The ability to automatically classify parliamentary proceedings to their respective parties can enhance transparency and accountability in the democratic system and aid in policy-making and governance. Machine learning supervised learning text classification parliamentary proceedings Greek language. Computer Sciences Datavetenskap (datalogi) Computer Systems Datorsystem
115	Comparing Text Classification Libraries in Scala and Python : A comparison of precision and recall Garamvölgyi, Filip, Henning Bruce, August January 2021 (has links) In today’s internet era, more text than ever is being uploaded online. The text comes in many forms, such as social media posts, business reviews, and many more. For various reasons, there is an interest in analyzing the uploaded text. For instance, an airline business could ask their customers to review the service they have received. The feedback would be collected by asking the customer to leave a review and a score. A common scenario is a review with a good score that contains negative aspects. It is preferable to avoid a situation where the entirety of the review is regarded as positive because of the score if there are negative aspects mentioned. A solution to this would be to analyze each sentence of a review and classify it by negative, neutral or, positive depending on how the sentence is perceived. With the amount of text uploaded today, it is not feasible to manually analyze text. To automatically classify text by a set of criteria is called text classification. The process of specifically classifying text by how it is perceived is a subcategory of text classification known as sentiment analysis. Positive, neutral and, negative would be the sentiments to classify. The most popular frameworks associated with the implementation of sentiment analyzers are developed in the programming language Python. However, over the years, text classification has had an increase in popularity. The increase in popularity has caused new frameworks to be developed in new programming languages. Scala is one of the programming languages that has had new frameworks developed to work with sentiment analysis. However, in comparison to Python, it has fewer available resources. Python has more available libraries to work with, available documentation, and community support online. There are even fewer resources regarding sentiment analysis in a less common language such as Swedish. The problem is no one has compared a sentiment analyzer for Swedish text implemented using Scala and compared it to Python. The purpose of this thesis is to compare recall and precision of a sentiment analyzer implemented in Scala to Python. The goal of this thesis is to increase the knowledge regarding the state of text classification for less common natural languages in Scala. To conduct the study, a qualitative approach with the support of quantitative data was used. Two kinds of sentiment analyzers were implemented in Scala and Python. The first classified text as either positive or negative (binary sentiment analysis), the second sentiment analyzer would also classify text as neutral (multiclass sentiment analysis). To perform the comparative study, the implemented analyzers would perform classification on text with known sentiments. The quality of the classifications was measured using their F1-score. The results showed that Python had better recall and quality for both tasks. In the binary task, there was not as large of a difference between the two implementations. The resources from Python were more specialized for Swedish and did not seem to be as affected by the small dataset used as the resources in Scala. Scala had an F1-score of 0.78 for binary sentiment analysis and 0.65 for multiclass sentiment analysis. Python had an F1-score of 0.83 for binary sentiment analysis and 0.78 for multiclass sentiment analysis. / I dagens internetera laddas mer text upp än någonsin online. Texten finns i många former, till exempel inlägg på sociala medier, företagsrecensioner och många fler. Av olika skäl finns det ett intresse av att analysera den uppladdade texten. Till exempel kan ett flygbolag be sina kunder att lämna omdömen om tjänsten de nyttjat. Feedbacken samlas in genom att be kunden lämna ett omdöme och ett betyg. Ett vanligt scenario är en recension med ett bra betyg som innehåller negativa aspekter. Det är att föredra att undvika en situation där hela recensionen anses vara positiv på grund av poängen, om det nämnts negativa aspekter. En lösning på detta skulle vara att analysera varje mening i en recension och klassificera den som negativ, neutral eller positiv beroende på hur meningen uppfattas. Med den mängd text som laddas upp idag är det inte möjligt att manuellt analysera text. Att automatiskt klassificera text efter en uppsättning kriterier kallas textklassificering. Processen att specifikt klassificera text efter hur den uppfattas är en underkategori av textklassificering som kallas sentimentanalys. Positivt, neutralt och negativt skulle vara sentiment att klassificera. De mest populära ramverken för implementering av sentimentanalysatorer utvecklas i programmeringsspråket Python. Men genom åren har textklassificering ökat i popularitet. Ökningen i popularitet har gjort att nya ramverk utvecklats för nya programmeringsspråk. Scala är ett av programmeringsspråken som har utvecklat nya ramverk för att arbeta med sentimentanalys. I jämförelse med Python har den dock mindre tillgängliga resurser. Python har mer bibliotek, dokumentation och mer stöd online. Det finns ännu färre resurser när det gäller sentimentanalyser på ett mindre vanligt språk som svenska. Problemet är att ingen har jämfört en sentimentanalysator för svensk text implementerad med Scala och jämfört den med Python. Syftet med denna avhandling är att jämföra precision och recall på en sentimentanalysator implementerad i Scala med Python. Målet med denna avhandling är att öka kunskapen om tillståndet för textklassificering för mindre vanliga naturliga språk i Scala. För att genomföra studien användes ett kvalitativt tillvägagångssätt med stöd av kvantitativa data. Två typer av sentimentanalysatorer implementerades i Scala och Python. Den första klassificerade texten som antingen positiv eller negativ (binär sentimentanalys), den andra sentimentanalysatorn skulle också klassificera text som neutral (sentimentanalys i flera klasser). För att utföra den jämförande studien skulle de implementerade analysatorerna utföra klassificering på text med kända sentiment. Klassificeringarnas kvalitet mättes med deras F1-poäng. Resultaten visade att Python hade bättre precision och recall för båda uppgifterna. I den binära uppgiften var det inte lika stor skillnad mellan de två implementeringarna. Resurserna från Python var mer specialiserade för svenska och verkade inte påverkas lika mycket av den lilla dataset som används som resurserna i Scala. Scala hade ett F1-poäng på 0,78 för binär sentimentanalys och 0,65 för sentimentanalys i flera klasser. Python hade ett F1-poäng på 0,83 för binär sentimentanalys och 0,78 för sentimentanalys i flera klasser. LaBSE Spark NLP NLP Text classification Scala LaBSE Spark NLP NLP Textklassificering Scala Computer and Information Sciences Data- och informationsvetenskap
116	Extending a Text Classifier to Multiple Languages / Utöka en textklassificeringsmodell till flera språk Byström, Albin January 2021 (has links) This thesis explores the possibility to extend monolingual and bilingual text classifiers to multiple languages. Two different language models are explored, language aligned word embeddings and a transformer model. The goal was to take a classifier based on Swedish and English samples and extend it to Danish, German, and Finnish samples. The result shows that extending a text classifier by word embeddings alignment or by finetuning a multilingual transformer model is possible but with varying accuracy depending on the language. / Denna avhandling undersöker möjligheten att utvidga enspråkiga och tvåspråkiga textklassificatorer till flera språk. Två olika språkmodeller utforskas, justeras ordinbäddningar och en transformatormodell. Målet var att ta en klassificerare baserad på svenska och engelska texter och utvidga den till danska, tyska och finska texter. Resultatet visar att det är möjligt att utöka en textklassificering med ordinbäddning eller genom att finjustera en flerspråkig transformatormodell, men träffsäkerheten varierar beroende på språk. Natural language processing Multilingual Transformer Word embeddings Text classification Språkteknologi Flerspråkig Transformator Ordinbäddningar Textklassificering Computer and Information Sciences Data- och informationsvetenskap
117	Uncertainty Estimation on Natural Language Processing He, Jianfeng 15 May 2024 (has links) Text plays a pivotal role in our daily lives, encompassing various forms such as social media posts, news articles, books, reports, and more. Consequently, Natural Language Processing (NLP) has garnered widespread attention. This technology empowers us to undertake tasks like text classification, entity recognition, and even crafting responses within a dialogue context. However, despite the expansive utility of NLP, it frequently necessitates a critical decision: whether to place trust in a model's predictions. To illustrate, consider a state-of-the-art (SOTA) model entrusted with diagnosing a disease or assessing the veracity of a rumor. An incorrect prediction in such scenarios can have dire consequences, impacting individuals' health or tarnishing their reputation. Consequently, it becomes imperative to establish a reliable method for evaluating the reliability of an NLP model's predictions, which is our focus-uncertainty estimation on NLP. Though many works have researched uncertainty estimation or NLP, the combination of these two domains is rare. This is because most NLP research emphasizes model prediction performance but tends to overlook the reliability of NLP model predictions. Additionally, current uncertainty estimation models may not be suitable for NLP due to the unique characteristics of NLP tasks, such as the need for more fine-grained information in named entity recognition. Therefore, this dissertation proposes novel uncertainty estimation methods for different NLP tasks by considering the NLP task's distinct characteristics. The NLP tasks are categorized into natural language understanding (NLU) and natural language generation (NLG, such as text summarization). Among the NLU tasks, the understanding could be on two views, global-view (e.g. text classification at document level) and local-view (e.g. natural language inference at sentence level and named entity recognition at token level). As a result, we research uncertainty estimation on three tasks: text classification, named entity recognition, and text summarization. Besides, because few-shot text classification has captured much attention recently, we also research the uncertainty estimation on few-shot text classification. For the first topic, uncertainty estimation on text classification, few uncertainty models focus on improving the performance of text classification where human resources are involved. In response to this gap, our research focuses on enhancing the accuracy of uncertainty scores by bolstering the confidence associated with winning scores. we introduce MSD, a novel model comprising three distinct components: 'mix-up,' 'self-ensembling,' and 'distinctiveness score.' The primary objective of MSD is to refine the accuracy of uncertainty scores by mitigating the issue of overconfidence in winning scores while simultaneously considering various categories of uncertainty. seamlessly integrate with different Deep Neural Networks. Extensive experiments with ablation settings are conducted on four real-world datasets, resulting in consistently competitive improvements. Our second topic focuses on uncertainty estimation on few-shot text classification (UEFTC), which has few or even only one available support sample for each class. UEFTC represents an underexplored research domain where, due to limited data samples, a UEFTC model predicts an uncertainty score to assess the likelihood of classification errors. However, traditional uncertainty estimation models in text classification are ill-suited for UEFTC since they demand extensive training data, while UEFTC operates in a few-shot scenario, typically providing just a few support samples, or even just one, per class. To tackle this challenge, we introduce Contrastive Learning from Uncertainty Relations (CLUR) as a solution tailored for UEFTC. CLUR exhibits the unique capability to be effectively trained with only one support sample per class, aided by pseudo uncertainty scores. A distinguishing feature of CLUR is its autonomous learning of these pseudo uncertainty scores, in contrast to previous approaches that relied on manual specification. Our investigation of CLUR encompasses four model structures, allowing us to evaluate the performance of three commonly employed contrastive learning components in the context of UEFTC. Our findings highlight the effectiveness of two of these components. Our third topic focuses on uncertainty estimation on sequential labeling. Sequential labeling involves the task of assigning labels to individual tokens in a sequence, exemplified by Named Entity Recognition (NER). Despite significant advancements in enhancing NER performance in prior research, the realm of uncertainty estimation for NER (UE-NER) remains relatively uncharted but is of paramount importance. This topic focuses on UE-NER, seeking to gauge uncertainty scores for NER predictions. Previous models for uncertainty estimation often overlook two distinctive attributes of NER: the interrelation among entities (where the learning of one entity's embedding depends on others) and the challenges posed by incorrect span predictions in entity extraction. To address these issues, we introduce the Sequential Labeling Posterior Network (SLPN), designed to estimate uncertainty scores for the extracted entities while considering uncertainty propagation from other tokens. Additionally, we have devised an evaluation methodology tailored to the specific nuances of wrong-span cases. Our fourth topic focuses on an overlooked question that persists regarding the evaluation reliability of uncertainty estimation in text summarization (UE-TS). Text summarization, a key task in natural language generation (NLG), holds significant importance, particularly in domains where inaccuracies can have serious consequences, such as healthcare. UE-TS has garnered attention due to the potential risks associated with erroneous summaries. However, the reliability of evaluating UE-TS methods raises concerns, stemming from the interdependence between uncertainty model metrics and the wide array of NLG metrics. To address these concerns, we introduce a comprehensive UE-TS benchmark incorporating twenty-six NLG metrics across four dimensions. This benchmark evaluates the uncertainty estimation capabilities of two large language models and one pre-trained language model across two datasets. Additionally, it assesses the effectiveness of fourteen common uncertainty estimation methods. Our study underscores the necessity of utilizing diverse, uncorrelated NLG metrics and uncertainty estimation techniques for a robust evaluation of UE-TS methods. / Doctor of Philosophy / Text is integral to our daily activities, appearing in various forms such as social media posts, news articles, books, and reports. We rely on text for communication, information dissemination, and decision-making. Given its ubiquity, the ability to process and understand text through Natural Language Processing (NLP) has become increasingly important. NLP technology enables us to perform tasks like text classification, which involves categorizing text into predefined labels, and named entity recognition (NER), which identifies specific entities such as names, dates, and locations within text. Additionally, NLP facilitates generating coherent and contextually appropriate responses in conversational agents, enhancing human-computer interaction. However, the reliability of NLP models is crucial, especially in sensitive applications like medical diagnoses, where errors can have severe consequences. This dissertation focuses on uncertainty estimation in NLP, a less explored but essential area. Uncertainty estimation helps evaluate the confidence of NLP model predictions. We propose new methods tailored to various NLP tasks, acknowledging their unique needs. NLP tasks are divided into natural language understanding (NLU) and natural language generation (NLG). Within NLU, we look at tasks from two perspectives: a global view (e.g., document-level text classification) and a local view (e.g., sentence-level inference and token-level entity recognition). Our research spans text classification, named entity recognition (NER), and text summarization, with a special focus on few-shot text classification due to its recent prominence. For text classification, we introduce the MSD model, which includes three components to enhance uncertainty score accuracy and address overconfidence issues. This model integrates seamlessly with different neural networks and shows consistent improvements in experiments. For few-shot text classification, we develop Contrastive Learning from Uncertainty Relations (CLUR), designed to work effectively with minimal support samples per class. CLUR autonomously learns pseudo uncertainty scores, demonstrating effectiveness with various contrastive learning components. In NER, we address the unique challenges of entity interrelation and span prediction errors. We propose the Sequential Labeling Posterior Network (SLPN) to estimate uncertainty scores while considering uncertainty propagation from other tokens. For text summarization, we create a benchmark with tens of metrics to evaluate uncertainty estimation methods across two datasets. This benchmark helps assess the reliability of these methods, highlighting the need for diverse, uncorrelated metrics. Overall, our work advances the understanding and implementation of uncertainty estimation in NLP, providing more reliable and accurate predictions across different tasks. Uncertainty Estimation Bayesian Neural Network Evidential Neural Network Text Classification Few-Shot Named Entity Recognition Text Summarization
118	Arabic language processing for text classification : contributions to Arabic root extraction techniques, building an Arabic corpus, and to Arabic text classification techniques Al-Nashashibi, May Yacoub Adib January 2012 (has links) The impact and dynamics of Internet-based resources for Arabic-speaking users is increasing in significance, depth and breadth at highest pace than ever, and thus requires updated mechanisms for computational processing of Arabic texts. Arabic is a complex language and as such requires in depth investigation for analysis and improvement of available automatic processing techniques such as root extraction methods or text classification techniques, and for developing text collections that are already labeled, whether with single or multiple labels. This thesis proposes new ideas and methods to improve available automatic processing techniques for Arabic texts. Any automatic processing technique would require data in order to be used and critically reviewed and assessed, and here an attempt to develop a labeled Arabic corpus is also proposed. This thesis is composed of three parts: 1- Arabic corpus development, 2- proposing, improving and implementing root extraction techniques, and 3- proposing and investigating the effect of different pre-processing methods on single-labeled text classification methods for Arabic. This thesis first develops an Arabic corpus that is prepared to be used here for testing root extraction methods as well as single-label text classification techniques. It also enhances a rule-based root extraction method by handling irregular cases (that appear in about 34% of texts). It proposes and implements two expanded algorithms as well as an adjustment for a weight-based method. It also includes the algorithm that handles irregular cases to all and compares the performances of these proposed methods with original ones. This thesis thus develops a root extraction system that handles foreign Arabized words by constructing a list of about 7,000 foreign words. The outcome of the technique with best accuracy results in extracting the correct stem and root for respective words in texts, which is an enhanced rule-based method, is used in the third part of this thesis. This thesis finally proposes and implements a variant term frequency inverse document frequency weighting method, and investigates the effect of using different choices of features in document representation on single-label text classification performance (words, stems or roots as well as including to these choices their respective phrases). This thesis applies forty seven classifiers on all proposed representations and compares their performances. One challenge for researchers in Arabic text processing is that reported root extraction techniques in literature are either not accessible or require a long time to be reproduced while labeled benchmark Arabic text corpus is not fully available online. Also, by now few machine learning techniques were investigated on Arabic where usual preprocessing steps before classification were chosen. Such challenges are addressed in this thesis by developing a new labeled Arabic text corpus for extended applications of computational techniques. Results of investigated issues here show that proposing and implementing an algorithm that handles irregular words in Arabic did improve the performance of all implemented root extraction techniques. The performance of the algorithm that handles such irregular cases is evaluated in terms of accuracy improvement and execution time. Its efficiency is investigated with different document lengths and empirically is found to be linear in time for document lengths less than about 8,000. The rule-based technique is improved the highest among implemented root extraction methods when including the irregular cases handling algorithm. This thesis validates that choosing roots or stems instead of words in documents representations indeed improves single-label classification performance significantly for most used classifiers. However, the effect of extending such representations with their respective phrases on single-label text classification performance shows that it has no significant improvement. Many classifiers were not yet tested for Arabic such as the ripple-down rule classifier. The outcome of comparing the classifiers' performances concludes that the Bayesian network classifier performance is significantly the best in terms of accuracy, training time, and root mean square error values for all proposed and implemented representations. 492.7
119	[en] SQLLOMINING: FINDING LEARNING OBJECTS USING MACHINE LEARNING METHODS / [pt] SQLLOMINING: OBTENÇÃO DE OBJETOS DE APRENDIZAGEM UTILIZANDO TÉCNICAS DE APRENDIZADO DE MÁQUINA SUSANA ROSICH SOARES VELLOSO 04 December 2007 (has links) [pt] Objetos de Aprendizagem ou Learning Objects (LOs) são porções de material didático tais como textos que podem ser reutilizados na composição de outros objetos maiores (aulas ou cursos). Um dos problemas da reutilização de LOs é descobri-los em seus contextos ou documentos texto originais tais como livros, e artigos. Visando a obtenção de LOs, este trabalho apresenta um processo que parte da extração, tratamento e carga de uma base de dados textual e em seguida, baseando-se em técnicas de aprendizado de máquina, uma combinação de EM (Expectation-Maximization) e um classificador Bayesiano, classifica-se os textos extraídos. Tal processo foi implementado em um sistema chamado SQLLOMining, que usa SQL como linguagem de programação e técnicas de mineração de texto na busca de LOs. / [en] Learning Objects (LOs) are pieces of instructional material like traditional texts that can be reused in the composition of more complex objects like classes or courses. There are some difficulties in the process of LO reutilization. One of them is to find pieces of documents that can be used like LOs. In this work we present a process that, in search for LOs, starts by extracting, transforming and loading a text database and then continue clustering these texts, using a machine learning methods that combines EM (Expectation- Maximization) and a Bayesian classifier. We implemented that process in a system called SQLLOMining that uses the SQL language and text mining methods in the search for LOs. [pt] APRENDIZADO DE MAQUINA [en] MACHINE LEARNING [pt] ONTOLOGIA [en] ONTOLOGY [pt] BANCO DE DADOS [en] DATABASE [pt] EDUCACAO VIA WEB [en] E-LEARNING [pt] OBJETOS DE APRENDIZADO [en] LEARNING OBJECTS [pt] CLASSIFICACAO DE TEXTOS [en] TEXT CLASSIFICATION
120	[en] A STUDY OF MULTILABEL TEXT CLASSIFICATION ALGORITHMS USING NAIVE-BAYES / [pt] UM ESTUDO DE ALGORITMOS PARA CLASSIFICAÇÃO AUTOMÁTICA DE TEXTOS UTILIZANDO NAIVE-BAYES DAVID STEINBRUCH 12 March 2007 (has links) [pt] A quantidade de informação eletrônica vem crescendo de forma acelerada, motivada principalmente pela facilidade de publicação e divulgação que a Internet proporciona. Desta forma, é necessária a organização da informação de forma a facilitar a sua aquisição. Muitos trabalhos propuseram resolver este problema através da classificação automática de textos associando a eles vários rótulos (classificação multirótulo). No entanto, estes trabalhos transformam este problema em subproblemas de classificação binária, considerando que existe independência entre as categorias. Além disso, utilizam limiares (thresholds), que são muito específicos para o conjunto de treinamento utilizado, não possuindo grande capacidade de generalização na aprendizagem. Esta dissertação propõe dois algoritmos de classificação automática de textos baseados no algoritmo multinomial naive Bayes e sua utilização em um ambiente on-line de classificação automática de textos com realimentação de relevância pelo usuário. Para testar a eficiência dos algoritmos propostos, foram realizados experimentos na base de notícias Reuters 21758 e na base de documentos médicos Ohsumed. / [en] The amount of electronic information has been growing fast, mainly due to the easiness of publication and spreading that Internet provides. Therefore, is necessary the organisation of information to facilitate its retrieval. Many works have solved this problem through the automatic text classification, associating to them several labels (multilabel classification). However, those works have transformed this problem into binary classification subproblems, considering there is not dependence among categories. Moreover, they have used thresholds, which are very sepecific of the classifier document base, and so, does not have great generalization capacity in the learning process. This thesis proposes two text classifiers based on the multinomial algorithm naive Bayes and its usage in an on-line text classification environment with user relevance feedback. In order to test the proposed algorithms efficiency, experiments have been performed on the Reuters 21578 news base, and on the Ohsumed medical document base. [pt] APRENDIZADO DE MAQUINA [en] MACHINE LEARNING [pt] INTERNET [en] INTERNET [pt] CATEGORIZACAO DE TEXTOS [en] TEXT CATEGORIZATION [pt] CLASSIFICACAO DE TEXTOS [en] TEXT CLASSIFICATION [pt] MULTIROTULO [en] MULTILABEL [pt] NAIVE-BAYES [en] NAIVE-BAYES

Search results