Global ETD Search

101	Classificação supervisionada com programação probabilística Lucena, Danilo Carlos Gouveia de 10 February 2014 (has links) Made available in DSpace on 2015-05-14T12:36:45Z (GMT). No. of bitstreams: 1 arquivototal.pdf: 606852 bytes, checksum: 6a982febbce62a2525ee58de6e011a23 (MD5) Previous issue date: 2014-02-10 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Probabilistic inference mechanisms are at the intersection of three main areas: statistics, programming languages and probability. These mechanisms are used to create probabilistic models and assist in treating uncertainties. Probabilistic programming languages assist in high-level description of these models. These languages facilitate the development of the models because they abstract the inference mechanisms at the lower levels, allow reuse of code, and assist in results analysis. This study proposes the analysis of inference engines implemented by probabilistic programming languages and presents a case study of a supervised text classifier using probabilistic programming. / Mecanismos de inferência probabilísticos estão na intersecção de três áreas: estatística, linguagens de programação e sistemas de probabilidade. Esses mecanismos são utilizados para criar modelos probabilísticos e auxiliam no tratamento de incertezas. As linguagens de programação probabilísticas auxiliam na descrição de alto nível desses tipos de modelos. Essas linguagens facilitam o desenvolvimento abstraindo os mecanismos de inferência de mais baixo nível, favorecem o reuso de código e auxiliam na análise dos resultados. Este estudo propõe a análise dos mecanismos de inferência implementados pelas linguagens de programação probabilísticas e apresenta um estudo de caso com a implementação de um classificador supervisionado de textos com programação probabilística. mecanismos de inferência linguagens probabilísticas classificador de textos inference mechanisms probabilistic languages text classification
102	A novel approach to text classification Zechner, Niklas January 2017 (has links) This thesis explores the foundations of text classification, using both empirical and deductive methods, with a focus on author identification and syntactic methods. We strive for a thorough theoretical understanding of what affects the effectiveness of classification in general. To begin with, we systematically investigate the effects of some parameters on the accuracy of author identification. How is the accuracy affected by the number of candidate authors, and the amount of data per candidate? Are there differences in how methods react to the changes in parameters? Using the same techniques, we see indications that methods previously thought to be topic-independent might not be so, but that syntactic methods may be the best option for avoiding topic dependence. This means that previous studies may have overestimated the power of lexical methods. We also briefly look for ways of spotting which particular features might be the most effective for classification. Apart from author identification, we apply similar methods to identifying properties of the author, including age and gender, and attempt to estimate the number of distinct authors in a text sample. In all cases, the techniques are proven viable if not overwhelmingly accurate, and we see that lexical and syntactic methods give very similar results. In the final parts, we see some results of automata theory that can be of use for syntactic analysis and classification. First, we generalise a known algorithm for finding a list of the best-ranked strings according to a weighted automaton, to doing the same with trees and a tree automaton. This result can be of use for speeding up parsing, which often runs in several steps, where each step needs several trees from the previous as input. Second, we use a compressed version of deterministic finite automata, known as failure automata, and prove that finding the optimal compression is NP-complete, but that there are efficient algorithms for finding good approximations. Third, we find and prove the derivatives of regular expressions with cuts. Derivatives are an operation on expressions to calculate the remaining expression after reading a given symbol, and cuts are an extension to regular expressions found in many programming languages. Together, these findings may be able to improve on the syntactic analysis which we have seen is a valuable tool for text classification. Text classification natural language processing automata Computer Science Datavetenskap (datalogi)
103	DETECTION OF EMERGING DISRUPTIVE FIELDS USING ABSTRACTS OF SCIENTIFIC ARTICLES Vorgianitis, Georgios January 2017 (has links) With the significant advancementstaking place in the last three decades in the field ofInformation Technology (IT), we are witnesses of an era unprecedented to the standards that mankind was used to, for centuries. Having access to a huge amount of dataalmost instantly,entails certainadvantages. One of which is the ability to observe in which segments of their expertise do scientists focus their research. That kind of knowledge, if properly appraised could hold the key to explaining what the new directions of the applied sciences will be and thus could help to constructing a “map” of the future developments from the Research and Development labs of the industries worldwide.Though the above statement may be considered too “futuristic”, already there have been documented attempts in the literature that have been fruitful into using vast amount of scientific data in an attempt to outline future scientific trends and thus scientific discoveries.The purpose of this research is to try to use a pioneeringmethodof modeling text corpora that already hasbeen used previously to the task of mapping the history of scientific discovery, that of Latent Dirichlet Allocation (LDA)and try to evaluate itsusability into detecting emerging research trends by the mere use of only the “Abstracts” from a collectionof scientific articles.To do that an experimental set is being utilized and the process is repeated over three experimental runs.The results, although not the ones that would validate the hypothesis, are showing that with certain improvements in the processing the hypothesis could be confirmed. Technological Forecasting Text Mining Text Classification Topic Detection Research Evolution Annan elektroteknik och elektronik
104	Trajectory-based methods to predict user churn in online health communities Joshi, Apoorva 01 May 2018 (has links) Online Health Communities (OHCs) have positively disrupted the modern global healthcare system as patients and caregivers are interacting online with similar peers to improve quality of their life. Social support is the pillar of OHCs and, hence, analyzing the different types of social support activities contributes to a better understanding and prediction of future user engagement in OHCs. This thesis used data from a popular OHC, called Breastcancer.org, to first classify user posts in the community into the different categories of social support using Word2Vec for language processing and six different classifiers were explored, resulting in the conclusion that Random Forest was the best approach for classification of the user posts. This exercise helped identify the different types of social support activities that users participate in and also detect the most common type of social support activity among users in the community. Thereafter, three trajectory-based methods were proposed and implemented to predict user churn (attrition) from the OHC. Comparison of the proposed trajectory-based methods with two non-trajectory-based benchmark methods helped establish that user trajectories, which represent the month-to-month change in the type of social support activity of users are effective pointers for user churn from the community. The results and findings from this thesis could help OHC managers better understand the needs of users in the community and take necessary steps to improve user retention and community management. Churn Prediction Machine Learning Natural Language Processing Online Health Communities Text Classification Trajectory Mining Electrical and Computer Engineering
105	DECEPTIVE REVIEW IDENTIFICATION VIA REVIEWER NETWORK REPRESENTATION LEARNING Shih-Feng Yang (11502553) 19 December 2021 (has links) <div><div>With the growth of the popularity of e-commerce and mobile apps during the past decade, people rely on online reviews more than ever before for purchasing products, booking hotels, and choosing all kinds of services. Users share their opinions by posting product reviews on merchant sites or online review websites (e.g., Yelp, Amazon, TripAdvisor). Although online reviews are valuable information for people who are interested in products and services, many reviews are manipulated by spammers to provide untruthful information for business competition. Since deceptive reviews can damage the reputation of brands and mislead customers’ buying behaviors, the identification of fake reviews has become an important topic for online merchants. Among the computational approaches proposed for fake review identification, network-based fake review analysis jointly considers the information from review text, reviewer behaviors, and production information. Researchers have proposed network-based methods (e.g., metapath) on heterogeneous networks, which have shown promising results.</div><div><br></div><div>However, we’ve identified two research gaps in this study: 1) We argue the previous network-based reviewer representations are not sufficient to preserve the relationship of reviewers in networks. To be specific, previous studies only considered first-order proximity, which indicates the observable connection between reviewers, but not second-order proximity, which captures the neighborhood structures where two vertices overlap. Moreover, although previous network-based fake review studies (e.g., metapath) connect reviewers through feature nodes across heterogeneous networks, they ignored the multi-view nature of reviewers. A view is derived from a single type of proximity or relationship between the nodes, which can be characterized by a set of edges. In other words, the reviewers could form different networks with regard to different relationships. 2) The text embeddings of reviews in previous network-based fake review studies were not considered with reviewer embeddings.</div><div><br></div><div>To tackle the first gap, we generated reviewer embeddings via MVE (Qu et al., 2017), a framework for multi-view network representation learning, and conducted spammer classification experiments to examine the effectiveness of the learned embeddings for distinguishing spammers and non-spammers. In addition, we performed unsupervised hierarchical clustering to observe the clusters of the reviewer embeddings. Our results show the clusters generated based on reviewer embeddings capture the difference between spammers and non-spammers better than those generated based on reviewers’ features.</div><div><br></div><div>To fill the second gap, we proposed hybrid embeddings that combine review text embeddings with reviewer embeddings (i.e., the vector that represents a reviewer’s characteristics, such as writing or behavioral patterns). We conducted fake review classification experiments to compare the performance between using hybrid embeddings (i.e., text+reviewer) as features and using text-only embeddings as features. Our results suggest that hybrid embedding is more effective than text-only embedding for fake review identification. Moreover, we compared the prediction performance of the hybrid embeddings with baselines and showed our approach outperformed others on fake review identification experiments.</div><div><br></div><div>The contributions of this study are four-fold: 1) We adopted a multi-view representation learning approach for reviewer embedding learning and analyze the efficacy of the embeddings used for spammer classification and fake review classification. 2) We proposed a hybrid embedding that considers the characteristics of both review text and the reviewer. Our results are promising and suggest hybrid embedding is very effective for fake review identification. 3) We proposed a heuristic network construction approach that builds a user network based on user features. 4) We evaluated how different spammer thresholds impact the performance of fake review classification. Several studies have used the same datasets as we used in this study, but most of them followed the spammer definition mentioned by Jindal and Liu (2008). We argued that the spammer definition should be configurable based on different datasets. Our findings showed that by carefully choosing the spammer thresholds for the target datasets, hybrid embeddings have higher efficacy for fake review classification.</div></div> Natural Language Processing Fake Review Network Representation Learning Text Classification Natural Language Understanding
106	Automatic Dispatching of Issues using Machine Learning / Automatisk fördelning av ärenden genom maskininlärning Bengtsson, Fredrik, Combler, Adam January 2019 (has links) Many software companies use issue tracking systems to organize their work. However, when working on large projects, across multiple teams, a problem of finding the correctteam to solve a certain issue arises. One team might detect a problem, which must be solved by another team. This can take time from employees tasked with finding the correct team and automating the dispatching of these issues can have large benefits for the company. In this thesis, the use of machine learning methods, mainly convolutional neural networks (CNN) for text classification, has been applied to this problem. For natural language processing both word- and character-level representations are commonly used. The results in this thesis suggests that the CNN learns different information based on whether word- or character-level representation is used. Furthermore, it was concluded that the CNN models performed on similar levels as the classical Support Vector Machine for this task. When compared to a human expert, working with dispatching issues, the best CNN model performed on a similar level when given the same information. The high throughput of a computer model, therefore, suggests automation of this task is very much possible. NLP Machine Learning Convolutional Neural Networks CNN SVM bug report issue text classification Computer Sciences Datavetenskap (datalogi)
107	Optimizing Deep Neural Networks for Classification of Short Texts Pettersson, Fredrik January 2019 (has links) This master's thesis investigates how a state-of-the-art (SOTA) deep neural network (NN) model can be created for a specific natural language processing (NLP) dataset, the effects of using different dimensionality reduction techniques on common pre-trained word embeddings and how well this model generalize on a secondary dataset. The research is motivated by two factors. One is that the construction of a machine learning (ML) text classification (TC) model is typically done around a specific dataset and often requires a lot of manual intervention. It's therefore hard to know exactly what procedures to implement for a specific dataset and how the result will be affected. The other reason is that, if the dimensionality of pre-trained embedding vectors can be lowered without losing accuracy, and thus saving execution time, other techniques can be used during the time saved to achieve even higher accuracy. A handful of deep neural network architectures are used, namely a convolutional neural network (CNN), long short-term memory neural network (LSTM) and a bidirectional LSTM (Bi-LSTM) architecture. These deep neural network architectures are combined with four different word embeddings: GoogleNews-vectors-negative300, glove.840B.300d, paragram_300_sl999 and wiki-news-300d-1M. Three main experiments are conducted in this thesis. In the first experiment, a top-performing TC model is created for a recent NLP competition held at Kaggle.com. Each implemented procedure is benchmarked on how the accuracy and execution time of the model is affected. In the second experiment, principal component analysis (PCA) and random projection (RP) are applied to the pre-trained word embeddings used in the top-performing model to investigate how the accuracy and execution time is affected when creating lower-dimensional embedding vectors. In the third experiment, the same model is benchmarked on a separate dataset (Sentiment140) to investigate how well it generalizes on other data and how each implemented procedure affects the accuracy compared to on the original dataset. The first experiment results in a bidirectional LSTM model and a combination of the three embeddings: glove, paragram and wiki-news concatenated together. The model is able to give predictions with an F1 score of 71% which is good enough to reach 9th place out of 1,401 participating teams in the competition. In the second experiment, the execution time is improved by 13%, by using PCA, while lowering the dimensionality of the embeddings by 66% and only losing half a percent of F1 accuracy. RP gave a constant accuracy of 66-67% regardless of the projected dimensions compared to over 70% when using PCA. In the third experiment, the model gained around 12% accuracy from the initial to the final benchmarks, compared to 19% on the competition dataset. The best-achieved accuracy on the Sentiment140 dataset is 86% and thus higher than the 71% achieved on the Quora dataset. Machine learning Text classification state-of-the-art SOTA Deep neural network Dimensionality reduction Word embeddings Computer and Information Sciences Data- och informationsvetenskap
108	Evaluation of text classification techniques for log file classification / Utvärdering av textklassificeringstekniker för klassificering avloggfiler Olin, Per January 2020 (has links) System log files are filled with logged events, status codes, and other messages. By analyzing the log files, the systems current state can be determined, and find out if something during its execution went wrong. Log file analysis has been studied for some time now, where recent studies have shown state-of-the-art performance using machine learning techniques. In this thesis, document classification solutions were tested on log files in order to classify regular system runs versus abnormal system runs. To solve this task, supervised and unsupervised learning methods were combined. Doc2Vec was used to extract document features, and Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) based architectures on the classification task. With the use of the machine learning models and preprocessing techniques the tested models yielded an f1-score and accuracy above 95% when classifying log files. Text classification machine learning NLP natural language processing log file doc2vec CNN LSTM LSTM-CNN Computer Sciences Datavetenskap (datalogi)
109	Parafrasidentifiering med maskinklassificerad data : utvärdering av olika metoder / Paraphrase identification with computer classified paraphrases : An evaluation of different methods Johansson, Oskar January 2020 (has links) Detta arbete undersöker hur språkmodellen BERT och en MaLSTM-arkitektur fungerar att för att identifiera parafraser ur 'Microsoft Paraphrase Research Corpus' (MPRC) om dessa tränats på automatiskt identifierade parafraser ur 'Paraphrase Database' (PPDB). Metoderna ställs mot varandra för att undersöka vilken som presterar bäst och metoden att träna på maskinklassificerad data för att användas på mänskligt klassificerad data utvärderas i förhållande till annan klassificering av samma dataset. Meningsparen som används för att träna modellerna hämtas från de högst rankade parafraserna ur PPDB och genom en genereringsmetod som skapar icke-parafraser ur samma dataset. I resultatet visar sig BERT vara kapabel till att identifiera en del parafraser ur MPRC, medan MaLSTM-arkitekturen inte klarade av detta trots förmåga att särskilja på parafraser och icke-parafraser under träning. Både BERT och MaLSTM presterade sämre på att identifiera parafraser ur MPRC än modeller som till exempel StructBERT, som tränat och utvärderats på samma dataset, presterar. Anledningar till att MaLSTM inte klarar av uppgiften diskuteras och främst lyfts att meningarna från icke-parafraserna ur träningsdatan är för olika varandra i förhållande till hur de ser ut i MPRC. Slutligen diskuteras vikten av att forska vidare på hur man kan använda sig av maskinframtagna parafraser inom parafraseringsrelaterad forskning. NLP text classification BERT
110	Predicting Political Party Affiliation in the Swedish Parliament using Natural Language Processing Zetterberg, Johannes January 2022 (has links) Text classification is a fundamental part of natural language processing. In this thesis, methods for text classification are used in an attempt to predict the political party affiliation of members of parliament (MPs). The objective is to evaluate the performance of Support Vector Machines (SVM), naive Bayes, and a fine-tuned Bidirectional Encoder Representations from Transformers (BERT) model in predicting MPs' political party affiliation based on speeches given in the Chamber of the Swedish Parliament. This study shows that BERT outperforms SVM and naive Bayes in correctly classifying MPs, and SVM makes better predictions than naive Bayes and performs reasonably well compared to BERT. The results show that all models correctly predict MPs representing the Sweden Democrats to the highest degree. Both BERT and SVM roughly classify every other speech correctly, which implies much better than making random predictions. These results indicate the potential use of methods for automatically classifying political speeches. Machine learning support vector machines naive Bayes transformer BERT text classification NLP Probability Theory and Statistics Sannolikhetsteori och statistik

Search results