31 |
Disaster tweet classification using parts-of-speech tags: a domain adaptation approachRobinson, Tyler January 1900 (has links)
Master of Science / Department of Computer Science / Doina Caragea / Twitter is one of the most active social media sites today. Almost everyone is using it, as it is a medium by which people stay in touch and inform others about events in their lives. Among many other types of events, people tweet about disaster events. Both man made and natural disasters, unfortunately, occur all the time. When these tragedies transpire, people tend to cope in their own ways. One of the most popular ways people convey their feelings towards disaster events is by offering or asking for support, providing valuable information about the disaster, and voicing their disapproval towards those who may be the cause. However, not all of the tweets posted during a disaster are guaranteed to be useful or informative to authorities nor to the general public. As the number of tweets that are posted during a disaster can reach the hundred thousands range, it is necessary to automatically distinguish tweets that provide useful information from those that don't.
Manual annotation cannot scale up to the large number of tweets, as it takes significant time and effort, which makes it unsuitable for real-time disaster tweet annotation. Alternatively, supervised machine learning has been traditionally used to learn classifiers that can quickly annotate new unseen tweets. But supervised machine learning algorithms make use of labeled training data from the disaster of interest, which is presumably not available for a current target disaster. However, it is reasonable to assume that some amount of labeled data is available for a prior source disaster. Therefore, domain adaptation algorithms that make use of labeled data from a source disaster to learn classifiers for the target disaster provide a promising direction in the area of tweet classification for disaster management. In prior work, domain adaptation algorithms have been trained based on tweets represented as bag-of-words. In this research, I studied the effect of Part of Speech (POS) tag unigrams and bigrams on the performance of the domain adaptation classifiers. Specifically, I used POS tag unigram and bigram features in conjunction with a Naive Bayes Domain Adaptation algorithm to learn classifiers from source labeled data together with target unlabeled data, and subsequently used the resulting classifiers to classify target disaster tweets. The main research question addressed through this work was if the POS tags can help improve the performance of the classifiers learned from tweet bag-of-words representations only. Experimental results have shown that the POS tags can improve the performance of the classifiers learned from words only, but not always. Furthermore, the results of the experiments show that POS tag bigrams contain more information as compared to POS tag unigrams, as the classifiers learned from bigrams have better performance than those learned from unigrams.
|
32 |
Using dated training sets for classifying recent news articles with Naive Bayes and Support Vector Machines : An experiment comparing the accuracy of classifications using test sets from 2005 and 2017Rydberg, Filip, Tornfors, Jonas January 2017 (has links)
Text categorisation is an important feature for organising text data and making it easier to find information on the world wide web. The categorisation of text data can be done through the use of machine learning classifiers. These classifiers need to be trained with data in order to predict a result for future input. The authors chose to investigate how accurate two classifiers are when classifying recent news articles on a classifier model that is trained with older news articles. To reach a result the authors chose the Naive Bayes and Support Vector Machine classifiers and conducted an experiment. The experiment involved training models of both classifiers with news articles from 2005 and testing the models with news articles from 2005 and 2017 to compare the results. The results showed that both classifiers did considerably worse when classifying the news articles from 2017 compared to classifying the news articles from the same year as the training data.
|
33 |
Classifying receipts or invoices from images based on text extractionKaci, Iuliia January 2016 (has links)
Nowadays, most of the documents are stored in electronic form and there is a high demand to organize and categorize them efficiently. Therefore, the field of automated text classification has gained a significant attention both from science and industry. This technology has been applied to information retrieval, information filtering, news classification, etc. The goal of this project is the automated text classification of photos as invoices or receipts in Visma Mobile Scanner, based on the previously extracted text. Firstly, several OCR tools available on the market have been evaluated in order to find the most accurate to be used for the text extraction, which turned out to be ABBYY FineReader. The machine learning tool WEKA has been used for the text classification, with the focus on the Naïve Bayes classifier. Since the Naïve Bayes implementation provided by WEKA does not support some advances in the text classification field such as N-gram, Laplace smoothing, etc., an improved version of Naïve Bayes classifier which is more specialized for the text classification and the invoice/receipt classification has been implemented. Improving the Naive Bayes classifier, investigating how it can be improved for the problem domain and evaluating the obtained classification accuracy compared to the generic Naïve Bayes are the main parts of this research. Experimental results show that the specialized Naïve Bayes classifier has the highest accuracy. By applying the Fixed penalty feature, the best result of 95.6522% accuracy on cross-validation mode has been achieved. In case of more accurate text extraction, the accuracy is even higher.
|
34 |
Production planning of combined heat and power plants with regards to electricity price spikes : A machine learning approachFransson, Nathalie January 2017 (has links)
District heating systems could help manage the expected increase of volatility on the Nordic electricity market by starting a combined heat and power production plant (CHP) instead of a heat only production plant when electricity prices are expected to be high. Fortum Värme is interested in adjusting the production planning of their district heating system more towards high electricity prices and in their system there is a peak load CHP unit that could be utilised for this purpose. The economic potential of starting the CHP, instead of a heat only production unit, when profitable was approximated for 2013-2016. Three machine learning classification algorithms, Support vector machine (SVM), Naive Bayes and an ensemble of decision trees were implemented and compared with the purpose of predicting price spikes in price area SE3, where Fortum Värme operates, and to assist production planning. The results show that the SVM model achieved highest performance and could be useful in production planning towards high electricity prices. The results also show a potential profit of adjusting production planning. A potential that might increase if the electricity market becomes more volatile.
|
35 |
Analýza sentimentu zákaznických recenzí / Sentiment Analysis of Customer ReviewsHrabák, Jan January 2016 (has links)
This thesis is focused on sentiment analysis of unstructured text and its practical application on the real data downloaded from website Yelp.com The objectives of the theoretical part of this thesis is to sum up the information related to history, methods and possible applications of sentiment analysis. A reader is acquainted with important terms and processes of sentiment analysis. Theoretical part is focused on Naive Bayes classifier, that will be used in practical part of this thesis. In practical part there is detailed description of data set, construction and testing of model. At the end there are presented pros and cons of the chosen model and described some possibilities of its usage.
|
36 |
Modelos probabilísticos e não probabilísticos de classificação binária para pacientes com ou sem demência como auxílio na prática clínica em geriatria.Galdino, Maicon Vinícius. January 2020 (has links)
Orientador: Liciana Vaz de Arruda Silveira / Resumo: Os objetivos deste trabalho foram apresentar modelos de classificação (Regressão Logística, Naive Bayes, Árvores de Classificação, Random Forest, k-Vizinhos mais próximos e Redes Neurais Artificiais) e a comparação destes utilizando processos de reamostragem em um conjunto de dados da área de geriatria (diagnóstico de demência). Analisar as pressuposições de cada metodologia, vantagens, desvantagens e cenários em que cada metodologia pode ser melhor utilizada. A justificativa e relevância desse projeto se baseiam na importância e na utilidade do tema proposto, visto que a população idosa aumenta em todo o mundo (nos países desenvolvidos e nos em desenvolvimento como o Brasil), os modelos de classificação podem ser úteis aos profissionais médicos, em especial aos médicos generalistas, no diagnóstico de demências, pois em diversos momentos o diagnóstico não é simples. / Doutor
|
37 |
An Automated Digital Analysis of Depictions of Child Maltreatment in Ancient Roman WritingsBrowne, Alexander January 2019 (has links)
Historians, mostly engaging with written evidence, have argued that the Christianisation of the Roman Empire resulted in changes in both attitudes and behaviour towards children, resulting in a decrease in their maltreatment by society. I begin with a working hypothesis that this attitude-change was real and resulted in a reduction in the maltreatment of children; and that this reduction in maltreatment is evident in the literature. The approach to investigating this hypothesis belongs to the emerging field of digital humanities: by using programming techniques developed in the field of sentiment analysis, I create two sentiment-analysis like tools, one a lexicon-based approach, the other an application of a naive bayes machine learning approach. The latter is favoured as more accurate. The tool is used to automatically tag sentences, extracted from a corpus of texts written between 100 B.C and 600 A.D, that mention children, as to whether the sentences feature the maltreatment of children or not. The results are then quantitively analysed with reference to the year in which the text was written, with no statistically significant result found. However, the high accuracy of the tool in tagging sentences, at above 88%, suggests that similar tools may be able to play an important role, alongside traditional research techniques, in historical and social-science research in the future.
|
38 |
Data Analysis of Minimally-Structured Heterogeneous Logs : An experimental study of log template extraction and anomaly detection based on Recurrent Neural Network and Naive Bayes.Liu, Chang January 2016 (has links)
Nowadays, the ideas of continuous integration and continuous delivery are under heavy usage in order to achieve rapid software development speed and quick product delivery to the customers with good quality. During the process ofmodern software development, the testing stage has always been with great significance so that the delivered software is meeting all the requirements and with high quality, maintainability, sustainability, scalability, etc. The key assignment of software testing is to find bugs from every test and solve them. The developers and test engineers at Ericsson, who are working on a large scale software architecture, are mainly relying on the logs generated during the testing, which contains important information regarding the system behavior and software status, to debug the software. However, the volume of the data is too big and the variety is too complex and unpredictable, therefore, it is very time consuming and with great efforts for them to manually locate and resolve the bugs from such vast amount of log data. The objective of this thesis project is to explore a way to conduct log analysis efficiently and effectively by applying relevant machine learning algorithms in order to help people quickly detect the test failure and its possible causalities. In this project, a method of preprocessing and clusering original logs is designed and implemented in order to obtain useful data which can be fed to machine learning algorithms. The comparable log analysis, based on two machine learning algorithms - Recurrent Neural Network and Naive Bayes, is conducted for detecting the place of system failures and anomalies. Finally, relevant experimental results are provided and analyzed.
|
39 |
Taskfinder : Comparison of NLP techniques for textclassification within FMCG storesJensen, Julius January 2022 (has links)
Natural language processing has many important applications in today, such as translations, spam filters, and other useful products. To achieve these applications supervised and unsupervised machine learning models, have shown to be successful. The most important aspect of these models is what the model can achieve with different datasets. This article will examine how RNN models compare with Naive Bayes in text classification. The chosen RNN models are long short-term memory (LSTM) and gated recurrent unit (GRU). Both LSTM and GRU will be trained using the flair Framework. The models will be trained on three separate datasets with different compositions, where the trend within each model will be examined and compared with the other models. The result showed that Naive Bayes performed better on classifying short sentences than the RNN models, but worse in longer sentences. When trained on a small dataset LSTM and GRU had a better result then Naive Bayes. The best performing model was Naive Bayes, which had the highest accuracy score in two out of the three datasets.
|
40 |
Exploration of infectious disease transmission dynamics using the relative probability of direct transmission between patientsLeavitt, Sarah Van Ness 06 October 2020 (has links)
The question “who infected whom” is a perennial one in the study of infectious disease dynamics. To understand characteristics of infectious diseases such as how many people will one case produce over the course of infection (the reproductive number), how much time between the infection of two connected cases (the generation interval), and what factors are associated with transmission, one must ascertain who infected whom. The current best practices for linking cases are contact investigations and pathogen whole genome sequencing (WGS). However, these data sources cannot perfectly link cases, are expensive to obtain, and are often not available for all cases in a study. This lack of discriminatory data limits the use of established methods in many existing infectious disease datasets.
We developed a method to estimate the relative probability of direct transmission between any two infectious disease cases. We used a subset of cases that have pathogen WGS or contact investigation data to train a model and then used demographic, spatial, clinical, and temporal data to predict the relative transmission probabilities for all case-pairs using a simple machine learning algorithm called naive Bayes. We adapted existing methods to estimate the reproductive number and generation interval to use these probabilities. Finally, we explored the associations between various covariates and transmission and how they related to the associations between covariates and pathogen genetic relatedness. We applied these methods to a tuberculosis outbreak in Hamburg, Germany and to surveillance data in Massachusetts, USA.
Through simulations we found that our estimated transmission probabilities accurately classified pairs as links and nonlinks and were able to accurately estimate the reproductive number and the generation interval. We also found that the association between covariates and genetic relatedness captures the direction but not absolute magnitude of the association between covariates and transmission, but the bias was improved by using effect estimates from the naive Bayes algorithm. The methods developed in this dissertation can be used to explore transmission dynamics and estimate infectious disease parameters in established datasets where this was not previously feasible because of a lack of highly discriminatory information, and therefore expand our understanding of many infectious diseases.
|
Page generated in 0.0693 seconds