71 |
Machine Learning explainability in text classification for Fake News detectionKurasinski, Lukas January 2020 (has links)
Fake news detection gained an interest in recent years. This made researchers try to findmodels that can classify text in the direction of fake news detection. While new modelsare developed, researchers mostly focus on the accuracy of a model. There is little researchdone in the subject of explainability of Neural Network (NN) models constructed for textclassification and fake news detection. When trying to add a level of explainability to aNeural Network model, allot of different aspects have to be taken under consideration.Text length, pre-processing, and complexity play an important role in achieving successfully classification. Model’s architecture has to be taken under consideration as well. Allthese aspects are analyzed in this thesis. In this work, an analysis of attention weightsis performed to give an insight into NN reasoning about texts. Visualizations are usedto show how 2 models, Bidirectional Long-Short term memory Convolution Neural Network (BIDir-LSTM-CNN), and Bidirectional Encoder Representations from Transformers(BERT), distribute their attentions while training and classifying texts. In addition, statistical data is gathered to deepen the analysis. After the analysis, it is concluded thatexplainability can positively influence the decisions made while constructing a NN modelfor text classification and fake news detection. Although explainability is useful, it is nota definitive answer to the problem. Architects should test, and experiment with differentsolutions, to be successful in effective model construction.
|
72 |
Automatic Retail Product Identification System for Cashierless StoresZhong, Shiting January 2021 (has links)
The introduction of artificial intelligence techniques in the retail market is making a revolution in shopping experience. It allows shoppers to walk into a store, grab what they want and simply walk out without scanning barcodes or having to stand in long queues. That is what we call cashierless stores. In this project, it aims to provide an efficient solution to the automatic retail product identification. This solution presents an artifact one can use to build an end-to-end smart system for cashierless stores. Henceforth, a solution based on text classification is proposed to recognize and identify the products. For that, deep learning techniques are used such as RNN and LSTM to build the classifier. The performance of this classifier is evaluated using various metrics and it shows its efficiency with an accuracy exceeding 86%.
|
73 |
Klassificering av transkriberade telefonsamtal med Support Vector Machines för ökad effektivitet inom vården / Classification of transcribed telephone calls with support vector machines for increased efficiency in healthcareHöglind, Sanna, Sundström, Emelie January 2019 (has links)
Patientnämndens förvaltning i Stockholm tar årligen emot tusentals samtal som önskar framföra klagomål på vården i Region Stockholm. Syftet med arbetet är att undersöka hur en NLP-robot för klassificering av inkomna klagomål skulle kunna bidra till en ökad effektivitet av verksamheten. Klassificeringen av klagomålen har utförts med hjälp av en metod baserad på Support Vector Machines. För att optimera modellens korrekthet undersöktes hur längden av ordvektorerna påverkar korrektheten. Modellen gav en slutgiltig korrekthet 53,10 %. Detta resultat analyserades sedan med målsättningen att identifiera potentiella förbättringsmöjligheter hos modellen. För framtida arbeten kan det därför vara intressant att undersöka hur antalet samtal, antalet personer som spelar in samtal och klassfördelningen i datamängden påverkar korrektheten. För att undersöka hur effektiviteten hos Patientnämndens förvaltning i Stockholm skulle påverkas av implementeringen av en NLP-robot användes en SWOT-analys. Denna analys visade på tydliga fördelar med automatisering av klagomålshanteringen, men att en sådan implementation måste ske med försiktighet där det säkerställs att tillgången på kompetens är tillräcklig för att förebygga potentiella hot. / Every year Patientnämnden recieves thousands of phone calls from patients wishing to make complaints about the health care in Stockholm. The aim of this work is to investigate how an NLP-robot for classification of recieved phone calls would contribute to an increased efficiency of the operation. The classification of the complaints has been made using a method based on Support Vector Machines. In order to optimize the accuracy of the model the impact of the length of the word vector has been investigated. The final result was an accuracy of 53.10%. The result was analyzed with the goal to identify potential opportunities of improvement of the model. For future work it could be interesting to investigate in how the number of calls, the number of people recording the calls and the distribution between the classes affect the accuracy A SWOT-analysis was performed in order to investigate in how the efficiency of Patientnämnden would be affected by the implementation of an NLP-robot. The analysis showed apparent benefits of automation of complaint management, but also that such an implementation must be done with great caution in order to be able to ensure that the available competence is high enough to prevent potential threats.
|
74 |
Predicting SNI Codes from Company Descriptions : A Machine Learning SolutionLindholm, Erik, Nilsson, Jonas January 2023 (has links)
This study aims to develop an automated solution for assigning area of industry codes to businesses based on the contents of their business descriptions. The Swedish standard industrial classification (SNI) is a system used by Statistics Sweden (SCB) for categorizing businesses for their statistics reports. Assignment of SNI codes has so far been done manually by the person registering a new company, but this is a far from optimal solution. Some of the 88 main group areas of industry are hard to tell apart from one another, and this often leads to incorrect assignments. Our approach to this problem was to train a machine learning model using the Naive Bayes and SVM classifier algorithms and conduct an experiment. In 2019, Dahlqvist and Strandlund had attempted this and reached an accuracy score of 52 percent by use of the gradient boosting classifier, but this was considered too low for real-world implementation. Our main goal was to achieve a higher accuracy than that of Dahlqvist and Strandlund, which we eventually succeeded in - our best-performing SVM model reached a score of 60.11 percent. Similarly to Dahlqvist and Strandlund, we concluded that the low quality of the dataset was the main obstacle for achieving higher scores. The dataset we used was severely imbalanced, and much time was spent on investigating and applying oversampling and undersampling as strategies for mitigating this problem. However, we found during the testing phase that none of these strategies had any positive effect on the accuracy scores.
|
75 |
NOVA: Automated Detection of Violent Threats in Swedish Online EnvironmentsLindén, Kevin, Moshfegh, Arvin January 2023 (has links)
Social media and online environments have become an integral part of society, allowing for self-expression, information sharing, and discussions online. However, these platforms are also used to express hate and threats of violence. Violent threats online lead to negative consequences, such as an unsafe online environment, self-censorship, and endangering democracy. Manually detecting and moderating threats online is challenging due to the vast amounts of data uploaded daily. Scholars have called for efficient tools based on machine learning to tackle this problem. Another challenge is that few threat-focused datasets and models exist, especially for low-resource languages such as Swedish, making identifying and detecting threats challenging. Therefore, this study aims to develop a practical and effective tool to automatically detect and identify online threats in Swedish. A tailored Swedish threat dataset will be generated to fine-tune KBLab’s Swedish BERT model. The research question that guides this project is “How effective is a fine-tuned BERT model in classifying texts as threatening or non-threatening in Swedish online environments?”. To the authors’ knowledge, no existing model can detect threats in Swedish. This study uses design science research to develop the artifact and evaluates the artifact’s performance using experiments. The dataset will be generated during the design and development by manually annotating translated English, synthetic, and authentic Swedish data. The BERT model will be fine-tuned using hyperparameters from previous research. The generated dataset comprised 6,040 posts split into 39% threats and 61% non-threats. The model, NOVA, achieved good performance on the test set and in the wild - successfully differentiating threats from non-threats. NOVA achieved almost perfect recall but a lower precision - indicating room for improvement. NOVA might be too lenient when classifying threats, which could be attributed to the complexity and ambiguity of threats and the relatively small dataset. Nevertheless, NOVA can be used as a filter to identify threatening posts online among vast amounts of data.
|
76 |
Genre classification using syntactic featuresBrigadoi, Ivan January 2021 (has links)
This thesis work adresses text classification in relation to genre identification using different feature sets, with a focus on syntactic based features. We built our models by means of traditional machine learning algorithms, i.e. Naive Bayes, K-nearest neighbour, Support Vector Machine and Random Forest in order to predict the literary genre of books. We trained our models using as feature sets bag-of-words (BOW), bigrams, syntactic-based bigrams and emotional features, as well as combinations of features. Results obtained using the best features, i.e. BOW combined with bigrams based on syntactic relations between words, on the test set showed an enhancement in performance by 2% in F1-score over the baseline using BOW features, which translates into a positive impact of using syntactic information in the task of text classification.
|
77 |
Automatic Document Classification in Small EnvironmentsMcElroy, Jonathan David 01 January 2012 (has links) (PDF)
Document classification is used to sort and label documents. This gives users quicker access to relevant data. Users that work with large inflow of documents spend time filing and categorizing them to allow for easier procurement. The Automatic Classification and Document Filing (ACDF) system proposed here is designed to allow users working with files or documents to rely on the system to classify and store them with little manual attention. By using a system built on Hidden Markov Models, the documents in a smaller desktop environment are categorized with better results than the traditional Naive Bayes implementation of classification.
|
78 |
Comparative Study of the Combined Performance of Learning Algorithms and Preprocessing Techniques for Text ClassificationGrancharova, Mila, Jangefalk, Michaela January 2018 (has links)
With the development in the area of machine learning, society has become more dependent on applications that build on machine learning techniques. Despite this, there are extensive classification tasks which are still performed by humans. This is time costly and often results in errors. One application in machine learning is text classification which has been researched a lot the past twenty years. Text classification tasks can be automated through the machine learning technique supervised learning which can lead to increased performance compared to manual classification. When handling text data, the data often has to be preprocessed in different ways to assure a good classification. Preprocessing techniques have been shown to increase performance of text classification through supervised learning. Different processing techniques affect the performance differently depending on the choice of learning algorithm and characteristics of the data set. This thesis investigates how classification accuracy is affected by different learning algorithms and different preprocessing techniques for a specific customer feedback data set. The researched algorithms are Naïve Bayes, Support Vector Machine and Decision Tree. The research is done by experiments with dependency on algorithm and combinations of preprocessing techniques. The results show that spelling correction and removing stop words increase the accuracy for all classifiers while stemming lowers the accuracy for all classifiers. Furthermore, Decision Tree was most positively affected by preprocessing while Support Vector Machine was most negatively affected. A deeper study on why the preprocessing techniques affected the algorithms in such a way is recommended for future work. / I och med utvecklingen inom området maskininlärning har samhället blivit mer beroende av applikationer som bygger på maskininlärningstekniker. Trots detta finns omfattande klassificeringsuppgifter som fortfarande utförs av människor. Detta är tidskrävande och resulterar ofta i olika typer av fel. En uppgift inom maskininlärning är textklassificering som har forskats mycket i de senaste tjugo åren. Textklassificering kan automatiseras genom övervakad maskininlärningsteknik vilket kan leda till effektiviseringar jämfört med manuell klassificering. Ofta måste textdata förbehandlas på olika sätt för att säkerställa en god klassificering. Förbehandlingstekniker har visat sig öka textklassificeringens prestanda genom övervakad inlärning. Olika förbetningstekniker påverkar prestandan olika beroende på valet av inlärningsalgoritm och egenskaper hos datamängden. Denna avhandling undersöker hur klassificeringsnoggrannheten påverkas av olika inlärningsalgoritmer och olika förbehandlingstekniker för en specifik datamängd som utgörs av kunddata. De undersökta algoritmerna är naïve Bayes, supportvektormaskin och beslutsträd. Undersökningen görs genom experiment med beroende av algoritm och kombinationer av förbehandlingstekniker. Resultaten visar att stavningskorrektion och borttagning av stoppord ökar noggrannheten för alla klassificerare medan stämming sänker noggrannheten för alla. Decision Tree var dessutom mest positivt påverkad av de olika förbehandlingsmetoderna medan Support Vector Machine påverkades mest negativt. En djupare studie om varför förbehandlingsresultaten påverkat algoritmerna på ett sådant sätt rekommenderas för framtida arbete.
|
79 |
Classification of explicit music content using lyrics and music metadata / Klassificering av stötande innehåll i musik med hjälp av låttexter och musik-metadataBergelid, Linn January 2018 (has links)
In a world where online information is growing rapidly, the need for more efficient methods to search for and create music collections is larger than ever. Looking at the most recent trends, the application of machine learning to automate different categorization problems such as genre and mood classification has shown promising results. In this thesis we investigate the problem of classifying explicit music content using machine learning. Different data sets containing lyrics and music metadata, vectorization methods and algorithms including Support Vector Machine, Random Forest, k-Nearest Neighbor and Multinomial Naive Bayes are combined to create 32 different configurations. The configurations are then evaluated using precision-recall curves. The investigation shows that the configuration with the lyric data set together with TF-IDF vectorization and Random Forest as algorithm outperforms all other configurations. / I en värld där online-information växer snabbt, ökar behovet av effektivare metoder för att söka i och skapa musiksamlingar. De senaste trenderna visar att användandet av maskininlärning för att automatisera olika kategoriseringsproblem så som klassificering av genre och humör har gett lovande resultat. I denna rapport undersöker vi problemet att klassificera stötande innehåll i musik med maskininlärning. Genom att kombinera olika datamängder med låttexter och musik-metadata, vektoriseringsmetoder samt algoritmer så som Support Vector Machine, Random Forest, k-Nearest Neighbor och Multinomial Naive Bayes skapas 32 olika konfigurationer som tränas och utvärderas med precision-recall-kurvor. Resultaten visar att konfigurationen med datamängden som endast innehåller låttexter tillsammans med TF-IDF-vektorisering och algoritmen Random Forest presterar bättre än alla andra konfigurationer.
|
80 |
Unstructured to Actionable: Extracting wind event impact data for enhanced infrastructure resiliencePham, An Huy 28 August 2023 (has links)
The United States experiences more extreme wind events than any other country, owing to its extensive coastlines, central regions prone to tornadoes, and varied climate that together create a wide array of wind phenomena. Despite advanced meteorological forecasts, these events continue to have significant impacts on infrastructure due to the knowledge gap between hazard prediction and tangible impact. Consequently, disaster managers are increasingly interested in understanding the impacts of past wind events that can assist in formulating strategies to enhance community resilience. However, this data is often non-structured and embedded in various agency documents. This makes it challenging to access and use the data effectively. Therefore, it is important to investigate approaches that can distinguish and extract impact data from non-essential information.
This research aims at exploring methods that can identify, extract, and summarize sentences containing impact data. The significance of this study lies in addressing the scarcity of historical impact data related to structural and community damage, given that such information is dispersed across multiple briefings and damage reports.
The research has two main objectives. The first is to extract sentences providing information on infrastructure, or community damage. This task uses Zero-shot text classification with the large version of the Bidirectional and Auto-Regressive Transformers model (BART-large) pre-trained on the multi-nominal language inference (MNLI) dataset. The model identifies the impact sentences by evaluating entailment probabilities with user-defined impact keywords. This method addresses the absence of manually labeled data and establishes a framework applicable to various reports. The second objective transforms this extracted data into easily digestible summaries. This is achieved by using a pre-trained BART-large model on the Cable News Network (CNN) Daily Mail dataset to generate abstractive summaries, making it easier to understand the key points from the extracted impact data.
This approach is versatile, given its dependence on user-defined keywords, and can adapt to different disasters, including tornadoes, hurricanes, earthquakes, floods, and more. A case study will demonstrate this methodology, specifically examining the Hurricane Ian impact data found in the Structural Extreme Events Reconnaissance (StEER) damage report. / Master of Science / The U.S. sees more severe windstorms than any other country. These storms can cause significant damage, despite the availability of warnings and alerts generated from weather forecast systems up to 72 hours before the storm hits. One challenge is the ineffective communication between emergency managers and at-risk communities, which can hinder timely evacuation and preparation. Additionally, data about past storm damages are often mixed up with non-actionable information in many different reports, making it difficult to use the data to enhance future warnings and readiness for upcoming storms.
This study tries to solve this problem by finding ways to identify, extract, and summarize information about damage caused by windstorms. It is an important step toward using historical data to prepare for future events.
Two main objectives guide this research. The first involves extracting sentences in these reports that provide information on damage to buildings, infrastructure, or communities. We're using a machine learning model to sort the sentences into two groups: those that contain useful information and those that do not. The second objective revolves around transforming this extracted data into easily digestible summaries. The same machine learning model is then trained in a different way, to create these summaries. As a result, critical data can be presented in a more user-friendly and effective format, enhancing its usefulness to disaster managers.
|
Page generated in 0.0257 seconds