Global ETD Search

81	Optimising Machine Learning Models for Imbalanced Swedish Text Financial Datasets: A Study on Receipt Classification : Exploring Balancing Methods, Naive Bayes Algorithms, and Performance Tradeoffs Hu, Li Ang, Ma, Long January 2023 (has links) This thesis investigates imbalanced Swedish text financial datasets, specifically receipt classification using machine learning models. The study explores the effectiveness of under-sampling and over-sampling methods for Naive Bayes algorithms, collaborating with Fortnox for a controlled experiment. Evaluation metrics compare balancing methods regarding the accuracy, Matthews's correlation coefficient (MCC) , F1 score, precision, and recall. Findings contribute to Swedish text classification, providing insights into balancing methods. The thesis report examines balancing methods and parameter tuning on machine learning models for imbalanced datasets. Multinomial Naive Bayes (MultiNB) algorithms in Natural language processing (NLP) are studied, with potential application in image classification for assessing industrial thin component deformation. Experiments show balancing methods significantly affect MCC and recall, with a recall-MCC-accuracy tradeoff. Smaller alpha values generally improve accuracy. Synthetic Minority Oversampling Technique (SMOTE) and Tomek's algorithm for removing links developed in 1976 by Ivan Tomek. First Tomek, then SMOTE (TomekSMOTE) yield promising accuracy improvements. Due to time constraints, Over-sampling using SMOTE and cleaning using Tomek links. First SMOTE, then Tomek (SMOTETomek) training is incomplete. This thesis report finds the best MCC is achieved when $\alpha$ is 0.01 on imbalanced datasets. Imbalanced datasets Swedish text financial datasets Accuracy Matthews correlation coefficient Recall Multinomial Naive Bayes SMOTE TomekLinks Performance optimization Computer Sciences Datavetenskap (datalogi)
82	AUTOMATED IMAGE LOCALIZATION AND DAMAGE LEVEL EVALUATION FOR RAPID POST-EVENT BUILDING ASSESSMENT Xiaoyu Liu (13989906) 25 October 2022 (has links) <p> </p> <p>Image data remains an important tool for post-event building assessment and documentation. After each natural hazard event, significant efforts are made by teams of engineers to visit the affected regions and collect useful image data. In general, a global positioning system (GPS) can provide useful spatial information for localizing image data. However, it is challenging to collect such information when images are captured in places where GPS signals are weak or interrupted, such as the indoor spaces of buildings. An inability to document the images’ locations would hinder the analysis, organization, and documentation of these images as they lack sufficient spatial context. This problem becomes more urgent to solve for the inspection mission covering a large area, like a community. To address this issue, the objective of this research is to generate a tool to automatically process the image data collected during such a mission and provide the location of each image. Towards this goal, the following tasks are performed. First, I develop a methodology to localize images and link them to locations on a structural drawing (Task 1). Second, this methodology is extended to be able to process data collected from a large scale area, and perform indoor localization for images collected on each of the indoor floors of each individual building (Task 2). Third, I develop an automated technique to render the damage condition decision of buildings by fusing the image data collected within (Task 3). The methods developed through each task have been evaluated with data collected from real world buildings. This research may also lead to automated assessment of buildings over a large scale area. </p> post-event building assessment indoor localization visual odometry 3D reconstruction information fusion naive Bayes fusion optimization
83	COVID-19: Анализ эмоциональной окраски сообщений в социальных сетях (на материале сети «Twitter») : магистерская диссертация / COVID-19: Social network sentiment analysis (based on the material of "Twitter" messages) Денисова, П. А., Denisova, P. A. January 2021 (has links) Работа посвящена изучению анализа тональности текстов в социальных сетях на примере сообщений-твитов из социальной сети Twitter. Материал исследования составили 818 224 сообщения по 17-ти ключевым словам, из которых 89 025 твитов содержали слова «COVID-19» и «Сoronavirus». В первой части работы рассматриваются общие теоретические и методологические вопросы: вводится понятие Sentiment Analysis, анализируются различные подходы к классификации тональности текстов. Особое внимание в задачах классификации текстов уделяется Байесовскому классификатору, который показывает высокую точность работы. Изучаются особенности анализа тональности текстов в социальных сетях во время эпидемий и вспышек болезней. Описывается процедура и алгоритм анализа тональности текста. Большое внимание уделяется анализу тональности текстов в Python с помощью библиотеки TextBlob, а также выбирается ещё один из инструментов «SaaS» - программное обеспечение как услуга, который позволяет реализовать анализ тональности текстов в режиме реального времени, где нет необходимости в большом опыте машинного обучения и обработке естественного языка, в сравнении с языком программирования Python. Вторая часть исследования начинается с построения выборок, т.е. определения ключевых слов, по которым в работе осуществляется поиск и экспорт необходимых твитов. Для этой цели используется корпус - Coronavirus Corpus, предназначенный для отражения социальных, культурных и экономических последствий коронавируса (COVID-19) в 2020 году и в последующий период. Анализируется динамика использования слов по изучаемой тематике в течение 2020 года и проводится аналогия между частотой их использования и происходящими событиями. Далее по выбранным ключевым словам осуществляется поиск твитов и, основываясь на полученных данных, реализуется анализ тональности cообщений с помощью библиотеки Python - TextBlob, созданной для обработки текстовых данных, и онлайн - сервиса Brand24. Сравнивая данные инструменты, отмечается схожесть полученных результатов. Исследование помогает быстро и в реальном времени понять общественные настроения по поводу вспышки COVID-19, способствуя тем самым пониманию развивающихся событий. Также данная работа может быть использована в качестве модели для определения эмоционального состояния интернет-пользователей в различных ситуациях. / The work is devoted to the sentiment analysis study of messages in Twitter social network. The research material consisted of 818,224 messages and 17 keywords, whereas 89,025 tweets contained the words "COVID-19" and "Coronavirus". In the first part, theoretical and methodological issues are considered: the concept of sentiment analysis is introduced, various approaches to text classification are analyzed. Particular attention in the problems of text classification is given to Naive Bayes classifier, which shows high accuracy of work. The features of sentiment analysis in social networks during epidemics and disease outbreaks are studied. The procedure and algorithm for analyzing the sentiment of the text are described. Much attention is paid to the analysis of sentiment of texts in Python using TextBlob library, and also one of the SaaS tools is chosen - software as a service, which allows real-time sentiment analysis of texts, where there is no need for extensive experience in machine learning and natural language processing against Python programming language. The second part of the study begins with sampling, i.e. definition of keywords by which the search and export of the necessary tweets is carried out. For this purpose, the Coronavirus Corpus is used, designed to reflect the social, cultural and economic consequences of the coronavirus (COVID-19) in 2020 and beyond. The dynamics of the topic words usage during 2020 is analyzed and an analogy is drawn between the frequency of their usage and the events in place. Next, the selected keywords are used to search for tweets and, based on the data obtained, the sentiment analysis of messages is carried out using the Python library - TextBlob, created for processing textual data, and the Brand24 online service. Comparing these tools, the results are similar. The study helps to understand quickly and in real-time public sentiments about the COVID-19 outbreak, thereby contributing to the understanding of developing events. Also, this work can be used as a model for determining the emotional state of Internet users in various situations. COVID-19 ПАНДЕМИЯ КОРОНАВИРУС TWITTER TEXTBLOB BRAND24 MASTER'S THESIS COVID-19 PANDEMIC CORONAVIRUS TWITTER SENTIMENT ANALYSIS TEXTBLOB NAIVE BAYES CLASSIFIER BRAND24
84	Simulating ADS-B vulnerabilities by imitating aircrafts : Using an air traffic management simulator / Simulering av ADS-B sårbarheter genom imitering av flygplan : Med hjälp av en flyglednings simulator Boström, Axel, Börjesson, Oliver January 2022 (has links) Air traffic communication is one of the most vital systems for air traffic management controllers. It is used every day to allow millions of people to travel safely and efficiently across the globe. But many of the systems considered industry-standard are used without any sort of encryption and authentication meaning that they are vulnerable to different wireless attacks. In this thesis vulnerabilities within an air traffic management system called ADS-B will be investigated. The structure and theory behind this system will be described as well as the reasons why ADS-B is unencrypted. Two attacks will then be implemented and performed in an open-source air traffic management simulator called openScope. ADS-B data from these attacks will be gathered and combined with actual ADS-B data from genuine aircrafts. The collected data will be cleaned and used for machine learning purposes where three different algorithms will be applied to detect attacks. Based on our findings, where two out of the three machine learning algorithms used were able to detect 99.99% of the attacks, we propose that machine learning algorithms should be used to improve ADS-B security. We also think that educating air traffic controllers on how to detect and handle attacks is an important part of the future of air traffic management. ADS-B ATC Impersonation attack Sybil attack Air Traffic Communication openScope Security Machine learning K nearest neighbour Naive Bayes Decision tree Other Computer and Information Science Annan data- och informationsvetenskap
85	The Influence of Artificial Intelligence on Education: Sentiment Analysis on YouTube Comments : What is people´s sentiment on ChatGPT for educational purposes? Rodríguez Roldán, Javier January 2024 (has links) The use of artificial intelligence (AI), especially ChatGPT, has increased exponentially in the past years, and it can be seen how AI-based tools are being used in several fields, including education. The literature on AI on education (AIEd), how it has been used, its potential uses, opportunities and challenges were reviewed as well as the literature on sentiment analysis on social media to evaluate the best approach. Since education might face notorious changes due to this technology, assessing how people feel about this potential change in the paradigm is essential. Sentiment analysis on YouTube comments of videos related to ChatGPT, the most popular AI tool for education across learners and educators, was performed. It was found that 62.1% of thes ample had a positive feeling regarding this technology for educational purposes, whereas 19.4% had a negative sentiment and 18.5% were neutral. To contribute to the literature on sentiment analysis of YouTube comments, the two most used and best-performing algorithms were used for this task: Naive Bayes and Support Vector Machine. The results show that the first algorithm had a 61.30% accuracy, whereas SVM had a 71.79%. Education YouTube Artificial Intelligence Chatbot ChatGPT Sentiment Analysis Lexicon-based VADER Machine Learning Naive Bayes SVM Information Systems
86	Federated Online Learning with Streaming Data for Intrusion Detection Systems : Comparing Federated and Centralized Learning Methods in Online and Offline Settings Arvidsson, Victor January 2024 (has links) Background. With increased pressure from both regulatory bodies and end-users, interest in privacy preserving machine learning methods have increased among companies and researchers in the last few years. One of the main areas of research regarding this is federated learning. Further, with the current situation in the world, interest in cybersecurity is also at an all time high, where intrusion detection systems are one component of interest. With anomaly-based intrusion detection systems using machine learning methods, it is desirable that these can adapt automatically over time as the network patterns change, resulting in online learning being highly relevant for this application. Previous research has studied offline federated intrusion detection systems. However, there have been very little work performed in the study of online federated learning for intrusion detection systems. Objectives. The objective of this thesis is to evaluate the performance of online federated machine learning methods for intrusion detection systems. Furthermore, the thesis will study the performance relationship between offline and online models for both centralized and federated learning, in order to draw conclusions about the ability to extrapolate from results between the different types of models. Methods. This thesis uses a quasi-experiment to evaluate two different types of models, Naive Bayes and Semi-supervised Federated Learning on Evolving Data Streams (SFLEDS), on three different datasets, NSL-KDD, UNSW-NB15, and CIC-IDS2017. For each model, four variants are implemented: centralized offline, centralized online, federated offline and federated online, and in the federated setting the models are evaluated with 20, 30, and 40 clients. Results. The results show that the best performing model in general is the federated online SFLEDS. They also highlight an important problem with using imbalanced datasets without proper care for data preprocessing and model design. Finally, the results show that there are no general relationships between offline and online models that hold in both the centralized and federated settings in terms of prediction performance. Conclusions. The main conclusion of the thesis is that online federated learning has a lot of potential for the application of intrusion detection systems, but more research is required to find the optimal models and parameters that result in satisfactory performance. / Bakgrund. Med ökat tryck från både tillsynsorgan och slutanvändare har intresset för integritetsbevarande maskininlärning ökat hos företag och forskare under de senaste åren. Ett av huvudområdena där det forskas om detta är inom federerad inlärning. Vidare, med det nuvarande läget i världen är intresset för cybersäkerhet högre än någonsin, där bland annat intrångsdetekteringssystem är av intresse. Med avvikelsebaserade intrångsdetekteringssystem som använder sig av maskininlärning så är det önskvärt att dessa automatiskt kan anpassa sig över tid när nätverksmönster förändras, vilket resulterar i att online maskininlärning är högst relevant för området. Tidigare forskning har studerat federerade offline intrångsdetekteringssystem, men det finns väldigt lite forskning gällande federerad online maskininlärning för intrångsdetekteringssystem. Syfte. Syftet med det här arbetet är att utvärdera prestandan av federerad online maskininlärning för intrångsdetekteringssystem. Vidare kommer det här arbetet att studera prestandaförhållandet mellan offline och online modeller för både centraliserad och federerad inlärning, för att kunna dra slutsatser om förmågan att extrapolera resultat mellan olika typer av modeller. \newline\textbf{Metod.} Det här arbetet använder sig av ett kvasiexperiment för att utvärdera två olika modeller, Naive Bayes och Semi-supervised Federated Learning on Evolving Data Streams (SFLEDS), på tre olika dataset, NSL-KDD, UNSW-NB15 och CIC-IDS2017. För varje modell implementeras fyra varianter: centraliserad offline, centraliserad online, federerad offline och federerad online. De federerade modellerna utvärderas med 20, 30 och 40 klienter. Resultat. Resultaten visar att den generellt bästa modellen är online SFLEDS. De belyser även ett viktigt problem med att använda obalanserade dataset utan tillräcklig hänsyn till förbearbetning av datan och modelldesign. Slutligen visar resultaten att det inte finns något generellt samband mellan offline och online modeller som stämmer för både centraliserad och federerad inlärning när det gäller modellprestanda. Slutsatser. Den huvudsakliga slutsatsen från arbetet är att federerad online maskininlärning har stor potential för intrångsdetekteringssystem, men mer forskning krävs för att hitta den bästa modellen och de bästa parametrarna för att nå ett tillfredsställande resultat. Naive Bayes SFLEDS Semi-Supervised Learning Cybersecurity Centralized Federated Learning Naive Bayes SFLEDS Semi-övervakad maskininlärning Cybersäkerhet Centraliserad federerad maskininlärning Computer Sciences Datavetenskap (datalogi)
87	Detection of bullying with MachineLearning : Using Supervised Machine Learning and LLMs to classify bullying in text Yousef, Seif-Alamir, Svensson, Ludvig January 2024 (has links) In recent years, there has been an increase in the issue of bullying, particularly in academic settings. This degree project examines the use of supervised machine learning techniques to identify bullying in text data from school surveys provided by the Friends Foundation. It evaluates various traditional algorithms such as Logistic Regression, Naive Bayes, SVM, Convolutional neural networks (CNN), alongside a Retrieval-Augmented Generation (RAG) model using Llama 3, with a primary goal of achieving high recall on the texts consisting of bullying while also considering precision, which is reflected in the use of the F3-score. The SVM model emerged as the most effective among the traditional methods, achieving the highest F3-score of 0.83. Although the RAG model showed promising recall, it suffered from very low precision, resulting in a slightly lower F3-score of 0.79. The study also addresses challenges such as the small and imbalanced dataset as well as emphasizes the importance of retaining stop words to maintain context in the text data. The findings highlight the potential of advanced machine learning models to significantly assist in bullying detection with adequate resources and further refinement. Natural Language Processing Bullying Text Classification SVM Logistic Regression Naive Bayes Convolutional Neural Networks Retrieval-augmented generation GPT-4o Computer Sciences Datavetenskap (datalogi)
88	It’s a Match: Predicting Potential Buyers of Commercial Real Estate Using Machine Learning Hellsing, Edvin, Klingberg, Joel January 2021 (has links) This thesis has explored the development and potential effects of an intelligent decision support system (IDSS) to predict potential buyers for commercial real estate property. The overarching need for an IDSS of this type has been identified exists due to information overload, which the IDSS aims to reduce. By shortening the time needed to process data, time can be allocated to make sense of the environment with colleagues. The system architecture explored consisted of clustering commercial real estate buyers into groups based on their characteristics, and training a prediction model on historical transaction data from the Swedish market from the cadastral and land registration authority. The prediction model was trained to predict which out of the cluster groups most likely will buy a given property. For the clustering, three different clustering algorithms were used and evaluated, one density based, one centroid based and one hierarchical based. The best performing clustering model was the centroid based (K-means). For the predictions, three supervised Machine learning algorithms were used and evaluated. The different algorithms used were Naive Bayes, Random Forests and Support Vector Machines. The model based on Random Forests performed the best, with an accuracy of 99.9%. / Denna uppsats har undersökt utvecklingen av och potentiella effekter med ett intelligent beslutsstödssystem (IDSS) för att prediktera potentiella köpare av kommersiella fastigheter. Det övergripande behovet av ett sådant system har identifierats existerar på grund av informtaionsöverflöd, vilket systemet avser att reducera. Genom att förkorta bearbetningstiden av data kan tid allokeras till att skapa förståelse av omvärlden med kollegor. Systemarkitekturen som undersöktes bestod av att gruppera köpare av kommersiella fastigheter i kluster baserat på deras köparegenskaper, och sedan träna en prediktionsmodell på historiska transkationsdata från den svenska fastighetsmarknaden från Lantmäteriet. Prediktionsmodellen tränades på att prediktera vilken av grupperna som mest sannolikt kommer köpa en given fastighet. Tre olika klusteralgoritmer användes och utvärderades för grupperingen, en densitetsbaserad, en centroidbaserad och en hierarkiskt baserad. Den som presterade bäst var var den centroidbaserade (K-means). Tre övervakade maskininlärningsalgoritmer användes och utvärderades för prediktionerna. Dessa var Naive Bayes, Random Forests och Support Vector Machines. Modellen baserad p ̊a Random Forests presterade bäst, med en noggrannhet om 99,9%. Intelligent decision support system Decision support system Commercial real estate Information overload Decision making Sensemaking Machine learning Clustering Prediction algorithms Naive bayes Random forests Support vector machines Intelligenta beslutsstödssystem Beslutsstödssystem Kommersiella fastigheter Informationsöverflöd Beslutsfattande Meningsskapande Maskininlärning Klustring Prediktionsalgoritmer Naive bayes Random forests Support vector machines Information Systems, Social aspects Communication Studies Kommunikationsvetenskap
89	在Spark大數據平台上分析DBpedia開放式資料：以電影票房預測為例 / Analyzing DBpedia Linked Open Data (LOD) on Spark：Movie Box Office Prediction as an Example 劉文友, Liu, Wen Yu Unknown Date (has links) 近年來鏈結開放式資料 (Linked Open Data，簡稱LOD) 被認定含有大量潛在價值。如何蒐集與整合多元化的LOD並提供給資料分析人員進行資料的萃取與分析，已成為當前研究的重要挑戰。LOD資料是RDF (Resource Description Framework) 的資料格式。我們可以利用SPARQL來查詢RDF資料，但是目前對於大量RDF的資料除了缺少一個高性能且易擴展的儲存和查詢分析整合性系統之外，對於RDF大數據資料分析流程的研究也不夠完備。本研究以預測電影票房為例，使用DBpedia LOD資料集並連結外部電影資料庫 (例如：IMDb)，並在Spark大數據平台上進行巨量圖形的分析。首先利用簡單貝氏分類與貝氏網路兩種演算法進行電影票房預測模型實例的建構，並使用貝氏訊息準則 (Bayesian Information Criterion，簡稱BIC) 找到最佳的貝氏網路結構。接著計算多元分類的ROC曲線與AUC值來評估本案例預測模型的準確率。 / Recent years, Linked Open Data (LOD) has been identified as containing large amount of potential value. How to collect and integrate multiple LOD contents for effective analytics has become a research challenge. LOD is represented as a Resource Description Framework (RDF) format, which can be queried through SPARQL language. But large amount of RDF data is lack of a high performance and scalable storage analysis system. Moreover, big RDF data analytics pipeline is far from perfect. The purpose of this study is to exploit the above research issue. A movie box office sale prediction scenario is demonstrated by using DBpedia with external IMDb movie database. We perform the DBpedia big graph analytics on the Apache Spark platform. The movie box office prediction for optimal model selection is first evaluated by BIC. Then, Naïve Bayes and Bayesian Network optimal model’s ROC and AUC values are obtained to justify our approach. 開放式鏈結資料資源描述框架巨量資料 Spark 簡單貝氏分類貝氏網路 LOD RDF Big Data Spark Naive Bayes Bayesian Network
90	Evaluation of computational methods for data prediction Erickson, Joshua N. 03 September 2014 (has links) Given the overall increase in the availability of computational resources, and the importance of forecasting the future, it should come as no surprise that prediction is considered to be one of the most compelling and challenging problems for both academia and industry in the world of data analytics. But how is prediction done, what factors make it easier or harder to do, how accurate can we expect the results to be, and can we harness the available computational resources in meaningful ways? With efforts ranging from those designed to save lives in the moments before a near field tsunami to others attempting to predict the performance of Major League Baseball players, future generations need to have realistic expectations about prediction methods and analytics. This thesis takes a broad look at the problem, including motivation, methodology, accuracy, and infrastructure. In particular, a careful study involving experiments in regression, the prediction of continuous, numerical values, and classification, the assignment of a class to each sample, is provided. The results and conclusions of these experiments cover only the included data sets and the applied algorithms as implemented by the Python library. The evaluation includes accuracy and running time of different algorithms across several data sets to establish tradeoffs between the approaches, and determine the impact of variations in the size of the data sets involved. As scalability is a key characteristic required to meet the needs of future prediction problems, a discussion of some of the challenges associated with parallelization is included. / Graduate / 0984 / erickson@uvic.ca regression classification evaluation data analysis prediction machine learning supervised learning linear regression support vector machine nearest neighbor logistic regression gaussian naive bayes stochastic gradient descent scikit-learn decision tree

Search results