Global ETD Search

221	Data Mining Methods For Malware Detection Siddiqui, Muazzam 01 January 2008 (has links) This research investigates the use of data mining methods for malware (malicious programs) detection and proposed a framework as an alternative to the traditional signature detection methods. The traditional approaches using signatures to detect malicious programs fails for the new and unknown malwares case, where signatures are not available. We present a data mining framework to detect malicious programs. We collected, analyzed and processed several thousand malicious and clean programs to find out the best features and build models that can classify a given program into a malware or a clean class. Our research is closely related to information retrieval and classification techniques and borrows a number of ideas from the field. We used a vector space model to represent the programs in our collection. Our data mining framework includes two separate and distinct classes of experiments. The first are the supervised learning experiments that used a dataset, consisting of several thousand malicious and clean program samples to train, validate and test, an array of classifiers. In the second class of experiments, we proposed using sequential association analysis for feature selection and automatic signature extraction. With our experiments, we were able to achieve as high as 98.4% detection rate and as low as 1.9% false positive rate on novel malwares. Data Mining Malware Detection Machine Learning Classification Instruction Sequences Signature Extraction Predictive Modeling Supervised Learning Unsupervised Learning Feature Selection Feature Reduction Categorical Data Analysis
222	Unsupervised Machine Learning Based Anomaly Detection in Stockholm Road Traffic / Oövervakad Maskininlärning baserad Anomali Detektion i Stockholms Trafikdata Hellström, Vilma January 2023 (has links) This thesis is a study of anomaly detection in vehicle traffic data in central Stockholm. Anomaly detection is an important tool in the analysis of traffic data for improved urban planing. Two unsupervised machine learning models are used, the DBSCAN clustering model and the LSTM deep learning neural network. A modified version of the models is also employed, incorporating adaptations that exploit diurnal traffic variations to improve the quality of the results. Subsequently, the model performance is analysed and compared. For evaluating the models, we employed two types of synthetic anomalies: a straightforward one and a more complex variant. The results indicate that all models show some ability to detect both anomalies. The models show better performance on the simpler anomaly, with both LSTM and DBSCAN giving comparable results. In contrast, LSTM outperforms DBSCAN on the more complex anomaly. Notably, the modified versions of both models consistently show enhanced performance. This suggest that LSTM outperforms DBSCAN as anomalies become more complex, presumably owing to LSTM’s proficiency in identifying intricate patterns. However, this relationship warrants further investigation in future research. / Denna examensuppsats behandlar anomalidetektering i fordonstrafikdata i centrala Stockholm. Anomalidetektering är ett viktigt verktyg vid analys av trafikdata för förbättrad stadsplanering. Två oövervakade maskininlärningsmodeller används, klustringsmodellen DBSCAN och djupinlärande neurala nätverket LSTM. En modifierad version av modellerna appliceras även, denna modifikation innebär anpassningar som utnyttjar dagliga traffikvariationer för att förbättra kvaliteten på resultatet. Modellerna analyseras och dess prestanda jämförs. För att utvärdera modellerna användes två typer av syntetiska anomalier: en enkel och en mer komplex anomali. Resultaten visar på en förmåga hos modellerna att upptäcka båda anomalierna. Modellerna uppvisar en bättre prestanda på den enklare anomalin, där LSTM och DBSCAN ger jämförbara resultat. För den mer komplexa anomalin så ger LSTM bättre resultat än DBSCAN. De modifierade versionerna av båda modellerna genererade konsekvent bättre resultat än den mer konventionella tillämpningen. Resultatet tyder på att LSTM överträffar DBSCAN när anomalierna blir mer komplexa, detta på grund av LSTMs skicklighet i att identifiera icke triviala mönster. Detta kräver dock ytterligare undersökningar i framtida forskning. Anomaly detection DBSCAN LSTM Machine learning Synthetic anomalies Unsupervised learning Anomalidetektering DBSCAN LSTM maskininlärning syntetiska anomalier oövervakad inlärning Elektroteknik och elektronik
223	Unsupervised Anomaly Detection on Multi-Process Event Time Series Vendramin, Nicoló January 2018 (has links) Establishing whether the observed data are anomalous or not is an important task that has been widely investigated in literature, and it becomes an even more complex problem if combined with high dimensional representations and multiple sources independently generating the patterns to be analyzed. The work presented in this master thesis employs a data-driven pipeline for the definition of a recurrent auto-encoder architecture to analyze, in an unsupervised fashion, high-dimensional event time-series generated by multiple and variable processes interacting with a system. Facing the above mentioned problem the work investigates whether it is possible or not to use a single model to analyze patterns produced by different sources. The analysis of log files that record events of interaction between users and the radio network infrastructure is employed as realworld case-study for the given problem. The investigation aims to verify the performances of a single machine learning model applied to the learning of multiple patterns developed through time by distinct sources. The work proposes a pipeline, to deal with the complex representation of the data source and the definition and tuning of the anomaly detection model, that is based on no domain-specific knowledge and can thus be adapted to different problem settings. The model has been implemented in four different variants that have been evaluated over both normal and anomalous data, gathered partially from real network cells and partially from the simulation of anomalous behaviours. The empirical results show the applicability of the model for the detection of anomalous sequences and events in the described conditions, with scores reaching above 80% in terms of F1-score, and varying depending on the specific threshold setting. In addition, their deeper interpretation gives insights about the difference between the variants of the model and thus, their limitations and strong points. / Att fastställa huruvida observerade data är avvikande eller inte är en viktig uppgift som har studerats ingående i litteraturen och problemet blir ännu mer komplext, om detta kombineras med högdimensionella representationer och flera källor som oberoende genererar de mönster som ska analyseras. Arbetet som presenteras i denna uppsats använder en data-driven pipeline för definitionen av en återkommande auto-encoderarkitektur för att analysera, på ett oövervakat sätt, högdimensionella händelsetidsserier som genereras av flera och variabla processer som interagerar med ett system. Mot bakgrund av ovanstående problem undersöker arbetet om det är möjligt eller inte att använda en enda modell för att analysera mönster som producerats av olika källor. Analys av loggfiler som registrerar händelser av interaktion mellan användare och radionätverksinfrastruktur används som en fallstudie för det angivna problemet. Undersökningen syftar till att verifiera prestandan hos en enda maskininlärningsmodell som tillämpas för inlärning av flera mönster som utvecklats över tid från olika källor. Arbetet föreslår en pipeline för att hantera den komplexa representationen hos datakällorna och definitionen och avstämningen av anomalidetektionsmodellen, som inte är baserad på domänspecifik kunskap och därför kan anpassas till olika probleminställningar. Modellen har implementerats i fyra olika varianter som har utvärderats med avseende på både normala och avvikande data, som delvis har samlats in från verkliga nätverksceller och delvis från simulering av avvikande beteenden. De empiriska resultaten visar modellens tillämplighet för detektering av avvikande sekvenser och händelser i det föreslagna ramverket, med F1-score över 80%, varierande beroende på den specifika tröskelinställningen. Dessutom ger deras djupare tolkning insikter om skillnaden mellan olika varianter av modellen och därmed deras begränsningar och styrkor. Anomaly Detection Recurrent Neural Networks Time Series Analysis Unsupervised Learning Anomalitetsdetektering Återkommande neurala nätverk Tidsserieanalys Oövervakat lärande Computer and Information Sciences Data- och informationsvetenskap
224	Models and Representation Learning Mechanisms for Graph Data Susheel Suresh (14228138) 15 December 2022 (has links) <p>Graph representation learning (GRL) has been increasing used to model and understand data from a wide variety of complex systems spanning social, technological, bio-chemical and physical domains. GRL consists of two main components (1) a parametrized encoder that provides representations of graph data and (2) a learning process to train the encoder parameters. Designing flexible encoders that capture the underlying invariances and characteristics of graph data are crucial to the success of GRL. On the other hand, the learning process drives the quality of the encoder representations and developing principled learning mechanisms are vital for a number of growing applications in self-supervised, transfer and federated learning settings. To this end, we propose a suite of models and learning algorithms for GRL which form the two main thrusts of this dissertation.</p> <p><br></p> <p>In Thrust I, we propose two novel encoders which build upon on a widely popular GRL encoder class called graph neural networks (GNNs). First, we empirically study the prediction performance of current GNN based encoders when applied to graphs with heterogeneous node mixing patterns using our proposed notion of local assortativity. We find that GNN performance in node prediction tasks strongly correlates with our local assortativity metric---thereby introducing a limit. We propose to transform the input graph into a computation graph with proximity and structural information as distinct types of edges. We then propose a novel GNN based encoder that operates on this computation graph and adaptively chooses between structure and proximity information. Empirically, adopting our transformation and encoder framework leads to improved node classification performance compared to baselines in real-world graphs that exhibit diverse mixing.</p> <p>Secondly, we study the trade-off between expressivity and efficiency of GNNs when applied to temporal graphs for the task of link ranking. We develop an encoder that incorporates a labeling approach designed to allow for efficient inference over the candidate set jointly, while provably boosting expressivity. We also propose to optimize a list-wise loss for improved ranking. With extensive evaluation on real-world temporal graphs, we demonstrate its improved performance and efficiency compared to baselines.</p> <p><br></p> <p>In Thrust II, we propose two principled encoder learning mechanisms for challenging and realistic graph data settings. First, we consider a scenario where only limited or even no labelled data is available for GRL. Recent research has converged on graph contrastive learning (GCL), where GNNs are trained to maximize the correspondence between representations of the same graph in its different augmented forms. However, we find that GNNs trained by traditional GCL often risk capturing redundant graph features and thus may be brittle and provide sub-par performance in downstream tasks. We then propose a novel principle, termed adversarial-GCL (AD-GCL), which enables GNNs to avoid capturing redundant information during the training by optimizing adversarial graph augmentation strategies used in GCL. We pair AD-GCL with theoretical explanations and design a practical instantiation based on trainable edge-dropping graph augmentation. We experimentally validate AD-GCL by comparing with state-of-the-art GCL methods and achieve performance gains in semi-supervised, unsupervised and transfer learning settings using benchmark chemical and biological molecule datasets. </p> <p>Secondly, we consider a scenario where graph data is silo-ed across clients for GRL. We focus on two unique challenges encountered when applying distributed training to GRL: (i) client task heterogeneity and (ii) label scarcity. We propose a novel learning framework called federated self-supervised graph learning (FedSGL), which first utilizes a self-supervised objective to train GNNs in a federated fashion across clients and then, each client fine-tunes the obtained GNNs based on its local task and available labels. Our framework enables the federated GNN model to extract patterns from the common feature (attribute and graph topology) space without the need of labels or being biased by heterogeneous local tasks. Extensive empirical study of FedSGL on both node and graph classification tasks yields fruitful insights into how the level of feature / task heterogeneity, the adopted federated algorithm and the level of label scarcity affects the clients’ performance in their tasks.</p> Data mining and knowledge discovery Graph, social and multimedia data Deep learning Neural networks Semi- and unsupervised learning Graph Neural Networks (GNNs) Deep Learning Self Supervised Learning Federated Learning frameworks
225	Image-based Machine Learning Applications in Nitrate Sensor Quality Assessment and Inkjet Print Quality Stability Qingyu Yang (6634961) 21 December 2022 (has links) <p>An on-line quality assessment system in the industry is essential to prevent artifacts and guide manufacturing processes. Some well-developed systems can diagnose problems and help control the output qualities. However, some of the conventional methods are limited in time consumption and cost of expensive human labor. So, more efficient solutions are needed to guide future decisions and improve productivity. This thesis focuses on developing two image-based machine learning systems to accelerate the manufacturing process: one is to benefit nitrate sensor fabrication, and the other is to help image quality control for inkjet printers.</p> <p><br></p> <p>In the first work, we propose a system for predicting the nitrate sensor's performance based on non-contact images. Nitrate sensors are commonly used to reflect the nitrate levels of soil conditions in agriculture. In a roll-to-roll system, for manufacturing thin-film nitrate sensors, varying characteristics of the ion-selective membrane on screen-printed electrodes are inevitable and affect sensor performance. It is essential to monitor the sensor performance in real-time to guarantee the quality of the sensor. We also develop a system for predicting the sensor performance in on-line scenarios and making the neural networks efficiently adapt to the new data.</p> <p><br></p> <p>Streaks are the number one image quality problem in inkjet printers. In the second work, we focus on developing an efficient method to model and predict missing jets, which is the main contributor to streaks. In inkjet printing, the missing jets typically increase over printing time, and the print head needs to be purged frequently to recover missing jets and maintain print quality. We leverage machine learning techniques for developing spatio-temporal models to predict when and where the missing jets are likely to occur. The prediction system helps the inkjet printers make more intelligent decisions during customer jobs. In addition, we propose another system that will automatically identify missing jet patterns from a large-scale database that can be used in a diagnostic system to identify potential failures.</p> Image Processing Machine Learning Computer Vision CNN Unsupervised Learning Image Retrieval Online Deep Learning Online Learning System Image Quality Assessment
226	Unsupervised multiple object tracking on video with no ego motion / Oövervakad spårning av flera objekt på video utan egorörelse Wu, Shuai January 2022 (has links) Multiple-object tracking is a task within the field of computer vision. As the name stated, the task consists of tracking multiple objects in the video, an algorithm that completes such task are called trackers. Many of the existing trackers require supervision, meaning that the location and identity of each object which appears in the training data must be labeled. The procedure of generating these labels, usually through manual annotation of video material, is highly resource-consuming. On the other hand, different from well-known labeled Multiple-object tracking datasets, there exist a massive amount of unlabeled video with different objects, environments, and video specifications. Using such unlabeled video can therefore contribute to cheaper and more diverse datasets. There have been numerous attempts on unsupervised object tracking, but most rely on evaluating the tracker performance on a labeled dataset. The reason behind this is the lack of an evaluation method for unlabeled datasets. This project explores unsupervised pedestrian tracking on video taken from a stationary camera over a long duration. On top of a simple baseline tracker, two methods are proposed to extend the baseline to increase its performance. We then propose an evaluation method that works for unlabeled video, which we use to evaluate the proposed methods. The evaluation method consists of the trajectory completion rate and the number of ID switches. The trajectory completion rate is a novel metric proposed for pedestrian tracking. Pedestrians generally enter and exit the scene for video taken by a stationary camera in specific locations. We define a complete trajectory as a trajectory that goes from one area to another. The completion rate is calculated by the number of complete trajectories over all trajectories. Results showed that the two proposed methods had increased the trajectory completion rate on top of the original baseline performance. Moreover, both proposed methods did so without significantly increasing the number of ID switches. / Spårning av flera objekt är en uppgift inom området datorseende. Som namnet angav består uppgiften av att spåra flera objekt i videon, en algoritm som slutför en sådan uppgift kallas trackers. Många av de befintliga spårarna kräver övervakning, vilket innebär att platsen och identiteten för varje objekt som visas i träningsdata måste märkas. Proceduren för att generera dessa etiketter, vanligtvis genom manuell anteckning av videomaterial, är mycket resurskrävande. Å andra sidan, till skillnad från välkända märkta uppsättningar för spårning av flera objekt, finns det en enorm mängd omärkt video med olika objekt, miljöer och videospecifikationer. Att använda sådan omärkt video kan därför bidra till billigare och mer varierande datauppsättningar. Det har gjorts många försök med oövervakad objektspårning, men de flesta förlitar sig på att utvärdera spårningsprestandan på en märkt dataset. Anledningen till detta är avsaknaden av en utvärderingsmetod för omärkta datamängder. Detta projekt utforskar oövervakad fotgängarspårning på video som tagits från en stillastående kamera under lång tid. Utöver en enkel baslinjespårare föreslås två metoder för att utöka baslinjen för att öka dess prestanda. Vi föreslår sedan en utvärderingsmetod som fungerar för omärkt video, som vi använder för att utvärdera de föreslagna metoderna. Utvärderingsmetoden består av banans slutförandegrad och antalet ID-växlar. Banans slutförandegrad är ett nytt mått som föreslås för spårning av fotgängare. Fotgängare går vanligtvis in och lämnar scenen för video tagna med en stillastående kamera på specifika platser. Vi definierar en komplett bana som en bana som går från ett område till ett annat. Färdigställandegraden beräknas av antalet kompletta banor över alla banor. Resultaten visade att de två föreslagna metoderna hade ökat graden av fullbordande av banan utöver den ursprungliga baslinjeprestandan. Dessutom gjorde båda de föreslagna metoderna det utan att nämnvärt öka antalet ID-växlar. Object tracking Multiple-object tracking Unsupervised learning Evaluation metric Pedestrian tracking Objektspårning Spårning av flera objekt Oövervakad inlärning Utvärderingsmått Fotgängarspårning Computer and Information Sciences Data- och informationsvetenskap
227	Chronic Pain as a Continuum: Autoencoder and Unsupervised Learning Methods for Archetype Clustering and Identifying Co-existing Chronic Pain Mechanisms / Chronic Pain as a Continuum: Unsupervised Learning for Identification of Co-existing Chronic Pain Mechanisms Khan, Md Asif January 2022 (has links) Chronic pain (CP) is a personal and economic burden that affects more than 30% of the world's population. While being the leading cause of disability, it is complicated to diagnose and manage. The optimal way to treat CP is to identify the pain mechanism or the underlying cause. The substantial overlap of the pain mechanisms (i.e., Nociceptive, Neuropathic, and Nociplastic) usually makes identification unreachable in a clinical setting where finding the dominant mechanism is complicated. Additionally, many specialists regard CP classification as a spectrum or continuum. Despite the importance, a data-driven way to identify co-existing CP mechanisms and quantification is still absent. This work successfully identified the co-existing CP mechanisms within a patient using Unsupervised Learning while quantifying them without the help of diagnosis established by the clinicians. Two different datasets from different cohorts comprised of patient-reported history and questionnaires were used in this work. Unsupervised Learning (k-prototypes) revealed notable overlaps in the data. It was further emphasized by the outcomes of the Semi-supervised Learning algorithms when the same trend was observed with some diagnosis or class information. It became evident that the CP mechanisms overlap and cannot be classified as distinct conditions. Additionally, mixed pain mechanisms do not make an individual cluster or class, and CP should be considered as a continuum. To reduce data dimension and extract hidden features, Autoencoder was used. Using an overlapping clustering technique, the pain mechanisms were identified. The pain mechanisms were also quantified while elucidating overlaps, and the dominant CP mechanism was successfully pointed out with explainable element. The hamming loss of 0.43 and average precision of 0.5 were achieved when considered as a multi-label classification problem. This work is a data-driven validation that there are significant overlaps in CP conditions, and CP should be considered a continuum where all CP mechanisms may co-exist. / Thesis / Master of Applied Science (MASc) / Chronic pain (CP) is a global burden and the primary cause for patients to seek medical attention. Despite continuous efforts in this area, CP remains clinically challenging to manage. The most effective method of treating CP is identifying the underlying cause or mechanism, which is often unattainable. This thesis attempted to identify the CP mechanisms existing in a patient while quantifying them from patient-reported history and questionnaire data. Unsupervised Learning was used to identify clinically meaningful clusters that revealed the three main CP mechanisms, i.e., Nociceptive, Neuropathic, and Nociplastic, achieving acceptable hamming loss (0.43) and average precision (0.5). The results exhibited that the CP mechanisms co-exist and CP should be regarded as a continuum rather than distinct entities. The algorithm successfully indicated the dominant CP mechanism, a goal for optimal CP management and treatment. The results were also validated by a comparative analysis with data from another cohort that demonstrated a similar trend. Unsupervised Learning Chronic Pain Semi-supervised Learning Co-existing Chronic Pain Mechanism Overlapping Clustering Autoencoder Chronic Pain as a Continuum Chronic Pain Quantification Machine Learning Artificial Intelligence
228	Discover patterns within train log data using unsupervised learning and network analysis Guo, Zehua January 2022 (has links) With the development of information technology in recent years, log analysis has gradually become a hot research topic. However, manual log analysis requires specialized knowledge and is a time-consuming task. Therefore, more and more researchers are searching for ways to automate log analysis. In this project, we explore methods for train log analysis using natural language processing and unsupervised machine learning. Multiple language models are used in this project to extract word embeddings, one of which is the traditional language model TF-IDF, and the other three are the very popular transformer-based model, BERT, and its variants, the DistilBERT and the RoBERTa. In addition, we also compare two unsupervised clustering algorithms, the DBSCAN and the Mini-Batch k-means. The silhouette coefficient and Davies-Bouldin score are utilized for evaluating the clustering performance. Moreover, the metadata of the train logs is used to verify the effectiveness of the unsupervised methods. Apart from unsupervised learning, network analysis is applied to the train log data in order to explore the connections between the patterns, which are identified by train control system experts. Network visualization and centrality analysis are investigated to analyze the relationship and, in terms of graph theory, importance of the patterns. In general, this project provides a feasible direction to conduct log analysis and processing in the future. / I och med informationsteknologins utveckling de senaste åren har logganalys gradvis blivit ett hett forskningsämne. Manuell logganalys kräver dock specialistkunskap och är en tidskrävande uppgift. Därför söker fler och fler forskare efter sätt att automatisera logganalys. I detta projekt utforskar vi metoder för tåglogganalys med hjälp av naturlig språkbehandling och oövervakad maskininlärning. Flera språkmodeller används i detta projekt för att extrahera ordinbäddningar, varav en är den traditionella språkmodellen TF-IDF, och de andra tre är den mycket populära transformatorbaserade modellen, BERT, och dess varianter, DistilBERT och RoBERTa. Dessutom jämför vi två oövervakade klustringsalgoritmer, DBSCAN och Mini-Batch k-means. Siluettkoefficienten och Davies-Bouldin-poängen används för att utvärdera klustringsprestandan. Dessutom används tågloggarnas metadata för att verifiera effektiviteten hos de oövervakade metoderna. Förutom oövervakad inlärning tillämpas nätverksanalys på tågloggdata för att utforska sambanden mellan mönstren, som identifieras av experter på tågstyrsystem. Nätverksvisualisering och centralitetsanalys undersöks för att analysera sambandet och grafteoriskt betydelsen av mönstren mönstren. I allmänhet ger detta projekt en genomförbar riktning för att genomföra logganalys och bearbetning i framtiden. Log analysis Natural language processing Unsupervised learning Clustering Network analysis Logganalys Bearbetning av naturligt språk Oövervakat lärande Clustering Nätverksanalys Computer and Information Sciences Data- och informationsvetenskap
229	ISAR Imaging Enhancement Without High-Resolution Ground Truth Enåkander, Moltas January 2023 (has links) In synthetic aperture radar (SAR) and inverse synthetic aperture radar (ISAR), an imaging radar emits electromagnetic waves of varying frequencies towards a target and the backscattered waves are collected. By either moving the radar antenna or rotating the target and combining the collected waves, a much longer synthetic aperture can be created. These radar measurements can be used to determine the radar cross-section (RCS) of the target and to reconstruct an estimate of the target. However, the reconstructed images will suffer from spectral leakage effects and are limited in resolution. Many methods of enhancing the images exist and some are based on deep learning. Most commonly the deep learning methods rely on high-resolution ground truth data of the scene to train a neural network to enhance the radar images. In this thesis, a method that does not rely on any high-resolution ground truth data is applied to train a convolutional neural network to enhance radar images. The network takes a conventional ISAR image subject to spectral leakage effects as input and outputs an enhanced ISAR image which contains much more defined features. New RCS measurements are created from the enhanced ISAR image and the network is trained to minimise the difference between the original RCS measurements and the new RCS measurements. A sparsity constraint is added to ensure that the proposed enhanced ISAR image is sparse. The synthetic training data consists of scenes containing point scatterers that are either individual or grouped together to form shapes. The scenes are used to create synthetic radar measurements which are then used to reconstruct ISAR images of the scenes. The network is tested using both synthetic data and measurement data from a cylinder and two aeroplane models. The network manages to minimise spectral leakage and increase the resolution of the ISAR images created from both synthetic and measured RCSs, especially on measured data from target models which have similar features to the synthetic training data. The contributions of this thesis work are firstly a convolutional neural network that enhances ISAR images affected by spectral leakage. The neural network handles complex-valued signals as a single channel and does not perform any rescaling of the input. Secondly, it is shown that it is sufficient to calculate the new RCS for much fewer frequency samples and angular positions and compare those measurements to the corresponding frequency samples and angular positions in the original RCS to train the neural network. SAR SAR Imaging ISAR ISAR Imaging Machine learning Convolutional neural network CNN neural network Super resolution Unsupervised learning Signal Processing Signalbehandling Computer Systems Datorsystem
230	Bullying Detection through Graph Machine Learning : Applying Neo4j’s Unsupervised Graph Learning Techniques to the Friends Dataset Enström, Olof, Eid, Christoffer January 2023 (has links) In recent years, the pervasive issue of bullying, particularly in academic institutions, has witnessed a surge in attention. This report centers around the utilization of the Friends Dataset and Graph Machine Learning to detect possible instances of bullying in an educational setting. The importance of this research lies in the potential it has to enhance early detection and prevention mechanisms, thereby creating safer environments for students. Leveraging graph theory, Neo4j, Graph Data Science Library, and similarity algorithms, among other tools and methods, we devised an approach for processing and analyzing the dataset. Our method involves data preprocessing, application of similarity and community detection algorithms, and result validation with domain experts. The findings of our research indicate that Graph Machine Learning can be effectively utilized to identify potential bullying scenarios, with a particular focus on discerning community structures and their influence on bullying. Our results, albeit preliminary, represent a promising step towards leveraging technology for bullying detection and prevention. Bullying Graph Machine Learning Community Detection Neo4j Data Preprocessing Similarity Algorithms Friends Neo4j Unsupervised Learning Anti-bullying Computer Sciences Datavetenskap (datalogi)

Search results