Global ETD Search

1	Distributed Local Outlier Factor with Locality-Sensitive Hashing Zheng, Lining 08 November 2019 (has links) Outlier detection remains a heated area due to its essential role in a wide range of applications, including intrusion detection, fraud detection in finance, medical diagnosis, etc. Local Outlier Factor (LOF) has been one of the most influential outlier detection techniques over the past decades. LOF has distinctive advantages on skewed datasets with regions of various densities. However, the traditional centralized LOF faces new challenges in the era of big data and no longer satisfies the rigid time constraints required by many modern applications, due to its expensive computation overhead. A few researchers have explored the distributed solution of LOF, but existant methods are limited by their grid-based data partitioning strategy, which falls short when applied to high-dimensional data. In this thesis, we study efficient distributed solutions for LOF. A baseline MapReduce solution for LOF implemented with Apache Spark, named MR-LOF, is introduced. We demonstrate its disadvantages in communication cost and execution time through complexity analysis and experimental evaluation. Then an approximate LOF method is proposed, which relies on locality-sensitive hashing (LSH) for partitioning data and enables fully distributed local computation. We name it MR-LOF-LSH. To further improve the approximate LOF, we introduce a process called cross-partition updating. With cross-partition updating, the actual global k-nearest neighbors (k-NN) of the outlier candidates are found, and the related information of the neighbors is used to update the outlier scores of the candidates. The experimental results show that MR-LOF achieves a speedup of up to 29 times over the centralized LOF. MR-LOF-LSH further reduces the execution time by a factor of up to 9.9 compared to MR-LOF. The results also highlight that MR-LOF-LSH scales well as the cluster size increases. Moreover, with a sufficient candidate size, MR-LOF-LSH is able to detect in most scenarios over 90% of the top outliers with the highest LOF scores computed by the centralized LOF algorithm. Outlier detection Distributed computing Apache Spark Locality-sensitive hashing Local Outlier Factor
2	Variable Shaped Detector: A Negative Selection Algorithm Ataser, Zafer 01 February 2013 (has links) (PDF) Artificial Immune Systems (AIS) are class of computational intelligent methods developed based on the principles and processes of the biological immune system. AIS methods are categorized mainly into four types according to the inspired principles and processes of immune system. These categories are clonal selection, negative selection, immune network and danger theory. The approach of negative selection algorithm (NSA) is one of the major AIS models. NSA is a supervised learning algorithm based on the imitation of the T cells maturation process in thymus. In this imitation, detectors are used to mimic the cells, and the process of T cells maturation is simulated to generate detectors. Then, NSA classifies the specified data either as normal (self) data or as anomalous (non-self) data. In this classification task, NSA methods can make two kinds of classification errors: a self data is classified as anomalous, and a non-self data is classified as normal data. In this thesis, a novel negative selection method, variable shaped detector (V-shaped detector), is proposed to increase the classification accuracy, or in other words decreasing classification errors. In V-shaped detector, new approaches are introduced to define self and represent detectors. V-shaped detector uses the combination of Local Outlier Factor (LOF) and kth nearest neighbor (k-NN) to determine a different radius for each self sample, thus it becomes possible to model the self space using self samples and their radii. Besides, the cubic b-spline is proposed to generate a variable shaped detector. In detector representation, the application of cubic spline is meaningful, when the edge points are used. Hence, Edge Detection (ED) algorithm is developed to find the edge points of the given self samples. V-shaped detector was tested using different data sets and compared with the well-known one-class classification method, SVM, and the similar popular negative selection method, NSA with variable-sized detector termed V-detector. The experiments show that the proposed method generates reasonable and comparable results. QA Computer Software 76.75-76.765
3	On pruning and feature engineering in Random Forests Fawagreh, Khaled January 2016 (has links) Random Forest (RF) is an ensemble classification technique that was developed by Leo Breiman over a decade ago. Compared with other ensemble techniques, it has proved its accuracy and superiority. Many researchers, however, believe that there is still room for optimizing RF further by enhancing and improving its performance accuracy. This explains why there have been many extensions of RF where each extension employed a variety of techniques and strategies to improve certain aspect(s) of RF. The main focus of this dissertation is to develop new extensions of RF using new optimization techniques that, to the best of our knowledge, have never been used before to optimize RF. These techniques are clustering, the local outlier factor, diversified weighted subspaces, and replicator dynamics. Applying these techniques on RF produced four extensions which we have termed CLUB-DRF, LOFB-DRF, DSB-RF, and RDB-DR respectively. Experimental studies on 15 real datasets showed favorable results, demonstrating the potential of the proposed methods. Performance-wise, CLUB-DRF is ranked first in terms of accuracy and classifcation speed making it ideal for real-time applications, and for machines/devices with limited memory and processing power. 004
4	Exogenous Fault Detection in Aerial Swarms of UAVs / Exogen Feldetektering i Svärmar med UAV:er Westberg, Maja January 2023 (has links) In this thesis, the main focus is to formulate and test a suitable model forexogenous fault detection in swarms containing unmanned aerial vehicles(UAVs), which are aerial autonomous systems. FOI Swedish DefenseResearch Agency provided the thesis project and research question. Inspiredby previous work, the implementation use behavioral feature vectors (BFVs)to simulate the movements of the UAVs and to identify anomalies in theirbehaviors. The chosen algorithm for fault detection is the density-based cluster analysismethod known as the Local Outlier Factor (LOF). This method is built on thek-Nearest Neighbor(kNN) algorithm and employs densities to detect outliers.In this thesis, it is implemented to detect faulty agents within the swarm basedon their behavior. A confusion matrix and some associated equations are usedto evaluate the accuracy of the method. Six features are selected for examination in the LOF algorithm. The firsttwo features assess the number of neighbors in a circle around the agent,while the others consider traversed distance, height, velocity, and rotation.Three different fault types are implemented and induced in one of the agentswithin the swarm. The first two faults are motor failures, and the last oneis a sensor failure. The algorithm is successfully implemented, and theevaluation of the faults is conducted using three different metrics. Several setsof experiments are performed to assess the optimal value for the LOF thresholdand to understand the model’s performance. The thesis work results in a strongLOF value which yields an acceptable F1 score, signifying the accuracy of theimplementation is at a satisfactory level. / I denna uppsats är huvudfokuset att formulera och testa en lämplig modellför detektion av exogena fel i svärmar som innehåller obemannade flygfordon(UAV:er), vilka utgör autonoma luftburna system. Examensarbetet ochforskningsfrågan tillhandahölls av FOI, Totalförsvarets forskningsinstitut.Inspirerad av tidigare arbete används beteendemässiga egenskapsvektorer(BFV:er) för att simulera rörelserna hos UAV:erna och för att identifieraavvikelser i deras beteenden. Den valda algoritmen för felavkänning är en densitetsbaserad klusterana-lysmetod som kallas Local Outlier Factor (LOF). Denna metod byggerpå k-Nearest Neighbor-algoritmen och använder densiteter för att upptäckaavvikande datapunkter. I denna uppsats implementeras den för att detekterafelaktiga agenter inom svärmen baserat på deras beteende. En förväxlings-matris(Confusion Matrix) och dess tillhörande ekvationer används för attutvärdera metodens noggrannhet. Sex egenskaper valdes för undersökning i LOF-algoritmen. De första tvåegenskaperna bedömer antalet grannar i en cirkel runt agenter, medande andra beaktar avstånd, höjd, hastighet och rotation. Tre olika feltyperimplementeras och framkallas hos en av agenterna inom svärmen. De förstatvå felen är motorfel, och det sista är ett sensorfel. Algoritmen implementerasframgångsrikt och utvärderingen av felen genomförs med hjälp av treolika mått. Ett antal uppsättningar av experiment utförs för att hitta detoptimala värdet för LOF-gränsen och för att förstå modellens prestanda.Examensarbetet resultat är ett optimalt LOF-värde som genererar ettacceptabelt F1-score, vilket innebär att noggrannheten för implementationennår en tillfredsställande nivå. Exogenous Fault Detection Local Outlier Factor Aerial Swarm Unmanned Aerial Vehicle Behavioral Feature Vector Outlier Detection Cluster Analysis Exogen Feldetektering Local Outlier Factor Drönarsvärm Obemannad Drönare Beteendevektor Avvikelse-Analys Kluster-Analys Computational Mathematics Beräkningsmatematik
5	Anomaly Detection for Water Quality Data YAN, YAN January 2019 (has links) Real-time water quality monitoring using automated systems with sensors is becoming increasingly common, which enables and demands timely identification of unexpected values. Technical issues create anomalies, which at the rate of incoming data can prevent the manual detection of problematic data. This thesis deals with the problem of anomaly detection for water quality data using machine learning and statistic learning approaches. Anomalies in data can cause serious problems in posterior analysis and lead to poor decisions or incorrect conclusions. Five time series anomaly detection techniques: local outlier factor (machine learning), isolation forest (machine learning), robust random cut forest (machine learning), seasonal hybrid extreme studentized deviate (statistic learning approach), and exponential moving average (statistic learning approach) have been analyzed. Extensive experimental analysis of those techniques have been performed on data sets collected from sensors deployed in a wastewater treatment plant. The results are very promising. In the experiments, three approaches successfully detected anomalies in the ammonia data set. With the temperature data set, the local outlier factor successfully detected all twenty-six outliers whereas the seasonal hybrid extreme studentized deviate only detected one anomaly point. The exponential moving average identified ten time ranges with anomalies. Eight of them cover a total of fourteen anomalies. The reproducible experiments demonstrate that local outlier factor is a feasible approach for detecting anomalies in water quality data. Isolation forest and robust random cut forest also rate high anomaly scores for the anomalies. The result of the primary experiment confirms that local outlier factor is much faster than isolation forest, robust random cut forest, seasonal hybrid extreme studentized deviate and exponential moving average. / Thesis / Master of Computer Science (MCS) Anomaly Detection Water Quality Machine Learning Local Outlier Factor Isolation Forest Random Cut Forest S-H-ESD EMA Statistic Learning
6	Unveiling Anomaly Detection: Navigating Cultural Shifts and Model Dynamics in AIOps Implementations Sandén, Therese January 2024 (has links) This report examines Artificial Intelligence for IT Operations, commonly known as AIOps, delving deeper into the area of anomaly detection and also investigating the effects of the shift in working methods when a company starts using AI-driven tools. Two anomaly detection machine learning algorithms were explored, Isolation Forest(IF)and Local Outlier Factor(LOF), and compared by testing with a focuson throughput and resource efficiency, to mirror how they would operate in a real-time cloud environment. From a throughput and efficiency perspective, LOF outperforms IF when using default parameters, making it a more suitable choice for cloud environments where processing speed is critical. The higher throughput of LOF indicates that it can handle a larger volume of log data more quickly, which is essential for real-time anomaly detection in dynamic cloud settings. However, LOF’s higher memory usage suggests that it may be less scalable in memory-constrained environments within the cloud. This could lead to increased costs due to the need for more memory resources. The tests show, however, that tuning the models’ parameters are essential to fit them to different types of data. Through a literature study, it is evident that the integration of AI and automation into routine tasks presents an opportunity for workforce development and operational improvement.Addressing cultural barriers and fostering collaboration across IT teamsare essential for successful adoption and implementation. AIOps Machine Learning AI Culture Computing Science Anomaly Detection Local Outlier Factor Isolation Forest Computer Sciences Datavetenskap (datalogi)
7	Exploring Integration of Predictive Maintenance using Anomaly Detection : Enhancing Productivity in Manufacturing / Utforska integration av prediktivt underhåll med hjälp av avvikelsedetektering : Förbättra produktiviteten inom tillverkning Bülund, Malin January 2024 (has links) In the manufacturing industry, predictive maintenance (PdM) stands out by leveraging data analytics and IoT technologies to predict machine failures, offering a significant advancement over traditional reactive and scheduled maintenance practices. The aim of this thesis was to examine how anomaly detection algorithms could be utilized to anticipate potential breakdowns in manufacturing operations, while also investigating the feasibility and potential benefits of integrating PdM strategies into a production line. The methodology of this projectconsisted of a literature review, application of machine learning (ML) algorithms, and conducting interviews. Firstly, the literature review provided a foundational basis to explore the benefits of PdM and its impact on production line productivity, thereby shaping the development of interview questions. Secondly, ML algorithms were employed to analyze data and predict equipment failures. The algorithms used in this project were: Isolation Forest (IF), Local Outlier Factor (LOF), Logistic Regression (LR), One-Class Support Vector Machine(OC-SVM) and Random Forest (RF). Lastly, interviews with production line personnel provided qualitative insights into the current maintenance practices and perceptions of PdM. The findings from this project underscore the efficacy of the IF model in identifying potential equipment failures, emphasizing its key role in improving future PdM strategies to enhance maintenance schedules and boost operational efficiency. Insights gained from both literature and interviews underscore the transformative potential of PdM in refining maintenance strategies, enhancing operational efficiency, and minimizing unplanned downtime. More broadly, the successful implementation of these technologies is expected to revolutionize manufacturing processes, driving towards more sustainable and efficient industrial operations. / I tillverkningsindustrin utmärker sig prediktivt underhåll (PdM) genom att använda dataanalys och IoT-teknologier för att förutse maskinfel, vilket erbjuder ett betydande framsteg jämfört med traditionella reaktiva och schemalagda underhållsstrategier. Syftet med denna avhandling var att undersöka hur algoritmer för avvikelsedetektering kunde användas för att förutse potentiella haverier i tillverkningsoperationer, samtidigt som genomförbarheten och de potentiella fördelarna med att integrera PdM-strategier i en produktionslinje undersöktes. Metodologin för detta projekt bestod av en litteraturöversikt, tillämpning av maskininlärningsalgoritmer (ML) och genomförande av intervjuer. Först och främst gav litteraturöversikten en grundläggande bas för att utforska fördelarna med PdM och dess inverkan på produktionslinjens produktivitet, vilket därmed påverkade utformningen av intervjufrågorna. För det andra användes ML-algoritmer för att analysera data och förutsäga utrustningsfel. Algoritmerna som användes i detta projekt var: Isolation Forest (IF), Local Outlier Factor (LOF), Logistic Regression (LR), One-Class Support Vector Machine (OCSVM) och Random Forest (RF). Slutligen gav intervjuer med produktionslinjepersonal kvalitativa insikter i de nuvarande underhållsstrategierna och uppfattningarna om PdM.Resultaten från detta projekt understryker effektiviteten hos IF-modellen för att identifiera potentiella utrustningsfel, vilket betonar dess centrala roll i att förbättra framtida PdM-strategier för att förbättra underhållsscheman och öka den operativa effektiviteten. Insikter vunna från både litteratur och intervjuer understryker PdM:s transformativa potential att finslipa underhållsstrategier, öka operativ effektivitet och minimera oplanerade driftstopp. Mer generellt förväntas den framgångsrika implementeringen av dessa teknologier revolutionera tillverkningsprocesser och driva mot mer hållbara och effektiva industriella operationer. Anomaly Detection Isolation Forest Local Outlier Factor Logistic Regression One-Class Support Vector Machine Predictive Maintenance Random Forest. Avvikelsedetektion Isolation Forest Local Outlier Factor Logistic Regression One-Class Support Vector Machine Prediktivt Underhåll Random Forest. Medical Engineering Medicinteknik Transport Systems and Logistics Transportteknik och logistik Other Medical Engineering Annan medicinteknik
8	Data Driven Energy Efficiency of Ships Taspinar, Tarik January 2022 (has links) Decreasing the fuel consumption and thus greenhouse gas emissions of vessels has emerged as a critical topic for both ship operators and policy makers in recent years. The speed of vessels has long been recognized to have highest impact on fuel consumption. The solution suggestions like "speed optimization" and "speed reduction" are ongoing discussion topics for International Maritime Organization. The aim of this study are to develop a speed optimization model using time-constrained genetic algorithms (GA). Subsequent to this, this paper also presents the application of machine learning (ML) regression methods in setting up a model with the aim of predicting the fuel consumption of vessels. Local outlier factor algorithm is used to eliminate outlier in prediction features. In boosting and tree-based regression prediction methods, the overfitting problem is observed after hyperparameter tuning. Early stopping technique is applied for overfitted models.In this study, speed is also found as the most important feature for fuel consumption prediction models. On the other hand, GA evaluation results showed that random modifications in default speed profile can increase GA performance and thus fuel savings more than constant speed limits during voyages. The results of GA also indicate that using high crossover rates and low mutations rates can increase fuel saving.Further research is recommended to include fuel and bunker prices to determine more accurate fuel efficiency. Local outlier factor k-nearest neighbors random forest gradient boosting support vector machines ensemble learning ship speed optimization genetic algorithm DEAP HyperOpt Annan elektroteknik och elektronik
9	Signal Processing Methods for Reliable Extraction of Neural Responses in Developmental EEG Kumaravel, Velu Prabhakar 27 February 2023 (has links) Studying newborns in the first days of life prior to experiencing the world provides remarkable insights into the neurocognitive predispositions that humans are endowed with. First, it helps us to improve our current knowledge of the development of a typical brain. Secondly, it potentially opens new pathways for earlier diagnosis of several developmental neurocognitive disorders such as Autism Spectrum Disorder (ASD). While most studies investigating early cognition in the literature are purely behavioural, recently there has been an increasing number of neuroimaging studies in newborns and infants. Electroencephalography (EEG) is one of the most optimal neuroimaging technique to investigate neurocognitive functions in human newborns because it is non-invasive and quick and easy to mount on the head. Since EEG offers a versatile design with custom number of channels/electrodes, an ergonomic wearable solution could help study newborns outside clinical settings such as their homes. Compared to adult EEG, newborn EEG data are different in two main aspects: 1) In experimental designs investigating stimulus-related neural responses, collected data is extremely short in length due to the reduced attentional span of newborns; 2) Data is heavily contaminated with noise due to their uncontrollable movement artifacts. Since EEG processing methods for adults are not adapted to very short data length and usually deal with well-defined, stereotyped artifacts, they are unsuitable for newborn EEG. As a result, researchers manually clean the data, which is a subjective and time-consuming task. This thesis work is specifically dedicated to developing (semi-) automated novel signal processing methods for noise removal and for extracting reliable neural responses specific to this population. The solutions are proposed for both high-density EEG for traditional lab-based research and wearable EEG for clinical applications. To this end, this thesis, first, presents novel signal processing methods applied to newborn EEG: 1) Local Outlier Factor (LOF) for detecting and removing bad/noisy channels; 2) Artifacts Subspace Reconstruction (ASR) for detecting and removing or correcting bad/noisy segments. Then, based on these algorithms and other preprocessing functionalities, a robust preprocessing pipeline, Newborn EEG Artifact Removal (NEAR), is proposed. Notably, this is the first time LOF is explored for EEG bad channel detection, despite being a popular outlier detection technique in other kinds of data such as Electrocardiogram (ECG). Even if ASR is already an established artifact real algorithm originally developed for mobile adult EEG, this thesis explores the possibility of adapting ASR for short newborn EEG data, which is the first of its kind. NEAR is validated on simulated, real newborn, and infant EEG datasets. We used the SEREEGA toolbox to simulate neurologically plausible synthetic data and contaminated a certain number of channels and segments with artifacts commonly manifested in developmental EEG. We used newborn EEG data (n = 10, age range: 1 and 4 days) recorded in our lab based on a frequency-tagging paradigm. The chosen paradigm consists of visual stimuli to investigate the cortical bases of facelike pattern processing, and the results were published in 2019. To test NEAR performance on an older population with an event-related design (ERP) and with data recorded in another lab, we also evaluated NEAR on infant EEG data recorded on 9-months-old infants (n = 14) with an ERP paradigm. The experimental paradigm for these datasets consists of auditory stimulus to investigate the electrophysiological evidence for understanding maternal speech, and the results were published in 2012. Since authors of these independent studies employed manual artifact removal, the obtained neural responses serve as ground truth for validating NEAR’s artifact removal performance. For comparative evaluation, we considered the performance of two state-of-the-art pipelines designed for older infants. Results show that NEAR is successful in recovering the neural responses (specific to the EEG paradigm and the stimuli) compared to the other pipelines. In sum, this thesis presents a set of methods for artifact removal and extraction of stimulus-related neural responses specifically adapted to newborn and infant EEG data that will hopefully contribute to strengthening the reliability and reproducibility of developmental cognitive neuroscience studies, both in research laboratories and in clinical applications.

Search results