221 |
Finding Interesting Subgraphs with GuaranteesCadena, Jose 29 January 2018 (has links)
Networks are a mathematical abstraction of the interactions between a set of entities, with extensive applications in social science, epidemiology, bioinformatics, and cybersecurity, among others. There are many fundamental problems when analyzing network data, such as anomaly detection, dense subgraph mining, motif finding, information diffusion, and epidemic spread. A common underlying task in all these problems is finding an "interesting subgraph"; that is, finding a part of the graph---usually small relative to the whole---that optimizes a score function and has some property of interest, such as connectivity or a minimum density.
Finding subgraphs that satisfy common constraints of interest, such as the ones above, is computationally hard in general, and state-of-the-art algorithms for many problems in network analysis are heuristic in nature. These methods are fast and usually easy to implement. However, they come with no theoretical guarantees on the quality of the solution, which makes it difficult to assess how the discovered subgraphs compare to an optimal solution, which in turn affects the data mining task at hand. For instance, in anomaly detection, solutions with low anomaly score lead to sub-optimal detection power. On the other end of the spectrum, there have been significant advances on approximation algorithms for these challenging graph problems in the theoretical computer science community. However, these algorithms tend to be slow, difficult to implement, and they do not scale to the large datasets that are common nowadays.
The goal of this dissertation is developing scalable algorithms with theoretical guarantees for various network analysis problems, where the underlying task is to find subgraphs with constraints. We find interesting subgraphs with guarantees by adapting techniques from parameterized complexity, convex optimization, and submodularity optimization. These techniques are well-known in the algorithm design literature, but they lead to slow and impractical algorithms. One unifying theme in the problems that we study is that our methods are scalable without sacrificing the theoretical guarantees of these algorithm design techniques. We accomplish this combination of scalability and rigorous bounds by exploiting properties of the problems we are trying to optimize, decomposing or compressing the input graph to a manageable size, and parallelization.
We consider problems on network analysis for both static and dynamic network models. And we illustrate the power of our methods in applications, such as public health, sensor data analysis, and event detection using social media data. / Ph. D. / Networks are a mathematical abstraction of the interactions between a set of entities, with extensive applications in social science, epidemiology, bioinformatics, and cybersecurity, among others. There are many fundamental problems when analyzing network data, such as anomaly detection, dense subgraph mining, motif finding, information diffusion, and epidemic spread. A common underlying task in all these problems is finding an “interesting subgraph”; that is, finding a part of the graph—usually small relative to the whole—that optimizes a score function and has some property of interest, such as being connected.
Finding subgraphs that satisfy common constraints of interest is computationally hard, and existing techniques for many problems of this kind are heuristic in nature. Heuristics are fast and usually easy to implement. However, they come with no theoretical guarantees on the quality of the solution, which makes it difficult to assess how the discovered subgraphs compare to an optimal solution, which in turn affects the data mining task at hand. For instance, in anomaly detection, solutions with low anomaly score lead to sub-optimal detection power. On the other end of the spectrum, there have been significant progress on these challenging graph problems in the theoretical computer science community. However, these techniques tend to be slow, difficult to implement, and they do not scale to the large datasets that are common nowadays.
The goal of this dissertation is developing scalable algorithms with theoretical guarantees for various network analysis problems, where the underlying task is to find subgraphs with constraints. One unifying theme in the problems that we study is that our methods are scalable without sacrificing theoretical guarantees. We accomplish this combination of scalability and rigorous bounds by exploiting properties of the problems we are trying to optimize, decomposing or compressing the input graph to a manageable size, and parallelization.
We consider problems on network analysis for both static and dynamic network models. And we illustrate the power of our methods in applications, such as public health, sensor data analysis, and event detection using social media data. Read more
|
222 |
Machine Learning-Assisted Log Analysis for Uncovering AnomaliesRurling, Samuel January 2024 (has links)
Logs, which are semi-structured records of system runtime information, contain a lot of valuable insights. By looking at the logs, developers and operators can analyse their system’s behavior. This is especially necessary when something in the system goes wrong, as nonconforming logs may indicate a root cause. With the growing complexity and size of IT systems however, millions of logs are generated hourly. Reviewing them manually can therefore become an all consuming task. A potential solution to aid in log analysis is machine learning. By leveraging their ability to automatically learn from experience, machine learning algorithms can be modeled to automatically analyse logs. In this thesis, machine learning is used to perform anomaly detection, which is the discovery of so called nonconforming logs. An experiment is created in which four feature extraction methods - that is four ways of creating data representations from the logs - are tested in combination with three machine learning models. These models are: LogCluster, PCA and SVM. Additionally, a neural network architecture called an LSTM network is explored as well, a network that can craft its own features and analyse them. The results show that the LSTM performed the best, in terms of precision, recall and f1-score, followed by SVM, LogCluster and PCA, in combination with a feature extraction method using word embeddings. Read more
|
223 |
Enhancing Computational Efficiency in Anomaly Detection with a Cascaded Machine Learning ModelYu, Teng-Sung January 2024 (has links)
This thesis presents and evaluates a new cascading machine learning model framework for anomaly detection, which are essential for modern industrial applications where computing efficiency is crucial. Traditional deep learning algorithms frequently struggle to effectively deploy in edge computing due to the limitations of processing power and memory. This study addresses the challenge by creating a cascading model framework that strategically combines lightweight and more complex models to improve the efficiency of inference while maintaining the accuracy of detection. We proposed a cascading model framework consisting of a One-Class Support Vector Machine (OCSVM) for rapid initial anomaly detection and a Variational Autoencoder (VAE) for more precise prediction in uncertain cases. The cascading technique between the OCSVM and VAE enables the system to efficiently handle regular data instances, while assigning more complex analyses only when required. This framework was tested in real-world scenarios, including anomaly detection in air pressure system of automotive industry as well as with the MNIST datasets. These tests demonstrate the framework's practical applicability and effectiveness across diverse settings, underscoring its potential for broad implementation in industrial applications. Read more
|
224 |
Anomaly Detection for Insider Threats : Comparative Evaluation of LSTM Autoencoders, Isolation Forest, and Elasticsearch on Two Datasets. / Anomalidetektion för interna hot : Utvärdering av LSTM-autoencoders, Isolation Forest och Elasticsearch på två datasetFagerlund, Martin January 2024 (has links)
Insider threat detection is one of cybersecurity’s most challenging and costly problems. Anomalous behaviour can take multiple shapes, which puts a great demand on the anomaly detection system. Significant research has been conducted in the area, but the existing experimental datasets’ absence of real data leaves uncertainty about the proposed systems’ realistic performance. This thesis introduces a new insider threat dataset consisting exclusively of events from real users. The dataset is used to evaluate the performance of various anomaly detection system techniques comparatively. Three anomaly detection techniques were evaluated: LSTM autoencoder, isolation forest, and Elasticsearch’s anomaly detection. The dataset’s properties inhibited any hyperparameter tuning of the LSTM autoencoders since the data lacks sufficient positive instances. Therefore, the architecture and hyperparameter settings are taken from the previously proposed research. The implemented anomaly detection models were also evaluated on the commonly used CERT v4.2 insider threat test dataset. The results show that the LSTM autoencoder provides better anomaly detection on the CERT v4.2 dataset regarding the accuracy, precision, recall, F1 score, and false positive rate compared to the other tested models. However, the investigated systems performed more similarly on the introduced dataset with real data. The LSTM autoencoder achieved the best recall, precision, and F1 score, the isolation forest showed almost as good F1 score with a lower false positive rate, and Elasticsearch’s anomaly detection reported the best accuracy and false positive rate. Additionally, the LSTM autoencoder generated the best ROC curve and precision-recall curve. While Elasticsearch’s anomaly detection showed promising results concerning the accuracy, it performed with low precision and was explicitly implemented to detect certain anomalies, which reduced its generalisability. In conclusion, the results show that the LSTM autoencoder is a feasible anomaly detection model for detecting abnormal behaviour in real user-behaviour logs. Secondly, Elasticsearch’s anomaly detection can be used but is better suited for less complex data analysis tasks. Further, the thesis analyzes the introduced dataset and problematizes its application. In the closing chapter, the study provides domains where further research should be conducted. / Interna hot är ett av de svåraste och mest kostsamma problemen inom cybersäkerhet. Avvikande beteende kan anta många olika former vilket innebär stora krav på de system som ska upptäcka dem. Mycket forskning har genomförts i detta område för att tillhandahålla kraftfulla system. Dessvärre saknar de existerande dataseten som används inom forskningen verklig data vilket gör evalueringen av systemens verkliga förmåga osäker. Denna rapport introducerar ett nytt dataset med data enbart från riktiga användare. Datasetet används för att analysera prestandan av tre olika anomalidetektionssystem: LSTM autoencoder, isolation forest och Elasticsearchs inbyggda anomalidetektering. Datasetets egenskaper förhindrade hyperparameterjustering av LSTM autoencoderna då datasetet innehåller för få positiva data punkter. Därav var arkitekturen och hyperparameterinställningar tagna från tidigare forskning. De implementerade modellerna var också jämförda på det välanvända CERT v4.2 datasetet. Resultaten från CERT v4.2 datasetet visade att LSTM autoencodern ger en bättre anomalidetektion än de andra modellerna när måtten noggrannhet, precision, recall, F1 poäng och andel falska positiva användes. När modellerna testades på det introducerade datasetet presterade de mer jämlikt. LSTM autoencodern presterar med bäst recall, precision och F1 poäng medan isolation forest nästan nådde lika hög F1 poäng men med lägre andel falska positiva predikteringar. Elasticsearchs anomalidetektering lyckades nå högst noggrannhet med lägst andel falsk positiva. Dessvärre med låg precision jämfört med de två andra modellerna. Elasticsearchs anomalidetektering var även tvungen att implementeras mer specifikt riktat mot anomalierna den skulle upptäcka vilket gör användningsområdet för den mindre generellt. Sammanfattningsvis visar resultaten att LSTM autoencoders är ett adekvat alternativ för att detektera abnormaliteter i loggar med händelser från riktiga användare. Dessutom är det möjligt till en viss gräns att använda Elasticsearchs anomalidetektering för dessa ändamål men den passar bättre för uppgifter med mindre komplexitet. Utöver modellernas resultat så analyseras det framtagna datasetet och några egenskaper specificeras som försvårar dess användning och trovärdighet. Avslutningsvis så preciseras intressanta relaterade områden där vidare forskning bör ske. Read more
|
225 |
Deep-Learning Conveyor Belt Anomaly Detection Using Synthetic Data and Domain AdaptationFridesjö, Jakob January 2024 (has links)
Conveyor belts are essential components used in the mining and mineral processing industry to transport granular material and objects. However, foreign objects/anomalies transported along the conveyor belts can result in catastrophic and costly consequences. A solution to the problem is to use machine vision systems based on AI algorithms to detect anomalies before any incidents occur. However, the challenge is to obtain sufficient training data when images containing anomalous objects are, by definition, scarce. This thesis investigates how synthetic data generated by a granular simulator can be used to train a YOLOv8-based model to detect foreign objects in a real world setting. Furthermore, the domain gap between the synthetic data domain and real-world data domain is bridged by utilizing style transfer through CycleGAN. Results show that using YOLOv8s-seg for instance segmentation of conveyors is possible even when trained on synthetic data. It is also shown that using domain adaptation by style transfer using CycleGAN can improve the performance of the synthetic model, even when the real-world data lacks anomalies.
|
226 |
Adaptive Anomaly Prediction ModelsFarhangi, Ashkan 01 January 2024 (has links) (PDF)
Anomalies are rare in nature. This rarity makes it difficult for models to provide accurate and reliable predictions. Deep learning models typically excel at identifying underlying patterns from abundant data through supervised learning mechanisms but struggle with anomalies due to their limited representation. This results in a significant portion of errors arising from these rare and poorly represented events. Here, we present various methods and frameworks to develop the specialized ability of models to better detect and predict anomalies. Additionally, we improve the interpretability of these models by enhancing their anomaly awareness, leading to stronger performance on real-world datasets that often contain such anomalies. Because our models dynamically adapt to the significance of anomalies, they benefit from increased accuracy and prioritization of rare events in predictions. We demonstrate such capabilities on real-world datasets across multiple domains. Our results show that this framework enhances accuracy and interpretability, improving upon existing methods in anomaly prediction tasks.
|
227 |
Event Sequence Identification and Deep Learning Classification for Anomaly Detection and Predication on High-Performance Computing SystemsLi, Zongze 12 1900 (has links)
High-performance computing (HPC) systems continue growing in both scale and complexity. These large-scale, heterogeneous systems generate tens of millions of log messages every day. Effective log analysis for understanding system behaviors and identifying system anomalies and failures is highly challenging. Existing log analysis approaches use line-by-line message processing. They are not effective for discovering subtle behavior patterns and their transitions, and thus may overlook some critical anomalies. In this dissertation research, I propose a system log event block detection (SLEBD) method which can extract the log messages that belong to a component or system event into an event block (EB) accurately and automatically. At the event level, we can discover new event patterns, the evolution of system behavior, and the interaction among different system components. To find critical event sequences, existing sequence mining methods are mostly based on the a priori algorithm which is compute-intensive and runs for a long time. I develop a novel, topology-aware sequence mining (TSM) algorithm which is efficient to generate sequence patterns from the extracted event block lists. I also train a long short-term memory (LSTM) model to cluster sequences before specific events. With the generated sequence pattern and trained LSTM model, we can predict whether an event is going to occur normally or not. To accelerate such predictions, I propose a design flow by which we can convert recurrent neural network (RNN) designs into register-transfer level (RTL) implementations which are deployed on FPGAs. Due to its high parallelism and low power, FPGA achieves a greater speedup and better energy efficiency compared to CPU and GPU according to our experimental results. Read more
|
228 |
Data-driven Algorithms for Critical Detection Problems: From Healthcare to Cybersecurity DefensesSong, Wenjia 16 January 2025 (has links)
Machine learning and data-driven approaches have been widely applied to critical detection problems, but their performance is often hindered by data-related challenges. This dissertation seeks to address three key challenges: data imbalance, scarcity of high-quality labels, and excessive data processing requirements, through studies in healthcare and cybersecurity.
We study healthcare problems with imbalanced clinical datasets that lead to performance disparities across prediction classes and demographic groups. We systematically evaluate these disparities and propose a Double Prioritized (DP) bias correction method that significantly improves the model performance for underrepresented groups and reduces biases. Cyber threats, such as ransomware and advanced persistent threats (APTs), have presented growing threats in recent years. Existing ransomware defenses often rely on black-box models trained on unverified traces, providing limited interpretability. To address the scarcity of reliably labeled training data, we experimentally profile runtime ransomware behaviors of real-world samples and identify core patterns, enabling explainable and trustworthy detection. For APT detection, the large size of system audit logs hinders real-time detection. We introduce Madeline, a lightweight system that efficiently processes voluminous logs with compact representations, overcoming real-time detection bottlenecks.
These contributions provide deployable and effective solutions, offering insights for future research within and beyond the fields of healthcare and cybersecurity. / Doctor of Philosophy / Machine learning and data-driven methods have been widely used to solve important detection problems, but their effectiveness is often limited by challenges related to the data they rely on. This dissertation focuses on three key challenges: imbalanced data, a lack of high-quality information, and the need to process large amounts of data quickly. We address these issues through studies in healthcare and cybersecurity.
Data from clinical studies is often unbalanced, with certain patient groups or outcomes being underrepresented. This imbalance leads to inconsistent prediction accuracies across groups. We address this by developing a method called Double Prioritized (DP) bias correction, which significantly improves the accuracy for minority groups and reduces biases. Cyber threats are becoming increasingly serious risks. One type of prevalent malware is ransomware, which encrypts the victim's data and demands payment for recovery. Current ransomware defenses often learn from unverified data and make decisions without clear explanations. To improve this, we analyze how real-world ransomware behaves, identifying patterns that allow for more explainable and reliable detection. Another type of threat is called advanced persistent threats (APTs), which aim to stay undetected in the victim's system for a long time and exfiltrate data gradually. For APT detection, the challenge lies in analyzing the vast amount of activity data the system generates, which slows down detection. We introduce detectionname, a system designed to process large logs efficiently, enabling fast and accurate threat detection.
These contributions provide practical solutions to pressing problems in healthcare and cybersecurity and offer ideas for future improvements within and beyond these fields. Read more
|
229 |
Autonomic Failure Identification and Diagnosis for Building Dependable Cloud Computing SystemsGuan, Qiang 05 1900 (has links)
The increasingly popular cloud-computing paradigm provides on-demand access to computing and storage with the appearance of unlimited resources. Users are given access to a variety of data and software utilities to manage their work. Users rent virtual resources and pay for only what they use. In spite of the many benefits that cloud computing promises, the lack of dependability in shared virtualized infrastructures is a major obstacle for its wider adoption, especially for mission-critical applications. Virtualization and multi-tenancy increase system complexity and dynamicity. They introduce new sources of failure degrading the dependability of cloud computing systems. To assure cloud dependability, in my dissertation research, I develop autonomic failure identification and diagnosis techniques that are crucial for understanding emergent, cloud-wide phenomena and self-managing resource burdens for cloud availability and productivity enhancement. We study the runtime cloud performance data collected from a cloud test-bed and by using traces from production cloud systems. We define cloud signatures including those metrics that are most relevant to failure instances. We exploit profiled cloud performance data in both time and frequency domain to identify anomalous cloud behaviors and leverage cloud metric subspace analysis to automate the diagnosis of observed failures. We implement a prototype of the anomaly identification system and conduct the experiments in an on-campus cloud computing test-bed and by using the Google datacenter traces. Our experimental results show that our proposed anomaly detection mechanism can achieve 93% detection sensitivity while keeping the false positive rate as low as 6.1% and outperform other tested anomaly detection schemes. In addition, the anomaly detector adapts itself by recursively learning from these newly verified detection results to refine future detection. Read more
|
230 |
A data analytics approach to gas turbine prognostics and health managementDiallo, Ousmane Nasr 19 November 2010 (has links)
As a consequence of the recent deregulation in the electrical power production industry, there has been a shift in the traditional ownership of power plants and the way they are operated. To hedge their business risks, the many new private entrepreneurs enter into long-term service agreement (LTSA) with third parties for their operation and maintenance activities. As the major LTSA providers, original equipment manufacturers have invested huge amounts of money to develop preventive maintenance strategies to minimize the occurrence of costly unplanned outages resulting from failures of the equipments covered under LTSA contracts. As a matter of fact, a recent study by the Electric Power Research Institute estimates the cost benefit of preventing a failure of a General Electric 7FA or 9FA technology compressor at $10 to $20 million.
Therefore, in this dissertation, a two-phase data analytics approach is proposed to use the existing monitoring gas path and vibration sensors data to first develop a proactive strategy that systematically detects and validates catastrophic failure precursors so as to avoid the failure; and secondly to estimate the residual time to failure of the unhealthy items. For the first part of this work, the time-frequency technique of the wavelet packet transforms is used to de-noise the noisy sensor data. Next, the time-series signal of each sensor is decomposed to perform a multi-resolution analysis to extract its features. After that, the probabilistic principal component analysis is applied as a data fusion technique to reduce the number of the potentially correlated multi-sensors measurement into a few uncorrelated principal components. The last step of the failure precursor detection methodology, the anomaly detection decision, is in itself a multi-stage process. The obtained principal components from the data fusion step are first combined into a one-dimensional reconstructed signal representing the overall health assessment of the monitored systems. Then, two damage indicators of the reconstructed signal are defined and monitored for defect using a statistical process control approach. Finally, the Bayesian evaluation method for hypothesis testing is applied to a computed threshold to test for deviations from the healthy band.
To model the residual time to failure, the anomaly severity index and the anomaly duration index are defined as defects characteristics. Two modeling techniques are investigated for the prognostication of the survival time after an anomaly is detected: the deterministic regression approach, and parametric approximation of the non-parametric Kaplan-Meier plot estimator. It is established that the deterministic regression provides poor prediction estimation. The non parametric survival data analysis technique of the Kaplan-Meier estimator provides the empirical survivor function of the data set comprised of both non-censored and right censored data. Though powerful because no a-priori predefined lifetime distribution is made, the Kaplan-Meier result lacks the flexibility to be transplanted to other units of a given fleet. The parametric analysis of survival data is performed with two popular failure analysis distributions: the exponential distribution and the Weibull distribution. The conclusion from the parametric analysis of the Kaplan-Meier plot is that the larger the data set, the more accurate is the prognostication ability of the residual time to failure model. Read more
|
Page generated in 0.0936 seconds