Return to search

Anomaly-based Self-Healing Framework in Distributed Systems

One of the important design criteria for distributed systems and their applications is their reliability and robustness to hardware and software failures. The increase in complexity, interconnectedness, dependency and the asynchronous interactions between the components that include hardware resources (computers, servers, network devices), and software (application services, middleware, web services, etc.) makes the fault detection and tolerance a challenging research problem. In this dissertation, we present a self healing methodology based on the principles of autonomic computing, statistical and data mining techniques to detect faults (hardware or software) and also identify the source of the fault. In our approach, we monitor and analyze in real-time all the interactions between all the components of a distributed system using two software modules: Component Fault Manager (CFM) to monitor all set of measurement attributes for applications and nodes and Application Fault Manager (AFM) that is responsible for several activities such as monitoring, anomaly analysis, root cause analysis and recovery. We used three-dimensional array of features to capture spatial and temporal features to be used by an anomaly analysis engine to immediately generate an alert when abnormal behavior pattern is detected due to a software or hardware failure. We use several fault tolerance metrics (false positive, false negative, precision, recall, missed alarm rate, detection accuracy, latency and overhead) to evaluate the effectiveness of our self healing approach when compared to other techniques. We applied our approach to an industry standard web e-commerce application to emulate a complex e-commerce environment. We evaluate the effectiveness of our approach and its performance to detect software faults that we inject asynchronously, and compare the results for different noise levels. Our experimental results showed that by applying our anomaly based approach, false positive, false negative, missed alarm rate and detection accuracy can be improved significantly. For example, evaluating the effectiveness of this approach to detect faults injected asynchronously shows a detection rate of above 99.9% with no false alarms for a wide range of faulty and normal operational scenarios.

Identiferoai:union.ndltd.org:arizona.edu/oai:arizona.openrepository.com:10150/193660
Date January 2008
CreatorsKim, Byoung Uk
ContributorsHariri, Salim, Hariri, Salim, Rozenblit, Jerzy W., Akoglu, Ali
PublisherThe University of Arizona.
Source SetsUniversity of Arizona
LanguageEnglish
Detected LanguageEnglish
Typetext, Electronic Dissertation
RightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.

Page generated in 0.0028 seconds