Success for many businesses depends on their information software systems.
Keeping these systems operational is critical, as failure in these systems is
costly. Such systems are in many cases sophisticated, distributed and
dynamically composed.
To ensure high availability and correct operation, it is essential that
failures be detected promptly, their causes diagnosed and remedial actions
taken. Although automated recovery approaches exists for specific problem
domains, the problem-resolution process is in many cases manual and painstaking.
Computer support personnel put a great deal of effort into resolving the reported
failures. The growing size and complexity of these systems creates the need to
automate this process.
The primary focus of our research is on automated fault diagnosis and recovery
using discrete monitoring data such as log files and notifications. Our goal is
to quickly pinpoint the root-cause of a failure. Our contributions are:
Modelling discrete monitoring data for automated analysis, automatically leveraging common symptoms of failures from historic
monitoring data using such models to pinpoint faults, and providing a model for decision-making under uncertainty such that
appropriate recovery actions are chosen.
Failures in such systems are caused by software defects, human error, hardware
failures, environmental conditions and malicious behaviour. Our primary focus
in this thesis is on software defects and misconfiguration.
Identifer | oai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:OWTU.10012/6757 |
Date | 18 May 2012 |
Creators | Reidemeister, Thomas |
Source Sets | Library and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada |
Language | English |
Detected Language | English |
Type | Thesis or Dissertation |
Page generated in 0.0035 seconds