Software systems supporting networked, transaction-oriented services are large and complex;
they comprise a multitude of inter-dependent layers and components,
and they implement many dynamic optimization mechanisms.
In addition, these systems are subject to workload that is hard to predict.
These factors make monitoring these systems as well as performing problem determination
challenging and costly.
In this thesis we tackle these challenges with the goal of lowering the cost and
improving the effectiveness of monitoring and problem determination
by reducing the dependence on human operators.
Specifically, this thesis presents and demonstrates the effectiveness of an efficient,
automated monitoring approach which enables detection of errors and failures,
and which assists in localizing faults.
Software systems expose various types of monitoring data;
this thesis focuses on the use of management metrics to monitor a system's health.
We devise a system modeling approach which entails modeling stable,
statistical correlations among management metrics; these correlations
characterize a system's normal behaviour
This approach allows a system model to be built automatically and efficiently
using the monitoring data alone.
In order to control the monitoring overhead, and yet allow a system's health
to be assessed reliably, we design an adaptive monitoring approach.
This adaptive capability builds on the flexible nature of our system modeling approach,
which allows the set of monitored metrics to be altered at runtime.
We develop methods to automatically select management metrics to collect
at the minimal monitoring level, without any domain knowledge.
In addition, we devise an automated fault localization approach,
which leverages the ability of the monitoring system to analyze individual metrics.
Using a realistic, multi-tier software system, including different applications based on
Java Enterprise Edition and industrial-strength products, we evaluate our system modeling approach.
We show that stable metric correlations exist in complex software systems and
that many of these correlations can be modeled using simple, efficient
techniques.
We investigate the effect of the collection of management metrics on system performance.
We show that the monitoring overhead can be high and thus needs to be controlled.
We employ fault injection experiments to evaluate the effectiveness of our
adaptive monitoring and fault localization approach.
We demonstrate that our approach is cost-effective,
has high fault coverage and, in the majority of the cases studied,
provides pertinent diagnosis information.
The main contribution of this work is to show how to monitor complex software systems
and determine problems in them automatically and efficiently.
Our solution approach has wide applicability and the techniques we use are simple
and yet effective.
Our work suggests that the cost of monitoring software systems is not necessarily
a function of their complexity, providing hope that the health of increasingly large and
complex systems can be tracked with a limited amount of human resources and without
sacrificing much system performance.
Identifer | oai:union.ndltd.org:LACETR/oai:collectionscanada.gc.ca:OWTU.10012/4797 |
Date | 30 September 2009 |
Creators | Munawar, Mohammad Ahmad |
Source Sets | Library and Archives Canada ETDs Repository / Centre d'archives des thèses électroniques de Bibliothèque et Archives Canada |
Language | English |
Detected Language | English |
Type | Thesis or Dissertation |
Page generated in 0.0015 seconds