Large-scale distributed systems---such as supercomputers, cloud computing platforms,
and distributed applications---routinely suffer from slowdowns and crashes due to
software and hardware problems, resulting in reduced efficiency and wasted
resources. These large-scale systems typically deploy monitoring or tracing
systems that gather a variety of statistics about the state of the hardware
and the software. State-of-the-art methods either analyze this data manually,
or design unique automated methods for each specific problem. This thesis
builds on the vision that generalized automated analytics methods on the data
sets collected from these complex computing systems provide critical
information about the causes of the problems, and this analysis can then enable
proactive management to improve performance, resilience, efficiency, or security
significantly beyond current limits.
This thesis seeks to design scalable, automated analytics methods and frameworks
for large-scale distributed systems that minimize dependency on expert
knowledge, automate parts of the solution process, and help make systems more
resilient. In addition to analyzing data that is already collected from systems,
our frameworks also identify what to collect from where in the system, such that
the collected data would be concise and useful for manual analytics. We focus on
two data sources for conducting analytics: numeric telemetry data, which is
typically collected from operating system or hardware counters, and end-to-end
traces collected from distributed applications.
This thesis makes the following contributions in large-scale distributed
systems: (1) Designing a framework for accurately diagnosing previously
encountered performance variations, (2) designing a technique for detecting
(unwanted) applications running on the systems, (3) developing a suite for
reproducing performance variations that can be used to systematically develop
analytics methods, (4) designing a method to explain predictions of black-box
machine learning frameworks, and (5) constructing an end-to-end tracing
framework that can dynamically adjust instrumentation for effective diagnosis of
performance problems. / 2021-09-28T00:00:00Z
Identifer | oai:union.ndltd.org:bu.edu/oai:open.bu.edu:2144/41472 |
Date | 28 September 2020 |
Creators | Ateş, Emre |
Contributors | Coskun, Ayse K. |
Source Sets | Boston University |
Language | en_US |
Detected Language | English |
Type | Thesis/Dissertation |
Rights | Attribution 4.0 International, http://creativecommons.org/licenses/by/4.0/ |
Page generated in 0.0017 seconds