Performance unpredictability of the cloud hinders widespread adoption of cloud systems and adversely impacts costs and revenue. To mitigate this challenge, cloud systems typically incorporate monitoring and tracing mechanisms to collect a diverse set of metrics on applications' state to facilitate the analysis of performance fluctuations. Drawing on this collected data, engineers devote considerable effort to diagnosing performance issues and expediting the delivery of superior-quality software to enhance performance, aligning with changing demands.
To capture unanticipated performance problems, engineers utilize state-of-the-art diagnostic systems to meticulously trace and record behavior of distributed applications running on cloud. However, this level of detailed tracing incurs considerable costs in terms of storage, computation, and network overheads. Even after engineers have resolved these performance problems, they may face challenges in deploying new code to the cloud. Gradual deployment approaches are available to mitigate risk by enabling faulty versions to be rolled back, but these systems lack the necessary statistical sophistication to accurately assess and compare application versions, potentially leading to further performance issues.
This thesis argues that integrating automated, statistically-driven methods is imperative to achieve substantial improvements in efficiency when diagnosing application performance and delivering new code in the cloud. This vision has the potential to enable efficient and proactive performance management beyond the state-of-the-art by reducing time, effort, and cost spent on diagnosis and code delivery. To support this vision, the thesis makes two specific contributions. First, we demonstrate that dynamically adjusting instrumentation using statistically-driven techniques significantly enhances diagnosis efficiency. Our distributed tracing approach enables accurate tracing of sources of performance issues using only a small fraction of the available tracing instrumentation. Second, we demonstrate an online learning-based approach that intelligently adjusts the user traffic split among competing deployments, substantially improves code delivery efficiency. Our online experimentation approach reduces performance variations by directing user traffic to the optimal deployment during code delivery. / 2025-05-24T00:00:00Z
Identifer | oai:union.ndltd.org:bu.edu/oai:open.bu.edu:2144/46255 |
Date | 24 May 2023 |
Creators | Toslali, Mert |
Contributors | Coskun, Ayse K. |
Source Sets | Boston University |
Language | en_US |
Detected Language | English |
Type | Thesis/Dissertation |
Page generated in 0.0021 seconds