Spelling suggestions: "subject:"distributed tracing"" "subject:"eistributed tracing""
1 |
Fehleranalyse in Microservices mithilfe von verteiltem TracingSinner, Robin Andreas 26 April 2022 (has links)
Mit dem Architekturkonzept von Microservices und der steigenden Anzahl an heterogenen Services, ergeben sich neue Herausforderungen hinsichtlich des Debuggings, Monitorings und Testens solcher Anwendungen. Verteiltes Tracing bietet einen Ansatz zur Lösung dieser Herausforderungen. Das Ziel in der vorliegenden Arbeit ist es zu untersuchen, wie verteiltes Tracing für eine automatisierte Fehleranalyse von Microservices genutzt werden kann. Dazu wird die folgende Forschungsfrage gestellt: Wie können Traces ausgewertet werden, um die Fehlerursachen beim Testen von Microservices zu identifizieren? Um die Forschungsfrage zu beantworten, wurde ein Datenformat zur automatisierten Auswertung von Tracing-Daten definiert. Zur Auswertung wurden Algorithmen konzipiert, welche die Fehlerpropagierung zwischen Services anhand kausaler Beziehungen auflösen. Dieses Vorgehen wurde in Form einer prototypischen Implementierung in Python umgesetzt und dessen Funktionalität evaluiert. Die Ergebnisse zeigen, dass in rund 77 % der durchgeführten Testszenarien, die Fehlerursache mithilfe des Prototyps korrekt aus den Tracing-Daten abgeleitet werden konnte. Ohne Einsatz des Prototyps und ohne weiteres Debugging konnte lediglich in circa 5 % der Testszenarien die Fehlerursache anhand der Fehlerausgabe der Anwendung selbst erkannt werden. Damit bietet das Konzept sowie der Prototyp eine Erleichterung des Debuggings von Pythonbasierten Microservice-Anwendungen.:1. Einleitung
1.1. Motivation
1.2. Abgrenzung
1.3. Methodik
2. Grundlagen
2.1. Verwandte Arbeiten
2.1.1. Automatisierte Analyse von Tracing-Informationen
2.1.2. Automatisierte Fehlerursachenanalyse
2.1.3. Fehlerursachenanalyse in Microservices
2.1.4. Ursachenanalyse von Laufzeitfehlern in verteilten Systemen
2.1.5. Tracing-Tool zur Fehlererkennung
2.2. Theoretische Grundlagen
2.2.1. Microservices
2.2.2. Verteiltes Tracing
2.2.3. OpenTracing
2.2.4. Jaeger
2.2.5. Exemplarische Anwendung für Untersuchungen
2.2.6. Continuous Integration/ Continuous Delivery/ Continuous Deployment
3. Konzeption
3.1. Definition des Datenformats
3.1.1. Analyse des Datenformats der OpenTracing Spezifikation
3.1.2. Erweiterungen
3.1.3. Resultierendes Datenformat für eine automatisierte Auswertung
3.1.4. Zeitversatz verteilter Systeme
3.2. Algorithmen zur Fehlerursachenanalyse
3.2.1. Erstellung eines Abhängigkeitsgraphen
3.2.2. Pfad-basierte Untersuchung von Fehlerursachen
3.2.3. Auswertung nach zeitlicher Abfolge und kausaler Beziehung
3.2.4. Bewertung potenzieller Fehlerursachen
3.3. Konzeption des Prototyps
3.3.1. Integration in den Entwicklungszyklus
3.3.2. Funktionale Anforderungen
3.3.3. Architektur des Prototyps
4. Durchführung/ Implementation
4.1. Implementation des Prototyps zur Fehlerursachenanalyse
4.2. Einbindung des Prototyps in Testszenarien und Continuous Integration
4.3. Tests zur Evaluation des Prototyps
5. Ergebnisse 51
5.1. Evaluation des Konzepts/ Prototyps
5.2. Evaluation der Methoden zur Fehlerursachen-Bewertung
5.3. Wirtschaftliche Betrachtung
6. Fazit/ Ausblick
6.1. Fazit
6.2. Ausblick
Literatur
Selbstständigkeitserklärung
A. Abbildungen
A.1. Mockups der Auswertungsberichte des Prototyps
B. Tabellen
B.1. Felder der Tags nach OpenTracing
B.2. Felder der Logs nach OpenTracing
B.3. Auswertung der Testergebnisse
C. Listings
C.1. Datenformat zur automatisierten Auswertung
C.2. Definition von Regeln zur Auswertung
D. Anleitung für den Prototyp
|
2 |
Performance modelling of reactive web applications using trace data from automated testingAnderson, Michael 29 April 2019 (has links)
This thesis evaluates a method for extracting architectural dependencies and performance measures from an evolving distributed software system. The research goal was to establish methods of determining potential scalability issues in a distributed software system as it is being iteratively developed. The research evaluated the use of industry available distributed tracing methods to extract performance measures and queuing network model parameters for common user activities. Additionally, a method was developed to trace and collect system operations the correspond to these user activities utilizing automated acceptance testing. Performance measure extraction was tested across several historical releases of a real-world distributed software system with this method. The trends in performance measures across releases correspond to several scalability issues identified in the production software system. / Graduate
|
3 |
Tail Based Sampling Framework for Distributed Tracing Using Stream Processing / Ramverk för svansbaserad provtagning för distribuerad spårning med hjälp av strömbearbetningShuvo, G Kibria January 2021 (has links)
In recent years, microservice architecture has surpassed monolithic architecture in popularity among developers by providing a flexible way of developing complex distributed applications. Whereas a monolithic application functions as a single indivisible unit, a microservices-based application comprises a collection of loosely coupled services that communicate with each other to fulfill the requirements of the application. Consequently, different services in a microservices-based application can be developed and deployed independently. However, this flexibility is achieved at the expense of reduced observability of microservices-based applications complicating the debugging of such applications. The reduction of observability can be compensated by performing distributed tracing in microservices-based applications. Distributed tracing refers to observing requests propagating through a distributed system to collect observability data that can aid in understanding the interactions among the services and pinpoint failures and performance issues in the system. Open- Telemetry, an open-source observability framework supported by Cloud Native Computing Foundation (CNCF), defines a standardized specification for generating observability data. Nevertheless, instrumenting an application with an observability framework incurs performance overhead. To tackle this deterioration of performance and to reduce the cost of persisting observability data, only a subset of the requests are typically traced by performing head-based or tail-based sampling. In this work, we present a tail-based sampling framework using stream processing techniques. The developed framework demonstrated promising performance in our experiments by saving approximately a third of memory-based storage compared to an OpenTelemetry tail-based sampling module. Moreover, being compliant with the OpenTelemetry specifications, our framework aligns well with the OpenTelemetry ecosystem. / Under de senaste åren har mikrotjänstarkitektur överträffat monolitisk arkitektur i popularitet bland utvecklare genom att erbjuda ett flexibelt sätt att utveckla komplexa distribuerade tillämpningar. Medan en monolitisk tillämpning fungerar som en enda odelbar enhet, består en mikrotjänstbaserad tillämpning av en samling löst kopplade tjänster som kommunicerar med varandra för att uppfylla tillämpningens krav. Därför kan olika tjänster i en mikrotjänstbaserad tillämpning utvecklas och driftsättas oberoende av varandra. Denna flexibilitet uppnås dock på bekostnad av minskad observerbarhet för mikrotjänstbaserade tillämpningar, vilket försvårar felsökningen av sådana tillämpningar. Den minskade observerbarheten kan kompenseras genom att utföra distribuerad spårning i mikrotjänstbaserade tillämpningar. Distribuerad spårning innebär att man observerar förfrågningar som sprids genom ett distribuerat system för att samla in data om observerbarhet som kan hjälpa till att förstå interaktionerna mellan tjänsterna och lokalisera fel och prestandaproblem i systemet. OpenTelemetry, ett ramverk för observerbarhet med öppen källkod som stöds av Cloud Native Computing Foundation (CNCF), definierar en standardiserad specifikation för att generera observerbarhetsdata. Att instrumentera en tillämpning med ett ramverk för observerbarhet medför dock en överbelastning av prestanda. För att hantera denna försämring av prestanda och för att minska kostnaden för att bevara observerbarhetsdata spåras vanligtvis endast en delmängd av förfrågningarna genom att utföra s.k. “head-based sampling” eller “tail-based sampling”. I det här arbetet presenterar vi ett ramverk för tail-based sampling med hjälp av strömbehandlingsteknik. Den utvecklade ramen visade lovande prestanda i våra experiment genom att spara ungefär en tredjedel av den minnesbaserade lagringen jämfört med en OpenTelemetry-modul för tail-based sampling. Eftersom vårt ramverk är förenligt med OpenTelemetry-specifikationerna är det dessutom väl anpassat till OpenTelemetry-ekosystemet.
|
4 |
Telemetry for Debugging Software Issues in Privacy-sensitive Systems / Telemetri för att felsöka programvaruproblem i sekretesskänsliga systemLandgren, Kasper, Tavakoli, Payam January 2023 (has links)
Traditionally, when debugging a software system, developers rely on having access to the input data that caused the error. However, with data privacy concerns on the rise, it is becoming increasingly challenging to rely on input data as it might be sensitive or even classified. Telemetry is a method of collecting information about a system that can offer assistance. This thesis proposes a telemetry solution for debugging systems where input data is sensitive. The telemetry solution was implemented in a geographical 3D visualization system and evaluated based on two aspects: its effectiveness in assisting the developers working on the system to locate if and where a problem has occurred, and its ability to deduce the input data that generated the output. The results indicate that the telemetry solution is successful in helping developers identify system errors. Additionally, the system's privacy is maintained, as it is not feasible to directly ascertain the input data responsible from the output. However, no proof is presented and as such, no guarantee can be made. We conclude that telemetry can be a useful tool for developers, making the debugging process more effective while protecting sensitive input data. However, this might not be the case for different customers or systems. The thesis demonstrates the potential of using telemetry for debugging privacy-sensitive systems, with a proof-of-concept solution that can be improved upon in the future.
|
5 |
Protractor: Leveraging distributed tracing in service meshes for application profiling at scaleCarosi, Robert January 2018 (has links)
Large scale Internet services are increasingly implemented as distributed systems in order to achieve fault tolerance, availability, and scalability. When requests traverse multiple services, end-to-end metrics no longer tell a clear picture. Distributed tracing emerged to break down end-to-end latency on a per service basis, but only answers where a problem occurs, not why. From user research we found that root-cause analysis of performance problems is often still done by manually correlating information from logs, stack traces, and monitoring tools. Profilers provide fine-grained information, but we found they are rarely used in production systems because of the required changes to existing applications, the substantial storage requirements they introduce, and because it is difficult to correlate profiling data with information from other sources. The proliferation of modern low-overhead profilers opens up possibilities to do online always-on profiling in production environments. We propose Protractor as the missing link that exploits these possibilities to provide distributed profiling. It features a novel approach that leverages service meshes for application-level transparency, and uses anomaly detection to selectively store relevant profiling information. Profiling information is correlated with distributed traces to provide contextual information for root-cause analysis. Protractor has support for different profilers, and experimental work shows impact on end-to-end request latency is less than 3%. The utility of Protractor is further substantiated with a survey showing the majority of the participants would use it frequently / Storskaliga Internettjänster implementeras allt oftare som distribuerade system för att uppnå feltolerans, tillgänglighet och skalbarhet. När en request spänner över flera tjänster ger inte längre end-to-end övervakning en tydlig bild av orsaken till felet. Distribuerad tracing utvecklades för att spåra end-to-end request latency per tjänst och för att ge en indikation vart problemet kan ligger med visar oftas inte orsaken. Genom user research fann vi att root-cause-analys av prestandaproblem ofta fortfarande görs genom att manuellt korrelera information från loggar, stack traces och övervakningsverktyg. Kod-profilering tillhandahåller detaljerad information, men vi fann att den sällan används i produktionssystem på grund av att de kräver ändringar i den befintliga koden, de stora lagringskraven som de introducerar och eftersom det är svårt att korrelera profilerings data med information från andra källor. Utbredning av moderna kodprofilerare med låg overhead öppnar upp möjligheten att kontinuerligt köra dem i produktionsmiljöer. Vi introducerar Protractor som kombinerar kodprofilering och distribuerad tracing. Genom att utnyttja och bygga på koncept så som service meshes uppnår vi transparens på applikationsnivå och använder anomalitetsdetektering för att selektivt lagra relevant profileringsinformation. Den informationen korreleras med distribuerade traces för att ge kontext för root-cause-analys. Protractor har stöd för olika kodprofilerare och experiment har visat att påverkan på end-to-end request latency är mindre än 3Användbarheten av Protractor är ytterligare underbyggd med en undersökning som visar att majoriteten av deltagarna skulle använda den ofta.
|
6 |
Optimizing Distributed Tracing Overhead in a Cloud Environment with OpenTelemetryElias, Norgren January 2024 (has links)
To gain observability in distributed systems, some telemetry generation and gathering must be implemented. This is especially important when systems have layers of dependencies on other microservices. One method for observability is called distributed tracing. Distributed tracing is the act of building causal event chains between microservices, which are called traces. Finding bottlenecks and dependencies within each call chain is possible with the traces. One framework for implementing distributed tracing is OpenTelemetry. The developer must determine design choices when deploying OpenTelemetry in a Kubernetes cluster. For example, OpenTelemetry provides a collector that collects spans, which are parts of a trace from microservices. These collectors can be deployed one on each node, called a daemonset. Or it can be deployed with one for each service, called sidecars. This study compared the performance impact of the sidecar and daemonset setup to that of having no OpenTelemetry implemented. The resources analyzed were CPU usage, network usage, and RAM usage. Tests were done in a permutation of 4 different scenarios. Experiments were run on 4 and 2 nodes, as well as a balanced and unbalanced service placement setup. The experiments were run in a cloud environment using Kubernetes. The tested system was an emulation of one of Nasdaq's systems based on real data from the company. The study concluded that having OpenTelemetry added overhead / increased resource usage in all cases. Having the daemonset setup, compared to no OpenTelemetry, increased CPU usage by 46.5 %, network usage by 18.25 %, and memory usage by 47.5 % on average. Sidecar did, in most cases, perform worse than the daemonset setup in most cases and resources, especially in RAM and CPU usage.
|
7 |
Performance Overhead Of OpenTelemetry Sampling Methods In A Cloud InfrastructureKarkan, Tahir Mert January 2024 (has links)
This thesis explores the overhead of distributed tracing in OpenTelemetry, using different sampling strategies, in a cloud environment. Distributed tracing is telemetry data that allows developers to analyse causal events in a system with temporal information. This comes at the cost of overhead, in terms of CPU, memory and network usage, as the telemetry data has to be generated and sent through collectors that handle traces and at last sends them to a backend. By sampling using three different sampling strategies, head and tail based sampling and a mixture of those two, overhead can be reduced at the price of losing some information. To gain a measure of how this information loss impacts application performance, synthetic error messages are introduced in traces and used to gauge how many traces with errors the sampling strategies can detect. All three sampling strategies were compared for services that sent more and less data between nodes in Kubernetes. The experiments were also tested in a two and four nodes setup. This thesis was conducted with Nasdaq as it is of their interest to have high performing monitoring tools and their systems were analysed and emulated for relevance. The thesis concluded that tail based sampling had the highest overhead (71.33% CPU, 23.7% memory and 5.6% network average overhead compared to head based sampling) for the benefit of capturing all the errors. Head based sampling had the least overhead, except in the node that had deployed Jaeger as the backend for traces, where its higher total sampling rate added on average 12.75% CPU overhead for the four node setup compared to mixed sampling. Although, mixed sampling captured more errors. When measuring the overall time taken for the experiments, the highest impact could be observed when more requests had to be sent between nodes.
|
8 |
Distributed Trace Comparisons for Code Review : A System Design and Practical EvaluationRabo, Hannes January 2020 (has links)
Ensuring the health of a distributed system with frequent updates is complicated. Many tools exist to improve developers’ comprehension and productivity in this task, but room for improvement exists. Based on previous research within request flow comparison, we propose a system design for using distributed tracing data in the process of reviewing code changes. The design is evaluated from the perspective of system performance and developer productivity using a critical production system at a large software company. The results show that the design has minimal negative performance implications while providing a useful service to the developers. They also show a positive but statistically insignificant effect on productivity during the evaluation period. To a large extent, developers adopted the tool into their workflow to explore and improve system understanding. This use case deviates from the design target of providing a method to compare changes between software versions. We conclude that the design is successful, but more optimization of functionality and a higher rate of adoption would likely improve the effects the tool could have. / Att säkerställa stabilitet i ett distribuerat system med hög frekvens av uppdateringar är komplicerat. I dagsläget finns många verktyg som hjälper utvecklare i deras förståelse och produktivitet relaterat till den här typen av problem, dock finns fortfarande möjliga förbättringar. Baserat på tidigare forskning inom teknik för att jämföra protokollförfrågningsflöden mellan mjukvaruversioner så föreslår vi en systemdesign för ett nytt verktyg. Designen använder sig av data från distribuerad tracing för att förbättra arbetsflödet relaterat till kodgranskning. Designen utvärderas både prestanda och produktivitetsmässigt under utvecklingen av ett affärskritiskt produktionssystem på ett stort mjukvaruföretag. Resultaten visar att designen har mycket låg inverkan på prestandan av systemet där det införs, samtidigt som den tillhandahåller ett användbart verktyg till utvecklarna. Resultaten visar också på en positiv men statistiskt insignifikant effekt på utvecklarnas produktivitet. Utvecklarna använde primärt verktyget för att utforska och förbättra sin egen förståelse av systemet som helhet. Detta användningsområde avvek från det ursprungliga målet med designen, vilket var att tillhandahålla en tjänst för att jämföra mjukvaruversioner med varandra. Från resultaten drar vi slutsatsen att designen som helhet var lyckad, men mer optimering av funktionalitet och mer effektivt införande av verktyget i arbetsflödet hade troligtvis resulterat i större positiva effekter på organisationen.
|
Page generated in 0.0971 seconds