Global ETD Search

1	Event Mining for System and Service Management Tang, Liang 18 April 2014 (has links) Modern IT infrastructures are constructed by large scale computing systems and administered by IT service providers. Manually maintaining such large computing systems is costly and inefficient. Service providers often seek automatic or semi-automatic methodologies of detecting and resolving system issues to improve their service quality and efficiency. This dissertation investigates several data-driven approaches for assisting service providers in achieving this goal. The detailed problems studied by these approaches can be categorized into the three aspects in the service workflow: 1) preprocessing raw textual system logs to structural events; 2) refining monitoring configurations for eliminating false positives and false negatives; 3) improving the efficiency of system diagnosis on detected alerts. Solving these problems usually requires a huge amount of domain knowledge about the particular computing systems. The approaches investigated by this dissertation are developed based on event mining algorithms, which are able to automatically derive part of that knowledge from the historical system logs, events and tickets. In particular, two textual clustering algorithms are developed for converting raw textual logs into system events. For refining the monitoring configuration, a rule based alert prediction algorithm is proposed for eliminating false alerts (false positives) without losing any real alert and a textual classification method is applied to identify the missing alerts (false negatives) from manual incident tickets. For system diagnosis, this dissertation presents an efficient algorithm for discovering the temporal dependencies between system events with corresponding time lags, which can help the administrators to determine the redundancies of deployed monitoring situations and dependencies of system components. To improve the efficiency of incident ticket resolving, several KNN-based algorithms that recommend relevant historical tickets with resolutions for incoming tickets are investigated. Finally, this dissertation offers a novel algorithm for searching similar textual event segments over large system logs that assists administrators to locate similar system behaviors in the logs. Extensive empirical evaluation on system logs, events and tickets from real IT infrastructures demonstrates the effectiveness and efficiency of the proposed approaches. System monitoring IT service Service Optimization System Logs Log Mining
2	Understanding a large-scale IPTV network via system logs Qiu, Tongqing 08 July 2011 (has links) Recently, there has been a global trend among the telecommunication industry on the rapid deployment of IPTV (Internet Protocol Television) infrastructure and services. While the industry rushes into the IPTV era, the comprehensive understanding of the status and dynamics of IPTV network lags behind. Filling this gap requires in-depth analysis of large amounts of measurement data across the IPTV network. One type of the data of particular interest is device or system log, which has not been systematically studied before. In this dissertation, we will explore the possibility of utilizing system logs to serve a wide range of IPTV network management purposes including health monitoring, troubleshooting and performance evaluation, etc. In particular, we develop a tool to convert raw router syslogs to meaningful network events. In addition, by analyzing set-top box (STB) logs, we propose a series of models to capture both channel popularity and dynamics, and users' activity on the IPTV network. Modeling Troubleshooting IPTV System logs Convergence (Telecommunication) Telecommunication Multicasting (Computer networks) Telecommunication systems
3	Toward Resilience in High Performance Computing:: A Prototype to Analyze and Predict System Behavior Ghiasvand, Siavash 16 October 2020 (has links) Following the growth of high performance computing systems (HPC) in size and complexity, and the advent of faster and more complex Exascale systems, failures became the norm rather than the exception. Hence, the protection mechanisms need to be improved. The most de facto mechanisms such as checkpoint/restart or redundancy may also fail to support the continuous operation of future HPC systems in the presence of failures. Failure prediction is a new protection approach that is beneficial for HPC systems with a short mean time between failure. The failure prediction mechanism extends the existing protection mechanisms via the dynamic adjustment of the protection level. This work provides a prototype to analyze and predict system behavior using statistical analysis to pave the path toward resilience in HPC systems. The proposed anomaly detection method is noise-tolerant by design and produces accurate results with as little as 30 minutes of historical data. Machine learning models complement the main approach and further improve the accuracy of failure predictions up to 85%. The fully automatic unsupervised behavior analysis approach, proposed in this work, is a novel solution to protect future extreme-scale systems against failures.:1 Introduction 1.1 Background and Statement of the Problem 1.2 Purpose and Significance of the Study 1.3 Jam–e Jam: A System Behavior Analyzer 2 Review of the Literature 2.1 Syslog Analysis 2.2 Users and Systems Privacy 2.3 Failure Detection and Prediction 2.3.1 Failure Correlation 2.3.2 Anomaly Detection 2.3.3 Prediction Methods 2.3.4 Prediction Accuracy and Lead Time 3 Data Collection and Preparation 3.1 Taurus HPC Cluster 3.2 Monitoring Data 3.2.1 Data Collection 3.2.2 Taurus System Log Dataset 3.3 Data Preparation 3.3.1 Users and Systems Privacy 3.3.2 Storage and Size Reduction 3.3.3 Automation and Improvements 3.3.4 Data Discretization and Noise Mitigation 3.3.5 Cleansed Taurus System Log Dataset 3.4 Marking Potential Failures 4 Failure Prediction 4.1 Null Hypothesis 4.2 Failure Correlation 4.2.1 Node Vicinities 4.2.2 Impact of Vicinities 4.3 Anomaly Detection 4.3.1 Statistical Analysis (frequency) 4.3.2 Pattern Detection (order) 4.3.3 Machine Learning 4.4 Adaptive resilience 5 Results 5.1 Taurus System Logs 5.2 System-wide Failure Patterns 5.3 Failure Correlations 5.4 Taurus Failures Statistics 5.5 Jam-e Jam Prototype 5.6 Summary and Discussion 6 Conclusion and Future Works Bibliography List of Figures List of Tables Appendix A Neural Network Models Appendix B External Tools Appendix C Structure of Failure Metadata Databse Appendix D Reproducibility Appendix E Publicly Available HPC Monitoring Datasets Appendix F Glossary Appendix G Acronyms info:eu-repo/classification/ddc/004 ddc:004
4	Anomaly Detection for Root Cause Analysis in System Logs using Long Short-Term Memory / Anomalidetektion för Grundorsaksanalys i Loggar från Mjukvara med hjälp av Long Short-Term Memory von Hacht, Johan January 2021 (has links) Many software systems are under test to ensure that they function as expected. Sometimes, a test can fail, and in that case, it is essential to understand the cause of the failure. However, as systems grow larger and become more complex, this task can become non-trivial and potentially take much time. Therefore, even partially, automating the process of root cause analysis can save time for the developers involved. This thesis investigates the use of a Long Short-Term Memory (LSTM) anomaly detector in system logs for root cause analysis. The implementation is evaluated in a quantitative and a qualitative experiment. The quantitative experiment evaluates the performance of the anomaly detector in terms of precision, recall, and F1 measure. Anomaly injection is used to measure these metrics since there are no labels in the data. Additionally, the LSTM is compared with a baseline model. The qualitative experiment evaluates how effective the anomaly detector could be for root cause analysis of the test failures. This was evaluated in interviews with an expert in the software system that produced the log data that the thesis uses. The results show that the LSTM anomaly detector achieved a higher F1 measure than the proposed baseline implementation thanks to its ability to detect unusual events and events happening out of order. The qualitative results indicate that the anomaly detector could be used for root cause analysis. In many of the evaluated test failures, the expert being interviewed could deduce the cause of the failure. Even if the detector did not find the exact issue, a particular part of the software might be highlighted, meaning that it produces many anomalous log messages. With this information, the expert could contact the people responsible for that part of the application for help. In conclusion, the anomaly detector automatically collects the necessary information for the expert to perform root cause analysis. As a result, it could save the expert time to perform this task. With further improvements, it could also be possible for non-experts to utilise the anomaly detector, reducing the need for an expert. / Många mjukvarusystem testas för att försäkra att de fungerar som de ska. Ibland kan ett test misslyckas och i detta fall är det viktigt att förstå varför det gick fel. Detta kan bli problematiskt när mjukvarusystemen växer och blir mer komplexa eftersom att denna uppgift kan bli icke trivial och ta mycket tid. Om man skulle kunna automatisera felsökningsprocessen skulle det kunna spara mycket tid för de invloverade utvecklarna. Denna rapport undersöker användningen av en Long Short-Term Memory (LSTM) anomalidetektor för grundorsaksanalys i loggar. Implementationen utvärderas genom en kvantitativ och kvalitativ undersökning. Den kvantitativa undersökningen utvärderar prestandan av anomalidetektorn med precision, recall och F1 mått. Artificiellt insatta anomalier används för att kunna beräkna dessa mått eftersom att det inte finns etiketter i den använda datan. Implementationen jämförs också med en annan simpel anomalidetektor. Den kvalitativa undersökning utvärderar hur användbar anomalidetektorn är för grundorsaksanalys för misslyckade tester. Detta utvärderades genom intervjuer med en expert inom mjukvaran som producerade datan som användes in denna rapport. Resultaten visar att LSTM anomalidetektorn lyckades nå ett högre F1 mått jämfört med den simpla modellen. Detta tack vare att den kunde upptäcka ovanliga loggmeddelanden och loggmeddelanden som skedde i fel ordning. De kvalitativa resultaten pekar på att anomalidetektorn kan användas för grundorsaksanalys för misslyckade tester. I många av de misslyckade tester som utvärderades kunde experten hitta anledningen till att felet misslyckades genom det som hittades av anomalidetektorn. Även om detektorn inte hittade den exakta orsaken till att testet misslyckades så kan den belysa en vissa del av mjukvaran. Detta betyder att just den delen av mjukvaran producerad många anomalier i loggarna. Med denna information kan experten kontakta andra personer som känner till den delen av mjukvaran bättre för hjälp. Anomalidetektorn automatiskt den information som är viktig för att experten ska kunna utföra grundorsaksanalys. Tack vare detta kan experten spendera mindre tid på denna uppgift. Med vissa förbättringar skulle det också kunna vara möjligt för mindre erfarna utvecklare att använda anomalidetektorn. Detta minskar behovet för en expert. Anomaly detection Root cause analysis System logs Long Short-Term Memory Machine learning Anomalidetektion Grundorsaksanalys System loggar Long Short-Term Memory Maskininlärning Computer Sciences Datavetenskap (datalogi)

1

Page generated in 0.0412 seconds