Return to search

Log Data Analysis for Software Diagnosis: The Machine Learning Theories and Applications / Machine Learning for Log Data Analysis

This research investigates software failure and fault analysis through data-driven machine learning approaches. Faults can happen in any software system and may hugely impact system reliability and user experience. Log data, the machine-generated data that records the system status, is often the primary source of information to track down a fault. This study aims to develop automated systems that recognize recurring faults by analyzing the system log data. The methodology developed in this research applies to the Ford SYNC vehicle infotainment system as well as other systems that produce similar log data.
Log data has been used in manual examination to help trace and localize a fault. This manual process can be effective and sometimes the only feasible way of troubleshooting software faults. However, as the amount of log data increases significantly with the growing complexity and scale of software, the manual workload can get overwhelming. During the system-level validation tests, all system components are producing log data, resulting in tens of thousands of lines of log messages in just a few minutes. Therefore, automated diagnosis has been a promising approach for log data analysis.
Three machine learning approaches are investigated in this research to tackle the fault diagnosis problem: 1) the data mining approach; 2) the statistical feature approach; and, 3) the deep learning approach. The first method attempts to mimic human experts to examine log data. Log sequences representing a fault are extracted through data mining techniques and used to identify anomalies. The method is effective when applied to a small volume of data, but computational efficiency can be an issue when scaling to larger datasets. As its name suggests, the second method involves an examination of the log data’s statistical and numerical features and adapting a machine learning model for decision making. The use of numerical features to describe log data has significant computational efficiency improvement over working directly with sequences. The last approach adopts deep learning models that process the log data in sequential format, enabling more sophisticated feature extraction that often exceeds human capability. In this research, all three methods are implemented and evaluated in a controlled testing environment, and their strengths and weaknesses are comparatively evaluated.
This study also reports on a novel finding that the time information in a log sequence plays an important role in distinguishing a faulty condition from a normal one. For most software systems, the log sequences are unevenly spaced, meaning that the timestamps associated with log data are nonuniform. Existing log analysis studies generally overlooked the time information while emphasizing log sequences. This research proposes a novel deep learning structure to unify the processing of timestamps and log sequences. The timestamps are integrated through interpolation at an intermediate layer of a neural network. Testing results demonstrate that the inclusion of timestamps makes a significant contribution to identifying a fault, and that models using time stamps can push the performance to a higher level. / Dissertation / Doctor of Engineering (DEng)

Identiferoai:union.ndltd.org:mcmaster.ca/oai:macsphere.mcmaster.ca:11375/27412
Date January 2022
CreatorsHuangfu, Yixin
ContributorsHabibi, Saeid, Mechanical Engineering
Source SetsMcMaster University
LanguageEnglish
Detected LanguageEnglish
TypeThesis

Page generated in 0.0017 seconds