Global ETD Search

1	Scalable Multi-Parameter Outlier Detection Technology Wang, Jiayuan 23 December 2013 (has links) "The real-time detection of anomalous phenomena on streaming data has become increasingly important for applications ranging from fraud detection, financial analysis to traffic management. In these streaming applications, often a large number of similar continuous outlier detection queries are executed concurrently. In the light of the high algorithmic complexity of detecting and maintaining outlier patterns for different parameter settings independently, we propose a shared execution methodology called SOP that handles a large batch of requests with diverse pattern configurations. First, our systematic analysis reveals opportunities for maximum resource sharing by leveraging commonalities among outlier detection queries. For that, we introduce a sharing strategy that integrates all computation results into one compact data structure. It leverages temporal relationships among stream data points to prioritize the probing process. Second, this work is the first to consider predicate constraints in the outlier detection context. By distinguishing between target and scope constraints, customized fragment sharing and block selection strategies can be effectively applied to maximize the efficiency of system resource utilization. Our experimental studies utilizing real stream data demonstrate that our approach performs 3 orders of magnitude faster than the start-of-the-art and scales to 1000s of queries." Multi-Para Predicate Outlier Detection
2	Outlier Detection with Applications in Graph Data Mining Ranga Suri, N N R January 2013 (has links) (PDF) Outlier detection is an important data mining task due to its applicability in many contemporary applications such as fraud detection and anomaly detection in networks, etc. It assumes significance due to the general perception that outliers represent evolving novel patterns in data that are critical to many discovery tasks. Extensive use of various data mining techniques in different application domains gave rise to the rapid proliferation of research work on outlier detection problem. This has lead to the development of numerous methods for detecting outliers in various problem settings. However, most of these methods deal primarily with numeric data. Therefore, the problem of outlier detection in categorical data has been considered in this work for developing some novel methods addressing various research issues. Firstly, a ranking based algorithm for detecting a likely set of outliers in a given categorical data has been developed employing two independent ranking schemes. Subsequently, the issue of data dimensionality has been addressed by proposing a novel unsupervised feature selection algorithm on categorical data. Similarly, the uncertainty associated with the outlier detection task has also been suitably dealt with by developing a novel rough sets based categorical clustering algorithm. Due to the networked nature of the data pertaining to many real life applications such as computer communication networks, social networks of friends, the citation networks of documents, hyper-linked networks of web pages, etc., outlier detection(also known as anomaly detection) in graph representation of network data turns out to be an important pattern discovery activity. Accordingly, a novel graph mining method has been envisaged in this thesis based on the concept of community detection in graphs. In addition to finding anomalous nodes and anomalous edges, this method is capable of detecting various higher level anomalies that are arbitrary sub-graphs of the input graph. Subsequently, these ideas have been further extended in this thesis to characterize the time varying behavior of outliers(anomalies) in dynamic network data by defining various categories of temporal outliers (anomalies). Characterizing the behavior of such outliers during the evolution of the network over time is critical for discovering different anomalous connectivity patterns with potential adverse effects such as intrusions into a computer network, etc. In order to deal with temporal outlier detection in single instance network/graph data, the link prediction task has been leveraged in this thesis to produce multiple instances of the input graph. Thus, various outlier detection principles have been successfully applied for mining various categories of temporal outliers(anomalies) in the graph representation of network data. Data Mining Graph Data Mining Outlier Detection Categorical Data - Outlier Detection Network/Graph Data - Outlier Detection Graph Data Mining - Outlier Detection Outliers Rough Clustering Algorithm Computer Science
3	Use of Machine Learning for Outlier Detection in Healthy Human Brain Magnetic Resonance Imaging (MRI) Diffusion Tensor (DT) Datasets / Outlier Detection in Brain MRI Diffusion Datasets MacPhee, Neil January 2022 (has links) Machine learning (ML) and deep learning (DL) are powerful techniques that allow for analysis and classification of large MRI datasets. With the growing accessibility of high-powered computing and large data storage, there has been an explosive interest in their uses for assisting clinical analysis and interpretation. Though these methods can provide insights into the data which are not possible through human analysis alone, they require significantly large datasets for training which can difficult for anyone (researcher and clinician) to obtain on their own. The growing use of publicly available, multi-site databases helps solve this problem. Inadvertently, however, these databases can sometimes contain outliers or incorrectly labeled data as the subjects may or may not have subclinical or underlying pathology unbeknownst to them or to those who did the data collection. Due to the outlier sensitivity of ML and DL techniques, inclusion of such data can lead to poor classification rates and subsequent low specificity and sensitivity. Thus, the focus of this work was to evaluate large brain MRI datasets, specifically diffusion tensor imaging (DTI), for the presence of anomalies and to validate and compare different methods of anomaly detection. A total of 1029 male and female subjects ages 22 to 35 were downloaded from a global imaging repository and divided into 6 cohorts depending on their age and sex. Care was made to minimize variance due to hardware and hence only data from a specific vendor (General Electric Healthcare) and MRI B0 field strength (i.e. 3 Tesla) were obtained. The raw DTI data (i.e. in this case DICOM images) was first preprocessed into scalar metrics (i.e. FA, RD, AD, MD) and warped to MNI152 T1 1mm standardized space using the FMRIB software library (FSL). Subsequently data was segmented into regions of interest (ROI) using the JHU DTI-based white-matter atlas and a mean was calculated for each ROI defined by that atlas. The ROI data was standardized and a Z-score, for each ROI over all subjects, was calculated. Four different algorithms were used for anomaly detection, including Z-score outlier detection, maximum likelihood estimator (MLE) and minimum covariance determinant (MCD) based Mahalanobis distance outlier detection, one-class support vector machine (OCSVM) outlier detection, and OCSVM novelty detection trained on MCD based Mahalanobis distance data. The best outlier detector was found to be MCD based Mahalanobis distance, with the OCSVM novelty detector performing exceptionally well on the MCD based Mahalanobis distance data. From the results of this study, it is clear that these global databases contain outliers within their healthy control datasets, further reinforcing the need for the inclusion of outlier or novelty detection as part of the preprocessing pipeline for ML and DL related studies. / Thesis / Master of Applied Science (MASc) / Artificial intelligence (AI) refers to the ability of a computer or robot to mimic human traits such as problem solving or learning. Recently there has been an explosive interest in its uses for assisting in clinical analysis. However, successful use of these methods require a significantly large training set which can often contain outliers or incorrectly labeled data. Due to the sensitivity of these techniques to outliers, this often leads to poor classification rates as well as low specificity and sensitivity. The focus of this work was to evaluate different methods of outlier detection and investigate the presence of anomalies in large brain MRI datasets. The results of this study show that these large brain MRI datasets contain anomalies and provide a method best fit for identifying them. Outlier Detection MRI Machine learning DTI
4	High-dimensional data mining: subspace clustering, outlier detection and applications to classification Foss, Andrew 06 1900 (has links) Data mining in high dimensionality almost inevitably faces the consequences of increasing sparsity and declining differentiation between points. This is problematic because we usually exploit these differences for approaches such as clustering and outlier detection. In addition, the exponentially increasing sparsity tends to increase false negatives when clustering. In this thesis, we address the problem of solving high-dimensional problems using low-dimensional solutions. In clustering, we provide a new framework MAXCLUS for finding candidate subspaces and the clusters within them using only two-dimensional clustering. We demonstrate this through an implementation GCLUS that outperforms many state-of-the-art clustering algorithms and is particularly robust with respect to noise. It also handles overlapping clusters and provides either `hard' or `fuzzy' clustering results as desired. In order to handle extremely high dimensional problems, such as genome microarrays, given some sample-level diagnostic labels, we provide a simple but effective classifier GSEP which weights the features so that the most important can be fed to GCLUS. We show that this leads to small numbers of features (e.g. genes) that can distinguish the diagnostic classes and thus are candidates for research for developing therapeutic applications. In the field of outlier detection, several novel algorithms suited to high-dimensional data are presented (TENT, TROF, FASTOUT). It is shown that these algorithms outperform the state-of-the-art outlier detection algorithms in ranking outlierness for many datasets regardless of whether they contain rare classes or not. Our research into high-dimensional outlier detection has even shown that our approach can be a powerful means of classification for heavily overlapping classes given sufficiently high dimensionality and that this phenomenon occurs solely due to the differences in variance among the classes. On some difficult datasets, this unsupervised approach yielded better separation than the very best supervised classifiers and on other data, the results are competitive with state-of-the-art supervised approaches.kern-1pt The elucidation of this novel approach to classification opens a new field in data mining, classification through differences in variance rather than spatial location. As an appendix, we provide an algorithm for estimating false negative and positive rates so these can be compensated for. Subspace clustering Outlier detection Subspace outlier detection Classification Error estimation SERA MAXCLUS GSEP GCLUS FASTOUT T*
5	High-dimensional data mining: subspace clustering, outlier detection and applications to classification Foss, Andrew Unknown Date No description available. Subspace clustering Outlier detection Subspace outlier detection Classification Error estimation SERA MAXCLUS GSEP GCLUS FASTOUT T*
6	Relational Outlier Detection: Techniques and Applications Lu, Yen-Cheng 10 June 2021 (has links) Nowadays, outlier detection has attracted growing interest. Unlike typical outlier detection problems, relational outlier detection focuses on detecting abnormal patterns in datasets that contain relational implications within each data point. Furthermore, different from the traditional outlier detection that focuses on only numerical data, modern outlier detection models must be able to handle data in various types and structures. Detecting relational outliers should consider (1) Dependencies among different data types, (2) Data types that are not continuous or do not have ordinal characteristics, such as binary, categorical or multi-label, and (3) Special structures in the data. This thesis focuses on the development of relational outlier detection methods and real-world applications in datasets that contain non-numerical, mixed-type, and special structure data in three tasks, namely (1) outlier detection in mixed-type data, (2) categorical outlier detection in music genre data, and (3) outlier detection in categorized time series data. For the first task, existing solutions for mixed-type data mostly focus on computational efficiency, and their strategies are mostly heuristic driven, lacking a statistical foundation. The proposed contributions of our work include: (1) Constructing a novel unsupervised framework based on a robust generalized linear model (GLM), (2) Developing a model that is capable of capturing large variances of outliers and dependencies among mixed-type observations, and designing an approach for approximating the analytically intractable Bayesian inference, and (3) Conducting extensive experiments to validate effectiveness and efficiency. For the second task, we extended and applied the modeling strategy to a real-world problem. The existing solutions to the specific task are mostly supervised, and the traditional outlier detection methods only focus on detecting outliers by the data distributions, ignoring the input-output relation between the genres and the extracted features. The proposed contributions of our work for this task include: (1) Proposing an unsupervised outlier detection framework for music genre data, (2) Extending the GLM based model in the first task to handle categorical responses and developing an approach to approximate the analytically intractable Bayesian inference, and (3) Conducting experiments to demonstrate that the proposed method outperforms the benchmark methods. For the third task, we focused on improving the outlier detection performance in the second task by proposing a novel framework and expanded the research scope to general categorized time-series data. Existing studies have suggested a large number of methods for automatic time series classification. However, there is a lack of research focusing on detecting outliers from manually categorized time series. The proposed contributions of our work for this task include: (1) Proposing a novel semi-supervised robust outlier detection framework for categorized time-series datasets, (2) Further extending the new framework to an active learning system that takes user insights into account, and (3) Conducting a comprehensive set of experiments to demonstrate the performance of the proposed method in real-world applications. / Doctor of Philosophy / In recent years, outlier detection has been one of the most important topics in the data mining and machine learning research domain. Unlike typical outlier detection problems, relational outlier detection focuses on detecting abnormal patterns in datasets that contain relational implications within each data point. Detecting relational outliers should consider (1) Dependencies among different data types, (2) Data types that are not continuous or do not have ordinal characteristics, such as binary, categorical or multi-label, and (3) Special structures in the data. This thesis focuses on the development of relational outlier detection methods and real-world applications in datasets that contain non-numerical, mixed-type, and special structure data in three tasks, namely (1) outlier detection in mixed-type data, (2) categorical outlier detection in music genre data, and (3) outlier detection in categorized time series data. The first task aims on constructing a novel unsupervised framework, developing a model that is capable of capturing the normal pattern and the effects, and designing an approach for model fitting. In the second task, we further extended and applied the modeling strategy to a real-world problem in the music technology domain. For the third task, we expanded the research scope from the previous task to general categorized time-series data, and focused on improving the outlier detection performance by proposing a novel semi-supervised framework. Relational Outlier Detection Generalized Linear Model Robust Estimation Music Genre Recognition Time Series Outlier Detection
7	Outlier Detection In Big Data Cao, Lei 29 March 2016 (has links) The dissertation focuses on scaling outlier detection to work both on huge static as well as on dynamic streaming datasets. Outliers are patterns in the data that do not conform to the expected behavior. Outlier detection techniques are broadly applied in applications ranging from credit fraud prevention, network intrusion detection to stock investment tactical planning. For such mission critical applications, a timely response often is of paramount importance. Yet the processing of outlier detection requests is of high algorithmic complexity and resource consuming. In this dissertation we investigate the challenges of detecting outliers in big data -- in particular caused by the high velocity of streaming data, the big volume of static data and the large cardinality of the input parameter space for tuning outlier mining algorithms. Effective optimization techniques are proposed to assure the responsiveness of outlier detection in big data. In this dissertation we first propose a novel optimization framework called LEAP to continuously detect outliers over data streams. The continuous discovery of outliers is critical for a large range of online applications that monitor high volume continuously evolving streaming data. LEAP encompasses two general optimization principles that utilize the rarity of the outliers and the temporal priority relationships among stream data points. Leveraging these two principles LEAP not only is able to continuously deliver outliers with respect to a set of popular outlier models, but also provides near real-time support for processing powerful outlier analytics workloads composed of large numbers of outlier mining requests with various parameter settings. Second, we develop a distributed approach to efficiently detect outliers over massive-scale static data sets. In this big data era, as the volume of the data advances to new levels, the power of distributed compute clusters must be employed to detect outliers in a short turnaround time. In this research, our approach optimizes key factors determining the efficiency of distributed data analytics, namely, communication costs and load balancing. In particular we prove the traditional frequency-based load balancing assumption is not effective. We thus design a novel cost-driven data partitioning strategy that achieves load balancing. Furthermore, we abandon the traditional one detection algorithm for all compute nodes approach and instead propose a novel multi-tactic methodology which adaptively selects the most appropriate algorithm for each node based on the characteristics of the data partition assigned to it. Third, traditional outlier detection systems process each individual outlier detection request instantiated with a particular parameter setting one at a time. This is not only prohibitively time-consuming for large datasets, but also tedious for analysts as they explore the data to hone in on the most appropriate parameter setting or on the desired results. We thus design an interactive outlier exploration paradigm that is not only able to answer traditional outlier detection requests in near real-time, but also offers innovative outlier analytics tools to assist analysts to quickly extract, interpret and understand the outliers of interest. Our experimental studies including performance evaluation and user studies conducted on real world datasets including stock, sensor, moving object, and Geolocation datasets confirm both the effectiveness and efficiency of the proposed approaches. big data outlier detection data stream distributed algorithm data analytics
8	Exploration Framework For Detecting Outliers In Data Streams Sean, Viseth 27 April 2016 (has links) Current real-world applications are generating a large volume of datasets that are often continuously updated over time. Detecting outliers on such evolving datasets requires us to continuously update the result. Furthermore, the response time is very important for these time critical applications. This is challenging. First, the algorithm is complex; even mining outliers from a static dataset once is already very expensive. Second, users need to specify input parameters to approach the true outliers. While the number of parameters is large, using a trial and error approach online would be not only impractical and expensive but also tedious for the analysts. Worst yet, since the dataset is changing, the best parameter will need to be updated to respond to user exploration requests. Overall, the large number of parameter settings and evolving datasets make the problem of efficiently mining outliers from dynamic datasets very challenging. Thus, in this thesis, we design an exploration framework for detecting outliers in data streams, called EFO, which enables analysts to continuously explore anomalies in dynamic datasets. EFO is a continuous lightweight preprocessing framework. EFO embraces two optimization principles namely "best life expectancy" and "minimal trial," to compress evolving datasets into a knowledge-rich abstraction of important interrelationships among data. An incremental sorting technique is also used to leverage the almost ordered lists in this framework. Thereafter, the knowledge abstraction generated by EFO not only supports traditional outlier detection requests but also novel outlier exploration operations on evolving datasets. Our experimental study conducted on two real datasets demonstrates that EFO outperforms state-of-the-art technique in terms of CPU processing costs when varying stream volume, velocity and outlier rate. outliers data streams streaming window outlier detection distance-based outlier
9	The selection of different averaging approaches on whole-body vibration exposure levels of a driver utilising the ISO 2631-1 standard Bester, Duane January 2014 (has links) Limited research has been conducted on inconsistencies relating to whole-body vibration (WBV) field assessments. Therefore, this study aimed to investigate a certain possible contributor to inconsistencies in vibration assessment work, namely averaging intervals. To our knowledge, this was the first study investigating the effect of multiple averaging approaches on WBV results. WBV parameters were measured for a driver operating a vehicle on a preselected test route utilising ISO 2631-1:1997. This was achieved utilizing a Quest HavPro vibration monitor with a fitted tri-axial Integrated Circuit Piezoelectric (ICP) accelerometer pad mounted on the driver’s seat. Furthermore, in an attempt to decrease differences between observed WBV results, an outlier detection method, part of the STATA software package was utilised to clean the data. Statistical analyses included hypothesis testing in the form of one-way ANOVA and Kruskal-Wallis one-way analysis of variance by ranks to determine significant differences between integration intervals. Logged data time-series durations showed a W0 = 0.04, therefore indicating unequal variance. Omission of 60s from statistical analyses showed a W0 = 0.28. The observed difference occurs when data is averaged over longer intervals, resulting in portions of data not being reflected in the final dataset. In addition, frequency-weighted root mean squared acceleration results reflected significant differences between 1s, 10s, 30s, 60s and SLOW averaging approaches, while non-significant differences were observed for crest factors and instantaneous peak accelerations. Vibration Dose Value results reflected non-significant differences after omission of 60 second averaging interval data. Cleaned data showed significant differences between various averaging approaches as well as significant differences when compared with raw vibration data. The study therefore outlined certain inconsistencies pertaining to the selection of multiple integration intervals during the assessment of WBV exposure. Data filtering could not provide a conclusion on a suitable averaging period and as such, further research is required to determine the correct averaging interval to be used for WBV assessment. / Dissertation (MPH)--University of Pretoria, 2014. / tm2015 / School of Health Systems and Public Health (SHSPH) / MPH / Unrestricted UCTD Occupational hygiene Whole-body vibration Averaging HavPro Outlier detection
10	Efficient Algorithms for Mining Large Spatio-Temporal Data Chen, Feng 21 January 2013 (has links) Knowledge discovery on spatio-temporal datasets has attracted<br />growing interests. Recent advances on remote sensing technology mean<br />that massive amounts of spatio-temporal data are being collected,<br />and its volume keeps increasing at an ever faster pace. It becomes<br />critical to design efficient algorithms for identifying novel and<br />meaningful patterns from massive spatio-temporal datasets. Different<br />from the other data sources, this data exhibits significant<br />space-time statistical dependence, and the assumption of i.i.d. is<br />no longer valid. The exact modeling of space-time dependence will<br />render the exponential growth of model complexity as the data size<br />increases. This research focuses on the construction of efficient<br />and effective approaches using approximate inference techniques for<br />three main mining tasks, including spatial outlier detection, robust<br />spatio-temporal prediction, and novel applications to real world<br />problems.<br /><br />Spatial novelty patterns, or spatial outliers, are those data points<br />whose characteristics are markedly different from their spatial<br />neighbors. There are two major branches of spatial outlier detection<br />methodologies, which can be either global Kriging based or local<br />Laplacian smoothing based. The former approach requires the exact<br />modeling of spatial dependence, which is time extensive; and the<br />latter approach requires the i.i.d. assumption of the smoothed<br />observations, which is not statistically solid. These two approaches<br />are constrained to numerical data, but in real world applications we<br />are often faced with a variety of non-numerical data types, such as<br />count, binary, nominal, and ordinal. To summarize, the main research<br />challenges are: 1) how much spatial dependence can be eliminated via<br />Laplace smoothing; 2) how to effectively and efficiently detect<br />outliers for large numerical spatial datasets; 3) how to generalize<br />numerical detection methods and develop a unified outlier detection<br />framework suitable for large non-numerical datasets; 4) how to<br />achieve accurate spatial prediction even when the training data has<br />been contaminated by outliers; 5) how to deal with spatio-temporal<br />data for the preceding problems.<br /><br />To address the first and second challenges, we mathematically<br />validated the effectiveness of Laplacian smoothing on the<br />elimination of spatial autocorrelations. This work provides<br />fundamental support for existing Laplacian smoothing based methods.<br />We also discovered a nontrivial side-effect of Laplacian smoothing,<br />which ingests additional spatial variations to the data due to<br />convolution effects. To capture this extra variability, we proposed<br />a generalized local statistical model, and designed two fast forward<br />and backward outlier detection methods that achieve a better balance<br />between computational efficiency and accuracy than most existing<br />methods, and are well suited to large numerical spatial datasets.<br /><br />We addressed the third challenge by mapping non-numerical variables<br />to latent numerical variables via a link function, such as logit<br />function used in logistic regression, and then utilizing<br />error-buffer artificial variables, which follow a Student-t<br />distribution, to capture the large valuations caused by outliers. We<br />proposed a unified statistical framework, which integrates the<br />advantages of spatial generalized linear mixed model, robust spatial<br />linear model, reduced-rank dimension reduction, and Bayesian<br />hierarchical model. A linear-time approximate inference algorithm<br />was designed to infer the posterior distribution of the error-buffer<br />artificial variables conditioned on observations. We demonstrated<br />that traditional numerical outlier detection methods can be directly<br />applied to the estimated artificial variables for outliers<br />detection. To the best of our knowledge, this is the first<br />linear-time outlier detection algorithm that supports a variety of<br />spatial attribute types, such as binary, count, ordinal, and<br />nominal.<br /><br />To address the fourth and fifth challenges, we proposed a robust<br />version of the Spatio-Temporal Random Effects (STRE) model, namely<br />the Robust STRE (R-STRE) model. The regular STRE model is a recently<br />proposed statistical model for large spatio-temporal data that has a<br />linear order time complexity, but is not best suited for<br />non-Gaussian and contaminated datasets. This deficiency can be<br />systemically addressed by increasing the robustness of the model<br />using heavy-tailed distributions, such as the Huber, Laplace, or<br />Student-t distribution to model the measurement error, instead of<br />the traditional Gaussian. However, the resulting R-STRE model<br />becomes analytical intractable, and direct application of<br />approximate inferences techniques still has a cubic order time<br />complexity. To address the computational challenge, we reformulated<br />the prediction problem as a maximum a posterior (MAP) problem with a<br />non-smooth objection function, transformed it to a equivalent<br />quadratic programming problem, and developed an efficient<br />interior-point numerical algorithm with a near linear order<br />complexity. This work presents the first near linear time robust<br />prediction approach for large spatio-temporal datasets in both<br />offline and online cases. / Ph. D. Spatio-Temporal Analysis Outlier Detection Robust Prediction Energy Disaggregation

Search results