Spelling suggestions: "subject:"outlier"" "subject:"utlier""
1 |
Towards outlier detection for high-dimensional data streams using projected outlier analysis strategyZhang, Ji January 2008 (has links)
[Abstract]: Outlier detection is an important research problem in data mining that aims to discover useful abnormal and irregular patterns hidden in large data sets. Most existing outlier detection methods only deal with static data with relatively low dimensionality.Recently, outlier detection for high-dimensional stream data became a new emerging research problem. A key observation that motivates this research is that outliersin high-dimensional data are projected outliers, i.e., they are embedded in lower-dimensional subspaces. Detecting projected outliers from high-dimensional streamdata is a very challenging task for several reasons. First, detecting projected outliers is difficult even for high-dimensional static data. The exhaustive search for the out-lying subspaces where projected outliers are embedded is a NP problem. Second, the algorithms for handling data streams are constrained to take only one pass to process the streaming data with the conditions of space limitation and time criticality. The currently existing methods for outlier detection are found to be ineffective for detecting projected outliers in high-dimensional data streams.In this thesis, we present a new technique, called the Stream Project Outlier deTector (SPOT), which attempts to detect projected outliers in high-dimensionaldata streams. SPOT employs an innovative window-based time model in capturing dynamic statistics from stream data, and a novel data structure containing a set oftop sparse subspaces to detect projected outliers effectively. SPOT also employs a multi-objective genetic algorithm as an effective search method for finding theoutlying subspaces where most projected outliers are embedded. The experimental results demonstrate that SPOT is efficient and effective in detecting projected outliersfor high-dimensional data streams. The main contribution of this thesis is that it provides a backbone in tackling the challenging problem of outlier detection for high-dimensional data streams. SPOT can facilitate the discovery of useful abnormal patterns and can be potentially applied to a variety of high demand applications, such as for sensor network data monitoring, online transaction protection, etc.
|
2 |
Some problems in the detection and accommodation of outliers in gamma samplesKimber, A. C. January 1980 (has links)
No description available.
|
3 |
Exploration Framework For Detecting Outliers In Data StreamsSean, Viseth 27 April 2016 (has links)
Current real-world applications are generating a large volume of datasets that are often continuously updated over time. Detecting outliers on such evolving datasets requires us to continuously update the result. Furthermore, the response time is very important for these time critical applications. This is challenging. First, the algorithm is complex; even mining outliers from a static dataset once is already very expensive. Second, users need to specify input parameters to approach the true outliers. While the number of parameters is large, using a trial and error approach online would be not only impractical and expensive but also tedious for the analysts. Worst yet, since the dataset is changing, the best parameter will need to be updated to respond to user exploration requests. Overall, the large number of parameter settings and evolving datasets make the problem of efficiently mining outliers from dynamic datasets very challenging. Thus, in this thesis, we design an exploration framework for detecting outliers in data streams, called EFO, which enables analysts to continuously explore anomalies in dynamic datasets. EFO is a continuous lightweight preprocessing framework. EFO embraces two optimization principles namely "best life expectancy" and "minimal trial," to compress evolving datasets into a knowledge-rich abstraction of important interrelationships among data. An incremental sorting technique is also used to leverage the almost ordered lists in this framework. Thereafter, the knowledge abstraction generated by EFO not only supports traditional outlier detection requests but also novel outlier exploration operations on evolving datasets. Our experimental study conducted on two real datasets demonstrates that EFO outperforms state-of-the-art technique in terms of CPU processing costs when varying stream volume, velocity and outlier rate.
|
4 |
A simple univariate outlier identification procedure on ratio data collected by the Department of Revenue for the state of KansasJun, Hyoungjin January 1900 (has links)
Master of Science / Department of Statistics / John E. Boyer Jr / In order to impose fair taxes on properties, it is required that appraisers annually estimate prices of all the properties in each of the counties in Kansas. The Department of Revenue of Kansas oversees the quality of work of appraisers in each county. The Department of Revenue uses ratio data which is appraisal price divided by sale price for those parcels which are sold during the year as a basis for evaluating the work of the appraisers. They know that there are outliers in these ratio data sets and these outliers can impact their evaluations of the county appraisers.
The Department of Revenue has been using a simple box plot procedure to identify outliers for the previous 10 years. Staff members have questioned whether there might be a need for improvement in the procedure. They considered the possibility of tuning the procedure to depend on distributions and sample sizes. The methodology as a possible solution was suggested by Iglewicz et al. (2007).
In this report, we examine the new methodology and attempt to apply it to ratio data sets provided by the Department of Revenue.
|
5 |
Scalable Multi-Parameter Outlier Detection TechnologyWang, Jiayuan 23 December 2013 (has links)
"The real-time detection of anomalous phenomena on streaming data has become increasingly important for applications ranging from fraud detection, financial analysis to traffic management. In these streaming applications, often a large number of similar continuous outlier detection queries are executed concurrently. In the light of the high algorithmic complexity of detecting and maintaining outlier patterns for different parameter settings independently, we propose a shared execution methodology called SOP that handles a large batch of requests with diverse pattern configurations. First, our systematic analysis reveals opportunities for maximum resource sharing by leveraging commonalities among outlier detection queries. For that, we introduce a sharing strategy that integrates all computation results into one compact data structure. It leverages temporal relationships among stream data points to prioritize the probing process. Second, this work is the first to consider predicate constraints in the outlier detection context. By distinguishing between target and scope constraints, customized fragment sharing and block selection strategies can be effectively applied to maximize the efficiency of system resource utilization. Our experimental studies utilizing real stream data demonstrate that our approach performs 3 orders of magnitude faster than the start-of-the-art and scales to 1000s of queries."
|
6 |
Anomaly Handling in Visual AnalyticsNguyen, Quyen Do 23 December 2007 (has links)
"Visual analytics is an emerging field which uses visual techniques to interact with users in the analytical reasoning process. Users can choose the most appropriate representation that conveys the important content of their data by acting upon different visual displays. The data itself has many features of interest, including clusters, trends (commonalities) and anomalies. Most visualization techniques currently focus on the discovery of trends and other relations, where uncommon phenomena are treated as outliers and are either removed from the datasets or de-emphasized on the visual displays. Much less work has been done on the visual analysis of outliers, or anomalies. In this thesis, I will introduce a method to identify the different levels of “outlierness†by using interactive selection and other approaches to process outliers after detection. In one approach, the values of these outliers will be estimated from the values of their k-Nearest Neighbors and replaced to increase the consistency of the whole dataset. Other approaches will leave users with the choice of removing the outliers from the graphs or highlighting the unusual patterns on the graphs if points of interest lie in these anomalous regions. I will develop and test these anomaly handling methods within the XMDV Tool."
|
7 |
Exploring Ways of Identifying Outliers in Spatial Point PatternsLiu, Jie 01 May 2015 (has links)
This work discusses alternative methods to detect outliers in spatial point patterns.
Outliers are defined based on location only and also with respect to associated variables. Throughout the thesis we discuss five case studies, three of them come from experiments with spiders and bees, and the other two are data from earthquakes in a certain region. One of the main conclusions is that when detecting outliers from the point of view of location we need to take into consideration both the degree of clustering of the events and the context of the study. When detecting outliers from the point of view of an associated variable, outliers can be identified from a global or local perspective. For global outliers, one of the main questions addressed is whether the outliers tend to be clustered or randomly distributed in the region. All the work was done using the R programming language.
|
8 |
Integrated circuit outlier identification by multiple parameter correlationSabade, Sagar Suresh 30 September 2004 (has links)
Semiconductor manufacturers must ensure that chips conform to their specifications before they are shipped to customers. This is achieved by testing various parameters of a chip to determine whether it is defective or not. Separating defective chips from fault-free ones is relatively straightforward for functional or other Boolean tests that produce a go/no-go type of result. However, making this distinction is extremely challenging for parametric tests. Owing to continuous distributions of parameters, any pass/fail threshold results in yield loss and/or test escapes. The continuous advances in process technology, increased process variations and inaccurate fault models all make this even worse. The pass/fail thresholds for such tests are usually set using prior experience or by a combination of visual inspection and engineering judgment. Many chips have parameters that exceed certain thresholds but pass Boolean tests. Owing to the imperfect nature of tests, to determine whether these chips (called "outliers") are indeed defective is nontrivial. To avoid wasted investment in packaging or further testing it is important to screen defective chips early in a test flow. Moreover, if seemingly strange behavior of outlier chips can be explained with the help of certain process parameters or by correlating additional test data, such chips can be retained in the test flow before they are proved to be fatally flawed. In this research, we investigate several methods to identify true outliers (defective chips, or chips that lead to functional failure) from apparent outliers (seemingly defective, but fault-free chips). The outlier identification methods in this research primarily rely on wafer-level spatial correlation, but also use additional test parameters. These methods are evaluated and validated using industrial test data. The potential of these methods to reduce burn-in is discussed.
|
9 |
Integrated circuit outlier identification by multiple parameter correlationSabade, Sagar Suresh 30 September 2004 (has links)
Semiconductor manufacturers must ensure that chips conform to their specifications before they are shipped to customers. This is achieved by testing various parameters of a chip to determine whether it is defective or not. Separating defective chips from fault-free ones is relatively straightforward for functional or other Boolean tests that produce a go/no-go type of result. However, making this distinction is extremely challenging for parametric tests. Owing to continuous distributions of parameters, any pass/fail threshold results in yield loss and/or test escapes. The continuous advances in process technology, increased process variations and inaccurate fault models all make this even worse. The pass/fail thresholds for such tests are usually set using prior experience or by a combination of visual inspection and engineering judgment. Many chips have parameters that exceed certain thresholds but pass Boolean tests. Owing to the imperfect nature of tests, to determine whether these chips (called "outliers") are indeed defective is nontrivial. To avoid wasted investment in packaging or further testing it is important to screen defective chips early in a test flow. Moreover, if seemingly strange behavior of outlier chips can be explained with the help of certain process parameters or by correlating additional test data, such chips can be retained in the test flow before they are proved to be fatally flawed. In this research, we investigate several methods to identify true outliers (defective chips, or chips that lead to functional failure) from apparent outliers (seemingly defective, but fault-free chips). The outlier identification methods in this research primarily rely on wafer-level spatial correlation, but also use additional test parameters. These methods are evaluated and validated using industrial test data. The potential of these methods to reduce burn-in is discussed.
|
10 |
Outlier Detection with Applications in Graph Data MiningRanga Suri, N N R January 2013 (has links) (PDF)
Outlier detection is an important data mining task due to its applicability in many contemporary applications such as fraud detection and anomaly detection in networks, etc. It assumes significance due to the general perception that outliers represent evolving novel patterns in data that are critical to many discovery tasks. Extensive use of various data mining techniques in different application domains gave rise to the rapid proliferation of research work on outlier detection problem. This has lead to the development of numerous methods for detecting outliers in various problem settings. However, most of these methods deal primarily with numeric data. Therefore, the problem of outlier detection in categorical data has been considered in this work for developing some novel methods addressing various research issues. Firstly, a ranking based algorithm for detecting a likely set of outliers in a given categorical data has been developed employing two independent ranking schemes. Subsequently, the issue of data dimensionality has been addressed by proposing a novel unsupervised feature selection algorithm on categorical data. Similarly, the uncertainty associated with the outlier detection task has also been suitably dealt with by developing a novel rough sets based categorical clustering algorithm.
Due to the networked nature of the data pertaining to many real life applications such as computer communication networks, social networks of friends, the citation networks of documents, hyper-linked networks of web pages, etc., outlier detection(also known as anomaly detection) in graph representation of network data turns out to be an important pattern discovery activity. Accordingly, a novel graph mining method has been envisaged in this thesis based on the concept of community detection in graphs. In addition to finding anomalous nodes and anomalous edges, this method is capable of detecting various higher level anomalies that are arbitrary sub-graphs of the input graph. Subsequently, these ideas have been further extended in this thesis to characterize the time varying behavior of outliers(anomalies) in dynamic network data by defining various categories of temporal outliers (anomalies). Characterizing the behavior of such outliers during the evolution of the network over time is critical for discovering different anomalous connectivity patterns with potential adverse effects such as intrusions into a computer network, etc. In order to deal with temporal outlier detection in single instance network/graph data, the link prediction task has been leveraged in this thesis to produce multiple instances of the input graph. Thus, various outlier detection principles have been successfully applied for mining various categories of temporal outliers(anomalies) in the graph representation of network data.
|
Page generated in 0.0332 seconds