Global ETD Search

11	Use of Machine Learning for Outlier Detection in Healthy Human Brain Magnetic Resonance Imaging (MRI) Diffusion Tensor (DT) Datasets / Outlier Detection in Brain MRI Diffusion Datasets MacPhee, Neil January 2022 (has links) Machine learning (ML) and deep learning (DL) are powerful techniques that allow for analysis and classification of large MRI datasets. With the growing accessibility of high-powered computing and large data storage, there has been an explosive interest in their uses for assisting clinical analysis and interpretation. Though these methods can provide insights into the data which are not possible through human analysis alone, they require significantly large datasets for training which can difficult for anyone (researcher and clinician) to obtain on their own. The growing use of publicly available, multi-site databases helps solve this problem. Inadvertently, however, these databases can sometimes contain outliers or incorrectly labeled data as the subjects may or may not have subclinical or underlying pathology unbeknownst to them or to those who did the data collection. Due to the outlier sensitivity of ML and DL techniques, inclusion of such data can lead to poor classification rates and subsequent low specificity and sensitivity. Thus, the focus of this work was to evaluate large brain MRI datasets, specifically diffusion tensor imaging (DTI), for the presence of anomalies and to validate and compare different methods of anomaly detection. A total of 1029 male and female subjects ages 22 to 35 were downloaded from a global imaging repository and divided into 6 cohorts depending on their age and sex. Care was made to minimize variance due to hardware and hence only data from a specific vendor (General Electric Healthcare) and MRI B0 field strength (i.e. 3 Tesla) were obtained. The raw DTI data (i.e. in this case DICOM images) was first preprocessed into scalar metrics (i.e. FA, RD, AD, MD) and warped to MNI152 T1 1mm standardized space using the FMRIB software library (FSL). Subsequently data was segmented into regions of interest (ROI) using the JHU DTI-based white-matter atlas and a mean was calculated for each ROI defined by that atlas. The ROI data was standardized and a Z-score, for each ROI over all subjects, was calculated. Four different algorithms were used for anomaly detection, including Z-score outlier detection, maximum likelihood estimator (MLE) and minimum covariance determinant (MCD) based Mahalanobis distance outlier detection, one-class support vector machine (OCSVM) outlier detection, and OCSVM novelty detection trained on MCD based Mahalanobis distance data. The best outlier detector was found to be MCD based Mahalanobis distance, with the OCSVM novelty detector performing exceptionally well on the MCD based Mahalanobis distance data. From the results of this study, it is clear that these global databases contain outliers within their healthy control datasets, further reinforcing the need for the inclusion of outlier or novelty detection as part of the preprocessing pipeline for ML and DL related studies. / Thesis / Master of Applied Science (MASc) / Artificial intelligence (AI) refers to the ability of a computer or robot to mimic human traits such as problem solving or learning. Recently there has been an explosive interest in its uses for assisting in clinical analysis. However, successful use of these methods require a significantly large training set which can often contain outliers or incorrectly labeled data. Due to the sensitivity of these techniques to outliers, this often leads to poor classification rates as well as low specificity and sensitivity. The focus of this work was to evaluate different methods of outlier detection and investigate the presence of anomalies in large brain MRI datasets. The results of this study show that these large brain MRI datasets contain anomalies and provide a method best fit for identifying them. Outlier Detection MRI Machine learning DTI
12	Cluster-Based Bounded Influence Regression Lawrence, David E. 14 August 2003 (has links) In the field of linear regression analysis, a single outlier can dramatically influence ordinary least squares estimation while low-breakdown procedures such as M regression and bounded influence regression may be unable to combat a small percentage of outliers. A high-breakdown procedure such as least trimmed squares (LTS) regression can accommodate up to 50% of the data (in the limit) being outlying with respect to the general trend. Two available one-step improvement procedures based on LTS are Mallows 1-step (M1S) regression and Schweppe 1-step (S1S) regression (the current state-of-the-art method). Issues with these methods include (1) computational approximations and sub-sampling variability, (2) dramatic coefficient sensitivity with respect to very slight differences in initial values, (3) internal instability when determining the general trend and (4) performance in low-breakdown scenarios. A new high-breakdown regression procedure is introduced that addresses these issues, plus offers an insightful summary regarding the presence and structure of multivariate outliers. This proposed method blends a cluster analysis phase with a controlled bounded influence regression phase, thereby referred to as cluster-based bounded influence regression, or CBI. Representing the data space via a special set of anchor points, a collection of point-addition OLS regression estimators forms the basis of a metric used in defining the similarity between any two observations. Cluster analysis then yields a main cluster "halfset" of observations, with the remaining observations becoming one or more minor clusters. An initial regression estimator arises from the main cluster, with a multiple point addition DFFITS argument used to carefully activate the minor clusters through a bounded influence regression framework. CBI achieves a 50% breakdown point, is regression equivariant, scale equivariant and affine equivariant and distributionally is asymptotically normal. Case studies and Monte Carlo studies demonstrate the performance advantage of CBI over S1S and the other high breakdown methods regarding coefficient stability, scale estimation and standard errors. A dendrogram of the clustering process is one graphical display available for multivariate outlier detection. Overall, the proposed methodology represents advancement in the field of robust regression, offering a distinct philosophical viewpoint towards data analysis and the marriage of estimation with diagnostic summary. / Ph. D. High-breakdown Robust Linear Outlier LTS
13	Distributed Local Outlier Factor with Locality-Sensitive Hashing Zheng, Lining 08 November 2019 (has links) Outlier detection remains a heated area due to its essential role in a wide range of applications, including intrusion detection, fraud detection in finance, medical diagnosis, etc. Local Outlier Factor (LOF) has been one of the most influential outlier detection techniques over the past decades. LOF has distinctive advantages on skewed datasets with regions of various densities. However, the traditional centralized LOF faces new challenges in the era of big data and no longer satisfies the rigid time constraints required by many modern applications, due to its expensive computation overhead. A few researchers have explored the distributed solution of LOF, but existant methods are limited by their grid-based data partitioning strategy, which falls short when applied to high-dimensional data. In this thesis, we study efficient distributed solutions for LOF. A baseline MapReduce solution for LOF implemented with Apache Spark, named MR-LOF, is introduced. We demonstrate its disadvantages in communication cost and execution time through complexity analysis and experimental evaluation. Then an approximate LOF method is proposed, which relies on locality-sensitive hashing (LSH) for partitioning data and enables fully distributed local computation. We name it MR-LOF-LSH. To further improve the approximate LOF, we introduce a process called cross-partition updating. With cross-partition updating, the actual global k-nearest neighbors (k-NN) of the outlier candidates are found, and the related information of the neighbors is used to update the outlier scores of the candidates. The experimental results show that MR-LOF achieves a speedup of up to 29 times over the centralized LOF. MR-LOF-LSH further reduces the execution time by a factor of up to 9.9 compared to MR-LOF. The results also highlight that MR-LOF-LSH scales well as the cluster size increases. Moreover, with a sufficient candidate size, MR-LOF-LSH is able to detect in most scenarios over 90% of the top outliers with the highest LOF scores computed by the centralized LOF algorithm. Outlier detection Distributed computing Apache Spark Locality-sensitive hashing Local Outlier Factor
14	High-dimensional data mining: subspace clustering, outlier detection and applications to classification Foss, Andrew 06 1900 (has links) Data mining in high dimensionality almost inevitably faces the consequences of increasing sparsity and declining differentiation between points. This is problematic because we usually exploit these differences for approaches such as clustering and outlier detection. In addition, the exponentially increasing sparsity tends to increase false negatives when clustering. In this thesis, we address the problem of solving high-dimensional problems using low-dimensional solutions. In clustering, we provide a new framework MAXCLUS for finding candidate subspaces and the clusters within them using only two-dimensional clustering. We demonstrate this through an implementation GCLUS that outperforms many state-of-the-art clustering algorithms and is particularly robust with respect to noise. It also handles overlapping clusters and provides either `hard' or `fuzzy' clustering results as desired. In order to handle extremely high dimensional problems, such as genome microarrays, given some sample-level diagnostic labels, we provide a simple but effective classifier GSEP which weights the features so that the most important can be fed to GCLUS. We show that this leads to small numbers of features (e.g. genes) that can distinguish the diagnostic classes and thus are candidates for research for developing therapeutic applications. In the field of outlier detection, several novel algorithms suited to high-dimensional data are presented (TENT, TROF, FASTOUT). It is shown that these algorithms outperform the state-of-the-art outlier detection algorithms in ranking outlierness for many datasets regardless of whether they contain rare classes or not. Our research into high-dimensional outlier detection has even shown that our approach can be a powerful means of classification for heavily overlapping classes given sufficiently high dimensionality and that this phenomenon occurs solely due to the differences in variance among the classes. On some difficult datasets, this unsupervised approach yielded better separation than the very best supervised classifiers and on other data, the results are competitive with state-of-the-art supervised approaches.kern-1pt The elucidation of this novel approach to classification opens a new field in data mining, classification through differences in variance rather than spatial location. As an appendix, we provide an algorithm for estimating false negative and positive rates so these can be compensated for. Subspace clustering Outlier detection Subspace outlier detection Classification Error estimation SERA MAXCLUS GSEP GCLUS FASTOUT T*
15	High-dimensional data mining: subspace clustering, outlier detection and applications to classification Foss, Andrew Unknown Date No description available. Subspace clustering Outlier detection Subspace outlier detection Classification Error estimation SERA MAXCLUS GSEP GCLUS FASTOUT T*
16	Relational Outlier Detection: Techniques and Applications Lu, Yen-Cheng 10 June 2021 (has links) Nowadays, outlier detection has attracted growing interest. Unlike typical outlier detection problems, relational outlier detection focuses on detecting abnormal patterns in datasets that contain relational implications within each data point. Furthermore, different from the traditional outlier detection that focuses on only numerical data, modern outlier detection models must be able to handle data in various types and structures. Detecting relational outliers should consider (1) Dependencies among different data types, (2) Data types that are not continuous or do not have ordinal characteristics, such as binary, categorical or multi-label, and (3) Special structures in the data. This thesis focuses on the development of relational outlier detection methods and real-world applications in datasets that contain non-numerical, mixed-type, and special structure data in three tasks, namely (1) outlier detection in mixed-type data, (2) categorical outlier detection in music genre data, and (3) outlier detection in categorized time series data. For the first task, existing solutions for mixed-type data mostly focus on computational efficiency, and their strategies are mostly heuristic driven, lacking a statistical foundation. The proposed contributions of our work include: (1) Constructing a novel unsupervised framework based on a robust generalized linear model (GLM), (2) Developing a model that is capable of capturing large variances of outliers and dependencies among mixed-type observations, and designing an approach for approximating the analytically intractable Bayesian inference, and (3) Conducting extensive experiments to validate effectiveness and efficiency. For the second task, we extended and applied the modeling strategy to a real-world problem. The existing solutions to the specific task are mostly supervised, and the traditional outlier detection methods only focus on detecting outliers by the data distributions, ignoring the input-output relation between the genres and the extracted features. The proposed contributions of our work for this task include: (1) Proposing an unsupervised outlier detection framework for music genre data, (2) Extending the GLM based model in the first task to handle categorical responses and developing an approach to approximate the analytically intractable Bayesian inference, and (3) Conducting experiments to demonstrate that the proposed method outperforms the benchmark methods. For the third task, we focused on improving the outlier detection performance in the second task by proposing a novel framework and expanded the research scope to general categorized time-series data. Existing studies have suggested a large number of methods for automatic time series classification. However, there is a lack of research focusing on detecting outliers from manually categorized time series. The proposed contributions of our work for this task include: (1) Proposing a novel semi-supervised robust outlier detection framework for categorized time-series datasets, (2) Further extending the new framework to an active learning system that takes user insights into account, and (3) Conducting a comprehensive set of experiments to demonstrate the performance of the proposed method in real-world applications. / Doctor of Philosophy / In recent years, outlier detection has been one of the most important topics in the data mining and machine learning research domain. Unlike typical outlier detection problems, relational outlier detection focuses on detecting abnormal patterns in datasets that contain relational implications within each data point. Detecting relational outliers should consider (1) Dependencies among different data types, (2) Data types that are not continuous or do not have ordinal characteristics, such as binary, categorical or multi-label, and (3) Special structures in the data. This thesis focuses on the development of relational outlier detection methods and real-world applications in datasets that contain non-numerical, mixed-type, and special structure data in three tasks, namely (1) outlier detection in mixed-type data, (2) categorical outlier detection in music genre data, and (3) outlier detection in categorized time series data. The first task aims on constructing a novel unsupervised framework, developing a model that is capable of capturing the normal pattern and the effects, and designing an approach for model fitting. In the second task, we further extended and applied the modeling strategy to a real-world problem in the music technology domain. For the third task, we expanded the research scope from the previous task to general categorized time-series data, and focused on improving the outlier detection performance by proposing a novel semi-supervised framework. Relational Outlier Detection Generalized Linear Model Robust Estimation Music Genre Recognition Time Series Outlier Detection
17	Detecção de outliers baseada em caminhada determinística do turista / Outlier detection based on deterministic tourist walk Rodrigues, Rafael Delalibera 03 April 2018 (has links) Detecção de outliers é uma tarefa fundamental para descoberta de conhecimento em mineração de dados. Cujo objetivo é identificar as amostras de dados que desviam acentuadamente dos padrões apresentados num conjunto de dados. Neste trabalho, apresentamos uma nova técnica de detecção de outliers baseada em caminhada determinística do turista. Especificamente um caminhante é iniciado para cada exemplar de dado, variando-se o tamanho da memória, assim, um exemplar recebe uma alta pontuação de outlier ao participar em poucos atratores, enquanto que receberá uma baixa pontuação no caso de participar numa grande quantidade de atratores. Os resultados experimentais em cenários artificiais e reais evidenciaram um bom desempenho do método proposto. Em comparação com os métodos clássicos, o método proposto apresenta as seguintes características salientes: 1) Identifica os outliers através da determinação de estruturas no espaço de dados ao invés de considerar apenas características físicas, como distância, similaridade e densidade. 2) É capaz de detectar outliers internos, situados em regiões entre dois ou mais agrupamentos. 3) Com a variação do valor de memória, os caminhantes conseguem extrair tanto características locais, quanto globais do conjunto de dados. 4) O método proposto é determinístico, não exigindo diversas execuções (em contraste às técnicas estocásticas). Além disso, neste trabalho caracterizamos, pela primeira vez, que as dinâmicas exibidas pela caminhada do turista podem gerar atratores complexos, com diversos cruzamentos. Sendo que estes podem revelar estruturas ainda mais detalhadas e consequentemente melhorar a detecção dos outliers. / Outlier detection is a fundamental task for knowledge discovery in data mining. It aims to detect data items that deviate from the general pattern of a given data set. In this work, we present a new outlier detection technique using tourist walks. Specifically, starting from each data sample and varying the memory size, a data sample gets a higher outlier score if it participates in few tourist walk attractors, while it gets a low score if it participates in a large number of attractors. Experimental results on artificial and real data sets show good performance of the proposed method. In comparison to classical methods, the proposed one shows the following salient features: 1) It finds out outliers by identifying the structure of the input data set instead of considering only physical features, such as distance, similarity or density. 2) It can detect not only external outliers as classical methods do, but also internal outliers staying among various normal data groups. 3) By varying the memory size, the tourist walks can characterize both local and global structures of the data set. 4) The proposed method is a deterministic technique. Therefore, only one run is sufficient, in contrast to stochastic techniques, which require many runs. Moreover, in this work, we find, for the first time, that tourist walks can generate complex attractors in various crossing shapes. Such complex attractors reveal data structures in more details. Consequently, it can improve the outlier detection.
18	Detecção de outliers baseada em caminhada determinística do turista / Outlier detection based on deterministic tourist walk Rafael Delalibera Rodrigues 03 April 2018 (has links) Detecção de outliers é uma tarefa fundamental para descoberta de conhecimento em mineração de dados. Cujo objetivo é identificar as amostras de dados que desviam acentuadamente dos padrões apresentados num conjunto de dados. Neste trabalho, apresentamos uma nova técnica de detecção de outliers baseada em caminhada determinística do turista. Especificamente um caminhante é iniciado para cada exemplar de dado, variando-se o tamanho da memória, assim, um exemplar recebe uma alta pontuação de outlier ao participar em poucos atratores, enquanto que receberá uma baixa pontuação no caso de participar numa grande quantidade de atratores. Os resultados experimentais em cenários artificiais e reais evidenciaram um bom desempenho do método proposto. Em comparação com os métodos clássicos, o método proposto apresenta as seguintes características salientes: 1) Identifica os outliers através da determinação de estruturas no espaço de dados ao invés de considerar apenas características físicas, como distância, similaridade e densidade. 2) É capaz de detectar outliers internos, situados em regiões entre dois ou mais agrupamentos. 3) Com a variação do valor de memória, os caminhantes conseguem extrair tanto características locais, quanto globais do conjunto de dados. 4) O método proposto é determinístico, não exigindo diversas execuções (em contraste às técnicas estocásticas). Além disso, neste trabalho caracterizamos, pela primeira vez, que as dinâmicas exibidas pela caminhada do turista podem gerar atratores complexos, com diversos cruzamentos. Sendo que estes podem revelar estruturas ainda mais detalhadas e consequentemente melhorar a detecção dos outliers. / Outlier detection is a fundamental task for knowledge discovery in data mining. It aims to detect data items that deviate from the general pattern of a given data set. In this work, we present a new outlier detection technique using tourist walks. Specifically, starting from each data sample and varying the memory size, a data sample gets a higher outlier score if it participates in few tourist walk attractors, while it gets a low score if it participates in a large number of attractors. Experimental results on artificial and real data sets show good performance of the proposed method. In comparison to classical methods, the proposed one shows the following salient features: 1) It finds out outliers by identifying the structure of the input data set instead of considering only physical features, such as distance, similarity or density. 2) It can detect not only external outliers as classical methods do, but also internal outliers staying among various normal data groups. 3) By varying the memory size, the tourist walks can characterize both local and global structures of the data set. 4) The proposed method is a deterministic technique. Therefore, only one run is sufficient, in contrast to stochastic techniques, which require many runs. Moreover, in this work, we find, for the first time, that tourist walks can generate complex attractors in various crossing shapes. Such complex attractors reveal data structures in more details. Consequently, it can improve the outlier detection.
19	ANOVA - The Effect of Outliers Halldestam, Markus January 2016 (has links) This bachelor’s thesis focuses on the effect of outliers on the one-way analysis of variance and examines whether the estimate in ANOVA is robust and whether the actual test itself is robust from influence of extreme outliers. The robustness of the estimates is examined using the breakdown point while the robustness of the test is examined by simulating the hypothesis test under some extreme situations. This study finds evidence that the estimates in ANOVA are sensitive to outliers, i.e. that the procedure is not robust. Samples with a larger portion of extreme outliers have a higher type-I error probability than the expected level. Analysis of variance ANOVA outlier outlying observation typ-1 error robust
20	Correlation between American mortality and DJIA index price Ong, Li Kee 14 September 2016 (has links) For an equity-linked insurance, the death benefit is linked to the performance of the company’s investment portfolio. Hence, both mortality risk and equity return shall be considered for pricing such insurance. Several studies have found some dependence between mortality improvement and economy growth. In this thesis, we showed that American mortality rate and Dow Jones Industrial Average (DJIA) index price are negatively dependent by using several copulas to define the joint distribution. Then, we used these copulas to forecast mortality rates and index prices, and calculated the payoffs of a 10-year term equity-linked insurance. We showed that the predicted insurance payoffs will be smaller if dependence between mortality and index price is taken into account. / October 2016 Mortality Time series models Outlier models Copulas Equity-linked Securities

Search results