Return to search

Multiple outlier detection and cluster analysis of multivariate normal data

Thesis (MscEng)--Stellenbosch University, 2003. / ENGLISH ABSTRACT: Outliers may be defined as observations that are sufficiently aberrant to arouse the
suspicion of the analyst as to their origin. They could be the result of human error, in
which case they should be corrected, but they may also be an interesting exception,
and this would deserve further investigation.
Identification of outliers typically consists of an informal inspection of a plot of
the data, but this is unreliable for dimensions greater than two. A formal procedure
for detecting outliers allows for consistency when classifying observations. It also
enables one to automate the detection of outliers by using computers.
The special case of univariate data is treated separately to introduce essential
concepts, and also because it may well be of interest in its own right. We then consider
techniques used for detecting multiple outliers in a multivariate normal sample,
and go on to explain how these may be generalized to include cluster analysis.
Multivariate outlier detection is based on the Minimum Covariance Determinant
(MCD) subset, and is therefore treated in detail. Exact bivariate algorithms were
refined and implemented, and the solutions were used to establish the performance
of the commonly used heuristic, Fast–MCD. / AFRIKAANSE OPSOMMING: Uitskieters word gedefinieer as waarnemings wat tot s´o ’n mate afwyk van die verwagte
gedrag dat die analis wantrouig is oor die oorsprong daarvan. Hierdie waarnemings
mag die resultaat wees van menslike foute, in welke geval dit reggestel moet
word. Dit mag egter ook ’n interressante verskynsel wees wat verdere ondersoek
benodig.
Die identifikasie van uitskieters word tipies informeel deur inspeksie vanaf ’n
grafiese voorstelling van die data uitgevoer, maar hierdie benadering is onbetroubaar
vir dimensies groter as twee. ’n Formele prosedure vir die bepaling van uitskieters
sal meer konsekwente klassifisering van steekproefdata tot gevolg hˆe. Dit gee ook
geleentheid vir effektiewe rekenaar implementering van die tegnieke.
Aanvanklik word die spesiale geval van eenveranderlike data behandel om noodsaaklike
begrippe bekend te stel, maar ook aangesien dit in eie reg ’n area van
groot belang is. Verder word tegnieke vir die identifikasie van verskeie uitskieters in
meerveranderlike, normaal verspreide data beskou. Daar word ook ondersoek hoe
hierdie idees veralgemeen kan word om tros analise in te sluit.
Die sogenaamde Minimum Covariance Determinant (MCD) subversameling is
fundamenteel vir die identifikasie van meerveranderlike uitskieters, en word daarom
in detail ondersoek. Deterministiese tweeveranderlike algoritmes is verfyn en ge¨ımplementeer,
en gebruik om die effektiwiteit van die algemeen gebruikte heuristiese algoritme,
Fast–MCD, te ondersoek.

Identiferoai:union.ndltd.org:netd.ac.za/oai:union.ndltd.org:sun/oai:scholar.sun.ac.za:10019.1/53508
Date12 1900
CreatorsRobson, Geoffrey
ContributorsHerbst, B. M., Muller, N. L., Stellenbosch University. Faculty of Science. Dept. of Mathematical Sciences.
PublisherStellenbosch : Stellenbosch University
Source SetsSouth African National ETD Portal
Languageen_ZA
Detected LanguageUnknown
TypeThesis
Format127 p. : ill.
RightsStellenbosch University

Page generated in 0.0018 seconds