Return to search

Aspects of the pre- and post-selection classification performance of discriminant analysis and logistic regression

Thesis (PhD)--Stellenbosch University, 1997. / One copy microfiche. / ENGLISH ABSTRACT: Discriminani analysis and logistic regression are techniques that can be used to classify
entities of unknown origin into one of a number of groups. However, the underlying
models and assumptions for application of the two techniques differ. In this study, the
two techniques are compared with respect to classification of entities.
Firstly, the two techniques were compared in situations where no data dependent
variable selection took place. Several underlying distributions were studied: the
normal distribution, the double exponential distribution and the lognormal distribution.
The number of variables, sample sizes from the different groups and the correlation
structure between the variables were varied to' obtain a large number of different
configurations. .The cases of two and three groups were studied. The most important
conclusions are: "for normal and double' exponential data linear discriminant analysis
outperforms logistic regression, especially in cases where the ratio of the number of
variables to the total sample size is large. For lognormal data, logistic regression
should be preferred, except in cases where the ratio of the number of variables to the
total sample size is large. "
Variable selection is frequently the first step in statistical analyses. A large number of
potenti8.Ily important variables are observed, and an optimal subset has to be selected
for use in further analyses. Despite the fact that variable selection is often used, the
influence of a selection step on further analyses of the same data, is often completely
ignored. An important aim of this study was to develop new selection techniques for
use in discriminant analysis and logistic regression. New estimators of the postselection
error rate were also developed. A new selection technique, cross model
validation (CMV) that can be applied both in discriminant analysis and logistic
regression, was developed. ."This technique combines the selection of variables and the
estimation of the post-selection error rate. It provides a method to determine the
optimal model dimension, to select the variables for the final model and to estimate the
post-selection error rate of the discriminant rule. An extensive Monte Carlo simulation
study comparing the CMV technique to existing procedures in the literature, was
undertaken. In general, this technique outperformed the other methods, especially
with respect to the accuracy of estimating the post-selection error rate.
Finally, pre-test type variable selection was considered. A pre-test estimation
procedure was adapted for use as selection technique in linear discriminant analysis. In
a simulation study, this technique was compared to CMV, and was found to perform
well, especially with respect to correct selection. However, this technique is only valid
for uncorrelated normal variables, and its applicability is therefore limited.
A numerically intensive approach was used throughout the study, since the problems
that were investigated are not amenable to an analytical approach. / AFRIKAANSE OPSOMMING: Lineere diskriminantanaliseen logistiese regressie is tegnieke wat gebruik kan word vir die
Idassifikasie van items van onbekende oorsprong in een van 'n aantal groepe. Die
agterliggende modelle en aannames vir die gebruik van die twee tegnieke is egter
verskillend. In die studie is die twee tegnieke vergelyk ten opsigte van k1assifikasievan
items.
Eerstens is die twee tegnieke vergelyk in 'n apset waar daar geen data-afhanklike seleksie
van veranderlikes plaasvind me. Verskeie onderliggende verdelings is bestudeer: die
normaalverdeling, die dubbeleksponensiaal-verdeling,en die lognormaal verdeling. Die
aantal veranderlikes, steekproefgroottes uit die onderskeie groepe en die
korrelasiestruktuur tussen die veranderlikes is gevarieer om 'n groot aantal konfigurasies
te verkry. Die geval van twee en drie groepe is bestudeer. Die belangrikste
gevolgtrekkings wat op grond van die studie gemaak kan word is: vir normaal en
dubbeleksponensiaal data vaar lineere diskriminantanalise beter as logistiese regressie,
veral in gevalle waar die. verhouding van die aantal veranderlikes tot die totale
steekproefgrootte groot is. In die geval van data uit 'n lognormaalverdeling, hehoort
logistiese regressie die metode van keuse te wees, tensy die verhouding van die aantal
veranderlikes tot die totale steekproefgrootte groot is.
Veranderlike seleksie is dikwels die eerste stap in statistiese ontledings. 'n Groot aantal
potensieel belangrike veranderlikes word waargeneem, en 'n subversamelingwat optimaal
is, word gekies om in die verdere ontledings te gebruik. Ten spyte van die feit dat
veranderlike seleksie dikwels gebruik word, word die invloed wat 'n seleksie-stap op
verdere ontledings van dieselfde data. het, dikwels heeltemal geYgnoreer.'n Belangrike
doelwit van die studie was om nuwe seleksietegniekete ontwikkel wat gebruik kan word
in diskriminantanalise en logistiese regressie. Verder is ook aandag gegee aan
ontwikkeling van beramers van die foutkoers van 'n diskriminantfunksie wat met
geselekteerde veranderlikes gevorm word. 'n Nuwe seleksietegniek, kruis-model validasie
(KMV) wat gebruik kan word vir die seleksie van veranderlikes in beide
diskriminantanalise en logistiese regressie is ontwikkel. Hierdie tegniek hanteer die
seleksie van veranderlikes en die beraming van die na-seleksie foutkoers in een stap, en
verskaf 'n metode om die optimale modeldimensiete bepaal, die veranderlikes wat in die
model bevat moet word te kies, en ook die na-seleksie foutkoers van die
diskriminantfunksie te beraam. 'n Uitgebreide simulasiestudie waarin die voorgestelde
KMV-tegniek met ander prosedures in die Iiteratuur. vergelyk is, is vir beide
diskriminantanaliseen logistiese regressie ondemeem. In die algemeen het hierdie tegniek
beter gevaar as die ander metodes wat beskou is, veral ten opsigte van die akkuraatheid
waarmee die na-seleksie foutkoers beraam word.
Ten slotte is daar ook aandag gegee aan voor-toets tipeseleksie. 'n Tegniek is ontwikkel
wat gebruik maak van 'nvoor-toets berarningsmetode om veranderlikes vir insluiting in 'n
lineere diskriminantfunksie te selekteer. Die tegniek ISin 'n simulasiestudie met die KMV-tegniek vergelyk, en vaar baie goed, veral t.o.v. korrekte seleksie. Hierdie tegniek is egter
slegs geldig vir ongekorreleerde normaalveranderlikes, wat die gebruik darvan beperk.
'n Numeries intensiewe benadering is deurgaans in die studie gebruik. Dit is genoodsaak
deur die feit dat die probleme wat ondersoek is, nie deur middel van 'n analitiese
benadering hanteer kan word nie.

Identiferoai:union.ndltd.org:netd.ac.za/oai:union.ndltd.org:sun/oai:scholar.sun.ac.za:10019.1/55402
Date12 1900
CreatorsLouw, Nelmarie
ContributorsLe Roux, N. J., Steel, S. J., Stellenbosch University. Faculty of Science. Dept. of Mathematical Sciences.
PublisherStellenbosch : Stellenbosch University
Source SetsSouth African National ETD Portal
Languageen_ZA
Detected LanguageUnknown
TypeThesis
Format338 p. : ill.
RightsStellenbosch University

Page generated in 0.0025 seconds