Spelling suggestions: "subject:"dissertations -- estatistics"" "subject:"dissertations -- cstatistics""
11 |
Estimating the window period and incidence of recently infected HIV patients.Du Toit, Cari 03 1900 (has links)
Thesis (MComm (Statistics and Actuarial Science))--University of Stellenbosch, 2009. / Incidence can be defined as the rate of occurence of new infections of a disease like HIV and
is an useful estimate of trends in the epidemic. Annualised incidence can be expressed as a
proportion, namely the number of recent infections per year divided by the number of people at
risk of infection. This number of recent infections is dependent on the window period, which
is basically the period of time from seroconversion to being classified as a long-term infection
for the first time. The BED capture enzyme immunoassay was developed to provide a way to
distinguish between recent and long-term infections. An optical density (OD) measurement is
obtained from this assay. Window period is defined as the number of days since seroconversion,
with a baseline OD value of 0, 0476 to the number of days to reach an optical density of 0, 8.The
aim of this study is to describe different techniques to estimate the window period which may
subsequently lead to alternative estimates of annualised incidence of HIV infection. These
various techniques are applied to different subsets of the Zimbabwe Vitamin A for Mothers and
Babies (ZVITAMBO) dataset.
Three different approaches are described to analyse window periods: a non-parametric survival
analysis approach, the fitting of a general linear mixed model in a longitudinal data setting and
a Bayesian approach of assigning probability distributions to the parameters of interest. These
techniques are applied to different subsets and transformations of the data and the estimated
mean and median window periods are obtained and utilised in the calculation of incidence.
|
12 |
A comparison of support vector machines and traditional techniques for statistical regression and classificationHechter, Trudie 04 1900 (has links)
Thesis (MComm)--Stellenbosch University, 2004. / ENGLISH ABSTRACT: Since its introduction in Boser et al. (1992), the support vector machine has become a
popular tool in a variety of machine learning applications. More recently, the support
vector machine has also been receiving increasing attention in the statistical
community as a tool for classification and regression. In this thesis support vector
machines are compared to more traditional techniques for statistical classification and
regression. The techniques are applied to data from a life assurance environment for a
binary classification problem and a regression problem. In the classification case the
problem is the prediction of policy lapses using a variety of input variables, while in
the regression case the goal is to estimate the income of clients from these variables.
The performance of the support vector machine is compared to that of discriminant
analysis and classification trees in the case of classification, and to that of multiple
linear regression and regression trees in regression, and it is found that support vector
machines generally perform well compared to the traditional techniques. / AFRIKAANSE OPSOMMING: Sedert die bekendstelling van die ondersteuningspuntalgoritme in Boser et al. (1992),
het dit 'n populêre tegniek in 'n verskeidenheid masjienleerteorie applikasies geword.
Meer onlangs het die ondersteuningspuntalgoritme ook meer aandag in die statistiese
gemeenskap begin geniet as 'n tegniek vir klassifikasie en regressie. In hierdie tesis
word ondersteuningspuntalgoritmes vergelyk met meer tradisionele tegnieke vir
statistiese klassifikasie en regressie. Die tegnieke word toegepas op data uit 'n
lewensversekeringomgewing vir 'n binêre klassifikasie probleem sowel as 'n
regressie probleem. In die klassifikasiegeval is die probleem die voorspelling van
polisvervallings deur 'n verskeidenheid invoer veranderlikes te gebruik, terwyl in die
regressiegeval gepoog word om die inkomste van kliënte met behulp van hierdie
veranderlikes te voorspel. Die resultate van die ondersteuningspuntalgoritme word
met dié van diskriminant analise en klassifikasiebome vergelyk in die
klassifikasiegeval, en met veelvoudige linêere regressie en regressiebome in die
regressiegeval. Die gevolgtrekking is dat ondersteuningspuntalgoritmes oor die
algemeen goed vaar in vergelyking met die tradisionele tegnieke.
|
13 |
Empirical Bayes estimation of the extreme value index in an ANOVA settingJordaan, Aletta Gertruida 04 1900 (has links)
Thesis (MComm)-- Stellenbosch University, 2014. / ENGLISH ABSTRACT: Extreme value theory (EVT) involves the development of statistical models and techniques in order to describe and model extreme events. In order to make inferences about extreme quantiles, it is necessary to estimate the extreme value index (EVI). Numerous estimators of the EVI exist in the literature. However, these estimators are only applicable in the single sample setting. The aim of this study is to obtain an improved estimator of the EVI that is applicable to an ANOVA setting.
An ANOVA setting lends itself naturally to empirical Bayes (EB) estimators, which are the main estimators under consideration in this study. EB estimators have not received much attention in the literature.
The study begins with a literature study, covering the areas of application of EVT, Bayesian theory and EB theory. Different estimation methods of the EVI are discussed, focusing also on possible methods of determining the optimal threshold. Specifically, two adaptive methods of threshold selection are considered.
A simulation study is carried out to compare the performance of different estimation methods, applied only in the single sample setting. First order and second order estimation methods are considered. In the case of second order estimation, possible methods of estimating the second order parameter are also explored.
With regards to obtaining an estimator that is applicable to an ANOVA setting, a first order EB estimator and a second order EB estimator of the EVI are derived. A case study of five insurance claims portfolios is used to examine whether the two EB estimators improve the accuracy of estimating the EVI, when compared to viewing the portfolios in isolation.
The results showed that the first order EB estimator performed better than the Hill estimator. However, the second order EB estimator did not perform better than the “benchmark” second order estimator, namely fitting the perturbed Pareto distribution to all observations above a pre-determined threshold by means of maximum likelihood estimation. / AFRIKAANSE OPSOMMING: Ekstreemwaardeteorie (EWT) behels die ontwikkeling van statistiese modelle en tegnieke wat gebruik word om ekstreme gebeurtenisse te beskryf en te modelleer. Ten einde inferensies aangaande ekstreem kwantiele te maak, is dit nodig om die ekstreem waarde indeks (EWI) te beraam. Daar bestaan talle beramers van die EWI in die literatuur. Hierdie beramers is egter slegs van toepassing in die enkele steekproef geval. Die doel van hierdie studie is om ’n meer akkurate beramer van die EWI te verkry wat van toepassing is in ’n ANOVA opset.
’n ANOVA opset leen homself tot die gebruik van empiriese Bayes (EB) beramers, wat die fokus van hierdie studie sal wees. Hierdie beramers is nog nie in literatuur ondersoek nie.
Die studie begin met ’n literatuurstudie, wat die areas van toepassing vir EWT, Bayes teorie en EB teorie insluit. Verskillende metodes van EWI beraming word bespreek, insluitend ’n bespreking oor hoe die optimale drempel bepaal kan word. Spesifiek word twee aanpasbare metodes van drempelseleksie beskou.
’n Simulasiestudie is uitgevoer om die akkuraatheid van beraming van verskillende beramingsmetodes te vergelyk, in die enkele steekproef geval. Eerste orde en tweede orde beramingsmetodes word beskou. In die geval van tweede orde beraming, word moontlike beramingsmetodes van die tweede orde parameter ook ondersoek.
’n Eerste orde en ’n tweede orde EB beramer van die EWI is afgelei met die doel om ’n beramer te kry wat van toepassing is vir die ANAVA opset. ’n Gevallestudie van vyf versekeringsportefeuljes word gebruik om ondersoek in te stel of die twee EB beramers die akkuraatheid van beraming van die EWI verbeter, in vergelyking met die EWI beramers wat verkry word deur die portefeuljes afsonderlik te ontleed. Die resultate toon dat die eerste orde EB beramer beter gevaar het as die Hill beramer. Die tweede orde EB beramer het egter slegter gevaar as die tweede orde beramer wat gebruik is as maatstaf, naamlik die passing van die gesteurde Pareto verdeling (PPD) aan alle waarnemings bo ’n gegewe drempel, met behulp van maksimum aanneemlikheidsberaming.
|
14 |
Nearest hypersphere classification : a comparison with other classification techniquesVan der Westhuizen, Cornelius Stephanus 12 1900 (has links)
Thesis (MCom)--Stellenbosch University, 2014. / ENGLISH ABSTRACT: Classification is a widely used statistical procedure to classify objects into two or more
classes according to some rule which is based on the input variables. Examples of such
techniques are Linear and Quadratic Discriminant Analysis (LDA and QDA). However,
classification of objects with these methods can get complicated when the number of input
variables in the data become too large ( ≪ ), when the assumption of normality is no
longer met or when classes are not linearly separable. Vapnik et al. (1995) introduced the
Support Vector Machine (SVM), a kernel-based technique, which can perform classification
in cases where LDA and QDA are not valid. SVM makes use of an optimal separating
hyperplane and a kernel function to derive a rule which can be used for classifying objects.
Another kernel-based technique was proposed by Tax and Duin (1999) where a hypersphere
is used for domain description of a single class. The idea of a hypersphere for a single class
can be easily extended to classification when dealing with multiple classes by just classifying
objects to the nearest hypersphere.
Although the theory of hyperspheres is well developed, not much research has gone into
using hyperspheres for classification and the performance thereof compared to other
classification techniques. In this thesis we will give an overview of Nearest Hypersphere
Classification (NHC) as well as provide further insight regarding the performance of NHC
compared to other classification techniques (LDA, QDA and SVM) under different
simulation configurations.
We begin with a literature study, where the theory of the classification techniques LDA,
QDA, SVM and NHC will be dealt with. In the discussion of each technique, applications in
the statistical software R will also be provided. An extensive simulation study is carried out
to compare the performance of LDA, QDA, SVM and NHC for the two-class case. Various
data scenarios will be considered in the simulation study. This will give further insight in
terms of which classification technique performs better under the different data scenarios.
Finally, the thesis ends with the comparison of these techniques on real-world data. / AFRIKAANSE OPSOMMING: Klassifikasie is ’n statistiese metode wat gebruik word om objekte in twee of meer klasse te
klassifiseer gebaseer op ’n reël wat gebou is op die onafhanklike veranderlikes. Voorbeelde
van hierdie metodes sluit in Lineêre en Kwadratiese Diskriminant Analise (LDA en KDA).
Wanneer die aantal onafhanklike veranderlikes in ’n datastel te veel raak, die aanname van
normaliteit nie meer geld nie of die klasse nie meer lineêr skeibaar is nie, raak die toepassing
van metodes soos LDA en KDA egter te moeilik. Vapnik et al. (1995) het ’n kern gebaseerde
metode bekendgestel, die Steun Vektor Masjien (SVM), wat wel vir klassifisering gebruik
kan word in situasies waar metodes soos LDA en KDA misluk. SVM maak gebruik van ‘n
optimale skeibare hipervlak en ’n kern funksie om ’n reël af te lei wat gebruik kan word om
objekte te klassifiseer. ’n Ander kern gebaseerde tegniek is voorgestel deur Tax and Duin
(1999) waar ’n hipersfeer gebruik kan word om ’n gebied beskrywing op te stel vir ’n datastel
met net een klas. Dié idee van ’n enkele klas wat beskryf kan word deur ’n hipersfeer, kan
maklik uitgebrei word na ’n multi-klas klassifikasie probleem. Dit kan gedoen word deur
slegs die objekte te klassifiseer na die naaste hipersfeer.
Alhoewel die teorie van hipersfere goed ontwikkeld is, is daar egter nog nie baie navorsing
gedoen rondom die gebruik van hipersfere vir klassifikasie nie. Daar is ook nog nie baie
gekyk na die prestasie van hipersfere in vergelyking met ander klassifikasie tegnieke nie. In
hierdie tesis gaan ons ‘n oorsig gee van Naaste Hipersfeer Klassifikasie (NHK) asook verdere
insig in terme van die prestasie van NHK in vergelyking met ander klassifikasie tegnieke
(LDA, KDA en SVM) onder sekere simulasie konfigurasies.
Ons gaan begin met ‘n literatuurstudie, waar die teorie van die klassifikasie tegnieke LDA,
KDA, SVM en NHK behandel gaan word. Vir elke tegniek gaan toepassings in die statistiese
sagteware R ook gewys word. ‘n Omvattende simulasie studie word uitgevoer om die
prestasie van die tegnieke LDA, KDA, SVM en NHK te vergelyk. Die vergelyking word
gedoen vir situasies waar die data slegs twee klasse het. ‘n Verskeidenheid van data situasies
gaan ook ondersoek word om verdere insig te toon in terme van wanneer watter tegniek die
beste vaar. Die tesis gaan afsluit deur die genoemde tegnieke toe te pas op praktiese
datastelle.
|
15 |
Modelling of multi-state panel data : the importance of the model assumptionsMafu, Thandile John 12 1900 (has links)
Thesis (MCom)--Stellenbosch University, 2014. / ENGLISH ABSTRACT: A multi-state model is a way of describing a process in which a subject moves through a series
of states in continuous time. The series of states might be the measurement of a disease for
example in state 1 we might have subjects that are free from disease, in state 2 we might have
subjects that have a disease but the disease is mild, in state 3 we might have subjects having a
severe disease and in last state 4 we have those that die because of the disease. So Markov
models estimates the transition probabilities and transition intensity rates that describe the
movement of subjects between these states. The transition might be for example a particular
subject or patient might be slightly sick at age 30 but after 5 years he or she might be worse.
So Markov model will estimate what probability will be for that patient for moving from state
2 to state 3.
Markov multi-state models were studied in this thesis with the view of assessing the Markov
models assumptions such as homogeneity of the transition rates through time, homogeneity of
the transition rates across the subject population and Markov property or assumption.
The assessments of these assumptions were based on simulated panel or longitudinal dataset
which was simulated using the R package named msm package developed by Christopher
Jackson (2014). The R code that was written using this package is attached as appendix.
Longitudinal dataset consists of repeated measurements of the state of a subject and the time
between observations. The period of time with observations in longitudinal dataset is being
made on subject at regular or irregular time intervals until the subject dies then the study ends. / AFRIKAANSE OPSOMMING: ’n Meertoestandmodel is ’n manier om ’n proses te beskryf waarin ’n subjek in ’n ononderbroke
tydperk deur verskeie toestande beweeg. Die verskillende toestande kan byvoorbeeld vir die
meting van siekte gebruik word, waar toestand 1 uit gesonde subjekte bestaan, toestand 2 uit
subjekte wat siek is, dog slegs matig, toestand 3 uit subjekte wat ernstig siek is, en toestand 4
uit subjekte wat aan die siekte sterf. ’n Markov-model raam die oorgangswaarskynlikhede en
-intensiteit wat die subjekte se vordering deur hierdie toestande beskryf. Die oorgang is
byvoorbeeld wanneer ’n bepaalde subjek of pasiënt op 30-jarige ouderdom net lig aangetas is,
maar na vyf jaar veel ernstiger siek is. Die Markov-model raam dus die waarskynlikheid dat so
’n pasiënt van toestand 2 tot toestand 3 sal vorder.
Hierdie tesis het ondersoek ingestel na Markov-meertoestandmodelle ten einde die aannames
van die modelle, soos die homogeniteit van oorgangstempo’s oor tyd, die homogeniteit van
oorgangstempo’s oor die subjekpopulasie en tipiese Markov-eienskappe, te beoordeel.
Die beoordeling van hierdie aannames was gegrond op ’n gesimuleerde paneel of longitudinale
datastel wat met behulp van Christopher Jackson (2014) se R-pakket genaamd msm gesimuleer
is. Die R-kode wat met behulp van hierdie pakket geskryf is, word as bylae aangeheg. Die
longitudinale datastel bestaan uit herhaalde metings van die toestand waarin ’n subjek verkeer
en die tydsverloop tussen waarnemings. Waarnemings van die longitudinale datastel word met
gereelde of ongereelde tussenposes onderneem totdat die subjek sterf, wanneer die studie dan
ook ten einde loop.
|
16 |
Multi-label feature selection with application to musical instrument recognitionSandrock, Trudie 12 1900 (has links)
Thesis (PhD)--Stellenbosch University, 2013. / ENGLISH ABSTRACT: An area of data mining and statistics that is currently receiving considerable attention is the field of multi-label learning. Problems in this field are concerned with scenarios where each data case can be associated with a set of labels instead of only one. In this thesis, we review the field of multi-label learning and discuss the lack of suitable benchmark data available for evaluating multi-label algorithms. We propose a technique for simulating multi-label data, which allows good control over different data characteristics and which could be useful for conducting comparative studies in the multi-label field.
We also discuss the explosion in data in recent years, and highlight the need for some form of dimension reduction in order to alleviate some of the challenges presented by working with large datasets. Feature (or variable) selection is one way of achieving dimension reduction, and after a brief discussion of different feature selection techniques, we propose a new technique for feature selection in a multi-label context, based on the concept of independent probes. This technique is empirically evaluated by using simulated multi-label data and it is shown to achieve classification accuracy with a reduced set of features similar to that achieved with a full set of features.
The proposed technique for feature selection is then also applied to the field of music information retrieval (MIR), specifically the problem of musical instrument recognition. An overview of the field of MIR is given, with particular emphasis on the instrument recognition problem. The particular goal of (polyphonic) musical instrument recognition is to automatically identify the instruments playing simultaneously in an audio clip, which is not a simple task. We specifically consider the case of duets – in other words, where two instruments are playing simultaneously – and approach the problem as a multi-label classification one. In our empirical study, we illustrate the complexity of musical instrument data and again show that our proposed feature selection technique is effective in identifying relevant features and thereby reducing the complexity of the dataset without negatively impacting on performance. / AFRIKAANSE OPSOMMING: ‘n Area van dataontginning en statistiek wat tans baie aandag ontvang, is die veld van multi-etiket leerteorie. Probleme in hierdie veld beskou scenarios waar elke datageval met ‘n stel etikette geassosieer kan word, instede van slegs een. In hierdie skripsie gee ons ‘n oorsig oor die veld van multi-etiket leerteorie en bespreek die gebrek aan geskikte standaard datastelle beskikbaar vir die evaluering van multi-etiket algoritmes. Ons stel ‘n tegniek vir die simulasie van multi-etiket data voor, wat goeie kontrole oor verskillende data eienskappe bied en wat nuttig kan wees om vergelykende studies in die multi-etiket veld uit te voer. Ons bespreek ook die onlangse ontploffing in data, en beklemtoon die behoefte aan ‘n vorm van dimensie reduksie om sommige van die uitdagings wat deur sulke groot datastelle gestel word die hoof te bied. Veranderlike seleksie is een manier van dimensie reduksie, en na ‘n vlugtige bespreking van verskillende veranderlike seleksie tegnieke, stel ons ‘n nuwe tegniek vir veranderlike seleksie in ‘n multi-etiket konteks voor, gebaseer op die konsep van onafhanklike soek-veranderlikes. Hierdie tegniek word empiries ge-evalueer deur die gebruik van gesimuleerde multi-etiket data en daar word gewys dat dieselfde klassifikasie akkuraatheid behaal kan word met ‘n verminderde stel veranderlikes as met die volle stel veranderlikes.
Die voorgestelde tegniek vir veranderlike seleksie word ook toegepas in die veld van musiek dataontginning, spesifiek die probleem van die herkenning van musiekinstrumente. ‘n Oorsig van die musiek dataontginning veld word gegee, met spesifieke klem op die herkenning van musiekinstrumente. Die spesifieke doel van (polifoniese) musiekinstrument-herkenning is om instrumente te identifiseer wat saam in ‘n oudiosnit speel. Ons oorweeg spesifiek die geval van duette – met ander woorde, waar twee instrumente saam speel – en hanteer die probleem as ‘n multi-etiket klassifikasie een. In ons empiriese studie illustreer ons die kompleksiteit van musiekinstrumentdata en wys weereens dat ons voorgestelde veranderlike seleksie tegniek effektief daarin slaag om relevante veranderlikes te identifiseer en sodoende die kompleksiteit van die datastel te verminder sonder ‘n negatiewe impak op klassifikasie akkuraatheid.
|
17 |
Robust principal component analysis biplotsWedlake, Ryan Stuart 03 1900 (has links)
Thesis (MSc (Mathematical Statistics))--University of Stellenbosch, 2008. / In this study several procedures for finding robust principal components (RPCs) for low and high dimensional data sets are investigated in parallel with robust principal component analysis (RPCA) biplots. These RPCA biplots will be used for the simultaneous visualisation of the observations and variables in the subspace spanned by the RPCs. Chapter 1 contains: a brief overview of the difficulties that are encountered when graphically investigating patterns and relationships in multidimensional data and why PCA can be used to circumvent these difficulties; the objectives of this study; a summary of the work done in order to meet these objectives; certain results in matrix algebra that are needed throughout this study. In Chapter 2 the derivation of the classic sample principal components (SPCs) is first discussed in detail since they are the „building blocks‟ of classic principal component analysis (CPCA) biplots. Secondly, the traditional CPCA biplot of Gabriel (1971) is reviewed. Thirdly, modifications to this biplot using the new philosophy of Gower & Hand (1996) are given attention. Reasons why this modified biplot has several advantages over the traditional biplot – some of which are aesthetical in nature – are given. Lastly, changes that can be made to the Gower & Hand (1996) PCA biplot to optimally visualise the correlations between the variables is discussed.
Because the SPCs determine the position of the observations as well as the orientation of the arrows (traditional biplot) or axes (Gower and Hand biplot) in the PCA biplot subspace, it is useful to give estimates of the standard errors of the SPCs together with the biplot display as an indication of the stability of the biplot. A computer-intensive statistical technique called the Bootstrap is firstly discussed that is used to calculate the standard errors of the SPCs without making underlying distributional assumptions. Secondly, the influence of outliers on Bootstrap results is investigated. Lastly, a robust form of the Bootstrap is briefly discussed for calculating standard error estimates that remain stable with or without the presence of outliers in the sample. All the preceding topics are the subject matter of Chapter 3. In Chapter 4, reasons why a PC analysis should be made robust in the presence of outliers are firstly discussed. Secondly, different types of outliers are discussed. Thirdly, a method for identifying influential observations and a method for identifying outlying observations are investigated. Lastly, different methods for constructing robust estimates of location and dispersion for the observations receive attention. These robust estimates are used in numerical procedures that calculate RPCs. In Chapter 5, an overview of some of the procedures that are used to calculate RPCs for lower and higher dimensional data sets is firstly discussed. Secondly, two numerical procedures that can be used to calculate RPCs for lower dimensional data sets are discussed and compared in detail. Details and examples of robust versions of the Gower & Hand (1996) PCA biplot that can be constructed using these RPCs are also provided. In Chapter 6, five numerical procedures for calculating RPCs for higher dimensional data sets are discussed in detail. Once RPCs have been obtained by using these methods, they are used to construct robust versions of the PCA biplot of Gower & Hand (1996). Details and examples of these robust PCA biplots are also provided. An extensive software library has been developed so that the biplot methodology discussed in this study can be used in practice. The functions in this library are given in an appendix at the end of this study. This software library is used on data sets from various fields so that the merit of the theory developed in this study can be visually appraised.
|
18 |
Bayesian approaches of Markov models embedded in unbalanced panel dataMuller, Christoffel Joseph Brand 12 1900 (has links)
Thesis (PhD)--Stellenbosch University, 2012. / ENGLISH ABSTRACT: Multi-state models are used in this dissertation to model panel data, also known as longitudinal
or cross-sectional time-series data. These are data sets which include units that are observed
across two or more points in time. These models have been used extensively in medical studies
where the disease states of patients are recorded over time.
A theoretical overview of the current multi-state Markov models when applied to panel data
is presented and based on this theory, a simulation procedure is developed to generate panel
data sets for given Markov models. Through the use of this procedure a simulation study
is undertaken to investigate the properties of the standard likelihood approach when fitting
Markov models and then to assess its shortcomings. One of the main shortcomings highlighted
by the simulation study, is the unstable estimates obtained by the standard likelihood models,
especially when fitted to small data sets.
A Bayesian approach is introduced to develop multi-state models that can overcome these
unstable estimates by incorporating prior knowledge into the modelling process. Two Bayesian
techniques are developed and presented, and their properties are assessed through the use of
extensive simulation studies.
Firstly, Bayesian multi-state models are developed by specifying prior distributions for the
transition rates, constructing a likelihood using standard Markov theory and then obtaining
the posterior distributions of the transition rates. A selected few priors are used in these
models. Secondly, Bayesian multi-state imputation techniques are presented that make use
of suitable prior information to impute missing observations in the panel data sets. Once
imputed, standard likelihood-based Markov models are fitted to the imputed data sets to
estimate the transition rates. Two different Bayesian imputation techniques are presented.
The first approach makes use of the Dirichlet distribution and imputes the unknown states at
all time points with missing observations. The second approach uses a Dirichlet process to
estimate the time at which a transition occurred between two known observations and then a
state is imputed at that estimated transition time.
The simulation studies show that these Bayesian methods resulted in more stable results, even
when small samples are available. / AFRIKAANSE OPSOMMING: Meerstadium-modelle word in hierdie verhandeling gebruik om paneeldata, ook bekend as
longitudinale of deursnee tydreeksdata, te modelleer. Hierdie is datastelle wat eenhede insluit
wat oor twee of meer punte in tyd waargeneem word. Hierdie tipe modelle word dikwels in
mediese studies gebruik indien verskillende stadiums van ’n siekte oor tyd waargeneem word.
’n Teoretiese oorsig van die huidige meerstadium Markov-modelle toegepas op paneeldata word
gegee. Gebaseer op hierdie teorie word ’n simulasieprosedure ontwikkel om paneeldatastelle
te simuleer vir gegewe Markov-modelle. Hierdie prosedure word dan gebruik in ’n simulasiestudie
om die eienskappe van die standaard aanneemlikheidsbenadering tot die pas vanMarkov
modelle te ondersoek en dan enige tekortkominge hieruit te beoordeel. Een van die hoof
tekortkominge wat uitgewys word deur die simulasiestudie, is die onstabiele beramings wat
verkry word indien dit gepas word op veral klein datastelle.
’n Bayes-benadering tot die modellering van meerstadiumpaneeldata word ontwikkel omhierdie
onstabiliteit te oorkom deur a priori-inligting in die modelleringsproses te inkorporeer. Twee
Bayes-tegnieke word ontwikkel en aangebied, en hulle eienskappe word ondersoek deur ’n
omvattende simulasiestudie.
Eerstens word Bayes-meerstadium-modelle ontwikkel deur a priori-verdelings vir die oorgangskoerse
te spesifiseer en dan die aanneemlikheidsfunksie te konstrueer deur van standaard
Markov-teorie gebruik te maak en die a posteriori-verdelings van die oorgangskoerse te bepaal.
’n Gekose aantal a priori-verdelings word gebruik in hierdie modelle. Tweedens word Bayesmeerstadium
invul tegnieke voorgestel wat gebruik maak van a priori-inligting om ontbrekende
waardes in die paneeldatastelle in te vul of te imputeer. Nadat die waardes ge-imputeer is,
word standaard Markov-modelle gepas op die ge-imputeerde datastel om die oorgangskoerse te
beraam. Twee verskillende Bayes-meerstadium imputasie tegnieke word bespreek. Die eerste
tegniek maak gebruik van ’n Dirichletverdeling om die ontbrekende stadium te imputeer by alle
tydspunte met ’n ontbrekende waarneming. Die tweede benadering gebruik ’n Dirichlet-proses
om die oorgangstyd tussen twee waarnemings te beraam en dan die ontbrekende stadium te
imputeer op daardie beraamde oorgangstyd.
Die simulasiestudies toon dat die Bayes-metodes resultate oplewer wat meer stabiel is, selfs
wanneer klein datastelle beskikbaar is.
|
19 |
Exploratory and inferential multivariate statistical techniques for multidimensional count and binary data with applications in RNtushelo, Nombasa Sheroline 12 1900 (has links)
Thesis (MComm)--Stellenbosch University, 2011. / ENGLISH ABSTRACT: The analysis of multidimensional (multivariate) data sets is a very important area of
research in applied statistics. Over the decades many techniques have been developed to
deal with such datasets. The multivariate techniques that have been developed include
inferential analysis, regression analysis, discriminant analysis, cluster analysis and many
more exploratory methods. Most of these methods deal with cases where the data contain
numerical variables. However, there are powerful methods in the literature that also deal
with multidimensional binary and count data.
The primary purpose of this thesis is to discuss the exploratory and inferential techniques
that can be used for binary and count data. In Chapter 2 of this thesis we give the detail of
correspondence analysis and canonical correspondence analysis. These methods are used
to analyze the data in contingency tables. Chapter 3 is devoted to cluster analysis. In this
chapter we explain four well-known clustering methods and we also discuss the distance
(dissimilarity) measures available in the literature for binary and count data. Chapter 4
contains an explanation of metric and non-metric multidimensional scaling. These
methods can be used to represent binary or count data in a lower dimensional Euclidean
space. In Chapter 5 we give a method for inferential analysis called the analysis of
distance. This method use a similar reasoning as the analysis of variance, but the
inference is based on a pseudo F-statistic with the p-value obtained using permutations of
the data. Chapter 6 contains real-world applications of these above methods on two
special data sets called the Biolog data and Barents Fish data.
The secondary purpose of the thesis is to demonstrate how the above techniques can be
performed in the software package R. Several R packages and functions are discussed
throughout this thesis. The usage of these functions is also demonstrated with appropriate
examples. Attention is also given to the interpretation of the output and graphics. The
thesis ends with some general conclusions and ideas for further research. / AFRIKAANSE OPSOMMING: Die analise van meerdimensionele (meerveranderlike) datastelle is ’n belangrike area van
navorsing in toegepaste statistiek. Oor die afgelope dekades is daar verskeie tegnieke
ontwikkel om sulke data te ontleed. Die meerveranderlike tegnieke wat ontwikkel is sluit
in inferensie analise, regressie analise, diskriminant analise, tros analise en vele meer
verkennende data analise tegnieke. Die meerderheid van hierdie metodes hanteer gevalle
waar die data numeriese veranderlikes bevat. Daar bestaan ook kragtige metodes in die
literatuur vir die analise van meerdimensionele binêre en telling data.
Die primêre doel van hierdie tesis is om tegnieke vir verkennende en inferensiële analise
van binêre en telling data te bespreek. In Hoofstuk 2 van hierdie tesis bespreek ons
ooreenkoms analise en kanoniese ooreenkoms analise. Hierdie metodes word gebruik om
data in gebeurlikheidstabelle te analiseer. Hoofstuk 3 bevat tegnieke vir tros analise. In
hierdie hoofstuk verduidelik ons vier gewilde tros analise metodes. Ons bespreek ook die
afstand maatstawwe wat beskikbaar is in die literatuur vir binêre en telling data. Hoofstuk
4 bevat ’n verduideliking van metriese en nie-metriese meerdimensionele skalering.
Hierdie metodes kan gebruik word om binêre of telling data in ‘n lae dimensionele
Euclidiese ruimte voor te stel. In Hoofstuk 5 beskryf ons ’n inferensie metode wat bekend
staan as die analise van afstande. Hierdie metode gebruik ’n soortgelyke redenasie as die
analise van variansie. Die inferensie hier is gebaseer op ’n pseudo F-toetsstatistiek en die
p-waardes word verkry deur gebruik te maak van permutasies van die data. Hoofstuk 6
bevat toepassings van bogenoemde tegnieke op werklike datastelle wat bekend staan as
die Biolog data en die Barents Fish data.
Die sekondêre doel van die tesis is om te demonstreer hoe hierdie tegnieke uitgevoer
word in the R sagteware. Verskeie R pakette en funksies word deurgaans bespreek in die
tesis. Die gebruik van die funksies word gedemonstreer met toepaslike voorbeelde.
Aandag word ook gegee aan die interpretasie van die afvoer en die grafieke. Die tesis
sluit af met algemene gevolgtrekkings en voorstelle vir verdere navorsing.
|
20 |
Aspects of copulas and goodness-of-fitKpanzou, Tchilabalo Abozou 12 1900 (has links)
Thesis (MComm (Statistics and Actuarial Science))--Stellenbosch University, 2008. / The goodness-of- t of a statistical model describes how well it ts a set of observations. Measures
of goodness-of- t typically summarize the discrepancy between observed values and the values
expected under the model in question. Such measures can be used in statistical hypothesis
testing, for example to test for normality, to test whether two samples are drawn from identical
distributions, or whether outcome frequencies follow a speci ed distribution. Goodness-of- t
for copulas is a special case of the more general problem of testing multivariate models, but is
complicated due to the di culty of specifying marginal distributions.
In this thesis, the goodness-of- t test statistics for general distributions and the tests for copulas
are investigated, but prior to that an understanding of copulas and their properties is developed.
In fact copulas are useful tools for understanding relationships among multivariate variables, and
are important tools for describing the dependence structure between random variables. Several
univariate, bivariate and multivariate test statistics are investigated, the emphasis being on
tests for normality. Among goodness-of- t tests for copulas, tests based on the probability integral
transform, Rosenblatt's transformation, as well as some dimension reduction techniques are
considered. Bootstrap procedures are also described. Simulation studies are conducted to rst
compare the power of rejection of the null hypothesis of the Clayton copula by four di erent test
statistics under the alternative of the Gumbel-Hougaard copula, and also to compare the power
of rejection of the null hypothesis of the Gumbel-Hougaard copula under the alternative of the
Clayton copula. An application of the described techniques is made to a practical data set.
|
Page generated in 0.1541 seconds