Spelling suggestions: "subject:"thesestatistics."" "subject:"thestatistics.""
41 |
Comparative approaches to handling missing data, with particular focus on multiple imputation for both cross-sectional and longitudinal models.Hassan, Ali Satty Ali. January 2012 (has links)
Much data-based research are characterized by the unavoidable problem of incompleteness
as a result of missing or erroneous values. This thesis discusses some of the
various strategies and basic issues in statistical data analysis to address the missing
data problem, and deals with both the problem of missing covariates and missing outcomes.
We restrict our attention to consider methodologies which address a specific
missing data pattern, namely monotone missingness.
The thesis is divided into two parts. The first part placed a particular emphasis on
the so called missing at random (MAR) assumption, but focuses the bulk of attention
on multiple imputation techniques. The main aim of this part is to investigate various
modelling techniques using application studies, and to specify the most appropriate
techniques as well as gain insight into the appropriateness of these techniques for handling
incomplete data analysis. This thesis first deals with the problem of missing
covariate values to estimate regression parameters under a monotone missing covariate
pattern. The study is devoted to a comparison of different imputation techniques,
namely markov chain monte carlo (MCMC), regression, propensity score (PS) and last
observation carried forward (LOCF). The results from the application study revealed
that we have universally best methods to deal with missing covariates when the missing
data pattern is monotone. Of the methods explored, the MCMC and regression methods
of imputation to estimate regression parameters with monotone missingness were
preferable to the PS and LOCF methods. This study is also concerned with comparative
analysis of the techniques applied to incomplete Gaussian longitudinal outcome
or response data due to random dropout. Three different methods are assessed and
investigated, namely multiple imputation (MI), inverse probability weighting (IPW)
and direct likelihood analysis. The findings in general favoured MI over IPW in the
case of continuous outcomes, even when the MAR mechanism holds. The findings further suggest that the use of MI and direct likelihood techniques lead to accurate and
equivalent results as both techniques arrive at the same substantive conclusions. The
study also compares and contrasts several statistical methods for analyzing incomplete
non-Gaussian longitudinal outcomes when the underlying study is subject to ignorable
dropout. The methods considered include weighted generalized estimating equations
(WGEE), multiple imputation after generalized estimating equations (MI-GEE) and
generalized linear mixed model (GLMM). The current study found that the MI-GEE
method was considerably robust, doing better than all the other methods in terms of
small and large sample sizes, regardless of the dropout rates.
The primary interest of the second part of the thesis falls under the non-ignorable
dropout (MNAR) modelling frameworks that rely on sensitivity analysis in modelling
incomplete Gaussian longitudinal data. The aim of this part is to deal with non-random
dropout by explicitly modelling the assumptions that caused the dropout and
incorporated this additional sub-model into the model for the measurement data, and
to assess the sensitivity of the modelling assumptions. The study pays attention to
the analysis of repeated Gaussian measures subject to potentially non-random dropout
in order to study the influence on inference that might be caused in the data by the
dropout process. We consider the construction of a particular type of selection model,
namely the Diggle-Kenward model as a tool for assessing the sensitivity of a selection
model in terms of the modelling assumptions. The major conclusions drawn were that
there was evidence in favour of the MAR process rather than an MCAR process in
the context of the assumed model. In addition, there was the need to obtain further
insight into the data by comparing various sensitivity analysis frameworks. Lastly,
two families of models were also compared and contrasted to investigate the potential
influence on inference that dropout might have or exert on the dependent measurement
data considered, and to deal with incomplete sequences. The models were based on
selection and pattern mixture frameworks used for sensitivity analysis to jointly model
the distribution of the dropout process and longitudinal measurement process. The
results of the sensitivity analysis were in agreement and hence led to similar parameter
estimates. Additional confidence in the findings was gained as both models led to
similar results for significant effects such as marginal treatment effects. / Thesis (M.Sc.)-University of KwaZulu-Natal, Pietermaritzburg, 2012.
|
42 |
Likelihood based statistical methods for estimating HIV incidence rate.Gabaitiri, Lesego. January 2013 (has links)
Estimation of current levels of human immunodeficiency virus (HIV) incidence is essential
for monitoring the impact of an epidemic, determining public health priorities,
assessing the impact of interventions and for planning purposes. However, there is
often insufficient data on incidence as compared to prevalence. A direct approach
is to estimate incidence from longitudinal cohort studies. Although this approach
can provide direct and unbiased measure of incidence for settings where the study is
conducted, it is often too expensive and time consuming. An alternative approach is
to estimate incidence from cross sectional survey using biomarkers that distinguish
between recent and non-recent/longstanding infections. The original biomarker based
approach proposes the detection of HIV-1 p24 antigen in the pre-seroconversion period
to identify persons with acute infection for estimating HIV incidence. However,
this approach requires large sample sizes in order to obtain reliable estimates of HIV
incidence because the duration of antigenemia before antibody detection is short,
about 22.5 days. Subsequently, another method that involves dual antibody testing
system was developed. In stage one, a sensitive test is used to diagnose HIV infection
and a less sensitive test such is used in the second stage to distinguish between long
standing infections and recent infections among those who tested positive for HIV
in stage one. The question is: how do we combine this data with other relevant information,
such as the period an individual takes from being undetectable by a less
sensitive test to being detectable, to estimate incidence?
The main objective of this thesis is therefore to develop likelihood based method
that can be used to estimate HIV incidence when data is derived from cross sectional
surveys and the disease classification is achieved by combining two biomarker or
assay tests. The thesis builds on the dual antibody testing approach and extends the
statistical framework that uses the multinomial distribution to derive the maximum
likelihood estimators of HIV incidence for different settings.
In order to improve incidence estimation, we develop a model for estimating HIV
incidence that incorporate information on the previous or past prevalence and derive
maximum likelihood estimators of incidence assuming incidence density is constant
over a specified period. Later, we extend the method to settings where a proportion
of subjects remain non-reactive to a less sensitive test long after seroconversion.
Diagnostic tests used to determine recent infections are prone to errors. To address
this problem, we considered a method that simultaneously makes adjustment for
sensitivity and specificity. In addition, we also showed that sensitivity is similar to
the proportion of subjects who eventually transit the “recent infection” state.
We also relax the assumption of constant incidence density by proposing linear incidence
density to accommodate settings where incidence might be declining or increasing.
We extend the standard adjusted model for estimating incidence to settings where
some subjects who tested positive for HIV antibodies were not tested by a less sensitive
test resulting in missing outcome data. Models for the risk factors (covariates)
of HIV incidence are considered in the last but one chapter. We used data from
Botswana AIDS Impact (BAIS) III of 2008 to illustrate the proposed methods. The
general conclusion and recommendations for future work are provided in the final
chapter. / Theses (Ph.D.)-University of KwaZulu-Natal, Pietermaritzburg, 2013.
|
43 |
Statistical methods for analysing complex survey data : an application to HIV/AIDS in Ethiopia.Mohammed, Mohammed O. M. 12 February 2014 (has links)
The HIV/AIDS pandemic is currently the most challenging public health matter that
faces third world countries, especially those in Sub-Saharan Africa. Ethiopia, in East
Africa, with a generalised and highly heterogeneous epidemic, is no exception, with
HIV/AIDS affecting most sectors of the economy. The first case of HIV in Ethiopia
was reported in 1984. Since then, HIV/AIDS has become a major public health con
cern, leading the Government of Ethiopia to declare a public health emergency in
2002. In 2011, the adult HIV/AIDS prevalence in Ethiopia was estimated at 1.5%.
Approximately 1.2 million Ethiopians were living with HIV/AIDS in 2010.
Surveys are an important and popular tool for collecting data. Analytical use of survey
data especially health survey data has become very common, with a focus on the association of particular outcome variables with explanatory variables at the population
level. In this study we used the data from the 2005 Ethiopian Demographic and Health
Survey, (EDHS 2005), and identified key demographic, socioeconomic, sociocultural,
behavioral and proximate determinants of HIV/AIDS risk factor. Usually most survey
analysts ignore the complex survey design issues like clustering, stratification and unequal probability of selection (weights). This study deals with complex survey design
and takes the design aspect into account, because failure to do so leads to bias parameters estimates and standard error, wide confidence intervals and statistical tests
will be incorrect.
In this study, three statistical approaches were used to analyse the complex survey
data. The first approach was a survey logistic regression used to model the binary
outcome (HIV serostatus) and set of explanatory variables (the dependence of the
HIV risk factors). The difference between survey logistic regression and the ordinary
logistic regression is that survey logistic regression approach takes the study design
into account during analysis. The second approach was a multilevel logistic regression
model, that assumed that the data structure in the population was hierarchical, and
that individual within household was selected from clusters that were randomly selected
from a national sampling frame. We considered a three-level model for our analysis.
This second approach considered the results from Frequentist and a Bayesian multilevel
models. Bayesian methods can provide accurate estimates of the parameters and the
uncertainty associated with them. The third approach used was a Spatial models
approach where model parameters were estimated under the Integrated Nested Laplace
Approximation (INLA) paradigm. / Thesis (Ph.D.)-University of KwaZulu-Natal, Pietermaritzburg, 2013.
|
44 |
Multivariate analysis of the BRICS financial markets.Ijumba, Claire. January 2013 (has links)
The co-movements and integration of financial markets has been a subject of great concern among
many researchers and economists due to an interest in the impacts of stock market integration in
terms of international portfolio diversification, asset allocation and asset pricing efficiency. Understanding
the interdependence among financial markets is thus of immense importance especially to
investors and stakeholders in making viable decisions, managing risks and monitoring portfolio performances.
In this thesis, we investigated the levels of interdependence and dynamic linkages among
the five emerging economies well known as the BRICS: Brazil, Russia, India, China and South Africa,
using a Vector autoregressive (VAR), univariate GARCH(1,1) and multivariate GARCH models. Our
data sample consisted of the BRICS weekly returns from the period of January 2000 to December
2012. We used a VAR model to examine the linear dependence among the BRICS markets. The
results from the VAR model analysis provided some evidence of unidirectional linear dependencies
of the Indian and Chinese markets on the Brazilian stock market. The univariate GARCH(1,1) and
multivariate GARCH models were employed to explore the volatility and dynamic correlation in the
BRICS stock returns respectively. The results of the univariate GARCH model suggested volatility
persistence among all the BRICS stock returns where China appeared to be the most volatile
followed by the Russian stock market while the South African market was found to be the least
volatile. Results from the multivariate GARCH models revealed similar volatility persistence. Furthermore,
we found that, the correlations among the five emerging markets varied with time. From
this study, evidence of interdependence among the BRICS cannot be rejected. Moreover, it appears
that there are other factors apart from the internal markets themselves that may affect the volatility
and correlation among the BRICS. / M.Sc. University of KwaZulu-Natal, Durban 2013.
|
45 |
Robust principal component analysis biplotsWedlake, Ryan Stuart 03 1900 (has links)
Thesis (MSc (Mathematical Statistics))--University of Stellenbosch, 2008. / In this study several procedures for finding robust principal components (RPCs) for low and high dimensional data sets are investigated in parallel with robust principal component analysis (RPCA) biplots. These RPCA biplots will be used for the simultaneous visualisation of the observations and variables in the subspace spanned by the RPCs. Chapter 1 contains: a brief overview of the difficulties that are encountered when graphically investigating patterns and relationships in multidimensional data and why PCA can be used to circumvent these difficulties; the objectives of this study; a summary of the work done in order to meet these objectives; certain results in matrix algebra that are needed throughout this study. In Chapter 2 the derivation of the classic sample principal components (SPCs) is first discussed in detail since they are the „building blocks‟ of classic principal component analysis (CPCA) biplots. Secondly, the traditional CPCA biplot of Gabriel (1971) is reviewed. Thirdly, modifications to this biplot using the new philosophy of Gower & Hand (1996) are given attention. Reasons why this modified biplot has several advantages over the traditional biplot – some of which are aesthetical in nature – are given. Lastly, changes that can be made to the Gower & Hand (1996) PCA biplot to optimally visualise the correlations between the variables is discussed.
Because the SPCs determine the position of the observations as well as the orientation of the arrows (traditional biplot) or axes (Gower and Hand biplot) in the PCA biplot subspace, it is useful to give estimates of the standard errors of the SPCs together with the biplot display as an indication of the stability of the biplot. A computer-intensive statistical technique called the Bootstrap is firstly discussed that is used to calculate the standard errors of the SPCs without making underlying distributional assumptions. Secondly, the influence of outliers on Bootstrap results is investigated. Lastly, a robust form of the Bootstrap is briefly discussed for calculating standard error estimates that remain stable with or without the presence of outliers in the sample. All the preceding topics are the subject matter of Chapter 3. In Chapter 4, reasons why a PC analysis should be made robust in the presence of outliers are firstly discussed. Secondly, different types of outliers are discussed. Thirdly, a method for identifying influential observations and a method for identifying outlying observations are investigated. Lastly, different methods for constructing robust estimates of location and dispersion for the observations receive attention. These robust estimates are used in numerical procedures that calculate RPCs. In Chapter 5, an overview of some of the procedures that are used to calculate RPCs for lower and higher dimensional data sets is firstly discussed. Secondly, two numerical procedures that can be used to calculate RPCs for lower dimensional data sets are discussed and compared in detail. Details and examples of robust versions of the Gower & Hand (1996) PCA biplot that can be constructed using these RPCs are also provided. In Chapter 6, five numerical procedures for calculating RPCs for higher dimensional data sets are discussed in detail. Once RPCs have been obtained by using these methods, they are used to construct robust versions of the PCA biplot of Gower & Hand (1996). Details and examples of these robust PCA biplots are also provided. An extensive software library has been developed so that the biplot methodology discussed in this study can be used in practice. The functions in this library are given in an appendix at the end of this study. This software library is used on data sets from various fields so that the merit of the theory developed in this study can be visually appraised.
|
46 |
Bayesian approaches of Markov models embedded in unbalanced panel dataMuller, Christoffel Joseph Brand 12 1900 (has links)
Thesis (PhD)--Stellenbosch University, 2012. / ENGLISH ABSTRACT: Multi-state models are used in this dissertation to model panel data, also known as longitudinal
or cross-sectional time-series data. These are data sets which include units that are observed
across two or more points in time. These models have been used extensively in medical studies
where the disease states of patients are recorded over time.
A theoretical overview of the current multi-state Markov models when applied to panel data
is presented and based on this theory, a simulation procedure is developed to generate panel
data sets for given Markov models. Through the use of this procedure a simulation study
is undertaken to investigate the properties of the standard likelihood approach when fitting
Markov models and then to assess its shortcomings. One of the main shortcomings highlighted
by the simulation study, is the unstable estimates obtained by the standard likelihood models,
especially when fitted to small data sets.
A Bayesian approach is introduced to develop multi-state models that can overcome these
unstable estimates by incorporating prior knowledge into the modelling process. Two Bayesian
techniques are developed and presented, and their properties are assessed through the use of
extensive simulation studies.
Firstly, Bayesian multi-state models are developed by specifying prior distributions for the
transition rates, constructing a likelihood using standard Markov theory and then obtaining
the posterior distributions of the transition rates. A selected few priors are used in these
models. Secondly, Bayesian multi-state imputation techniques are presented that make use
of suitable prior information to impute missing observations in the panel data sets. Once
imputed, standard likelihood-based Markov models are fitted to the imputed data sets to
estimate the transition rates. Two different Bayesian imputation techniques are presented.
The first approach makes use of the Dirichlet distribution and imputes the unknown states at
all time points with missing observations. The second approach uses a Dirichlet process to
estimate the time at which a transition occurred between two known observations and then a
state is imputed at that estimated transition time.
The simulation studies show that these Bayesian methods resulted in more stable results, even
when small samples are available. / AFRIKAANSE OPSOMMING: Meerstadium-modelle word in hierdie verhandeling gebruik om paneeldata, ook bekend as
longitudinale of deursnee tydreeksdata, te modelleer. Hierdie is datastelle wat eenhede insluit
wat oor twee of meer punte in tyd waargeneem word. Hierdie tipe modelle word dikwels in
mediese studies gebruik indien verskillende stadiums van ’n siekte oor tyd waargeneem word.
’n Teoretiese oorsig van die huidige meerstadium Markov-modelle toegepas op paneeldata word
gegee. Gebaseer op hierdie teorie word ’n simulasieprosedure ontwikkel om paneeldatastelle
te simuleer vir gegewe Markov-modelle. Hierdie prosedure word dan gebruik in ’n simulasiestudie
om die eienskappe van die standaard aanneemlikheidsbenadering tot die pas vanMarkov
modelle te ondersoek en dan enige tekortkominge hieruit te beoordeel. Een van die hoof
tekortkominge wat uitgewys word deur die simulasiestudie, is die onstabiele beramings wat
verkry word indien dit gepas word op veral klein datastelle.
’n Bayes-benadering tot die modellering van meerstadiumpaneeldata word ontwikkel omhierdie
onstabiliteit te oorkom deur a priori-inligting in die modelleringsproses te inkorporeer. Twee
Bayes-tegnieke word ontwikkel en aangebied, en hulle eienskappe word ondersoek deur ’n
omvattende simulasiestudie.
Eerstens word Bayes-meerstadium-modelle ontwikkel deur a priori-verdelings vir die oorgangskoerse
te spesifiseer en dan die aanneemlikheidsfunksie te konstrueer deur van standaard
Markov-teorie gebruik te maak en die a posteriori-verdelings van die oorgangskoerse te bepaal.
’n Gekose aantal a priori-verdelings word gebruik in hierdie modelle. Tweedens word Bayesmeerstadium
invul tegnieke voorgestel wat gebruik maak van a priori-inligting om ontbrekende
waardes in die paneeldatastelle in te vul of te imputeer. Nadat die waardes ge-imputeer is,
word standaard Markov-modelle gepas op die ge-imputeerde datastel om die oorgangskoerse te
beraam. Twee verskillende Bayes-meerstadium imputasie tegnieke word bespreek. Die eerste
tegniek maak gebruik van ’n Dirichletverdeling om die ontbrekende stadium te imputeer by alle
tydspunte met ’n ontbrekende waarneming. Die tweede benadering gebruik ’n Dirichlet-proses
om die oorgangstyd tussen twee waarnemings te beraam en dan die ontbrekende stadium te
imputeer op daardie beraamde oorgangstyd.
Die simulasiestudies toon dat die Bayes-metodes resultate oplewer wat meer stabiel is, selfs
wanneer klein datastelle beskikbaar is.
|
47 |
Exploratory and inferential multivariate statistical techniques for multidimensional count and binary data with applications in RNtushelo, Nombasa Sheroline 12 1900 (has links)
Thesis (MComm)--Stellenbosch University, 2011. / ENGLISH ABSTRACT: The analysis of multidimensional (multivariate) data sets is a very important area of
research in applied statistics. Over the decades many techniques have been developed to
deal with such datasets. The multivariate techniques that have been developed include
inferential analysis, regression analysis, discriminant analysis, cluster analysis and many
more exploratory methods. Most of these methods deal with cases where the data contain
numerical variables. However, there are powerful methods in the literature that also deal
with multidimensional binary and count data.
The primary purpose of this thesis is to discuss the exploratory and inferential techniques
that can be used for binary and count data. In Chapter 2 of this thesis we give the detail of
correspondence analysis and canonical correspondence analysis. These methods are used
to analyze the data in contingency tables. Chapter 3 is devoted to cluster analysis. In this
chapter we explain four well-known clustering methods and we also discuss the distance
(dissimilarity) measures available in the literature for binary and count data. Chapter 4
contains an explanation of metric and non-metric multidimensional scaling. These
methods can be used to represent binary or count data in a lower dimensional Euclidean
space. In Chapter 5 we give a method for inferential analysis called the analysis of
distance. This method use a similar reasoning as the analysis of variance, but the
inference is based on a pseudo F-statistic with the p-value obtained using permutations of
the data. Chapter 6 contains real-world applications of these above methods on two
special data sets called the Biolog data and Barents Fish data.
The secondary purpose of the thesis is to demonstrate how the above techniques can be
performed in the software package R. Several R packages and functions are discussed
throughout this thesis. The usage of these functions is also demonstrated with appropriate
examples. Attention is also given to the interpretation of the output and graphics. The
thesis ends with some general conclusions and ideas for further research. / AFRIKAANSE OPSOMMING: Die analise van meerdimensionele (meerveranderlike) datastelle is ’n belangrike area van
navorsing in toegepaste statistiek. Oor die afgelope dekades is daar verskeie tegnieke
ontwikkel om sulke data te ontleed. Die meerveranderlike tegnieke wat ontwikkel is sluit
in inferensie analise, regressie analise, diskriminant analise, tros analise en vele meer
verkennende data analise tegnieke. Die meerderheid van hierdie metodes hanteer gevalle
waar die data numeriese veranderlikes bevat. Daar bestaan ook kragtige metodes in die
literatuur vir die analise van meerdimensionele binêre en telling data.
Die primêre doel van hierdie tesis is om tegnieke vir verkennende en inferensiële analise
van binêre en telling data te bespreek. In Hoofstuk 2 van hierdie tesis bespreek ons
ooreenkoms analise en kanoniese ooreenkoms analise. Hierdie metodes word gebruik om
data in gebeurlikheidstabelle te analiseer. Hoofstuk 3 bevat tegnieke vir tros analise. In
hierdie hoofstuk verduidelik ons vier gewilde tros analise metodes. Ons bespreek ook die
afstand maatstawwe wat beskikbaar is in die literatuur vir binêre en telling data. Hoofstuk
4 bevat ’n verduideliking van metriese en nie-metriese meerdimensionele skalering.
Hierdie metodes kan gebruik word om binêre of telling data in ‘n lae dimensionele
Euclidiese ruimte voor te stel. In Hoofstuk 5 beskryf ons ’n inferensie metode wat bekend
staan as die analise van afstande. Hierdie metode gebruik ’n soortgelyke redenasie as die
analise van variansie. Die inferensie hier is gebaseer op ’n pseudo F-toetsstatistiek en die
p-waardes word verkry deur gebruik te maak van permutasies van die data. Hoofstuk 6
bevat toepassings van bogenoemde tegnieke op werklike datastelle wat bekend staan as
die Biolog data en die Barents Fish data.
Die sekondêre doel van die tesis is om te demonstreer hoe hierdie tegnieke uitgevoer
word in the R sagteware. Verskeie R pakette en funksies word deurgaans bespreek in die
tesis. Die gebruik van die funksies word gedemonstreer met toepaslike voorbeelde.
Aandag word ook gegee aan die interpretasie van die afvoer en die grafieke. Die tesis
sluit af met algemene gevolgtrekkings en voorstelle vir verdere navorsing.
|
48 |
Aspects of copulas and goodness-of-fitKpanzou, Tchilabalo Abozou 12 1900 (has links)
Thesis (MComm (Statistics and Actuarial Science))--Stellenbosch University, 2008. / The goodness-of- t of a statistical model describes how well it ts a set of observations. Measures
of goodness-of- t typically summarize the discrepancy between observed values and the values
expected under the model in question. Such measures can be used in statistical hypothesis
testing, for example to test for normality, to test whether two samples are drawn from identical
distributions, or whether outcome frequencies follow a speci ed distribution. Goodness-of- t
for copulas is a special case of the more general problem of testing multivariate models, but is
complicated due to the di culty of specifying marginal distributions.
In this thesis, the goodness-of- t test statistics for general distributions and the tests for copulas
are investigated, but prior to that an understanding of copulas and their properties is developed.
In fact copulas are useful tools for understanding relationships among multivariate variables, and
are important tools for describing the dependence structure between random variables. Several
univariate, bivariate and multivariate test statistics are investigated, the emphasis being on
tests for normality. Among goodness-of- t tests for copulas, tests based on the probability integral
transform, Rosenblatt's transformation, as well as some dimension reduction techniques are
considered. Bootstrap procedures are also described. Simulation studies are conducted to rst
compare the power of rejection of the null hypothesis of the Clayton copula by four di erent test
statistics under the alternative of the Gumbel-Hougaard copula, and also to compare the power
of rejection of the null hypothesis of the Gumbel-Hougaard copula under the alternative of the
Clayton copula. An application of the described techniques is made to a practical data set.
|
49 |
'n Ondersoek na die eindige steekproefgedrag van inferensiemetodes in ekstreemwaarde-teorieVan Deventer, Dewald 03 1900 (has links)
Thesis (MComm (Statistics and Actuarial Science))--University of Stellenbosch, 2005. / Extremes are unusual or rare events. However, when such events – for example
earthquakes, tidal waves and market crashes - do take place, they typically cause
enormous losses, both in terms of human lives and monetary value. For this reason,
it is of critical importance to accurately model extremal events. Extreme value theory
entails the development of statistical models and techniques in order to describe and
model such rare observations.
In this document we discuss aspects of extreme value theory. This theory consists of
two approaches: The classical maxima method, based on the properties of the
maximum of a sample and the more popular threshold theory, based upon the
properties of exceedances of a specified threshold value. This document provides
the practitioner with the theoretical and practical tools for both these approaches.
This will enable him/her to perform extreme value analyses with confidence.
Extreme value theory – for both approaches - is based upon asymptotic arguments.
For finite samples, the limiting result for the sample maximum holds approximately
only. Similarly, for finite choices of the threshold, the limiting distribution for
exceedances of that threshold holds only approximately. In this document we
investigate the quality of extreme value based inferences with regard to the unknown
underlying distribution when the sample size or threshold is finite. Estimation of
extreme tail quantiles of the underlying distribution, as well as the calculation of
confidence intervals, are typically the most important objectives of an extreme
analysis. For that reason, we evaluate the accuracy of extreme based inferences in
terms of these estimates. This investigation was carried out using a simulation study,
performed with the software package S-Plus.
|
50 |
Confidence intervals for estimators of welfare indices under complex samplingKirchoff, Retha 03 1900 (has links)
Thesis (MComm (Statistics and Actuarial Science))--University of Stellenbosch, 2010. / ENGLISH ABSTRACT: The aim of this study is to obtain estimates and confidence intervals for welfare
indices under complex sampling. It begins by looking at sampling in general with
specific focus on complex sampling and weighting. For the estimation of the welfare
indices, two resampling techniques, viz. jackknife and bootstrap, are discussed.
They are used for the estimation of bias and standard error under simple random
sampling and complex sampling. Three con dence intervals are discussed, viz. standard
(asymptotic), percentile and bootstrap-t. An overview of welfare indices and
their estimation is given. The indices are categorized into measures of poverty and
measures of inequality. Two Laeken indices, viz. at-risk-of-poverty and quintile
share ratio, are included in the discussion. The study considers two poverty lines,
namely an absolute poverty line based on percy (ratio of total household income
to household size) and a relative poverty line based on equivalized income (ratio of
total household income to equivalized household size). The data set used as surrogate
population for the study is the Income and Expenditure survey 2005/2006
conducted by Statistics South Africa and details of it are provided and discussed.
An analysis of simulation data from the surrogate population was carried out using
techniques mentioned above and the results were graphed, tabulated and discussed.
Two issues were considered, namely whether the design of the survey should be considered
and whether resampling techniques provide reliable results, especially for
con dence intervals. The results were a mixed bag . Overall, however, it was found
that weighting showed promise in many cases, especially in the improvement of the
coverage probabilities of the con dence intervals. It was also found that the bootstrap
resampling technique was reliable (by looking at standard errors). Further
research options are mentioned as possible solutions towards the mixed results. / AFRIKAANSE OPSOMMING: Die doel van die studie is die verkryging van beramings en vertrouensintervalle vir
welvaartsmaatstawwe onder komplekse steekproefneming. 'n Algemene bespreking
van steekproefneming word gedoen waar daar spesi ek op komplekse steekproefneming
en weging gefokus word. Twee hersteekproefnemingstegnieke, nl. uitsnit
(jackknife)- en skoenlushersteekproefneming, word bespreek as metodes vir die beraming
van die maatstawwe. Hierdie maatstawwe word gebruik vir sydigheidsberaming
asook die beraming van standaardfoute in eenvoudige ewekansige steekproefneming
asook komplekse steekproefneming. Drie vertrouensintervalle word bespreek, nl.
die standaard (asimptotiese), die persentiel en die bootstrap-t vertrouensintervalle.
Daar is ook 'n oorsigtelike bespreking oor welvaartsmaatstawwe en die beraming
daarvan. Hierdie welvaartsmaatstawwe vorm twee kategorieë, nl. maatstawwe van
armoede en maatstawwe van ongelykheid. Ook ingesluit by hierdie bespreking is die
at-risk-of-poverty en quintile share ratio wat deel vorm van die Laekenindekse.
Twee armoedemaatlyne , 'n absolute- en relatiewemaatlyn, word in hierdie studie
gebruik. Die absolute armoedemaatlyn word gebaseer op percy , die verhouding van
die totale huishoudingsinkomste tot die grootte van die huishouding, terwyl die relatiewe
armoedemaatlyn gebasseer word op equivalized income , die verhouding van
die totale huishoudingsinkomste tot die equivalized grootte van die huishouding.
Die datastel wat as surrogaat populasie gedien het in hierdie studie is die Inkomste
en Uitgawe opname van 2005/2006 wat deur Statistiek Suid-Afrika uitgevoer is. Inligting
met betrekking tot hierdie opname word ook gegee. Gesimuleerde data vanuit
die surrogaat populasie is geanaliseer deur middel van die hersteekproefnemingstegnieke
wat genoem is. Die resultate van die simulasie is deur middel van gra eke en
tabelle aangedui en bespreek. Vanuit die simulasie het twee vrae opgeduik, nl. of
die ontwerp van 'n steekproef, dus weging, in ag geneem behoort te word en of die
hersteekproefnemingstegnieke betroubare resultate lewer, veral in die geval van die vertrouensintervalle. Die resultate wat verkry is, het baie gevarieer. Daar is egter
bepaal dat weging in die algemeen belowende resultate opgelewer het vir baie van die
gevalle, maar nie vir almal nie. Dit het veral die dekkingswaarskynlikhede van die
vertrouensintervalle verbeter. Daar is ook bepaal, deur na die standaardfoute van
die skoenlusberamers te kyk, dat die skoenlustegniek betroubare resultate gelewer
het. Verdere navorsingsmoontlikhede is genoem as potensiële verbeteringe op die
gemengde resultate wat verkry is.
|
Page generated in 0.0959 seconds