Global ETD Search

1	Time series forecasting and model selection in singular spectrum analysis De Klerk, Jacques 11 1900 (has links) Dissertation (PhD)--University of Stellenbosch, 2002 / ENGLISH ABSTRACT: Singular spectrum analysis (SSA) originated in the field of Physics. The technique is non-parametric by nature and inter alia finds application in atmospheric sciences, signal processing and recently in financial markets. The technique can handle a very broad class of time series that can contain combinations of complex periodicities, polynomial or exponential trend. Forecasting techniques are reviewed in this study, and a new coordinate free joint-horizon k-period-ahead forecasting formulation is derived. The study also considers model selection in SSA, from which it become apparent that forward validation results in more stable model selection. The roots of SSA are outlined and distributional assumptions of signal senes are considered ab initio. Pitfalls that arise in the multivariate statistical theory are identified. Different approaches of recurrent one-period-ahead forecasting are then reviewed. The forecasting approaches are all supplied in algorithmic form to ensure effortless adaptation to computer programs. Theoretical considerations, underlying the forecasting algorithms, are also considered. A new coordinate free joint-horizon kperiod- ahead forecasting formulation is derived and also adapted for the multichannel SSA case. Different model selection techniques are then considered. The use of scree-diagrams, phase space portraits, percentage variation explained by eigenvectors, cross and forward validation are considered in detail. The non-parametric nature of SSA essentially results in the use of non-parametric model selection techniques. Finally, the study also considers a commercial software package that is available and compares it with Fortran code, which was developed as part of the study. / AFRIKAANSE OPSOMMING: Singulier spektraalanalise (SSA) het sy oorsprong in die Fisika. Die tegniek is nieparametries van aard en vind toepassing in velde soos atmosferiese wetenskappe, seinprossesering en onlangs in finansiële markte. Die tegniek kan 'n wye verskeidenheid tydreekse hanteer wat kombinasies van komplekse periodisiteite, polinomiese- en eksponensiële tendense insluit. Vooruitskattingstegnieke word ook in hierdie studie beskou, en 'n nuwe koërdinaatvrye gesamentlike horison k-periodevooruitskattingformulering word afgelei. Die studie beskou ook model seleksie in SSA, waaruit duidelik blyk dat voorwaartse validasie meer stabiele model seleksie tot gevolg het. Die agtergrond van SSA word ab initio geskets en verdelingsaannames van seinreekse beskou. Probleemgevalle wat voorkom in die meervoudige statistiese teorie word duidelik geïdentifiseer. Verskeie tegnieke van herhalende toepassing van een-periode-vooruitskatting word daarna beskou. Die benaderings tot vooruitskatting word in algororitmiese formaat verskaf wat die aanpassing na rekenaarprogrammering vergemaklik. Teoretiese vraagstukke, onderliggend aan die vooruitskattings-algortimes, word ook beskou. 'n Nuwe koërdinaatvrye gesamentlike horison k-periode-vooruitskattingsformulering word afgelei en aangepas vir die multikanaal SSA geval. Verskillende model seleksie tegnieke is ook beskou. Die gebruik van "scree"- diagramme, fase ruimte diagramme, persentasie variasie verklaar deur eievektore, kruis- en voorwaartse validasie word ook aangespreek. Die nie-parametriese aard van SSA noop die gebruik van nie-parametriese model seleksie tegnieke. Die studie vergelyk laastens 'n kommersiële sagtewarepakket met die Fortran bronkode wat as deel van hierdie studie ontwikkel is. Time-series analysis Spectral theory (Mathematics) Dissertations -- Statistics Theses -- Statistics
2	Estimating measurement error in blood pressure, using structural equations modelling Kepe, Lulama Patrick January 2004 (has links) Thesis (MSc)--Stellenbosch University, 2004. / ENGLISH ABSTRACT: Any branch in science experiences measurement error to some extent. This maybe due to conditions under which measurements are taken, which may include the subject, the observer, the measurement instrument, and data collection method. The inexactness (error) can be reduced to some extent through the study design, but at some level further reduction becomes difficult or impractical. It then becomes important to determine or evaluate the magnitude of measurement error and perhaps evaluate its effect on the investigated relationships. All this is particularly true for blood pressure measurement. The gold standard for measunng blood pressure (BP) is a 24-hour ambulatory measurement. However, this technology is not available in Primary Care Clinics in South Africa and a set of three mercury-based BP measurements is the norm for a clinic visit. The quality of the standard combination of the repeated measurements can be improved by modelling the measurement error of each of the diastolic and systolic measurements and determining optimal weights for the combination of measurements, which will give a better estimate of the patient's true BP. The optimal weights can be determined through the method of structural equations modelling (SEM) which allows a richer model than the standard repeated measures ANOVA. They are less restrictive and give more detail than the traditional approaches. Structural equations modelling which is a special case of covariance structure modelling has proven to be useful in social sciences over the years. Their appeal stem from the fact that they includes multiple regression and factor analysis as special cases. Multi-type multi-time (MTMT) models are a specific type of structural equations models that suit the modelling of BP measurements. These designs (MTMT models) constitute a variant of repeated measurement designs and are based on Campbell and Fiske's (1959) suggestion that the quality of methods (time in our case) can be determined by comparing them with other methods in order to reveal both the systematic and random errors. MTMT models also showed superiority over other data analysis methods because of their accommodation of the theory of BP. In particular they proved to be a strong alternative to be considered for the analysis of BP measurement whenever repeated measures are available even when such measures do not constitute equivalent replicates. This thesis focuses on SEM and its application to BP studies conducted in a community survey of Mamre and the Mitchells Plain hypertensive clinic population. / AFRIKAANSE OPSOMMING: Elke vertakking van die wetenskap is tot 'n minder of meerdere mate onderhewig aan metingsfout. Dit is die gevolg van die omstandighede waaronder metings gemaak word soos die eenheid wat gemeet word, die waarnemer, die meetinstrument en die data versamelingsmetode. Die metingsfout kan verminder word deur die studie ontwerp maar op 'n sekere punt is verdere verbetering in presisie moeilik en onprakties. Dit is dan belangrik om die omvang ven die metingsfout te bepaal en om die effek hiervan op verwantskappe te ondersoek. Hierdie aspekte is veral waar vir die meting van bloeddruk by die mens. Die goue standaard vir die meet van bloeddruk is 'n 24-uur deurlopenee meting. Hierdie tegnologie is egter nie in primêre gesondheidsklinieke in Suid-Afrika beskikbaar nie en 'n stel van drie kwik gebasseerde bloedrukmetings is die norm by 'n kliniek besoek. Die kwaliteit van die standard kombinasie van die herhaalde metings kan verbeter word deur die modellering van die metingsfout van diastoliese en sistoliese bloeddruk metings. Die bepaling van optimale gewigte vir die lineêre kombinasie van die metings lei tot 'n beter skatting van die pasiënt se ware bloedruk. Die gewigte kan berekening word met die metode van strukturele vergelykings modellering (SVM) wat 'n ryker klas van modelle bied as die standaard herhaalde metings analise van variansie modelle. Dié model het minder beperkings en gee dus meer informasie as die tradisionele benaderings. Strukurele vergelykings modellering wat 'n spesial geval van kovariansie strukturele modellering is, is oor die jare nuttig aangewend in die sosiale wetenskap. Die aanhang is die gevolg van die feit dat meervoudige lineêre regressie en faktor analise ook spesiale gevalle van die metode is. Meervoudige-tipe meervoudige-tyd (MTMT) modelle is 'n spesifieke strukturele vergelykings model wat die modellering van bloedruk pas. Hierdie tipe model is 'n variant van die herhaalde metings ontwerp en is gebaseer op Campbell en Fiske (1959) se voorstel dat die kwaliteit van verskillende metodes bepaal kan word deur dit met ander metodes te vergelyk om sodoende sistematiese en stogastiese foute te onderskei. Die MTMT model pas ook goed in by die onderliggende fisiologies aspekte van bloedruk en die meting daarvan. Dit is dus 'n goeie alternatief vir studies waar die herhaalde metings nie ekwivalente replikate is nie. Hierdie tesis fokus op die strukturele vergelykings model en die toepassing daarvan in hipertensie studies uitgevoer in die Mamre gemeenskap en 'n hipertensie kliniek populasie in Mitchells Plain. Blood pressure -- Measurement Error analysis (Mathematics) Dissertations -- Statistics
3	Some statistical aspects of LULU smoothers Jankowitz, Maria Dorothea 12 1900 (has links) Thesis (PhD (Statistics and Actuarial Science))--University of Stellenbosch, 2007. / The smoothing of time series plays a very important role in various practical applications. Estimating the signal and removing the noise is the main goal of smoothing. Traditionally linear smoothers were used, but nonlinear smoothers became more popular through the years. From the family of nonlinear smoothers, the class of median smoothers, based on order statistics, is the most popular. A new class of nonlinear smoothers, called LULU smoothers, was developed by using the minimum and maximum selectors. These smoothers have very attractive mathematical properties. In this thesis their statistical properties are investigated and compared to that of the class of median smoothers. Smoothing, together with related concepts, are discussed in general. Thereafter, the class of median smoothers, from the literature is discussed. The class of LULU smoothers is defined, their properties are explained and new contributions are made. The compound LULU smoother is introduced and its property of variation decomposition is discussed. The probability distributions of some LULUsmoothers with independent data are derived. LULU smoothers and median smoothers are compared according to the properties of monotonicity, idempotency, co-idempotency, stability, edge preservation, output distributions and variation decomposition. A comparison is made of their respective abilities for signal recovery by means of simulations. The success of the smoothers in recovering the signal is measured by the integrated mean square error and the regression coefficient calculated from the least squares regression of the smoothed sequence on the signal. Finally, LULU smoothers are practically applied. Smoothing (Statistics) Estimation theory
4	Variable selection for kernel methods with application to binary classification Oosthuizen, Surette 03 1900 (has links) Thesis (PhD (Statistics and Actuarial Science))—University of Stellenbosch, 2008. / The problem of variable selection in binary kernel classification is addressed in this thesis. Kernel methods are fairly recent additions to the statistical toolbox, having originated approximately two decades ago in machine learning and artificial intelligence. These methods are growing in popularity and are already frequently applied in regression and classification problems. Variable selection is an important step in many statistical applications. Thereby a better understanding of the problem being investigated is achieved, and subsequent analyses of the data frequently yield more accurate results if irrelevant variables have been eliminated. It is therefore obviously important to investigate aspects of variable selection for kernel methods. Chapter 2 of the thesis is an introduction to the main part presented in Chapters 3 to 6. In Chapter 2 some general background material on kernel methods is firstly provided, along with an introduction to variable selection. Empirical evidence is presented substantiating the claim that variable selection is a worthwhile enterprise in kernel classification problems. Several aspects which complicate variable selection in kernel methods are discussed. An important property of kernel methods is that the original data are effectively transformed before a classification algorithm is applied to it. The space in which the original data reside is called input space, while the transformed data occupy part of a feature space. In Chapter 3 we investigate whether variable selection should be performed in input space or rather in feature space. A new approach to selection, so-called feature-toinput space selection, is also proposed. This approach has the attractive property of combining information generated in feature space with easy interpretation in input space. An empirical study reveals that effective variable selection requires utilisation of at least some information from feature space. Having confirmed in Chapter 3 that variable selection should preferably be done in feature space, the focus in Chapter 4 is on two classes of selecion criteria operating in feature space: criteria which are independent of the specific kernel classification algorithm and criteria which depend on this algorithm. In this regard we concentrate on two kernel classifiers, viz. support vector machines and kernel Fisher discriminant analysis, both of which are described in some detail in Chapter 4. The chapter closes with a simulation study showing that two of the algorithm-independent criteria are very competitive with the more sophisticated algorithm-dependent ones. In Chapter 5 we incorporate a specific strategy for searching through the space of variable subsets into our investigation. Evidence in the literature strongly suggests that backward elimination is preferable to forward selection in this regard, and we therefore focus on recursive feature elimination. Zero- and first-order forms of the new selection criteria proposed earlier in the thesis are presented for use in recursive feature elimination and their properties are investigated in a numerical study. It is found that some of the simpler zeroorder criteria perform better than the more complicated first-order ones. Up to the end of Chapter 5 it is assumed that the number of variables to select is known. We do away with this restriction in Chapter 6 and propose a simple criterion which uses the data to identify this number when a support vector machine is used. The proposed criterion is investigated in a simulation study and compared to cross-validation, which can also be used for this purpose. We find that the proposed criterion performs well. The thesis concludes in Chapter 7 with a summary and several discussions for further research. Variable selection Support vector machines Kernel Fisher discriminant analysis
5	Statistical inference for inequality measures based on semi-parametric estimators Kpanzou, Tchilabalo Abozou 12 1900 (has links) Thesis (PhD)--Stellenbosch University, 2011. / ENGLISH ABSTRACT: Measures of inequality, also used as measures of concentration or diversity, are very popular in economics and especially in measuring the inequality in income or wealth within a population and between populations. However, they have applications in many other fields, e.g. in ecology, linguistics, sociology, demography, epidemiology and information science. A large number of measures have been proposed to measure inequality. Examples include the Gini index, the generalized entropy, the Atkinson and the quintile share ratio measures. Inequality measures are inherently dependent on the tails of the population (underlying distribution) and therefore their estimators are typically sensitive to data from these tails (nonrobust). For example, income distributions often exhibit a long tail to the right, leading to the frequent occurrence of large values in samples. Since the usual estimators are based on the empirical distribution function, they are usually nonrobust to such large values. Furthermore, heavy-tailed distributions often occur in real life data sets, remedial action therefore needs to be taken in such cases. The remedial action can be either a trimming of the extreme data or a modification of the (traditional) estimator to make it more robust to extreme observations. In this thesis we follow the second option, modifying the traditional empirical distribution function as estimator to make it more robust. Using results from extreme value theory, we develop more reliable distribution estimators in a semi-parametric setting. These new estimators of the distribution then form the basis for more robust estimators of the measures of inequality. These estimators are developed for the four most popular classes of measures, viz. Gini, generalized entropy, Atkinson and quintile share ratio. Properties of such estimators are studied especially via simulation. Using limiting distribution theory and the bootstrap methodology, approximate confidence intervals were derived. Through the various simulation studies, the proposed estimators are compared to the standard ones in terms of mean squared error, relative impact of contamination, confidence interval length and coverage probability. In these studies the semi-parametric methods show a clear improvement over the standard ones. The theoretical properties of the quintile share ratio have not been studied much. Consequently, we also derive its influence function as well as the limiting normal distribution of its nonparametric estimator. These results have not previously been published. In order to illustrate the methods developed, we apply them to a number of real life data sets. Using such data sets, we show how the methods can be used in practice for inference. In order to choose between the candidate parametric distributions, use is made of a measure of sample representativeness from the literature. These illustrations show that the proposed methods can be used to reach satisfactory conclusions in real life problems. / AFRIKAANSE OPSOMMING: Maatstawwe van ongelykheid, wat ook gebruik word as maatstawwe van konsentrasie of diversiteit, is baie populêr in ekonomie en veral vir die kwantifisering van ongelykheid in inkomste of welvaart binne ’n populasie en tussen populasies. Hulle het egter ook toepassings in baie ander dissiplines, byvoorbeeld ekologie, linguistiek, sosiologie, demografie, epidemiologie en inligtingskunde. Daar bestaan reeds verskeie maatstawwe vir die meet van ongelykheid. Voorbeelde sluit in die Gini indeks, die veralgemeende entropie maatstaf, die Atkinson maatstaf en die kwintiel aandeel verhouding. Maatstawwe van ongelykheid is inherent afhanklik van die sterte van die populasie (onderliggende verdeling) en beramers daarvoor is tipies dus sensitief vir data uit sodanige sterte (nierobuust). Inkomste verdelings het byvoorbeeld dikwels lang regtersterte, wat kan lei tot die voorkoms van groot waardes in steekproewe. Die tradisionele beramers is gebaseer op die empiriese verdelingsfunksie, en hulle is gewoonlik dus nierobuust teenoor sodanige groot waardes nie. Aangesien swaarstert verdelings dikwels voorkom in werklike data, moet regstellings gemaak word in sulke gevalle. Hierdie regstellings kan bestaan uit of die afknip van ekstreme data of die aanpassing van tradisionele beramers om hulle meer robuust te maak teen ekstreme waardes. In hierdie tesis word die tweede opsie gevolg deurdat die tradisionele empiriese verdelingsfunksie as beramer aangepas word om dit meer robuust te maak. Deur gebruik te maak van resultate van ekstreemwaardeteorie, word meer betroubare beramers vir verdelings ontwikkel in ’n semi-parametriese opset. Hierdie nuwe beramers van die verdeling vorm dan die basis vir meer robuuste beramers van maatstawwe van ongelykheid. Hierdie beramers word ontwikkel vir die vier mees populêre klasse van maatstawwe, naamlik Gini, veralgemeende entropie, Atkinson en kwintiel aandeel verhouding. Eienskappe van hierdie beramers word bestudeer, veral met behulp van simulasie studies. Benaderde vertrouensintervalle word ontwikkel deur gebruik te maak van limietverdelingsteorie en die skoenlus metodologie. Die voorgestelde beramers word vergelyk met tradisionele beramers deur middel van verskeie simulasie studies. Die vergelyking word gedoen in terme van gemiddelde kwadraat fout, relatiewe impak van kontaminasie, vertrouensinterval lengte en oordekkingswaarskynlikheid. In hierdie studies toon die semi-parametriese metodes ’n duidelike verbetering teenoor die tradisionele metodes. Die kwintiel aandeel verhouding se teoretiese eienskappe het nog nie veel aandag in die literatuur geniet nie. Gevolglik lei ons die invloedfunksie asook die asimptotiese verdeling van die nie-parametriese beramer daarvoor af. Ten einde die metodes wat ontwikkel is te illustreer, word dit toegepas op ’n aantal werklike datastelle. Hierdie toepassings toon hoe die metodes gebruik kan word vir inferensie in die praktyk. ’n Metode in die literatuur vir steekproefverteenwoordiging word voorgestel en gebruik om ’n keuse tussen die kandidaat parametriese verdelings te maak. Hierdie voorbeelde toon dat die voorgestelde metodes met vrug gebruik kan word om bevredigende gevolgtrekkings in die praktyk te maak. Extreme value theory Semi-parametric estimation Confidence intervals
6	Aspects of model development using regression quantiles and elemental regressions Ranganai, Edmore 03 1900 (has links) Dissertation (PhD)--University of Stellenbosch, 2007. / ENGLISH ABSTRACT: It is well known that ordinary least squares (OLS) procedures are sensitive to deviations from the classical Gaussian assumptions (outliers) as well as data aberrations in the design space. The two major data aberrations in the design space are collinearity and high leverage. Leverage points can also induce or hide collinearity in the design space. Such leverage points are referred to as collinearity influential points. As a consequence, over the years, many diagnostic tools to detect these anomalies as well as alternative procedures to counter them were developed. To counter deviations from the classical Gaussian assumptions many robust procedures have been proposed. One such class of procedures is the Koenker and Bassett (1978) Regressions Quantiles (RQs), which are natural extensions of order statistics, to the linear model. RQs can be found as solutions to linear programming problems (LPs). The basic optimal solutions to these LPs (which are RQs) correspond to elemental subset (ES) regressions, which consist of subsets of minimum size to estimate the necessary parameters of the model. On the one hand, some ESs correspond to RQs. On the other hand, in the literature it is shown that many OLS statistics (estimators) are related to ES regression statistics (estimators). Therefore there is an inherent relationship amongst the three sets of procedures. The relationship between the ES procedure and the RQ one, has been noted almost “casually” in the literature while the latter has been fairly widely explored. Using these existing relationships between the ES procedure and the OLS one as well as new ones, collinearity, leverage and outlier problems in the RQ scenario were investigated. Also, a lasso procedure was proposed as variable selection technique in the RQ scenario and some tentative results were given for it. These results are promising. Single case diagnostics were considered as well as their relationships to multiple case ones. In particular, multiple cases of the minimum size to estimate the necessary parameters of the model, were considered, corresponding to a RQ (ES). In this way regression diagnostics were developed for both ESs and RQs. The main problems that affect RQs adversely are collinearity and leverage due to the nature of the computational procedures and the fact that RQs’ influence functions are unbounded in the design space but bounded in the response variable. As a consequence of this, RQs have a high affinity for leverage points and a high exclusion rate of outliers. The influential picture exhibited in the presence of both leverage points and outliers is the net result of these two antagonistic forces. Although RQs are bounded in the response variable (and therefore fairly robust to outliers), outlier diagnostics were also considered in order to have a more holistic picture. The investigations used comprised analytic means as well as simulation. Furthermore, applications were made to artificial computer generated data sets as well as standard data sets from the literature. These revealed that the ES based statistics can be used to address problems arising in the RQ scenario to some degree of success. However, due to the interdependence between the different aspects, viz. the one between leverage and collinearity and the one between leverage and outliers, “solutions” are often dependent on the particular situation. In spite of this complexity, the research did produce some fairly general guidelines that can be fruitfully used in practice. / AFRIKAANSE OPSOMMING: Dit is bekend dat die gewone kleinste kwadraat (KK) prosedures sensitief is vir afwykings vanaf die klassieke Gaussiese aannames (uitskieters) asook vir data afwykings in die ontwerpruimte. Twee tipes afwykings van belang in laasgenoemde geval, is kollinearitiet en punte met hoë hefboom waarde. Laasgenoemde punte kan ook kollineariteit induseer of versteek in die ontwerp. Na sodanige punte word verwys as kollinêre hefboom punte. Oor die jare is baie diagnostiese hulpmiddels ontwikkel om hierdie afwykings te identifiseer en om alternatiewe prosedures daarteen te ontwikkel. Om afwykings vanaf die Gaussiese aanname teen te werk, is heelwat robuuste prosedures ontwikkel. Een sodanige klas van prosedures is die Koenker en Bassett (1978) Regressie Kwantiele (RKe), wat natuurlike uitbreidings is van rangorde statistieke na die lineêre model. RKe kan bepaal word as oplossings van lineêre programmeringsprobleme (LPs). Die basiese optimale oplossings van hierdie LPs (wat RKe is) kom ooreen met die elementale deelversameling (ED) regressies, wat bestaan uit deelversamelings van minimum grootte waarmee die parameters van die model beraam kan word. Enersyds geld dat sekere EDs ooreenkom met RKe. Andersyds, uit die literatuur is dit bekend dat baie KK statistieke (beramers) verwant is aan ED regressie statistieke (beramers). Dit impliseer dat daar dus ‘n inherente verwantskap is tussen die drie klasse van prosedures. Die verwantskap tussen die ED en die ooreenkomstige RK prosedures is redelik “terloops” van melding gemaak in die literatuur, terwyl laasgenoemde prosedures redelik breedvoerig ondersoek is. Deur gebruik te maak van bestaande verwantskappe tussen ED en KK prosedures, sowel as nuwes wat ontwikkel is, is kollineariteit, punte met hoë hefboom waardes en uitskieter probleme in die RK omgewing ondersoek. Voorts is ‘n lasso prosedure as veranderlike seleksie tegniek voorgestel in die RK situasie en is enkele tentatiewe resultate daarvoor gegee. Hierdie resultate blyk belowend te wees, veral ook vir verdere navorsing. Enkel geval diagnostiese tegnieke is beskou sowel as hul verwantskap met meervoudige geval tegnieke. In die besonder is veral meervoudige gevalle beskou wat van minimum grootte is om die parameters van die model te kan beraam, en wat ooreenkom met ‘n RK (ED). Met sodanige benadering is regressie diagnostiese tegnieke ontwikkel vir beide EDs en RKe. Die belangrikste probleme wat RKe negatief beinvloed, is kollineariteit en punte met hoë hefboom waardes agv die aard van die berekeningsprosedures en die feit dat RKe se invloedfunksies begrensd is in die ruimte van die afhanklike veranderlike, maar onbegrensd is in die ontwerpruimte. Gevolglik het RKe ‘n hoë affiniteit vir punte met hoë hefboom waardes en poog gewoonlik om uitskieters uit te sluit. Die finale uitset wat verkry word wanneer beide punte met hoë hefboom waardes en uitskieters voorkom, is dan die netto resultaat van hierdie twee teenstrydige pogings. Alhoewel RKe begrensd is in die onafhanklike veranderlike (en dus redelik robuust is tov uitskieters), is uitskieter diagnostiese tegnieke ook beskou om ‘n meer holistiese beeld te verkry. Die ondersoek het analitiese sowel as simulasie tegnieke gebruik. Voorts is ook gebruik gemaak van kunsmatige datastelle en standard datastelle uit die literatuur. Hierdie ondersoeke het getoon dat die ED gebaseerde statistieke met ‘n redelike mate van sukses gebruik kan word om probleme in die RK omgewing aan te spreek. Dit is egter belangrik om daarop te let dat as gevolg van die interafhanklikheid tussen kollineariteit en punte met hoë hefboom waardes asook dié tussen punte met hoë hefboom waardes en uitskieters, “oplossings” dikwels afhanklik is van die bepaalde situasie. Ten spyte van hierdie kompleksiteit, is op grond van die navorsing wat gedoen is, tog redelike algemene riglyne verkry wat nuttig in die praktyk gebruik kan word. Regression analysis Least squares Estimation theory
7	Improved estimation procedures for a positive extreme value index Berning, Thomas Louw 12 1900 (has links) Thesis (PhD (Statistics))--University of Stellenbosch, 2010. / ENGLISH ABSTRACT: In extreme value theory (EVT) the emphasis is on extreme (very small or very large) observations. The crucial parameter when making inferences about extreme quantiles, is called the extreme value index (EVI). This thesis concentrates on only the right tail of the underlying distribution (extremely large observations), and specifically situations where the EVI is assumed to be positive. A positive EVI indicates that the underlying distribution of the data has a heavy right tail, as is the case with, for example, insurance claims data. There are numerous areas of application of EVT, since there are a vast number of situations in which one would be interested in predicting extreme events accurately. Accurate prediction requires accurate estimation of the EVI, which has received ample attention in the literature from a theoretical as well as practical point of view. Countless estimators of the EVI exist in the literature, but the practitioner has little information on how these estimators compare. An extensive simulation study was designed and conducted to compare the performance of a wide range of estimators, over a wide range of sample sizes and distributions. A new procedure for the estimation of a positive EVI was developed, based on fitting the perturbed Pareto distribution (PPD) to observations above a threshold, using Bayesian methodology. Attention was also given to the development of a threshold selection technique. One of the major contributions of this thesis is a measure which quantifies the stability (or rather instability) of estimates across a range of thresholds. This measure can be used to objectively obtain the range of thresholds over which the estimates are most stable. It is this measure which is used for the purpose of threshold selection for the proposed PPD estimator. A case study of five insurance claims data sets illustrates how data sets can be analyzed in practice. It is shown to what extent discretion can/should be applied, as well as how different estimators can be used in a complementary fashion to give more insight into the nature of the data and the extreme tail of the underlying distribution. The analysis is carried out from the point of raw data, to the construction of tables which can be used directly to gauge the risk of the insurance portfolio over a given time frame. / AFRIKAANSE OPSOMMING: Die veld van ekstreemwaardeteorie (EVT) is bemoeid met ekstreme (baie klein of baie groot) waarnemings. Die parameter wat deurslaggewend is wanneer inferensies aangaande ekstreme kwantiele ter sprake is, is die sogenaamde ekstreemwaarde-indeks (EVI). Hierdie verhandeling konsentreer op slegs die regterstert van die onderliggende verdeling (baie groot waarnemings), en meer spesifiek, op situasies waar aanvaar word dat die EVI positief is. ’n Positiewe EVI dui aan dat die onderliggende verdeling ’n swaar regterstert het, wat byvoorbeeld die geval is by versekeringseis data. Daar is verskeie velde waar EVT toegepas word, aangesien daar ’n groot aantal situasies is waarin mens sou belangstel om ekstreme gebeurtenisse akkuraat te voorspel. Akkurate voorspelling vereis die akkurate beraming van die EVI, wat reeds ruim aandag in die literatuur geniet het, uit beide teoretiese en praktiese oogpunte. ’n Groot aantal beramers van die EVI bestaan in die literatuur, maar enige persoon wat die toepassing van EVT in die praktyk beoog, het min inligting oor hoe hierdie beramers met mekaar vergelyk. ’n Uitgebreide simulasiestudie is ontwerp en uitgevoer om die akkuraatheid van beraming van ’n groot verskeidenheid van beramers in die literatuur te vergelyk. Die studie sluit ’n groot verskeidenheid van steekproefgroottes en onderliggende verdelings in. ’n Nuwe prosedure vir die beraming van ’n positiewe EVI is ontwikkel, gebaseer op die passing van die gesteurde Pareto verdeling (PPD) aan waarnemings wat ’n gegewe drempel oorskrei, deur van Bayes tegnieke gebruik te maak. Aandag is ook geskenk aan die ontwikkeling van ’n drempelseleksiemetode. Een van die hoofbydraes van hierdie verhandeling is ’n maatstaf wat die stabiliteit (of eerder onstabiliteit) van beramings oor verskeie drempels kwantifiseer. Hierdie maatstaf bied ’n objektiewe manier om ’n gebied (versameling van drempelwaardes) te verkry waaroor die beramings die stabielste is. Dit is hierdie maatstaf wat gebruik word om drempelseleksie te doen in die geval van die PPD beramer. ’n Gevallestudie van vyf stelle data van versekeringseise demonstreer hoe data in die praktyk geanaliseer kan word. Daar word getoon tot watter mate diskresie toegepas kan/moet word, asook hoe verskillende beramers op ’n komplementêre wyse ingespan kan word om meer insig te verkry met betrekking tot die aard van die data en die stert van die onderliggende verdeling. Die analise word uitgevoer vanaf die punt waar slegs rou data beskikbaar is, tot op die punt waar tabelle saamgestel is wat direk gebruik kan word om die risiko van die versekeringsportefeulje te bepaal oor ’n gegewe periode. Extreme value index (EVI) Extreme value theory (EVT)
8	Edgeworth-corrected small-sample confidence intervals for ratio parameters in linear regression Binyavanga, Kamanzi-wa 03 1900 (has links) Dissertation (PhD)--Stellenbosch University, 2002. / ENGLISH ABSTRACT: In this thesis we construct a central confidence interval for a smooth scalar non-linear function of parameter vector f3 in a single general linear regression model Y = X f3 + c. We do this by first developing an Edgeworth expansion for the distribution function of a standardised point estimator. The confidence interval is then constructed in the manner discussed. Simulation studies reported at the end of the thesis show the interval to perform well in many small-sample situations. Central to the development of the Edgeworth expansion is our use of the index notation which, in statistics, has been popularised by McCullagh (1984, 1987). The contributions made in this thesis are of two kinds. We revisit the complex McCullagh Index Notation, modify and extend it in certain respects as well as repackage it in the manner that is more accessible to other researchers. On the new contributions, in addition to the introduction of a new small-sample confidence interval, we extend the theory of stochastic polynomials (SP) in three respects. A method, which we believe to be the simplest and most transparent to date, is proposed for deriving cumulants for these. Secondly, the theory of the cumulants of the SP is developed both in the context of Edgeworth expansion as well as in the regression setting. Thirdly, our new method enables us to propose a natural alternative to the method of Hall (1992a, 1992b) regarding skewness-reduction in Edgeworth expansions. / AFRIKAANSE OPSOMMING: In hierdie proefskrif word daar aandag gegee aan die konstruksie van 'n sentrale vertrouensinterval vir 'n gladde skalare nie-lineêre funksie van die parametervektor (3 in 'n enkele algemene lineêre regressiemodel y = X (3 + e.. Dit behels eerstens die ontwikkeling van 'n Edgeworth uitbreiding vir die verdelingsfunksie van 'n gestandaardiseerde puntberamer. Die vertrouensinterval word dan op grond van hierdie uitbreiding gekonstrueer. Simulasiestudies wat aan die einde van die proefskrif gerapporteer word, toon dat die voorgestelde interval goed vertoon in verskeie klein-steekproef gevalle. Die gebruik van indeksnotasie, wat in die statistiek deur McCullagh (1984, 1987) bekendgestel is, speel 'n sentrale rol in die ontwikkeling van die Edgeworth uitbreiding. Die bydrae wat in hierdie proefskrif gemaak word, is van 'n tweërlei aard. Die ingewikkelde Indeksnotasie van McCullagh word ondersoek, aangepas en ten opsigte van sekere aspekte uitgebrei. Die notasie word ook aangebied in 'n vorm wat dit hopelik meer toeganklik sal maak vir ander navorsers. Betreffende die bydrae wat gemaak word, word 'n nuwe klein-steekproef vertrouensinterval voorgestel, en word die teorie van stogastiese polinome (SP) ook in drie opsigte uitgebrei. 'n Metode word voorgestelom die kumulante van SP'e af te lei. Ons glo dat hierdie metode die duidelikste en eenvoudigste metode is wat tot dusver hiervoor voorgestel is. Tweedens word die teorie van die kumulante van SP'e ontwikkel binne die konteks van Edgeworth uitbreidings, sowel as die konteks van regressie. Derdens stelons nuwe metode ons in staat om 'n natuurlike alternatief voor te stel vir die metode van Hall (1992a, 1992b) vir die vermindering van skeefheid in Edgeworth uitbreidings. Confidence intervals Regression analysis Edgeworth expansions
9	Influential data cases when the C-p criterion is used for variable selection in multiple linear regression Uys, Daniel Wilhelm January 2003 (has links) Dissertation (PhD)--Stellenbosch University, 2003. / ENGLISH ABSTRACT: In this dissertation we study the influence of data cases when the Cp criterion of Mallows (1973) is used for variable selection in multiple linear regression. The influence is investigated in terms of the predictive power and the predictor variables included in the resulting model when variable selection is applied. In particular, we focus on the importance of identifying and dealing with these so called selection influential data cases before model selection and fitting are performed. For this purpose we develop two new selection influence measures, both based on the Cp criterion. The first measure is specifically developed to identify individual selection influential data cases, whereas the second identifies subsets of selection influential data cases. The success with which these influence measures identify selection influential data cases, is evaluated in example data sets and in simulation. All results are derived in the coordinate free context, with special application in multiple linear regression. / AFRIKAANSE OPSOMMING: Invloedryke waarnemings as die C-p kriterium vir veranderlike seleksie in meervoudigelineêre regressie gebruik word: In hierdie proefskrif ondersoek ons die invloed van waarnemings as die Cp kriterium van Mallows (1973) vir veranderlike seleksie in meervoudige lineêre regressie gebruik word. Die invloed van waarnemings op die voorspellingskrag en die onafhanklike veranderlikes wat ingesluit word in die finale geselekteerde model, word ondersoek. In besonder fokus ons op die belangrikheid van identifisering van en handeling met sogenaamde seleksie invloedryke waarnemings voordat model seleksie en passing gedoen word. Vir hierdie doel word twee nuwe invloedsmaatstawwe, albei gebaseer op die Cp kriterium, ontwikkel. Die eerste maatstaf is spesifiek ontwikkelom die invloed van individuele waarnemings te meet, terwyl die tweede die invloed van deelversamelings van waarnemings op die seleksie proses meet. Die sukses waarmee hierdie invloedsmaatstawwe seleksie invloedryke waarnemings identifiseer word beoordeel in voorbeeld datastelle en in simulasie. Alle resultate word afgelei binne die koërdinaatvrye konteks, met spesiale toepassing in meervoudige lineêre regressie. Regression analysis C-p criterion Variable selection
10	Evaluating the properties of sensory tests using computer intensive and biplot methodologies Meintjes, M. M. (Maria Magdalena) 03 1900 (has links) Assignment (MComm)--University of Stellenbosch, 2007. / ENGLISH ABSTRACT: This study is the result of part-time work done at a product development centre. The organisation extensively makes use of trained panels in sensory trials designed to asses the quality of its product. Although standard statistical procedures are used for analysing the results arising from these trials, circumstances necessitate deviations from the prescribed protocols. Therefore the validity of conclusions drawn as a result of these testing procedures might be questionable. This assignment deals with these questions. Sensory trials are vital in the development of new products, control of quality levels and the exploration of improvement in current products. Standard test procedures used to explore such questions exist but are in practice often implemented by investigators who have little or no statistical background. Thus test methods are implemented as black boxes and procedures are used blindly without checking all the appropriate assumptions and other statistical requirements. The specific product under consideration often warrants certain modifications to the standard methodology. These changes may have some unknown effect on the obtained results and therefore should be scrutinized to ensure that the results remain valid. The aim of this study is to investigate the distribution and other characteristics of sensory data, comparing the hypothesised, observed and bootstrap distributions. Furthermore, the standard testing methods used to analyse sensory data sets will be evaluated. After comparing these methods, alternative testing methods may be introduced and then tested using newly generated data sets. Graphical displays are also useful to get an overall impression of the data under consideration. Biplots are especially useful in the investigation of multivariate sensory data. The underlying relationships among attributes and their combined effect on the panellists’ decisions can be visually investigated by constructing a biplot. Results obtained by implementing biplot methods are compared to those of sensory tests, i.e. whether a significant difference between objects will correspond to large distances between the points representing objects in the display. In conclusion some recommendations are made as to how the organisation under consideration should implement sensory procedures in future trials. However, these proposals are preliminary and further research is necessary before final adoption. Some issues for further investigation are suggested. / AFRIKAANSE OPSOMMING: Hierdie studie spruit uit deeltydse werk by ’n produk-ontwikkeling-sentrum. Die organisasie maak in al hul sensoriese proewe rakende die kwaliteit van hul produkte op groot skaal gebruik van opgeleide panele. Alhoewel standaard prosedures ingespan word om die resultate te analiseer, noodsaak sekere omstandighede dat die voorgeskrewe protokol in ’n aangepaste vorm geïmplementeer word. Dié aanpassings mag meebring dat gevolgtrekkings gebaseer op resultate ongeldig is. Hierdie werkstuk ondersoek bogenoemde probleem. Sensoriese proewe is noodsaaklik in kwaliteitbeheer, die verbetering van bestaande produkte, asook die ontwikkeling van nuwe produkte. Daar bestaan standaard toets- prosedures om vraagstukke te verken, maar dié word dikwels toegepas deur navorsers met min of geen statistiese kennis. Dit lei daartoe dat toetsprosedures blindelings geïmplementeer en resultate geïnterpreteer word sonder om die nodige aannames en ander statistiese vereistes na te gaan. Alhoewel ’n spesifieke produk die wysiging van die standaard metode kan regverdig, kan hierdie veranderinge ’n groot invloed op die resultate hê. Dus moet die geldigheid van die resultate noukeurig ondersoek word. Die doel van hierdie studie is om die verdeling sowel as ander eienskappe van sensoriese data te bestudeer, deur die verdeling onder die nulhipotese sowel as die waargenome- en skoenlusverdelings te beskou. Verder geniet die standaard toetsprosedure, tans in gebruik om sensoriese data te analiseer, ook aandag. Na afloop hiervan word alternatiewe toetsprosedures voorgestel en dié geëvalueer op nuut gegenereerde datastelle. Grafiese voorstellings is ook nuttig om ’n geheelbeeld te kry van die data onder bespreking. Bistippings is veral handig om meerdimensionele sensoriese data te bestudeer. Die onderliggende verband tussen die kenmerke van ’n produk sowel as hul gekombineerde effek op ’n paneel se besluit, kan hierdeur visueel ondersoek word. Resultate verkry in die voorstellings word vergelyk met dié van sensoriese toetsprosedures om vas te stel of statisties betekenisvolle verskille in ’n produk korrespondeer met groot afstande tussen die relevante punte in die bistippingsvoorstelling. Ten slotte word sekere aanbevelings rakende die implementering van sensoriese proewe in die toekoms aan die betrokke organisasie gemaak. Hierdie aanbevelings word gemaak op grond van die voorafgaande ondersoeke, maar verdere navorsing is nodig voor die finale aanvaarding daarvan. Waar moontlik, word voorstelle vir verdere ondersoeke gedoen.

Search results