Global ETD Search

361	Aspects of the pre- and post-selection classification performance of discriminant analysis and logistic regression Louw, Nelmarie 12 1900 (has links) Thesis (PhD)--Stellenbosch University, 1997. / One copy microfiche. / ENGLISH ABSTRACT: Discriminani analysis and logistic regression are techniques that can be used to classify entities of unknown origin into one of a number of groups. However, the underlying models and assumptions for application of the two techniques differ. In this study, the two techniques are compared with respect to classification of entities. Firstly, the two techniques were compared in situations where no data dependent variable selection took place. Several underlying distributions were studied: the normal distribution, the double exponential distribution and the lognormal distribution. The number of variables, sample sizes from the different groups and the correlation structure between the variables were varied to' obtain a large number of different configurations. .The cases of two and three groups were studied. The most important conclusions are: "for normal and double' exponential data linear discriminant analysis outperforms logistic regression, especially in cases where the ratio of the number of variables to the total sample size is large. For lognormal data, logistic regression should be preferred, except in cases where the ratio of the number of variables to the total sample size is large. " Variable selection is frequently the first step in statistical analyses. A large number of potenti8.Ily important variables are observed, and an optimal subset has to be selected for use in further analyses. Despite the fact that variable selection is often used, the influence of a selection step on further analyses of the same data, is often completely ignored. An important aim of this study was to develop new selection techniques for use in discriminant analysis and logistic regression. New estimators of the postselection error rate were also developed. A new selection technique, cross model validation (CMV) that can be applied both in discriminant analysis and logistic regression, was developed. ."This technique combines the selection of variables and the estimation of the post-selection error rate. It provides a method to determine the optimal model dimension, to select the variables for the final model and to estimate the post-selection error rate of the discriminant rule. An extensive Monte Carlo simulation study comparing the CMV technique to existing procedures in the literature, was undertaken. In general, this technique outperformed the other methods, especially with respect to the accuracy of estimating the post-selection error rate. Finally, pre-test type variable selection was considered. A pre-test estimation procedure was adapted for use as selection technique in linear discriminant analysis. In a simulation study, this technique was compared to CMV, and was found to perform well, especially with respect to correct selection. However, this technique is only valid for uncorrelated normal variables, and its applicability is therefore limited. A numerically intensive approach was used throughout the study, since the problems that were investigated are not amenable to an analytical approach. / AFRIKAANSE OPSOMMING: Lineere diskriminantanaliseen logistiese regressie is tegnieke wat gebruik kan word vir die Idassifikasie van items van onbekende oorsprong in een van 'n aantal groepe. Die agterliggende modelle en aannames vir die gebruik van die twee tegnieke is egter verskillend. In die studie is die twee tegnieke vergelyk ten opsigte van k1assifikasievan items. Eerstens is die twee tegnieke vergelyk in 'n apset waar daar geen data-afhanklike seleksie van veranderlikes plaasvind me. Verskeie onderliggende verdelings is bestudeer: die normaalverdeling, die dubbeleksponensiaal-verdeling,en die lognormaal verdeling. Die aantal veranderlikes, steekproefgroottes uit die onderskeie groepe en die korrelasiestruktuur tussen die veranderlikes is gevarieer om 'n groot aantal konfigurasies te verkry. Die geval van twee en drie groepe is bestudeer. Die belangrikste gevolgtrekkings wat op grond van die studie gemaak kan word is: vir normaal en dubbeleksponensiaal data vaar lineere diskriminantanalise beter as logistiese regressie, veral in gevalle waar die. verhouding van die aantal veranderlikes tot die totale steekproefgrootte groot is. In die geval van data uit 'n lognormaalverdeling, hehoort logistiese regressie die metode van keuse te wees, tensy die verhouding van die aantal veranderlikes tot die totale steekproefgrootte groot is. Veranderlike seleksie is dikwels die eerste stap in statistiese ontledings. 'n Groot aantal potensieel belangrike veranderlikes word waargeneem, en 'n subversamelingwat optimaal is, word gekies om in die verdere ontledings te gebruik. Ten spyte van die feit dat veranderlike seleksie dikwels gebruik word, word die invloed wat 'n seleksie-stap op verdere ontledings van dieselfde data. het, dikwels heeltemal geYgnoreer.'n Belangrike doelwit van die studie was om nuwe seleksietegniekete ontwikkel wat gebruik kan word in diskriminantanalise en logistiese regressie. Verder is ook aandag gegee aan ontwikkeling van beramers van die foutkoers van 'n diskriminantfunksie wat met geselekteerde veranderlikes gevorm word. 'n Nuwe seleksietegniek, kruis-model validasie (KMV) wat gebruik kan word vir die seleksie van veranderlikes in beide diskriminantanalise en logistiese regressie is ontwikkel. Hierdie tegniek hanteer die seleksie van veranderlikes en die beraming van die na-seleksie foutkoers in een stap, en verskaf 'n metode om die optimale modeldimensiete bepaal, die veranderlikes wat in die model bevat moet word te kies, en ook die na-seleksie foutkoers van die diskriminantfunksie te beraam. 'n Uitgebreide simulasiestudie waarin die voorgestelde KMV-tegniek met ander prosedures in die Iiteratuur. vergelyk is, is vir beide diskriminantanaliseen logistiese regressie ondemeem. In die algemeen het hierdie tegniek beter gevaar as die ander metodes wat beskou is, veral ten opsigte van die akkuraatheid waarmee die na-seleksie foutkoers beraam word. Ten slotte is daar ook aandag gegee aan voor-toets tipeseleksie. 'n Tegniek is ontwikkel wat gebruik maak van 'nvoor-toets berarningsmetode om veranderlikes vir insluiting in 'n lineere diskriminantfunksie te selekteer. Die tegniek ISin 'n simulasiestudie met die KMV-tegniek vergelyk, en vaar baie goed, veral t.o.v. korrekte seleksie. Hierdie tegniek is egter slegs geldig vir ongekorreleerde normaalveranderlikes, wat die gebruik darvan beperk. 'n Numeries intensiewe benadering is deurgaans in die studie gebruik. Dit is genoodsaak deur die feit dat die probleme wat ondersoek is, nie deur middel van 'n analitiese benadering hanteer kan word nie. Discriminant analysis Multivariate analysis Regression analysis Dissertations -- Mathematical statistics
362	Positioning patterns from multidimensional data and its applications in meteorology Wong, Ka-yan, 王嘉欣 January 2008 (has links) published_or_final_version / abstract / Computer Science / Doctoral / Doctor of Philosophy Meteorology - Data processing. Genetic algorithms.
363	Statistical analysis of proteomic mass spectrometry data Handley, Kelly January 2007 (has links) This thesis considers the statistical modelling and analysis of proteomic mass spectrometry data. Proteomics is a relatively new field of study and tried and tested methods of analysis do not yet exist. Mass spectrometry output is high-dimensional and so we firstly develop an algorithm to identify peaks in the spectra in order to reduce the dimensionality of the datasets. We use the results along with a variety of classification methods to examine the classification of new spectra based on a training set. Another method to reduce the complexity of the problem is to fit a parametric model to the data. We model the data as a mixture of Gaussian peaks with parameters representing the peak locations, heights and variances, and apply a Bayesian Markov chain Monte Carlo (MCMC) algorithm to obtain their estimates. These resulting estimates are used to identify m/z values where differences are apparent between groups, where the m/z value of an ion is its mass divided by its charge. A multilevel modelling framework is also considered to incorporate the structure in the data and locations exhibiting differences are again obtained. We consider two mass spectrometry datasets in detail. The first consists of mass spectra from breast cancer cells which either have or have not been treated with the chemotherapeutic agent Taxol. The second consists of mass spectra from melanoma cells classified as stage I or stage IV using the TNM system. Using the MCMC and multilevel techniques described above we show that, in both datasets, small subsets of the available m/z values can be identified which exhibit significant differences in protein expression between groups. Also we see that good classification of new data can also be achieved using a small number of m/z values and that the classification rate does not fall greatly when compared with results from the complete spectra. For both datasets we compare our results with those in the literature which use other techniques on the same data. We conclude by discussing potential areas for further research. 572.36015195
364	Robustness of Parametric and Nonparametric Tests When Distances between Points Change on an Ordinal Measurement Scale Chen, Andrew H. (Andrew Hwa-Fen) 08 1900 (has links) The purpose of this research was to evaluate the effect on parametric and nonparametric tests using ordinal data when the distances between points changed on the measurement scale. The research examined the performance of Type I and Type II error rates using selected parametric and nonparametric tests. parametric tests nonparametric tests Mathematical statistics. Statistical hypothesis testing.
365	Multivariate copulas in financial market risk with particular focus on trading strategies and asset allocation 05 November 2012 (has links) D.Comm. / Copulas provide a useful way to model different types of dependence structures explicitly. Instead of having one correlation number that encapsulates everything known about the dependence between two variables, copulas capture information on the level of dependence as well as whether the two variables exhibit other types of dependence, for example tail dependence. Tail dependence refers to the instance where the variables show higher dependence between their extreme values. A copula is defined as a multivariate distribution function with uniform marginals. A useful class of copulas is known as Archimedean copulas that are constructed from generator functions with very specific properties. The main aim of this thesis is to construct multivariate Archimedean copulas by nesting different bivariate Archimedean copulas using the vine construction approach. A characteristic of the vine construction is that not all combinations of generator functions lead to valid multivariate copulas. Established research is limited in that it presents constraints that lead to valid multivariate copulas that can be used to model positive dependence only. The research in this thesis extends the theory by deriving the necessary constraints to model negative dependence as well. Specifically, it ensures that the multivariate copulas that are constructed from bivariate copulas that capture negative dependence, will be able to capture negative dependence as well. Constraints are successfully derived for trivariate copulas. It is, however, shown that the constraints cannot easily be extended to higher-order copulas. The rules on the types of dependence structures that can be nested are also established. A number of practical applications in the financial markets where copula theory can be utilized to enhance the more established methodologies, are considered. The first application considers trading strategies based on statistical arbitrage where the information in the bivariate copula structure is utilised to identify trading opportunities in the equity market. It is shown that trading costs adversely affect the profits generated. The second application considers the impact of wrong-way risk on counterparty credit exposure. A trivariate copula is used to model the wrong-way risk. The aim of the analysis is to show how the theory developed in this thesis should be applied where negative correlation is present in a trivariate copula structure. Approaches are considered where conditional and unconditional risk driver scenarios are derived by means of the trivariate copula structure. It is argued that by not allowing for wrong-way risk, a financial institution’s credit pricing and regulatory capital calculations may be adversely affected. The final application compares the philosophy behind cointegration and copula asset allocation techniques to test which approach produces the most profitable index-tracking portfolios over time. The copula asset allocation approach performs well over time; however, it is very computationally intensive. Copulas (Mathematical statistics) Variables (Mathematics) Asset allocation Financial risk
366	Country risk analysis: an application of logistic regression and neural networks Ncube, Gugulethu January 2017 (has links) A research report submitted to the Faculty of Science, School of Statistics and Actuarial Science in partial fulfilment of the requirements for the degree of Master of Science, University of the Witwatersrand. Johannesburg, 08 June 2017. Mathematical Statistics degree, 2017 / Country risk evaluation is a crucial exercise when determining the ability of countries to repay their debts. The global environment is volatile and is ﬁlled with macro-economic, ﬁnancial and political factors that may aﬀect a country’s commercial environment, resulting in its inability to service its debt. This re search report compares the ability of conventional neural network models and traditional panel logistic regression models in assessing country risk. The mod els are developed using a set of economic, ﬁnancial and political risk factors obtained from the World Bank for the years 1996 to 2013 for 214 economies. These variables are used to assess the debt servicing capacity of the economies as this has a direct impact on the return on investments for ﬁnancial institu tions, investors, policy makers as well as researchers. The models developed may act as early warning systems to reduce exposure to country risk. Keywords: Country risk, Debt rescheduling, Panel logit model, Neural net work models / XL2017 Country risk Investments Neural networks Mathematical statistics World Bank
367	Categorical data imputation using non-parametric or semi-parametric imputation methods Khosa, Floyd Vukosi 11 May 2016 (has links) A research report submitted to the Faculty of Science, University of the Witwatersrand, for the degree of Master of Science by Coursework and Research Report. / Researchers and data analysts often encounter a problem when analysing data with missing values. Methods for imputing continuous data are well developed in the literature. However, methods for imputing categorical data are not well established. This research report focuses on categorical data imputation using non-parametric and semi-parametric methods. The aims of the study are to compare different imputation methods for categorical data and to assess the quality of the imputation. Three imputation methods are compared namely; multiple imputation, hot deck imputation and random forest imputation. Missing data are created on a complete data set using the missing completely at random mechanism. The imputed data sets are compared with the original complete data set, and the imputed values which are the same as the values in the original data set are counted. The analysis revealed that the hot deck imputation method is more precise, compared to random forest and multiple imputation methods. Logistic regression is fitted on the imputed data sets and the original data set and the resulting models are compared. The analysis shows that the multiple imputation method affects the model fit of the logistic regression negatively. Data integrity. Quality control. Nonparametric statistics. Mathematical statistics.
368	The enumeration of lattice paths and walks Unknown Date (has links) A well-known long standing problem in combinatorics and statistical mechanics is to find the generating function for self-avoiding walks (SAW) on a two-dimensional lattice, enumerated by perimeter. A SAW is a sequence of moves on a square lattice which does not visit the same point more than once. It has been considered by more than one hundred researchers in the pass one hundred years, including George Polya, Tony Guttmann, Laszlo Lovasz, Donald Knuth, Richard Stanley, Doron Zeilberger, Mireille Bousquet-Mlou, Thomas Prellberg, Neal Madras, Gordon Slade, Agnes Dit- tel, E.J. Janse van Rensburg, Harry Kesten, Stuart G. Whittington, Lincoln Chayes, Iwan Jensen, Arthur T. Benjamin, and many others. More than three hundred papers and a few volumes of books were published in this area. A SAW is interesting for simulations because its properties cannot be calculated analytically. Calculating the number of self-avoiding walks is a common computational problem. A recently proposed model called prudent self-avoiding walks (PSAW) was first introduced to the mathematics community in an unpublished manuscript of Pra, who called them exterior walks. A prudent walk is a connected path on square lattice such that, at each step, the extension of that step along its current trajectory will never intersect any previously occupied vertex. A lattice path composed of connected horizontal and vertical line segments, each passing between adjacent lattice points. We will discuss some enumerative problems in self-avoiding walks, lattice paths and walks with several step vectors. Many open problems are posted. / by Shanzhen Gao. / Thesis (Ph.D.)--Florida Atlantic University, 2011. / Includes bibliography. / Electronic reproduction. Boca Raton, Fla., 2011. Mode of access: World Wide Web. Combinatorial analysis Approximation theory Mathematical statistics Limit theorems (Probabilty theory)
369	Bayesian approach to an exponential hazard regression model with a change point Unknown Date (has links) This thesis contains two parts. The first part derives the Bayesian estimator of the parameters in a piecewise exponential Cox proportional hazard regression model, with one unknown change point for a right censored survival data. The second part surveys the applications of change point problems to various types of data, such as long-term survival data, longitudinal data and time series data. Furthermore, the proposed method is then used to analyse a real survival data. / Includes bibliography. / Thesis (M.S.)--Florida Atlantic University, 2014. / FAU Electronic Theses and Dissertations Collection Bayesian statistical decision theory Mathematical statistics Multivariate analysis -- Data processing
370	Asymptotic statistics and spline functions / CUHK electronic theses & dissertations collection January 2014 (has links) In this thesis, two topics in asymptotic statistics and spline functions are studied. / The first one is a study of testing the equality of Sharpe ratios. We compare two approaches to testing for the equality of many Sharpe ratios: the multivariate test of Wright et al. (2014) and Ledoit and Wolf's (2008) pairwise test. Firstly, Ledoit and Wolf's pairwise test is generalized to a multivariate one for direct comparison. We conclude by proposing a modified version that incorporates the Warp-Speed calibration method of Giacomini et al. (2013). The resulting procedure is much less computationally expensive but is comparable in its accuracy. / The second one is a study of the theoretical properties of an exponential weighting aggregated (EWA) penalized spline estimator, where the smoothing parameter is being averaged over an exponential reweighted posterior distribution. We show that the finite sample mean squared error of the EWA estimator is smaller than that of the smooth penalized spline estimator when the smoothing parameter is chosen to be a fixed value. Consistency and asymptotic normality of the EWA estimator are also developed under general situations. / 這篇論文研究了兩個關於漸近統計和樣條函數的話題。 / 其一是在假設檢驗夏普比率的相等性這個問題中的應用。我們首先比較了Wright et al.（2014）的多元檢驗方法和Ledoit & Wolf（2008）的二元檢驗方法。我們先將Ledoit & Wolf (2008)的二元檢驗法拓展到多元層面，以方便比較。最後，我們提出了採用Giacomini et al.（2013）的曲速法修改后的自助抽樣檢驗法。這種方法極大地降低了實際計算成本，同時保持了其準確性。 / 其二是對樣條估計的漸近表現的研究。我們研究了指數權重合計平滑樣條（EWA）估計，這種方法通過貝葉斯法給予平滑參數一個後驗分佈并將其加權合計。我們通過擴展一個oracle不等式驗證了EWA估計的方差比選定了一個平滑參數的一般平滑樣條估計的方差要小。此外，我們還驗證了EWA估計在一般情況下的一致性和漸近正態性。 / Huang, Wei. / Thesis M.Phil. Chinese University of Hong Kong 2014. / Includes bibliographical references (leaves 80-84). / Abstracts also in Chinese. / Title from PDF title page (viewed on 13, September, 2016). / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. Spline theory QA276 .H825 2014

Search results