Global ETD Search

1	Sampling designs for exploratory multivariate analysis Hopkins, Julie Anne January 2000 (has links) This thesis is concerned with problems of variable selection, influence of sample size and related issues in the applications of various techniques of exploratory multivariate analysis (in particular, correspondence analysis, biplots and canonical correspondence analysis) to archaeology and ecology. Data sets (both published and new) are used to illustrate these methods and to highlight the problems that arise - these practical examples are returned to throughout as the various issues are discussed. Much of the motivation for the development of the methodology has been driven by the needs of the archaeologists providing the data, who were consulted extensively during the study. The first (introductory) chapter includes a detailed description of the data sets examined and the archaeological background to their collection. Chapters Two, Three and Four explain in detail the mathematical theory behind the three techniques. Their uses are illustrated on the various examples of interest, raising data-driven questions which become the focus of the later chapters. The main objectives are to investigate the influence of various design quantities on the inferences made from such multivariate techniques. Quantities such as the sample size (e.g. number of artefacts collected), the number of categories of classification (e.g. of sites, wares, contexts) and the number of variables measured compete for fixed resources in archaeological and ecological applications. Methods of variable selection and the assessment of the stability of the results are further issues of interest and are investigated using bootstrapping and procrustes analysis. Jack-knife methods are used to detect influential sites, wares, contexts, species and artefacts. Some existing methods of investigating issues such as those raised above are applied and extended to correspondence analysis in Chapters Five and Six. Adaptions of them are proposed for biplots in Chapters Seven and Eight and for canonical correspondence analysis in Chapter Nine. Chapter Ten concludes the thesis. 519.5 Variable selection; Data sets
2	Properties of the SCOOP Method of Selecting Gene Sets Liu, Yushi 27 September 2010 (has links) No description available. Statistics SCOOP Variable Selection Microarray
3	Model selection and estimation in high dimensional settings Ngueyep Tzoumpe, Rodrigue 08 June 2015 (has links) Several statistical problems can be described as estimation problem, where the goal is to learn a set of parameters, from some data, by maximizing a criterion. These type of problems are typically encountered in a supervised learning setting, where we want to relate an output (or many outputs) to multiple inputs. The relationship between these outputs and these inputs can be complex, and this complexity can be attributed to the high dimensionality of the space containing the inputs and the outputs; the existence of a structural prior knowledge within the inputs or the outputs that if ignored may lead to inefficient estimates of the parameters; and the presence of a non-trivial noise structure in the data. In this thesis we propose new statistical methods to achieve model selection and estimation when there are more predictors than observations. We also design a new set of algorithms to efficiently solve the proposed statistical models. We apply the implemented methods to genetic data sets of cancer patients and to some economics data. Variable selection High dimensional statistics Regularization
4	Identifying historical financial crisis: Bayesian stochastic search variable selection in logistic regression Ho, Chi-San 2009 August 1900 (has links) This work investigates the factors that contribute to financial crises. We first study the Dow Jones index performance by grouping the daily adjusted closing value into a two-month window and finding several critical quantiles in each window. Then, we identify severe downturn in these quantiles and find that the 5th quantile is the best to identify financial crises. We then matched these quantiles with historical financial crises and gave a basic explanation about them. Next, we introduced all exogenous factors that could be related to the crises. Then, we applied a rapid Bayesian variable selection technique - Stochastic Search Variable Selection (SSVS) using a Bayesian logistic regression model. Finally, we analyzed the result of SSVS, leading to the conclusion that that the dummy variable we created for disastrous hurricane, crude oil price and gold price (GOLD) should be included in the model. / text Logistic regression
5	Nonlinear systems identification using the Narmax method Mao, Ke Zhi January 1998 (has links) No description available. 629.8
6	Quantifying the stability of feature selection Nogueira, Sarah January 2018 (has links) Feature Selection is central to modern data science, from exploratory data analysis to predictive model-building. The "stability"of a feature selection algorithm refers to the robustness of its feature preferences, with respect to data sampling and to its stochastic nature. An algorithm is "unstable" if a small change in data leads to large changes in the chosen feature subset. Whilst the idea is simple, quantifying this has proven more challenging---we note numerous proposals in the literature, each with different motivation and justification. We present a rigorous statistical and axiomatic treatment for this issue. In particular, with this work we consolidate the literature and provide (1) a deeper understanding of existing work based on a small set of properties, and (2) a clearly justified statistical approach with several novel benefits. This approach serves to identify a stability measure obeying all desirable properties, and (for the first time in the literature) allowing confidence intervals and hypothesis tests on the stability of an approach, enabling rigorous comparison of feature selection algorithms. 004
7	Variable selection in principal component analysis : using measures of multivariate association. Sithole, Moses M. January 1992 (has links) This thesis is concerned with the problem of selection of important variables in Principal Component Analysis (PCA) in such a way that the selected subsets of variables retain, as much as possible, the overall multivariate structure of the complete data. Throughout the thesis, the criteria used in order to meet this requirement are collectively referred to as measures of Multivariate Association (MVA). Most of the currently available selection methods may lead to inappropriate subsets, while Krzanowskis (1987) M(subscript)2-Procrustes criterion successfully identifies structure-bearing variables particularly when groups are present in the data. Our major objective, however, is to utilize the idea of multivariate association to select subsets of the original variables which preserve any (unknown) multivariate structure that may be present in the data.The first part of the thesis is devoted to a study of the choice of the number of components (say, k) to be used in the variable selection process. Various methods that exist in the literature for choosing k are described, and comparative studies on these methods are reviewed. Currently available methods based exclusively on the eigenvalues of the covariance or correlation matrices, and those based on cross-validation are unsatisfactory. Hence, we propose a new technique for choosing k based on the bootstrap methodology. A full comparative study of this new technique and the cross-validatory choice of k proposed by Eastment and Krzanowski (1982) is then carried out using data simulated from Monte Carlo experiment.The remainder of the thesis focuses on variable selection in PCA using measures of MVA. Various existing selection methods are described, and comparative studies on these methods available in the literature are reviewed. New methods for selecting variables, based of measures of MVA are then proposed and compared ++ / among themselves as well as with the M(subscript)2-procrustes criterion. This comparison is based on Monte Carlo simulation, and the behaviour of the selection methods is assessed in terms of the performance of the selected variables.In summary, the Monte Carlo results suggest that the proposed bootstrap technique for choosing k generally performs better than the cross-validatory technique of Eastment and Krzanowski (1982). Similarly, the Monte Carlo comparison of the variable selection methods shows that the proposed methods are comparable with or better than Krzanowskis (1987) M(subscript)2-procrustes criterion. These conclusions are mainly based on data simulated by means of Monte Carlo experiments. However, these techniques for choosing k and the various variable selection techniques are also evaluated on some real data sets. Some comments on alternative approaches and suggestions for possible extensions conclude the thesis.
8	Coefficient of intrinsic dependence: a new measure of association Liu, Li-yu Daisy 29 August 2005 (has links) To detect dependence among variables is an essential task in many scientific investigations. In this study we propose a new measure of association, the coefficient of intrinsic dependence (CID), which takes value in [0,1] and faithfully reflects the full range of dependence for two random variables. The CID is free of distributional and functional assumptions. It can be easily implemented and extended to multivariate situations. Traditionally, the correlation coefficient is the preferred measure of association. However, it's effectiveness is considerably compromised when the random variables are not normally distributed. Besides, the interpretation of the correlation coefficient is difficult when the data are categorical. By contrast, the CID is free of these problems. In our simulation studies, we find that the ability of the CID in differentiating different levels of dependence remains robust across different data types (categorical or continuous) and model features (linear or curvilinear). Also, the CID is particularly effective when the dependence is strong, making it a powerful tool for variable selection. As an illustration, the CID is applied to variable selection in two aspects: classification and prediction. The analysis of actual data from a study of breast cancer gene expression is included. For the classification problem, we identify a pair of genes that best classify a patient's prognosis signature, and for the prediction problem, we identify a pair of genes that best relates to the expression of a specific gene. Measure of Association Dependence Correlation Variable Selection
9	An Investigation of Artificial Immune Systems and Variable Selection Techniques for Credit Scoring. Leung Kan Hing, Kevin, kleung19@yahoo.com January 2009 (has links) Most lending institutions are aware of the importance of having a well-performing credit scoring model or scorecard and know that, in order to remain competitive in the credit industry, it is necessary to continuously improve their scorecards. This is because better scorecards result in substantial monetary savings that can be stated in terms of millions of dollars. Thus, there has been increasing interest in the application of new classifiers in credit scoring from both practitioners and researchers in the last few decades. Most of the recent work in this field has focused on the use of new and innovative techniques to classify applicants as either 'credit-worthy' or 'non-credit-worthy', with the aim of improving scorecard performance. In this thesis, we investigate the suitability of intelligent systems techniques for credit scoring. In particular, intelligent systems that use immunological metaphors are examined and used to build a learning and evolutionary classification algorithm. Our model, named Simple Artificial Immune System (SAIS), is based on the concepts of the natural immune system. The model uses applicants' credit details to classify them as either 'credit-worthy' or 'non-credit-worthy'. As part of the model development, we also investigate several techniques for selecting variables from the applicants' credit details. Variable selection is important as choosing the best set of variables can have a significant effect on the performance of scorecards. Interestingly, our results demonstrate that the traditional stepwise regression variable selection technique seems to perform better than many of the more recent techniques. A further contribution offered by this thesis is a detailed description of the scorecard development process. A detailed explanation of this process is not readily available in the literature and our description of the process is based on our own experiences and discussions with industry credit risk practitioners. We evaluate our model using both publicly available datasets as well as a very large set of real-world consumer credit scoring data obtained from a leading Australian bank. The evaluation results reveal that SAIS is a competitive classifier and is appropriate for developing scorecards which require a class decision as an outcome. Another conclusion reached is one confirmed by the existing literature, that even though more sophisticated scorecard development techniques, including SAIS, perform well compared to the traditional statistical methods, their performances are not statistically significantly different from the statistical methods. As with other intelligent systems techniques, SAIS is not explicitly designed to develop practical scorecards which require the generation of a score that represents the degree of confidence that an applicant will belong to a particular group. However, it is comparable to other intelligent systems techniques which are outperformed by statistical techniques for generating p ractical scorecards. Our final remark on this research is that even though SAIS does not seem to be quite suitable for developing practical scorecards, we still believe that there is room for improvement and that the natural immune system of the body has a number of avenues yet to be explored which could assist with the development of practical scorecards. Credit scoring artificial immune system variable selection
10	Combining Variable Selection with Dimensionality Reduction Wolf, Lior, Bileschi, Stanley 30 March 2005 (has links) This paper bridges the gap between variable selection methods (e.g., Pearson coefficients, KS test) and dimensionality reductionalgorithms (e.g., PCA, LDA). Variable selection algorithms encounter difficulties dealing with highly correlated data,since many features are similar in quality. Dimensionality reduction algorithms tend to combine all variables and cannotselect a subset of significant variables.Our approach combines both methodologies by applying variable selection followed by dimensionality reduction. Thiscombination makes sense only when using the same utility function in both stages, which we do. The resulting algorithmbenefits from complex features as variable selection algorithms do, and at the same time enjoys the benefits of dimensionalityreduction.1 AI Computer Vision Statistical Learning Variable Selection

Search results