Global ETD Search

51	Multi-objective ROC learning for classification Clark, Andrew Robert James January 2011 (has links) Receiver operating characteristic (ROC) curves are widely used for evaluating classifier performance, having been applied to e.g. signal detection, medical diagnostics and safety critical systems. They allow examination of the trade-offs between true and false positive rates as misclassification costs are varied. Examination of the resulting graphs and calcu- lation of the area under the ROC curve (AUC) allows assessment of how well a classifier is able to separate two classes and allows selection of an operating point with full knowledge of the available trade-offs. In this thesis a multi-objective evolutionary algorithm (MOEA) is used to find clas- sifiers whose ROC graph locations are Pareto optimal. The Relevance Vector Machine (RVM) is a state-of-the-art classifier that produces sparse Bayesian models, but is unfor- tunately prone to overfitting. Using the MOEA, hyper-parameters for RVM classifiers are set, optimising them not only in terms of true and false positive rates but also a novel measure of RVM complexity, thus encouraging sparseness, and producing approximations to the Pareto front. Several methods for regularising the RVM during the MOEA train- ing process are examined and their performance evaluated on a number of benchmark datasets demonstrating they possess the capability to avoid overfitting whilst producing performance equivalent to that of the maximum likelihood trained RVM. A common task in bioinformatics is to identify genes associated with various genetic conditions by finding those genes useful for classifying a condition against a baseline. Typ- ically, datasets contain large numbers of gene expressions measured in relatively few sub- jects. As a result of the high dimensionality and sparsity of examples, it can be very easy to find classifiers with near perfect training accuracies but which have poor generalisation capability. Additionally, depending on the condition and treatment involved, evaluation over a range of costs will often be desirable. An MOEA is used to identify genes for clas- sification by simultaneously maximising the area under the ROC curve whilst minimising model complexity. This method is illustrated on a number of well-studied datasets and ap- plied to a recent bioinformatics database resulting from the current InChianti population study. Many classifiers produce “hard”, non-probabilistic classifications and are trained to find a single set of parameters, whose values are inevitably uncertain due to limited available training data. In a Bayesian framework it is possible to ameliorate the effects of this parameter uncertainty by averaging over classifiers weighted by their posterior probabil- ity. Unfortunately, the required posterior probability is not readily computed for hard classifiers. In this thesis an Approximate Bayesian Computation Markov Chain Monte Carlo algorithm is used to sample model parameters for a hard classifier using the AUC as a measure of performance. The ability to produce ROC curves close to the Bayes op- timal ROC curve is demonstrated on a synthetic dataset. Due to the large numbers of sampled parametrisations, averaging over them when rapid classification is needed may be impractical and thus methods for producing sparse weightings are investigated. 519.6
52	Model calibration methods for mechanical systems with local nonlinearities Chen, Yousheng January 2016 (has links) Most modern product development utilizes computational models. With increasing demands on reducing the product development lead-time, it becomes more important to improve the accuracy and efficiency of simulations. In addition, to improve product performance, a lot of products are designed to be lighter and more flexible, thus more prone to nonlinear behaviour. Linear finite element (FE) models, which still form the basis of numerical models used to represent mechanical structures, may not be able to predict structural behaviour with necessary accuracy when nonlinear effects are significant. Nonlinearities are often localized to joints or boundary conditions. Including nonlinear behaviour in FE-models introduces more sources of uncertainty and it is often necessary to calibrate the models with the use of experimental data. This research work presents a model calibration method that is suitable for mechanical systems with structural nonlinearities. The methodology concerns pre-test planning, parameterization, simulation methods, vibrational testing and optimization. The selection of parameters for the calibration requires physical insights together with analyses of the structure; the latter can be achieved by use of simulations. Traditional simulation methods may be computationally expensive when dealing with nonlinear systems; therefore an efficient fixed-step state-space based simulation method was developed. To gain knowledge of the accuracy of different simulation methods, the bias errors for the proposed method as well as other widespread simulation methods were studied and compared. The proposed method performs well in comparison to other simulation methods. To obtain precise estimates of the parameters, the test data should be informative of the parameters chosen and the parameters should be identifiable. Test data informativeness and parameter identifiability are coupled and they can be assessed by the Fisher information matrix (FIM). To optimize the informativeness of test data, a FIM based pre-test planning method was developed and a multi-sinusoidal excitation was designed. The steady-state responses at the side harmonics were shown to contain valuable information for model calibration of FE-models representing mechanical systems with structural nonlinearities. In this work, model calibration was made by minimizing the difference between predicted and measured multi-harmonic frequency response functions using an efficient optimization routine. The steady-state responses were calculated using the extended multi-harmonic balance method. When the parameters were calibrated, a k-fold cross validation was used to obtain parameter uncertainty. The proposed model calibration method was validated using two test-rigs, one with a geometrical nonlinearity and one with a clearance type of nonlinearity. To attain high quality data efficiently, the amplitude of the forcing harmonics was controlled at each frequency step by an off-line force feedback algorithm. The applied force was then measured and used in the numerical simulations of the responses. It was shown in the validation results that the predictions from the calibrated models agree well with the experimental results. In summary, the presented methodology concerns both theoretical and experimental aspects as it includes methods for pre-test planning, simulations, testing, calibration and validation. As such, this research work offers a complete framework and contributes to more effective and efficient analyses on mechanical systems with structural nonlinearities. model calibration finite element modelling nonlinear structural dynamics pre-test planning multi-sinusoidal excitation vibrational testing cross validation
53	Neighborhood-Oriented feature selection and classification of Duke’s stages on colorectal Cancer using high density genomic data. Peng, Liang January 1900 (has links) Master of Science / Department of Statistics / Haiyan Wang / The selection of relevant genes for classification of phenotypes for diseases with gene expression data have been extensively studied. Previously, most relevant gene selection was conducted on individual gene with limited sample size. Modern technology makes it possible to obtain microarray data with higher resolution of the chromosomes. Considering gene sets on an entire block of a chromosome rather than individual gene could help to reveal important connection of relevant genes with the disease phenotypes. In this report, we consider feature selection and classification while taking into account of the spatial location of probe sets in classification of Duke’s stages B and C using DNA copy number data or gene expression data from colorectal cancers. A novel method was presented for feature selection in this report. A chromosome was first partitioned into blocks after the probe sets were aligned along their chromosome locations. Then a test of interaction between Duke’s stage and probe sets was conducted on each block of probe sets to select significant blocks. For each significant block, a new multiple comparison procedure was carried out to identify truly relevant probe sets while preserving the neighborhood location information of the probe sets. Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) classification using the selected final probe sets was conducted for all samples. Leave-One-Out Cross Validation (LOOCV) estimate of accuracy is reported as an evaluation of selected features. We applied the method on two large data sets, each containing more than 50,000 features. Excellent classification accuracy was achieved by the proposed procedure along with SVM or KNN for both data sets even though classification of prognosis stages (Duke’s stages B and C) is much more difficult than that for the normal or tumor types. Feature selection Classification Hypothesis testing Cross validation Multiple comparison Genomic data Bioinformatics (0715) Computer Science (0984) Statistics (0463)
54	Relação hipsométrica de eucalipto clonal no sul do Tocantins Schmitt, Thaís 14 September 2017 (has links) Este trabalho foi estruturado em dois capítulos, utilizando 11 parcelas retangulares e permanentes de 348 m² cada, de um plantio clonal de Eucalyptus camaldulensis e Eucalyptus urophylla na região Sul do Estado do Tocantins. O primeiro capítulo objetivou a melhor forma de ajuste de modelos hipsométricos, analisando a acurácia do melhor modelo, e aplicando-o em uma situação florestal diferente. Os dados foram divididos em um lote de ajuste e outro de aplicação, com três classes de diâmetro e três classes de altura dominante. Inicialmente determinou-se o coeficiente de determinação ajustado em porcentagem (R²aj), erro-padrão da estimativa em porcentagem (Syx%), e análise gráfica residual. Posteriormente realizou-se um teste de identidade de modelos, seguido de um delineamento inteiramente casualizado (DIC) no esquema de parcelas subdivididas, juntamente com o teste de Dunnet. No final da análise, para avaliar a estabilidade dos modelos em um teste de validação, foram utilizados os critérios: coeficiente de determinação da predição (R²), soma de quadrados do resíduo relativo (SQRR), raiz quadrada do erro médio (RQEM), erro médio percentual (EMP). Concluiu-se que a melhor forma de ajuste foi realizar um ajuste por classe, sendo o modelo regional o mais adequado a se utilizar. O segundo capítulo aborda a avaliação de modelos hipsométricos aplicando a técnica de validação cruzada, e a comparação dos resultados com aqueles obtidos no capítulo 1, visando obter o melhor modelo a ser utilizado na região sob diferentes aspectos de seleção. Inicialmente aplicaram-se os critérios de precisão: coeficiente de determinação ajustado, erro padrão da estimativa e análise gráfica residual. Em seguida foram aplicados os critérios de estabilidade realizando a validação cruzada entre os dois lotes de dados, que foram estes: erro médio absoluto, raiz do quadrado médio e soma de quadrados do erro médio. Os modelos selecionados foram submetidos a uma nova análise, utilizando-se os lotes de dados do capítulo 1, onde se aplicou os mesmos critérios de precisão e estabilidade utilizados anteriormente, resultando na comparação entre os capítulos. Concluiu-se que o melhor modelo local foi o 14 de Chapman-Richards, o melhor modelo regional foi o parabólico 03, e na comparação com os modelos selecionados no capítulo 01, o mais adequado para o plantio foi o modelo regional parabólico 3, proveniente do capítulo 02. / Hypsometric relations of clonal eucalyptus in south of Tocantins. This work was structured in two chapters, using 11 rectangular and permanent plots of 348 m² each, from a clonal plantation of Eucalyptus camaldulensis and Eucalyptus urophylla in the southern region of the state of Tocantins. The first chapter aimed at the best way of adjusting hypsometric models, analyzing the accuracy of the best model, and applying it in a different forest situation. The data were divided into one set of adjustment and another of application, with three classes of diameter and three classes of dominant height. The coefficient of determination adjusted in percentage (R²aj), standard error of the estimate in percentage (Syx%), and residual graphical analysis were determined initially. A model identity test was then performed, followed by a completely randomized design (DIC) in the subdivided plot scheme, along with the Dunnet test. At the end of the analysis, to evaluate the stability of the models in a validation test, the following criteria were used: prediction determination coefficient (R²), sum of squares of the residual residue (SQRR), square root mean error (RQEM) mean error (EMP). It was concluded that the best form of adjustment was to perform an adjustment by class, being the regional model the most appropriate to be used. The second chapter deals with the evaluation of hypsometric models applying the cross validation technique, and the comparison of the results with those obtained in chapter 1, aiming to obtain the best model to be used in the region under different aspects of selection. Initially the precision criteria were applied: adjusted coefficient of determination, standard error of the estimate and residual graphical analysis. Then, the stability criteria were applied by performing cross-validation between the two batches of data, which were: absolute mean error, mean square root, and mean square error sum. The selected models were submitted to a new analysis, using the data bundles of chapter 1, where the same criteria of precision and stability previously used were applied, resulting in the comparison between the chapters. It was concluded that the best local model was Chapman-Richards 14, the best regional model was parabolic 03, and in comparison with the models selected in chapter 01, the most suitable for planting was the regional parabolic model 3, of chapter 02. CNPQ::ENGENHARIAS Relação altura/diâmetro Formas de ajuste Validação cruzada Teste de identidade Height / diameter ratio; Cross-validation Identity test
55	Serial Testing for Detection of Multilocus Genetic Interactions Al-Khaledi, Zaid T. 01 January 2019 (has links) A method to detect relationships between disease susceptibility and multilocus genetic interactions is the Multifactor-Dimensionality Reduction (MDR) technique pioneered by Ritchie et al. (2001). Since its introduction, many extensions have been pursued to deal with non-binary outcomes and/or account for multiple interactions simultaneously. Studying the effects of multilocus genetic interactions on continuous traits (blood pressure, weight, etc.) is one case that MDR does not handle. Culverhouse et al. (2004) and Gui et al. (2013) proposed two different methods to analyze such a case. In their research, Gui et al. (2013) introduced the Quantitative Multifactor-Dimensionality Reduction (QMDR) that uses the overall average of response variable to classify individuals into risk groups. The classification mechanism may not be efficient under some circumstances, especially when the overall mean is close to some multilocus means. To address such difficulties, we propose a new algorithm, the Ordered Combinatorial Quantitative Multifactor-Dimensionality Reduction (OQMDR), that uses a series of testings, based on ascending order of multilocus means, to identify best interactions of different orders with risk patterns that minimize the prediction error. Ten-fold cross-validation is used to choose from among the resulting models. Regular permutations testings are used to assess the significance of the selected model. The assessment procedure is also modified by utilizing the Generalized Extreme-Value distribution to enhance the efficiency of the evaluation process. We presented results from a simulation study to illustrate the performance of the algorithm. The proposed algorithm is also applied to a genetic data set associated with Alzheimer's Disease. Multifactor dimensionality reduction Cross Validation Model selection Continuous Trait Continuous Phenotype Ordered Combinatorial Partitioning Applied Statistics Biostatistics Statistics and Probability
56	New Non-Parametric Methods for Income Distributions Luo, Shan 26 April 2013 (has links) Low income proportion (LIP), Lorenz curve (LC) and generalized Lorenz curve (GLC) are important indexes in describing the inequality of income distribution. They have been widely used for measuring social stability by governments around the world. The accuracy of estimating those indexes is essential to quantify the economics of a country. Established statistical inferential methods for these indexes are based on an asymptotic normal distribution, which may have poor performance when the real income data is skewed or has outliers. Recent applications of nonparametric methods, though, allow researchers to utilize techniques without giving data the parametric distribution assumption. For example, existing research proposes the plug-in empirical likelihood (EL)-based inferences for LIP, LC and GLC. However, this method becomes computationally intensive and mathematically complex because of the presence of nonlinear constraints in the underlying optimization problem. Meanwhile, the limiting distribution of the log empirical likelihood ratio is a scaled Chi-square distribution. The estimation of the scale constant will affect the overall performance of the plug-in EL method. To improve the efficiency of the existing inferential methods, this dissertation first proposes kernel estimators for LIP, LC and GLC, respectively. Then the cross-validation method is proposed to choose bandwidth for the kernel estimators. These kernel estimators are proved to have asymptotic normality. The smoothed jackknife empirical likelihood (SJEL) for LIP, LC and GLC are defined. Then the log-jackknife empirical likelihood ratio statistics are proved to follow the standard Chi-square distribution. Extensive simulation studies are conducted to evaluate the kernel estimators in terms of Mean Square Error and Asymptotic Relative Efficiency. Next, the SJEL-based confidence intervals and the smoothed bootstrap-based confidence intervals are proposed. The coverage probability and interval length for the proposed confidence intervals are calculated and compared with the normal approximation-based intervals. The proposed kernel estimators are found to be competitive estimators, and the proposed inferential methods are observed to have better finite-sample performance. All inferential methods are illustrated through real examples. Low income proportion Lorenz curve Generalized Lorenz curve Ker- nel estimator Bandwidth Empirical likelihood Bootstrap Jackknife Cross-validation
57	Ensembles of Artificial Neural Networks: Analysis and Development of Design Methods Torres Sospedra, Joaquín 30 September 2011 (has links) This thesis is focused on the analysis and development of Ensembles of Neural Networks. An ensemble is a system in which a set of heterogeneous Artificial Neural Networks are generated in order to outperform the Single network based classifiers. However, this proposed thesis differs from others related to ensembles of neural networks [1, 2, 3, 4, 5, 6, 7] since it is organized as follows. In this thesis, firstly, an ensemble methods comparison has been introduced in order to provide a rank-based list of the best ensemble methods existing in the bibliography. This comparison has been split into two researches which represents two chapters of the thesis. Moreover, there is another important step related to the ensembles of neural networks which is how to combine the information provided by the neural networks in the ensemble. In the bibliography, there are some alternatives to apply in order to get an accurate combination of the information provided by the heterogeneous set of networks. For this reason, a combiner comparison has also been introduced in this thesis. Furthermore, Ensembles of Neural Networks is only a kind of Multiple Classifier System based on neural networks. However, there are other alternatives to generate MCS based on neural networks which are quite different to Ensembles. The most important systems are Stacked Generalization and Mixture of Experts. These two systems will be also analysed in this thesis and new alternatives are proposed. One of the results of the comparative research developed is a deep understanding of the field of ensembles. So new ensemble methods and combiners can be designed after analyzing the results provided by the research performed. Concretely, two new ensemble methods, a new ensemble methodology called Cross-Validated Boosting and two reordering algorithms are proposed in this thesis. The best overall results are obtained by the ensemble methods proposed. Finally, all the experiments done have been carried out on a common experimental setup. The experiments have been repeated ten times on nineteen different datasets from the UCI repository in order to validate the results. Moreover, the procedure applied to set up specific parameters is quite similar in all the experiments performed. It is important to conclude by remarking that the main contributions are: 1) An experimental setup to prepare the experiments which can be applied for further comparisons. 2) A guide to select the most appropriate methods to build and combine ensembles and multiple classifiers systems. 3) New methods proposed to build ensembles and other multiple classifier systems. Ensemble Neural Networks Multilayer Feedforward Mixture Stacked Combination Multiple Classifier Systems Cross-Validation Boosting Reordering 004
58	Non-Destructive VIS/NIR Reflectance Spectrometry for Red Wine Grape Analysis Fadock, Michael 04 August 2011 (has links) A novel non-destructive method of grape berry analysis is presented that uses reflected light to predict berry composition. The reflectance spectrum was collected using a diode array spectrometer (350 to 850 nm) over the 2009 and 2010 growing seasons. Partial least squares regression (PLS) and support vector machine regression (SVMR) generated calibrations between reflected light and composition for five berry components, total soluble solids (°Brix), titratable acidity (TA), pH, total phenols, and anthocyanins. Standard methods of analysis for the components were employed and characterized for error. Decomposition of the reflectance data was performed by principal component analysis (PCA) and independent component analysis (ICA). Regression models were constructed using 10x10 fold cross validated PLS and SVM models subject to smoothing, differentiation, and normalization pretreatments. All generated models were validated on the alternate season using two model selection strategies: minimum root mean squared error of prediction (RMSEP), and the "oneSE" heuristic. PCA/ICA decomposition demonstrated consistent features in the long VIS wavelengths and NIR region. The features are consistent across seasons. 2009 was generally more variable, possibly due to cold weather affects. RMSEP and R2 statistics of models indicate that PLS °Brix, pH, and TA models are well predicted for 2009 and 2010. SVM was marginally better. The R2 values of the PLS °Brix, pH, and TA models for 2009 and 2010 respectively were: 0.84, 0.58, 0.56 and: 0.89, 0.81, 0.58. 2010 °Brix models were suitable for rough screening. Optimal pretreatments were SG smoothing and relative normalization. Anthocyanins were well predicted in 2009, R2 0.65, but not in 2010, R2 0.15. Phenols were not well predicted in either year, R2 0.15-0.25. Validation demonstrated that °Brix, pH, and TA models from 2009 transferred to 2010 with fair results, R2 0.70, 0.72, 0.31. Models generated using 2010 reflectance data did not generate models that could predict 2009 data. It is hypothesized that weather events present in 2009 and not in 2010 allowed for a forward calibration transfer, and prevented the reverse calibration transfer. Heuristic selection was superior to minimum RMSEP for transfer, indicating some overfitting in the minimum RMSEP models. The results are demonstrative of a reflectance-composition relationship in the VIS-NIR region for °Brix, pH, and TA requiring additional study and development of further calibrations. PLS SVM regression reflectance multivariate cross validation feature selection calibration °Brix pH titratable acidity total phenols anthocyanins
59	Clustering, Classification, and Factor Analysis in High Dimensional Data Analysis Wang, Yanhong 17 December 2013 (has links) Clustering, classification, and factor analysis are three popular data mining techniques. In this dissertation, we investigate these methods in high dimensional data analysis. Since there are much more features than the sample sizes and most of the features are non-informative in high dimensional data, dimension reduction is necessary before clustering or classification can be made. In the first part of this dissertation, we reinvestigate an existing clustering procedure, optimal discriminant clustering (ODC; Zhang and Dai, 2009), and propose to use cross-validation to select the tuning parameter. Then we develop a variation of ODC, sparse optimal discriminant clustering (SODC) for high dimensional data, by adding a group-lasso type of penalty to ODC. We also demonstrate that both ODC and SDOC can be used as a dimension reduction tool for data visualization in cluster analysis. In the second part, three existing sparse principal component analysis (SPCA) methods, Lasso-PCA (L-PCA), Alternative Lasso PCA (AL-PCA), and sparse principal component analysis by choice of norm (SPCABP) are applied to a real data set the International HapMap Project for AIM selection to genome-wide SNP data, the classification accuracy is compared for them and it is demonstrated that SPCABP outperforms the other two SPCA methods. Third, we propose a novel method called sparse factor analysis by projection (SFABP) based on SPCABP, and propose to use cross-validation method for the selection of the tuning parameter and the number of factors. Our simulation studies show that SFABP has better performance than the unpenalyzed factor analysis when they are applied to classification problems. Cluster analysis Classification Cross-validation High-dimensional data Optimal score Principal components analysis Tuning parameter Variable selection Factor Analysis
60	Optimal Active Learning: experimental factors and membership query learning Yu-hui Yeh Unknown Date (has links) The field of Machine Learning is concerned with the development of algorithms, models and techniques that solve challenging computational problems by learning from data representative of the problem (e.g. given a set of medical images previously classified by a human expert, build a model to predict unseen images as either benign or malignant). Many important real-world problems have been formulated as supervised learning problems. The assumption is that a data set is available containing the correct output (e.g. class label or target value) for each given data point. In many application domains, obtaining the correct outputs (labels) for data points is a costly and time-consuming task. This has provided the motivation for the development of Machine Learning techniques that attempt to minimize the number of labeled data points while maintaining good generalization performance on a given problem. Active Learning is one such class of techniques and is the focus of this thesis. Active Learning algorithms select or generate unlabeled data points to be labeled and use these points for learning. If successful, an Active Learning algorithm should be able to produce learning performance (e.g test set error) comparable to an equivalent supervised learner using fewer labeled data points. Theoretical, algorithmic and experimental Active Learning research has been conducted and a number of successful applications have been demonstrated. However, the scope of many of the experimental studies on Active Learning has been relatively small and there are very few large-scale experimental evaluations of Active Learning techniques. A significant amount of performance variability exists across Active Learning experimental results in the literature. Furthermore, the implementation details and effects of experimental factors have not been closely examined in empirical Active Learning research, creating some doubt over the strength and generality of conclusions that can be drawn from such results. The Active Learning model/system used in this thesis is the Optimal Active Learning algorithm framework with Gaussian Processes for regression problems (however, most of the research questions are of general interest in many other Active Learning scenarios). Experimental and implementation details of the Active Learning system used are described in detail, using a number of regression problems and datasets of different types. It is shown that the experimental results of the system are subject to significant variability across problem datasets. The hypothesis that experimental factors can account for this variability is then investigated. The results show the impact of sampling and sizes of the datasets used when generating experimental results. Furthermore, preliminary experimental results expose performance variability across various real-world regression problems. The results suggest that these experimental factors can (to a large extent) account for the variability observed in experimental results. A novel resampling technique for Optimal Active Learning, called '3-Sets Cross-Validation', is proposed as a practical solution to reduce experimental performance variability. Further results confirm the usefulness of the technique. The thesis then proposes an extension to the Optimal Active Learning framework, to perform learning via membership queries via a novel algorithm named MQOAL. The MQOAL algorithm employs the Metropolis-Hastings Markov chain Monte Carlo (MCMC) method to sample data points for query selection. Experimental results show that MQOAL provides comparable performance to the pool-based OAL learner, using a very generic, simple MCMC technique, and is robust to experimental factors related to the MCMC implementation. The possibility of making queries in batches is also explored experimentally, with results showing that while some performance degradation does occur, it is minimal for learning in small batch sizes, which is likely to be valuable in some real-world problem domains. active learning probabilistic data selection experimental factor cross-validation membership query gaussian processes markov chain monte carlo batch learning

Search results