Global ETD Search

31	Stability Selection of the Number of Clusters Reizer, Gabriella v 18 April 2011 (has links) Selecting the number of clusters is one of the greatest challenges in clustering analysis. In this thesis, we propose a variety of stability selection criteria based on cross validation for determining the number of clusters. Clustering stability measures the agreement of clusterings obtained by applying the same clustering algorithm on multiple independent and identically distributed samples. We propose to measure the clustering stability by the correlation between two clustering functions. These criteria are motivated by the concept of clustering instability proposed by Wang (2010), which is based on a form of clustering distance. In addition, the effectiveness and robustness of the proposed methods are numerically demonstrated on a variety of simulated and real world samples. Consistency Cross validation Hierarchical clustering Instability k-means clustering Spectral clustering Stability Mathematics
32	Spatio-temporal prediction modeling of clusters of influenza cases Qiu, Weiyu Unknown Date No description available. influenza prediction generalized linear mixed model pseudo-likelihood cross-validation multivariate Pearson Type VII family
33	Bayesian Analysis of Spatial Point Patterns Leininger, Thomas Jeffrey January 2014 (has links) <p>We explore the posterior inference available for Bayesian spatial point process models. In the literature, discussion of such models is usually focused on model fitting and rejecting complete spatial randomness, with model diagnostics and posterior inference often left as an afterthought. Posterior predictive point patterns are shown to be useful in performing model diagnostics and model selection, as well as providing a wide array of posterior model summaries. We prescribe Bayesian residuals and methods for cross-validation and model selection for Poisson processes, log-Gaussian Cox processes, Gibbs processes, and cluster processes. These novel approaches are demonstrated using existing datasets and simulation studies.</p> / Dissertation Statistics cross-validation Gibbs process Log-Gaussian Cox process model selection point pattern residuals Poisson process
34	Exploiting diversity for efficient machine learning Geras, Krzysztof Jerzy January 2018 (has links) A common practice for solving machine learning problems is currently to consider each problem in isolation, starting from scratch every time a new learning problem is encountered or a new model is proposed. This is a perfectly feasible solution when the problems are sufficiently easy or, if the problem is hard when a large amount of resources, both in terms of the training data and computation, are available. Although this naive approach has been the main focus of research in machine learning for a few decades and had a lot of success, it becomes infeasible if the problem is too hard in proportion to the available resources. When using a complex model in this naive approach, it is necessary to collect large data sets (if possible at all) to avoid overfitting and hence it is also necessary to use large computational resources to handle the increased amount of data, first during training to process a large data set and then also at test time to execute a complex model. An alternative to this strategy of treating each learning problem independently is to leverage related data sets and computation encapsulated in previously trained models. By doing that we can decrease the amount of data necessary to reach a satisfactory level of performance and, consequently, improve the accuracy achievable and decrease training time. Our attack on this problem is to exploit diversity - in the structure of the data set, in the features learnt and in the inductive biases of different neural network architectures. In the setting of learning from multiple sources we introduce multiple-source cross-validation, which gives an unbiased estimator of the test error when the data set is composed of data coming from multiple sources and the data at test time are coming from a new unseen source. We also propose new estimators of variance of the standard k-fold cross-validation and multiple-source cross-validation, which have lower bias than previously known ones. To improve unsupervised learning we introduce scheduled denoising autoencoders, which learn a more diverse set of features than the standard denoising auto-encoder. This is thanks to their training procedure, which starts with a high level of noise, when the network is learning coarse features and then the noise is lowered gradually, which allows the network to learn some more local features. A connection between this training procedure and curriculum learning is also drawn. We develop further the idea of learning a diverse representation by explicitly incorporating the goal of obtaining a diverse representation into the training objective. The proposed model, the composite denoising autoencoder, learns multiple subsets of features focused on modelling variations in the data set at different levels of granularity. Finally, we introduce the idea of model blending, a variant of model compression, in which the two models, the teacher and the student, are both strong models, but different in their inductive biases. As an example, we train convolutional networks using the guidance of bidirectional long short-term memory (LSTM) networks. This allows to train the convolutional neural network to be more accurate than the LSTM network at no extra cost at test time.
35	Aeroacústica de motores aeronáuticos: uma abordagem por meta-modelo / Aeroengine aeroacoustics: a meta-model approach Rafael Gigena Cuenca 20 June 2017 (has links) Desde a última década, as autoridades aeronáuticas dos países membros da ICAO vem, gradativamente, aumentando as restrições nos níveis de ruído externo de aeronaves, principalmente nas proximidades dos aeroportos. Por isso os novos motores aeronáuticos precisam ter projetos mais silenciosos, tornando as técnicas de predição de ruído de motores cada vez mais importantes. Diferente das técnicas semi-analíticas, que vêm evoluindo nas últimas décadas, as técnicas semiempíricas possuem suas bases lastreadas em técnicas e dados que remontam à década de 70, como as desenvolvidas no projeto ANOPP. Uma bancada de estudos aeroacústicos para um conjunto rotor/estator foi construída no departamento de Engenharia Aeronáutica da Escola de Engenharia de São Carlos, permitindo desenvolver uma metodologia capaz de gerar uma técnica semi-empírica utilizando métodos e dados novos. Tal bancada é capaz de variar a rotação, o espaçamento rotor/estator e controlar a vazão mássica, resultando em 71 configurações avaliadas. Para isso, uma antena de parede com 14 microfones foi usada. O espectro do ruído de banda larga é modelado como um ruído rosa e o ruído tonal é modelado por um comportamento exponencial, resultando em 5 parâmetros: nível do ruído, decaimento linear e fator de forma da banda larga, nível do primeiro tonal e o decaimento exponencial de seus harmônicos. Uma regressão superficial Kriging é utilizada para aproximar os 5 parâmetros utilizando as variáveis do experimento e o estudo mostrou que Mach Tip e RSS são as principais variáveis que definem o ruído, assim como utilizado pelo projeto ANOPP. Assim, um modelo de previsão é definido para o conjunto rotor/estator estudado na bancada, o que permite prever o espectro em condições não ensaiadas. A análise do modelo resultou em uma ferramenta de interpretação dos resultados. Ao modelo são aplicadas 3 técnicas de validação cruzada: leave one out, monte carlo e repeated k-folds e mostrou que o modelo desenvolvido possui um erro médio, do nível do ruído total do espectro, de 2.35 dBs e desvio padrão de 0.91. / Since the last decade, the countries members of ICAO, via its aeronautical authorities, has been gradually increasing the restrictions on external aircraft noise levels, especially in the vicinity of airports. Because that, the new aero-engines need quieter designs, so noise prediction techniques for aero-engines are getting even more important. Semi-analytical techniques have undergone a major evolution since the 70th until nowadays, but semi-empirical techniques still have their bases pegged in techniques and data defined on the 70th, developed in the ANOPP project. An Aeroacoustics Fan Rig to investigate a Rotor/Stator assembly was developed at Aeronautical Engineering Department of São Carlos School of Engineering, allowing the development of a methodology capable of defining a semi-empirical technique based on new data and methods. Such rig is able to vary the rotation, the rotor/stator spacing and mass flow rate, resulting in a set of 71 configurations tested. To measure the noise, a microphone wall antenna with 14 sensors were used. The broadband noise was modeled by a pink noise and the tonal with exponential behavior, resulting in 5 parameters: broadband noise level, decay and form factor and the level and decay of tonal noise. A superficial kriging regression were used to approach the parameters using the experimental variables and the investigation has shown that Mach Tip and RSS are the most important variables that defines the noise, as well on ANOPP. A prediction model for the rotor/stator noise are defined with the 5 approximation of the parameters, that allow to predict the spectra at operations points not measured. The model analyses of the model resulted on a tool for results interpretation. Tree different cross validation techniques are applied to model: leave ou out, Monte Carlo and repeated k-folds. That analysis shows that the model developed has average error of 2.35 dBs and standard deviation of 0.91 for the spectrum level predicted. Aeroacustica Cross-validação Kriging Meta-modelo Turbo-fan Aeroacoustics Cross-validation Kriging Metamodel Turbo-fan engine
36	Seleção e análise de associação genômica em dados simulados e da qualidade da carne de ovinos da raça Santa Inês / Genomic selection and association analysis in simulated data and meat quality of Santa Inês sheep breed Simone Fernanda Nedel Pértile 19 August 2015 (has links) Informações de milhares de marcadores genéticos têm sido incluídas nos programas de melhoramento genético, permitindo a seleção dos animais considerando estas informações e a identificações de regiões genômicas associadas às características de interesse econômico. Devido ao alto custo associado a esta tecnologia e às coletas de dados, os dados simulados apresentam grande importância para que novas metodologias sejam estudadas. O objetivo deste trabalho foi avaliar a eficiência do método ssGBLUP utilizando pesos para os marcadores genéticos, informações de genótipo e fenótipos, com ou sem as informações de pedigree, para seleção e associação genômica ampla, considerando diferentes coeficientes de herdabilidade, presença de efeito poligênico, diferentes números de QTL (quantitative trait loci) e pressões de seleção. Adicionalmente, dados de qualidade da carne de ovinos da raça Santa Inês foram comparados com a os padrões descritos para esta raça. A população estudada foi obtida por simulação de dados, e foi composta por 8.150 animais, sendo 5.850 animais genotipados. Os dados simulados foram analisados utilizando o método ssGBLUP com matrizes de relacionamento com ou sem informações de pedigree, utilizando pesos para os marcadores genéticos obtidos em cada iteração. As características de qualidade da carne estudadas foram: área de olho de lombo, espessura de gordura subcutânea, cor, pH ao abate e após 24 horas de resfriamento das carcaças, perdas por cocção e força de cisalhamento. Quanto maior o coeficiente de herdabilidade, melhores foram os resultados de seleção e associação genômica. Para a identificação de regiões associadas a características de interesse, não houve influência do tipo de matriz de relacionamento utilizada. Para as características com e sem efeito poligênico, quando considerado o mesmo coeficiente de herdabilidade, não houve diferenças para seleção genômica, mas a identificação de QTL foi melhor nas características sem efeito poligênico. Quanto maior a pressão de seleção, mais acuradas foram as predições dos valores genéticos genômicos. Os dados de qualidade da carne obtidos de ovinos da raça Santa Inês estão dentro dos padrões descritos para esta raça e foram identificas diversas regiões genômicas associadas às características estudadas. / Thousands of genetic markers data have been included in animal breeding programs to allow the selection of animals considering this information and to identify genomic regions associated to traits of economic interest. Simulated data have great importance to the study of new methodologies due to the high cost associated with this technology and data collection. The objectives of this study were to evaluate the efficiency of the ssGBLUP method using genotype and phenotype information, with or without pedigree information, and attributing weights for genetic markers, for selection and genome-wide association considering different coefficients of heritability, the presence of polygenic effect, different numbers of quantitative trait loci and selection pressures. Additionally, meat quality data of Santa Ines sheep breed were compared with the standards for the breed. The population of simulated data was composed by 8.150 individuals and 5.850 genotyped animals. The simulated data was analysed by the ssGBLUP method and by two relationship matrix, with or without pedigree information, and weights for genetic markers were obtained in every iteration. The traits of meat quality evaluated were: rib eye area, fat thickness, color, pH at slaughter and 24 hours after the carcass cooling, cooking losses and shear force. The results of selection and genomic association were better for the traits with the highest heritability coefficients. For traits with the greater selection pressure, more accurate predictions of the genomic breeding values were obtained. There was no difference between the relationship matrix studied to identify the regions associated with traits of interest. For the traits with and without polygenic effect, considering the same heritability coefficient, they did not show differences in genomic selection, but the identification of the QTL was better for traits without polygenic effect. The meat quality data obtained from Santa Ines sheep breed are in accordance with the standards for this breed and different genomic regions associated to the studied characteristics were identified. Coeficiente de herdabilidade Habilidade preditiva ssGBLUP Validação cruzada Bayesian Cross-validation Heritability coefficient ssGBLUP
37	An investigation of feature weighting algorithms and validation techniques using blind analysis for analogy-based estimation Sigweni, Boyce B. January 2016 (has links) Context: Software effort estimation is a very important component of the software development life cycle. It underpins activities such as planning, maintenance and bidding. Therefore, it has triggered much research over the past four decades, including many machine learning approaches. One popular approach, that has the benefit of accessible reasoning, is analogy-based estimation. Machine learning including analogy is known to significantly benefit from feature selection/weighting. Unfortunately feature weighting search is an NP hard problem, therefore computationally very demanding, if not intractable. Objective: Therefore, one objective of this research is to develop an effi cient and effective feature weighting algorithm for estimation by analogy. However, a major challenge for the effort estimation research community is that experimental results tend to be contradictory and also lack reliability. This has been paralleled by a recent awareness of how bias can impact research results. This is a contributory reason why software effort estimation is still an open problem. Consequently the second objective is to investigate research methods that might lead to more reliable results and focus on blinding methods to reduce researcher bias. Method: In order to build on the most promising feature weighting algorithms I conduct a systematic literature review. From this I develop a novel and e fficient feature weighting algorithm. This is experimentally evaluated, comparing three feature weighting approaches with a na ive benchmark using 2 industrial data sets. Using these experiments, I explore blind analysis as a technique to reduce bias. Results: The systematic literature review conducted identified 19 relevant primary studies. Results from the meta-analysis of selected studies using a one-sample sign test (p = 0.0003) shows a positive effect - to feature weighting in general compared with ordinary analogy-based estimation (ABE), that is, feature weighting is a worthwhile technique to improve ABE. Nevertheless the results remain imperfect so there is still much scope for improvement. My experience shows that blinding can be a relatively straightforward procedure. I also highlight various statistical analysis decisions which ought not be guided by the hunt for statistical significance and show that results can be inverted merely through a seemingly inconsequential statistical nicety. After analysing results from 483 software projects from two separate industrial data sets, I conclude that the proposed technique improves accuracy over the standard feature subset selection (FSS) and traditional case-based reasoning (CBR) when using pseudo time-series validation. Interestingly, there is no strong evidence for superior performance of the new technique when traditional validation techniques (jackknifing) are used but is more effi cient. Conclusion: There are two main findings: (i) Feature weighting techniques are promising for software effort estimation but they need to be tailored for target case for their potential to be adequately exploited. Despite the research findings showing that assuming weights differ in different parts of the instance space ('local' regions) may improve effort estimation results - majority of studies in software effort estimation (SEE) do not take this into consideration. This represents an improvement on other methods that do not take this into consideration. (ii) Whilst there are minor challenges and some limits to the degree of blinding possible, blind analysis is a very practical and an easy-to-implement method that supports more objective analysis of experimental results. Therefore I argue that blind analysis should be the norm for analysing software engineering experiments. 005.1
38	Simulation and Application of Binary Logic Regression Models Heredia Rico, Jobany J 01 April 2016 (has links) Logic regression (LR) is a methodology to identify logic combinations of binary predictors in the form of intersections (and), unions (or) and negations (not) that are linearly associated with an outcome variable. Logic regression uses the predictors as inputs and enables us to identify important logic combinations of independent variables using a computationally efficient tree-based stochastic search algorithm, unlike the classical regression models, which only consider pre-determined conventional interactions (the “and” rules). In the thesis, we focused on LR with a binary outcome in a logistic regression framework. Simulation studies were conducted to examine the performance of LR under the assumption of independent and correlated observations, respectively, for various characteristics of the data sets and LR search parameters. We found that the proportion of times that LR selected the correct logic rule was usually low when the signal and/or prevalence of the true logic rule were relatively low. The method performed satisfactorily under easy learning conditions such as high signal, simple logic rules and/or small numbers of predictors. Given the simulation characteristics and correlation structures tested, we found some but not significant difference in performance when LR was applied to dependent observations compared to the independent case. In addition to simulation studies, an advanced application method was proposed to integrate LR and resampling methods in order to enhance LR performance. The proposed method was illustrated using two simulated data sets as well as a data set from a real-life situation. The proposed method showed some evidence of being effective in discerning the correct logic rule, even for unfavorable learning conditions. Logic regression simulation repeated measures bootstrapping cross-validation. Applied Statistics Biostatistics Statistical Models
39	Využití bootstrapu a křížové validace v odhadu predikční chyby regresních modelů / Utilizing Bootstrap and Cross-validation for prediction error estimation in regression models Lepša, Ondřej January 2014 (has links) Finding a well-predicting model is one of the main goals of regression analysis. However, to evaluate a model's prediction abilities, it is a normal practice to use criteria which either do not serve this purpose, or criteria of insufficient reliability. As an alternative, there are relatively new methods which use repeated simulations for estimating an appropriate loss function -- prediction error. Cross-validation and bootstrap belong to this category. This thesis describes how to utilize these methods in order to select a regression model that best predicts new values of the response variable.
40	Open and closed loop model identification and validation Guidi, Figuroa Hernan 03 July 2009 (has links) Closed-loop system identification and validation are important components in dynamic system modelling. In this dissertation, a comprehensive literature survey is compiled on system identification with a specific focus on closed-loop system identification and issues of identification experiment design and model validation. This is followed by simulated experiments on known linear and non-linear systems and experiments on a pilot scale distillation column. The aim of these experiments is to study several sensitivities between identification experiment variables and the consequent accuracy of identified models and discrimination capacity of validation sets given open and closed-loop conditions. The identified model structure was limited to an ARX structure and the parameter estimation method to the prediction error method. The identification and validation experiments provided the following findings regarding the effects of different feedback conditions: <ul> <li>Models obtained from open-loop experiments produced the most accurate responses when approximating the linear system. When approximating the non-linear system, models obtained from closed-loop experiments were found to produce the most accurate responses.</li> <li>Validation sets obtained from open-loop experiments were found to be most effective in discriminating between models approximating the linear system while the same may be said of validation sets obtained from closed-loop experiments for the nonlinear system.</li> </ul> These finding were mostly attributed to the condition that open-loop experiments produce more informative data than closed-loop experiments given no constraints are imposed on system outputs. In the case that system output constraints are imposed, closed-loop experiments produce the more informative data of the two. In identifying the non-linear system and the distillation column it was established that defining a clear output range, and consequently a region of dynamics to be identified, is very important when identifying linear approximations of non-linear systems. Thus, since closed-loop experiments produce more informative data given output constraints, the closed-loop experiments were more effective on the non-liner systems. Assessment into other identification experiment variables revealed the following: <ul> <li>Pseudo-random binary signals were the most persistently exciting signals as they were most consistent in producing models with accurate responses.</li> <li>Dither signals with frequency characteristics based on the system’s dominant dynamics produced models with more accurate responses.</li> <li>Setpoint changes were found to be very important in maximising the generation of informative data for closed-loop experiments</li></ul> Studying the literature surveyed and the results obtained from the identification and validation experiments it is recommended that, when identifying linear models approximating a linear system and validating such models, open-loop experiments should be used to produce data for identification and cross-validation. When identifying linear approximations of a non-linear system, defining a clear output range and region of dynamics is essential and should be coupled with closed-loop experiments to generate data for identification and cross-validation. / Dissertation (MEng)--University of Pretoria, 2009. / Chemical Engineering / unrestricted Arx Identification experiment design Closed-loop system identification Lti approximation Prediction error method Cross-validation UCTD

Search results