Global ETD Search

21	Supervised Learning Techniques : A comparison of the Random Forest and the Support Vector Machine Arnroth, Lukas, Fiddler Dennis, Jonni January 2016 (has links) This thesis examines the performance of the support vector machine and the random forest models in the context of binary classification. The two techniques are compared and the outstanding one is used to construct a final parsimonious model. The data set consists of 33 observations and 89 biomarkers as features with no known dependent variable. The dependent variable is generated through k-means clustering, with a predefined final solution of two clusters. The training of the algorithms is performed using five-fold cross-validation repeated twenty times. The outcome of the training process reveals that the best performing versions of the models are a linear support vector machine and a random forest with six randomly selected features at each split. The final results of the comparison on the test set of these optimally tuned algorithms show that the random forest outperforms the linear kernel support vector machine. The former classifies all observations in the test set correctly whilst the latter classifies all but one correctly. Hence, a parsimonious random forest model using the top five features is constructed, which, to conclude, performs equally well on the test set compared to the original random forest model using all features. machine learning biomarkers cross-validation receiver operating characteristic k-means clustering feature selection binary classification
22	Tuning Parameter Selection in L1 Regularized Logistic Regression Shi, Shujing 05 December 2012 (has links) Variable selection is an important topic in regression analysis and is intended to select the best subset of predictors. Least absolute shrinkage and selection operator (Lasso) was introduced by Tibshirani in 1996. This method can serve as a tool for variable selection because it shrinks some coefficients to exact zero by a constraint on the sum of absolute values of regression coefficients. For logistic regression, Lasso modifies the traditional parameter estimation method, maximum log likelihood, by adding the L1 norm of the parameters to the negative log likelihood function, so it turns a maximization problem into a minimization one. To solve this problem, we first need to give the value for the parameter of the L1 norm, called tuning parameter. Since the tuning parameter affects the coefficients estimation and variable selection, we want to find the optimal value for the tuning parameter to get the most accurate coefficient estimation and best subset of predictors in the L1 regularized regression model. There are two popular methods to select the optimal value of the tuning parameter that results in a best subset of predictors, Bayesian information criterion (BIC) and cross validation (CV). The objective of this paper is to evaluate and compare these two methods for selecting the optimal value of tuning parameter in terms of coefficients estimation accuracy and variable selection through simulation studies. Variable Selection Logistic Regression Lasso Tuning Parameter BIC Cross Validation Physical Sciences and Mathematics
23	Image-Based Non-Contact Conductivity Prediction for Inkjet Printed Electrodes and Follow-Up Work of Toner Usage Prediction for Laser Electro-Phorographic Printers Yang Yan (6861362) 16 August 2019 (has links) <div>This thesis includes two parts. The main part is on the topic of conductivity prediction for Inkjet printed silver electrodes. The second part is about the follow-up work of toner usage prediction of laser electro-photographic printers. </div><div><br></div><div>For conductivity prediction of Inkjet printed silver electrodes part, the brief introduction is described below. Recently, electronic devices made with Inkjet printing technique and flexible thin films have attracted great attention due to their potential applications in sensor manufacturing. This imaging system has become a great tool to monitor the quality of Inkjet printed electrodes due to the fact that most thickness or resistance measuring devices can destroy the surface of a printed electrode or even whole electrode. Thus, a non-contact image-based approach to estimate sheet resistance of Inkjet printed electrodes is developed.</div><div><br></div><div>The approach has two stages. Firstly, strip-shaped electrodes are systematically printed with various printing parameters. The sheet resistance measurement data as</div><div>well as images of the electrodes are acquired. Then, based on the real experimental data, the fitting model is constructed and further used in predicting the sheet</div><div>resistance of the Inkjet printed silver electrodes.</div><div><br></div><div>For toner usage prediction part, the introduction is described below. With the widespread use of laser electro-photographic printers in both industry and households fields, estimation of toner usage has great significance to ensuring the full utilization of each cartridge. The follow-up work is focused on testing and improving feasibility, reliability, and adaptability of the Black Box Model (BBM) based two-stage strategy in estimating the toner usage. Comparing with previous methods, the training process for the firrst stage requires less time and disk storage, all while maintaining high accuracy. For the second stage, experiments are performed on various models of printers, with cyan(C), magenta(M), yellow(Y), and black(K) color cartridges.</div> inkjet printing Machine learning Electro-photographic printers Cross Validation
24	Predicting rifle shooting accuracy from context and sensor data : A study of how to perform data mining and knowledge discovery in the target shooting domain / Prediktering av skytteträffsäkerhet baserat på kontext och sensordata. Pettersson, Max, Jansson, Viktor January 2019 (has links) The purpose of this thesis is to develop an interpretable model that gives predictions for what factors impacted a shooter’s results. Experiment is our chosen research method. Our three independent variables are weapon movement, trigger pull force and heart rate. Our dependent variable is shooting accuracy. A random forest regression model is trained with the experiment data to produce predictions of shooting accuracy and to show correlation between independent and dependent variables. Our method shows that an increase in weapon movement, trigger pull force and heart rate decrease the predicted accuracy score. Weapon movement impacted shooting results the most with 53.61%, while trigger pull force and heart rateimpacted shooting results 22.20% and 24.18% respectively. We have also shown that LIME can be a viable method to give explanations on how the measured factors impacted shooting results. The results from this thesis lay the groundwork for better training tools for target shooting using explainable prediction models with sensors. Interpretability Target shooting Regression trees Feature selection Cross-validation Computer Sciences Datavetenskap (datalogi)
25	Time Series Forecasting of House Prices: An evaluation of a Support Vector Machine and a Recurrent Neural Network with LSTM cells Rostami, Jako, Hansson, Fredrik January 2019 (has links) In this thesis, we examine the performance of different forecasting methods. We use dataof monthly house prices from the larger Stockholm area and the municipality of Uppsalabetween 2005 and early 2019 as the time series to be forecast. Firstly, we compare theperformance of two machine learning methods, the Long Short-Term Memory, and theSupport Vector Machine methods. The two methods forecasts are compared, and themodel with the lowest forecasting error measured by three metrics is chosen to be comparedwith a classic seasonal ARIMA model. We find that the Long Short-Term Memorymethod is the better performing machine learning method for a twelve-month forecast,but that it still does not forecast as well as the ARIMA model for the same forecast period. machine learning cross-validation seasonality sliding window sequential model supervised learning Probability Theory and Statistics Sannolikhetsteori och statistik
26	Aeroacústica de motores aeronáuticos: uma abordagem por meta-modelo / Aeroengine aeroacoustics: a meta-model approach Cuenca, Rafael Gigena 20 June 2017 (has links) Desde a última década, as autoridades aeronáuticas dos países membros da ICAO vem, gradativamente, aumentando as restrições nos níveis de ruído externo de aeronaves, principalmente nas proximidades dos aeroportos. Por isso os novos motores aeronáuticos precisam ter projetos mais silenciosos, tornando as técnicas de predição de ruído de motores cada vez mais importantes. Diferente das técnicas semi-analíticas, que vêm evoluindo nas últimas décadas, as técnicas semiempíricas possuem suas bases lastreadas em técnicas e dados que remontam à década de 70, como as desenvolvidas no projeto ANOPP. Uma bancada de estudos aeroacústicos para um conjunto rotor/estator foi construída no departamento de Engenharia Aeronáutica da Escola de Engenharia de São Carlos, permitindo desenvolver uma metodologia capaz de gerar uma técnica semi-empírica utilizando métodos e dados novos. Tal bancada é capaz de variar a rotação, o espaçamento rotor/estator e controlar a vazão mássica, resultando em 71 configurações avaliadas. Para isso, uma antena de parede com 14 microfones foi usada. O espectro do ruído de banda larga é modelado como um ruído rosa e o ruído tonal é modelado por um comportamento exponencial, resultando em 5 parâmetros: nível do ruído, decaimento linear e fator de forma da banda larga, nível do primeiro tonal e o decaimento exponencial de seus harmônicos. Uma regressão superficial Kriging é utilizada para aproximar os 5 parâmetros utilizando as variáveis do experimento e o estudo mostrou que Mach Tip e RSS são as principais variáveis que definem o ruído, assim como utilizado pelo projeto ANOPP. Assim, um modelo de previsão é definido para o conjunto rotor/estator estudado na bancada, o que permite prever o espectro em condições não ensaiadas. A análise do modelo resultou em uma ferramenta de interpretação dos resultados. Ao modelo são aplicadas 3 técnicas de validação cruzada: leave one out, monte carlo e repeated k-folds e mostrou que o modelo desenvolvido possui um erro médio, do nível do ruído total do espectro, de 2.35 dBs e desvio padrão de 0.91. / Since the last decade, the countries members of ICAO, via its aeronautical authorities, has been gradually increasing the restrictions on external aircraft noise levels, especially in the vicinity of airports. Because that, the new aero-engines need quieter designs, so noise prediction techniques for aero-engines are getting even more important. Semi-analytical techniques have undergone a major evolution since the 70th until nowadays, but semi-empirical techniques still have their bases pegged in techniques and data defined on the 70th, developed in the ANOPP project. An Aeroacoustics Fan Rig to investigate a Rotor/Stator assembly was developed at Aeronautical Engineering Department of São Carlos School of Engineering, allowing the development of a methodology capable of defining a semi-empirical technique based on new data and methods. Such rig is able to vary the rotation, the rotor/stator spacing and mass flow rate, resulting in a set of 71 configurations tested. To measure the noise, a microphone wall antenna with 14 sensors were used. The broadband noise was modeled by a pink noise and the tonal with exponential behavior, resulting in 5 parameters: broadband noise level, decay and form factor and the level and decay of tonal noise. A superficial kriging regression were used to approach the parameters using the experimental variables and the investigation has shown that Mach Tip and RSS are the most important variables that defines the noise, as well on ANOPP. A prediction model for the rotor/stator noise are defined with the 5 approximation of the parameters, that allow to predict the spectra at operations points not measured. The model analyses of the model resulted on a tool for results interpretation. Tree different cross validation techniques are applied to model: leave ou out, Monte Carlo and repeated k-folds. That analysis shows that the model developed has average error of 2.35 dBs and standard deviation of 0.91 for the spectrum level predicted. Aeroacoustics Aeroacustica Cross-validação Cross-validation Kriging Kriging Meta-modelo Metamodel Turbo-fan Turbo-fan engine
27	Seleção e análise de associação genômica em dados simulados e da qualidade da carne de ovinos da raça Santa Inês / Genomic selection and association analysis in simulated data and meat quality of Santa Inês sheep breed Pértile, Simone Fernanda Nedel 19 August 2015 (has links) Informações de milhares de marcadores genéticos têm sido incluídas nos programas de melhoramento genético, permitindo a seleção dos animais considerando estas informações e a identificações de regiões genômicas associadas às características de interesse econômico. Devido ao alto custo associado a esta tecnologia e às coletas de dados, os dados simulados apresentam grande importância para que novas metodologias sejam estudadas. O objetivo deste trabalho foi avaliar a eficiência do método ssGBLUP utilizando pesos para os marcadores genéticos, informações de genótipo e fenótipos, com ou sem as informações de pedigree, para seleção e associação genômica ampla, considerando diferentes coeficientes de herdabilidade, presença de efeito poligênico, diferentes números de QTL (quantitative trait loci) e pressões de seleção. Adicionalmente, dados de qualidade da carne de ovinos da raça Santa Inês foram comparados com a os padrões descritos para esta raça. A população estudada foi obtida por simulação de dados, e foi composta por 8.150 animais, sendo 5.850 animais genotipados. Os dados simulados foram analisados utilizando o método ssGBLUP com matrizes de relacionamento com ou sem informações de pedigree, utilizando pesos para os marcadores genéticos obtidos em cada iteração. As características de qualidade da carne estudadas foram: área de olho de lombo, espessura de gordura subcutânea, cor, pH ao abate e após 24 horas de resfriamento das carcaças, perdas por cocção e força de cisalhamento. Quanto maior o coeficiente de herdabilidade, melhores foram os resultados de seleção e associação genômica. Para a identificação de regiões associadas a características de interesse, não houve influência do tipo de matriz de relacionamento utilizada. Para as características com e sem efeito poligênico, quando considerado o mesmo coeficiente de herdabilidade, não houve diferenças para seleção genômica, mas a identificação de QTL foi melhor nas características sem efeito poligênico. Quanto maior a pressão de seleção, mais acuradas foram as predições dos valores genéticos genômicos. Os dados de qualidade da carne obtidos de ovinos da raça Santa Inês estão dentro dos padrões descritos para esta raça e foram identificas diversas regiões genômicas associadas às características estudadas. / Thousands of genetic markers data have been included in animal breeding programs to allow the selection of animals considering this information and to identify genomic regions associated to traits of economic interest. Simulated data have great importance to the study of new methodologies due to the high cost associated with this technology and data collection. The objectives of this study were to evaluate the efficiency of the ssGBLUP method using genotype and phenotype information, with or without pedigree information, and attributing weights for genetic markers, for selection and genome-wide association considering different coefficients of heritability, the presence of polygenic effect, different numbers of quantitative trait loci and selection pressures. Additionally, meat quality data of Santa Ines sheep breed were compared with the standards for the breed. The population of simulated data was composed by 8.150 individuals and 5.850 genotyped animals. The simulated data was analysed by the ssGBLUP method and by two relationship matrix, with or without pedigree information, and weights for genetic markers were obtained in every iteration. The traits of meat quality evaluated were: rib eye area, fat thickness, color, pH at slaughter and 24 hours after the carcass cooling, cooking losses and shear force. The results of selection and genomic association were better for the traits with the highest heritability coefficients. For traits with the greater selection pressure, more accurate predictions of the genomic breeding values were obtained. There was no difference between the relationship matrix studied to identify the regions associated with traits of interest. For the traits with and without polygenic effect, considering the same heritability coefficient, they did not show differences in genomic selection, but the identification of the QTL was better for traits without polygenic effect. The meat quality data obtained from Santa Ines sheep breed are in accordance with the standards for this breed and different genomic regions associated to the studied characteristics were identified. Bayesian Coeficiente de herdabilidade Cross-validation Habilidade preditiva Heritability coefficient ssGBLUP ssGBLUP Validação cruzada
28	台灣地區死亡率推估的實證方法之研究與相關年金問題之探討曾奕翔 Unknown Date (has links) In Taiwan area, the mortality rates at all ages have decreased since the end of World War II, and the life expectancy of people has increased from 62 in 1950's to 75 in 2000, which is an increase of 21%. The mortality improvement of the elderly (i.e. people ages 65 and over) is especially significant, which effects in the rapid population aging in Taiwan area. For example, the proportion of the elderly has increased from 6.14%in 1990 to 8.52% in 2000. On one hand, the prolonged life span for an individual means a longer period of retirement life and thus a larger retirement fund. On the other hand, a longer life for the government is equivalent to a more thorough social system for the elderly. Therefore, a reliable mortality rates projection is essential to both personal financial and social welfare planning. 　　In this study, we have two main objectives: First, we explore some frequent used models, such as Lee-Carter, multivariate regression and principal component methods. We use the data between 1950 to 1995 as the pilot data and 1996 to 2000 as the test data to judge which method has the smallest prediction error. In addition, based on computer simulation, we also evaluate the performance of the estimation methods for the Lee-Carter method. The second part (and the other objective) of this study is to explore the effect of mortality improvement on the pure premium of annuity insurance. In particular, we calculate the pure premium of the annuity under the best model acquired from the first part, and compare those under 1989 TSO and other life tables. We found that the pure premiums under current life tables are under estimated, which may cause the insolvency of insurance companies. Mortality Improvement Population Projection Lee-Carter Method Cross-validation Simulation Pure Premium Annuity 1989 TSO
29	Inégalités probabilistes pour l'estimateur de validation croisée dans le cadre de l'apprentissage statistique et Modèles statistiques appliqués à l'économie et à la finance Cornec, Matthieu 04 June 2009 (has links) (PDF) L'objectif initial de la première partie de cette thèse est d'éclairer par la théorie une pratique communément répandue au sein des practiciens pour l'audit (ou risk assessment en anglais) de méthodes prédictives (ou prédicteurs) : la validation croisée (ou cross-validation en anglais). La seconde partie s'inscrit principalement dans la théorie des processus et son apport concerne essentiellement les applications à des données économiques et financières. Le chapitre 1 s'intéresse au cas classique de prédicteurs de Vapnik-Chernovenkis dimension (VC-dimension dans la suite) finie obtenus par minimisation du risque empirique. Le chapitre 2 s'intéresse donc à une autre classe de prédicteurs plus large que celle du chapitre 1 : les estimateurs stables. Dans ce cadre, nous montrons que les méthodes de validation croisée sont encore consistantes. Dans le chapitre 3, nous exhibons un cas particulier important le subagging où la méthode de validation croisée permet de construire des intervalles de confiance plus étroits que la méthodologie traditionnelle issue de la minimisation du risque empirique sous l'hypothèse de VC-dimension finie. Le chapitre 4 propose un proxy mensuel du taux de croissance du Produit Intérieur Brut français qui est disponible officiellement uniquement à fréquence trimestrielle. Le chapitre 5 décrit la méthodologie pour construire un indicateur synthétique mensuel dans les enquêtes de conjoncture dans le secteur des services en France. L'indicateur synthétique construit est publié mensuellement par l'Insee dans les Informations Rapides. Le chapitre 6 décrit d'un modèle semi-paramétrique de prix spot d'électricité sur les marchés de gros ayant des applications dans la gestion du risque de la production d'électricité. [MATH] Mathematics cross-validation stability concentration inequality bagging Empirical risk minimisation Kalman filter
30	Algorithms for a Partially Regularized Least Squares Problem Skoglund, Ingegerd January 2007 (has links) <p>Vid analys av vattenprover tagna från t.ex. ett vattendrag betäms halten av olika ämnen. Dessa halter är ofta beroende av vattenföringen. Det är av intresse att ta reda på om observerade förändringar i halterna beror på naturliga variationer eller är orsakade av andra faktorer. För att undersöka detta har föreslagits en statistisk tidsseriemodell som innehåller okända parametrar. Modellen anpassas till uppmätta data vilket leder till ett underbestämt ekvationssystem. I avhandlingen studeras bl.a. olika sätt att säkerställa en unik och rimlig lösning. Grundidén är att införa vissa tilläggsvillkor på de sökta parametrarna. I den studerade modellen kan man t.ex. kräva att vissa parametrar inte varierar kraftigt med tiden men tillåter årstidsvariationer. Det görs genom att dessa parametrar i modellen regulariseras.</p><p>Detta ger upphov till ett minsta kvadratproblem med en eller två regulariseringsparametrar. I och med att inte alla ingående parametrar regulariseras får vi dessutom ett partiellt regulariserat minsta kvadratproblem. I allmänhet känner man inte värden på regulariseringsparametrarna utan problemet kan behöva lösas med flera olika värden på dessa för att få en rimlig lösning. I avhandlingen studeras hur detta problem kan lösas numeriskt med i huvudsak två olika metoder, en iterativ och en direkt metod. Dessutom studeras några sätt att bestämma lämpliga värden på regulariseringsparametrarna.</p><p>I en iterativ lösningsmetod förbättras stegvis en given begynnelseapproximation tills ett lämpligt valt stoppkriterium blir uppfyllt. Vi använder här konjugerade gradientmetoden med speciellt konstruerade prekonditionerare. Antalet iterationer som krävs för att lösa problemet utan prekonditionering och med prekonditionering jämförs både teoretiskt och praktiskt. Metoden undersöks här endast med samma värde på de två regulariseringsparametrarna.</p><p>I den direkta metoden används QR-faktorisering för att lösa minsta kvadratproblemet. Idén är att först utföra de beräkningar som kan göras oberoende av regulariseringsparametrarna samtidigt som hänsyn tas till problemets speciella struktur.</p><p>För att bestämma värden på regulariseringsparametrarna generaliseras Reinsch’s etod till fallet med två parametrar. Även generaliserad korsvalidering och en mindre beräkningstung Monte Carlo-metod undersöks.</p> / <p>Statistical analysis of data from rivers deals with time series which are dependent, e.g., on climatic and seasonal factors. For example, it is a well-known fact that the load of substances in rivers can be strongly dependent on the runoff. It is of interest to find out whether observed changes in riverine loads are due only to natural variation or caused by other factors. Semi-parametric models have been proposed for estimation of time-varying linear relationships between runoff and riverine loads of substances. The aim of this work is to study some numerical methods for solving the linear least squares problem which arises.</p><p>The model gives a linear system of the form <em>A</em><em>1x1</em><em> + A</em><em>2x2</em><em> + n = b</em><em>1</em>. The vector <em>n</em> consists of identically distributed random variables all with mean zero. The unknowns, <em>x,</em> are split into two groups, <em>x</em><em>1</em><em> </em>and <em>x</em><em>2</em><em>.</em> In this model, usually there are more unknowns than observations and the resulting linear system is most often consistent having an infinite number of solutions. Hence some constraint on the parameter vector x is needed. One possibility is to avoid rapid variation in, e.g., the parameters<em> x</em><em>2</em><em>.</em> This can be accomplished by regularizing using a matrix <em>A</em><em>3</em>, which is a discretization of some norm. The problem is formulated</p><p>as a partially regularized least squares problem with one or two regularization parameters. The parameter <em>x</em><em>2</em> has here a two-dimensional structure. By using two different regularization parameters it is possible to regularize separately in each dimension.</p><p>We first study (for the case of one parameter only) the conjugate gradient method for solution of the problem. To improve rate of convergence blockpreconditioners of Schur complement type are suggested, analyzed and tested. Also a direct solution method based on QR decomposition is studied. The idea is to first perform operations independent of the values of the regularization parameters. Here we utilize the special block-structure of the problem. We further discuss the choice of regularization parameters and generalize in particular Reinsch’s method to the case with two parameters. Finally the cross-validation technique is treated. Here also a Monte Carlo method is used by which an approximation to the generalized cross-validation function can be computed efficiently.</p> Least squares Regularization Block-matrices Conjugate gradient QR-factorization Cross-validation Numerical analysis Numerisk analys

Search results