• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 117
  • 61
  • 21
  • 20
  • 2
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 266
  • 266
  • 69
  • 67
  • 59
  • 57
  • 52
  • 39
  • 36
  • 32
  • 31
  • 30
  • 30
  • 29
  • 28
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
211

High-dimensional classification and attribute-based forecasting

Lo, Shin-Lian 27 August 2010 (has links)
This thesis consists of two parts. The first part focuses on high-dimensional classification problems in microarray experiments. The second part deals with forecasting problems with a large number of categories in predictors. Classification problems in microarray experiments refer to discriminating subjects with different biologic phenotypes or known tumor subtypes as well as to predicting the clinical outcomes or the prognostic stages of subjects. One important characteristic of microarray data is that the number of genes is much larger than the sample size. The penalized logistic regression method is known for simultaneous variable selection and classification. However, the performance of this method declines as the number of variables increases. With this concern, in the first study, we propose a new classification approach that employs the penalized logistic regression method iteratively with a controlled size of gene subsets to maintain variable selection consistency and classification accuracy. The second study is motivated by a modern microarray experiment that includes two layers of replicates. This new experimental setting causes most existing classification methods, including penalized logistic regression, not appropriate to be directly applied because the assumption of independent observations is violated. To solve this problem, we propose a new classification method by incorporating random effects into penalized logistic regression such that the heterogeneity among different experimental subjects and the correlations from repeated measurements can be taken into account. An efficient hybrid algorithm is introduced to tackle computational challenges in estimation and integration. Applications to a breast cancer study show that the proposed classification method obtains smaller models with higher prediction accuracy than the method based on the assumption of independent observations. The second part of this thesis develops a new forecasting approach for large-scale datasets associated with a large number of predictor categories and with predictor structures. The new approach, beyond conventional tree-based methods, incorporates a general linear model and hierarchical splits to make trees more comprehensive, efficient, and interpretable. Through an empirical study in the air cargo industry and a simulation study containing several different settings, the new approach produces higher forecasting accuracy and higher computational efficiency than existing tree-based methods.
212

Improving process monitoring and modeling of batch-type plasma etching tools

Lu, Bo, active 21st century 01 September 2015 (has links)
Manufacturing equipments in semiconductor factories (fabs) provide abundant data and opportunities for data-driven process monitoring and modeling. In particular, virtual metrology (VM) is an active area of research. Traditional monitoring techniques using univariate statistical process control charts do not provide immediate feedback to quality excursions, hindering the implementation of fab-wide advanced process control initiatives. VM models or inferential sensors aim to bridge this gap by predicting of quality measurements instantaneously using tool fault detection and classification (FDC) sensor measurements. The existing research in the field of inferential sensor and VM has focused on comparing regressions algorithms to demonstrate their feasibility in various applications. However, two important areas, data pretreatment and post-deployment model maintenance, are usually neglected in these discussions. Since it is well known that the industrial data collected is of poor quality, and that the semiconductor processes undergo drifts and periodic disturbances, these two issues are the roadblocks in furthering the adoption of inferential sensors and VM models. In data pretreatment, batch data collected from FDC systems usually contain inconsistent trajectories of various durations. Most analysis techniques requires the data from all batches to be of same duration with similar trajectory patterns. These inconsistencies, if unresolved, will propagate into the developed model and cause challenges in interpreting the modeling results and degrade model performance. To address this issue, a Constrained selective Derivative Dynamic Time Warping (CsDTW) method was developed to perform automatic alignment of trajectories. CsDTW is designed to preserve the key features that characterizes each batch and can be solved efficiently in polynomial time. Variable selection after trajectory alignment is another topic that requires improvement. To this end, the proposed Moving Window Variable Importance in Projection (MW-VIP) method yields a more robust set of variables with demonstrably more long-term correlation with the predicted output. In model maintenance, model adaptation has been the standard solution for dealing with drifting processes. However, most case studies have already preprocessed the model update data offline. This is an implicit assumption that the adaptation data is free of faults and outliers, which is often not true for practical implementations. To this end, a moving window scheme using Total Projection to Latent Structure (T-PLS) decomposition screens incoming updates to separate the harmless process noise from the outliers that negatively affects the model. The integrated approach was demonstrated to be more robust. In addition, model adaptation is very inefficient when there are multiplicities in the process, multiplicities could occur due to process nonlinearity, switches in product grade, or different operating conditions. A growing structure multiple model system using local PLS and PCA models have been proposed to improve model performance around process conditions with multiplicity. The use of local PLS and PCA models allows the method to handle a much larger set of inputs and overcome several challenges in mixture model systems. In addition, fault detection sensitivities are also improved by using the multivariate monitoring statistics of these local PLS/PCA models. These proposed methods are tested on two plasma etch data sets provided by Texas Instruments. In addition, a proof of concept using virtual metrology in a controller performance assessment application was also tested.
213

Exploring the Boundaries of Gene Regulatory Network Inference

Tjärnberg, Andreas January 2015 (has links)
To understand how the components of a complex system like the biological cell interact and regulate each other, we need to collect data for how the components respond to system perturbations. Such data can then be used to solve the inverse problem of inferring a network that describes how the pieces influence each other. The work in this thesis deals with modelling the cell regulatory system, often represented as a network, with tools and concepts derived from systems biology. The first investigation focuses on network sparsity and algorithmic biases introduced by penalised network inference procedures. Many contemporary network inference methods rely on a sparsity parameter such as the L1 penalty term used in the LASSO. However, a poor choice of the sparsity parameter can give highly incorrect network estimates. In order to avoid such poor choices, we devised a method to optimise the sparsity parameter, which maximises the accuracy of the inferred network. We showed that it is effective on in silico data sets with a reasonable level of informativeness and demonstrated that accurate prediction of network sparsity is key to elucidate the correct network parameters. The second investigation focuses on how knowledge from association networks can be transferred to regulatory network inference procedures. It is common that the quality of expression data is inadequate for reliable gene regulatory network inference. Therefore, we constructed an algorithm to incorporate prior knowledge and demonstrated that it increases the accuracy of network inference when the quality of the data is low. The third investigation aimed to understand the influence of system and data properties on network inference accuracy. L1 regularisation methods commonly produce poor network estimates when the data used for inference is ill-conditioned, even when the signal to noise ratio is so high that all links in the network can be proven to exist for the given significance. In this study we elucidated some general principles for under what conditions we expect strongly degraded accuracy. Moreover, it allowed us to estimate expected accuracy from conditions of simulated data, which was used to predict the performance of inference algorithms on biological data. Finally, we built a software package GeneSPIDER for solving problems encountered during previous investigations. The software package supports highly controllable network and data generation as well as data analysis and exploration in the context of network inference. / <p>At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 4: Manuscript.</p><p> </p>
214

Some Topics in Roc Curves Analysis

Huang, Xin 07 May 2011 (has links)
The receiver operating characteristic (ROC) curves is a popular tool for evaluating continuous diagnostic tests. The traditional definition of ROC curves incorporates implicitly the idea of "hard" thresholding, which also results in the empirical curves being step functions. The first topic is to introduce a novel definition of soft ROC curves, which incorporates the idea of "soft" thresholding. The softness of a soft ROC curve is controlled by a regularization parameter that can be selected suitably by a cross-validation procedure. A byproduct of the soft ROC curves is that the corresponding empirical curves are smooth. The second topic is on combination of several diagnostic tests to achieve better diagnostic accuracy. We consider the optimal linear combination that maximizes the area under the receiver operating characteristic curve (AUC); the estimates of the combination's coefficients can be obtained via a non-parametric procedure. However, for estimating the AUC associated with the estimated coefficients, the apparent estimation by re-substitution is too optimistic. To adjust for the upward bias, several methods are proposed. Among them the cross-validation approach is especially advocated, and an approximated cross-validation is developed to reduce the computational cost. Furthermore, these proposed methods can be applied for variable selection to select important diagnostic tests. However, the above best-subset variable selection method is not practical when the number of diagnostic tests is large. The third topic is to further develop a LASSO-type procedure for variable selection. To solve the non-convex maximization problem in the proposed procedure, an efficient algorithm is developed based on soft ROC curves, difference convex programming, and coordinate descent algorithm.
215

Modelling and forecasting economic time series with single hidden-layer feedforward autoregressive artificial neural networks /

Rech, Gianluigi, January 1900 (has links)
Diss. Stockholm : Handelshögskolan, 2002.
216

Classification non supervisée et sélection de variables dans les modèles mixtes fonctionnels. Applications à la biologie moléculaire / Curve clustering and variable selection in mixed effects functional models. Applications to molecular biology

Giacofci, Joyce 22 October 2013 (has links)
Un nombre croissant de domaines scientifiques collectent de grandes quantités de données comportant beaucoup de mesures répétées pour chaque individu. Ce type de données peut être vu comme une extension des données longitudinales en grande dimension. Le cadre naturel pour modéliser ce type de données est alors celui des modèles mixtes fonctionnels. Nous traitons, dans une première partie, de la classification non-supervisée dans les modèles mixtes fonctionnels. Nous présentons dans ce cadre une nouvelle procédure utilisant une décomposition en ondelettes des effets fixes et des effets aléatoires. Notre approche se décompose en deux étapes : une étape de réduction de dimension basée sur les techniques de seuillage des ondelettes et une étape de classification où l'algorithme EM est utilisé pour l'estimation des paramètres par maximum de vraisemblance. Nous présentons des résultats de simulations et nous illustrons notre méthode sur des jeux de données issus de la biologie moléculaire (données omiques). Cette procédure est implémentée dans le package R "curvclust" disponible sur le site du CRAN. Dans une deuxième partie, nous nous intéressons aux questions d'estimation et de réduction de dimension au sein des modèles mixtes fonctionnels et nous développons en ce sens deux approches. La première approche se place dans un objectif d'estimation dans un contexte non-paramétrique et nous montrons dans ce cadre, que l'estimateur de l'effet fixe fonctionnel basé sur les techniques de seuillage par ondelettes possède de bonnes propriétés de convergence. Notre deuxième approche s'intéresse à la problématique de sélection des effets fixes et aléatoires et nous proposons une procédure basée sur les techniques de sélection de variables par maximum de vraisemblance pénalisée et utilisant deux pénalités SCAD sur les effets fixes et les variances des effets aléatoires. Nous montrons dans ce cadre que le critère considéré conduit à des estimateurs possédant des propriétés oraculaires dans un cadre où le nombre d'individus et la taille des signaux divergent. Une étude de simulation visant à appréhender les comportements des deux approches développées est réalisée dans ce contexte. / More and more scientific studies yield to the collection of a large amount of data that consist of sets of curves recorded on individuals. These data can be seen as an extension of longitudinal data in high dimension and are often modeled as functional data in a mixed-effects framework. In a first part we focus on performing unsupervised clustering of these curves in the presence of inter-individual variability. To this end, we develop a new procedure based on a wavelet representation of the model, for both fixed and random effects. Our approach follows two steps : a dimension reduction step, based on wavelet thresholding techniques, is first performed. Then a clustering step is applied on the selected coefficients. An EM-algorithm is used for maximum likelihood estimation of parameters. The properties of the overall procedure are validated by an extensive simulation study. We also illustrate our method on high throughput molecular data (omics data) like microarray CGH or mass spectrometry data. Our procedure is available through the R package "curvclust", available on the CRAN website. In a second part, we concentrate on estimation and dimension reduction issues in the mixed-effects functional framework. Two distinct approaches are developed according to these issues. The first approach deals with parameters estimation in a non parametrical setting. We demonstrate that the functional fixed effects estimator based on wavelet thresholding techniques achieves the expected rate of convergence toward the true function. The second approach is dedicated to the selection of both fixed and random effects. We propose a method based on a penalized likelihood criterion with SCAD penalties for the estimation and the selection of both fixed effects and random effects variances. In the context of variable selection we prove that the penalized estimators enjoy the oracle property when the signal size diverges with the sample size. A simulation study is carried out to assess the behaviour of the two proposed approaches.
217

Aplicações de imagens digitais e análise multivariada para classificação e determinação de parâmetros de qualidade em plumas de algodão

Gonçalves, Maria Ivanda Silva 31 August 2015 (has links)
Submitted by Maike Costa (maiksebas@gmail.com) on 2016-05-11T12:40:39Z No. of bitstreams: 1 arquivo total.pdf: 6105657 bytes, checksum: 8404a0fcb54e3893c95fdfb017f0ac96 (MD5) / Made available in DSpace on 2016-05-11T12:40:39Z (GMT). No. of bitstreams: 1 arquivo total.pdf: 6105657 bytes, checksum: 8404a0fcb54e3893c95fdfb017f0ac96 (MD5) Previous issue date: 2015-08-31 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / In recent years, commercial cotton lint have been developed with better quality, presenting different characteristics, but with similar coloring. This can be a problem because these samples is identified, large-scale, performed by a visual inspection, which is a very subjective method and error prone. Another way available for classification of samples is the use of HVI system (High Volume Instruments) to determine physical quality parameters. However, this apparatus has a high cost when compared to digital imaging technique, furthermore has the need for adequate infrastructure and a trained analyst for analysis procedure. This work proposes the development of a novel analytical method based on the use of digital image and multivariate analysis to (1) naturally colored cotton plumes classification according to the type of cultivar and (2) simultaneous determination of degree of yellowness (+b), reflectance (Rd) and wax content (WAX). The acquisition of digital images of cotton lints was carried out through a webcam and histograms containing distributions in levels of colors in standard RGB (red-green-blue), grayscale and HSV system (hue-saturation-value) they were obtained. In the classification of samples, models based discriminant analysis by partial least squares (PLS-DA) and linear discriminant analysis (LDA) with variable selection by the successive projections algorithm (SPA) or stepwise (SW) were evaluated. For the determination of the parameters +b, Rd and WAX, PLS models and multiple linear regression (MLR) with variable selection by the SPA were developed and compared. The best classification results were obtained with LDA / SW model with a correct classification rate (TCC) of 96% for the test group using the HSV combination. As the calibration methods, satisfactory prediction results were obtained for both models (PLS and MLR-SPA) with values of RMSEP near repeatability of the reference method. Furthermore, no systematic error was observed and there were no significant differences between the predicted values and reference, according to a paired t-test at 95% confidence. As advantages of the method is simple, low cost, does not use reagent, does not destroy the sample and realizes analysis at short time intervals. / Nos últimos anos, plumas de algodão comerciais têm-se desenvolvido com melhor qualidade, apresentando características diferentes, mas com coloração similar. Isto pode ser um problema porque a identificação destas amostras é, em larga escala, realizada por meio de uma inspeção visual, que é um método subjetivo e sujeito a erros. Outra forma disponível para classificação dessas amostras consiste no uso do sistema HVI (High Volume Instruments) na determinação de parâmetros físicos de qualidade. Contudo, tal equipamento apresenta um alto custo, se comparado a técnica de imagens digitais, além do mais tem-se a necessidade de uma infraestrutura adequada e de um analista treinado para o procedimento de análise. Este trabalho propõe o desenvolvimento de uma nova metodologia analítica baseada na utilização de imagens digitais e análise multivariada para (1) classificação de plumas de algodão naturalmente colorido de acordo com o tipo de cultivar e (2) determinação simultânea de grau de amarelamento (+b), reflectância (Rd) e teor de cera (WAX). A aquisição das imagens digitais das plumas de algodão foi realizada por meio de uma webcam e foram obtidos os histogramas contendo as distribuições nos níveis de cores no padrão RGB (vermelho-verde-azul), escala de cinza e o sistema HSV (matiz-saturação-valor). Na classificação das amostras, modelos baseados na análise discriminante pelos mínimos quadrados parciais (PLS-DA) e análise discriminante linear (LDA) com seleção de variáveis pelo algoritmo das projeções sucessivas (SPA) ou pelo stepwise (SW) foram avaliados. Para a determinação dos parâmetros de +b, Rd e WAX, modelos PLS e regressão linear múltipla (MLR) com seleção de variáveis pelo SPA foram desenvolvidos e comparados. Os melhores resultados de classificação foram obtidos com o modelo LDA/SW, com uma taxa de classificação correta (TCC) de 96% para o conjunto de teste utilizando a combinação HSV. Quanto aos métodos de calibração, resultados de previsão satisfatórios foram obtidos para ambos os modelos (PLS e MLR-SPA), com valores de RMSEP próximos à repetitividade do método de referência. Além disso, nenhum erro sistemático foi observado e não foram encontradas diferenças significativas entre os valores previstos e de referência, de acordo com um teste t-pareado ao nível de confiança de 95%. Como vantagens o método é simples, de baixo custo, não utiliza reagente, não destrói a amostra e realiza análise em curtos intervalos de tempo.
218

Algoritmo das projeções sucessivas aplicado à seleção de variáveis em regressão PLS

Gomes, Adriano de Araújo 08 March 2012 (has links)
Made available in DSpace on 2015-05-14T13:21:12Z (GMT). No. of bitstreams: 1 arquivototal.pdf: 4180515 bytes, checksum: c6359ed912cde60c8848929b44dcca5c (MD5) Previous issue date: 2012-03-08 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / Spectroscopy techniques combined with multivariate calibration have allowed the development of methods for analyte determinations (or other properties) in complex matrices. In this context, it can be mentioned the determinations that uses models based on PLS (Partial Least Square) regression, which is well established and consolidated in literature. Is spite of efficiency of PLS models obtained from full spectrum, some papers reported in literature show that a variable selection may improve the predictive ability of the PLS models. In the present work, it was developed an algorithm, in Matlab@, that employs the SPA (Successive Projection Algorithm), originally proposed for MLR (Multiple Linear Regression), in order to improve the predictive ability of interval PLS models. The proposed algorithm, termed iSPA-PLS, was evaluated in three case studies, namely: (i) simultaneous determination of three artificial colorants by UV-VIS spectrometry, (ii) quantification of protein contents in wheat using NIR spectrometry, and (iii) quality determination of samples of beer extract using NIR spectrometry too. The performance of iSPA-PLS was compared to the following well-established algorithms and methods: GA-PLS, PLS-Jack-Knife, iPLS e siPLS. In all applications, the results show that the iSPA-PLS presented some advantageous when compared to other algorithms used for comparison. The main advantageous include the smallest errors of prediction and the capacity of selecting a smaller number of PLS factors. / A combinação de técnicas espectroscópicas com calibração multivariada tem permitido o desenvolvimento de métodos para determinação de analitos (ou outras propriedades) em matrizes complexas. Nesse contexto, destacam-se as determinações usando modelos baseados na regressão PLS (Partial Least Square), bem difundida e consolidada na literatura. Apesar da eficácia dos modelos PLS obtidos a partir de espectros completos, alguns trabalhos da literatura têm mostrado que a seleção de variáveis pode melhorar a capacidade preditiva dos modelos PLS. No presente trabalho, desenvolve-se um algoritmo, em MatLab@, que utiliza o Algoritmo das Projeções Sucessivas-APS, proposto originalmente para MLR (Multiple Linear Regression), a fim de melhorar a capacidade preditiva de modelos PLS obtidos por intervalos. O algoritmo proposto, denominado Algoritmo das projeções sucessivas em intervalos para regressão PLS (iSPA-PLS), foi avaliado em três estudos de caso, a saber: (i) determinação simultânea de três corantes alimentícios em amostras sintéticas usando espectrometria UV-Vis, (ii) quantificação do teor de proteínas em trigo por espectrometria NIR e (iii) determinação da qualidade de amostras de extrato de cervejas usando também espectrometria NIR. O desempenho do iSPA-PLS foi comparado ao dos seguintes algoritmos e modelos bem estabelecidos na literatura: GA-PLS, PLS-Jack-Knife, iPLS e siPLS. Os resultados das três aplicações atestam as vantagens do iSPA-PLS frente aos demais algoritmos. Entre elas, destacam-se os menores erros de predição e a capacidade de selecionar um número menor de fatores PLS.
219

Determinação simultânea de Cu, Pb, Cd, Ni, Co e Zn em etanol combustível por voltametria de redissolução adsortiva e calibração multivariada.

Nascimento, Danielle Silva do 06 September 2013 (has links)
Made available in DSpace on 2015-05-14T13:21:22Z (GMT). No. of bitstreams: 1 ArquivoTotalDanielle.pdf: 3075160 bytes, checksum: 7e0a0035250fe5165be8f049a27cdce6 (MD5) Previous issue date: 2013-09-06 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / This study seeks to discuss the use of multivariate calibration techniques in the development of a methodology for simultaneous determination of Cu, Pb, Cd, Ni, Co and Zn at trace level using differential pulse adsorptive stripping voltammetry (DPAdSV). A hanging drop mercury electrode (HDME) was employed as working electrode. The calibration set was assembled by using a Brereton s design, performing 25 replicate mixtures. The different linear ranges were selected from univariate models, and verified using tests of lack of fit and significance by the ANOVA (Analysis of Variance) study. The studied ranges were Cu (0,30 3μg L-1), Pb (1 - 10μg L-1), Cd (0,5 - 5μg L-1), Ni (0,3 - 3μg L-1), Co (0,09 0,5μg L-1) e Zn (0,6 - 6μg L-1). The voltammograms were preprocessed with the algorithms AsLS (asymmetric least squares) and icoshift (interval-correlation-shifting) in order to perform baseline correction and peak alignment, respectively. The following multivariate calibration algorithms were evaluated: partial least-squares regression (PLS) and multiple linear regression with prior variable selection by successive projections algorithm (SPA-MLR). For the validation of the calibration models 10 mixtures with random concentration of each analyte were used, resulting in RMSEV between 0,03 and 0,86 g L-1. As an application of the developed method in the analysis of real samples, hydrated ethyl alcohol fuel (HEAF) was chosen as target matrix. The determination of inorganic contaminants in ethanol fuel samples is important to ensure product quality and pollution control due to release of toxic metals by burning the fuel. The validated models were satisfactorily tested in commercial samples of HEAF from different gas stations of João Pessoa, Brazil. / Resumo: Este trabalho visa avaliar o uso de técnicas de calibração multivariada no desenvolvimento de uma metodologia de determinação simultânea dos metais Cu, Pb, Cd, Ni, Co e Zn em nível-traço utilizando voltametria de pulso diferencial com redissolução adsortiva (DPAdSV). Foi empregado como eletrodo de trabalho um eletrodo de gota pendente de mercúrio (HDME). Para construção do conjunto de calibração aplicou-se um planejamento de calibração de Brereton, analisando-se 25 misturas em replicata. As faixas lineares de concentração foram obtidas dos modelos univariados e avaliados pelos testes de falta de ajuste e significância de regressão por meio da análise de variância (ANOVA). As seguintes faixas lineares de concentração foram utilizadas, Cu (0,30 3μg L-1), Pb (1 - 10μg L-1), Cd (0,5 - 5μg L-1), Ni (0,3 - 3μg L-1), Co (0,09 0,5μg L-1) e Zn (0,6 - 6μg L-1). O pré-processamento dos voltamogramas consistiu em uma correção da linha de base mediante o algoritmo AsLS (mínimos quadrados assimétricos). Posteriormente, os picos de redissolução de cada analito foram alinhados usando-se o algoritmo icoshift (otimização da correlação mediante deslocamento por intervalos). Foram empregados os seguintes algoritmos de calibração: regressão por mínimos quadrados parciais (PLS) e regressão linear múltipla com seleção prévia de variáveis pelo algoritmo das projeções sucessivas (SPA-MLR). Para a validação dos modelos de calibração foram empregadas 10 misturas com concentração aleatória de cada analito, obtendo-se RMSEV entre 0,03 e 0,86 μg L-1. Como aplicação do método desenvolvido na análise de amostras reais, foi escolhida a matriz álcool etílico hidratado combustível (AEHC). Os modelos validados foram testados satisfatoriamente em amostras comerciais de AEHC de diferentes postos de gasolina de João Pessoa- PB, Brasil.
220

Um novo critério para seleção de variáveis usando o Algoritmo das Projeções Sucessivas

Soares, Sófacles Figueiredo Carreiro 22 September 2010 (has links)
Made available in DSpace on 2015-05-14T13:21:51Z (GMT). No. of bitstreams: 1 arquivototal.pdf: 2432134 bytes, checksum: aeda44e0d999a92b980354a5ea66ce01 (MD5) Previous issue date: 2010-09-22 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / This study proposes a modification in the Successive Projections Algorithm (SPA), that makes models of Multiple Linear Regression (MLR) more robust in terms of interference. In SPA, subsets of variables are compared based on their root mean square errors for the validation set. By taking into account the statistical prediction error obtained for the calibration set, and dividing by the statistical prediction error obtained for the prediction set, SPA can be improved. Also taken into account is the leverage associated with each sample. Three case studies involving; simulated analytic determinations, food colorants (UV-VIS spectrometry), and ethanol in gasoline (NIR spectrometry) are discussed. The results were evaluated using the root mean square error for an independent prediction set (Root Mean Square Error of Prediction - RMSEP), graphs of the variables, and the statistical tests t and F. The MLR models obtained by the selection using the new function were called SPE-SPA-MLR. When an interferent was present in the prediction spectra, almost all of the models performed better than both SPA-MLR and PLS. The models when compared to SPA-MLR showed that the change promoted better models in all cases giving smaller RMSEPs and variable numbers. The SPE-SPA-MLR was not better in some cases, than PLS models. The variables selected by SPA-SPE-MLR when observed in the spectra were detected in regions where interference was the at its smallest, revealing great potential. The modifications presented here make a useful tool for the basic formulation of the SPA. / Este trabalho propõe uma modificação no Algoritmo das Projeções Sucessivas (Sucessive Projection Algorithm - SPA), com objetivo de aumentar a robustez a interferentes nos modelos de Regressão Linear Múltipla (Multiple Linear Regression - MLR) construídos. Na formulação original do SPA, subconjuntos de variáveis são comparados entre si com base na raiz do erro quadrático médio obtido em um conjunto de validação. De acordo com o critério aqui proposto, a comparação é feita também levando em conta o erro estatístico de previsão (Statistical Prediction Error SPE) obtido para o conjunto de calibração dividido pelo erro estatístico de previsão obtido para o conjunto de previsão. Tal métrica leva em conta a leverage associada a cada amostra. Três estudos de caso envolvendo a determinação de analitos simulados, corantes alimentícios por espectrometria UV-VIS e álcool em gasolinas por espectrometria NIR são discutidos. Os resultados são avaliados em termos da raiz do erro quadrático médio em um conjunto de previsão independente (Root Mean Square Error of Prediction - RMSEP), dos gráficos das variáveis selecionadas e através do testes estatísticos t e F. Os modelos MLR obtidos a partir da seleção usando a nova função custo foram chamados aqui de SPA-SPE-MLR. Estes modelos foram comparados com o SPA-MLR e PLS. Os desempenhos de previsão do SPA-SPEMLR apresentados foram melhores em quase todos os modelos construídos quando algum interferente estava presente nos espectros de previsão. Estes modelos quando comparados ao SPA-MLR, revelou que a mudança promoveu melhorias em todos os casos fornecendo RMSEPs e números de variáveis menores. O SPA-SPE-MLR só não foi melhor que alguns modelos PLS. As variáveis selecionadas pelo SPA-SPE-MLR quando observadas nos espectros se mostraram em regiões onde a ação do interferente foi à menor possível revelando o grande potencial que tal mudança provocou. Desta forma a modificação aqui apresentada pode ser considerada como uma ferramenta útil para a formulação básica do SPA.

Page generated in 0.0861 seconds