151 |
Dados hiperespectrais para predição do teor foliar de nitrogênio em cana-de-açúcar / Hyperspectral data to predict sugarcane leaf nitrogen contentMartins, Juliano Araújo 17 February 2016 (has links)
Uma das alternativas bastante abordada na literatura para a melhoria do gerenciamento da adubação nitrogenada nas culturas é o sensoriamento remoto, tendo destaque a utilização de sensores espectrais na região do visível e infravermelho. Neste trabalho, buscou-se estabelecer as relações existentes entre variações no teor foliar de nitrogênio (TFN) e a resposta espectral da folha de cana-de-açúcar, utilizando um sensor hiperespectral, com avaliações em três áreas experimentais do estado de São Paulo, com diferentes solos e variedades. Cada experimento foi alocado em blocos ao acaso, com parcelas subdividas e quatro repetições. Foram aplicadas doses de 0, 50, 100 e 150 kg de nitrogênio por hectare. A análise espectral foi realizada na folha \"+1\" em laboratório, sendo coletadas 10 folhas por subparcela, estas foram posteriormente submetidas a análise química para o TFN. Observou-se que existe correlação significativa entre o TFN e as variações na resposta espectral da cana-de-açúcar, sendo que a região do verde e de transição entre o vermelho e o infravermelho próximo (\"red-edge\") foram as mais consistentes e estáveis entre as áreas em estudo e safras avaliadas. A análise de componentes principais permitiu reforçar estes resultados, uma vez que as pontuações (\"scores\") dos componentes que apresentaram correlações significativas com o TFN, tiveram maiores pesos (\"loadings\") nas regiões espectrais citadas anteriormente. A partir das curvas espectrais foram também realizados os cálculos dos índices de vegetação já descritos em literatura, e estes submetidos a análise de regressão simples para predição do TFN, sendo os modelos calibrados com dados da safra 2012/13 e validados com os dados da safra 2013/14. Índices espectrais calculados com a combinação dos comprimentos de onda do verde e/ou \"red-edge\" com comprimentos de onda do infravermelho próximo tiveram bom desempenho na fase de validação, sendo que os cinco mais estáveis foram os índices BNi (500, 705 e 750 nm), GNDVI (550 e 780 nm), NDRE (790 e 720 nm), RI-1db (735 e 720 nm) e VOGa (740 e 720 nm). A variedade SP 81 3250 foi cultivada nas três áreas experimentais, o que permitiu a comparação do potencial de modelos calibrados por área, com um modelo generalista para uma mesma variedade cultivada em diferentes condições edáficas. Observou-se que embora o modelo generalista apresente parâmetros estatísticos significativos, existe redução expressiva da sensibilidade de predição quando comparado aos modelos calibrados por área experimental. Empregou-se também nesta pesquisa a análise de regressão linear múltipla por \"stepwise\" (RLMS) que gerou modelos com boa precisão na estimativa do TFN, mesmo quando calibrados por área experimental, independentes da variedade, utilizando de 5 a 6 comprimentos de onda. Concluímos com a presente pesquisa que comprimentos de onda específicos estão associados a variação do TFN em cana-de-açúcar, e estes são reportados na região do verde (próximos a 550 nm) e na região de transição entre os comprimentos de onda do vermelho e infravermelho próximo (680 a 720 nm). Apesar da baixa correlação entre a região do infravermelho próximo com o TFN, índices de vegetação calculados a partir destes comprimentos de onda ou a inserção destes na geração de modelos lineares foram importantes para melhorar a precisão da predição. / An alternative method, quite cited in literature to improve nitrogen fertilization management on crops is the remote sensing, highlighted with the use of spectral sensors in the visible and infrared region. In this work, we sought to establish the relationship between variations in leaf nitrogen content and the spectral response of sugarcane leaf using a hyperspectral sensor, with assessments in three experimental areas of São Paulo state, Brazil, with evaluations in different soils and varieties. Each experimental area was allocated in randomized block, with splitted plots and four repetition, hence, receiving doses of 0, 50, 100 and 150 kg of nitrogen per hectare. Spectral analysis was performed on the \"+1\" leaf in laboratory; we collected 10 leaves per subplots; which were subsequently subjected to chemical analysis to leaf nitrogen content determination. We observed a significant correlation between leaf nitrogen content and variations in sugarcane spectral response, we noticed that the region of the green light and red-edge were the most consistent and stable among the studied area and the crop seasons evaluated. The principal component analysis allowed to reinforce these results, since that the scores for principal components showed significant correlations with the leaf nitrogen content, had higher loadings values for the previous spectral regions mentioned. From the spectral curves were also performed calculations of spectral indices previously described in literature, being these submitted to simple regression analysis to direct prediction of leaf nitrogen content. The models were calibrated with 2012/13 and validated with 2013/14 crop season data. Spectral indices that were calculated with green and/or red-edge, combined with near-infrared wavelengths performed well in the validation phase, and the five most stable were the BNi (500, 705 and 750 nm), GNDVI (550 and 780 nm), NDRE (790 and 720 nm), IR-1dB (735 and 720 nm) and VOGa (740 and 720 nm). The variety SP 81 3250 was cultured in the three experimental areas, allowing to compare the performance of a specific site model with a general model for the same variety growing on different soil conditions. Although the general model presents meaningful statistical parameters, there is a significant reduction in sensitivity to predict leaf nitrogen content of sugarcane when compared with specific site calibrated models. We also used on this research the stepwise multiple linear regression (SMLR) that generated models with good precision to estimate the leaf nitrogen content, even when models are calibrated for an experimental area, regardless of spectral differences between varieties, using 5 to 6 wavelengths. This study shows that specific wavelengths are associated with variation in leaf nitrogen content of sugarcane, and these are reported in the region of green (near to 550 nm) and red-edge (680 to 720nm). Despite the low correlation observed between the infrared wavelengths to the leaf nitrogen content of sugarcane, vegetation indices calculated from these wavelengths, or its insertion on linear models generation were important to improve prediction accuracy.
|
152 |
Uso de polinômios fracionários nos modelos mistosGarcia, Edijane Paredes January 2019 (has links)
Orientador: Luzia Aparecida Trinca / Resumo: A classe dos modelos de regressão incorporando polinômios fracionários - FPs (Fractional Polynomials), proposta por Royston & Altman (1994), tem sido amplamente estudada. O uso de FPs em modelos mistos constitui uma alternativa muito atrativa para explicar a dependência das medidas intra-unidades amostrais em modelos em que há não linearidade na relação entre a variável resposta e variáveis regressoras contínua. Tal característica ocorre devido aos FPs oferecerem, para a resposta média, uma variedade de formas funcionais não lineares para as variáveis regressoras contínuas, em que se destacam a família dos polinômios convencionais e algumas curvas assimétricas e com assíntotas. A incorporação dos FPs na estrutura dos modelos mistos tem sido investigada por diversos autores. Porém, não existem publicações sobre: a exploração da problemática da modelagem na parte fixa e na parte aleatória (principalmente na presença de várias variáveis regressoras contínuas e categóricas); o estudo da influência dos FPs na estrutura dos efeitos aleatórios; a investigação de uma adequada estrutura para a matriz de covariâncias do erro; ou, um ponto de fundamental importância para colaborar com a seleção do modelo, a realização da análise de diagnóstico dos modelos ajustados. Uma contribuição, do nosso ponto de vista, de grande relevância é a investigação e oferecimento de estratégias de ajuste dos modelos polinômios fracionários com efeitos mistos englobando os pontos citados acima com o objetiv... (Resumo completo, clicar acesso eletrônico abaixo) / Abstract: The class of regression models incorporating Fractional Polynomials (FPs), proposed by Royston & Altman (1994), has been extensively studied. The use of FPs in mixed models is a very attractive alternative to explain the within-subjects’ measurements dependence in models where there is non-linearity in the relationship between the response variable and continuous covariates. This characteristic occurs because the FPs offers a variety of non-linear functional forms for the continuous covariates in the average response, in which the family of the conventional polynomials and some asymmetric curves with asymptotes stand out. The incorporation of FPs into the structure of the mixed models has been investigated by several authors. However, there are no works about the following issues: the modeling of the fixed and random effects (mainly in the presence of several continuous and categorical covariates), the study of the influence of the FPs on the structure of the random effects, the investigation of an adequate structure for the covariance of the random errors, or, a point that has central importance to the selection of the model, to perform a diagnostic analysis of the fitted models. In our point of view, a contribution of great relevance is the investigation and the proposition of strategies for fitting FPs with mixed effects encompassing the points mentioned above, with the goals of filling these gaps and to awaken the users to the great potential of mixed models, now even mor... (Complete abstract click electronic access below) / Doutor
|
153 |
Seleção bayesiana de variáveis em modelos multiníveis da teoria de resposta ao item com aplicações em genômica / Bayesian variable selection for multilevel item response theory models with applications in genomicsTiago de Miranda Fragoso 12 September 2014 (has links)
As investigações sobre as bases genéticas de doenças complexas em Genômica utilizam diversos tipos de informação. Diversos sintomas são avaliados de maneira a diagnosticar a doença, os indivíduos apresentam padrões de agrupamento baseados, por exemplo no seu parentesco ou ambiente comum e uma quantidade imensa de características dos indivíduos são medidas por meio de marcadores genéticos. No presente trabalho, um modelo multiníveis da teoria de resposta ao item (TRI) é proposto de forma a integrar todas essas fontes de informação e caracterizar doenças complexas através de uma variável latente. Além disso, a quantidade de marcadores moleculares induz um problema de seleção de variáveis, para o qual uma seleção baseada nos métodos da busca estocástica e do LASSO bayesiano são propostos. Os parâmetros do modelo e a seleção de variáveis são realizados sob um paradigma bayesiano, no qual um algoritmo Monte Carlo via Cadeias de Markov é construído e implementado para a obtenção de amostras da distribuição a posteriori dos parâmetros. O mesmo é validado através de estudos de simulação, nos quais a capacidade de recuperação dos parâmetros, de escolha de variáveis e características das estimativas pontuais dos parâmetros são avaliadas em cenários similares aos dados reais. O processo de estimação apresenta uma recuperação satisfatória nos parâmetros estruturais do modelo e capacidade de selecionar covariáveis em espaços de dimensão elevada apesar de um viés considerável nas estimativas das variáveis latentes associadas ao traço latente e ao efeito aleatório. Os métodos desenvolvidos são então aplicados aos dados colhidos no estudo de associação familiar \'Corações de Baependi\', nos quais o modelo multiníveis se mostra capaz de caracterizar a síndrome metabólica, uma série de sintomas associados com o risco cardiovascular. O modelo multiníveis e a seleção de variáveis se mostram capazes de recuperar características conhecidas da doença e selecionar um marcador associado. / Recent investigations about the genetic architecture of complex diseases use diferent sources of information. Diferent symptoms are measured to obtain a diagnosis, individuals may not be independent due to kinship or common environment and their genetic makeup may be measured through a large quantity of genetic markers. In the present work, a multilevel item response theory (IRT) model is proposed that unifies all these diferent sources of information through a latent variable. Furthermore, the large ammount of molecular markers induce a variable selection problem, for which procedures based on stochastic search variable selection and the Bayesian LASSO are considered. Parameter estimation and variable selection is conducted under a Bayesian framework in which a Markov chain Monte Carlo algorithm is derived and implemented to obtain posterior distribution samples. The estimation procedure is validated through a series of simulation studies in which parameter recovery, variable selection and estimation error are evaluated in scenarios similar to the real dataset. The estimation procedure showed adequate recovery of the structural parameters and the capability to correctly nd a large number of the covariates even in high dimensional settings albeit it also produced biased estimates for the incidental latent variables. The proposed methods were then applied to the real dataset collected on the \'Corações de Baependi\' familiar association study and was able to apropriately model the metabolic syndrome, a series of symptoms associated with elevated heart failure and diabetes risk. The multilevel model produced a latent trait that could be identified with the syndrome and an associated molecular marker was found.
|
154 |
PARTICIONAMENTO DE CONJUNTO DE DADOS E SELEÇÃO DE VARIÁVEIS EM PROBLEMAS DE CALIBRAÇÃO MULTIVARIADAAlves, André Luiz 22 September 2017 (has links)
Submitted by admin tede (tede@pucgoias.edu.br) on 2017-11-22T13:39:54Z
No. of bitstreams: 1
André Luiz Alves.pdf: 760209 bytes, checksum: 09b516d6ffcca2c7f66578b275613b36 (MD5) / Made available in DSpace on 2017-11-22T13:39:54Z (GMT). No. of bitstreams: 1
André Luiz Alves.pdf: 760209 bytes, checksum: 09b516d6ffcca2c7f66578b275613b36 (MD5)
Previous issue date: 2017-09-22 / The objective of this work is to compare a proposed algorithm based on the
RANdom SAmple Consensus (RANSAC) method for selection of samples, selection of
variables and simultaneous selection of samples and variables with the Sucessive
Projections Algorithm (SPA) from a chemical data set in the context of multivariate
calibration. The proposed method is based on the RANSAC method and Multiple
Linear Regression (MLR). The predictive capacity of the models is measured using the
Root Mean Square Error of Prediction (RMSEP). The results allow to conclude that the
Successive Projection Algorithm improves the predictive capacity of Ransac. It is
concluded that the SPA positively influences the Ransac algorithm for selection of
samples, for selection of variables and also for simultaneous selection of samples and
variables. / O objetivo do trabalho é comparar um algoritmo proposto baseado no método
consenso de amostra aleatória (RANdom SAmple Consensus, RANSAC) para seleção
de amostras, seleção de variáveis e seleção simultânea de amostras e variáveis com o
algoritmo de projeções sucessivas (Sucessive Projections Algorithm, SPA) a partir de
conjuntos de dados químicos no contexto da calibração multivariada. O método
proposto é baseado no método RANSAC e regressão linear múltipla (Multiple Linear
Regression, MLR). A capacidade preditiva dos modelos é medida empregando o erro de
previsão da raiz quadrada do erro quadrático médio (Root Mean Square Error Of
Prediction, RMSEP). Os resultados permitem concluir que o Algoritmo das Projeções
Sucessivas melhora a capacidade preditiva do Ransac. Conclui-se que o SPA influi
positivamente no algoritmo Ransac para seleção de amostras, para seleção de variáveis
e também para seleção simultânea de amostras e variáveis.
|
155 |
Seleção de variáveis no desenvolvimento, classificação e predição de produtos / Selection of variables for the development, classification, and prediction of productsRossini, Karina January 2011 (has links)
O presente trabalho apresenta proposições para seleção de variáveis em avaliações sensoriais descritivas e de espectro infravermelho que contribuam com a indústria de alimentos e química através da utilização de métodos de análise multivariada. Desta forma, os objetivos desta tese são: (i) Estudar as principais técnicas de análise multivariada de dados, como são comumente organizadas e como podem contribuir no processo de seleção de variáveis; (ii) Identificar e estruturar técnicas de análise multivariada de dados de forma a construir um método que reduza o número de variáveis necessárias para fins de caracterização, classificação e predição dos produtos; (iii) Reduzir a lista de variáveis/atributos, selecionando aqueles relevantes e não redundantes, reduzindo o tempo de execução e a fadiga imposta aos membros de um painel em avaliações sensoriais; (iv) Validar o método proposto utilizando dados reais; e (v) Comparar diferentes abordagens de análise sensorial voltadas ao desenvolvimento de novos produtos. Os métodos desenvolvidos foram avaliados através da aplicação de estudos de caso, em exemplos com dados reais. Os métodos sugeridos variam com as características dos dados analisados, dados altamente multicolineares ou não e, com e sem variável dependente (variável de resposta). Os métodos apresentam bom desempenho, conduzindo a uma redução significativa no número de variáveis e apresentando índices de adequação de ajuste dos modelos ou acurácia satisfatórios quando comparados aos obtidos mediante retenção da totalidade das variáveis ou comparados a outros métodos dispostos na literatura. Conclui-se que os métodos propostos são adequados para a seleção de variáveis sensoriais e de espectro infravermelho. / This dissertation presents propositions for variable selection in data from descriptive sensory evaluations and near-infrared (NIR) spectrum analyses, based on multivariate analysis methods. There are five objectives here: (i) review the main multivariate analysis techniques, their relationships and potential use in variable selection procedures; (ii) propose a variable selection method based on the techniques in (i) that allows product prediction, classification, and description; (iii) reduce the list of variables/attributes to be analyzed in sensory panels identifying those relevant and non-redundant, such that the time to collect panel data and the fatigue imposed on panelists is minimized; (iv) validate methodological propositions using real life data; and (v) compare different sensory analysis approaches used in new product development. Proposed methods were evaluated through case studies, and vary according to characteristics in the datasets analyzed (data with different degrees of multicollinearity, presenting or not dependent variables). All methods presented good performance leading to significant reduction in the number of variables in the datasets, and leading to models with better adequacy of fit. We conclude that the methods are suitable for datasets from descriptive sensory evaluations and NIR analyses.
|
156 |
Canonical Variable Selection for Ecological Modeling of Fecal IndicatorsGilfillan, Dennis, Hall, Kimberlee, Joyner, Timothy Andrew, Scheuerman, Phillip 20 September 2018 (has links)
More than 270,000 km of rivers and streams are impaired due to fecal pathogens, creating an economic and public health burden. Fecal indicator organisms such as Escherichia coli are used to determine if surface waters are pathogen impaired, but they fail to identify human health risks, provide source information, or have unique fate and transport processes. Statistical and machine learning models can be used to overcome some of these weaknesses, including identifying ecological mechanisms influencing fecal pollution. In this study, canonical correlation analysis (CCorA) was performed to select parameters for the machine learning model, Maxent, to identify how chemical and microbial parameters can predict E. coli impairment and F+-somatic bacteriophage detections. Models were validated using a bootstrapping cross-validation. Three suites of models were developed; initial models using all parameters, models using parameters identified in CCorA, and optimized models after further sensitivity analysis. Canonical correlation analysis reduced the number of parameters needed to achieve the same degree of accuracy in the initial E. coli model (84.7%), and sensitivity analysis improved accuracy to 86.1%. Bacteriophage model accuracies were 79.2, 70.8, and 69.4% for the initial, CCorA, and optimized models, respectively; this suggests complex ecological interactions of bacteriophages are not captured by CCorA. Results indicate distinct ecological drivers of impairment depending on the fecal indicator organism used. Escherichia coli impairment is driven by increased hardness and microbial activity, whereas bacteriophage detection is inhibited by high levels of coliforms in sediment. Both indicators were influenced by organic pollution and phosphorus limitation.
|
157 |
THE FAMILY OF CONDITIONAL PENALIZED METHODS WITH THEIR APPLICATION IN SUFFICIENT VARIABLE SELECTIONXie, Jin 01 January 2018 (has links)
When scientists know in advance that some features (variables) are important in modeling a data, then these important features should be kept in the model. How can we utilize this prior information to effectively find other important features? This dissertation is to provide a solution, using such prior information. We propose the Conditional Adaptive Lasso (CAL) estimates to exploit this knowledge. By choosing a meaningful conditioning set, namely the prior information, CAL shows better performance in both variable selection and model estimation. We also propose Sufficient Conditional Adaptive Lasso Variable Screening (SCAL-VS) and Conditioning Set Sufficient Conditional Adaptive Lasso Variable Screening (CS-SCAL-VS) algorithms based on CAL. The asymptotic and oracle properties are proved. Simulations, especially for the large p small n problems, are performed with comparisons with other existing methods. We further extend to the linear model setup to the generalized linear models (GLM). Instead of least squares, we consider the likelihood function with L1 penalty, that is the penalized likelihood methods. We proposed for Generalized Conditional Adaptive Lasso (GCAL) for the generalized linear models. We then further extend the method for any penalty terms that satisfy certain regularity conditions, namely Conditionally Penalized Estimate (CPE). Asymptotic and oracle properties are showed. Four corresponding sufficient variable screening algorithms are proposed. Simulation examples are evaluated for our method with comparisons with existing methods. GCAL is also evaluated with a read data set on leukemia.
|
158 |
High-dimensional inference of ordinal data with medical applicationsJiao, Feiran 01 May 2016 (has links)
Ordinal response variables abound in scientific and quantitative analyses, whose outcomes comprise a few categorical values that admit a natural ordering, so that their values are often represented by non-negative integers, for instance, pain score (0-10) or disease severity (0-4) in medical research. Ordinal variables differ from rational variables in that its values delineate qualitative rather than quantitative differences. In this thesis, we develop new statistical methods for variable selection in a high-dimensional cumulative link regression model with an ordinal response. Our study is partly motivated by the needs for exploring the association structure between disease phenotype and high-dimensional medical covariates.
The cumulative link regression model specifies that the ordinal response of interest results from an order-preserving quantization of some latent continuous variable that bears a linear regression relationship with a set of covariates. Commonly used error distributions in the latent regression include the normal distribution, the logistic distribution, the Cauchy distribution and the standard Gumbel distribution (minimum). The cumulative link model with normal (logit, Gumbel) errors is also known as the ordered probit (logit, complementary log-log) model. While the likelihood function has a closed-form solution for the aforementioned error distributions, its strong nonlinearity renders direct optimization of the likelihood to sometimes fail. To mitigate this problem and to facilitate extension to penalized likelihood estimation, we proposed specific minorization-maximization (MM) algorithms for maximum likelihood estimation of a cumulative link model for each of the preceding 4 error distributions.
Penalized ordinal regression models play a role when variable selection needs to be performed. In some applications, covariates may often be grouped according to some meaningful way but some groups may be mixed in that they contain both relevant and irrelevant variables, i.e., whose coefficients are non-zero and zero, respectively. Thus, it is pertinent to develop a consistent method for simultaneously selecting relevant groups and the relevant variables within each selected group, which constitutes the so-called bi-level selection problem. We have proposed to use a penalized maximum likelihood approach with a composite bridge penalty to solve the bi-level selection problem in a cumulative link model. An MM algorithm was developed for implementing the proposed method, which is specific to each of the 4 error distributions. The proposed approach is shown to enjoy a number of desirable theoretical properties including bi-level selection consistency and oracle properties, under suitable regularity conditions. Simulations demonstrate that the proposed method enjoys good empirical performance. We illustrated the proposed methods with several real medical applications.
|
159 |
Investigation of multivariate prediction methods for the analysis of biomarker dataHennerdal, Aron January 2006 (has links)
<p>The paper describes predictive modelling of biomarker data stemming from patients suffering from multiple sclerosis. Improvements of multivariate analyses of the data are investigated with the goal of increasing the capability to assign samples to correct subgroups from the data alone.</p><p>The effects of different preceding scalings of the data are investigated and combinations of multivariate modelling methods and variable selection methods are evaluated. Attempts at merging the predictive capabilities of the method combinations through voting-procedures are made. A technique for improving the result of PLS-modelling, called bagging, is evaluated.</p><p>The best methods of multivariate analysis of the ones tried are found to be Partial least squares (PLS) and Support vector machines (SVM). It is concluded that the scaling have little effect on the prediction performance for most methods. The method combinations have interesting properties – the default variable selections of the multivariate methods are not always the best. Bagging improves performance, but at a high cost. No reasons for drastically changing the work flows of the biomarker data analysis are found, but slight improvements are possible. Further research is needed.</p>
|
160 |
New results in detection, estimation, and model selectionNi, Xuelei 08 December 2005 (has links)
This thesis contains two parts: the detectability of convex sets and the study on regression models
In the first part of this dissertation, we investigate the problem of the detectability of an inhomogeneous convex region in a Gaussian random field. The first proposed detection method relies on checking a constructed statistic on each convex set within an nn image, which is proven to be un-applicable. We then consider using h(v)-parallelograms as the surrogate, which leads to a multiscale strategy. We prove that 2/9 is the minimum proportion of the maximally embedded h(v)-parallelogram in a convex set. Such a constant indicates the effectiveness of the above mentioned multiscale detection method.
In the second part, we study the robustness, the optimality, and the computing for regression models. Firstly, for robustness, M-estimators in a regression model where the residuals are of unknown but stochastically bounded distribution are analyzed. An asymptotic minimax M-estimator (RSBN) is derived. Simulations demonstrate the robustness and advantages. Secondly, for optimality, the analysis on the least angle regressions inspired us to consider the conditions under which a vector is the solution of two optimization problems. For these two problems, one can be solved by certain stepwise algorithms, the other is the objective function in many existing subset selection criteria (including Cp, AIC, BIC, MDL, RIC, etc). The latter is proven to be NP-hard. Several conditions are derived. They tell us when a vector is the common optimizer. At last, extending the above idea about finding conditions into exhaustive subset selection in regression, we improve the widely used leaps-and-bounds algorithm (Furnival and Wilson). The proposed method further reduces the number of subsets needed to be considered in the exhaustive subset search by considering not only the residuals, but also the model matrix, and the current coefficients.
|
Page generated in 0.1185 seconds