Spelling suggestions: "subject:"estatistics models"" "subject:"cstatistics models""
1 |
New perspectives in cross-validationZhou, Wenda January 2020 (has links)
Appealing due to its universality, cross-validation is an ubiquitous tool for model tuning and selection. At its core, cross-validation proposes to split the data (potentially several times), and alternatively use some of the data for fitting a model and the rest for testing the model. This produces a reliable estimate of the risk, although many questions remain concerning how best to compare such estimates across different models. Despite its widespread use, many theoretical problems remain unanswered for cross-validation, particularly in high-dimensional regimes where bias issues are non-negligible. We first provide an asymptotic analysis of the cross-validated risk in relation to the train-test split risk for a large class of estimators under stability conditions. This asymptotic analysis is expressed in the form of a central limit theorem, and allows us to characterize the speed-up of the cross-validation procedure for general parametric M-estimators. In particular, we show that when the loss used for fitting differs from that used for evaluation, k-fold cross-validation may offer a reduction in variance less (or greater) than k. We then turn our attention to the high-dimensional regime (where the number of parameters is comparable to the number of observations). In such a regime, k-fold cross-validation presents asymptotic bias, and hence increasing the number of folds is of interest. We study the extreme case of leave-one-out cross-validation, and show that, for generalized linear models under smoothness conditions, it is a consistent estimate of the risk at the optimal rate. Given the large computational requirements of leave-one-out cross-validation, we finally consider the problem of obtaining a fast approximate version of the leave-one-out cross-validation (ALO) estimator. We propose a general strategy for deriving formulas for such ALO estimators for penalized generalized linear models, and apply it to many common estimators such as the LASSO, SVM, nuclear norm minimization. The performance of such approximations are evaluated on simulated and real datasets.
|
2 |
Exploração de metodos de seleção de variaveis pela tecnica de regressão logistica para analise de dados epidemiologicos / Exploration of variable selection methods by logistic regression techniques for epidemiologic data analysisSilva, Cleide Aparecida Moreira 23 February 2006 (has links)
Orientador: Djalma de Carvalho Moreira Filho / Dissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Ciencias Medicas / Made available in DSpace on 2018-08-07T04:15:04Z (GMT). No. of bitstreams: 1
Silva_CleideAparecidaMoreira_M.pdf: 1308398 bytes, checksum: 6caced7b18b78e3358ef0edb5ca4520b (MD5)
Previous issue date: 2006 / Resumo: Neste trabalho foi discutida a aplicação de dois métodos distintos de seleção de variáveis e modelos na análise de regressão logística múltipla: modelo hierarquizado e modelo selecionado pelo critério stepwise. Em um estudo caso-controle não-pareado realizado para identificar fatores de risco para o óbito neonatal em Campinas-SP foram analisadas variáveis sócio-econômicas, de morbidade materna e relacionadas à atenção à saúde. Foram selecionados 117 casos e 234 controles e as informações adicionais obtidas por meio de entrevista domiciliar. Pela análise de regressão logística múltipla com modelo hierarquizado foram identificados como fatores de risco para o óbito neonatal a renda familiar, a naturalidade da mãe, o número de moradores do domicílio, presença de sangramento vaginal, parto antecipado por problema de saúde, o número de orientações recebidas durante o pré-natal, a escolha do hospital para o parto, o tempo entre a internação e o parto, a idade gestacional, baixo peso ao nascer e Apgar do quinto minuto. As diferenças encontradas no modelo selecionado pelo critério stepwise foram: renda familiar que se mostrou associada à escolha do hospital, internação por problemas de saúde associada ao sangramento vaginal e naturalidade da mãe, contemplada apenas no modelo hierarquizado e associada ao parto precipitado. Houve também a inclusão de uma interação entre número de orientações recebidas e parto precipitado. A modelagem hierarquizada permitiu que variáveis associadas entre si ficassem no modelo final (colinearidade). A exploração das relações entre as variáveis foi realizada quando se empregou o procedimento stepwise. Independentemente da escolha do processo de seleção de variáveis ou modelo existem pontos que devem ser relevados: revisão exaustiva da literatura sobre o evento em estudo, análise univariada cuidadosa e avaliação das inter-relações ente as variáveis / Abstract: n this work it was discussed the application of two methods for selection of predictor variables and models in multiple logistic regression analysis: hierarchical model and a stepwise model. In a case-control study conduced to identify risk factors associated to neonatal mortality in Campinas, São Paulo, the effects of socio-economic, maternal morbidity and health care were studied. The study included 117 cases and 234 controls and the supplementary data were obtained from household interviews. The multiple logistic regression analysis, in a hierarchical model, identified as associated to neonatal death risk: income, immigration, number of dwellers, the choice of delivery hospital, vaginal bleeding, early delivery due to health problems, time elapsed between hospital admission and delivery, number of orientations received, gestational age, low birth weight and APGAR score at 5th minute. The differences found for the stepwise model were: income was associated with the choice of delivery hospital, vaginal bleeding was associated with early delivery due to health problems and immigration. Immigration was selected only by the hierarchical model and was associated with early delivery due to health problems. A interaction effect between number of orientations received and early delivery due to health problems was included in model selected by the stepwise procedure. The hierarchical modeling allowed associated variables to be in the final model (collinearity). The inter-relation between variables was investigated with the stepwise procedure. Independently of the choice of the process of selection of variables or models there are points that are important: exhaustive literature review, a careful univariate analysis and the evaluation of the relationships between two or more independent variables / Mestrado / Saude Coletiva / Mestre em Saude Coletiva
|
3 |
Partition-based Model Representation LearningHsu, Yayun January 2020 (has links)
Modern machine learning consists of both task forces from classical statistics and modern computation. On the one hand, this field becomes rich and quick-growing; on the other hand, different convention from different schools becomes harder and harder to communicate over time. A lot of the times, the problem is not about who is absolutely right or wrong, but about from which angle that one should approach the problem. This is the moment when we feel there should be a unifying machine learning framework that can withhold different schools under the same umbrella. So we propose one of such a framework and call it ``representation learning''.
Representations are for the data, which is almost identical to a statistical model. However, philosophically, we would like to distinguish from classical statistical modeling such that (1) representations are interpretable to the scientist, (2) representations convey the pre-existing subject view that the scientist has towards his/her data before seeing it (in other words, representations may not align with the true data generating process), and (3) representations are task-oriented.
To build such a representation, we propose to use partition-based models. Partition-based models are easy to interpret and useful for figuring out the interactions between variables. However, the major challenge lies in the computation, since the partition numbers can grow exponentially with respect to the number of variables. To solve the problem, we need a model/representation selection method over different partition models. We proposed to use I-Score with backward dropping algorithm to achieve the goal.
In this work, we explore the connection between the I-Score variable selection methodology to other existing methods and extend the idea into developing other objective functions that can be used in other applications. We apply our ideas to analyze three datasets, one is the genome-wide association study (GWAS), one is the New York City Vision Zero, and, lastly, the MNIST handwritten digit database.
On these applications, we showed the potential of the interpretability of the representations can be useful in practice and provide practitioners with much more intuitions in explaining their results. Also, we showed a novel way to look at causal inference problems from the view of partition-based models.
We hope this work serve as an initiative for people to start thinking about approaching problems from a different angle and to involve interpretability into the consideration when building a model so that it can be easier to be used to communicate with people from other fields.
|
4 |
On Modeling Spatial Time-to-Event Data with Missing Censoring TypeLu, Diane January 2024 (has links)
Time-to-event data, a common occurrence in medical research, is also pertinent in the ecological context, exemplified by leaf desiccation studies using innovative optical vulnerability techniques. Such data can unveil valuable insights into the influence of various factors on the event of interest. Leveraging both spatial and temporal information, spatial survival modeling can unravel the intricate spatiotemporal dynamics governing event occurrences. Existing spatial survival models often assume the availability of the censoring type for censored cases. Various approaches have been employed to address scenarios where a "subset" of cases lacks a known "censoring indicator" (i.e., whether they are right-censored or uncensored). This uncertainty in the subset pertains to missing information regarding the censoring status. However, our study specifically centers on situations where the missing information extends to "all" censored cases, rendering them devoid of a known censoring "type" indicator (i.e., whether they are right-censored or left-censored).
The genesis of this challenge emerged from leaf hydraulic data, specifically embolism data, where the observation of embolism events is limited to instances when leaf veins transition from water-filled to air-filled during the observation period. Although it is known that all veins eventually embolize when the entire plant dries up, the critical information of whether a censored leaf vein embolized before or after the observation period is absent. In other words, the censoring type indicator is missing.
To address this challenge, we developed a Gibbs sampler for a Bayesian spatial survival model, aiming to recover the missing censoring type indicator. This model incorporates the essential embolism formation mechanism theory, accounting for dynamic patterns observed in the embolism data. The model assumes spatial smoothness between connected leaf veins and incorporates vein thickness information. Our Gibbs sampler effectively infers the missing censoring type indicator, as demonstrated on both simulated and real-world embolism data. In applying our model to real data, we not only confirm patterns aligning with existing phytological literature but also unveil novel insights previously unexplored due to limitations in available statistical tools.
Additionally, our results suggest the potential for building hierarchical models with species-level parameters focusing solely on the temporal component. Overall, our study illustrates that the proposed Gibbs sampler for the spatial survival model successfully addresses the challenge of missing censoring type indicators, offering valuable insights into the underlying spatiotemporal dynamics.
|
5 |
Desenvolvimento de funções de pedotransferência e sua utilização em modelo agro-hidrológico / Development of pedotransfer functions and their application in agrohydrological modelsBarros, Alexandre Hugo Cezar 25 August 2010 (has links)
Foram desenvolvidas funções de pedotransferência (PTF) para estimar os parâmetros (\'alfa\', n \'teta\'r e \'teta\'s) do modelo de Van Genuchten (1980) utilizado para descrever curvas de retenção de água no solo. Os dados utilizados foram provenientes de diversas fontes, principalmente de estudos realizados na região Nordeste pelas Universidades, Embrapa e Codevasf, totalizando 786 curvas de retenção, as quais foram divididas em dois conjuntos de dados: 85% para desenvolvimento das PTF; e 15% para teste e validação, considerados como dados independentes. Além do desenvolvimento das PTF de caráter generalizado para todos os solos, foram desenvolvidas PTF específicas para as classes Argissolos, Latossolos, Neossolos e Planossolos. As PTF foram desenvolvidas utilizando técnicas de regressão múltipla, utilizando o procedimento stepwise (forward e backward) para selecionar os melhores preditores. Duas PTF foram desenvolvidas: a) incluindo todos os preditores, densidade do solo, teores de areia, silte e argila e de matéria orgânica e b) apenas com os teores de areia, silte e argila. A avaliação estatística das PTF foi feita de acordo o coeficiente de determinação (R2), o índice de Willmott (d) e o índice confiança (IC). Para avaliação da estimativa do teor de água em potenciais matriciais específicos foi utilizado a raiz do erro médio quadrado (RMSE). A avaliação funcional das PTF paramétricas foi realizada examinando o seu desempenho no contexto do modelo SWAP (Soil-Water-Atmosphere-Plant). Os parâmetros s, r, e n desenvolvidos por meio de PTF para o modelo de Van Genuchten foram introduzidos no modelo SWAP para verificar a viabilidade da utilização de funções de pedotransferência para descrever atributos físico-hídricos do solo e previsão do rendimento agrícola. Essa viabilidade foi avaliada pelo desempenho do modelo comparando suas estimativas da produtividade agrícola com valores observados. Para os parâmetros e n da equação de Van Genuchten, as PTF demonstraram baixa capacidade preditiva, no entanto, para o parâmetro r a predição foi melhor. Em potenciais matriciais específicos (-10, -33 e -1500 kPa), a capacidade preditiva das PTF foi maior, o que possibilita a utilização em modelos de simulação que requerem apenas aproximações da capacidade de campo, ponto de murcha permanente e água disponível. O desempenho das PTF específicas por classes de solo foi similar ao da PTF Geral, evidenciando que o agrupamento de solos para desenvolver as PTF por classe foi pouco vantajoso. O desempenho na estimativa do teor de água no solo foi melhor para as PTF desenvolvidas com teores granulométricos, matéria orgânica e densidade do solo. Os resultados das simulações de rendimento agrícola utilizando PTF não diferem muito daquelas que empregam outros métodos. Além disso, aplicados em séries de dados mais prolongadas, os erros são reduzidos devido à inerente variabilidade espaço-temporal da produtividade. Palavras-chaves: Pedotransferência; Modelo; Simulação; SWAP; Feijão caupi (Vigna unguiculata (L.) Walp.); Milho (Zea mays L.); Sorgo (Sorghum bicolor (L.) Moench) / Development of pedotransfer functions and their application in agrohydrological models Pedotransfer functions (PTF) were developed to estimate the parameters (,\'alfa\', n \'teta\'r and \'teta\'s) of the Van Genuchten (1980) soil water retention model. Data for 786 retention curves were obtained from several sources, mainly from studies from the Northeastern region of Brazil performed by universities, Embrapa and Codevasf. The data were divided in two groups: 85% of data for PTF development; and 15%, considered to be independent, for testing and validation. Besides development of general PTFs for all soils, for the classes Ultisols, Ferralsols, Entisols and Planosols specific PTFs were developed. Techniques of multiple regression, specifically the procedure stepwise (forward and backward) were used to select the best predictors. Two PTFs were developed: a) including all predictors (soil density and contents of sand, silt, clay and organic matter) and b) including only the contents of sand, silt and clay. The statistical performance of each PTF was evaluated from the coefficient of determination (R2), the Willmott index (d) and the confidence index (IC). To evaluate the prediction of soil water content at specific pressure heads, the root mean squared error (RMSE) was used. The functional evaluation of parametric PTFs was done examining performance of PTF estimated parameters in the context of the model SWAP (Soil-Water-Atmosphere-Plant). The parameters , n, r and s estimated through PTF were introduced in the model to evaluate, by comparison to observed yields, the performance of the PTF when its predictions were used to estimate agricultural productivity. The PTFs showed a low predictive capacity for parameters and n, however, for parameters r and s the fits were better. At specific pressure heads (-10, -33 and -1500 kPa), the predictive performance of the PTF was better, allowing the use in simulation models that require only values of field capacity, permanent wilting point and available water content. The performance of the specific PTF for soil classes was better than the general PTF, but the difference was small, showing that grouping of soils to develop PTF per class seems to be of little advantage. The results of the simulations of agricultural productivity, using PTF, are similar to those that use more traditional methods. Moreover, when applied in long data series the errors are reduced due to the inherent space-temporary variability of the productivity. Keywords: Pedotransfer; Model; Simulation; SWAP; Corn (Zea mays L.); Cowpea (Vigna unguiculata (L.) Walp.); Sorghum (Sorghum bicolor (L.) Moench)
|
6 |
Desenvolvimento de funções de pedotransferência e sua utilização em modelo agro-hidrológico / Development of pedotransfer functions and their application in agrohydrological modelsAlexandre Hugo Cezar Barros 25 August 2010 (has links)
Foram desenvolvidas funções de pedotransferência (PTF) para estimar os parâmetros (\'alfa\', n \'teta\'r e \'teta\'s) do modelo de Van Genuchten (1980) utilizado para descrever curvas de retenção de água no solo. Os dados utilizados foram provenientes de diversas fontes, principalmente de estudos realizados na região Nordeste pelas Universidades, Embrapa e Codevasf, totalizando 786 curvas de retenção, as quais foram divididas em dois conjuntos de dados: 85% para desenvolvimento das PTF; e 15% para teste e validação, considerados como dados independentes. Além do desenvolvimento das PTF de caráter generalizado para todos os solos, foram desenvolvidas PTF específicas para as classes Argissolos, Latossolos, Neossolos e Planossolos. As PTF foram desenvolvidas utilizando técnicas de regressão múltipla, utilizando o procedimento stepwise (forward e backward) para selecionar os melhores preditores. Duas PTF foram desenvolvidas: a) incluindo todos os preditores, densidade do solo, teores de areia, silte e argila e de matéria orgânica e b) apenas com os teores de areia, silte e argila. A avaliação estatística das PTF foi feita de acordo o coeficiente de determinação (R2), o índice de Willmott (d) e o índice confiança (IC). Para avaliação da estimativa do teor de água em potenciais matriciais específicos foi utilizado a raiz do erro médio quadrado (RMSE). A avaliação funcional das PTF paramétricas foi realizada examinando o seu desempenho no contexto do modelo SWAP (Soil-Water-Atmosphere-Plant). Os parâmetros s, r, e n desenvolvidos por meio de PTF para o modelo de Van Genuchten foram introduzidos no modelo SWAP para verificar a viabilidade da utilização de funções de pedotransferência para descrever atributos físico-hídricos do solo e previsão do rendimento agrícola. Essa viabilidade foi avaliada pelo desempenho do modelo comparando suas estimativas da produtividade agrícola com valores observados. Para os parâmetros e n da equação de Van Genuchten, as PTF demonstraram baixa capacidade preditiva, no entanto, para o parâmetro r a predição foi melhor. Em potenciais matriciais específicos (-10, -33 e -1500 kPa), a capacidade preditiva das PTF foi maior, o que possibilita a utilização em modelos de simulação que requerem apenas aproximações da capacidade de campo, ponto de murcha permanente e água disponível. O desempenho das PTF específicas por classes de solo foi similar ao da PTF Geral, evidenciando que o agrupamento de solos para desenvolver as PTF por classe foi pouco vantajoso. O desempenho na estimativa do teor de água no solo foi melhor para as PTF desenvolvidas com teores granulométricos, matéria orgânica e densidade do solo. Os resultados das simulações de rendimento agrícola utilizando PTF não diferem muito daquelas que empregam outros métodos. Além disso, aplicados em séries de dados mais prolongadas, os erros são reduzidos devido à inerente variabilidade espaço-temporal da produtividade. Palavras-chaves: Pedotransferência; Modelo; Simulação; SWAP; Feijão caupi (Vigna unguiculata (L.) Walp.); Milho (Zea mays L.); Sorgo (Sorghum bicolor (L.) Moench) / Development of pedotransfer functions and their application in agrohydrological models Pedotransfer functions (PTF) were developed to estimate the parameters (,\'alfa\', n \'teta\'r and \'teta\'s) of the Van Genuchten (1980) soil water retention model. Data for 786 retention curves were obtained from several sources, mainly from studies from the Northeastern region of Brazil performed by universities, Embrapa and Codevasf. The data were divided in two groups: 85% of data for PTF development; and 15%, considered to be independent, for testing and validation. Besides development of general PTFs for all soils, for the classes Ultisols, Ferralsols, Entisols and Planosols specific PTFs were developed. Techniques of multiple regression, specifically the procedure stepwise (forward and backward) were used to select the best predictors. Two PTFs were developed: a) including all predictors (soil density and contents of sand, silt, clay and organic matter) and b) including only the contents of sand, silt and clay. The statistical performance of each PTF was evaluated from the coefficient of determination (R2), the Willmott index (d) and the confidence index (IC). To evaluate the prediction of soil water content at specific pressure heads, the root mean squared error (RMSE) was used. The functional evaluation of parametric PTFs was done examining performance of PTF estimated parameters in the context of the model SWAP (Soil-Water-Atmosphere-Plant). The parameters , n, r and s estimated through PTF were introduced in the model to evaluate, by comparison to observed yields, the performance of the PTF when its predictions were used to estimate agricultural productivity. The PTFs showed a low predictive capacity for parameters and n, however, for parameters r and s the fits were better. At specific pressure heads (-10, -33 and -1500 kPa), the predictive performance of the PTF was better, allowing the use in simulation models that require only values of field capacity, permanent wilting point and available water content. The performance of the specific PTF for soil classes was better than the general PTF, but the difference was small, showing that grouping of soils to develop PTF per class seems to be of little advantage. The results of the simulations of agricultural productivity, using PTF, are similar to those that use more traditional methods. Moreover, when applied in long data series the errors are reduced due to the inherent space-temporary variability of the productivity. Keywords: Pedotransfer; Model; Simulation; SWAP; Corn (Zea mays L.); Cowpea (Vigna unguiculata (L.) Walp.); Sorghum (Sorghum bicolor (L.) Moench)
|
Page generated in 0.1165 seconds