221 |
Bayesian Nonparametric Models for Multi-Stage Sample SurveysYin, Jiani 27 April 2016 (has links)
It is a standard practice in small area estimation (SAE) to use a model-based approach to borrow information from neighboring areas or from areas with similar characteristics. However, survey data tend to have gaps, ties and outliers, and parametric models may be problematic because statistical inference is sensitive to parametric assumptions. We propose nonparametric hierarchical Bayesian models for multi-stage finite population sampling to robustify the inference and allow for heterogeneity, outliers, skewness, etc. Bayesian predictive inference for SAE is studied by embedding a parametric model in a nonparametric model. The Dirichlet process (DP) has attractive properties such as clustering that permits borrowing information. We exemplify by considering in detail two-stage and three-stage hierarchical Bayesian models with DPs at various stages. The computational difficulties of the predictive inference when the population size is much larger than the sample size can be overcome by the stick-breaking algorithm and approximate methods. Moreover, the model comparison is conducted by computing log pseudo marginal likelihood and Bayes factors. We illustrate the methodology using body mass index (BMI) data from the National Health and Nutrition Examination Survey and simulated data. We conclude that a nonparametric model should be used unless there is a strong belief in the specific parametric form of a model.
|
222 |
Nonparametric density estimation for univariate and bivariate distributions with applications in discriminant analysis for the bivariate caseHaug, Mark January 2010 (has links)
Typescript (photocopy). / Digitized by Kansas Correctional Industries / Department: Statistics.
|
223 |
Modern k-nearest neighbour methods in entropy estimation, independence testing and classificationBerrett, Thomas Benjamin January 2017 (has links)
Nearest neighbour methods are a classical approach in nonparametric statistics. The k-nearest neighbour classifier can be traced back to the seminal work of Fix and Hodges (1951) and they also enjoy popularity in many other problems including density estimation and regression. In this thesis we study their use in three different situations, providing new theoretical results on the performance of commonly-used nearest neighbour methods and proposing new procedures that are shown to outperform these existing methods in certain settings. The first problem we discuss is that of entropy estimation. Many statistical procedures, including goodness-of-fit tests and methods for independent component analysis, rely critically on the estimation of the entropy of a distribution. In this chapter, we seek entropy estimators that are efficient and achieve the local asymptotic minimax lower bound with respect to squared error loss. To this end, we study weighted averages of the estimators originally proposed by Kozachenko and Leonenko (1987), based on the k-nearest neighbour distances of a sample. A careful choice of weights enables us to obtain an efficient estimator in arbitrary dimensions, given sufficient smoothness, while the original unweighted estimator is typically only efficient in up to three dimensions. A related topic of study is the estimation of the mutual information between two random vectors, and its application to testing for independence. We propose tests for the two different situations of the marginal distributions being known or unknown and analyse their performance. Finally, we study the classical k-nearest neighbour classifier of Fix and Hodges (1951) and provide a new asymptotic expansion for its excess risk. We also show that, in certain situations, a new modification of the classifier that allows k to vary with the location of the test point can provide improvements. This has applications to the field of semi-supervised learning, where, in addition to labelled training data, we also have access to a large sample of unlabelled data.
|
224 |
A avaliação do impacto de um treinamento utilizando Propensity Score Matching : uma abordagem não-paramétrica e semiparamétricaSilveira, Luiz Felipe de Vasconcellos January 2015 (has links)
O objetivo dessa dissertação é avaliar o impacto de um programa de treinamento voltado para trabalhadores, utilizando o propensity score matching, mas com dois tipos de abordagem, uma não-paramétrica e a outra semi-paramétrica. Para estimação não paramétrica foi utilizado um método proposto por Li, Racine e Wooldridge (2009) e para estimação semi-paramétrica, o modelo utilizado foi o Generalized Additive Model proposto por Hastie e Tibshirani (1990). Os resultados obtidos indicam que os dois métodos utilizados apresentam estimativas tão boas ou melhores do que quando estimadas paramétricamente. / The goal of this thesis is to evaluate the impact of a job training program using propensity score matching methods with two types of approaches: a nonparametric e another semiparametric. For non-parametric estimation was used a method proposed by Li, Racine and Wooldridge (2009) and for the semiparametric model the Generalized Additive Model proposed by Hastie and Tibshirani (1990). The results indicate that both methods provide estimates as good or better than when parametrically estimated.
|
225 |
Interaction-Based Learning for High-Dimensional Data with Continuous PredictorsHuang, Chien-Hsun January 2014 (has links)
High-dimensional data, such as that relating to gene expression in microarray experiments, may contain substantial amount of useful information to be explored. However, the information, relevant variables and their joint interactions are usually diluted by noise due to a large number of non-informative variables. Consequently, variable selection plays a pivotal role for learning in high dimensional problems. Most of the traditional feature selection methods, such as Pearson's correlation between response and predictors, stepwise linear regressions and LASSO are among the popular linear methods. These methods are effective in identifying linear marginal effect but are limited in detecting non-linear or higher order interaction effects. It is well known that epistasis (gene - gene interactions) may play an important role in gene expression where unknown functional forms are difficult to identify. In this thesis, we propose a novel nonparametric measure to first screen and do feature selection based on information from nearest neighborhoods. The method is inspired by Lo and Zheng's earlier work (2002) on detecting interactions for discrete predictors. We apply a backward elimination algorithm based on this measure which leads to the identification of many in influential clusters of variables. Those identified groups of variables can capture both marginal and interactive effects. Second, each identified cluster has the potential to perform predictions and classifications more accurately. We also study procedures how to combine these groups of individual classifiers to form a final predictor. Through simulation and real data analysis, the proposed measure is capable of identifying important variable sets and patterns including higher-order interaction sets. The proposed procedure outperforms existing methods in three different microarray datasets. Moreover, the nonparametric measure is quite flexible and can be easily extended and applied to other areas of high-dimensional data and studies.
|
226 |
Estimação de cópulas via ondaletas / Copula estimation through waveletsSilva, Francyelle de Lima e 03 October 2014 (has links)
Cópulas tem se tornado uma importante ferramenta para descrever e analisar a estrutura de dependência entre variáveis aleatórias e processos estocásticos. Recentemente, surgiram alguns métodos de estimação não paramétricos, utilizando kernels e ondaletas. Neste contexto, sabendo que cópulas podem ser escritas como expansão em ondaletas, foi proposto um estimador não paramétrico via ondaletas para a função cópula para dados independentes e de séries temporais, considerando processos alfa-mixing. Este estimador tem como característica principal estimar diretamente a função cópula, sem fazer suposição alguma sobre a distribuição dos dados e sem ajustes prévios de modelos ARMA - GARCH, como é feito em ajuste paramétrico para cópulas. Foram calculadas taxas de convergência para o estimador proposto em ambos os casos, mostrando sua consistência. Foram feitos também alguns estudos de simulação, além de aplicações a dados reais. / Copulas are important tools for describing the dependence structure between random variables and stochastic processes. Recently some nonparametric estimation procedures have appeared, using kernels and wavelets. In this context, knowing that a copula function can be expanded in a wavelet basis, we have proposed a nonparametric copula estimation procedure through wavelets for independent data and times series under alpha-mixing condition. The main feature of this estimator is the copula function estimation without assumptions about the data distribution and without ARMA - GARCH modeling, like in parametric copula estimation. Convergence rates for the estimator were computed, showing the estimator consistency. Some simulation studies were made, as well as analysis of real data sets.
|
227 |
Aplicação do CAPM (Capital Asset Pricing Model) condicional por meio de métodos não-paramétricos para a economia brasileira: um estudo empírico do período 2002-2009 / Application of conditional CAPM (Capital Asset Pricing Model) using nonparametrics methods for the Brazilian economy: an empirical study from 2002-2009Galeno, Marcela Monteiro 04 October 2010 (has links)
Essa dissertação procura analisar se as variações dos retornos de carteiras setoriais formadas por ações do Índice teórico da Bolsa de Valores de São Paulo (Ibovespa), do primeiro quadrimestre de 2010, podem ser explicadas pelo CAPM condicional não-paramétrico proposto por Wang (2002) e também por quatro variáveis de informação disponíveis aos investidores: (i) percentual de variação do nível de produção industrial brasileira; (ii) percentual de variação do monetário agregado M4; (iii) percentual de variação da inflação representada pelo Índice de Preços ao Consumidor Amplo (IPCA); e (iv) percentual de variação da taxa de câmbio real-dólar, obtida pela cotação do dólar PTAX. O estudo compreendeu as ações listadas na Bolsa de Valores de São Paulo no período de janeiro de 2002 a dezembro de 2009. Utilizou-se a metodologia de teste desenvolvida por Wang (2002) e replicada para o contexto mexicano por Castillo-Spíndola (2006). Foram utilizados os excessos de retornos mensais para as ações, carteiras e prêmio de mercado. Ainda, para estimar a influência das variáveis de informação, foram calculados seus respectivos percentuais de variação mensal, para o período de janeiro de 2002 a novembro de 2009. A fim de validar a aplicação do CAPM condicional não-paramétrico para o mercado acionário brasileiro, foram estimados os diversos parâmetros do modelo e testada sua validade estatística para cada variável de informação avaliando-se o p-value. Os resultados observados indicam que o modelo condicional não-paramétrico é relevante na explicação dos retornos das carteiras da amostra considerada para duas das quatro variáveis testadas, M4 e dólar PTAX. / This dissertation seeks to analyze if the variations of returns from sector portfolios, formed by shares of the São Paulo Stock Exchange Index (Ibovespa), in the first four months of 2010, could be explained by the nonparametric conditional Capital Asset Pricing Model (CAPM), suggested by Wang (2002), and also by four variables of information available to the investors: (i) percentage variation of the Brazilian industrial production level; (ii) percentage variation of broad money supply M4; (iii) percentage variation of the inflation represented by the Índice de Preços ao Consumidor Amplo (IPCA); and (iv) percentage variation in the real-dollar exchange rate, obtained by PTAX dollar quotation. This study comprised the shares listed in São Paulo Stock Exchange throughout January 2002 to December 2009. The test methodology developed by Wang (2002) and retorted to the Mexican context by Castillo-Spíndola (2006) was used. The excess of monthly returns for the shares, portfolios, and market premium were used. Still, aiming to estimate the influence of information variables, their monthly percentage variations were calculated for the period from January 2002 to November 2009. In order to validate the nonparametric conditional CAPM application for the Brazilian stock market, the models several parameters were estimated and its statistic validity was tested for each information variable, evaluating the p-value. The observed results indicate that the nonparametric conditional model is relevant in explaining the portfolios returns of the sample considered for two among the four tested variables, M4 and PTAX dollar.
|
228 |
Estimação de funções do redshift de galáxias com base em dados fotométricos / Galaxies redshift function estimation using photometric dataFerreira, Gretta Rossi 18 September 2017 (has links)
Em uma quantidade substancial de problemas de astronomia, tem-se interesse na estimação do valor assumido, para diversas funções g, de alguma quantidade desconhecida z ∈ ℜ com base em covariáveis x ∈ ℜd. Isto é feito utilizando-se uma amostra (X1, Z1), ... (Xn, Zn). As duas abordagens usualmente utilizadas para resolver este problema consistem em (1) estimar a regressão de Z em x, e plugar esta na função g ou (2)estimar a densidade condicional f (z Ι x) e plugá-la em ∫ g(z) f (z Ι x)dz. Infelizmente, poucos estudos apresentam comparações quantitativas destas duas abordagens. Além disso, poucos métodos de estimação de densidade condicional tiveram seus desempenhos comparados nestes problemas. Em vista disso, o objetivo deste trabalho é apresentar diversas comparações de técnicas de estimação de funções de uma quantidade desconhecida. Em particular, damos destaque para métodos não paramétricos. Além dos estimadores (1) e (2), propomos também uma nova abordagem que consistem em estimar diretamente a função de regressão de g(Z) em x. Essas abordagens foram testadas em diferentes funções nos conjuntos de dados DEEP2 e Sheldon 2012. Para quase todas as funções testadas, o estimador (1) obteve os piores resultados, exceto quando utilizamos florestas aleatórias. Em diversos casos, a nova abordagem proposta apresentou melhores resultados, assim como o estimador (2). Em particular, verificamos que métodos via florestas aleatórias, em geral, levaram a bons resultados. / In a substantial a mount of astronomy problems, we are interested in estimating values assumed of some unknown quantity z ∈ ℜ, for many function g, based on covariates x ∈ ℜd. This is made using a sample (X1, Z1), ..., (Xn, Zn). Two approaches that are usually used to solve this problem consist in (1) estimating a regression function of Z in x and plugging it into the g or (2) estimating a conditional density f (z Ι x) and plugging it into ∫ g(z) f (z Ι x)dz. Unfortunately, few studies exhibit quantitative comparisons between these two approaches.Besides that, few conditional density estimation methods had their performance compared in these problems.In view of this, the objective of this work is to show several comparisons of techniques used to estimate functions of unknown quantity. In particular we highlight nonparametric methods. In addition to estimators (1) and (2), we also propose a new ap proach that consists in directly estimating the regression function from g(Z) on x. These approaches were tested in different functions in the DEEP 2 and Sheldon 2012 datasets. For almost all the functions tested, the estimator (1) obtained the worst results, except when we use the random forests methods. In several cases, the proposed new approach presented better results, as well as the estimator (2) .In particular, we verified that random forests methods generally present to good results.
|
229 |
Semiparametric latent variable models with Bayesian p-splines. / CUHK electronic theses & dissertations collectionJanuary 2010 (has links)
In medical, behavioral, and social-psychological sciences, latent variable models are useful in handling variables that cannot be directly measured by a single observed variable, but instead are assessed through a number of observed variables. Traditional latent variable models are usually based on parametric assumptions on both relations between outcome and explanatory latent variables, and error distributions. In this thesis, semiparametric models with Bayesian P-splines are developed to relax these rigid assumptions. / In the fourth part of the thesis, the methodology developed in the third part is further extended to a varying coefficient model with latent variables. Varying coefficient model is a class of flexible semiparametric models in which the effects of covariates are modeled dynamically by unspecified smooth functions. A transformation varying coefficient model can handle arbitrarily distributed dynamic data. A simulation study shows that our proposed method performs well in the analysis of this complex model. / In the last part of the thesis, we propose a finite mixture of varying coefficient models to analyze dynamic data with heterogeneity. A simulation study demonstrates that our proposed method can explore possible existence of different groups in a dynamic data, where in each group the dynamic influences of covariates on the response variables have different patterns. The proposed method is applied to a longitudinal study concerning the effectiveness of heroin treatment. Distinct patterns of heroin use and treatment effect in different patient groups are identified. / In the second part of the thesis, a latent variable model is proposed to relax the first assumption, in which unknown additive functions of latent variables in the structural equation are modeled by Bayesian P-splines. The estimation of nonparametric functions is based on powerful Markov chain Monte Carlo (MCMC) algorithm with block update scheme. A simulation study shows that the proposed method can handle much wider situation than traditional models. The proposed semiparametric latent variable model is applied to a study on osteoporosis prevention and control. Some interesting functional relations, which may be overlooked by traditional parametric latent variable models, are revealed. / In the third part of the thesis, a transformation model is developed to relax the second assumption, which usually assumes the normality of observed variables and random errors. In our proposed model, the nonnormal response variables are transformed to normal by unknown functions modeled with Bayesian P-splines. This semiparametric transformation model is shown to be applicable to a wide range of statistical analysis. The model is applied to a study on the intervention treatment of polydrug use in which the traditional model assumption is violated because many observed variables exhibit serious departure from normality. / Lu, Zhaohua. / Adviser: Xin-Yuan Song. / Source: Dissertation Abstracts International, Volume: 72-04, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2010. / Includes bibliographical references (leaves 119-130). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese.
|
230 |
Medidas de dependência local para séries temporais / Local dependence measures for time seriesLatif, Sumaia Abdel 25 February 2008 (has links)
Diferente das medidas de associação global (coeficiente de correlação linear de Pearson, de Spearman, tau de Kendall, por exemplo), as medidas de dependência local descrevem o comportamento da dependência localmente em diferentes regiões. Nesta tese, as medidas de dependência local para variáveis aleatórias propostas por Bairamov et al. (2003), Bjerve e Doksum (1993) e Sibuya (1960), são estudadas sob o enfoque de processos estocásticos estacionários bivariados e univariados, neste caso, estudando o comportamento da dependência local ao longo das defasagens da série temporal. Para as duas primeiras medidas, discutimos as suas propriedades, e estudamos os seus estimadores, além da consistência dos mesmos. Para a medida de Sibuya, além de discutir suas propriedades, propomos três estimadores para variáveis aleatórias e dois para séries temporais, verificando a consistência dos mesmos. O comportamento das três medidas locais e dos seus estimadores foram avaliados através de simulações e aplicações a dados reais (neste caso, fizemos uma comparação destas com cópula e densidade cópula). / Unlike global association measures (Pearson´s linear correlation coefficient, Spearman´s rho, Kendall´s tau, for example), local dependence measures describe the behaviour of dependence locally in different regions. In this thesis, the local dependence measures for random variables proposed by Bairamov et al. (2003), Bjerve and Doksum (1993) and Sibuya (1960), are studied in the context of bivariate and univariate stationary stochastic processes, in this case, evaluating the performance of local dependence along time lags. We discussed the properties and studied the estimators and consistence of the first two measures. As for the Sibuya measure, in addition to discussing its properties, we propose three estimators for random variables and two for time series while checking their consistence. The behaviour of the three local measures and their respective estimators was evaluated by simulations and application to real data (in this case, a comparison was drawn with copula and copula density).
|
Page generated in 0.0932 seconds