1 |
Sequential methodology and applications in sports ratingTaylor, Benjamin January 2011 (has links)
Sequential methods aim to update beliefs about a set of parameters given new blocks of data that arise in sequence. Early research in this area was motivated by the case where the blocks of data arise in time and as a result of observing an underlying dynamical system, but an important modern application is in the analysis of large datasets. This thesis considers both the design and application of sequential methods. A new adaptive sequential Monte Carlo (SMC) methodology is presented. By incorporating adaptive Markov chain Monte Carlo (MCMC) moves into the SMC update, it is possible to utilise the heuristic, computational and theoretical advantages of SMC to make gains in sampling efficiency. The new method is tested on the problem of Bayesian mixture analysis and found to outperform an adaptive MCMC algorithm in 5 out of 6 of the situations considered. Theoretical justification of the method, guidelines for implementation and a condition for convergence are provided. When the dimensionality of the parameter space is high, methods such as the adaptive SMC sampler do not work well. In such cases, sequential data analysis can proceed with statistical models that are amenable to the exact or approximate filtering recursions. The two situations considered here will be the rating of sports teams and players. A new method for rating and selecting teams for the NCAA basketball tournament is considered. The selection of teams is important to University institutions in the United States, as admittance brings academic as well as sports-related financial benefits. Currently the selection process is undertaken by a panel of expert voters. The new method is in the main found to agree with these pundits, but in the seasons considered a small number of cases are highlighted where injustice to the team was evident. Also considered is the rating of professional basketball players. A new method is developed that measures a player's offensive and defensive ability and provides a means of combining this information into an overall rating. The method uses data from multiple seasons to more accurately estimate player abilities in a single season. Injustice in the assigning of NBA awards in the 2009 season is uncovered, but the research also highlights one possible reason for this: the commonly cited box-score statistics contain little information on defensive ability.
|
2 |
Summary statistics and sequential methods for approximate Bayesian computationPrangle, Dennis January 2011 (has links)
Many modern statistical applications involve inference for complex stochastic models, where it is easy to simulate from the models, but impossible to calculate likelihoods. Approximate Bayesian computation (ABC) is a method of inference for such models. It replaces calculation of the likelihood by a step which involves simulating artificial data for different parameter values, and comparing summary statistics of the simulated data to summary statistics of the observed data. This thesis looks at two related methodological issues for ABC. Firstly a method is proposed to construct appropriate summary statistics for ABC in a semi-automatic manner. The aim is to produce summary statistics which will enable inference about certain parameters of interest to be as accurate as possible. Theoretical results show that, in some sense, optimal summary statistics are the posterior means of the parameters. While these cannot be calculated analytically, an extra stage of simulation is used to estimate how the posterior means vary as a function of the data, and these estimates are then used as summary statistics within ABC. Empirical results show that this is a robust method for choosing summary statistics, that can result in substantially more accurate ABC analyses than previous approaches in the literature. Secondly, ABC inference for multiple independent data sets is considered. If there are many such data sets, it is hard to choose summary statistics which capture the available information and are appropriate for general ABC methods. An alternative sequential ABC approach is proposed in which simulated and observed data are compared for each data set and combined to give overall results. Several algorithms are proposed and their theoretical properties studied, showing that exploiting ideas from the semi-automatic ABC theory produces consistent parameter estimation. Implementation details are discussed, with several simulation examples illustrating these and application to substantive inference problems.
|
3 |
Contributions to inference without likelihoodsJesus, J. January 2012 (has links)
This thesis is concerned with statistical inference in situations where one is unwilling or unable to formulate a likelihood function. The theory of estimating functions (EFs) provides an alternative inference framework in such settings. The research was motivated by problems arising in the application of a class of stochastic models for rainfall based on point processes. These models are often used by hydrologists to produce synthetic rainfall sequences for risk assessment purposes, notably in the UKCP09 climate change projections for the UK. In the absence of a likelihood function, the models are usually fitted by minimizing some measure of disagreement between theoretical properties and the observed counterparts. In general situations of this type, two ”subjective” decisions are required: what properties to use, and how to weight their contribution to the objective function. The choice of weights can be formalised by defining a minimum variance criterion for the estimator. This is equivalent to the Generalized Method of Moments estimator which is widely used in econometrics. The first contribution of this thesis is to translate the problem to an EF framework which is much more familiar to statisticians. Simulations show that the theory has poor finite sample performance for point process rainfall models. This is associated with inaccurate estimation of the covariance matrix of observed properties. A two-stage approach is developed to overcome this problem. The second main contribution is to apply EF theory to the Whittle likelihood, which is based on the periodogram of the data. A problem here is that the covariance matrix of the estimators depends on fourth-order properties which are often intractable. An EF approach provides a feasible alternative in practical applications. After establishing the conditions under which EF theory can be applied to Whittle estimation, simulations are once again used to explore the finite sample performance.
|
4 |
Sample size for multivariable prognostic modelsJinks, R. C. January 2012 (has links)
Prognosis is one of the central principles of medical practice; useful prognostic models are vital if clinicians wish to predict patient outcomes with any success. However, prognostic studies are often performed retrospectively, which can result in poorly validated models that do not become valuable clinical tools. One obstacle to planning prospective studies is the lack of sample size calculations for developing or validating multivariable models. The often used 5 or 10 events per variable (EPV) rule (Peduzzi and Concato, 1995) can result in small sample sizes which may lead to overfitting and optimism. This thesis investigates the issue of sample size in prognostic modelling, and develops calculations and recommendations which may improve prognostic study design. In order to develop multivariable prediction models, their prognostic value must be measurable and comparable. This thesis focuses on time-to-event data analysed with the Cox proportional hazards model, for which there are many proposed measures of prognostic ability. A measure of discrimination, the D statistic (Royston and Sauerbrei, 2004), is chosen for use in this work, as it has an appealing interpretation and direct relationship with a measure of explained variation. Real datasets are used to investigate how estimates of D vary with number of events. Seeking a better alternative to EPV rules, two sample size calculations are developed and tested for use where a target value of D is estimated: one based on significance testing and one on confidence interval width. The calculations are illustrated using real datasets; in general the sample sizes required are quite large. Finally, the usability of the new calculations is considered. To use the sample size calculations, researchers must estimate a target value of D, but this can be difficult if no previous study is available. To aid this, published D values from prognostic studies are collated into a ‘library’, which could be used to obtain plausible values of D to use in the calculations. To expand the library further an empirical conversion is developed to transform values of the more widely-used C-index (Harrell et al., 1984) to D.
|
5 |
A methodological investigation of non-sampling error : interviewer variability and non responseWiggins, Richard D. January 1990 (has links)
Two principal sources of error in data collected from structured interviews with respondents are the methods of observation itself, and the impact of failure to obtain responses from selected individuals. Methodological strategies are developed to investigate practical ways of handling these errors for data appraisal. In part one, the differential impact of each of a group of interviewers on the responses obtained in two separate epidemiological studies is examined. Interviewer effect is measured and its impact on the interpretation of individual responses, scale scores and modelling is shown. The analysis demonstrates that it is possible to achieve four objectives with slight modification of survey design. First, estimates of precision for the survey results can be improved by including the component due to interviewer variability. Secondly, items with high sensitivity to interviewer effect can be identified. Thirdly, the pattern of distortion for different types of items can be discovered. Replicate analyses appear to indicate that deviations between interviewers are not always consistent over time. Fourthly, by means of 'variance component modelling' the presence of interviewers on the interpretation of linear models can be evaluated. These models are used to show how interviewer characteristics may be used to account for variation in the responses. Part two establishes an evaluative framework for the systematic review of interviewer call back strategies in terms of nonresponse bias and the costs of data collection. Use of an 'efficiency index', based on a product of 'mean square error' and cost for items in a survey of occupational mobility provides a retrospective evaluation. The empirical evidence had important practical consequences for fieldwork. The possibility of alternative call-back norms and the relative efficacy of appointment versus non-appointment calls is shown. The methodology develops from a review of adjustment procedures for nonresponse bias and models for survey costing. Logically, the methodologies for the three empirical investigations could be combined into an appraisal for a single survey. Only lack of resources inhibited such an outcome.
|
6 |
Robust estimation for structural time series modelsKwan, Tan Hwee January 1990 (has links)
This thesis aims at developing robust methods of estimation in order to draw valid inference from contaminated time series. We concentrate on additive and innovation outliers in structural time series models using a state space representation. The parameters of interest are the state, hyperparameters and coefficients of explanatory variables. Three main contributions evolve from the research. Firstly, a filter named the approximate Gaussian sum filter is proposed to cope with noisy disturbances in both the transition and measurement equations. Secondly, the Kalman filter is robustified by carrying over the M-estimation of scale for i.i.d observations to time-dependent data. Thirdly, robust regression techniques are implemented to modify the generalised least squares transformation procedure to deal with explanatory variables in time series models. All the above procedures are tested against standard non-robust estimation methods for time series by means of simulations. Two real examples are also included.
|
7 |
Contributions to strong approximations in time series with applications in nonparametric statistics and functional limit theoremsda Silveira Filho, Getulio Borges January 1991 (has links)
This thesis is concerned with applications in probability and statistics of approximation theorems for weakly dependent random vectors. The basic approach is to approximate partial sums of weakly dependent random vectors by corresponding partial sums of independent ones. In chapter 2 we apply such a general idea so as to obtain an almost sure invariance principle for partial sums of Rd-valued absolutely regular processes. In chapter 3 we apply the results of chapter 2 to obtain functional limit theorems for non-stationary fractionally differenced processes. Chapter 4 deals with applications of approximation theorems to nonparamatric estimation of density and regression functions under weakly dependent samples. We consider L1-consistency of kernel and histogram density estimates. Universal consistency of the partition estimates of the regression function is also studied. Finally in chapter 5 we consider necessary conditions for L1-consistency of kernel density estimates under weakly dependent samples as an application of a Poisson approximation theorem for sums of uniform mixing Bernoulli random variables.
|
8 |
Methods for handling missing data for observational studies with repeated measurementsKalaycioglu, O. January 2015 (has links)
Missing data is common in longitudinal observational studies where, data on both outcome and explanatory variables are collected repeatedly at several time points. The research in this thesis is motivated by the repeated measurements observational study with incomplete outcome and explanatory variables. When the missing values on the explanatory variables are related to the observed values of the outcome, it has been recommended to use multiple imputation (MI) techniques to alleviate the problems of both bias and the efficiency of the parameter estimates. In this thesis MI techniques were reviewed, extended where necessary and compared regarding the bias and efficiency of the regression coefficient estimates using simulation studies in order to suggest the choice of the most optimal MI method when MAR explanatory variables occur in repeated measurements studies. Multivariate normal imputation (MVNI) produced the least bias in most situations, is theoretically well justified and allows flexible correlation for the repeated measurements in the imputation model. Bayesian MI is efficient and maybe preferable for imputing categorical variables with extreme prevalences. Imputation by chained equations (ICE) approaches were sensitive to the correlation between the repeated measurements of the incomplete variables. A complete missing data analysis requires sensitivity analysis which investigates the departures from MAR mechanism. Models for handling MNAR in both outcome and explanatory variables are not well developed and can potentially be complicated, especially when there are several missingness patterns. In this thesis selection modelling and pattern mixture modelling frameworks are extended to accommodate MNAR mechanism on time-varying outcome and explanatory variables, with mixed type of missingness patterns using fully Bayesian estimation technique. The investigations suggested that, when the true form of missingness mechanism is specified and the variables that cause missingness are used in the missingness models, the parameter estimates will be less biased than using standard MAR methods. The bias can be reduced, if the true values of missingness parameters are incorporated into the missingness models using informative priors.
|
9 |
Quantification of prediction uncertainty for principal components regression and partial least squares regressionZhang, Y. January 2014 (has links)
Principal components regression (PCR) and partial least squares regression (PLS) are widely used in multivariate calibration in the fields of chemometrics, econometrics, social science and so forth, serving as alternative solutions to the problems which arise in ordinary least squares regression when explanatory variables are either collinear, or there are hundreds of explanatory variables with a relatively small sample size. Both PCR and PLS tackle the problems by constructing lower dimensional factors based on the explanatory variables. The extra step of factor construction makes the standard prediction uncertainty theory of ordinary least squares regression not directly applicable to the two reduced dimension methods. In the thesis, we start by reviewing the ordinary least squares regression prediction uncertainty theory, and then investigate how the theory performs when it extends to PCR and PLS, aiming at potentially better approaches. The first main contribution of the thesis is to clarify the quantification of prediction uncertainty for PLS. We rephrase existing methods with consistent mathematical notations in the hope of giving a clear guidance to practitioners. The second main contribution is to develop a new linearisation method for PLS. After establishing the theory, simulation and real data studies have been employed to understand and compare the new method with several commonly used methods. From the studies of simulations and a real dataset, we investigate the properties of simple approaches based on the theory of ordinary least squares theory, the approaches using resampling of data, and the local linearisation approaches including a classical and our improved new methods. It is advisable to use the ordinary least squares type prediction variance with the estimated regression error variance from the tuning set in both PCR and PLS in practice.
|
10 |
Exploratory studies for Gaussian process structural equation modelsChiu, Y. D. January 2014 (has links)
Latent variable models (LVMs) are widely used in many scientific fields due to the ubiquitousness and feasibility of latent variables. Conventional LVMs, however, have limitations because they model relationships between covariates and latent variables or among latent variables with a parametric fashion. A more flexible model framework is therefore needed, especially without prior knowledge of sensible parametric forms. This thesis proposes a new non-parametric LVM for the need. We define a model structure with particular features, including a multi-layered structure constituting of non-parametric Gaussian Processes regression and parametric factor analysis. The connections to existing popular LVMs approaches, such as structural equation models and latent curve models, are also discussed. The model structure is subsequently extended for observed binary responses and longitudinal application. It follows that model identifiability is examined through parameter constraints and algebraic manipulations. The proposed model, despite convenient applicability, has a computational burden for analysing large data sets due to the computation of the inverse of a large covariance matrix. To address the issue, a sparse approximation method using a small number of M selected inputs (inducing inputs) is adopted. The associated computational cost can be reduced to O(M²NQ²) (or O(M²NT²)) where N and Q are the numbers of data points and latent variables (or time points T), respectively. Inference within this framework requires a series of algorithmic developments in a Bayesian paradigm. The algorithms, using Markov Chain Monte Carlo sampling-based methods and Expectation Maximisation optimisation methods with stochastic variant, are presented. A hybrid estimation procedure with two-step implementations is proposed as well, which can further reduce computational cost. Furthermore, a greedy selection scheme for inducing inputs is provided for better model predictive performance. Empirical studies of the modelling framework are conducted for various experiments. Interest lies in inference, including parameter estimation and realization of distribution of latent variables; and assessments and comparisons of predictive performance with two baseline techniques. Discussion and suggestions for improvement are provided based on results.
|
Page generated in 0.0282 seconds