31 |
An investigaton of umpire performance using PITCHf/x data via longitudinal analysisJuarez, Christopher January 1900 (has links)
Master of Science / Department of Statistics / Abigail Jager / Baseball has long provided statisticians a playground for analysis. In this report we discuss the history of Major League Baseball (MLB) umpires, MLB data collection, and the use of technology in sports officiating. We use PITCHf/x data to answer 3 questions. 1) Has the proportion of incorrect calls made by a major league umpire decreased over time? 2) Does the proportion of incorrect calls differ for umpires hired prior to the implementation of technology in evaluating umpire performance from those hired after? 3) Does the rate of change in the proportion of incorrect calls differ for umpires hired prior to the implementation of technology in evaluating umpire performance from those hired after?
PITCHf/x is a publicly available database which gathers characteristics for every pitch thrown in one of the 30 MLB parks. In 2002, MLB began to use camera technology in umpire evaluations; prior to 2007, the data were not publicly available. Data were collected at the pitch level and the proportion of incorrect calls was calculated for each umpire for the first third, second third, and last third of each of the seasons for 2008-2011. We collected data from retrosheet.org, which provides game summary information. We also determined the year of each umpire’s MLB debut to differentiate pre- and post-technology hired umpires for our analysis.
We answered our questions of interest using longitudinal data analysis, using a random coefficients model. We investigated the choice of covariance structure for our random coefficients model using Akaike’s Information Criterion and the Bayesian Information Criterion. Further, we compared our random coefficients model to a fixed slopes model and a general linear model.
|
32 |
Statistical methods for diagnostic testing: an illustration using a new method for cancer detectionSun, Xin January 1900 (has links)
Master of Science / Department of Statistics / Gary Gadbury / This report illustrates how to use two statistic methods to investigate the performance of a new technique to detect breast cancer and lung cancer at early stages. The two methods include logistic regression and classification and regression tree (CART). It is found that the technique is effective in detecting breast cancer and lung cancer, with both sensitivity and specificity close to 0.9. But the ability of this technique to predict the actual stages of cancer is low. The age variable improves the ability of logistic regression in predicting the existence of breast cancer for the samples used in this report. But since the sample sizes are small, it is impossible to conclude that including the age variable helps the prediction of breast cancer. Including the age variable does not improve the ability to predict the existence of lung cancer. If the age variable is excluded, CART and logistic regression give a very close result.
|
33 |
Estimating Non-homogeneous Intensity Matrices in Continuous Time Multi-state Markov ModelsLebovic, Gerald 31 August 2011 (has links)
Multi-State-Markov (MSM) models can be used to characterize the behaviour of categorical outcomes measured repeatedly over time. Kalbfleisch and Lawless (1985) and Gentleman et al. (1994) examine the MSM model under the assumption of time-homogeneous transition intensities. In the context of non-homogeneous intensities, current methods use piecewise constant approximations which are less than ideal. We propose a local likelihood method, based on Tibshirani and Hastie (1987) and Loader (1996), to estimate the transition intensities as continuous functions of time. In particular the local EM algorithm suggested by Betensky et al. (1999) is employed to estimate the in-homogeneous intensities in the presence of missing data.
A simulation comparing the piecewise constant method with the local EM method is examined using two different sets of underlying intensities. In addition, model assessment tools such as bandwidth selection, grid size selection, and bootstrapped percentile intervals are examined. Lastly, the method is applied to an HIV data set to examine the intensities with regard to depression scores. Although computationally intensive, it appears that this method is viable for estimating non-homogeneous intensities and outperforms existing methods.
|
34 |
Estimating Non-homogeneous Intensity Matrices in Continuous Time Multi-state Markov ModelsLebovic, Gerald 31 August 2011 (has links)
Multi-State-Markov (MSM) models can be used to characterize the behaviour of categorical outcomes measured repeatedly over time. Kalbfleisch and Lawless (1985) and Gentleman et al. (1994) examine the MSM model under the assumption of time-homogeneous transition intensities. In the context of non-homogeneous intensities, current methods use piecewise constant approximations which are less than ideal. We propose a local likelihood method, based on Tibshirani and Hastie (1987) and Loader (1996), to estimate the transition intensities as continuous functions of time. In particular the local EM algorithm suggested by Betensky et al. (1999) is employed to estimate the in-homogeneous intensities in the presence of missing data.
A simulation comparing the piecewise constant method with the local EM method is examined using two different sets of underlying intensities. In addition, model assessment tools such as bandwidth selection, grid size selection, and bootstrapped percentile intervals are examined. Lastly, the method is applied to an HIV data set to examine the intensities with regard to depression scores. Although computationally intensive, it appears that this method is viable for estimating non-homogeneous intensities and outperforms existing methods.
|
35 |
Training Recurrent Neural NetworksSutskever, Ilya 13 August 2013 (has links)
Recurrent Neural Networks (RNNs) are powerful sequence models that were believed to be difficult to train, and as a result they were rarely used in machine learning applications. This thesis presents methods that overcome the difficulty of training RNNs, and applications of RNNs to challenging problems.
We first describe a new probabilistic sequence model that combines Restricted Boltzmann Machines and RNNs. The new model is more powerful than similar models while being less difficult to train.
Next, we present a new variant of the Hessian-free (HF) optimizer and show that it can train RNNs on tasks that have extreme long-range temporal dependencies, which were previously considered to be impossibly hard. We then apply HF to character-level language modelling and get excellent results.
We also apply HF to optimal control and obtain RNN control laws that can successfully operate under conditions of delayed feedback and unknown disturbances.
Finally, we describe a random parameter initialization scheme that allows gradient descent with momentum to train RNNs on problems with long-term dependencies. This directly contradicts widespread beliefs about the inability of first-order methods to do so, and suggests that previous attempts at training RNNs failed partly due to flaws in the random initialization.
|
36 |
Stochastic Mortality ModellingLiu, Xiaoming 28 July 2008 (has links)
For life insurance and annuity products whose payoffs depend on the future mortality rates, there is a risk that realized
mortality rates will be different from the anticipated rates
accounted for in their pricing and reserving calculations. This is
termed as mortality risk. Since mortality risk is difficult to
diversify and has significant financial impacts on insurance
policies and pension plans, it is now a well-accepted fact that
stochastic approaches shall be adopted to model the mortality risk
and to evaluate the mortality-linked securities.
The objective of this thesis is to propose the use of a
time-changed Markov process to describe stochastic mortality
dynamics for pricing and risk management purposes. Analytical and
empirical properties of this dynamics have been investigated using
a matrix-analytic methodology. Applications of the proposed model
in the evaluation of fair values for mortality linked securities
have also been explored.
To be more specific, we consider a finite-state Markov process
with one absorbing state. This Markov process is related to an
underlying aging mechanism and the survival time is viewed as the
time until absorption. The resulting distribution for the survival
time is a so-called phase-type distribution. This approach is
different from the traditional curve fitting mortality models in
the sense that the survival probabilities are now linked with an
underlying Markov aging process. Markov mathematical and
phase-type distribution theories therefore provide us a flexible
and tractable framework to model the mortality dynamics. And the
time-changed Markov process allows us to incorporate the
uncertainties embedded in the future mortality evolution.
The proposed model has been applied to price the EIB/BNP Longevity
Bonds and other mortality derivatives under the independent
assumption of interest rate and mortality rate. A calibrating
method for the model is suggested so that it can utilize both the
market price information involving the relevant mortality risk and
the latest mortality projection. The proposed model has also been
fitted to various type of population mortality data for empirical
study. The fitting results show that our model can interpret the
stylized mortality patterns very well.
|
37 |
Training Recurrent Neural NetworksSutskever, Ilya 13 August 2013 (has links)
Recurrent Neural Networks (RNNs) are powerful sequence models that were believed to be difficult to train, and as a result they were rarely used in machine learning applications. This thesis presents methods that overcome the difficulty of training RNNs, and applications of RNNs to challenging problems.
We first describe a new probabilistic sequence model that combines Restricted Boltzmann Machines and RNNs. The new model is more powerful than similar models while being less difficult to train.
Next, we present a new variant of the Hessian-free (HF) optimizer and show that it can train RNNs on tasks that have extreme long-range temporal dependencies, which were previously considered to be impossibly hard. We then apply HF to character-level language modelling and get excellent results.
We also apply HF to optimal control and obtain RNN control laws that can successfully operate under conditions of delayed feedback and unknown disturbances.
Finally, we describe a random parameter initialization scheme that allows gradient descent with momentum to train RNNs on problems with long-term dependencies. This directly contradicts widespread beliefs about the inability of first-order methods to do so, and suggests that previous attempts at training RNNs failed partly due to flaws in the random initialization.
|
38 |
New methods for analysis of epidemiological data using capture-recapture methodsHuakau, John Tupou January 2002 (has links)
Capture-recapture methods take their origins from animal abundance estimation, where they were used to estimate the unknown size of the animal population under study. In the late 1940s and again in the late 1960s and early 1970s these same capture-recapture methods were modified and applied to epidemiological list data. Since then through their continued use, in particular in the 1990s, these methods have become popular for the estimation of the completeness of disease registries and for the estimation of the unknown total size of human disease populations. In this thesis we investigate new methods for the analysis of epidemiological list data using capture-recapture methods. In particular we compare two standard methods used to estimate the unknown total population size, and examine new methods which incorporate list mismatch errors and model-selection uncertainty into the process for the estimation of the unknown total population size and its associated confidence interval. We study the use of modified tag loss methods from animal abundance estimation to allow for list mismatch errors in the epidemio-logical list data. We also explore the use of a weighted average method, the use of Bootstrap methods, and the use of a Bayesian model averaging method for incorporating model-selection uncertainty into the estimate of the unknown total population size and its associated confidence interval. In addition we use two previously unanalysed Diabetes studies to illustrate the methods examined and a well-known Spina Bifida Study for simulation purposes. This thesis finds that ignoring list mismatch errors will lead to biased estimates of the unknown total population size and that the list mismatch methods considered here result in a useful adjustment. The adjustment also approximately agrees with the results obtained using a complex matching algorithm. As for the incorporation of model-selection uncertainty, we find that confidence intervals which incorporate model-selection uncertainty are wider and more appropriate than confidence intervals that do not. Hence we recommend the use of tag loss methods to adjust for list mismatch errors and the use of methods that incorporate model-selection uncertainty into both point and interval estimates of the unknown total population size. / Subscription resource available via Digital Dissertations only.
|
39 |
New methods for analysis of epidemiological data using capture-recapture methodsHuakau, John Tupou January 2002 (has links)
Capture-recapture methods take their origins from animal abundance estimation, where they were used to estimate the unknown size of the animal population under study. In the late 1940s and again in the late 1960s and early 1970s these same capture-recapture methods were modified and applied to epidemiological list data. Since then through their continued use, in particular in the 1990s, these methods have become popular for the estimation of the completeness of disease registries and for the estimation of the unknown total size of human disease populations. In this thesis we investigate new methods for the analysis of epidemiological list data using capture-recapture methods. In particular we compare two standard methods used to estimate the unknown total population size, and examine new methods which incorporate list mismatch errors and model-selection uncertainty into the process for the estimation of the unknown total population size and its associated confidence interval. We study the use of modified tag loss methods from animal abundance estimation to allow for list mismatch errors in the epidemio-logical list data. We also explore the use of a weighted average method, the use of Bootstrap methods, and the use of a Bayesian model averaging method for incorporating model-selection uncertainty into the estimate of the unknown total population size and its associated confidence interval. In addition we use two previously unanalysed Diabetes studies to illustrate the methods examined and a well-known Spina Bifida Study for simulation purposes. This thesis finds that ignoring list mismatch errors will lead to biased estimates of the unknown total population size and that the list mismatch methods considered here result in a useful adjustment. The adjustment also approximately agrees with the results obtained using a complex matching algorithm. As for the incorporation of model-selection uncertainty, we find that confidence intervals which incorporate model-selection uncertainty are wider and more appropriate than confidence intervals that do not. Hence we recommend the use of tag loss methods to adjust for list mismatch errors and the use of methods that incorporate model-selection uncertainty into both point and interval estimates of the unknown total population size. / Subscription resource available via Digital Dissertations only.
|
40 |
New methods for analysis of epidemiological data using capture-recapture methodsHuakau, John Tupou January 2002 (has links)
Capture-recapture methods take their origins from animal abundance estimation, where they were used to estimate the unknown size of the animal population under study. In the late 1940s and again in the late 1960s and early 1970s these same capture-recapture methods were modified and applied to epidemiological list data. Since then through their continued use, in particular in the 1990s, these methods have become popular for the estimation of the completeness of disease registries and for the estimation of the unknown total size of human disease populations. In this thesis we investigate new methods for the analysis of epidemiological list data using capture-recapture methods. In particular we compare two standard methods used to estimate the unknown total population size, and examine new methods which incorporate list mismatch errors and model-selection uncertainty into the process for the estimation of the unknown total population size and its associated confidence interval. We study the use of modified tag loss methods from animal abundance estimation to allow for list mismatch errors in the epidemio-logical list data. We also explore the use of a weighted average method, the use of Bootstrap methods, and the use of a Bayesian model averaging method for incorporating model-selection uncertainty into the estimate of the unknown total population size and its associated confidence interval. In addition we use two previously unanalysed Diabetes studies to illustrate the methods examined and a well-known Spina Bifida Study for simulation purposes. This thesis finds that ignoring list mismatch errors will lead to biased estimates of the unknown total population size and that the list mismatch methods considered here result in a useful adjustment. The adjustment also approximately agrees with the results obtained using a complex matching algorithm. As for the incorporation of model-selection uncertainty, we find that confidence intervals which incorporate model-selection uncertainty are wider and more appropriate than confidence intervals that do not. Hence we recommend the use of tag loss methods to adjust for list mismatch errors and the use of methods that incorporate model-selection uncertainty into both point and interval estimates of the unknown total population size. / Subscription resource available via Digital Dissertations only.
|
Page generated in 0.0181 seconds