Global ETD Search

1	Nonparametric clustering for spatio-temporal data Venkatasubramaniam, Ashwini Kolumam January 2019 (has links) Clustering algorithms attempt the identification of distinct subgroups within heterogeneous data and are commonly utilised as an exploratory tool. The definition of a cluster is dependent on the relevant dataset and associated constraints; clustering methods seek to determine homogeneous subgroups that each correspond to a distinct set of characteristics. This thesis focuses on the development of spatial clustering algorithms and the methods are motivated by the complexities posed by spatio-temporal data. The examples in this thesis primarily come from spatial structures described in the context of traffic modelling and are based on occupancy observations recorded over time for an urban road network. Levels of occupancy indicate the extent of traffic congestion and the goal is to identify distinct regions of traffic congestion in the urban road network. Spatial clustering for spatio-temporal data is an increasingly important research problem and the challenges posed by such research problems often demand the development of bespoke clustering methods. Many existing clustering algorithms, with a focus on accommodating the underlying spatial structure, do not generate clusters that adequately represent differences in the temporal pattern across the network. This thesis is primarily concerned with developing nonparametric clustering algorithms that seek to identify spatially contiguous clusters and retain underlying temporal patterns. Broadly, this thesis introduces two clustering algorithms that are capable of accommodating spatial and temporal dependencies that are inherent to the dataset. The first is a functional distributional clustering algorithm that is implemented within an agglomerative hierarchical clustering framework as a two-stage process. The method is based on a measure of distance that utilises estimated cumulative distribution functions over the data and this unique distance is both functional and distributional. This notion of distance utilises the differences in densities to identify distinct clusters in the graph, rather than raw recorded observations. However, distinct characteristics may not necessarily be identified and distinguishable by a densities-based distance measure, as defined within the agglomerative hierarchical clustering framework. In this thesis, we also introduce a formal Bayesian clustering approach that enables the researcher to determine spatially contiguous clusters in a data-driven manner. This framework varies from the set of assumptions introduced by the functional distributional clustering algorithm. This flexible Bayesian model employs a binary dependent Chinese restaurant process (binDCRP) to place a prior over the geographical constraints posed by a graph-based network. The binDCRP is a special case of the distance dependent Chinese restaurant process that was first introduced by Blei and Frazier (2011); the binDCRP is modified to account for data that poses spatial constraints. The binDCRP seeks to cluster data such that adjacent or neighbouring regions in a spatial structure are more likely to belong to the same cluster. The binDCRP introduces a large number of singletons within the spatial structure and we modify the binDCRP to enable the researcher to restrict the number of clusters in the graph. It is also reasonable to assume that individual junctions within a cluster are spatially correlated to adjacent junctions, due to the nature of traffic and the spread of congestion. In order to fully account for spatial correlation within a cluster structure, the model utilises a type of the conditional auto-regressive (CAR) model. The model also accounts for temporal dependencies using a first order auto-regressive (AR-1) model. In this mean-based flexible Bayesian model, the data is assumed to follow a Gaussian distribution and we utilise Kronecker product identities within the definition of the spatio-temporal precision matrix to improve the computational efficiency. The model utilises a Metropolis within Gibbs sampler to fully explore all possible partition structures within the network and infer the relevant parameters of the spatio-temporal precision matrix. The flexible Bayesian method is also applicable to map-based spatial structures and we describe the model in this context as well. The developed Bayesian model is applied to a simulated spatio-temporal dataset that is composed of three distinct known clusters. The differences in the clusters are reflected by distinct mean values over time associated with spatial regions. The nature of this mean-based comparison differs from the functional distributional clustering approach that seeks to identify differences across the distribution. We demonstrate the ability of the Bayesian model to restrict the number of clusters using a simulated data structure with distinctly defined clusters. The sampler is also able to explore potential cluster structures in an efficient manner and this is demonstrated using a simulated spatio-temporal data structure. The performance of this model is illustrated by an application to a dataset over an urban road network that presents traffic as a process varying continuously across space and time. We also apply this model to an areal unit dataset composed of property prices over a period of time for the Avon county in England. HA Statistics
2	Bayesian nonparametric inference in mechanistic models of complex biological systems Noè, Umberto January 2019 (has links) Parameter estimation in expensive computational models is a problem that commonly arises in science and engineering. With the increase in computational power, modellers started developing simulators of real life phenomena that are computationally intensive to evaluate. This, however, makes inference prohibitive due to the unit cost of a single function evaluation. This thesis focuses on computational models of biological and biomechanical processes such as the left-ventricular dynamics or the human pulmonary blood circulatory system. In the former model a single forward simulation is in the order of 11 minutes CPU time, while the latter takes approximately 23 seconds in our machines. Markov chain Monte Carlo methods or likelihood maximization using iterative algorithms would take days or weeks to provide a result. This makes them not suitable for clinical decision support systems, where a decision must be taken in a reasonable time frame. I discuss how to accelerate the inference by using the concept of emulation, i.e. by replacing a computationally expensive function with a statistical approximation based on a finite set of expensive training runs. The emulation target could be either the output-domain, representing the standard approach in the emulation literature, or the loss-domain, which is an alternative and different perspective. Then, I demonstrate how this approach can be used to estimate the parameters of expensive simulators. First I apply loss-emulation to a nonstandard variant of the Lotka-Volterra model of prey-predator interactions, in order to assess if the approach is approximately unbiased. Next, I present a comprehensive comparison between output-emulation and loss-emulation on a computational model of left ventricular dynamics, with the goal of inferring the constitutive law relating the myocardial stretch to its strain. This is especially relevant for assessing cardiac function post myocardial infarction. The results show how it is possible to estimate the stress-strain curve in just 15 minutes, compared to the one week required by the current best literature method. This means a reduction in the computational costs of 3 orders of magnitude. Next, I review Bayesian optimization (BO), an algorithm to optimize a computationally expensive function by adaptively improving the emulator. This method is especially useful in scenarios where the simulator is not considered to be a ``stable release''. For example, the simulator could still be undergoing further developments, bug fixing, and improvements. I develop a new framework based on BO to estimate the parameters of a partial differential equation (PDE) model of the human pulmonary blood circulation. The parameters, being related to the vessel structure and stiffness, represent important indicators of pulmonary hypertension risk, which need to be estimated as they can only be measured with invasive experiments. The results using simulated data show how it is possible to estimate a patient's vessel properties in a time frame suitable for clinical applications. I demonstrate a limitation of standard improvement-based acquisition functions for Bayesian optimization. The expected improvement (EI) policy recommends query points where the improvement is on average high. However, it does not account for the variance of the random variable Improvement. I define a new acquisition function, called ScaledEI, which recommends query points where the improvement on the incumbent minimum is expected to be high, with high confidence. This new BO algorithm is compared to acquisition functions from the literature on a large set of benchmark functions for global optimization, where it turns out to be a powerful default choice for Bayesian optimization. ScaledEI is then compared to standard non-Bayesian optimization solvers, to confirm that the policy still leads to a reduction in the number of forward simulations required to reach a given tolerance level on the function value. Finally, the new algorithm is applied to the problem of estimating the PDE parameters of the pulmonary circulation model previously discussed. HA Statistics
3	Spatio-temporal models for the analysis and optimisation of groundwater quality monitoring networks McLean, Marnie Isla January 2018 (has links) Commonly groundwater quality data are modelled using temporally independent spatial models. However, primarily due to cost constraints, data of this type can be sparse resulting in some sampling events only recording a few observations. With data of this nature, spatial models struggle to capture the true underlying state of the groundwater and building models with such small spatial datasets can result in unreliable predictions. This highlights the need for spatio-temporal models which `borrow strength' from earlier sampling events and which allow interpolations of groundwater concentrations between sampling points. To compare the relative merits of analysing groundwater quality data using spatial compared to spatio-temporal statistical models, a comparison study is presented using data from a hypothetical contaminant plume along with a real life dataset. In this study, the estimation accuracy of spatial p-spline and Kriging models are compared with spatio-temporal p-spline models. The results show that spatio-temporal methods can increase prediction efficiency markedly so that, in comparison with repeated spatial analysis, spatio-temporal methods can achieve the same level of performance but with smaller sample sizes. For the comparison study, in the spatio-temporal p-splines model, differing levels of variability over space and time were controlled using different numbers of basis functions rather than separate smoothing parameters due to the computational expense of their optimisation. However, deciding on the number of basis functions for each dimension is subjective due to space and time being measured on different scales, and thus methodology is developed to efficiently tune two smoothing parameters. The proposed methodology exploits lower resolution models to determine starting points for the optimisation procedure allowing for each parameter to be tuned separately. Working with spatio-temporal models can, however, pose their own problems. Due to the sporadic layout of many monitoring well networks, due to built-up urban areas and transport infrastructure, ballooning can be experienced in the predictions of these models. `Ballooning' is a term used to describe the event where high or low predictions are made in regions with little data support. To determine when this has occurred a measure is developed to highlight when ballooning may be present in the models predictions. In addition to the measure, to try to eliminate ballooning from happening in the first place, a penalty based on the idea that the total contaminant mass should not change significantly over time is proposed. However, the preliminary results presented here indicate that further work is needed to make this effective. It is shown that by adopting a spatio-temporal modelling framework a smoother, clearer and more accurate prediction through time can be achieved, compared to spatial modelling of individual time steps, whilst using fewer samples. This was shown using existing sampling schemes where the choice of sampling locations was made by someone with little knowledge or experience in sampling design. Sampling designs on fixed monitoring well networks are then explored and optimised through the minimisation two objective functions; the variance of the predicted plume mass (VM) and the integrated prediction variance (IV). Sampling design optimisations, using spatial and spatio-temporal p-spline models, are carried out, using a variety of numbers of wells and at various future sampling time points. The effects of well-specific sampling frequency are also investigated and it is found that both objective functions tend to propose wells for the next sampling design which have not been sampled recently. Often, an existing monitoring well network will need to be changed, either by adding new wells or by down-scaling and removing wells. The decision to add wells to the network comes at a financial expense, so it is of paramount importance that wells are added into areas where the gain in knowledge of the region is maximised. The decision to remove a well from the network is equally important and involves a trade-off between costs saved and information lost. The design objective functions suggest a well should be added in an area where the distance to the nearest neighbouring wells is greatest. Finally, consideration is given to optimal sampling designs when it is assumed the recorded data has multiplicative error - a common assumption in groundwater quality data. When modelling with this type of data, the response is normally log transformed prior to modelling and the predictions are then transformed back onto the original scale for interpretation. Assuming a log transformed response, the objective functions, initially presented, can be used if computation of the objective function is also on the log scale. However, if the desired scale of interpretation of the objective functions is the original scale but modelling was performed on the log scale, the resulting objective function values cannot simply be exponentiated to give an interpretation on the original scale. Modelling on the log scale while interpreting the objective function on the original scale can be achieved by adopting a lognormal distribution for the predicted response and subsequently numerically integrating its variance to compute the IV objective function. The results indicate that the designs do differ depending on which scale interpretation of the objective function is to be made. When interpreting on the original scale the objective function favours sampling from wells where higher values were previously estimated. Unfortunately, computation of the VM objective function when assuming a lognormal distribution has not been achieved so far. HA Statistics
4	Multilevel structural equation models for the interrelationships between multiple dimensions of childhood socioeconomic circumstances, partnership stability and midlife health Zhu, Yajing January 2018 (has links) Recent studies have contributed to understanding of the mechanisms behind the association between childhood circumstances and later life. It has been hypothesized that experiences in childhood operate through influencing trajectories of life events and functional changes in health-related behaviours that can mediate the effects of childhood socioeconomic circumstances (SECs) on later health. Using data from the 1958 British birth cohort, we propose a multilevel structural equation modelling (SEM) approach to investigate the mediating effects of partnership stability, an example of life events in adulthood. Childhood circumstances are abstract concepts with multiple dimensions, each measured by a number of indicators over four childhood waves (at ages 0, 7, 11 and 16). Latent class models are fitted to each set of these indicators and the derived categorical latent variables characterise the patterns of change in four dimensions of childhood SECs. To relate these latent variables to a distal outcome, we first extend the 3-step maximum likelihood (ML) method to handle multiple, associated categorical latent variables and investigate sensitivity of the proposed estimation approach to departures from model assumptions. We then extend the 3-step ML approach to estimate models with multiple outcomes of mixed types and at different levels in a hierarchical data structure. The final multilevel SEM is comprised of latent class models and a joint regression model that relates these categorical latent variables to partnership transitions in adulthood and midlife health, while allowing for informative dropout. Most likely class memberships are treated as imperfect measurements of the latent classes. Life events (e.g. partnership transitions), distal outcomes (e.g. midlife health) and dropout indicators are viewed as items of one or more individual-level latent variables. To account for endogeneity and indirect associations, the effects of childhood SECs on partnership transitions for ages 16-50 and distal health at age 50 are jointly modelled by allowing for a residual association across equations due to shared but differential influences of time-invariant unobservables on each response. Finally, sensitivity analyses are performed to investigate the extent to which the specifications of the dropout model influence the estimated effects of childhood SECs on midlife health. HA Statistics
5	On the running maximum of Brownian motion and associated lookback options Ho, Tak Yui January 2018 (has links) The running maximum of Brownian motion appears often in mathematical finance. In derivatives pricing, it is used in modelling derivatives with lookback or barrier hitting features. For path dependent derivatives, valuation and risk management rely on Monte Carlo simulation. However, discretization schemes are often biased in estimating the running maximum and barrier hitting time. For example, it is hard to know if the underlying asset has crossed the barrier between two discrete time points when the simulated asset prices are on one side of the barrier but very close. We apply several martingale methods, such as optional stopping and change of measure, also known as importance sampling including exponential tilting, on simulating the stopping times, and positions in some case, of the running maximum of Brownian motion. This results in more accurate and computationally cheap Monte Carlo simulations. In the linear deterministic barrier case, close-form distribution functions are obtained from integral transforms. The stopping time and position can hence be simulated exactly and efficiently by acceptancerejection method. Examples in derivative pricing are constructed by using the stopping time as a trigger event. A differential equation method is developed in parallel to solve for the Laplace transform and has the potential to be extended to other barriers. In the compound Poisson barrier case, we can reduce the variance and bias of the crossing probabilities simulated by different importance sampling methods. We have also addressed the problem of heavy skewness when applying importance sampling. HA Statistics
6	Eigenvalue-regularized covariance matrix estimators for high-dimensional data Feng, Huang January 2018 (has links) Covariance regularization is important when the dimension p of a covariance matrix is close to or even larger than the sample size n. This thesis concerns estimating large covariance matrix in both low and high frequency setting. First, we introduce an integration covariance matrix estimator which is a linear combination of a rotation-equivariant and a regularized covariance matrix estimator that assumed a specific structure for true covariance Σ0, under the practical scenario where one is not 100% certain of which regularization method to use. We estimate the weights in the linear combination and show that they asymptotically go to the true underlying weights. To generalize, we can put two regularized estimators into the linear combination, each assumes a specific structure for Σ0. Our estimated weights can then be shown to go to the true weights too, and if one regularized estimator is converging to Σ0 in the spectral norm, the corresponding weight then tends to 1 and others tend to 0 asymptotically. We demonstrate the performance of our estimator when compared to other state-of-the-art estimators through extensive simulation studies and a real data analysis. Next, in high-frequency setting with non-synchronous trading and contamination of microstructure noise, we propose a Nonparametrically Eigenvalue-Regularized Integrated coVariance matrix Estimator (NERIVE) which does not assume specific structures for the underlying integrated covariance matrix. We show that NERIVE is positive definite in probability, with extreme eigenvalues shrunk nonlinearly under the high dimensional framework p/n → c > 0. We also prove that in portfolio allocation, the minimum variance optimal weight vector constructed using NERIVE has maximum exposure and actual risk upper bounds of order p−1/2. The practical performance of NERIVE is illustrated by comparing to the usual two-scale realized covariance matrix as well as some other nonparametric alternatives using different simulation settings and a real data set. Last, another nonlinear shrinkage estimator of large integrated covariance matrix in high-frequency setting is explored, which shrinks the extreme eigenvalues of a realized covariance matrix back to an acceptable level, and enjoys a certain asymptotic efficiency when the number of assets is of the same order as the number of data points. Novel maximum exposure and actual risk bounds are derived when our estimator is used in constructing the minimum-variance portfolio. In simulations and a real data analysis, our estimator performs favourably in comparison with other methods. HA Statistics
7	Ha Jin and the location of representation Wong, Sin-i, Elaine. January 2005 (has links) Thesis (M. A.)--University of Hong Kong, 2005. / Title proper from title frame. Also available in printed format. Jin, Ha,
8	Statistical models for the evolution of facial curves Mariñas del Collado, Irene January 2017 (has links) This thesis presents statistical models for the study of the evolution of shape. Particularly, it focuses on the evolution of facial curves. Evolution can be modelled viewing time as a linear, continuous variable, i.e., one curve that is gradually changing in a particular situation. Alternatively, it can play the role of evolutionary time, where branching points in the evolution can occur: ancestors diverging into multiple daughters. Two applications are studied: the evolution of the shape of the lips during the performance of an emotion (linear evolution) and the evolution of nose shape within and between ethnic groups (phylogenetic evolution). The facial images available are in the form of three-dimensional point clouds which characterize each facial surface. Each face is represented by around 100,000 points. Anatomical curves are studied to provide a rich characterization of the full anatomical surface. The curves define the boundaries of morphological features of interest, using information of the facial surface curvature. Methods for the identification of facial three-dimensional curves are studied, and an algorithm to track four-dimensional curves (three spatial dimensions plus time) proposed. The physical characterisation of facial expression involves a set of human facial movements. This thesis considers the shape of the lips as a unique facial feature to characterise emotions. Different approaches are proposed to model the lip shape and its change during the performance of an emotion. A first analysis of the evolving curves is performed using techniques of Procrustes analysis and a model based on B-splines. The thesis then moves to Gaussian Process (GP) models as an alternative approach. Models for k-dimensional curves and k-dimensional evolving curves are proposed. One direct application of the GP models is to study the grouping of different expressions of emotions in a space defined in terms of correlation parameters. To model the evolution of facial curves over many generations, the GP model for evolving k-dimensional curves is extended, using the phylogenetic covariance function, to allow for branching points in the evolution. A case study is conducted on data specially collected from different ethnic groups, where the phylogenetic model is applied to points on two curves defining the shape of the nose. HA Statistics
9	Excursions of risk processes with inverse Gaussian processes and their applications in insurance Liu, Shiju January 2017 (has links) Parisian excursion of a Levy process is defined as the excursion of the process below or above a pre-defined barrier continuously exceeding a certain time length. In this thesis, we study classical and Parisian type of ruin problems, as well as Parisian excursions of collective risk processes generalized on the classical Cramer-Lundberg risk model. We consider that claim sizes follow mixed exponential distributions and that the main focus is claim arrival process converging to an inverse Gaussian process. By this convergence, there are infinitely many and arbitrarily small claim sizes over any finite time interval. The results are obtained through Gerber-Shiu penalty function employed in an infinitesimal generator and inverting corresponding Laplace transform applied to the generator. In Chapter 3, the classical collective risk process under the Cram´er-Lundberg risk model framework is introduced, and probabilities of ruin with claim sizes following exponential distribution and a combination of exponential distributions are also studied. In Chapter 4, we focus on a surplus process with the total claim process converging to an inverse Gaussian process. The classical probability of ruin and the joint distribution of ruin time, overshoot and initial capital are given. This joint distribution could provide us with probabilities of ruin given different initial capitals in any finite time horizon. In Chapter 5, the classical ruin problem is extended to Parisian type of ruin, which requires that the length of excursions of the surplus process continuously below zero reach a predetermined time length. The joint law of the first excursion above zero and the first excursion under zero is studied. Based on the result, the Laplace transform of Parisian ruin time and formulae of probability of Parisian type of ruin with different initial capitals are obtained. Considering the asymptotic properties of claim arrival process, we also propose an approximation of the probability of Parisian type of ruin when the initial capital converges to infinity. In Chapter 6, we generalize the surplus process to two cases with total claim process still following an inverse Gaussian process. The first generalization is the case of variable premium income, in which the insurance company invests previous surplus and collects interest. The probability of survival and numerical results are given. The second generalization is the case in which capital inflow is also modelled by a stochastic process, i.e. a compound Poisson process. The explicit formula of the probability of ruin is provided. HA Statistics
10	Hierarchical hidden Markov models with applications to BiSulfite-sequencing data Ghosh, Tusharkanti January 2018 (has links) DNA methylation is an epigenetic modification with significant roles in various biological processes such as gene expression and cellular proliferation. Aberrant DNA methylation patterns compared to normal cells have been associated with a large number of human malignancies and potential cancer symptoms. In DNA methylation studies, an important objective is to detect differences between two groups under distinct biological conditions, for e.g., between cancer/ageing and normal cells. BiSulfite sequencing (BS-seq) is currently the gold standard for experimentally measuring genome-wide DNA methylation. Recent evolution in the BS-seq technologies enabled the DNA methylation profiles at single base pair resolution to be more accurate in terms of their genome coverages. The main objective of my thesis is to identify differential patterns of DNA methylation between proliferating and senescent cells. For efficient detection of differential methylation patterns, this thesis adopts the approach of Bayesian latent variable model. One such class of models is hidden Markov model (HMM) that can detect the underlying latent (hidden) structures of the model. In this thesis, I propose a family of Bayesian hierarchical HMMs for identifying differentially methylated cytosines (DMCs) and differentially methylated regions (DMRs) from BS-seq data which act as important indicators in better understanding of cancer and other related diseases. I introduce HMMmethState, a model-based hierarchical Bayesian technique for identifying DMCs from BS-seq data. My novel HMMmethState method implements hierarchical HMMs to account for spatial dependence among the CpG sites over genomic positions of BS-seq methylation data. In particular, this thesis is concerned with developing hierarchical HMMs for the differential methylation analysis of BS-seq data, within a Bayesian framework. In these models, aberrant DNA methylation is driven by two latent states: differentially methylated state and similarly methylated state, which can be interpreted as methylation status of CpG sites, that evolve over genomic positions as a first order Markov chain. I first design a (homogeneous) discrete-index hierarchical HMM in which methylated counts given the methylation status of CpG sites follow Beta-Binomial emission distribution specific to the methylation state. However, this model does not incorporate the genomic positional variations among the CpG sites, so I develop a (non-homogeneous) continuous-index hierarchical HMM, in which the transition probabilities between methylation status depend on the genomic positions of the CpG sites. This Beta-Binomial emission model however does not take into account the correlation in the methylated counts of the proliferating and senescent cells, which has been observed in the BS-seq data analysis. So, I develop a hierarchical Normal-logit Binomial emission model that induces correlation between the methylated counts of the proliferating and senescent cells. Furthermore, to perform parameter estimation for my models, I implement efficient Markov Chain Monte Carlo (MCMC) based algorithms. In this thesis, I provide an extensive study on model comparisons and adequacy of all the models using Bayesian model checking. In addition, I also show the performances of all the models using Receiver Operating Characteristics (ROC) curves. I illustrate the models by fitting them to a large BS-seq dataset and apply model selection criteria on the dataset in search of selecting the best model. In addition, I compare the performances of my methods with existing methods for detecting DMCs with competing methods. I demonstrate how the HMMmethState based algorithms outperform the existing methods in simulation studies in terms of ROC curves. I present the results of DMRs obtained using my method, i.e., the results of DMRs with the proposed HMMmethState that have been applied to the BS-seq datasets. The results of the hierarchical HMMs explain that I can certainly implement these methods under unconditioned settings to identify DMCs for high-throughput BS-seq data. The predicted DMCs can also help in understanding the phenotypic changes associated with human ageing. HA Statistics

Search results