• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • No language data
  • Tagged with
  • 22
  • 22
  • 22
  • 22
  • 7
  • 7
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

To p, or not to p? : quantifying inferential decision errors to assess whether significance truly is significant

Abdey, James Spencer January 2009 (has links)
Empirical testing is centred on p-values. These summary statistics are used to assess the plausibility of a null hypothesis, and therein lies a flaw in their interpretation. Central to this research is accounting for the behaviour of p-values, through density functions, under the alternative hypothesis, H1. These densities are determined by a combination of the sample size and parametric specification of H1. Here, several new contributions are presented to reflect p-value behaviour. By considering the likelihood of both hypotheses in parallel, it is possible to optimise the decision-making process. A framework for simultaneously testing the null and alternative hypotheses is outlined for various testing scenarios. To facilitate efficient empirical conclusions, a new set of critical value tables is presented requiring only the conventional p-value, hence avoiding the need for additional computation in order to apply this joint testing in practice. Simple and composite forms of H1 are considered. Recognising the conflict between different schools of thought with respect to hypothesis testing, a unified approach at consolidating the advantages of each is offered. Again, exploiting p-value distributions under various forms of H1, a revised conditioning statistic for conditional frequentist testing is developed from which original p-value curves and surfaces are produced to further ease decision making. Finally, attention turns to multiple hypothesis testing. Estimation of multiple testing error rates is discussed and a new estimator for the proportion of true null hypotheses, when simultaneously testing several independent hypotheses, is presented. Under certain conditions it is shown that this estimator is superior to an established estimator.
12

Sparse modelling and estimation for nonstationary time series and high-dimensional data

Cho, Haeran January 2010 (has links)
Sparse modelling has attracted great attention as an efficient way of handling statistical problems in high dimensions. This thesis considers sparse modelling and estimation in a selection of problems such as breakpoint detection in nonstationary time series, nonparametric regression using piecewise constant functions and variable selection in high-dimensional linear regression. We first propose a method for detecting breakpoints in the secondorder structure of piecewise stationary time series, assuming that those structural breakpoints are sufficiently scattered over time. Our choice of time series model is the locally stationary wavelet process (Nason et al., 2000), under which the entire second-order structure of a time series is described by wavelet-based local periodogram sequences. As the initial stage of breakpoint detection, we apply a binary segmentation procedure to wavelet periodogram sequences at each scale separately, which is followed by within-scale and across-scales postprocessing steps. We show that the combined methodology achieves consistent estimation of the breakpoints in terms of their total number and locations, and investigate its practical performance using both simulated and real data. Next, we study the problem of nonparametric regression by means of piecewise constant functions, which are known to be flexible in approximating a wide range of function spaces. Among many approaches developed for this purpose, we focus on comparing two well-performing techniques, the taut string (Davies & Kovac, 2001) and the Unbalanced Haar (Fryzlewicz, 2007) methods. While the multiscale nature of the latter is easily observed, it is not so obvious that the former can also be interpreted as multiscale. We provide a unified, multiscale representation for both methods, which offers an insight into the relationship between them as well as suggesting some lessons that both methods can learn from each other. Lastly, one of the most widely-studied applications of sparse modelling and estimation is considered, variable selection in high-dimensional linear regression. High dimensionality of the data brings in many complications including (possibly spurious) non-negligible correlations among the variables, which may result in marginal correlation being unreliable as a measure of association between the variables and the response. We propose a new way of measuring the contribution of each variable to the response, which adaptively takes into account high correlations among the variables. A key ingredient of the proposed tilting procedure is hard-thresholding sample correlation of the design matrix, which enables a data-driven switch between the use of marginal correlation and tilted correlation for each variable. We study the conditions under which this measure can discriminate between relevant and irrelevant variables, and thus be used as a tool for variable selection. In order to exploit these theoretical properties of tilted correlation, we construct an iterative variable screening algorithm and examine its practical performance in a comparative simulation study.
13

Multiple imputation for missing data and statistical disclosure control for mixed-mode data using a sequence of generalised linear models

Lee, Min Cherng January 2014 (has links)
Multiple imputation is a commonly used approach to deal with missing data and to protect confidentiality of public use data sets. The basic idea is to replace the missing values or sensitive values with multiple imputation, and we then release the multiply imputed data sets to the public. Users can analyze the multiply imputed data sets and obtain valid inferences by using simple combining rules, which take the uncertainty due to the presence of missing values and synthetic values into account. It is crucial that imputations are drawn from the posterior predictive distribution to preserve relationships present in the data and allow valid conclusions to be made from any analysis. In data sets with different types of variables, e.g. some categorical and some continuous variables, multivariate imputation by chained equations (MICE) (Van Buuren (2011)) is a commonly used multiple imputation method. However, imputations from such an approach are not necessarily drawn from a proper posterior predictive distribution. We propose a method, called factored regression model (FRM) to multiply impute missing values in such data sets by modelling the joint distribution of the variables in the data through a sequence of generalised linear models. We use data augmentation methods to connect the categorical and continuous variables and this allows us to draw imputations from a proper posterior distribution. We compare the performance of our method with MICE using simulation studies and on a breastfeeding data. We also extend our modelling strategies to incorporate different informative priors for the FRM to explore robust regression modelling and the sparse relationships between the predictors. We then apply our model to protect confidentiality of the current population survey (CPS) data by generating multiply imputed, partially synthetic data sets. These data sets comprise a mix of original data and the synthetic data where values chosen for synthesis are based on an approach that considers unique and sensitive units in the survey. Valid inference can then be made using the combining rules described by Reiter (2003). An extension to the modelling strategy is also introduced to deal with the presence of spikes at zero in some of the continuous variables in the CPS data.
14

The design of cross-over studies subject to dropout

Low, Janice Lorraine January 1995 (has links)
A cross-over study is a comparitive experiment in which subjects receive a sequence of two or more treatments, one in each of a series of successive time periods, and the response of each subject is measured at the end of every period. A common problem, particularly in medicine, is that subjects fail to complete a study through dropping out during the later stages of the trial for reasons unrelated to the treatments received. Current practice is to select a design for a study on the basis of its performance under the assumption that no subjects drop out, using a criterion such as A-optimality. This is an unrealistic assumption for many medical applications. This thesis investigates how studies should be designed when it is unrealistic to assume that subjects will not drop out. A method of assessing cross-over designs is presented which judges how accurately all the pairwise treatment comparisons are estimated under the assumption that each subject has a fixed probability of dropping out during the final period, independent of treatment received and the other subjects. The method of design assessment is computationally intensive even for studies involving a relatively small number of subjects. Ways of reducing the amount of computation required are presented through establishing the link between implemented designs and a colouring problem in combinatorial theory. The reductions achieved make feasible investigations of currently used designs for cross-over studies. The results of investigations are presented for designs for the cases of particular practical importance, namely four treatment, four period and three treatment, three period studies, in which a simple carry-over model is assumed for the observations. Designs which are more robust to final period dropout than the currently favoured designs are identified.
15

Stochastic and robust models for optimal decision making in energy

Gourtani, Arash Mostajeran January 2014 (has links)
No description available.
16

Optimal and efficient experimental design for nonparametric regression with application to functional data

Fisher, Verity January 2012 (has links)
Functional data is ubiquitous in modern science, technology and medicine. An example, which motivates the work in this thesis, is an experiment in tribology to investigate wear in automotive transmission. The research in this thesis provides methods for the design of experiments when the response is assumed to be a realisation of a smooth function. In the course of the research, two areas were investigated: designs for local linear smoothers and designs for discriminating between two functional linear models. Designs that are optimal for minimising the prediction variance of a smooth function were found across an interval using two kernel smoothing methods: local linear regression and Gasser and Muller estimation. The values of the locality parameter and run size were shown to affect the optimal design. Optimal designs for best prediction using local linear regression were applied to the tribology experiment. A compound optimality criterion is proposed which is a weighted average of the integrated prediction variance and the inverse of the trace of the smoothing matrix using the Gasser and Muller estimator. The complexity of the model to be fitted was shown to influence the selection of optimal design points. The robustness of these optimal designs to misspecification of the kernel function for the compound criterion was also critically assessed. A criterion and method for finding T-optimal designs was developed for discriminating between two competing functional linear models. It was proved that the choice of optimal design is independent of the parameter values when discriminating between two nested functional linear models that differ by only one term. The performance of T-optimal designs was evaluated in simulation studies which calculated the power of the test for assessing the fit of one model using data generated from the competing model.
17

Statistical inference for ordinary differential equations using gradient matching

Macdonald, Benn January 2017 (has links)
A central objective of current systems biology research is explaining the interactions amongst components in biopathways. A standard approach is to view a biopathway as a network of biochemical reactions, which is modelled as a system of ordinary differential equations (ODEs). Conventional inference methods typically rely on searching the space of parameter values, and at each candidate, numerically solving the ODEs and comparing the output with that observed. After choosing an appropriate noise model, the form of the likelihood is defined, and a measure of similarity between the data signals and the signals described by the current set of ODE parameters can be calculated. This process is repeated, as part of either an iterative optimisation scheme or sampling procedure in order to estimate the parameters. However, the computational costs involved with repeatedly numerically solving the ODEs are usually high. Several authors have adopted approaches based on gradient matching, aiming to reduce this computational complexity. These approaches are based on the following two-step procedure. At the first step, interpolation is used to smooth the time series data, in order to avoid modelling noisy observations; in a second step, the kinetic parameters of the ODEs are either optimised or sampled, whilst minimising some metric measuring the difference between the slopes of the tangents to the interpolants, and the parameter-dependent time derivative from the ODEs. In this fashion, the ODEs never have to be numerically integrated, and the problem of inferring the typically unknown initial conditions of the system is removed, as it is not required for matching gradients. A downside to this two-step scheme is that the results of parameter inference are critically dependent on the quality of the initial interpolant. Alternatively, the ODEs can be allowed to regularise the interpolant and it has been demonstrated that it significantly improves the parameter inference accuracy and robustness with respect to noise. This thesis extends and develops methods of gradient matching for parameter inference and model selection in ODE systems in a systems biology context.
18

Dynamic DNA and human disease : mathematical modelling and statistical inference for myotonic dystrophy type 1 and Huntington disease

Higham, Catherine F. January 2013 (has links)
Several human genetic diseases, including myotonic dystrophy type 1 (DM1) and Huntington disease (HD), are associated with inheriting an abnormally large unstable DNA simple sequence tandem repeat. These sequences mutate, by changing the number of repeats, many times during the lifetime of those affected, with a bias towards expansion. High repeat numbers are associated with early onset and disease severity. The presence of somatic instability compromises attempts to measure intergenerational repeat dynamics and infer genotype-phenotype relationships. Modelling the progression of repeat length throughout the lifetime of individuals has potential for improving prognostic information as well as providing a deeper understanding of the underlying biological process. Dr Fernando Morales, Dr Anneli Cooper and others from the Monckton lab have characterised more than 25,000 de novo somatic mutations from a large cohort of DM1 patients using single-molecule polymerase chain reaction (SM-PCR). This rich dataset enables us to fully quantify levels of somatic instability across a representative DM1 population for the first time. We establish the relationship between inherited or progenitor allele length, age at sampling and levels of somatic instability using linear regression analysis. We show that the estimated progenitor allele length genotype is significantly better than modal repeat length (the current clinical standard) at predicting age of onset and this novel genotype is the major modifier of the age of onset phenotype. Further we show that somatic variation (adjusted for estimated progenitor allele length and age at sampling) is also a modifier of the age of onset phenotype. Several families form the large cohort, and we find that the level of somatic instability is highly heritable, implying a role for individual-specific trans-acting genetic modifiers. We develop new mathematical models, the main focus of this thesis, by modifying a previously proposed stochastic birth process to incorporate possible contraction. A Bayesian likelihood approach is used as the basis for inference and parameter estimation. We use model comparison analysis to reveal, for the first time, that the expansion bias observed in the distributions of repeat lengths is likely to be the cumulative effect of many expansion and contraction events. We predict that mutation events can occur as frequently as every other day, which matches the timing of regular cell activities such as DNA repair and transcription, but not DNA replication. Mutation rates estimated under the models described above are lower than expected among individuals with inherited repeat lengths less than 100 CTGs, suggesting that these rates may be suppressed at the lower end of the disease causing range. We propose that a length-specific effect may be operating within this range and test this hypothesis by introducing such an effect into the model. To calibrate this extended model, we use blood DNA data from DM1 individuals with small alleles (inherited repeat lengths less than 100 CTGs) and buccal DNA from HD individuals who almost always have inherited repeat lengths less than 100 CAGs. These datasets comprise single DNA molecules sized using SM-PCR. We find statistical support for a general length-specific effect which suppresses mutational rates among the smaller alleles and gives rise to a distinctive pattern in the repeat length distributions. In a novel application of this new model, fitted to a large cohort of DM1 individuals, we also show that this distinctive pattern may help identify individuals whose effective repeat length, with regards to somatic instability, is less than their actual repeat length. A plausible explanation for this distinction is that the expanded repeat tract is compromised by interruptions or other unusual features. For these individuals, we estimate the effective repeat length of their expanded repeat tracts and contribute to the on-going discussion about the effect of interruptions on phenotype. The interpretation of the levels of somatic instability in many of the affected tissues in the triplet repeat diseases is hindered by complex cell compositions. We extend our model to two cell populations whose repeat lengths have different rates of mutation (fast and slow). Swami et al. have recently characterised repeat length distributions in end stage HD brain. Applying our model, we infer for each frontal cortex HD dataset the likely relative weight of these cell populations and their corresponding contribution towards somatic variation. By comparison with data from laser captured single cells we conclude that the neuronal repeat lengths most likely mutate at a higher rate than glial repeat lengths, explaining the characteristic skewed distributions observed in mixed cell tissue from the brain. We confirm that individual-specific mutation rates in neurons are, in addition to the inherited repeat length, a modifier of age of onset. Our results support a model of disease progression where individuals with the same inherited repeat length may reach age of onset, as much as 30 years earlier, because of greater somatic expansions underpinned by higher mutational rates. Therapies aimed at reducing somatic expansions would therefore have considerable benefits with regard to extending the age of onset. Currently clinical diagnosis of DM1 is based on a measure of repeat length from blood cells, but variance in modal length only accounts for between 20 - 40% of the variance in age of onset and, therefore, is not a an accurate predictive tool. We show that in principle progenitor allele length improves the inverse correlation with age of onset over the traditional model length measure. We make use of second blood samples that are now available from 40 DM1 individuals. We show that inherited repeat length and the mutation rates underlying repeat length instability in blood, inferred from samples at two time points rather than one, are better predictors of age of onset than the traditional modal length measure. Our results are a step towards providing better prognostic information for DM1 individuals and their families. They should also lead to better predictions for drug/therapy response, which is emerging as key to successful clinical trials. Microsatellites are another type of tandem repeat found in the genome with high levels of intergenerational and somatic mutation. Differences between individuals make microsatellites very useful biomarkers and they have many applications in forensics and medicine. As well as a general application to other expanded repeat diseases, the mathematical models developed here could be used to better understand instability at other mutational hotspots such as microsatellites.
19

Topics on statistical design and analysis of cDNA microarray experiment

Zhu, Ximin January 2009 (has links)
A microarray is a powerful tool for surveying the expression levels of many thousands of genes simultaneously. It belongs to the new genomics technologies which have important applications in the biological, agricultural and pharmaceutical sciences. In this thesis, we focus on the dual channel cDNA microarray which is one of the most popular microarray technologies and discuss three different topics: optimal experimental design; estimating the true proportion of true nulls, local false discovery rate (lFDR) and positive false discovery rate (pFDR) and dye effect normalization. The first topic consists of four subtopics each of which is about an independent and practical problem of cDNA microarray experimental design. In the first subtopic, we propose an optimization strategy which is based on the simulated annealing method to find optimal or near-optimal designs with both biological and technical replicates. In the second subtopic, we discuss how to apply Q-criterion for the factorial design of microarray experiments. In the third subtopic, we suggest an optimal way of pooling samples, which is actually a replication scheme to minimize the variance of the experiment under the constraint of fixing the total cost at a certain level. In the fourth subtopic, we indicate that the criterion for distant pair design is not proper and propose an alternative criterion instead. The second topic of this thesis is dye effect normalization. For cDNA microarray technology, each array compares two samples which are usually labelled with different dyes Cy3 and Cy5. It assumes that: for a given gene (spot) on the array, if Cy3-labelled sample has k times as much of a transcript as the Cy5-labelled sample, then the Cy3 signal should be k times as high as the Cy5 signal, and vice versa. This important assumption requires that the dyes should have the same properties. However, the reality is that the Cy3 and Cy5 dyes have slightly different properties and the relative efficiency of the dyes vary across the intensity range in a "banana-shape" way. In order to remove the dye effect, we propose a novel dye effect normalization method which is based on modeling dye response functions and dye effect curve. Real and simulated microarray data sets are used to evaluate the method. It shows that the performance of the proposed method is satisfactory. The focus of the third topic is the estimation of the proportion of true null hypotheses, lFDR and pFDR. In a typical microarray experiment, a large number of gene expression data could be measured. In order to find differential expressed genes, these variables are usually screened by a statistical test simultaneously. Since it is a case of multiple hypothesis testing, some kind of adjustment should be made to the p-values resulted from the statistical test. Lots of multiple testing error rates, such as FDR, lFDR and pFDR have been proposed to address this issue. A key related problem is the estimation of the proportion of true null hypotheses (i.e. non-expressed genes). To model the distribution of the p-values, we propose three kinds of finite mixture of unknown number of components (the first component corresponds to differentially expressed genes and the rest components correspond to non-differentially expressed ones). We apply a new MCMC method called allocation sampler to estimate the proportion of true null (i.e. the mixture weight of the first component). The method also provides a framework for estimating lFDR and pFDR. Two real microarray data studies plus a small simulation study are used to assess our method. We show that the performance of the proposed method is satisfactory.
20

Informative censoring in transplantation statistics

Staplin, Natalie January 2012 (has links)
Observations are informatively censored when there is dependence between the time to the event of interest and time to censoring. When considering the time to death of patients on the waiting list for a transplant, particularly a liver transplant, patients that are removed for transplantation are potentially informatively censored, as generally the most ill patients are transplanted. If this censoring is assumed to be non-informative then any inferences may be misleading. The existing methods in the literature that account for informative censoring are applied to data to assess their suitability for the liver transplantation setting. As the amount of dependence between the time to failure and time to censoring variables cannot be identied from the observed data, estimators that give bounds on the marginal survival function for a given range of dependence values are considered. However, the bounds are too wide to be of use in practice. Sensitivity analyses are also reviewed as these allow us to assess how inferences are affected by assuming differing amounts of dependence and whether methods that account for informative censoring are necessary. Of the other methods considered IPCW estimators were found to be the most useful in practice. Sensitivity analyses for parametric models are less computationally intensive than those for Cox models, although they are not suitable for all sets of data. Therefore, we develop a sensitivity analysis for piecewise exponential models that is still quick to apply. These models are exible enough to be suitable for a wide range of baseline hazards. The sensitivity analysis suggests that for the liver transplantation setting the inferences about time to failure are sensitive to informative censoring. A simulation study is carried out that shows that the sensitivity analysis is accurate in many situations, although not when there is a large proportion of censoring in the data set. Finally, a method to calculate the survival benefit of liver transplantation is adapted to make it more suitable for UK data. This method calculates the expected change in post-transplant mortality relative to waiting list mortality. It uses IPCW methods to account for the informative censoring encountered when estimating waiting list mortality to ensure the estimated survival benefit is as accurate as possible.

Page generated in 0.1208 seconds