Global ETD Search

1	Measuring the Stability of Results from Supervised Statistical Learning Philipp, Michel, Rusch, Thomas, Hornik, Kurt, Strobl, Carolin 17 January 2017 (has links) (PDF) Stability is a major requirement to draw reliable conclusions when interpreting results from supervised statistical learning. In this paper, we present a general framework for assessing and comparing the stability of results, that can be used in real-world statistical learning applications or in benchmark studies. We use the framework to show that stability is a property of both the algorithm and the data-generating process. In particular, we demonstrate that unstable algorithms (such as recursive partitioning) can produce stable results when the functional form of the relationship between the predictors and the response matches the algorithm. Typical uses of the framework in practice would be to compare the stability of results generated by different candidate algorithms for a data set at hand or to assess the stability of algorithms in a benchmark study. Code to perform the stability analyses is provided in the form of an R-package. / Series: Research Report Series / Department of Statistics and Mathematics
2	Prospects and Challenges in R Package Development Theußl, Stefan, Ligges, Uwe, Hornik, Kurt January 2010 (has links) (PDF) R, a software package for statistical computing and graphics, has evolved into the lingua franca of (computational) statistics. One of the cornerstones of R's success is the decentralized and modularized way of creating software using a multi-tiered development model: The R Development Core Team provides the "base system", which delivers basic statistical functionality, and many other developers contribute code in the form of extensions in a standardized format via so-called packages. In order to be accessible by a broader audience, packages are made available via standardized source code repositories. To support such a loosely coupled development model, repositories should be able to verify that the provided packages meet certain formal quality criteria and "work": both relative to the development of the base R system as well as with other packages (interoperability). However, established quality assurance systems and collaborative infrastructures typically face several challenges, some of which we will discuss in this paper. / Series: Research Report Series / Department of Statistics and Mathematics
3	Bayesian design and analysis of cluster randomized trials Xiao, Shan 07 August 2017 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Cluster randomization is frequently used in clinical trials for convenience of inter ventional implementation and for reducing the risk of contamination. The opera tional convenience of cluster randomized trials, however, is gained at the expense of reduced analytical power. Compared to individually randomized studies, cluster randomized trials often have a much-reduced power. In this dissertation, I consider ways of enhancing analytical power with historical trial data. Speciﬁcally, I introduce a hierarchical Bayesian model that is designed to incorporate available information from previous trials of the same or similar interventions. Operationally, the amount of information gained from the previous trials is determined by a Kullback-Leibler divergence measure that quantiﬁes the similarity, or lack thereof, between the histor ical and current trial data. More weight is given to the historical data if they more closely resemble the current trial data. Along this line, I examine the Type I error rates and analytical power associated with the proposed method, in comparison with the existing methods without utilizing the ancillary historical information. Similarly, to design a cluster randomized trial, one could estimate the power by simulating trial data and comparing them with the historical data from the published studies. Data analytical and power simulation methods are developed for more general situations of cluster randomized trials, with multiple arms and multiple types of data following the exponential family of distributions. An R package is developed for practical use of the methods in data analysis and trial design. Bayesian power prior Cluster randomized trials R package
4	Bivariate Generalization of the Time-to-Event Conditional Reassessment Method with a Novel Adaptive Randomization Method Yan, Donglin 01 January 2018 (has links) Phase I clinical trials in oncology aim to evaluate the toxicity risk of new therapies and identify a safe but also effective dose for future studies. Traditional Phase I trials of chemotherapies focus on estimating the maximum tolerated dose (MTD). The rationale for finding the MTD is that better therapeutic effects are expected at higher dose levels as long as the risk of severe toxicity is acceptable. With the advent of a new generation of cancer treatments such as the molecularly targeted agents (MTAs) and immunotherapies, higher dose levels no longer guarantee increased therapeutic effects, and the focus has shifted to estimating the optimal biological dose (OBD). The OBD is a dose level with the highest biologic activity with acceptable toxicity. The search for OBD requires joint evaluation of toxicity and efficacy. Although several seamleass phase I/II designs have been published in recent years, there is not a consensus regarding an optimal design and further improvement is needed for some designs to be widely used in practice. In this dissertation, we propose a modification to an existing seamless phase I/II design by Wages and Tait (2015) for locating the OBD based on binary outcomes, and extend it to time to event (TITE) endpoints. While the original design showed promising results, we hypothesized that performance could be improved by replacing the original adaptive randomization stage with a different randomization strategy. We proposed to calculate dose assigning probabilities by averaging all candidate models that fit the observed data reasonably well, as opposed to the original design that based all calculations on one best-fit model. We proposed three different strategies to select and average among candidate models, and simulations are used to compare the proposed strategies to the original design. Under most scenarios, one of the proposed strategies allocates more patients to the optimal dose while improving accuracy in selecting the final optimal dose without increasing the overall risk of toxicity. We further extend this design to TITE endpoints to address a potential issue of delayed outcomes. The original design is most appropriate when both toxicity and efficacy outcomes can be observed shortly after the treatment, but delayed outcomes are common, especially for efficacy endpoints. The motivating example for this TITE extension is a Phase I/II study evaluating optimal dosing of all-trans retinoic acid (ATRA) in combination with a fixed dose of daratumumab in the treatment of relapsed or refractory multiple myeloma. The toxicity endpoint is observed in one cycle of therapy (i.e., 4 weeks) while the efficacy endpoint is assessed after 8 weeks of treatment. The difference in endpoint observation windows causes logistical challenges in conducting the trial, since it is not acceptable in practice to wait until both outcomes for each participant have been observed before sequentially assigning the dose of a newly eligible participant. The result would be a delay in treatment for patients and undesirably long trial duration. To address this issue, we generalize the time-to-event continual reassessment method (TITE-CRM) to bivariate outcomes with potentially non-monotonic dose-efficacy relationship. Simulation studies show that the proposed TITE design maintains similar probability in selecting the correct OBD comparing to the binary original design, but the number of patients treated at the OBD decreases as the rate of enrollment increases. We also develop an R package for the proposed methods and document the R functions used in this research. The functions in this R package assist implementation of the proposed randomization strategy and design. The input and output format of these functions follow similar formatting of existing R packages such as "dfcrm" or "pocrm" to allow direct comparison of results. Input parameters include efficacy skeletons, prior distribution of any model parameters, escalation restrictions, design method, and observed data. Output includes recommended dose level for the next patient, MTD, estimated model parameters, and estimated probabilities of each set of skeletons. Simulation functions are included in this R package so that the proposed methods can be used to design a trial based on certain parameters and assess performance. Parameters of these scenarios include total sample size, true dose-toxicity relationship, true dose-efficacy relationship, patient recruit rate, delay in toxicity and efficacy responses. Phase I/II trials Dose Finding Time to Event endpoints R package Continual Reassessment Method Biostatistics
5	EXAMINING THE CONFIRMATORY TETRAD ANALYSIS (CTA) AS A SOLUTION OF THE INADEQUACY OF TRADITIONAL STRUCTURAL EQUATION MODELING (SEM) FIT INDICES Liu, Hangcheng 01 January 2018 (has links) Structural Equation Modeling (SEM) is a framework of statistical methods that allows us to represent complex relationships between variables. SEM is widely used in economics, genetics and the behavioral sciences (e.g. psychology, psychobiology, sociology and medicine). Model complexity is defined as a model’s ability to fit different data patterns and it plays an important role in model selection when applying SEM. As in linear regression, the number of free model parameters is typically used in traditional SEM model fit indices as a measure of the model complexity. However, only using number of free model parameters to indicate SEM model complexity is crude since other contributing factors, such as the type of constraint or functional form are ignored. To solve this problem, a special technique, Confirmatory Tetrad Analysis (CTA) is examined. A tetrad refers to the difference in the products of certain covariances (or correlations) among four random variables. A structural equation model often implies that some tetrads should be zero. These model implied zero tetrads are called vanishing tetrads. In CTA, the goodness of fit can be determined by testing the null hypothesis that the model implied vanishing tetrads are equal to zero. CTA can be helpful to improve model selection because different functional forms may affect the model implied vanishing tetrad number (t), and models not nested according to the traditional likelihood ratio test may be nested in terms of tetrads. In this dissertation, an R package was created to perform CTA, a two-step method was developed to determine SEM model complexity using simulated data, and it is demonstrated how the number of vanishing tetrads can be helpful to indicate SEM model complexity in some situations. Structural Equation Modeling Model complexity Confirmatory Tetrad Analysis Model selection R package Simulated data Biostatistics
6	Extreme Value Mixture Modelling with Simulation Study and Applications in Finance and Insurance Hu, Yang January 2013 (has links) Extreme value theory has been used to develop models for describing the distribution of rare events. The extreme value theory based models can be used for asymptotically approximating the behavior of the tail(s) of the distribution function. An important challenge in the application of such extreme value models is the choice of a threshold, beyond which point the asymptotically justified extreme value models can provide good extrapolation. One approach for determining the threshold is to fit the all available data by an extreme value mixture model. This thesis will review most of the existing extreme value mixture models in the literature and implement them in a package for the statistical programming language R to make them more readily useable by practitioners as they are not commonly available in any software. There are many different forms of extreme value mixture models in the literature (e.g. parametric, semi-parametric and non-parametric), which provide an automated approach for estimating the threshold and taking into account the uncertainties with threshold selection. However, it is not clear that how the proportion above the threshold or tail fraction should be treated as there is no consistency in the existing model derivations. This thesis will develop some new models by adaptation of the existing ones in the literature and placing them all within a more generalized framework for taking into account how the tail fraction is defined in the model. Various new models are proposed by extending some of the existing parametric form mixture models to have continuous density at the threshold, which has the advantage of using less model parameters and being more physically plausible. The generalised framework all the mixture models are placed within can be used for demonstrating the importance of the specification of the tail fraction. An R package called evmix has been created to enable these mixture models to be more easily applied and further developed. For every mixture model, the density, distribution, quantile, random number generation, likelihood and fitting function are presented (Bayesian inference via MCMC is also implemented for the non-parametric extreme value mixture models). A simulation study investigates the performance of the various extreme value mixture models under different population distributions with a representative variety of lower and upper tail behaviors. The results show that the kernel density estimator based non-parametric form mixture model is able to provide good tail estimation in general, whilst the parametric and semi-parametric forms mixture models can give a reasonable fit if the distribution below the threshold is correctly specified. Somewhat surprisingly, it is found that including a constraint of continuity at the threshold does not substantially improve the model fit in the upper tail. The hybrid Pareto model performs poorly as it does not include the tail fraction term. The relevant mixture models are applied to insurance and financial applications which highlight the practical usefulness of these models. extreme value mixture model threshold estimation R Package evmix simulation study
7	A PREDICTIVE PROBABILITY INTERIM DESIGN FOR PHASE II CLINICAL TRIALS WITH CONTINUOUS ENDPOINTS Liu, Meng 01 January 2017 (has links) Phase II clinical trials aim to potentially screen out ineffective and identify effective therapies to move forward to randomized phase III trials. Single-arm studies remain the most utilized design in phase II oncology trials, especially in scenarios where a randomized design is simply not practical. Due to concerns regarding excessive toxicity or ineffective new treatment strategies, interim analyses are typically incorporated in the trial, and the choice of statistical methods mainly depends on the type of primary endpoints. For oncology trials, the most common primary objectives in phase II trials include tumor response rate (binary endpoint) and progression disease-free survival (time-to-event endpoint). Interim strategies are well-developed for both endpoints in single-arm phase II trials. The advent of molecular targeted therapies, often with lower toxicity profiles from traditional cytotoxic treatments, has shifted the drug development paradigm into establishing evidence of biological activity, target modulation and pharmacodynamics effects of these therapies in early phase trials. As such, these trials need to address simultaneous evaluation of safety as well as proof-of-concept of biological marker activity or changes in continuous tumor size instead of binary response rates. In this dissertation, we extend a predictive probability design for binary outcomes in the single-arm clinical trial setting and develop two interim designs for continuous endpoints, such as continuous tumor shrinkage or change in a biomarker over time. The two-stage design mainly focuses on the futility stopping strategies, while it also has the capacity of early stopping for efficacy. Both optimal and minimax designs are presented for this two-stage design. The multi-stage design has the flexibility of stopping the trial early either due to futility or efficacy. Due to the intense computation and searching strategy we adopt, only the minimax design is presented for this multi-stage design. The multi-stage design allows for up to 40 interim looks with continuous monitoring possible for large and moderate effect sizes, requiring an overall sample size less than 40. The stopping boundaries for both designs are based on predictive probability with normal likelihood and its conjugated prior distributions, while the design itself satisfies the pre-specified type I and type II error rate constraints. From simulation results, when compared with binary endpoints, both designs well preserve statistical properties across different effect sizes with reduced sample size. We also develop an R package, PPSC, and detail it in chapter four, so that both designs can be freely accessible for use in future phase II clinical trials with the collaborative efforts of biostatisticians. Clinical investigators and biostatisticians have the flexibility to specify the parameters from the hypothesis testing framework, searching ranges of the boundaries for predictive probabilities, the number of interim looks involved and if the continuous monitoring is preferred and so on. Predictive Probability Continuous Endpoints Single-Arm Phase II Trials Interim Strategy R package Biostatistics Clinical Trials
8	Semiparametric Regression Under Left-Truncated and Interval-Censored Competing Risks Data and Missing Cause of Failure Park, Jun 04 1900 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Observational studies and clinical trials with time-to-event data frequently involve multiple event types, known as competing risks. The cumulative incidence function (CIF) is a particularly useful parameter as it explicitly quantifies clinical prognosis. Common issues in competing risks data analysis on the CIF include interval censoring, missing event types, and left truncation. Interval censoring occurs when the event time is not observed but is only known to lie between two observation times, such as clinic visits. Left truncation, also known as delayed entry, is the phenomenon where certain participants enter the study after the onset of disease under study. These individuals with an event prior to their potential study entry time are not included in the analysis and this can induce selection bias. In order to address unmet needs in appropriate methods and software for competing risks data analysis, this thesis focuses the following development of application and methods. First, we develop a convenient and exible tool, the R package intccr, that performs semiparametric regression analysis on the CIF for interval-censored competing risks data. Second, we adopt the augmented inverse probability weighting method to deal with both interval censoring and missing event types. We show that the resulting estimates are consistent and double robust. We illustrate this method using data from the East-African International Epidemiology Databases to Evaluate AIDS (IeDEA EA) where a significant portion of the event types is missing. Last, we develop an estimation method for semiparametric analysis on the CIF for competing risks data subject to both interval censoring and left truncation. This method is applied to the Indianapolis-Ibadan Dementia Project to identify prognostic factors of dementia in elder adults. Overall, the methods developed here are incorporated in the R package intccr. / 2021-05-06 competing risks cumulative incidence function interval censoring left truncation missing cause of failure R package
9	Advances on Dimension Reduction for Univariate and Multivariate Time Series Mahappu Kankanamge, Tharindu Priyan De Alwis 01 August 2022 (has links) (PDF) Advances in modern technologies have led to an abundance of high-dimensional time series data in many fields, including finance, economics, health, engineering, and meteorology, among others. This causes the “curse of dimensionality” problem in both univariate and multivariate time series data. The main objective of time series analysis is to make inferences about the conditional distributions. There are some methods in the literature to estimate the conditional mean and conditional variance functions in time series. However, most of those are inefficient, computationally intensive, or suffer from the overparameterization. We propose some dimension reduction techniques to address the curse of dimensionality in high-dimensional time series dataFor high-dimensional matrix-valued time series data, there are a limited number of methods in the literature that can preserve the matrix structure and reduce the number of parameters significantly (Samadi, 2014, Chen et al., 2021). However, those models cannot distinguish between relevant and irrelevant information and yet suffer from the overparameterization. We propose a novel dimension reduction technique for matrix-variate time series data called the "envelope matrix autoregressive model" (EMAR), which offers substantial dimension reduction and links the mean function and the covariance matrix of the model by using the minimal reducing subspace of the covariance matrix. The proposed model can identify and remove irrelevant information and can achieve substantial efficiency gains by significantly reducing the total number of parameters. We derive the asymptotic properties of the proposed maximum likelihood estimators of the EMAR model. Extensive simulation studies and a real data analysis are conducted to corroborate our theoretical results and to illustrate the finite sample performance of the proposed EMAR model.For univariate time series, we propose sufficient dimension reduction (SDR) methods based on some integral transformation approaches that can preserve sufficient information about the response. In particular, we use the Fourier and Convolution transformation methods (FM and CM) to perform sufficient dimension reduction in univariate time series and estimate the time series central subspace (TS-CS), the time series mean subspace (TS-CMS), and the time series variance subspace (TS-CVS). Using FM and CM procedures and with some distributional assumptions, we derive candidate matrices that can fully recover the TS-CS, TS-CMS, and TS-CVS, and propose an explicit estimate of the candidate matrices. The asymptotic properties of the proposed estimators are established under both normality and non-normality assumptions. Moreover, we develop some data-drive methods to estimate the dimension of the time series central subspaces as well as the lag order. Our simulation results and real data analyses reveal that the proposed methods are not only significantly more efficient and accurate but also offer substantial computational efficiency compared to the existing methods in the literature. Moreover, we develop an R package entitled “sdrt” to easily perform our program code in FM and CM procedures to estimate suffices dimension reduction subspaces in univariate time series. Curse of dimensionality Envelope methods Integral transformation methods Matrix autoregressive R package Sufficient dimension reduction subspaces
10	Methods and software to enhance statistical analysis in large scale problems in breeding and quantitative genetics Pook, Torsten 27 June 2019 (has links) No description available. 630 haplotype blocks breeding simulation R-package big data imputation quantitative genetics breeding program Land- und Forstwirtschaft (PPN621302791)

Search results