• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 32
  • 6
  • 5
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 55
  • 55
  • 19
  • 13
  • 13
  • 12
  • 11
  • 8
  • 8
  • 8
  • 7
  • 7
  • 6
  • 6
  • 6
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
11

Essays on Financial Economics

Liu, Yan January 2014 (has links)
<p>In this thesis, I develop two sets of methods to help understand two distinct but also</p><p>related issues in financial economics.</p><p>First, representative agent models have been successfully applied to explain asset</p><p>market phenomenons. They are often simple to work with and appeal to intuition by</p><p>permitting a direct link between the agent's optimization behavior and asset market</p><p>dynamics. However, their particular modeling choices sometimes yield undesirable</p><p>or even counterintuitive consequences. Several diagnostic tools have been developed by the asset pricing literature to detect these unwanted consequences. I contribute to this literature by developing a new continuum of nonparametric asset pricing bounds to diagnose representative agent models. Chapter 1 lays down the theoretical framework and discusses its relevance to existing approaches. Empirically, it uses bounds implied by index option returns to study a well-known class of representative agent models|the rare disaster models. Chapter 2 builds on the insights of Chapter 1 to study dynamic models. It uses model implied conditional variables to sharpen asset pricing bounds, allowing a more powerful diagnosis of dynamic models.</p><p>While the first two chapters focus on the diagnosis of a particular model, Chapter</p><p>3 and 4 study the joint inference of a group of models or risk factors. Drawing on</p><p>multiple hypothesis testing in the statistics literature, Chapter 3 shows that many of</p><p>the risk factors documented by the academic literature are likely to be false. It also</p><p>proposes a new statistical framework to study multiple hypothesis testing under test</p><p>correlation and hidden tests. Chapter 4 further studies the statistical properties of</p><p>this framework through simulations.</p> / Dissertation
12

Multiple testing problems in classical clinical trial and adaptive designs

Deng, Xuan 07 November 2018 (has links)
Multiplicity issues arise prevalently in a variety of situations in clinical trials and statistical methods for multiple testing have gradually gained importance with the increasing number of complex clinical trial designs. In general, two types of multiple testing can be performed (Dmitrienko et al., 2009): union-intersection testing (UIT) and intersection-union testing (IUT). The UIT is of the interest in this dissertation. Thus, the familywise error rate (FWER) is required to be controlled in the strong sense. A number of methods have been developed for controlling the FWER, including single-step and stepwise procedures. In single-step approaches, such as the simple Bonferroni method, the rejection decision of a hypothesis does not depend on the decision of any other hypotheses. Single-step approaches can be improved in terms of power through stepwise approaches, while also controlling for the desired error rate. Besides, it is also possible to improve those procedures by a parametric approach. In the first project, we developed a new and powerful single-step progressive parametric multiple (SPPM) testing procedure for correlated normal test statistics. Through simulation studies, we demonstrate that SPPM improves power substantially when the correlation is moderate and/or the magnitude of eect sizes are similar. Group sequential designs (GSD) are clinical trials allowing interim looks with the possibility of early terminations due to ecacy, harm or futility, which can reduce the overall costs and timelines for the development of a new drug. However, repeated looks of data also have multiplicity issues and could inflate the type I error rate. The proper treatments to the error inflation have been discussed widely (Pocock, 1977), (O'Brien and Fleming, 1979), (Wang and Tsiatis, 1987), (Lan and DeMets, 1983). Most literature about GSD focuses on a single endpoint. GSD with multiple endpoints however, has also received considerable attention. The main focus of our second project is a GSD with multiple primary endpoints, in which the trial is to evaluate whether at least one of the endpoints is statistically signicant. In this study design, multiplicity issues arise from repeated interims and multiple endpoints. Therefore, the appropriate adjustments must be made to control the Type I error rate. Our second purpose here is to show that the combination of multiple endpoint and repeated interim analyses can lead to a more powerful design. Via the multivariate normal distribution, a method that allows for simultaneously consideration of interim analyses and all clinical endpoints was proposed. The new approach is derived from the closure principle, thus it can control type I error rate strongly. We evaluate the power under dierent scenarios and show that it compares favorably to other methods when correlation among endpoints is non-zero. In the group sequential design framework, another interesting topic is multiple arm multiple stage design (MAMS), where multiple arms are involved in the trial at the beginning with the flexibility about treatment selection or stopping decisions during the interim analyses. One of major hurdles of MAMS is the computational cost with the increasing number of arms and interim looks. Various designs were implemented to overcome this diculty (Thall et al., 1988; Schaid et al., 1990; Follmann et al., 1994; Stallard and Todd, 2003; Stallard and Friede, 2008; Magirr et al., 2012; Wason et al., 2017), but also control the FWER with the potential inflation from the multiple arm comparisons and multiple interim tests. Here, we consider a more flexible drop-the-loser design allowing the safety information in the treatment selection without a pre-specied dropping-arms mechanism and it still retains reasonable high power. The two dierent types of stopping boundaries are proposed for such a design. A sample size is also adjustable if the winner arm is dropped due to the safety considerations.
13

Simultaneous Inference for High Dimensional and Correlated Data

Polin, Afroza 22 August 2019 (has links)
No description available.
14

Multiple testing & optimization-based approaches with applications to genome-wide association studies

Posner, Daniel Charles 07 December 2019 (has links)
Many phenotypic traits are heritable, but the exact genetic causes are difficult to determine. A common approach for disentangling the different genetic factors is to conduct a "genome-wide association study" (GWAS), where each single nucleotide variant (SNV) is tested for association with a trait of interest. Many SNVs for complex traits have been found by GWAS, but to date they explain only a fraction of heritability of complex traits. In this dissertation, we propose novel optimization-based and multiple testing procedures for variant set tests. In the second chapter, we propose a novel variant set test, convex-optimized SKAT (cSKAT), that leverages multiple SNV annotations. The test generalizes SKAT to convex combinations of SKAT statistics constructed from functional genomic annotations. We differ from previous approaches by optimizing kernel weights with a multiple kernel learning algorithm. In cSKAT, the contribution of each variant to the overall statistic is a product of annotation values and kernel weights for annotation classes. We demonstrate the utility of our biologically-informed SNV weights in a rare-variant analysis of fasting glucose in the FHS. In the third chapter, we propose a sequential testing procedure for GWAS that joins tests of single SNVs and groups of SNVs (SNV-sets) with common biological function. The proposed procedure differs from previous procedures by testing genes and sliding 4kb intergenic windows rather than chromosomes or the whole genome. We also sharpen an existing tree-based multiple testing correction by incorporating correlation between SNVs, which is present in any SNV-set containing contiguous regions (such as genes). In the fourth chapter, we present a sequential testing procedure for SNV-sets that incorporates correlation between test statistics of the SNV-sets. At each step of the procedure, the multiplicity correction is the number of remaining independent tests, making no assumption about the null distribution of tests. We provide an estimator for the number of remaining independent tests based on previous work in single-SNV GWAS and demonstrate the estimator is valid for sequential procedures. We implement the proposed method for GWAS by sequentially testing chromosomes, genes, 4kb windows, and SNVs.
15

Multiple Testing in Grouped Dependent Data

Clements, Nicolle January 2013 (has links)
This dissertation is focused on multiple testing procedures to be used in data that are naturally grouped or possess a spatial structure. We propose `Two-Stage' procedure to control the False Discovery Rate (FDR) in situations where one-sided hypothesis testing is appropriate, such as astronomical source detection. Similarly, we propose a `Three-Stage' procedure to control the mixed directional False Discovery Rate (mdFDR) in situations where two-sided hypothesis testing is appropriate, such as vegetation monitoring in remote sensing NDVI data. The Two and Three-Stage procedures have provable FDR/mdFDR control under certain dependence situations. We also present the Adaptive versions which are examined under simulation studies. The `Stages' refer to testing hypotheses both group-wise and individually, which is motivated by the belief that the dependencies among the p-values associated with the spatially oriented hypotheses occur more locally than globally. Thus, these `Staged' procedures test hypotheses in groups that incorporate the local, unknown dependencies of neighboring p-values. If a group is found significant, further investigation is done to the individual p-values within that group. For the vegetation monitoring data, we extend the investigation by providing some spatio-temporal models and forecasts to some regions where significant change was detected through the multiple testing procedure. / Statistics
16

Dissecting genetic interactions in complex traits

Hemani, Gibran January 2012 (has links)
Of central importance in the dissection of the components that govern complex traits is understanding the architecture of natural genetic variation. Genetic interaction, or epistasis, constitutes one aspect of this, but epistatic analysis has been largely avoided in genome wide association studies because of statistical and computational difficulties. This thesis explores both issues in the context of two-locus interactions. Initially, through simulation and deterministic calculations it was demonstrated that not only can epistasis maintain deleterious mutations at intermediate frequencies when under selection, but that it may also have a role in the maintenance of additive variance. Based on the epistatic patterns that are evolutionarily persistent, and the frequencies at which they are maintained, it was shown that exhaustive two dimensional search strategies are the most powerful approaches for uncovering both additive variance and the other genetic variance components that are co-precipitated. However, while these simulations demonstrate encouraging statistical benefits, two dimensional searches are often computationally prohibitive, particularly with the marker densities and sample sizes that are typical of genome wide association studies. To address this issue different software implementations were developed to parallelise the two dimensional triangular search grid across various types of high performance computing hardware. Of these, particularly effective was using the massively-multi-core architecture of consumer level graphics cards. While the performance will continue to improve as hardware improves, at the time of testing the speed was 2-3 orders of magnitude faster than CPU based software solutions that are in current use. Not only does this software enable epistatic scans to be performed routinely at minimal cost, but it is now feasible to empirically explore the false discovery rates introduced by the high dimensionality of multiple testing. Through permutation analysis it was shown that the significance threshold for epistatic searches is a function of both marker density and population sample size, and that because of the correlation structure that exists between tests the threshold estimates currently used are overly stringent. Although the relaxed threshold estimates constitute an improvement in the power of two dimensional searches, detection is still most likely limited to relatively large genetic effects. Through direct calculation it was shown that, in contrast to the additive case where the decay of estimated genetic variance was proportional to falling linkage disequilibrium between causal variants and observed markers, for epistasis this decay was exponential. One way to rescue poorly captured causal variants is to parameterise association tests using haplotypes rather than single markers. A novel statistical method that uses a regularised parameter selection procedure on two locus haplotypes was developed, and through extensive simulations it can be shown that it delivers a substantial gain in power over single marker based tests. Ultimately, this thesis seeks to demonstrate that many of the obstacles in epistatic analysis can be ameliorated, and with the current abundance of genomic data gathered by the scientific community direct search may be a viable method to qualify the importance of epistasis.
17

Multiple Change-Point Detection: A Selective Overview

Niu, Yue S., Hao, Ning, Zhang, Heping 11 1900 (has links)
Very long and noisy sequence data arise from biological sciences to social science including high throughput data in genomics and stock prices in econometrics. Often such data are collected in order to identify and understand shifts in trends, for example, from a bull market to a bear market in finance or from a normal number of chromosome copies to an excessive number of chromosome copies in genetics. Thus, identifying multiple change points in a long, possibly very long, sequence is an important problem. In this article, we review both classical and new multiple change-point detection strategies. Considering the long history and the extensive literature on the change-point detection, we provide an in-depth discussion on a normal mean change-point model from aspects of regression analysis, hypothesis testing, consistency and inference. In particular, we present a strategy to gather and aggregate local information for change-point detection that has become the cornerstone of several emerging methods because of its attractiveness in both computational and theoretical properties.
18

Unbiased Recursive Partitioning: A Conditional Inference Framework

Hothorn, Torsten, Hornik, Kurt, Zeileis, Achim January 2004 (has links) (PDF)
Recursive binary partitioning is a popular tool for regression analysis. Two fundamental problems of exhaustive search procedures usually applied to fit such models have been known for a long time: Overfitting and a selection bias towards covariates with many possible splits or missing values. While pruning procedures are able to solve the overfitting problem, the variable selection bias still seriously effects the interpretability of tree-structured regression models. For some special cases unbiased procedures have been suggested, however lacking a common theoretical foundation. We propose a unified framework for recursive partitioning which embeds tree-structured regression models into a well defined theory of conditional inference procedures. Stopping criteria based on multiple test procedures are implemented and it is shown that the predictive performance of the resulting trees is as good as the performance of established exhaustive search procedures. It turns out that the partitions and therefore the models induced by both approaches are structurally different, indicating the need for an unbiased variable selection. The methodology presented here is applicable to all kinds of regression problems, including nominal, ordinal, numeric, censored as well as multivariate response variables and arbitrary measurement scales of the covariates. Data from studies on animal abundance, glaucoma classification, node positive breast cancer and mammography experience are re-analyzed. / Series: Research Report Series / Department of Statistics and Mathematics
19

Evaluation of statistical methods, modeling, and multiple testing in RNA-seq studies

Choi, Seung Hoan 12 August 2016 (has links)
Recent Next Generation Sequencing methods provide a count of RNA molecules in the form of short reads, yielding discrete, often highly non-normally distributed gene expression measurements. Due to this feature of RNA sequencing (RNA-seq) data, appropriate statistical inference methods are required. Although Negative Binomial (NB) regression has been generally accepted in the analysis of RNA-seq data, its appropriateness in the application to genetic studies has not been exhaustively evaluated. Additionally, adjusting for covariates that have an unknown relationship with expression of a gene has not been extensively evaluated in RNA-seq studies using the NB framework. Finally, the dependent structures in RNA-Seq data may violate the assumptions of some multiple testing correction methods. In this dissertation, we suggest an alternative regression method, evaluate the effect of covariates, and compare various multiple testing correction methods. We conduct simulation studies and apply these methods to a real data set. First, we suggest Firth’s logistic regression for detecting differentially expressed genes in RNA-seq data. We also recommend the data adaptive method that estimates a recalibrated distribution of test statistics. Firth’ logistic regression exhibits an appropriately controlled Type-I error rate using the data adaptive method and shows comparable power to NB regression in simulation studies. Next, we evaluate the effect of disease-associated covariates where the relationship between the covariate and gene expression is unknown. Although the power of NB and Firth’s logistic regression is decreased as disease-associated covariates are added in a model, Type-I error rates are well controlled in Firth’ logistic regression if the relationship between a covariate and disease is not strong. Finally, we compare multiple testing correction methods that control family-wise error rates and impose false discovery rates. The evaluation reveals that an understanding of study designs, RNA-seq data, and the consequences of applying specific regression and multiple testing correction methods are very important factors to control family-wise error rates or false discovery rates. We believe our statistical investigations will enrich gene expression studies and influence related statistical methods.
20

Statistical Feature Selection : With Applications in Life Science

Nilsson, Roland January 2007 (has links)
The sequencing of the human genome has changed life science research in many ways. Novel measurement technologies such as microarray expression analysis, genome-wide SNP typing and mass spectrometry are now producing experimental data of extremely high dimensions. While these techniques provide unprecedented opportunities for exploratory data analysis, the increase in dimensionality also introduces many difficulties. A key problem is to discover the most relevant variables, or features, among the tens of thousands of parallel measurements in a particular experiment. This is referred to as feature selection. For feature selection to be principled, one needs to decide exactly what it means for a feature to be ”relevant”. This thesis considers relevance from a statistical viewpoint, as a measure of statistical dependence on a given target variable. The target variable might be continuous, such as a patient’s blood glucose level, or categorical, such as ”smoker” vs. ”non-smoker”. Several forms of relevance are examined and related to each other to form a coherent theory. Each form of relevance then defines a different feature selection problem. The predictive features are those that allow an accurate predictive model, for example for disease diagnosis. I prove that finding redictive features is a tractable problem, in that consistent estimates can be computed in polynomial time. This is a substantial improvement upon current theory. However, I also demonstrate that selecting features to optimize prediction accuracy does not control feature error rates. This is a severe drawback in life science, where the selected features per se are important, for example as candidate drug targets. To address this problem, I propose a statistical method which to my knowledge is the first to achieve error control. Moreover, I show that in high dimensions, feature sets can be impossible to replicate in independent experiments even with controlled error rates. This finding may explain the lack of agreement among genome-wide association studies and molecular signatures of disease. The most predictive features may not always be the most relevant ones from a biological perspective, since the predictive power of a given feature may depend on measurement noise rather than biological properties. I therefore consider a wider definition of relevance that avoids this problem. The resulting feature selection problem is shown to be asymptotically intractable in the general case; however, I derive a set of simplifying assumptions which admit an intuitive, consistent polynomial-time algorithm. Moreover, I present a method that controls error rates also for this problem. This algorithm is evaluated on microarray data from case studies in diabetes and cancer. In some cases however, I find that these statistical relevance concepts are insufficient to prioritize among candidate features in a biologically reasonable manner. Therefore, effective feature selection for life science requires both a careful definition of relevance and a principled integration of existing biological knowledge. / Sekvenseringen av det mänskliga genomet i början på 2000-talet tillsammans och de senare sekvenseringsprojekten för olika modellorganismer har möjliggjort revolutionerade nya biologiska mätmetoder som omfattar hela genom. Microarrayer, mass-spektrometri och SNP-typning är exempel på sådana mätmetoder. Dessa metoder genererar mycket högdimensionell data. Ett centralt problem i modern biologisk forskning är således att identifiera de relevanta variablerna bland dessa tusentals mätningar. Detta kallas f¨or variabelsökning. För att kunna studera variabelsökning på ett systematiskt sätt är en exakt definition av begreppet ”relevans” nödvändig. I denna avhandling behandlas relevans ur statistisk synvinkel: ”relevans” innebär ett statistiskt beroende av en målvariabel ; denna kan vara kontinuerlig, till exempel en blodtrycksmätning på en patient, eller diskret, till exempel en indikatorvariabel såsom ”rökare” eller ”icke-rökare”. Olika former av relevans behandlas och en sammanhängande teori presenteras. Varje relevansdefinition ger därefter upphov till ett specifikt variabelsökningsproblem. Prediktiva variabler är sådana som kan användas för att konstruera prediktionsmodeller. Detta är viktigt exempelvis i kliniska diagnossystem. Här bevisas att en konsistent skattning av sådana variabler kan beräknas i polynomisk tid, så att variabelssökning är möjlig inom rimlig beräkningstid. Detta är ett genombrott jämfört med tidigare forskning. Dock visas även att metoder för att optimera prediktionsmodeller ofta ger höga andelar irrelevanta varibler, vilket är mycket problematiskt inom biologisk forskning. Därför presenteras också en ny variabelsökningsmetod med vilken de funna variablernas relevans är statistiskt säkerställd. I detta sammanhang visas också att variabelsökningsmetoder inte är reproducerbara i vanlig bemärkelse i höga dimensioner, även då relevans är statistiskt säkerställd. Detta förklarar till viss del varför genetiska associationsstudier som behandlar hela genom hittills har varit svåra att reproducera. Här behandlas också fallet där alla relevanta variabler eftersöks. Detta problem bevisas kräva exponentiell beräkningstid i det allmänna fallet. Dock presenteras en metod som löser problemet i polynomisk tid under vissa statistiska antaganden, vilka kan anses rimliga för biologisk data. Också här tas problemet med falska positiver i beaktande, och en statistisk metod presenteras som säkerställer relevans. Denna metod tillämpas på fallstudier i typ 2-diabetes och cancer. I vissa fall är dock mängden relevanta variabler mycket stor. Statistisk behandling av en enskild datatyp är då otillräcklig. I sådana situationer är det viktigt att nyttja olika datakällor samt existerande biologisk kunskap för att för att sortera fram de viktigaste fynden.

Page generated in 0.1084 seconds