Global ETD Search

11	Statistical methods for deep sequencing data Shen, Shihao 01 December 2012 (has links) Ultra-deep RNA sequencing has become a powerful approach for genome-wide analysis of pre-mRNA alternative splicing. We develop MATS (Multivariate Analysis of Transcript Splicing), a Bayesian statistical framework for flexible hypothesis testing of differential alternative splicing patterns on RNA-Seq data. MATS uses a multivariate uniform prior to model the between-sample correlation in exon splicing patterns, and a Markov chain Monte Carlo (MCMC) method coupled with a simulation-based adaptive sampling procedure to calculate the P value and false discovery rate (FDR) of differential alternative splicing. Importantly, the MATS approach is applicable to almost any type of null hypotheses of interest, providing the flexibility to identify differential alternative splicing events that match a given user-defined pattern. We evaluated the performance of MATS using simulated and real RNA-Seq data sets. In the RNA-Seq analysis of alternative splicing events regulated by the epithelial-specific splicing factor ESRP1, we obtained a high RT-PCR validation rate of 86% for differential alternative splicing events with a MATS FDR of < 10%. Additionally, over the full list of RT-PCR tested exons, the MATS FDR estimates matched well with the experimental validation rate. Our results demonstrate that MATS is an effective and flexible approach for detecting differential alternative splicing from RNA-Seq data. False Discovery Rate iFDR MATS rMATS RNA-Seq Biostatistics
12	Statistical Learning and Behrens Fisher Distribution Methods for Heteroscedastic Data in Microarray Analysis Manandhr-Shrestha, Nabin K. 29 March 2010 (has links) The aim of the present study is to identify the di®erentially expressed genes be- tween two di®erent conditions and apply it in predicting the class of new samples using the microarray data. Microarray data analysis poses many challenges to the statis- ticians because of its high dimensionality and small sample size, dubbed as "small n large p problem". Microarray data has been extensively studied by many statisticians and geneticists. Generally, it is said to follow a normal distribution with equal vari- ances in two conditions, but it is not true in general. Since the number of replications is very small, the sample estimates of variances are not appropriate for the testing. Therefore, we have to consider the Bayesian approach to approximate the variances in two conditions. Because the number of genes to be tested is usually large and the test is to be repeated thousands of times, there is a multiplicity problem. To remove the defect arising from multiple comparison, we use the False Discovery Rate (FDR) correction. Applying the hypothesis test repeatedly gene by gene for several thousands of genes, there is a great chance of selecting false genes as di®erentially expressed, even though the signi¯cance level is set very small. For the test to be reliable, the probability of selecting true positive should be high. To control the false positive rate, we have applied the FDR correction, in which the p -values for each of the gene is compared with its corresponding threshold. A gene is, then, said to be di®erentially expressed if the p-value is less than the threshold. We have developed a new method of selecting informative genes based on the Bayesian Version of Behrens-Fisher distribution which assumes the unequal variances in two conditions. Since the assumption of equal variances fail in most of the situation and the equal variance is a special case of unequal variance, we have tried to solve the problem of ¯nding di®erentially expressed genes in the unequal variance cases. We have found that the developed method selects the actual expressed genes in the simulated data and compared this method with the recent methods such as Fox and Dimmic’s t-test method, Tusher and Tibshirani’s SAM method among others. The next step of this research is to check whether the genes selected by the pro- posed Behrens -Fisher method is useful for the classi¯cation of samples. Using the genes selected by the proposed method that combines the Behrens Fisher gene se- lection method with some other statistical learning methods, we have found better classi¯cation result. The reason behind it is the capability of selecting the genes based on the knowledge of prior and data. In the case of microarray data due to the small sample size and the large number of variables, the variances obtained by the sample is not reliable in the sense that it is not positive de¯nite and not invertible. So, we have derived the Bayesian version of the Behrens Fisher distribution to remove that insu±ciency. The e±ciency of this established method has been demonstrated by ap- plying them in three real microarray data and calculating the misclassi¯cation error rates on the corresponding test sets. Moreover, we have compared our result with some of the other popular methods, such as Nearest Shrunken Centroid and Support Vector Machines method, found in the literature. We have studied the classi¯cation performance of di®erent classi¯ers before and after taking the correlation between the genes. The classi¯cation performance of the classi¯er has been signi¯cantly improved once the correlation was accounted. The classi¯cation performance of di®erent classi¯ers have been measured by the misclas- si¯cation rates and the confusion matrix. The another problem in the multiple testing of large number of hypothesis is the correlation among the test statistics. we have taken the correlation between the test statistics into account. If there were no correlation, then it will not a®ect the shape of the normalized histogram of the test statistics. As shown by Efron, the degree of the correlation among the test statistics either widens or shrinks the tail of the histogram of the test statistics. Thus the usual rejection region as obtained by the signi¯cance level is not su±cient. The rejection region should be rede¯ned accordingly and depends on the degree of correlation. The e®ect of the correlation in selecting the appropriate rejection region have also been studied. Genes False Discovery Rate Multiple Testing Correlation Classi¯cation American Studies Arts and Humanities Mathematics Statistics and Probability
13	An Application of Armitage Trend Test to Genome-wide Association Studies Scott, Nigel A 17 July 2009 (has links) Genome-wide Association (GWA) studies have become a widely used method for analyzing genetic data. It is useful in detecting associations that may exist between particular alleles and diseases of interest. This thesis investigates the dataset provided from problem 1 of the Genetic Analysis Workshop 16 (GAW 16). The dataset consists of GWA data from the North American Rheumatoid Arthritis Consortium (NARAC). The thesis attempts to determine a set of single nucleotide polymorphisms (SNP) that are associated significantly with rheumatoid arthritis. Moreover, this thesis also attempts to address the question of whether the one-sided alternative hypothesis that the minor allele is positively associated with the disease or the two-sided alternative hypothesis that the genotypes at a locus are associated with the disease is appropriate, or put another way, the question of whether examining both alternative hypotheses yield more information. False discovery rate Genetic analysis workshop Rheumatoid arthritis Sequentially rejective bonferroni Single nucleotide polymorphisms Mathematics
14	Controlling IER, EER, and FDR In Replicated Regular Two-Level Factorial Designs Akinlawon, Oludotun J Unknown Date No description available. individual error rate experimentwise error rate false discovery rate replicated factorial experiments
15	The performance of multiple hypothesis testing procedures in the presence of dependence Clarke, Sandra Jane January 2010 (has links) Hypothesis testing is foundational to the discipline of statistics. Procedures exist which control for individual Type I error rates and more global or family-wise error rates for a series of hypothesis tests. However, the ability of scientists to produce very large data sets with increasing ease has led to a rapid rise in the number of statistical tests performed, often with small sample sizes. This is seen particularly in the area of biotechnology and the analysis of microarray data. This thesis considers this high-dimensional context with particular focus on the effects of dependence on existing multiple hypothesis testing procedures. / While dependence is often ignored, there are many existing techniques employed currently to deal with this context but these are typically highly conservative or require difficult estimation of large correlation matrices. This thesis demonstrates that, in this high-dimensional context when the distribution of the test statistics is light-tailed, dependence is not as much of a concern as in the classical contexts. This is achieved with the use of a moving average model. One important implication of this is that, when this is satisfied, procedures designed for independent test statistics can be used confidently on dependent test statistics. / This is not the case however for heavy-tailed distributions, where we expect an asymptotic Poisson cluster process of false discoveries. In these cases, we estimate the parameters of this process along with the tail-weight from the observed exceedences and attempt to adjust procedures. We consider both conservative error rates such as the family-wise error rate and more popular methods such as the false discovery rate. We are able to demonstrate that, in the context of DNA microarrays, it is rare to find heavy-tailed distributions because most test statistics are averages.
16	Estratégias de imputação e associação genômica com dados de sequenciamento para características de produção de leite na raça Gir / Imputation strategies and genome-wide association with sequence data for milk production traits in Gyr cattle Nascimento, Guilherme Batista do [UNESP] 22 February 2018 (has links) Submitted by Guilherme Batista do Nascimento null (guilhermebn@msn.com) on 2018-03-16T12:24:54Z No. of bitstreams: 1 Tese_Guilherme_Batista_do_Nascimento.pdf: 1770231 bytes, checksum: ad03948ecc7b09b89d46d26b7c9e3bf8 (MD5) / Approved for entry into archive by Alexandra Maria Donadon Lusser Segali null (alexmar@fcav.unesp.br) on 2018-03-16T19:03:02Z (GMT) No. of bitstreams: 1 nascimento_gb_dr_jabo.pdf: 1770231 bytes, checksum: ad03948ecc7b09b89d46d26b7c9e3bf8 (MD5) / Made available in DSpace on 2018-03-16T19:03:02Z (GMT). No. of bitstreams: 1 nascimento_gb_dr_jabo.pdf: 1770231 bytes, checksum: ad03948ecc7b09b89d46d26b7c9e3bf8 (MD5) Previous issue date: 2018-02-22 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / A implementação de dados de sequenciamento de nova geração - “next-generation sequence” (NGS) em programas de melhoramento genético animal representa a mais recente ferramenta na utilização de dados genotípicos nos modelos de associação genômica, tendo em vista que todo polimorfismo é considerado nas associações entre registros fenotípicos e dados de sequenciamento. Como em toda nova tecnologia, a prospecção das variantes ainda representa um desafio no sentido computacional e de viabilidade dos custos para sua implementação em larga escala. Diante desses desafios, neste trabalho buscou-se meios de explorar os benefícios na utilização da NGS nas predições genômicas e superar as limitações inerentes a esse processo. Registros fenotípicos e genotípicos (Illumina Bovine HD BeadChip) de 2.279 animais da raça Gir (Bos taurus indicus) foram disponibilizados pela Embrapa Gado de Leite (MG) e utilizados para as análises de associação genômica. Além disso, dados de sequenciamento de 53 animais do 1000 “Bulls Project” deram origem à população de referência de imputação. Visando verificar a eficiência de imputação, foram testados diferentes cenários quanto a sua acurácia de imputação por meio da análise “leave-one-out”, utilizando apenas os dados de sequenciamento, que apresentaram eficiências de até 84%, no cenário com todos os 51 animais disponíveis após o controle de qualidade. Também foram verificadas as influências das variantes em baixa frequência na acurácia de imputação em diferentes regiões do genoma. Com a escolha da melhor estrutura da população de referência de imputação e aplicação dos controles de qualidade nos dados de NGS e genômicos, foi possível imputar os 2.237 animais genotipados, que passaram pelo controle de qualidade para dados de sequenciamento e realizar análise de associação genômica para as características produção de leite (PL305), teor de gordura (PG305), proteína (PP305) e sólidos totais (PS305), mensuradas aos 305 dias em animais da raça Gir leiteiro. Para tal, foram utilizados os valores genéticos desregredidos (dEBV) como variável resposta no modelo de regressão múltipla. Regiões de 1Mb que contivessem 100 ou mais variantes com “False Discovery Rate” (FDR) inferior a 0,05, foram consideradas significativas e submetidas a análise de enriquecimento por meio dos termos MeSh (“Medical Subject Headings”). As três regiões significativas (FDR<0,05) para PS305 foram observadas nos cromossomos 11, 12 e 28 e a única região significativa em PG305 foi no cromossomo 6. Tais regiões apresentaram variantes associadas com vias metabólicas da produção de leite, ausentes nos painéis comerciais de genotipagem, podendo representar genes candidatos a seleção. / - Implementing "next-generation sequence" (NGS) data in animal breeding programs represents the latest tool in the use of genotypic data in genomic association models, since all polymorphisms are considered in the associations between phenotypic records and sequencing data. As with any new technology, variant prospecting still represents a computational and cost-effective challenge for large-scale implementation. Front to these challenges, this work sought ways to explore the benefits of using NGS in genomic predictions and overcome the inherent limitations of this process. Phenotypic and genotypic (Illumina Bovine HD BeadChip) records of 2,279 Gir animals (Bos taurus indicus) were made available by Embrapa Gado de Leite (MG) and used for genomic association analysis. In addition, sequence data of 53 animals from the 1000 Bulls Project gave rise to the imputation reference population. In order to verify the imputation efficiency, different scenarios were tested for their imputation accuracy through the leave-one-out analysis, using only the sequencing data, which presented efficiencies of up to 84%, in the scenario with all the 51 animals available after quality control. Influences from the low-frequency variants on the accuracy of imputation in different regions of the genome were also verified. After identifying the best reference population structure of imputation and applying the quality controls in the NGS and genomic data, it was possible to impute the 2 237 genotyped animals that passed in the quality control to sequencing data and perform genomic association analysis for (PL305), fat content (PG305), protein (PP305) and total solids (PS305), measured at 305 days in dairy Gir animals. For this, unregulated genetic values (dEBV) were used as response variable in the multiple regression model. Regions of 1Mb containing 100 or more variants with a False Discovery Rate (FDR) lower than 0.05 were considered statistically significant and submitted to pathways enrichment analysis using the MeSh (Medical Subject Headings) terms. The three significant regions (FDR <0.05) for PS305 were observed on chromosomes 11, 12 and 28 and only one significant region in PG305, was on chromosome 6. These regions presented variants associated with metabolic pathways of milk production, absent in the panels genotyping, and may represent genes that are candidates for selection / convênio Capes/Embrapa (edital 15/2014) Acurácia de imputação Bovinocultura de leite False Discovery Rate Next-generation sequence Accuracy of imputation Dairy Cattle
17	Search for cosmic sources of high energy neutrinos with the AMANDA-II detector / Recherche de sources cosmiques de neutrinos à haute énergie avec le détecteur AMANDA-II Labare, Mathieu 26 January 2010 (has links) AMANDA-II est un télescope à neutrinos composé d'un réseau tri-dimensionnel de senseurs optiques déployé dans la glace du Pôle Sud.<p>Son principe de détection repose sur la mise en évidence de particules secondaires chargées émises lors de l'interaction d'un neutrino de haute énergie (> 100 GeV) avec la matière environnant le détecteur, sur base de la détection de rayonnement Cerenkov.<p><p>Ce travail est basé sur les données enregistrées par AMANDA-II entre 2000 et 2006, afin de rechercher des sources cosmiques de neutrinos.<p>Le signal recherché est affecté d'un bruit de fond important de muons et de neutrinos issus de l'interaction du rayonnement cosmique primaire dans l'atmosphère. En se limitant à l'observation de l'hémisphère nord, le bruit de fond des muons atmosphériques, absorbés par la Terre, est éliminé.<p>Par contre, les neutrinos atmosphériques forment un bruit de fond irréductible constituant la majorité des 6100 événements sélectionnés pour cette analyse.<p>Il est cependant possible d'identifier une source ponctuelle de neutrinos cosmiques en recherchant un excès local se détachant du bruit de fond isotrope de neutrinos atmosphériques, couplé à une sélection basée sur l'énergie, dont le spectre est différent pour les deux catégories de neutrinos.<p><p>Une approche statistique originale est développée dans le but d'optimiser le pouvoir de détection de sources ponctuelles, tout en contrôlant le taux de fausses découvertes, donc le niveau de confiance d'une observation.<p>Cette méthode repose uniquement sur la connaissance de l'hypothèse de bruit de fond, sans aucune hypothèse sur le modèle de production de neutrinos par les sources recherchées. De plus, elle intègre naturellement la notion de facteur d'essai rencontrée dans le cadre de test d'hypothèses multiples.La procédure a été appliquée sur l'échantillon final d'évènements récoltés par AMANDA-II.<p><p>---------<p><p>MANDA-II is a neutrino telescope which comprises a three dimensional array of optical sensors deployed in the South Pole glacier. <p>Its principle rests on the detection of the Cherenkov radiation emitted by charged secondary particles produced by the interaction of a high energy neutrino (> 100 GeV) with the matter surrounding the detector.<p><p>This work is based on data recorded by the AMANDA-II detector between 2000 and 2006 in order to search for cosmic sources of neutrinos. A potential signal must be extracted from the overwhelming background of muons and neutrinos originating from the interaction of primary cosmic rays within the atmosphere.<p>The observation is limited to the northern hemisphere in order to be free of the atmospheric muon background, which is stopped by the Earth. However, atmospheric neutrinos constitute an irreducible background composing the main part of the 6100 events selected for this analysis.<p>It is nevertheless possible to identify a point source of cosmic neutrinos by looking for a local excess breaking away from the isotropic background of atmospheric neutrinos;<p>This search is coupled with a selection based on the energy, whose spectrum is different from that of the atmospheric neutrino background.<p><p>An original statistical approach has been developed in order to optimize the detection of point sources, whilst controlling the false discovery rate -- hence the confidence level -- of an observation. This method is based solely on the knowledge of the background hypothesis, without any assumption on the production model of neutrinos in sought sources. Moreover, the method naturally accounts for the trial factor inherent in multiple testing.The procedure was applied on the final sample of events collected by AMANDA-II. / Doctorat en Sciences / info:eu-repo/semantics/nonPublished Physique Neutrinos Cosmic rays Neutrinos Rayonnement cosmique False Discovery Rate statistical method
18	FURTHER CONTRIBUTIONS TO MULTIPLE TESTING METHODOLOGIES FOR CONTROLLING THE FALSE DISCOVERY RATE UNDER DEPENDENCE Zhang, Shiyu, 0000-0001-8921-2453 12 1900 (has links) This thesis presents innovative approaches for controlling the False Discovery Rate (FDR) in both high-dimensional statistical inference and finite-sample cases, addressing challenges arising from various dependency structures in the data. The first project introduces novel multiple testing methods for matrix-valued data, motivated by an electroencephalography (EEG) experiment, where we model the inherent complex row-column cross-dependency using a matrix normal distribution. We proposed two methods designed for structured matrix-valued data, to approximate the true FDP that captures the underlying cross-dependency with statistical accuracy. In the second project, we focus on simultaneous testing of multivariate normal means under diverse covariance matrix structures. By adjusting p-values using a BH-type step-up procedure tailored to the known correlation matrix, we achieve robust finite-sample FDR control. Both projects demonstrate superior performance through extensive numerical studies and real-data applications, significantly advancing the field of multiple testing under dependency. The third project presented exploratory simulation results to demonstrate the methods constructed based on the paired-p-values framework that controls the FDR within the multivariate normal means testing framework. / Statistics Statistics Dependence False discovery proportion False discovery rate Matrix-valued data Multiple testing
19	New Results on the False Discovery Rate Liu, Fang January 2010 (has links) The false discovery rate (FDR) introduced by Benjamini and Hochberg (1995) is perhaps the most standard error controlling measure being used in a wide variety of applications involving multiple hypothesis testing. There are two approaches to control the FDR - the fixed error rate approach of Benjamini and Hochberg (BH, 1995) where a rejection region is determined with the FDR below a fixed level and the estimation based approach of Storey (2002) where the FDR is estimated for a fixed rejection region before it is controlled. In this proposal, we concentrate on both these approaches and propose new, improved versions of some FDR controlling procedures available in the literature. A number of adaptive procedures have been put forward in the literature, each attempting to improve the method of Benjamini and Hochberg (1995), the BH method, by incorporating into this method an estimate of number true null hypotheses. Among these, the method of Benjamini, Krieger and Yekutieli (2006), the BKY method, has been receiving lots of attention recently. In this proposal, a variant of the BKY method is proposed by considering a different estimate of number true null hypotheses, which often outperforms the BKY method in terms of the FDR control and power. Storey's (2002) estimation based approach to controlling the FDR has been developed from a class of conservatively biased point estimates of the FDR under a mixture model for the underlying p-values and a fixed rejection threshold for each null hypothesis. An alternative class of point estimates of the FDR with uniformly smaller conservative bias is proposed under the same setup. Numerical evidence is provided to show that the mean squared error (MSE) is also often smaller for this new class of estimates. Compared to Storey's (2002), the present class provides a more powerful estimation based approach to controlling the FDR. / Statistics Statistics Adaptive Bh Procedure False Discovery Rate Fdr Estimate Multiple Test Pfdr Q-value
20	ROBUST ESTIMATION OF THE PARAMETERS OF g - and - h DISTRIBUTIONS, WITH APPLICATIONS TO OUTLIER DETECTION Xu, Yihuan January 2014 (has links) The g - and - h distributional family is generated from a relatively simple transformation of the standard normal. By changing the skewness and elongation parameters g and h, this distributional family can approximate a broad spectrum of commonly used distributional shapes, such as normal, lognormal, Weibull and exponential. Consequently, it is easy to use in simulation studies and has been applied in multiple areas, including risk management, stock return analysis and missing data imputation studies. The current available methods to estimate the g - and - h distributional family include: letter value based method (LV), numerical maximal likelihood method (NMLE), and moment methods. Although these methods work well when no outliers or contaminations exist, they are not resistant to a moderate amount of contaminated observations or outliers. Meanwhile, NMLE is a computational time consuming method when data sample size is large. In this dissertation a quantile based least squares (QLS) estimation method is proposed to fit the g - and - h distributional family parameters and then derive its basic properties. Then QLS method is extended to a robust version (rQLS). Simulation studies are performed to compare the performance of QLS and rQLS methods with LV and NMLE methods to estimate the g - and - h parameters from random samples with or without outliers. In random samples without outliers, QLS and rQLS estimates are comparable to LV and NMLE in terms of bias and standard error. On the other hand, rQLS performs better than other non-robust method to estimate the g - and - h parameters when moderate amount of contaminated observations or outliers exist. The flexibility of the g - and - h distribution and the robustness of rQLS method make it a useful tool in various fields. The boxplot (BP) method had been used in multiple outlier detections by controlling the some-outside rate, which is the probability of one or more observations, in an outlier-free sample, falling into the outlier region. The BP method is distribution dependent. Usually the random sample is assumed normally distributed; however, this assumption may not be valid in many applications. The robustly estimated g - and - h distribution provides an alternative approach without distributional assumptions. Simulation studies indicate that the BP method based on robustly estimated g - and - h distribution identified reasonable number of true outliers while controlling number of false outliers and some-outside rate compared to normal distributional assumption when it is not valid. Another application of the robust g - and - h distribution is as an empirical null distribution in false discovery rate method (denoted as BH method thereafter). The performance of BH method depends on the accuracy of the null distribution. It has been found that theoretical null distributions were often not valid when simultaneously performing many thousands, even millions, of hypothesis tests. Therefore, an empirical null distribution approach is introduced that uses estimated distribution from the data. This is recommended as a substitute to the currently used empirical null methods of fitting a normal distribution or another member of the exponential family. Similar to BP outlier detection method, the robustly estimated g - and - h distribution can be used as empirical null distribution without any distributional assumptions. Several real data examples of microarray are used as illustrations. The QLS and rQLS methods are useful tools to estimate g - and - h parameters, especially rQLS because it noticeably reduces the effect of outliers on the estimates. The robustly estimated g - and - h distributions have multiple applications where distributional assumptions are required, such as boxplot outlier detection or BH methods. / Statistics Statistics Empirical Null Distribution False Discovery Rate G - and - H Distribution Indirect Inference Outlier Detection Quantiles

Search results