Spelling suggestions: "subject:"match dffects"" "subject:"match diffects""
1 |
COHORTFINDER: A DATA-DRIVEN, OPEN-SOURCE, TOOL FOR PARTITIONING PATHOLOGY AND IMAGING COHORTS TO YIELD ROBUST MACHINE LEARNING MODELSFan, Fan 26 May 2023 (has links)
No description available.
|
2 |
Exploration, quantification, and mitigation of systematic error in high-throughput approaches to gene-expression profiling : implications for data reproducibilityKitchen, Robert Raymond January 2011 (has links)
Technological and methodological advances in the fields of medical and life-sciences have, over the last 25 years, revolutionised the way in which cellular activity is measured at the molecular level. Three such advances have provided a means of accurately and rapidly quantifying mRNA, from the development of quantitative Polymerase Chain Reaction (qPCR), to DNA microarrays, and second-generation RNA-sequencing (RNA-seq). Despite consistent improvements in measurement precision and sample throughput, the data generated continue to be a ffected by high levels of variability due to the use of biologically distinct experimental subjects, practical restrictions necessitating the use of small sample sizes, and technical noise introduced during frequently complex sample preparation and analysis procedures. A series of experiments were performed during this project to pro le sources of technical noise in each of these three techniques, with the aim of using the information to produce more accurate and more reliable results. The mechanisms for the introduction of confounding noise in these experiments are highly unpredictable. The variance structure of a qPCR experiment, for example, depends on the particular tissue-type and gene under assessment while expression data obtained by microarray can be greatly influenced by the day on which each array was processed and scanned. RNA-seq, on the other hand, produces data that appear very consistent in terms of differences between technical replicates, however there exist large differences when results are compared against those reported by microarray, which require careful interpretation. It is demonstrated in this thesis that by quantifying some of the major sources of noise in an experiment and utilising compensation mechanisms, either pre- or post-hoc, researchers are better equipped to perform experiments that are more robust, more accurate, and more consistent.
|
3 |
Detecting and Correcting Batch Effects in High-Throughput Genomic ExperimentsReese, Sarah 19 April 2013 (has links)
Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal components analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data. We present an extension of principal components analysis to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test if a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies. We further compare existing batch effect correction methods and apply gPCA to test their effectiveness. We conclude that our novel statistic that utilizes guided principal components analysis to identify whether batch effects exist in high-throughput genomic data is effective. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well.
|
4 |
Adversarial Deep Neural Networks Effectively Remove Nonlinear Batch Effects from Gene-Expression DataDayton, Jonathan Bryan 01 July 2019 (has links)
Gene-expression profiling enables researchers to quantify transcription levels in cells, thus providing insight into functional mechanisms of diseases and other biological processes. However, because of the high dimensionality of these data and the sensitivity of measuring equipment, expression data often contains unwanted confounding effects that can skew analysis. For example, collecting data in multiple runs causes nontrivial differences in the data (known as batch effects), known covariates that are not of interest to the study may have strong effects, and there may be large systemic effects when integrating multiple expression datasets. Additionally, many of these confounding effects represent higher-order interactions that may not be removable using existing techniques that identify linear patterns. We created Confounded to remove these effects from expression data. Confounded is an adversarial variational autoencoder that removes confounding effects while minimizing the amount of change to the input data. We tested the model on artificially constructed data and commonly used gene expression datasets and compared against other common batch adjustment algorithms. We also applied the model to remove cancer-type-specific signal from a pan-cancer expression dataset. Our software is publicly available at https://github.com/jdayton3/Confounded.
|
5 |
Statistical methods for analyzing sequencing data with applications in modern biomedical analysis and personalized medicineManimaran, Solaiappan 13 March 2017 (has links)
There has been tremendous advancement in sequencing technologies; the rate at which sequencing data can be generated has increased multifold while the cost of sequencing continues on a downward descent. Sequencing data provide novel insights into the ecological environment of microbes as well as human health and disease status but challenge investigators with a variety of computational issues. This thesis focuses on three common problems in the analysis of high-throughput data. The goals of the first project are to (1) develop a statistical framework and a complete software pipeline for metagenomics that identifies microbes to the strain level and thus facilitating a personalized drug treatment targeting the strain; and (2) estimate the relative content of microbes in a sample as accurately and as quickly as possible.
The second project focuses on the analysis of the microbiome variation across multiple samples. Studying the variation of microbiomes under different conditions within an organism or environment is the key to diagnosing diseases and providing personalized treatments. The goals are to (1) identify various statistical diversity measures; (2) develop confidence regions for the relative abundance estimates; (3) perform multi-dimensional and differential expression analysis; and (4) develop a complete pipeline for multi-sample microbiome analysis.
The third project is focused on batch effect analysis. When analyzing high dimensional data, non-biological experimental variation or “batch effects” confound the true associations between the conditions of interest and the outcome variable. Batch effects exist even after normalization. Hence, unless the batch effects are identified and corrected, any attempts for downstream analyses, will likely be error prone and may lead to false positive results. The goals are to (1) analyze the effect of correlation of the batch adjusted data and develop new techniques to account for correlation in two step hypothesis testing approach; (2) develop a software pipeline to identify whether batch effects are present in the data and adjust for batch effects in a suitable way.
In summary, we developed software pipelines called PathoScope, PathoStat and BatchQC as part of these projects and validated our techniques using simulation and real data sets.
|
Page generated in 0.0466 seconds