In recent years, the RNA-sequencing (RNA-seq) method, which measures the transcriptome by counting short sequencing reads obtained by high-throughput sequencing, is replacing the microarray technology as the major platform in gene expression studies. The large amount of discrete data in RNA-seq experiments calls for effective analysis methods. In this dissertation, a new method to detect differentially expressed genes based on quasi-likelihood theory is developed in experiments with a completely randomized design with two experimental conditions. The proposed method estimates the variance function empirically and consequently it has similar sensitivities and FDRs across distributions with different variance functions. In a simulation study, the method is shown to have similar sensitivities and FDRs across the data with three different types of variance functions compared with some other popular methods. This method is applied to a real dataset with two experimental conditions along with some competing methods.
The new method is then extended to more complex designs such as an experiment with multiple experimental conditions, an experiment with block design and an experiment with factorial design. The same advantages for the new method have been found in simulation studies. This method and some competing methods are applied to three real datasets with complex designs.
The new method is also applied to analyze reads per kilobase per million mapped reads (RPKM) data. In the simulation, the method is compared with the Linear Models for Microarray Data (LIMMA) originally developed for microarray analysis (Smyth, 2004) and the question of normalization is also examined. It is shown that the new method and the LIMMA method have similar performance. Further normalization is required for the proper analysis of the RPKM data and the best such normalization is the scaling method. Analyzing raw count data properly has better performance than analyzing the RPKM data. Different normalization and statistical methods are applied to a real dataset with varied gene length across samples. / Thesis / Doctor of Philosophy (PhD)
Identifer | oai:union.ndltd.org:mcmaster.ca/oai:macsphere.mcmaster.ca:11375/20257 |
Date | January 2016 |
Creators | Gu, Chu-Shu |
Contributors | Canty, Angelo, Clinical Epidemiology/Clinical Epidemiology & Biostatistics |
Source Sets | McMaster University |
Language | English |
Detected Language | English |
Type | Thesis |
Page generated in 0.0017 seconds