1 |
Automation of comparative genomic promoter analysis of DNA microarray datasetsKaranam, Suresh Kumar 01 December 2003 (has links)
No description available.
|
2 |
Data mining algorithms for genomic analysisAo, Sio-iong., 區小勇. January 2007 (has links)
published_or_final_version / abstract / Mathematics / Doctoral / Doctor of Philosophy
|
3 |
Understanding the pathogenic fungus Penicillium marneffei: a computational genomics perspectiveCai, J., James., 蔡莖. January 2006 (has links)
published_or_final_version / abstract / Microbiology / Doctoral / Doctor of Philosophy
|
4 |
START : a parallel signal track analytical research tool for flexible and efficient analysis of genomic dataZhu, Xinjie, 朱信杰 January 2015 (has links)
Signal Track Analytical Research Tool (START), is a parallel system for analyzing large-scale genomic data. Currently, genomic data analyses are usually performed by using custom scripts developed by individual research groups, and/or by the integrated use of multiple existing tools (such as BEDTools and Galaxy). The goals of START are 1) to provide a single tool that supports a wide spectrum of genomic data analyses that are commonly done by analysts; and 2) to greatly simplify these analysis tasks by means of a simple declarative language (STQL) with which users only need to specify what they want to do, rather than the detailed computational steps as to how the analysis task should be performed.
START consists of four major components: 1) A declarative language called Signal Track Query Language (STQL), which is a SQL-like language we specifically designed to suit the needs for analyzing genomic signal tracks. 2) A STQL processing system built on top of a large-scale distributed architecture. The system is based on the Hadoop distributed storage and the MapReduce Big Data processing framework. It processes each user query using multiple machines in parallel. 3) A simple and user-friendly web site that helps users construct and execute queries, upload/download compressed data files in various formats, man-age stored data, queries and analysis results, and share queries with other users.
It also provides a complete help system, detailed specification of STQL, and a large number of sample queries for users to learn STQL and try START easily. Private files and queries are not accessible by other users. 4) A repository of public data popularly used for large-scale genomic data analysis, including data from ENCODE and Roadmap Epigenomics, that users can use in their analyses. / published_or_final_version / Computer Science / Doctoral / Doctor of Philosophy
|
5 |
Exploring microbial community structures and functions of activated sludge by high-throughput sequencingYe, Lin, 叶林 January 2012 (has links)
To investigate the diversities and abundances of nitrifiers and to apply the highthroughput
sequencing technologies to analyze the overall microbial community
structures and functions in the wastewater treatment bioreactors were the major
objectives of this study. Specifically, this study was conducted: (1) to investigate the
diversities and abundances of AOA, AOB and NOB in bioreactors, (2) to explore the
bacterial communities in bioreactors using 454 pyrosequencing, and (3) to analyze the
metagenomes of activated sludge using Illumina sequencing.
A lab-scale nitrification bioreactor was operated for 342 days under low DO (0.15~0.5
mg/L) and high nitrogen loading (0.26~0.52 kg-N/(m3d)). T-RFLP and cloning analysis
showed there were only one dominant AOA, AOB and NOB species in the bioreactor,
respectively. The amoA gene of the dominant AOA had a similarity of 89.3% with the
isolated AOA species Nitrosopumilus maritimus SCM1. The AOB species detected in the
bioreactor belonged to Nitrosomonas genus. The abundance of AOB was more than 40
times larger than that of AOA. The percentage of NOB in total bacteria increased from
not detectable to 30% when DO changed from 0.15 to 0.5 mg/L. Compared with
traditional methods, pyrosequencing analysis of the bacteria in this bioreactor provided
unprecedented information. 494 bacterial OTUs was obtained at 3% distance cutoff.
Furthermore, 454 pyrosequencing was applied to investigate the bacterial communities of
activated sludge samples from 14 WWTPs of Asia (mainland China, Hong Kong, and
Singapore) and North America (Canada and the United States). The results revealed huge
amounts of OTUs in activated sludge, i.e. 1183~3567 OTUs in one sludge sample at 3%
distance cutoff. Clear geographical differences among these samples were observed. The
AOB amoA genes in different WWTPs were found quite diverse while the 16S rRNA
genes were relatively conserved.
To explore microbial community structures and functions in the abovementioned labscale
bioreactor and a full-scale bioreactor, over six gigabases of metagenomic sequence
data and 150,000 paired-end reads of PCR amplicons were generated from the activated
sludge in the two bioreactors on Illumina HiSeq2000 platform. Three kinds of sequences
(16S rRNA amplicons, 16S rRNA gene tags and predicted genes) were used to conduct
taxonomic assignment and their applicabilities and reliabilities were compared. Specially,
based on 16S rRNA and amoA gene sequences, AOB were found more abundant than
AOA in the two bioreactors. Furthermore, the analysis of the metabolic profiles and
pathways indicated that the overall pathways in the two bioreactors were quite similar.
However, the abundances of some specific genes in the two bioreactors were different.
In addition, 454 pyrosequencing was also used to detect potentially pathogenic bacteria in
environmental samples. It was found most abundant potentially pathogenic bacteria in the
WWTPs were affiliated with Aeromonas and Clostridium. Aeromonas veronii,
Aeromonas hydrophila and Clostridium perfringens were species most similar to the
potentially pathogenic bacteria found in this study. Overall, the percentage of the
sequences closely related to known pathogenic bacteria sequences was about 0.16% of
the total sequences. Additionally, a Java application (BAND) was developed for
graphical visualization of microbial abundance data. / published_or_final_version / Civil Engineering / Doctoral / Doctor of Philosophy
|
6 |
Statistical Methods for Integrated Cancer Genomic Data Using a Joint Latent Variable ModelDrill, Esther January 2018 (has links)
Inspired by the TCGA (The Cancer Genome Atlas), we explore multimodal genomic datasets with integrative methods using a joint latent variable approach. We use iCluster+, an existing clustering method for integrative data, to identify potential subtypes within TCGA sarcoma and mesothelioma tumors, and across a large cohort of 33 dierent TCGA cancer datasets. For classication, motivated to improve the prediction of platinum resistance in high grade serous ovarian cancer (HGSOC) treatment, we propose novel integrative methods, iClassify to perform classication using a joint latent variable model. iClassify provides eective data integration and classication while handling heterogeneous data types, while providing a natural framework to incorporate covariate risk factors and examine genomic driver by covariate risk factor interaction. Feature selection is performed through a thresholding parameter that combines both latent variable and feature coecients. We demonstrate increased accuracy in classication over methods that assume homogeneous data type, such as linear discriminant analysis and penalized logistic regression, and improved feature selection. We apply iClassify to a TCGA cohort of HGSOC patients with three types of genomic data and platinum response data. This methodology has broad applications beyond predicting treatment outcomes and disease progression in cancer, including predicting prognosis and diagnosis in other diseases with major public health implications.
|
7 |
Prediction and analysis of the methylation status of CpG islands in human genomeZheng, Hao 27 March 2012 (has links)
DNA methylation serves as a major epigenetic modification crucial to the normal organismal development and the onset and progression of complex diseases such as cancer. Computational predictions for DNA methylation profiling serve multiple purposes. First, accurate predictions can contribute valuable information for speeding up genome-wide DNA methylation profiling so that experimental resources can be focused on a few selected while computational procedures are applied to the bulk of the genome. Second, computational predictions can extract functional features and construct useful models of DNA methylation based on existing data, and can therefore be used as an initial step toward quantitative identification of critical factors or pathways controlling DNA methylation patterns. Third, computational prediction of DNA methylation can provide benchmark data to calibrate DNA methylation profiling equipment and to consolidate profiling results from different equipments or techniques.
This thesis is written based on our study on the computational analysis of the DNA methylation patterns of the human genome. Particularly, we have established computational models (1) to predict the methylation patterns of the CpG islands in normal conditions, and (2) to detect the CpG islands that are unmethylated in normal conditions but aberrantly methylated in cancer conditions. When evaluated using the CD4 lymphocyte data of Human Epigenome Project (HEP) data set based on bisulfite sequencing, our computational models for predicting the methylation status of CpG islands in the normal conditions can achieve a high accuracy of 93-94%, specificity of 94%, and sensitivity of 92-93%. And, when evaluated using the aberrant methylation data from the MethCancerDB database for aberrantly methylated genes in cancer, our models for detecting the CpG islands that are unmethylated in normal conditions but aberrantly methylated in colon or prostate cancer can achieve an accuracy of 92-93%, specificity of 98-99%, and sensitivity of 92-93%.
|
8 |
Algorithm Optimizations in Genomic Analysis Using Entropic DissectionDanks, Jacob R. 08 1900 (has links)
In recent years, the collection of genomic data has skyrocketed and databases of genomic data are growing at a faster rate than ever before. Although many computational methods have been developed to interpret these data, they tend to struggle to process the ever increasing file sizes that are being produced and fail to take advantage of the advances in multi-core processors by using parallel processing. In some instances, loss of accuracy has been a necessary trade off to allow faster computation of the data. This thesis discusses one such algorithm that has been developed and how changes were made to allow larger input file sizes and reduce the time required to achieve a result without sacrificing accuracy. An information entropy based algorithm was used as a basis to demonstrate these techniques. The algorithm dissects the distinctive patterns underlying genomic data efficiently requiring no a priori knowledge, and thus is applicable in a variety of biological research applications. This research describes how parallel processing and object-oriented programming techniques were used to process larger files in less time and achieve a more accurate result from the algorithm. Through object oriented techniques, the maximum allowable input file size was significantly increased from 200 mb to 2000 mb. Using parallel processing techniques allowed the program to finish processing data in less than half the time of the sequential version. The accuracy of the algorithm was improved by reducing data loss throughout the algorithm. Finally, adding user-friendly options enabled the program to use requests more effectively and further customize the logic used within the algorithm.
|
9 |
On Identifying Rare Variants for Complex Human TraitsFan, Ruixue January 2015 (has links)
This thesis focuses on developing novel statistical tests for rare variants association analysis incorporating both marginal effects and interaction effects among rare variants. Compared with common variants, rare variants have lower minor allele frequencies (typically less than 5%), and hence traditional association tests for common variants will lose power for rare variants. Therefore, there is a pressing need of new analytical tools to tackle the problem of rare variants association with complex human traits. Several collapsing methods have been proposed that aggregate information of rare variants in a region and test them together. They can be divided into burden tests and non-burden tests based on their aggregation strategies. They are all variations of regression-based methods with the assumption that the phenotype is associated with the genotype via a (linear) regression model. Most of these methods consider only marginal effects of rare variants and fail to take into account gene-gene and gene-environmental interactive effects, which are ubiquitous and are of utmost importance in biological systems. In this thesis, we propose a summation of partition approach (SPA) -- a nonparametric strategy for rare variants association analysis. Extensive simulation studies show that SPA is powerful in detecting not only marginal effects but also gene-gene interaction effects of rare variants. Moreover, extensions of SPA are able to detect gene-environment interactions and other interactions existing in complicated biological system as well. We are also able to obtain the asymptotic behavior of the marginal SPA score, which guarantees the power of the proposed method. Inspired by the idea of stepwise variable selection, a significance-based backward dropping algorithm(SDA) is proposed to locate truly influential rare variants in a genetic region that has been identified significant. Unlike traditional backward dropping approaches which remove the least significant variables first, SDA introduces the idea of eliminating the most significant variable at each round. The removed variables are collected and their effects are evaluated by an influence ratio score -- the relative p-value change. Our simulation studies show that SDA is powerful to detect causal variables and SDA has lower false discovery rate than LASSO. We also demonstrate our method using the dataset provided by Genetic Analysis Workshop (GAW) 17 and the results support the superiority of SDA over LASSO. The general partition-retention framework can also be applied to detect gene-environmental interaction effects for common variants. We demonstrate this method using the dataset from Genetic Analysis Workshop (GAW) 18. Our nonparametric approach is able to identify a lot more possible influential gene-environmental pairs than traditional linear regression models. We propose in this thesis a "SPA-SDA" two step approach for rare variants association analysis at genomic scale: first identify significant regions of moderate sizes using SPA, and then apply SDA to the identified regions to pinpoint truly influential variables. This approach is computationally efficient for genomic data and it has the capacity to detect gene-gene and gene-environmental interactions.
|
10 |
Genomic protein functionality classification algorithms in frequency domain.January 2004 (has links)
Tak-Chung Lau. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2004. / Includes bibliographical references (leaves 190-198). / Abstracts in English and Chinese. / Abstract --- p.i / Acknowledgement --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Background Information --- p.4 / Chapter 1.2 --- Importance of the Problem --- p.6 / Chapter 1.3 --- Problem Definition and Proposed Algorithm Outline --- p.7 / Chapter 1.4 --- Simple Illustration --- p.10 / Chapter 1.5 --- Outline of the Thesis --- p.12 / Chapter 2 --- Survey --- p.14 / Chapter 2.1 --- Introduction --- p.14 / Chapter 2.2 --- Dynamic Programming (DP) --- p.15 / Chapter 2.2.1 --- Introduction --- p.15 / Chapter 2.2.2 --- Algorithm --- p.15 / Chapter 2.2.3 --- Example --- p.16 / Chapter 2.2.4 --- Complexity Analysis --- p.20 / Chapter 2.2.5 --- Summary --- p.21 / Chapter 2.3 --- General Alignment Tools --- p.21 / Chapter 2.4 --- K-Nearest Neighbor (KNN) --- p.22 / Chapter 2.4.1 --- Value of K --- p.22 / Chapter 2.4.2 --- Example --- p.23 / Chapter 2.4.3 --- Variations in KNN --- p.24 / Chapter 2.4.4 --- Summary --- p.24 / Chapter 2.5 --- Decision Tree --- p.25 / Chapter 2.5.1 --- General Information of Decision Tree --- p.25 / Chapter 2.5.2 --- Classification in Decision Tree --- p.26 / Chapter 2.5.3 --- Disadvantages in Decision Tree --- p.27 / Chapter 2.5.4 --- Comparison on Different Types of Trees --- p.28 / Chapter 2.5.5 --- Conclusion --- p.29 / Chapter 2.6 --- Hidden Markov Model (HMM) --- p.29 / Chapter 2.6.1 --- Markov Process --- p.29 / Chapter 2.6.2 --- Hidden Markov Model --- p.31 / Chapter 2.6.3 --- General Framework in HMM --- p.32 / Chapter 2.6.4 --- Example --- p.34 / Chapter 2.6.5 --- Drawbacks in HMM --- p.35 / Chapter 2.7 --- Chapter Summary --- p.36 / Chapter 3 --- Related Work --- p.37 / Chapter 3.1 --- Resonant Recognition Model (RRM) --- p.37 / Chapter 3.1.1 --- Introduction --- p.37 / Chapter 3.1.2 --- Encoding Stage --- p.39 / Chapter 3.1.3 --- Transformation Stage --- p.41 / Chapter 3.1.4 --- Evaluation Stage --- p.43 / Chapter 3.1.5 --- Important Conclusion in RRM --- p.47 / Chapter 3.1.6 --- Summary --- p.48 / Chapter 3.2 --- Motivation --- p.49 / Chapter 3.2.1 --- Example --- p.51 / Chapter 3.3 --- Chapter Summary --- p.53 / Chapter 4 --- Group Classification --- p.54 / Chapter 4.1 --- Introduction --- p.54 / Chapter 4.2 --- Design --- p.55 / Chapter 4.2.1 --- Data Preprocessing --- p.55 / Chapter 4.2.2 --- Encoding Stage --- p.58 / Chapter 4.2.3 --- Transformation Stage --- p.63 / Chapter 4.2.4 --- Evaluation Stage --- p.64 / Chapter 4.2.5 --- Classification --- p.72 / Chapter 4.2.6 --- Summary --- p.75 / Chapter 4.3 --- Experimental Settings --- p.75 / Chapter 4.3.1 --- "Statistics from Database of Secondary Structure in Pro- teins (DSSP) [27], [54]" --- p.76 / Chapter 4.3.2 --- Parameters Used --- p.77 / Chapter 4.3.3 --- Experimental Procedure --- p.79 / Chapter 4.4 --- Experimental Results --- p.79 / Chapter 4.4.1 --- Reference Group - Neurotoxin --- p.80 / Chapter 4.4.2 --- Reference Group - Biotin --- p.82 / Chapter 4.4.3 --- Average Results of all the Groups --- p.84 / Chapter 4.4.4 --- Conclusion in Experimental Results --- p.88 / Chapter 4.5 --- Discussion --- p.89 / Chapter 4.5.1 --- Discussion on the Experimental Results --- p.89 / Chapter 4.5.2 --- Complexity Analysis --- p.94 / Chapter 4.5.3 --- Other Discussion --- p.99 / Chapter 4.6 --- Chapter Summary --- p.102 / Chapter 5 --- Individual Classification --- p.103 / Chapter 5.1 --- Design --- p.103 / Chapter 5.1.1 --- Group Profile Generation --- p.104 / Chapter 5.1.2 --- Preparation of Each Testing Examples --- p.104 / Chapter 5.2 --- Design with Clustering --- p.104 / Chapter 5.2.1 --- Motivation --- p.105 / Chapter 5.2.2 --- Data Exception --- p.105 / Chapter 5.2.3 --- Clustering Technique --- p.110 / Chapter 5.2.4 --- Classification --- p.116 / Chapter 5.3 --- Hybridization of Our Approach and Sequence Alignment --- p.116 / Chapter 5.3.1 --- AlignRemove and AlignChange --- p.117 / Chapter 5.3.2 --- Classification --- p.119 / Chapter 5.4 --- Experimental Settings --- p.120 / Chapter 5.4.1 --- Parameters Used --- p.120 / Chapter 5.4.2 --- Choosing of Protein Functional Groups --- p.121 / Chapter 5.5 --- Experimental Results --- p.122 / Chapter 5.5.1 --- Experimental Results Setup --- p.122 / Chapter 5.5.2 --- Receiver Operating Characteristics (ROC) Curves --- p.123 / Chapter 5.5.3 --- Interpretation of Comparison Results --- p.125 / Chapter 5.5.4 --- Area under the Curve --- p.138 / Chapter 5.5.5 --- Classification with KNN --- p.141 / Chapter 5.5.6 --- Three Types of KNN --- p.142 / Chapter 5.5.7 --- Results in Three Types of KNN --- p.143 / Chapter 5.6 --- Complexity Analysis --- p.144 / Chapter 5.6.1 --- Complexity in Individual Classification --- p.144 / Chapter 5.6.2 --- Complexity in Individual Clustering Classification --- p.146 / Chapter 5.6.3 --- Complexity of Individual Classification in DP --- p.148 / Chapter 5.6.4 --- Conclusion --- p.148 / Chapter 5.7 --- Discussion --- p.149 / Chapter 5.7.1 --- Domain Expert Opinions --- p.149 / Chapter 5.7.2 --- Choosing the Threshold --- p.149 / Chapter 5.7.3 --- Statistical Support in an Individual Protein --- p.150 / Chapter 5.7.4 --- Discussion on Clustering --- p.151 / Chapter 5.7.5 --- Poor Performance in Hybridization --- p.154 / Chapter 5.8 --- Chapter Summary --- p.155 / Chapter 6 --- Application --- p.157 / Chapter 6.1 --- Introduction --- p.157 / Chapter 6.1.1 --- Construct the Correlation Graph --- p.157 / Chapter 6.1.2 --- Minimum Spanning Tree (MST) --- p.161 / Chapter 6.2 --- Application in Group Classification --- p.164 / Chapter 6.2.1 --- Groups with Weak Relationship --- p.164 / Chapter 6.2.2 --- Groups with Strong Relationship --- p.166 / Chapter 6.3 --- Application in Individual Classification --- p.168 / Chapter 6.4 --- Chapter Summary --- p.171 / Chapter 7 --- Discussion on Other Analysis --- p.172 / Chapter 7.1 --- Distanced MLN Encoding Scheme --- p.172 / Chapter 7.2 --- Unique Encoding Method --- p.174 / Chapter 7.3 --- Protein with Multiple Functions? --- p.175 / Chapter 7.4 --- Discussion on Sequence Similarity --- p.176 / Chapter 7.5 --- Functional Blocks in Proteins --- p.177 / Chapter 7.6 --- Issues in DSSP --- p.178 / Chapter 7.7 --- Flexible Encoding --- p.179 / Chapter 7.8 --- Advantages over Dynamic Programming --- p.179 / Chapter 7.9 --- Novel Research Direction --- p.180 / Chapter 8 --- Future Works --- p.182 / Chapter 8.1 --- Improvement in Encoding Scheme --- p.182 / Chapter 8.2 --- Analysis on Primary Protein Sequences --- p.183 / Chapter 8.3 --- In Between Spectrum Scaling --- p.184 / Chapter 8.4 --- Improvement in Hybridization --- p.185 / Chapter 8.5 --- Fuzzy Threshold Boundaries --- p.185 / Chapter 8.6 --- Optimal Parameters Setting --- p.186 / Chapter 8.7 --- Generalization Tool --- p.187 / Chapter 9 --- Conclusion --- p.188 / Bibliography --- p.190 / Chapter A --- Fourier Transform --- p.199 / Chapter A.1 --- Introduction --- p.199 / Chapter A.2 --- Example --- p.201 / Chapter A.3 --- Physical Meaning of Fourier Transform --- p.201
|
Page generated in 0.1014 seconds