Global ETD Search

1	Bioinformatics analyses for next-generation sequencing of plasma DNA. January 2012 (has links) 1997年，Dennis等證明胚胎DNA在孕婦母體中存在的事實開啟了產前無創診斷的大門。起初的應用包括性別鑒定和恒河猴血型系統的識別。隨著二代測序的出現和發展，對外周血游離DNA更加成熟的分析和應用應運而生。例如當孕婦懷孕十二周時，應用二代測序技術在母體外周血DNA中預測胎兒21號染色體是否是三倍體，其準確性達到98%。本論文的第一部分介紹如何應用母體外周血DNA構建胎兒的全基因組遺傳圖譜。這項研究極具挑戰，原因是孕後12周，胎兒對外周血DNA貢獻很小，大多數在10%左右，另外外周血中的胎兒DNA大多數短於200 bp。目前的演算法和程式都不適合於從母體外周血DNA中構建胎兒的遺傳圖譜。在這項研究中，根據母親和父親的基因型，用生物資訊學手段先構建胎兒可能有的遺傳圖譜，然後將母體外周血DNA的測序資訊比對到這張可能的遺傳圖譜上。如果在母親純和遺傳背景下，決定父親的特異遺傳片段，只要定性檢測父親的特異遺傳片段是否在母體外周血中存在。如果在母親雜合遺傳背景下，決定母親的遺傳特性，就要進行定量分析。我開發了單倍型相對劑量分析方案，統計學上判斷母親外周血中的兩條單倍型相對劑量水準，顯著增加的單倍型即為最大可能地遺傳給胎兒的單倍型。單倍型相對劑量分析方案可以加強測序資訊的分析效率，降低測序數據波動，比單個位點分析更加穩定，強壯。 / 隨著靶標富集測序出現，測序價格急劇下降。第一部分運用母親父親的多態位點基因型的組合加上測序的資訊可以計算出胎兒DNA在母體外周血中的濃度。但是該方法的局限是要利用母親父親的多態位點的基因型，而不能直接從測序的資訊中推測胎兒DNA在母體外周血中的濃度。本論文的第二部分，我開發了基於二項分佈的混合模型直接預測胎兒DNA在母體外周血中的濃度。當混合模型的似然值達到最大的時候，胎兒DNA在母體外周血中的濃度得到最優估算。由於靶標富集測序可以提供高倍覆蓋的測序資訊，從而有機會直接根據概率模型識別出母親是純和而且胎兒是雜合的有特異信息量的位點。 / 除了母體外周血DNA水準分析推動產前無創診斷外，表觀遺傳學的分析也不容忽視。在本論文的第三部分，我開發了Methyl-Pipe軟體，專門用於全基因組的甲基化的分析。甲基化測序數據分析比一般的基因組測序分析更加複雜。由於重亞硫酸鹽測序文庫的沒有甲基化的胞嘧啶轉化成尿嘧啶，最後以胸腺嘧啶的形式存在PCR產物中，但是對於甲基化的胞嘧啶則保持不變。因此，為了實現將重亞硫酸鹽處理過的測序序列比對到參考基因組。首先，分別將Watson和Crick鏈的參考基因組中胞嘧啶轉化成全部轉化為胸腺嘧啶，同時也將測序序列中的胞嘧啶轉化成胸腺嘧啶。然後將轉化後的測序序列比對到參考基因組上。最後根據比對到基因組上的測序序列中的胞嘧啶和胸腺嘧啶的含量推到全基因組的甲基化水準和甲基化特定模式。Methyl-Pipe可以用於識別甲基化水平顯著性差異的基因組區別，因此它可以用於識別潛在的胎兒特異的甲基化位點用於產前無創診斷。 / The presence of fetal DNA in the cell-free plasma of pregnant women was first described in 1997. The initial clinical applications of this phenomenon focused on the detection of paternally inherited traits such as sex and rhesus D blood group status. The development of massively parallel sequencing technologies has allowed more sophisticated analyses on circulating cell-free DNA in maternal plasma. For example, through the determination of the proportional representation of chromosome 21 sequences in maternal plasma, noninvasive prenatal diagnosis of fetal Down syndrome can be achieved with an accuracy of >98%. In the first part of my thesis, I have developed bioinformatics algorithms to perform genome-wide construction of the fetal genetic map from the massively parallel sequencing data of the maternal plasma DNA sample of a pregnant woman. The construction of the fetal genetic map through the maternal plasma sequencing data is very challenging because fetal DNA only constitutes approximately 10% of the maternal plasma DNA. Moreover, as the fetal DNA in maternal plasma exists as short fragments of less than 200 bp, existing bioinformatics techniques for genome construction are not applicable for this purpose. For the construction of the genome-wide fetal genetic map, I have used the genome of the father and the mother as scaffolds and calculated the fractional fetal DNA concentration. First, I looked at the paternal specific sequences in maternal plasma to determine which portions of the father’s genome had been passed on to the fetus. For the determination of the maternal inheritance, I have developed the Relative Haplotype Dosage (RHDO) approach. This method is based on the principle that the portion of maternal genome inherited by the fetus would be present in slightly higher concentration in the maternal plasma. The use of haplotype information can enhance the efficacy of using the sequencing data. Thus, the maternal inheritance can be determined with a much lower sequencing depth than just looking at individual loci in the genome. This algorithm makes it feasible to use genome-wide scanning to diagnose fetal genetic disorders prenatally in a noninvasive way. / As the emergence of targeted massively parallel sequencing, the sequencing cost per base is reducing dramatically. Even though the first part of the thesis has already developed a method to estimate fractional fetal DNA concentration using parental genotype informations, it still cannot be used to deduce the fractional fetal DNA concentration directly from sequencing data without prior knowledge of genotype information. In the second part of this thesis, I propose a statistical mixture model based method, FetalQuant, which utilizes the maximum likelihood to estimate the fractional fetal DNA concentration directly from targeted massively parallel sequencing of maternal plasma DNA. This method allows fetal DNA concentration estimation superior to the existing methods in term of obviating the need of genotype information without loss of accuracy. Furthermore, by using Bayes’ rule, this method can distinguish the informative SNPs where mother is homozygous and fetus is heterozygous, which is potential to detect dominant inherited disorder. / Besides the genetic analysis at the DNA level, epigenetic markers are also valuable for noninvasive diagnosis development. In the third part of this thesis, I have also developed a bioinformatics algorithm to efficiently analyze genomewide DNA methylation status based on the massively parallel sequencing of bisulfite-converted DNA. DNA methylation is one of the most important mechanisms for regulating gene expression. The study of DNA methylation for different genes is important for the understanding of the different physiological and pathological processes. Currently, the most popular method for analyzing DNA methylation status is through bisulfite sequencing. The principle of this method is based on the fact that unmethylated cytosine residues would be chemically converted to uracil on bisulfite treatment whereas methylated cytosine would remain unchanged. The converted uracil and unconverted cytosine can then be discriminated on sequencing. With the emergence of massively parallel sequencing platforms, it is possible to perform this bisulfite sequencing analysis on a genome-wide scale. However, the bioinformatics analysis of the genome-wide bisulfite sequencing data is much more complicated than analyzing the data from individual loci. Thus, I have developed Methyl-Pipe, a bioinformatics program for analyzing the DNA methylation status of genome-wide methylation status of DNA samples based on massively parallel sequencing. In the first step of this algorithm, an in-silico converted reference genome is produced by converting all the cytosine residues to thymine residues. Then, the sequenced reads of bisulfite-converted DNA sequences are aligned to this modified reference sequence. Finally, post-processing of the alignments removes non-unique and low-quality mappings and characterizes the methylation pattern in genome-wide manner. Making use of this new program, potential fetal-specific hypomethylated regions which can be used as blood biomarkers can be identified in a genome-wide manner. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Jiang, Peiyong. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2012. / Includes bibliographical references (leaves 100-105). / Abstracts also in Chinese. / Chapter SECTION I : --- BACKGROUND --- p.1 / Chapter CHAPTER 1: --- Circulating nucleic acids and Next-generation sequencing --- p.2 / Chapter 1.1 --- Circulating nucleic acids --- p.2 / Chapter 1.2 --- Next-generation sequencing --- p.3 / Chapter 1.3 --- Bioinformatics analyses --- p.9 / Chapter 1.4 --- Applications of the NGS --- p.11 / Chapter 1.5 --- Aims of this thesis --- p.12 / Chapter SECTION II : --- Mathematically decoding fetal genome in maternal plasma --- p.14 / Chapter CHAPTER 2: --- Characterizing the maternal and fetal genome in plasma at single base resolution --- p.15 / Chapter 2.1 --- Introduction --- p.15 / Chapter 2.2 --- SNP categories and principle --- p.17 / Chapter 2.3 --- Clinical cases and SNP genotyping --- p.20 / Chapter 2.4 --- Sequencing depth and fractional fetal DNA concentration determination --- p.24 / Chapter 2.5 --- Filtering of genotyping errors for maternal genotypes --- p.26 / Chapter 2.6 --- Constructing fetal genetic map in maternal plasma --- p.27 / Chapter 2.7 --- Sequencing error estimation --- p.36 / Chapter 2.8 --- Paternal-inherited alleles --- p.38 / Chapter 2.9 --- Maternally-derived alleles by RHDO analysis --- p.39 / Chapter 2.1 --- Recombination breakpoint simulation and detection --- p.49 / Chapter 2.11 --- Prenatal diagnosis of β- thalassaemia --- p.51 / Chapter 2.12 --- Discussion --- p.53 / Chapter SECTION III : --- Statistical model for fractional fetal DNA concentration estimation --- p.56 / Chapter CHAPTER 3: --- FetalQuant: deducing the fractional fetal DNA concentration from massively parallel sequencing of maternal plasma DNA --- p.57 / Chapter 3.1 --- Introduction --- p.57 / Chapter 3.2 --- Methods --- p.60 / Chapter 3.2.1 --- Maternal-fetal genotype combinations --- p.60 / Chapter 3.2.2 --- Binomial mixture model and likelihood --- p.64 / Chapter 3.2.3 --- Fractional fetal DNA concentration fitting --- p.66 / Chapter 3.3 --- Results --- p.71 / Chapter 3.3.1 --- Datasets --- p.71 / Chapter 3.3.2 --- Evaluation of FetalQuant algorithm --- p.75 / Chapter 3.3.3 --- Simulation --- p.78 / Chapter 3.3.4 --- Sequencing depth and the number of SNPs required by FetalQuant --- p.81 / Chapter 3.5 --- Discussion --- p.85 / Chapter SECTION IV : --- NGS-based data analysis pipeline development --- p.88 / Chapter CHAPTER 4: --- Methyl-Pipe: Methyl-Seq bioinformatics analysis pipeline --- p.89 / Chapter 4.1 --- Introduction --- p.89 / Chapter 4.2 --- Methods --- p.89 / Chapter 4.2.1 --- Overview of Methyl-Pipe --- p.90 / Chapter 4.3 --- Results and discussion --- p.96 / Chapter SECTION V : --- CONCLUDING REMARKS --- p.97 / Chapter CHAPTER 5: --- Conclusion and future perspectives --- p.98 / Chapter 5.1 --- Conclusion --- p.98 / Chapter 5.2 --- Future perspectives --- p.99 / Reference --- p.100 DNA--Analysis--Data processing Nucleotide sequence--Data processing Bioinformatics Sequence Analysis, DNA--methods Computational Biology--methods
2	Algorithms in protein functionality analysis. January 2002 (has links) Leung Ka-Kit. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2002. / Includes bibliographical references (leaves 129-131). / Abstracts in English and Chinese. / Abstract --- p.1 / Chapter CHAPTER 1. --- introduction --- p.14 / Chapter 1.1 --- Preamble --- p.14 / Chapter 1.2 --- Biological background --- p.14 / Chapter CHAPTER 2. --- previous related work --- p.18 / Chapter 2.1 --- Protein functionality analysis --- p.18 / Chapter 2.1.1 --- Analysis from primary structure --- p.18 / Chapter 2.1.2 --- Analysis from tertiary structure --- p.20 / Chapter 2.2 --- Secondary structure prediction --- p.21 / Chapter 2.3 --- Motivation - Challenges from protein complexity --- p.22 / Chapter CHAPTER 3. --- mathematical representations for protein properties and sequence alignment --- p.24 / Chapter 3.1 --- Secondary structure sequence model --- p.24 / Chapter 3.2 --- Substitution matrix --- p.26 / Chapter 3.3 --- Gap --- p.26 / Chapter 3.4 --- Similarity measurement --- p.27 / Chapter 3.5 --- Geometric Model for Protein --- p.28 / Chapter CHAPTER 4. --- overall system design --- p.30 / Chapter 4.1 --- System architecture and design --- p.30 / Chapter 4.2 --- System environment --- p.32 / Chapter 4.3 --- Experimental data --- p.32 / Chapter CHAPTER 5. --- adaptive dynamic programming (adp)- general global alignment consideration --- p.35 / Chapter 5.1 --- t-triangles cutting --- p.35 / Chapter 5.1.1 --- Theoretical time and memory requirements of ADP with z-triangles cutting --- p.43 / Chapter 5.1.1.1 --- Study of parameters affecting h in case 1 --- p.44 / Chapter 5.1.1.2 --- Study of parameters affecting h in case 2 --- p.45 / Chapter 5.1.2 --- Experimental results of ADP with z-triangles cutting --- p.46 / Chapter 5.2 --- Constructing the path matrix by expansion --- p.51 / Chapter 5.2.1 --- Time and memory requirements of EXPAND --- p.57 / Chapter 5.2.2 --- Experimental results and discussions --- p.58 / Chapter CHAPTER 6. --- adp - global alignment of sequences with consecutive repeated characters --- p.65 / Chapter 6.1 --- Estimation of similarity upper bound (Ba) --- p.65 / Chapter 6.1.1 --- Sequence composition (SC) consideration --- p.65 / Chapter 6.1.2 --- Implementation of SC --- p.67 / Chapter 6.1.3 --- Experimental results --- p.69 / Chapter 6.1.4 --- Overall trend of change of structures (OTCS) --- p.74 / Chapter 6.1.5 --- Uninformed search --- p.76 / Chapter 6.2 --- Short-cut --- p.80 / Chapter 6.2.1 --- Time and memory requirements --- p.86 / Chapter 6.2.2 --- Experimental results and discussions --- p.86 / Chapter CHAPTER 7. --- ga based topology discovery --- p.87 / Chapter 7.1 --- Chromosome encoding --- p.87 / Chapter 7.2 --- Non-sequential order penalty --- p.88 / Chapter 7.3 --- Fitness function --- p.88 / Chapter 7.4 --- Genetic operators --- p.88 / Chapter 7.4.1 --- Hop operator --- p.89 / Chapter 7.4.2 --- Inverse operator --- p.89 / Chapter 7.4.3 --- Shift operator --- p.90 / Chapter 7.4.4 --- Selection pressure --- p.90 / Chapter 7.5 --- Selection of progeny --- p.91 / Chapter 7.6 --- Implementation --- p.91 / Chapter 7.6.1 --- Size of population and generation --- p.91 / Chapter 7.6.2 --- Parallelization --- p.91 / Chapter 7.6.3 --- Crowding Handling --- p.92 / Chapter 7.6.4 --- Selection of progeny --- p.92 / Chapter 7.7 --- Results of alignment with GA exploration on topological order --- p.93 / Chapter CHAPTER 8. --- FILTERING OF FALSE POSITIVES --- p.103 / Chapter 8.1 --- Alignment Segments to Gap Ratio (ASGR) --- p.103 / Chapter 8.2 --- Tolerance --- p.104 / Chapter 8.3 --- Overall trend of change of structures (OTCS) --- p.104 / Chapter 8.4 --- Results and discussions --- p.105 / Chapter CHAPTER 9. --- SECONDARY STRUCTURE PREDICTION --- p.111 / Chapter 9.1 --- 3-STATE SECONDARY STRUCTURE PREDICTION IMPROVEMENT --- p.111 / Chapter 9.2 --- 8-state secondary structure prediction --- p.117 / Chapter 9.3 --- Iterative Subordinate Voting (IS V) --- p.117 / Chapter 9.4 --- ISV Results and discussion --- p.119 / Chapter CHAPTER 10. --- CONCLUSIONS --- p.123 / Chapter 10.1 --- Contributions --- p.123 / Chapter 10.2 --- Future Work --- p.126 / Chapter 10.2.1 --- Using database indexing --- p.126 / Chapter 10.2.2 --- 3-state secondary structure prediction improvement --- p.127 / appendix --- p.128 / Chapter ´Ø --- Interpretation on the dp一filter results --- p.128 Proteins--Conformation Bioinformatics Dynamic programming Genetic algorithms Proteins--Analysis Protein Conformation Sequence Alignment Computational Biology--methods Protein--analysis
3	Clues of identification of protein-protein interaction sites. January 2005 (has links) Leung Ka-Kit. / Thesis submitted in: November 2004. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2005. / Includes bibliographical references (leaves 67-71). / Abstracts in English and Chinese. / Abstract / Chapter CHAPTER 1. --- INTRODUCTION --- p.1 / Chapter 1.1 --- Background of protein structures --- p.1 / Chapter 1.2 --- Background of protein-protein interaction (PPI) --- p.4 / Chapter 1.2.1 --- Quaternary structure and protein complex --- p.4 / Chapter 1.2.2 --- Previous related work --- p.4 / Chapter 1.2.3 --- The kinetic and thermodynamic formalism --- p.6 / Chapter CHAPTER 2. --- MATERIALS AND METHODS --- p.10 / Chapter 2.1 --- Amino acid composition representative power modeling --- p.10 / Chapter 2.1.1 --- Propensity level modeling --- p.10 / Chapter 2.1.2 --- Polar atoms visualization --- p.17 / Chapter 2.2 --- Rigid structure representative power modeling --- p.17 / Chapter 2.3 --- Electrostatic potential modeling --- p.17 / Chapter 2.3.1 --- Charge residence --- p.17 / Chapter 2.3.2 --- Minimum Ribbon (MR) --- p.19 / Chapter 2.4 --- Examination of interface --- p.23 / Chapter 2.5 --- Identification procedures of a binding site --- p.24 / Chapter 2.6 --- System requirements --- p.24 / Chapter CHAPTER 3. --- RESULTS AND DISCUSSIONS --- p.24 / Chapter 3.1 --- Polar atoms --- p.25 / Chapter 3.2 --- Minimum Ribbon (MR) --- p.27 / Chapter 3.3 --- "Charge complementarity, propensity level and rigid structure orientation" --- p.31 / Chapter 3.4 --- Identification of interacting site --- p.36 / Chapter CHAPTER 4. --- CONCLUSIONS --- p.64 / System requirements --- p.65 / Basic operation --- p.65 / Limitation --- p.66 Protein-protein interactions Proteins--Structure--Mathematical Models Protein Binding--physiology Protein Conformation Binding sites Computational Biology--methods
4	A computational framework for protein-DNA binding discovery. January 2010 (has links) Wong, Ka Chun. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2010. / Includes bibliographical references (leaves 109-121). / Abstracts in English and Chinese. / Abstract --- p.ii / Acknowledgements --- p.iv / List of Figures --- p.ix / List of Tables --- p.xi / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Motivation --- p.1 / Chapter 1.2 --- Objective --- p.2 / Chapter 1.3 --- Methodology --- p.2 / Chapter 1.4 --- Bioinforrnatics --- p.2 / Chapter 1.5 --- Computational Methods --- p.3 / Chapter 1.5.1 --- Evolutionary Algorithms --- p.3 / Chapter 1.5.2 --- Data Mining for TF-TFBS bindings --- p.4 / Chapter 2 --- Background --- p.5 / Chapter 2.1 --- Gene Transcription --- p.5 / Chapter 2.1.1 --- Protein-DNA Binding --- p.6 / Chapter 2.1.2 --- Existing Methods --- p.6 / Chapter 2.1.3 --- Related Databases --- p.8 / Chapter 2.1.3.1 --- TRANSFAC - Experimentally Determined Database --- p.8 / Chapter 2.1.3.2 --- cisRED - Computational Determined Database --- p.9 / Chapter 2.1.3.3 --- ORegAnno - Community Driven Database --- p.10 / Chapter 2.2 --- Evolutionary Algorithms --- p.13 / Chapter 2.2.1 --- Representation --- p.15 / Chapter 2.2.2 --- Parent Selection --- p.16 / Chapter 2.2.3 --- Crossover Operators --- p.17 / Chapter 2.2.4 --- Mutation Operators --- p.18 / Chapter 2.2.5 --- Survival Selection --- p.19 / Chapter 2.2.6 --- Termination Condition --- p.19 / Chapter 2.2.7 --- Discussion --- p.19 / Chapter 2.2.8 --- Examples --- p.19 / Chapter 2.2.8.1 --- Genetic Algorithm --- p.20 / Chapter 2.2.8.2 --- Genetic Programming --- p.21 / Chapter 2.2.8.3 --- Differential Evolution --- p.21 / Chapter 2.2.8.4 --- Evolution Strategy --- p.22 / Chapter 2.2.8.5 --- Swarm Intelligence --- p.23 / Chapter 2.3 --- Association Rule Mining --- p.24 / Chapter 2.3.1 --- Objective --- p.24 / Chapter 2.3.2 --- Apriori Algorithm --- p.24 / Chapter 2.3.3 --- Partition Algorithm --- p.25 / Chapter 2.3.4 --- DHP --- p.25 / Chapter 2.3.5 --- Sampling --- p.25 / Chapter 2.3.6 --- Frequent Pattern Tree --- p.26 / Chapter 3 --- Discovering Protein-DNA Binding Sequence Patterns Using Associa- tion Rule Mining --- p.27 / Chapter 3.1 --- Materials and Methods --- p.28 / Chapter 3.1.1 --- Association Rule Mining and Apriori Algorithm --- p.29 / Chapter 3.1.2 --- Discovering associated TF-TFBS sequence patterns --- p.29 / Chapter 3.1.3 --- "Data, Preparation" --- p.31 / Chapter 3.2 --- Results and Analysis --- p.34 / Chapter 3.2.1 --- Rules Discovered --- p.34 / Chapter 3.2.2 --- Quantitative Analysis --- p.36 / Chapter 3.2.3 --- Annotation Analysis --- p.37 / Chapter 3.2.4 --- Empirical Analysis --- p.37 / Chapter 3.2.5 --- Experimental Analysis --- p.38 / Chapter 3.3 --- Verifications --- p.41 / Chapter 3.3.1 --- Verification by PDB --- p.41 / Chapter 3.3.2 --- Verification by Homology Modeling --- p.45 / Chapter 3.3.3 --- Verification by Random Analysis --- p.45 / Chapter 3.4 --- Discussion --- p.49 / Chapter 4 --- Designing Evolutionary Algorithms for Multimodal Optimization --- p.50 / Chapter 4.1 --- Introduction --- p.50 / Chapter 4.2 --- Problem Definition --- p.51 / Chapter 4.2.1 --- Minimization --- p.51 / Chapter 4.2.2 --- Maximization --- p.51 / Chapter 4.3 --- An Evolutionary Algorithm with Species-specific Explosion for Multi- modal Optimization --- p.52 / Chapter 4.3.1 --- Background --- p.52 / Chapter 4.3.1.1 --- Species Conserving Genetic Algorithm --- p.52 / Chapter 4.3.2 --- Evolutionary Algorithm with Species-specific Explosion --- p.53 / Chapter 4.3.2.1 --- Species Identification --- p.53 / Chapter 4.3.2.2 --- Species Seed Delta Evaluation --- p.55 / Chapter 4.3.2.3 --- Stage Switching Condition --- p.56 / Chapter 4.3.2.4 --- Species-specific Explosion --- p.57 / Chapter 4.3.2.5 --- Calculate Explosion Weights --- p.59 / Chapter 4.3.3 --- Experiments --- p.59 / Chapter 4.3.3.1 --- Performance measurement --- p.60 / Chapter 4.3.3.2 --- Parameter settings --- p.61 / Chapter 4.3.3.3 --- Results --- p.61 / Chapter 4.3.4 --- Conclusion --- p.62 / Chapter 4.4 --- A. Crowding Genetic. Algorithm with Spatial Locality for Multimodal Op- timization --- p.64 / Chapter 4.4.1 --- Background --- p.64 / Chapter 4.4.1.1 --- Crowding Genetic Algorithm --- p.64 / Chapter 4.4.1.2 --- Locality of Reference --- p.64 / Chapter 4.4.2 --- Crowding Genetic Algorithm with Spatial Locality --- p.65 / Chapter 4.4.2.1 --- Motivation --- p.65 / Chapter 4.4.2.2 --- Offspring generation with spatial locality --- p.65 / Chapter 4.4.3 --- Experiments --- p.67 / Chapter 4.4.3.1 --- Performance measurements --- p.67 / Chapter 4.4.3.2 --- Parameter setting --- p.68 / Chapter 4.4.3.3 --- Results --- p.68 / Chapter 4.4.4 --- Conclusion --- p.68 / Chapter 5 --- Generalizing Protein-DNA Binding Sequence Representations and Learn- ing using an Evolutionary Algorithm for Multimodal Optimization --- p.70 / Chapter 5.1 --- Introduction and Background --- p.70 / Chapter 5.2 --- Problem Definition --- p.72 / Chapter 5.3 --- Crowding Genetic Algorithm with Spatial Locality --- p.72 / Chapter 5.3.1 --- Representation --- p.72 / Chapter 5.3.2 --- Crossover Operators --- p.73 / Chapter 5.3.3 --- Mutation Operators --- p.73 / Chapter 5.3.4 --- Fitness Function --- p.74 / Chapter 5.3.5 --- Distance Metric --- p.76 / Chapter 5.4 --- Experiments --- p.77 / Chapter 5.4.1 --- Parameter Setting --- p.77 / Chapter 5.4.2 --- Search Space Estimation --- p.78 / Chapter 5.4.3 --- Experimental Procedure --- p.78 / Chapter 5.4.4 --- Results and Analysis --- p.79 / Chapter 5.4.4.1 --- Generalization Analysis --- p.79 / Chapter 5.4.4.2 --- Verification By PDB --- p.86 / Chapter 5.5 --- Conclusion --- p.87 / Chapter 6 --- Predicting Protein Structures on a Lattice Model using an Evolution- ary Algorithm for Multimodal Optimization --- p.88 / Chapter 6.1 --- Introduction --- p.88 / Chapter 6.2 --- Problem Definition --- p.89 / Chapter 6.3 --- Representation --- p.90 / Chapter 6.4 --- Related Works --- p.91 / Chapter 6.5 --- Crowding Genetic Algorithm with Spatial Locality --- p.92 / Chapter 6.5.1 --- Motivation --- p.92 / Chapter 6.5.2 --- Customization --- p.92 / Chapter 6.5.2.1 --- Distance metrics --- p.92 / Chapter 6.5.2.2 --- Handling infeasible conformations --- p.93 / Chapter 6.6 --- Experiments --- p.94 / Chapter 6.6.1 --- Performance Metrics --- p.94 / Chapter 6.6.2 --- Parameter Settings --- p.94 / Chapter 6.6.3 --- Results --- p.94 / Chapter 6.7 --- Conclusion --- p.95 / Chapter 7 --- Conclusion and Future Work --- p.97 / Chapter 7.1 --- Thesis Contribution --- p.97 / Chapter 7.2 --- Fixture Work --- p.98 / Chapter A --- Appendix --- p.99 / Chapter A.1 --- Problem Definition in Chapter 3 --- p.107 / Bibliography --- p.109 / Author's Publications --- p.122 DNA-binding proteins--Analysis Computer algorithms Computational biology--Methodology DNA-Binding Proteins--analysis Algorithms Computational Biology--methods
5	Generalized pattern matching applied to genetic analysis. / 通用性模式匹配在基因序列分析中的應用 / CUHK electronic theses & dissertations collection / Digital dissertation consortium / Tong yong xing mo shi pi pei zai ji yin xu lie fen xi zhong de ying yong January 2011 (has links) Approximate pattern matching problem is, given a reference sequence T, a pattern (query) Q, and a maximum allowed error e, to find all the substrings in the reference, such that the edit distance between the substrings and the pattern is smaller than or equal to the maximum allowed error. Though it is a well-studied problem in Computer Science, it gains a resurrection in Bioinformatics in recent years, largely due to the emergence of the next-generation high-throughput sequencing technologies. This thesis contributes in a novel generalized pattern matching framework, and applies it to solve pattern matching problems in general and alternative splicing detection (AS) in particular. AS is to map a large amount of next-generation sequencing short reads data to a reference human genome, which is the first and an important step in analyzing the sequenced data for further Biological analysis. The four parts of my research are as follows. / In the first part of my research work, we propose a novel deterministic pattern matching algorithm which applies Agrep, a well-known bit-parallel matching algorithm, to a truncated suffix array. Due to the linear cost of Agrep, the cost of our approach is linear to the number of characters processed in the truncated suffix array. We analyze the matching cost theoretically, and .obtain empirical costs from experiments. We carry out experiments using both synthetic and real DNA sequence data (queries) and search them in Chromosome-X of a reference human genome. The experimental results show that our approach achieves a speed-up of several magnitudes over standard Agrep algorithm. / In the fourth part, we focus on the seeding strategies for alternative splicing detection. We review the history of seeding-and-extending (SAE), and assess both theoretically and empirically the seeding strategies adopted in existing splicing detection tools, including Bowtie's heuristic and ABMapper's exact seedings, against the novel complementary quad-seeding strategy we proposed and the corresponding novel splice detection tool called CS4splice, which can handle inexact seeding (with errors) and all 3 types of errors including mismatch (substitution), insertion, and deletion. We carry out experiments using short reads (queries) of length 105bp comprised of several data sets consisting of various levels of errors, and align them back to a reference human genome (hg18). On average, CS4splice can align 88. 44% (recall rate) of 427,786 short reads perfectly back to the reference; while the other existing tools achieve much smaller recall rates: SpliceMap 48.72%, MapSplice 58.41%, and ABMapper 51.39%. The accuracies of CS4splice are also the highest or very close to the highest in all the experiments carried out. But due to the complementary quad-seeding that CS4splice use, it takes more computational resources, about twice (or more) of the other alternative splicing detection tools, which we think is practicable and worthy. / In the second part, we define a novel generalized pattern (query) and a framework of generalized pattern matching, for which we propose a heuristic matching algorithm. Simply speaking, a generalized pattern is Q 1G1Q2 ... Qc--1Gc--1 Qc, which consists of several substrings Q i and gaps Gi occurring in-between two substrings. The prototypes of the generalized pattern come from several real Biological problems that can all be modeled as generalized pattern matching problems. Based on a well-known seeding-and-extending heuristic, we propose a dual-seeding strategy, with which we solve the matching problem effectively and efficiently. We also develop a specialized matching tool called Gpattern-match. We carry out experiments using 10,000 generalized patterns and search them in a reference human genome (hg18). Over 98.74% of them can be recovered from the reference. It takes 1--2 seconds on average to recover a pattern, and memory peak goes to a little bit more than 1G. / In the third part, a natural extension of the second part, we model a real biological problem, alternative splicing detection, into a generalized pattern matching problem, and solve it using a proposed bi-directional seeding-and-extending algorithm. Different from all the other tools which depend on third-party tools, our mapping tool, ABMapper, is not only stand-alone but performs unbiased alignments. We carry out experiments using 427,786 real next-generation sequencing short reads data (queries) and align them back to a reference human genome (hg18). ABMapper achieves 98.92% accuracy and 98.17% recall rate, and is much better than the other state-of-the-art tools: SpliceMap achieves 94.28% accuracy and 78.13% recall rate;while TopHat 88.99% accuracy and 76.33% recall rate. When the seed length is set to 12 in ABMapper, the whole searching and alignment process takes about 20 minutes, and memory peak goes to a little bit more than 2G. / Ni, Bing. / Adviser: Kwong-Sak Leung. / Source: Dissertation Abstracts International, Volume: 73-06, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2011. / Includes bibliographical referencesTexture mapping (leaves 151-161). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. Ann Arbor, MI : ProQuest Information and Learning Company, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese. Combinatorial analysis Computational biology Computer algorithms DNA--Analysis--Data processing Genetics--Methodology Matching theory Proteins--Analysis--Data processing Computational Biology--methods Sequence Analysis, DNA Sequence Analysis, Protein
6	Computational biology approaches in drug repurposing and gene essentiality screening Philips, Santosh 20 June 2016 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / The rapid innovations in biotechnology have led to an exponential growth of data and electronically accessible scientific literature. In this enormous scientific data, knowledge can be exploited, and novel discoveries can be made. In my dissertation, I have focused on the novel molecular mechanism and therapeutic discoveries from big data for complex diseases. It is very evident today that complex diseases have many factors including genetics and environmental effects. The discovery of these factors is challenging and critical in personalized medicine. The increasing cost and time to develop new drugs poses a new challenge in effectively treating complex diseases. In this dissertation, we want to demonstrate that the use of existing data and literature as a potential resource for discovering novel therapies and in repositioning existing drugs. The key to identifying novel knowledge is in integrating information from decades of research across the different scientific disciplines to uncover interactions that are not explicitly stated. This puts critical information at the fingertips of researchers and clinicians who can take advantage of this newly acquired knowledge to make informed decisions. This dissertation utilizes computational biology methods to identify and integrate existing scientific data and literature resources in the discovery of novel molecular targets and drugs that can be repurposed. In chapters 1 of my dissertation, I extensively sifted through scientific literature and identified a novel interaction between Vitamin A and CYP19A1 that could lead to a potential increase in the production of estrogens. Further in chapter 2 by exploring a microarray dataset from an estradiol gene sensitivity study I was able to identify a potential novel anti-estrogenic indication for the commonly used urinary analgesic, phenazopyridine. Both discoveries were experimentally validated in the laboratory. In chapter 3 of my dissertation, through the use of a manually curated corpus and machine learning algorithms, I identified and extracted genes that are essential for cell survival. These results brighten the reality that novel knowledge with potential clinical applications can be discovered from existing data and literature by integrating information across various scientific disciplines. Drug repurposing Gene essentiality Literature mining Machine learning Biology -- Data processing Computational biology -- Methods Epidemiology -- Statisical methods Personalized medicine Genetic disorders -- Molecular diagnosis
7	Pattern discovery for deciphering gene regulation based on evolutionary computation. / CUHK electronic theses & dissertations collection January 2010 (has links) On TFBS motif discovery, three novel GA based algorithms are developed, namely GALF-P with focus on optimization, GALF-G for modeling, and GASMEN for spaced motifs. Novel memetic operators are introduced, namely local filtering and probabilistic refinement, to significantly improve effectiveness (e.g. 73% better than MEME) and efficiency (e.g. 4.49 times speedup) in search. The GA based algorithms have been extensively tested on comprehensive synthetic, real and benchmark datasets, and shown outstanding performances compared with state-of-the-art approaches. Our algorithms also "evolve" to handle more and more relaxed cases, namely from fixed motif widths to most flexible widths, from single motifs to multiple motifs with overlapping control, from stringent motif instance assumption to very relaxed ones, and from contiguous motifs to generic spaced motifs with arbitrary spacers. / TF-TFBS associated sequence pattern (rule) discovery is further investigated for better deciphering protein-DNA interactions in regulation. We for the first time generalize previous exact TF-TFBS rules to approximate ones using a progressive approach. A customized algorithm is developed, outperforming MEME by over 73%. The approximate TF-TFBS rules, compared with the exact ones, have significantly more verified rules and better verification ratios. Detailed analysis on PDB cases and conservation verification on NCBI protein records illustrate that the approximate rules reveal the flexible and specific protein-DNA interactions with much greater generalized capability. / The comprehensive pattern discovery algorithms developed will be further verified, improved and extended to further deciphering transcriptionial regulation, such as inferring whole gene regulatory networks by applying TFBS and TF-TFBS patterns discovered and incorporating expression data. / Transcription Factor (TF) and Transcription Factor Binding Site (TFBS) bindings are fundamental protein-DNA interactions in transcriptional regulation. TFs and TFBSs are conserved to form patterns (motifs) due to their important roles for controlling gene expressions and finally affecting functions and appearances. Pattern discovery is thus important for deciphering gene regulation, which has tremendous impacts on the understanding of life, bio-engineering and therapeutic applications. This thesis contributes to pattern discovery involving TFBS motifs and TF-TFBS associated sequence patterns based on Evolutionary Computation (EC), especially Genetic Algorithms (GAs), which are promising for bioinformatics problems with huge and noisy search space. / Chan, Tak Ming. / Advisers: Kwong-Sak Leung; Kin-Hong Lee. / Source: Dissertation Abstracts International, Volume: 73-03, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2010. / Includes bibliographical references (leaves 147-153). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese. Computational biology DNA-binding proteins Evolutionary computation Genetic algorithms Genetic regulation--Mathematical models Transcription factors Computational Biology--methods DNA-Binding Proteins Gene Expression Regulation Neural Networks (Computer) Transcription Factors
8	Computational protein design: assessment and applications Li, Zhixiu January 2015 (has links) Indiana University-Purdue University Indianapolis (IUPUI) / Computational protein design aims at designing amino acid sequences that can fold into a target structure and perform a desired function. Many computational design methods have been developed and their applications have been successful during past two decades. However, the success rate of protein design remains too low to be of a useful tool by biochemists whom are not an expert of computational biology. In this dissertation, we first developed novel computational assessment techniques to assess several state-of-the-art computational techniques. We found that significant progresses were made in several important measures by two new scoring functions from RosettaDesign and from OSCAR-design, respectively. We also developed the first machine-learning technique called SPIN that predicts a sequence profile compatible to a given structure with a novel nonlocal energy-based feature. The accuracy of predicted sequences is comparable to RosettaDesign in term of sequence identity to wild type sequences. In the last two application chapters, we have designed self-inhibitory peptides of Escherichia coli methionine aminopeptidase (EcMetAP) and de novo designed barstar. Several peptides were confirmed inhibition of EcMetAP at the micromole-range 50% inhibitory concentration. Meanwhile, the assessment of designed barstar sequences indicates the improvement of OSCAR-design over RosettaDesign. Computational protein design Energy function Machine learning Self-inhibitory peptide Sequence profile Inhibitor Protein engineering Protein engineering -- Methods Proteins -- Conformation Protein folding Computational biology Computational biology Computational biology -- Methods Machine learning -- Technique

Search results