Global ETD Search

1	Quantifying recent variation and relatedness in human populations Gusev, Alexander January 2012 (has links) Advances in the genetic analysis of humans have revealed a surprising abundance of local relatedness between purportedly unrelated individuals. Where common mutations classically inform us of ancient relationships, such segments of pairwise identical by descent (IBD) sharing from a common ancestor are the observable traces of recent inter-mating. Combining these two distinct sources of information can help disentangle the complex genetic structure and flux in human populations. When considered together with a heritable trait, the segments can also be used to interrogate unascertained rare variation and help in locating trait-effecting loci. This work presents methods for comprehensive analysis of population-wide IBD and explores applications to disease and the understanding of recent genetic variation. We propose several strategies for efficient detection of IBD segments in population genotype data. Our novel seed-based algorithm, GERMLINE, can reduce the computational burden of finding pairwise segments from quadratic to nearly linear time in a general population. We demonstrate that this approach is several orders of magnitude faster than the available all-pairs methods while maintaining higher accuracy. Next, we extended the GERMLINE technique to process cohorts of unlimited size by adaptively adjusting the search mechanism to meet resource restrictions. We confirm its effectiveness with an analysis of 50,000 individuals where contemporary methods can only process a few thousand. One draw-back of these two algorithms is the dependence on phased haplotype data as input - a constraint that becomes more difficult with large populations. We propose a solution to this problem with an algorithm that analyzes genotype data directly by exploring all potential haplotypes and scoring each putative segment based on linkage-disequilibrium. This solution significantly outperforms available methods when applied to full sequence data and is computationally efficient enough to analyze thousands of sequenced genomes where current methods can only determine haplotypes for several hundred. Secondly, we outline two algorithms for analyzing available IBD segments to increase our understanding of rare variation and complex disease. Motivated by whole-genome sequencing, we present the INFOSTIP algorithm, which uses IBD segments to optimize the selection of individuals for complete population ascertainment. In simulations, we show that INFOSTIP selection can significantly increase variant inference accuracy over random sampling and posit inference of 60% of an isolated population from 1% optimally selected individuals. Seeking to move beyond pairwise IBD segment analysis, we describe the DASH algorithm, which groups shared segments into IBD "clusters" that are likely to be commonly co-inherited and uses them as proxies for un-typed variation. In simulated disease studies, we show this reference-free approach to be much more powerful for detecting rare causal variants than either traditional single-marker analysis or imputation from a general reference panel. Applying the DASH algorithm to disease traits from different populations, we identify multiple novel loci of association. Together, these novel techniques integrate the power of population and disease genetics. Human genetics--Data processing Population Human population genetics Population genetics Genetics--Data processing Genetics Computer science
2	The development and application of informatics-based systems for the analysis of the human transcriptome. Kelso, Janet January 2003 (has links) <p>Despite the fact that the sequence of the human genome is now complete it has become clear that the elucidation of the transcriptome is more complicated than previously expected. There is mounting evidence for unexpected and previously underestimated phenomena such as alternative splicing in the transcriptome. As a result, the identification of novel transcripts arising from the genome continues. Furthermore, as the volume of transcript data grows it is becoming increasingly difficult to integrate expression information which is from different sources, is stored in disparate locations, and is described using differing terminologies. Determining the function of translated transcripts also remains a complex task. Information about the expression profile &ndash / the location and timing of transcript expression &ndash / provides evidence that can be used in understanding the role of the expressed transcript in the organ or tissue under study, or in developmental pathways or disease phenotype observed.<br /> <br /> In this dissertation I present novel computational approaches with direct biological applications to two distinct but increasingly important areas of research in gene expression research. The first addresses detection and characterisation of alternatively spliced transcripts. The second is the construction of an hierarchical controlled vocabulary for gene expression data and the annotation of expression libraries with controlled terms from the hierarchies. In the final chapter the biological questions that can be approached, and the discoveries that can be made using these systems are illustrated with a view to demonstrating how the application of informatics can both enable and accelerate biological insight into the human transcriptome.</p> Gene expression, Data processing Genetics, Data processing Genomes, Data processing.
3	Database construction and computational analysis of bacterial small regulatory RNAs. / CUHK electronic theses & dissertations collection January 2013 (has links) Li, Lei. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2013. / Includes bibliographical references (leaves 85-91). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts also in Chinese. Bacterial genetics--Data processing RNA--Data processing Genetic regulation
4	Bayesian approach for two model-selection-related bioinformatics problems. / CUHK electronic theses & dissertations collection January 2013 (has links) 在貝葉斯推理框架下，貝葉斯方法可以通過數據推斷複雜概率模型中的參數和結構。它被廣泛應用於多个領域。對於生物信息學問題，貝葉斯方法同樣也是一個理想的方法。本文通過介紹新的貝葉斯模型和計算方法討論並解決了兩個與模型選擇相關的生物信息學問題。 / 第一個問題是關於在DNA 序列中的模式識別的相關研究。串聯重複序列片段在DNA 序列中經常出現。它對於基因組進化和人類疾病的研究非常重要。在這一部分，本文主要討論不確定數目的同一模式的串聯重複序列彌散分佈在同一個序列中的情況。我們首先對串聯重複序列片段構建概率模型。然後利用馬爾可夫鏈蒙特卡羅算法探索後驗分佈進而推斷出串聯重複序列的重複片段的模式矩陣和位置。此外，利用RJMCMC 算法解決由不確定數目的重複片段引起的模型選擇問題。 / 另一個問題是對於生物分子的構象轉換的分析。一組生物分子的構象可被分成幾個不同的亞穩定狀態。由於生物分子的功能和構象之間的固有聯繫，構象轉變在不同的生物分子的生物過程中都扮演者非常重要的角色。一般我們從分子動力學模擬中可以得到構象轉換的數據。基於從分子動力學模擬中得到的微觀狀態水準上的構象轉換資訊，我們利用貝葉斯方法研究從微觀狀態到可變數目的亞穩定狀態的聚合問題。 / 本文通過對以上兩個問題討論闡釋貝葉斯方法在生物信息學研究的多個方面具備優勢。這包括闡述生物問題的多變性，處理噪聲和失數據，以及解決模型選擇問題。 / Bayesian approach is a powerful framework for inferring the parameters and structures of complicated probabilistic models from data. It is widely applied in many areas and also ideal for Bioinformatics problems due to their usually high complexity. In this thesis, new Bayesian models and computing methods are introduced to solve two Bioinformatics problems which are both related to model selection. / The first problem is about the repeat pattern recognition. Tandem repeats occur frequently in DNA sequences. They are important for studying genome evolution and human disease. This thesis focuses on the case that an unknown number of tandem repeat segments of the same pattern are dispersively distributed in a sequence. A probabilistic generative model is introduced for the tandem repeats. Markov chain Monte Carlo algorithms are used to explore the posterior distribution as an effort to infer both the specific pattern of the tandem repeats and the location of repeat segments. Furthermore, reversible jump Markov chain Monte Carlo algorithms are used to address the transdimensional model selection problem raised by the variable number of repeat segments. / The second part of this thesis is engaged in the conformational transitions of biomolecules. Because the function of a biological biomolecule is inherently related to its variable conformations which can be grouped into a set of metastable or long-live states, conformational transitions are important in biological processes. The 3D structure changes are generally simulated from the molecular dynamics computer simulation. Based on the conformational transitions on microstate level from molecular dynamics simulation, a Bayesian approach is developed to cluster the microstates into an uncertainty number of metastable that induces the model selection problem. / With these two problems, this thesis shows that the Bayesian approach for bioinformatics problems has its advantages in terms of taking account of the inherent uncertainty in biological data, handling noisy or missing data, and dealing with the model selection problem. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Liang, Tong. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2013. / Includes bibliographical references (leaves 120-130). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstracts also in Chinese. / Abstract --- p.i / Acknowledgement --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Motivation --- p.1 / Chapter 1.2 --- Statistical Background --- p.2 / Chapter 1.3 --- Tandem Repeats --- p.4 / Chapter 1.4 --- Conformational Space --- p.5 / Chapter 1.5 --- Outlines --- p.7 / Chapter 2 --- Preliminaries --- p.9 / Chapter 2.1 --- Bayesian Inference --- p.9 / Chapter 2.2 --- Markov chain Monte Carlo --- p.10 / Chapter 2.2.1 --- Gibbs sampling --- p.11 / Chapter 2.2.2 --- Metropolis - Hastings algorithm --- p.12 / Chapter 2.2.3 --- Reversible Jump MCMC --- p.12 / Chapter 3 --- Detection of Dispersed Short Tandem Repeats Using Reversible Jump MCMC --- p.14 / Chapter 3.1 --- Background --- p.14 / Chapter 3.2 --- Generative Model --- p.17 / Chapter 3.3 --- Statistical inference --- p.18 / Chapter 3.3.1 --- Likelihood --- p.19 / Chapter 3.3.2 --- Prior Distributions --- p.19 / Chapter 3.3.3 --- Sampling from Posterior Distribution via RJMCMC --- p.20 / Chapter 3.3.4 --- Extra MCMC moves for better mixing --- p.26 / Chapter 3.3.5 --- The complete algorithm --- p.29 / Chapter 3.4 --- Experiments --- p.29 / Chapter 3.4.1 --- Evaluation and comparison of the two RJMCMC versions using synthetic data --- p.30 / Chapter 3.4.2 --- Comparison with existing methods using synthetic data --- p.33 / Chapter 3.4.3 --- Sensitivity to Priors --- p.43 / Chapter 3.4.4 --- Real data experiment --- p.45 / Chapter 3.5 --- Discussion --- p.50 / Chapter 4 --- A Probabilistic Clustering Algorithm for Conformational Changes of Biomolecules --- p.53 / Chapter 4.1 --- Introduction --- p.53 / Chapter 4.1.1 --- Molecular dynamic simulation --- p.54 / Chapter 4.1.2 --- Hierarchical Conformational Space --- p.55 / Chapter 4.1.3 --- Clustering Algorithms --- p.56 / Chapter 4.2 --- Generative Model --- p.58 / Chapter 4.2.1 --- Model 1: Vanilla Model --- p.59 / Chapter 4.2.2 --- Model 2: Zero-Inflated Model --- p.60 / Chapter 4.2.3 --- Model 3: Constrained Model --- p.61 / Chapter 4.2.4 --- Model 4: Constrained and Zero-Inflated Model --- p.61 / Chapter 4.3 --- Statistical Inference for Vanilla Model --- p.62 / Chapter 4.3.1 --- Priors --- p.62 / Chapter 4.3.2 --- Posterior distribution --- p.63 / Chapter 4.3.3 --- Collapsed Gibbs for Vanilla Model with a Fixed Number of Clusters --- p.63 / Chapter 4.3.4 --- Inference on the Number of Clusters --- p.65 / Chapter 4.3.5 --- Synthetic Data Study --- p.68 / Chapter 4.4 --- Statistical Inference for Zero-Inflated Model --- p.76 / Chapter 4.4.1 --- Method 1 --- p.78 / Chapter 4.4.2 --- Method 2 --- p.81 / Chapter 4.4.3 --- Synthetic Data Study --- p.84 / Chapter 4.5 --- Statistical Inference for Constrained Model --- p.85 / Chapter 4.5.1 --- Priors --- p.85 / Chapter 4.5.2 --- Posterior Distribution --- p.86 / Chapter 4.5.3 --- Collapsed Posterior Distribution --- p.86 / Chapter 4.5.4 --- Updating for Cluster Labels K --- p.89 / Chapter 4.5.5 --- Updating for Constrained Λ from Truncated Distribution --- p.89 / Chapter 4.5.6 --- Updating the Number of Clusters --- p.91 / Chapter 4.5.7 --- Uniform Background Parameters on Λ --- p.92 / Chapter 4.6 --- Real Data Experiments --- p.93 / Chapter 4.7 --- Discussion --- p.104 / Chapter 5 --- Conclusion and FutureWork --- p.107 / Chapter A --- Appendix --- p.109 / Chapter A.1 --- Post-processing for indel treatment --- p.109 / Chapter A.2 --- Consistency Score --- p.111 / Chapter A.3 --- A Proof for Collapsed Posterior distribution in Constrained Model in Chapter 4 --- p.111 / Chapter A.4 --- Estimated Transition Matrices for Alanine Dipeptide by Chodera et al. (2006) --- p.117 / Bibliography --- p.120 Bioinformatics--Statistical methods Genetics--Data processing Bayesian statistical decision theory
5	The development and application of informatics-based systems for the analysis of the human transcriptome. Kelso, Janet January 2003 (has links) <p>Despite the fact that the sequence of the human genome is now complete it has become clear that the elucidation of the transcriptome is more complicated than previously expected. There is mounting evidence for unexpected and previously underestimated phenomena such as alternative splicing in the transcriptome. As a result, the identification of novel transcripts arising from the genome continues. Furthermore, as the volume of transcript data grows it is becoming increasingly difficult to integrate expression information which is from different sources, is stored in disparate locations, and is described using differing terminologies. Determining the function of translated transcripts also remains a complex task. Information about the expression profile &ndash / the location and timing of transcript expression &ndash / provides evidence that can be used in understanding the role of the expressed transcript in the organ or tissue under study, or in developmental pathways or disease phenotype observed.<br /> <br /> In this dissertation I present novel computational approaches with direct biological applications to two distinct but increasingly important areas of research in gene expression research. The first addresses detection and characterisation of alternatively spliced transcripts. The second is the construction of an hierarchical controlled vocabulary for gene expression data and the annotation of expression libraries with controlled terms from the hierarchies. In the final chapter the biological questions that can be approached, and the discoveries that can be made using these systems are illustrated with a view to demonstrating how the application of informatics can both enable and accelerate biological insight into the human transcriptome.</p> Gene expression, Data processing Genetics, Data processing Genomes, Data processing.
6	Full Bayesian boolean network inference based on Markov chain Monte Carlo algorithms. January 2012 (has links) 在生物信息學中，基因調控網絡推斷不斷受到人們的重視。各種不同的網絡模型被用來描述基因之間的調控關係，其中包括布爾網絡，概率布爾網絡，貝葉斯網絡等。本文主要是討論基於數據的布爾網絡推斷。現在已經有很多方法來推斷節點是離散變量的網絡結構。比如REVEAL算法，Best Fit Extension 算法是兩種比較受歡迎的推斷網絡結構方法。並且他們在網絡的節點數目不是很多的情況下有很好的表現。然而，現今很多方法對噪音和模型的不確定性沒有足夠的考慮。這也使得這些方法在實際應用中的表現不是很令人滿意。本文中，我們用完全貝葉斯的方法去研究概率布爾網絡空間。在給定樣本的情況下，我們提出了一種新的基於馬爾科夫鏈蒙特卡羅的算法。這種算法使得不同的網絡模型根據他們的後驗概率在整個網絡空間中跳動。為使得網絡模型能更好地在不同模型中轉換，我們把局部小網絡根據他們的可能性分配給他們相應的概率值。這些可能的局部小網絡是在數據前期處理中通過卡方檢驗得到的。和其他同類方法一樣，雖然我們的方法也同樣面臨著在一個很大的網絡空間中搜索的難題，但我們的方法能達到一個更高的推斷精度。同時，我們的方法所對應的計算量也是在可接收範圍之內。 / In bioinformatics, the gene regulatory network inference is gaining intensive attention nowadays. Various network models have been used to describe gene regulatory relationships, including deterministic Boolean networks, probabilistic Boolean networks, Bayesian networks, etc. This dissertation is focused on data-based Boolean network reconstruction. Many methods have been proposed to infer this discrete network structure. For example, the REVEAL algorithm and the Best-Fit Extension method are popular and perform well for the networks with limited total number of nodes. However, existing methods didn't take full consideration of the ubiquitous noise across the network and the structure uncertainty, which makes these algorithms unsatisfactory in real applications. In this dissertation, we use a full Bayesian approach to explore the space of probabilistic Boolean networks. To compare the relative fitness of networks to the input data, we design novel Markov chain Monte Carlo algorithms to jump among con rained networks according to the joint posterior probability. To facilitate the transdimensional move, high proposing probabilities are assigned to more likely subnetwork models as judged by chi-square tests in the preprocessing step. Although faced with the same difficulty of searching in a huge structure space as other methods, our algorithm is expected to reconstruct the Boolean network in a more accurate and comprehensive manner with a bearable computing cost. / Detailed summary in vernacular field only. / Han, Shengtong. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2012. / Includes bibliographical references (leaves 94-105). / Abstract also in Chinese. / Chapter 1 --- Introduction --- p.1 / Chapter 2 --- Technical Background --- p.5 / Chapter 2.1 --- Classical Boolean Network --- p.5 / Chapter 2.1.1 --- Definition --- p.5 / Chapter 2.1.2 --- Dynamic Properties --- p.8 / Chapter 2.2 --- Probabilistic Boolean Network --- p.9 / Chapter 2.2.1 --- Definition --- p.9 / Chapter 2.2.2 --- Dynamic Properties --- p.11 / Chapter 3 --- Bayesian Framework for Boolean Network Modeling --- p.12 / Chapter 3.1 --- Introduction --- p.12 / Chapter 3.2 --- Network Modeling --- p.15 / Chapter 3.2.1 --- Subnetwork Modeling --- p.15 / Chapter 3.2.2 --- Full Network Modeling --- p.21 / Chapter 3.2.3 --- Prior & Posterior Distributions --- p.23 / Chapter 4 --- Network Inference-MCMC --- p.29 / Chapter 4.1 --- Introduction --- p.29 / Chapter 4.2 --- Proposal Subnetwork Construction --- p.30 / Chapter 4.3 --- Network Structure Updating --- p.33 / Chapter 4.3.1 --- Individual Network Updating Moves --- p.33 / Chapter 4.3.2 --- Overall Network Updating Procedure --- p.37 / Chapter 4.3.3 --- The Core Metroplis-Hasting Algorithm --- p.37 / Chapter 4.4 --- Convergence Diagnostic --- p.40 / Chapter 4.5 --- Model Selection --- p.41 / Chapter 4.5.1 --- AIC, BIC --- p.42 / Chapter 4.5.2 --- Bayes Factor --- p.42 / Chapter 4.5.3 --- Reversible Jump MCMC --- p.43 / Chapter 4.5.4 --- Bayesian Model Averaging --- p.45 / Chapter 4.6 --- Computational Consideration --- p.46 / Chapter 5 --- Numerical Studies --- p.49 / Chapter 5.1 --- Simulation Studies --- p.49 / Chapter 5.1.1 --- Simulation for Synthetic Network Models with Small Number of Nodes --- p.50 / Chapter 5.1.2 --- Simulation for Synthetic Network Models with Large Number of Nodes --- p.64 / Chapter 5.2 --- Comparison with Other Methods --- p.68 / Chapter 5.2.1 --- Comparison Results --- p.71 / Chapter 5.2.2 --- Discussion --- p.72 / Chapter 6 --- Real Data Analysis --- p.74 / Chapter 6.1 --- A Real Cell Cycle Network --- p.74 / Chapter 6.2 --- Inference Result --- p.76 / Chapter 6.3 --- Discussion --- p.79 / Chapter 7 --- Summary and Discussion --- p.80 / Bibliography --- p.83 / Chapter A --- Data Pre-processing --- p.83 / Chapter A.1 --- Data Discretization --- p.83 / Chapter B --- Truth Tables for Commonly Used Basic Logic Functions --- p.85 / Chapter C --- All Distribution Tables for Gene Pairs and Gene Triplets --- p.86 / Chapter C.1 --- Distribution Assumptions for Input Gene Pairs --- p.86 / Chapter C.2 --- Distribution Assumptions for Gene Triplets --- p.87 / Chapter D --- Pseudo Code of the Algorithm --- p.91 / Chapter D.1 --- Case 1: In-degree=1 --- p.91 / Chapter D.2 --- Case 2: In-degree=2 --- p.93 / Chapter D.3 --- Case 3: In-degree=0 --- p.93 Genetics--Data processing Systems biology Bioinformatics Markov processes Monte Carlo method
7	Walking tree methods for biological string matching Hsu, Tai C. 20 June 2003 (has links) Graduation date: 2004 Matching theory Genetics -- Data processing Heuristic programming
8	Visualization, implementation, and application of the Walking Tree heuristics for biological string matching Cavener, Jeffrey Douglas 11 August 1997 (has links) Biologists need tools to see the structural relationships encoded in biological sequences (strings). The Walking Tree heuristics calculate some of these relationships. I have designed and implemented graphic presentations which allow the biologist (user) to see these relations. This thesis contains background information on the biological sequences and some background on the Walking Tree heuristics. I demonstrate my methods by showing a visual matching of mitochondrial genomes. I also show matchings based on amino acids and on hydrophobicity. I also show how the parameters of the visualization can be varied to produce more useful pictures. I implemented a parallel version of the Walking Tree heuristic and used it to produce a phylogenetic tree for picornaviruses. I also implemented several user interfaces. These programs are available on my WWW page which allows a user to produce a picture of a matching by giving the sequences in Gen Bank format and by making a few mouse clicks. / Graduation date: 1998 Genetics -- Data processing Matching theory Heuristic programming
9	Computational algorithm development for epigenomic analysis Wang, Jianrong 03 July 2012 (has links) Multiple computational algorithms were developed for analyzing ChIP-seq datasets of histone modifications. For basic ChIP-seq data processing, the problems of ambiguous short sequence read mapping and broad peak calling of diffuse ChIP-seq signals were solved by novel statistical methods. Their performance was systematically evaluated compared with existing approaches. The potential utility of finding meaningful biological information was demonstrated by the applications on real datasets. For biological question driven data mining, several important topics were selected for algorithm developments, including hypothesis-driven insulator prediction, unbiased chromatin boundary element discovery and combinatorial histone modification signature inference. The integrative computational pipeline for insulator prediction not only produced a list of putative insulators but also recovered specific associated chromatin and functional features. Selected predictions have been experimentally validated. The unbiased chromatin boundary element prediction algorithm was feature-free and had the capability to discover novel types of boundary elements. The predictions found a set of chromatin features and provided the first report of tRNA-derived boundary elements in the human genome. The combinatorial chromatin signature algorithm employed chromatin profile alignments for unsupervised inferences of histone modification patterns. The signatures were associated with various regulatory elements and functional activities. Both the computational advantages and the biological discoveries were discussed. ChIP-seq Histone modifications Bioinformatics Epigenetics Insulators Algorithms Bioinformatics Genetics Data processing Epigenesis
10	Stress-inducible protein 1: a bioinformatic analysis of the human, mouse and yeast STI1 gene structure Aken, Bronwen Louise January 2005 (has links) Stress-inducible protein 1 (Sti1) is a 60 kDa eukaryotic protein that is important under stress and non-stress conditions. Human Sti1 is also known as the Hsp70/Hsp90 organising protein (Hop) that coordinates the functional cooperation of heat shock protein 70 (Hsp70) and heat shock protein 90 (Hsp90) during the folding of various transcription factors and kinases, including certain oncogenic proteins and prion proteins. Limited studies have been conducted on the STI1 gene structure. Thus, the aim of this study was to develop a comprehensive description of human STI1 (hSTI1), mouse STI1 (mSTI1), and yeast STI1 (ySTI1) genes, using a bioinformatic approach. Genes encoded near the STI1 loci were identified for the three organisms using National Centre for Biotechnology Information (NCBI) MapViewer and the Saccharomyces Genome Database. Exon/intron boundaries were predicted using Hidden Markov model gene prediction software (HMMGene) and Genscan, and by alignment of the mRNA sequence with the genomic DNA sequence. Transcription factor binding sites (TFBS) were predicted by scanning the region 1000 base pairs (bp) upstream of the STI1 orthologues’ transcription start site (TSS) with Alibaba, Transcription element search software (TESS) and Transcription factor search (TFSearch). The promoter region was defined by comparing the number, type and position of TFBS across the orthologous STI1 genes. Additional putative TFBS were identified for ySTI1 by searching with software that aligns nucleic acid conserved elements (AlignACE) for over-represented motifs in the region upstream of the TSS of genes thought to be co-regulated with ySTI1. This study showed that hSTI1 and mSTI1 occur in a region of synteny with a number of genes of related function. Both hSTI1 and mSTI1 comprised 14 putative exons, while ySTI1 was encoded on a single exon. Human and mouse STI1 shared a perfectly conserved 55 bp region spanning their predicted TSS, although their TATA boxes were not conserved. A putative CpG island was identified in the region from -500 to +100 bp relative to the hSTI1 and mSTI1 TSS. This region overlapped with a region of high TFBS density, suggesting that the core promoter region was located in the region approximately 100 to 200 bp upstream of the TSS. Several conserved clusters of TFBS were also identified upstream of this promoter region, including binding sites for stimulatory protein 1 (Sp1), heat shock factor (HSF), nuclear factor kappa B (NF-kappaB), and the cAMP/enhancer binding protein (C/EBP). Microarray data suggested that ySTI1 was co-regulated with several heat shock proteins and substrates of the Hsp70/Hsp90 heterocomplex, and several putative regulatory elements were identified in the upstream region of these co-regulated genes, including a motif for HSF binding. The results of this research suggest several avenues of future experimental work, including the confirmation of the proposed core promoter, upstream regulatory elements, and CpG island, and the investigation into the co-regulation of mammalian STI1 with its surrounding genes. These results could also be used to inform STI1 gene knockout experiments in mice, to assess the biological importance of mammalian STI1. Molecular chaperones Proteins -- Analysis Heat shock proteins Bioinformatics Genetics -- Data processing

Search results