241 |
The making and breaking of SAS-6 : structural insights and inhibitor search for n-terminal domain dimerisationBusch, Julia Maria Christiane January 2017 (has links)
SAS-6 is the structural core of the forming centriole - a cylindrical protein complex, which is an essential component of the centrosome. Oligomerisation of SAS-6 is crucial for successful centriole duplication and is achieved through two dimerisation domains in the SAS-6 protein; a long C-terminal coiled-coil domain and a globular N-terminal dimerisation domain. As core components of the centrosome, centrioles help facilitate various cellular functions. They are involved in the anchoring of flagella and cilia to the membrane and in coordinating the spindle apparatus during chromosome segregation. A deeper insight into the molecular mechanisms at play in the centriole duplication process would have implications on our understanding of fundamental cell division processes and a number of related diseases. Here the involvement of an unstudied loop region in the C. elegans SAS-6 N-terminal domain dimerisation is described. Combining structural biology, biophysical and computational techniques, the molecular interactions of this loop were explored, contributing to the oligomerisation of SAS-6 at the N-terminal dimer interface. Furthermore, the screening and testing of small molecule inhibitors of the SAS-6 N-terminal domain dimerisation is described, targeting a hydrophobic pocket in the domain. Two candidate compounds are presented as a result of the screens and next steps towards structure based compound design are suggested, based on computational analysis. The search for inhibitory compounds includes a set-up of an in-house virtual screening pipeline, as well as in vitro screening efforts and a new crystallographic structure of the H. sapiens SAS-6 N-terminal domain. By investigating the making and breaking of the SAS-6 N-terminal domain dimerisation, light is shed on so far neglected details of this essential protein-protein interaction and advancements towards a SAS-6 oligomerisation inhibitor described, which could ultimately be used for new approaches in cell cycle research and might open up new avenues for medical research by binding a disease relevant target.
|
242 |
Bioinformatics analyses for next-generation sequencing of plasma DNA.January 2012 (has links)
1997年,Dennis等證明胚胎DNA在孕婦母體中存在的事實開啟了產前無創診斷的大門。起初的應用包括性別鑒定和恒河猴血型系統的識別。隨著二代測序的出現和發展,對外周血游離DNA更加成熟的分析和應用應運而生。例如當孕婦懷孕十二周時, 應用二代測序技術在母體外周血DNA中預測胎兒21號染色體是否是三倍體, 其準確性達到98%。本論文的第一部分介紹如何應用母體外周血DNA構建胎兒的全基因組遺傳圖譜。這項研究極具挑戰,原因是孕後12周,胎兒對外周血DNA貢獻很小,大多數在10%左右,另外外周血中的胎兒DNA大多數短於200 bp。目前的演算法和程式都不適合於從母體外周血DNA中構建胎兒的遺傳圖譜。在這項研究中,根據母親和父親的基因型,用生物資訊學手段先構建胎兒可能有的遺傳圖譜,然後將母體外周血DNA的測序資訊比對到這張可能的遺傳圖譜上。如果在母親純和遺傳背景下,決定父親的特異遺傳片段,只要定性檢測父親的特異遺傳片段是否在母體外周血中存在。如果在母親雜合遺傳背景下,決定母親的遺傳特性,就要進行定量分析。我開發了單倍型相對劑量分析方案,統計學上判斷母親外周血中的兩條單倍型相對劑量水準,顯著增加的單倍型即為最大可能地遺傳給胎兒的單倍型。單倍型相對劑量分析方案可以加強測序資訊的分析效率,降低測序數據波動,比單個位點分析更加穩定,強壯。 / 隨著靶標富集測序出現,測序價格急劇下降。第一部分運用母親父親的多態位點基因型的組合加上測序的資訊可以計算出胎兒DNA在母體外周血中的濃度。但是該方法的局限是要利用母親父親的多態位點的基因型,而不能直接從測序的資訊中推測胎兒DNA在母體外周血中的濃度。本論文的第二部分,我開發了基於二項分佈的混合模型直接預測胎兒DNA在母體外周血中的濃度。當混合模型的似然值達到最大的時候,胎兒DNA在母體外周血中的濃度得到最優估算。由於靶標富集測序可以提供高倍覆蓋的測序資訊,從而有機會直接根據概率模型識別出母親是純和而且胎兒是雜合的有特異信息量的位點。 / 除了母體外周血DNA水準分析推動產前無創診斷外,表觀遺傳學的分析也不容忽視。 在本論文的第三部分,我開發了Methyl-Pipe軟體,專門用於全基因組的甲基化的分析。甲基化測序數據分析比一般的基因組測序分析更加複雜。由於重亞硫酸鹽測序文庫的沒有甲基化的胞嘧啶轉化成尿嘧啶,最後以胸腺嘧啶的形式存在PCR產物中, 但是對於甲基化的胞嘧啶則保持不變。 因此,為了實現將重亞硫酸鹽處理過的測序序列比對到參考基因組。首先,分別將Watson和Crick鏈的參考基因組中胞嘧啶轉化成全部轉化為胸腺嘧啶,同時也將測序序列中的胞嘧啶轉化成胸腺嘧啶。然後將轉化後的測序序列比對到參考基因組上。最後根據比對到基因組上的測序序列中的胞嘧啶和胸腺嘧啶的含量推到全基因組的甲基化水準和甲基化特定模式。Methyl-Pipe可以用於識別甲基化水平顯著性差異的基因組區別,因此它可以用於識別潛在的胎兒特異的甲基化位點用於產前無創診斷。 / The presence of fetal DNA in the cell-free plasma of pregnant women was first described in 1997. The initial clinical applications of this phenomenon focused on the detection of paternally inherited traits such as sex and rhesus D blood group status. The development of massively parallel sequencing technologies has allowed more sophisticated analyses on circulating cell-free DNA in maternal plasma. For example, through the determination of the proportional representation of chromosome 21 sequences in maternal plasma, noninvasive prenatal diagnosis of fetal Down syndrome can be achieved with an accuracy of >98%. In the first part of my thesis, I have developed bioinformatics algorithms to perform genome-wide construction of the fetal genetic map from the massively parallel sequencing data of the maternal plasma DNA sample of a pregnant woman. The construction of the fetal genetic map through the maternal plasma sequencing data is very challenging because fetal DNA only constitutes approximately 10% of the maternal plasma DNA. Moreover, as the fetal DNA in maternal plasma exists as short fragments of less than 200 bp, existing bioinformatics techniques for genome construction are not applicable for this purpose. For the construction of the genome-wide fetal genetic map, I have used the genome of the father and the mother as scaffolds and calculated the fractional fetal DNA concentration. First, I looked at the paternal specific sequences in maternal plasma to determine which portions of the father’s genome had been passed on to the fetus. For the determination of the maternal inheritance, I have developed the Relative Haplotype Dosage (RHDO) approach. This method is based on the principle that the portion of maternal genome inherited by the fetus would be present in slightly higher concentration in the maternal plasma. The use of haplotype information can enhance the efficacy of using the sequencing data. Thus, the maternal inheritance can be determined with a much lower sequencing depth than just looking at individual loci in the genome. This algorithm makes it feasible to use genome-wide scanning to diagnose fetal genetic disorders prenatally in a noninvasive way. / As the emergence of targeted massively parallel sequencing, the sequencing cost per base is reducing dramatically. Even though the first part of the thesis has already developed a method to estimate fractional fetal DNA concentration using parental genotype informations, it still cannot be used to deduce the fractional fetal DNA concentration directly from sequencing data without prior knowledge of genotype information. In the second part of this thesis, I propose a statistical mixture model based method, FetalQuant, which utilizes the maximum likelihood to estimate the fractional fetal DNA concentration directly from targeted massively parallel sequencing of maternal plasma DNA. This method allows fetal DNA concentration estimation superior to the existing methods in term of obviating the need of genotype information without loss of accuracy. Furthermore, by using Bayes’ rule, this method can distinguish the informative SNPs where mother is homozygous and fetus is heterozygous, which is potential to detect dominant inherited disorder. / Besides the genetic analysis at the DNA level, epigenetic markers are also valuable for noninvasive diagnosis development. In the third part of this thesis, I have also developed a bioinformatics algorithm to efficiently analyze genomewide DNA methylation status based on the massively parallel sequencing of bisulfite-converted DNA. DNA methylation is one of the most important mechanisms for regulating gene expression. The study of DNA methylation for different genes is important for the understanding of the different physiological and pathological processes. Currently, the most popular method for analyzing DNA methylation status is through bisulfite sequencing. The principle of this method is based on the fact that unmethylated cytosine residues would be chemically converted to uracil on bisulfite treatment whereas methylated cytosine would remain unchanged. The converted uracil and unconverted cytosine can then be discriminated on sequencing. With the emergence of massively parallel sequencing platforms, it is possible to perform this bisulfite sequencing analysis on a genome-wide scale. However, the bioinformatics analysis of the genome-wide bisulfite sequencing data is much more complicated than analyzing the data from individual loci. Thus, I have developed Methyl-Pipe, a bioinformatics program for analyzing the DNA methylation status of genome-wide methylation status of DNA samples based on massively parallel sequencing. In the first step of this algorithm, an in-silico converted reference genome is produced by converting all the cytosine residues to thymine residues. Then, the sequenced reads of bisulfite-converted DNA sequences are aligned to this modified reference sequence. Finally, post-processing of the alignments removes non-unique and low-quality mappings and characterizes the methylation pattern in genome-wide manner. Making use of this new program, potential fetal-specific hypomethylated regions which can be used as blood biomarkers can be identified in a genome-wide manner. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Jiang, Peiyong. / Thesis (Ph.D.)--Chinese University of Hong Kong, 2012. / Includes bibliographical references (leaves 100-105). / Abstracts also in Chinese. / Chapter SECTION I : --- BACKGROUND --- p.1 / Chapter CHAPTER 1: --- Circulating nucleic acids and Next-generation sequencing --- p.2 / Chapter 1.1 --- Circulating nucleic acids --- p.2 / Chapter 1.2 --- Next-generation sequencing --- p.3 / Chapter 1.3 --- Bioinformatics analyses --- p.9 / Chapter 1.4 --- Applications of the NGS --- p.11 / Chapter 1.5 --- Aims of this thesis --- p.12 / Chapter SECTION II : --- Mathematically decoding fetal genome in maternal plasma --- p.14 / Chapter CHAPTER 2: --- Characterizing the maternal and fetal genome in plasma at single base resolution --- p.15 / Chapter 2.1 --- Introduction --- p.15 / Chapter 2.2 --- SNP categories and principle --- p.17 / Chapter 2.3 --- Clinical cases and SNP genotyping --- p.20 / Chapter 2.4 --- Sequencing depth and fractional fetal DNA concentration determination --- p.24 / Chapter 2.5 --- Filtering of genotyping errors for maternal genotypes --- p.26 / Chapter 2.6 --- Constructing fetal genetic map in maternal plasma --- p.27 / Chapter 2.7 --- Sequencing error estimation --- p.36 / Chapter 2.8 --- Paternal-inherited alleles --- p.38 / Chapter 2.9 --- Maternally-derived alleles by RHDO analysis --- p.39 / Chapter 2.1 --- Recombination breakpoint simulation and detection --- p.49 / Chapter 2.11 --- Prenatal diagnosis of β- thalassaemia --- p.51 / Chapter 2.12 --- Discussion --- p.53 / Chapter SECTION III : --- Statistical model for fractional fetal DNA concentration estimation --- p.56 / Chapter CHAPTER 3: --- FetalQuant: deducing the fractional fetal DNA concentration from massively parallel sequencing of maternal plasma DNA --- p.57 / Chapter 3.1 --- Introduction --- p.57 / Chapter 3.2 --- Methods --- p.60 / Chapter 3.2.1 --- Maternal-fetal genotype combinations --- p.60 / Chapter 3.2.2 --- Binomial mixture model and likelihood --- p.64 / Chapter 3.2.3 --- Fractional fetal DNA concentration fitting --- p.66 / Chapter 3.3 --- Results --- p.71 / Chapter 3.3.1 --- Datasets --- p.71 / Chapter 3.3.2 --- Evaluation of FetalQuant algorithm --- p.75 / Chapter 3.3.3 --- Simulation --- p.78 / Chapter 3.3.4 --- Sequencing depth and the number of SNPs required by FetalQuant --- p.81 / Chapter 3.5 --- Discussion --- p.85 / Chapter SECTION IV : --- NGS-based data analysis pipeline development --- p.88 / Chapter CHAPTER 4: --- Methyl-Pipe: Methyl-Seq bioinformatics analysis pipeline --- p.89 / Chapter 4.1 --- Introduction --- p.89 / Chapter 4.2 --- Methods --- p.89 / Chapter 4.2.1 --- Overview of Methyl-Pipe --- p.90 / Chapter 4.3 --- Results and discussion --- p.96 / Chapter SECTION V : --- CONCLUDING REMARKS --- p.97 / Chapter CHAPTER 5: --- Conclusion and future perspectives --- p.98 / Chapter 5.1 --- Conclusion --- p.98 / Chapter 5.2 --- Future perspectives --- p.99 / Reference --- p.100
|
243 |
Identificação e avaliação da expressão de marcadores moleculares envolvidos na tumorigênese de pulmãoHenrique, Tiago 05 July 2010 (has links)
Submitted by Fabíola Silva (fabiola.silva@famerp.br) on 2016-06-28T18:01:18Z
No. of bitstreams: 1
tiagohenrique_dissert.pdf: 1562127 bytes, checksum: 0d0b4da7bf071396a3c701c40e27e3ed (MD5) / Made available in DSpace on 2016-06-28T18:01:18Z (GMT). No. of bitstreams: 1
tiagohenrique_dissert.pdf: 1562127 bytes, checksum: 0d0b4da7bf071396a3c701c40e27e3ed (MD5)
Previous issue date: 2010-07-05 / Introduction: Lung cancer is the most common malignancy in human. The average
5 years survival rate is one of the lowest among aggressive cancers, showing no
significant improvement in recent years. When detected early, lung cancer has a
good prognosis, but most patients present metastatic disease at the time of
diagnostic, which significantly reduces survival rates. Despite all the recent
advances in cancer treatment, prognostic of these patients have improved
minimally. Objectives: The present study aimed to investigate the molecular profile
of non-small cell lung cancer as well as new tumor makers relevant to diagnosis and
prognosis of this disease. Methods: Total RNA from frozen surgical tissues was
extracted using TRIZOL reagent and RNeasy FFPE kit was used for RNA extraction
from formalin fixed, paraffin embedded tissue. Aiming to identify differentially
expressed genes involved in lung cancer, we analyzed combined data from normal
and tumor SAGE (Serial Analysis of Gene Expression) libraries available in the
public domain. Proteome profiling was also analyzed in adenocarcinoma, squamous
cell carcinoma and normal surgical margin samples using two-dimensional
electrophoresis and mass spectrometry. Results: The statistical analysis of SAGE
data indentified a subset of differentially expressed tags between normal surgical
margins and adenocarcinoma libraries. Three genes displaying differential regulation
in SAGE or proteomic analysis, two up- (COL3A1, CTSB) and one down-regulated
(ITGB1) in neoplastic cells, were selected for real-time polymerase chain reaction
(PCR) experiments using the same set of samples. Similar to the statistical results,
quantitative PCR confirmed the upregulation of COL3A1 and CTSB in carcinomas
when compared to tumor free tissues. Conclusion: RNA from frozen and arquived
samples is appropriate for amplification experiments by real time PCR, although with
lower efficiency among the last ones. Therefore, improved methods of RNA
extraction in arquived tissues are suitable for Real Time quantitative RT-PCR, and
may be used for gene expression profiling of paraffin embedded tissues from cancer
patients. To the best of our knowledge, this is the first study reporting SAGE data
analysis in lung cancer. The statistical approach as well as the proteomic evaluation
were effective in identifying differentially expressed genes and proteins reportedly
involved in cancer development and may be useful to disclose new tumor makers
relevant to lung tumorigenesis. / Introdução: O câncer de pulmão é a neoplasia humana mais comum. As taxas de
sobrevida em 5 anos estão entre as mais baixas para tumores agressivos e seus
valores não têm mostrado diferenças importantes nos últimos anos. Quando
detectado nos estágios precoces, o câncer de pulmão mostra bom prognóstico, mas
a maioria dos pacientes apresenta doença metastática no momento do diagnóstico,
o que reduz a sobrevida significativamente. Apesar de todo o progresso obtido nos
últimos anos em tratamento do câncer, o prognóstico desses pacientes permanece
desfavorável. Objetivos: O presente estudo teve como objetivo investigar o perfil
molecular de câncer de pulmão de células não pequenas, bem como de novos
marcadores tumorais relevantes para diagnóstico e prognóstico dessa doença.
Métodos: O RNA total de espécimes cirúrgicos congelados foi extraído pelo
método do trizol e o kit RNeasy FFPE foi utilizado para extração de RNA de tecidos
fixados em formalina e emblocados em parafina. Com o objetivo de identificar
genes diferencialmente expressos envolvidos em câncer de pulmão, dados
combinados de bibliotecas SAGE (Serial Analysis of Gene Expression) públicas
foram analisados. O perfil protéico foi também avaliado em amostras de
adenocarcinoma, carcinoma epidermóide e de margens cirúrgicas normais,
utilizando eletroforese bidimensional e espectrometria de massas. Resultados: A
análise estatística dos dados de SAGE identificou um conjunto de tags
diferencialmente expressas entre as bibliotecas de adenocarcinoma e de margens
cirúrgicas. Três genes com expressão alterada na análise de SAGE e de
proteômica, dois com níveis elevados (COL3A1, CTSB) e um com nível reduzido
(ITGB1) em células neoplásicas, foram selecionados para experimentos de PCR
(reação em cadeia da polimerase) em tempo real no mesmo conjunto de amostras.
Consistente com os resultados estatísticos, a PCR quantitativa confirmou a
expressão elevada de COL3A1 e CTSB em carcinomas quando comparados com o
tecido livre de tumor. Conclusão: O RNA de amostras congeladas e arquivadas é
adequado para amplificação por PCR em tempo real, embora exiba qualidade mais
baixa nessas últimas. Portanto, métodos otimizados para tecidos arquivados
permitem análises por PCR quantitativa e podem ser utilizados para avaliação do
perfil transcricional de espécimes embebidos em parafina procedentes de pacientes
com câncer. Este é aparentemente o primeiro estudo descrevendo a análise de
dados de SAGE em câncer de pulmão. As abordagens estatística e proteômica
foram efetivas em identificar genes e proteínas diferencialmente expressas
envolvidas no desenvolvimento do câncer e podem revelar novos marcadores
relevantes para a tumorigênese de pulmão.
|
244 |
Feature selection and classification problem in bioinformatics.January 2010 (has links)
Lau, Siu Him. / "November 2009." / Thesis (M.Phil.)--Chinese University of Hong Kong, 2010. / Includes bibliographical references (leaves 46). / Abstracts in English and Chinese. / Chapter 1 --- Introduction --- p.7 / Chapter 2 --- Support Vector Machine --- p.9 / Chapter 2.1 --- Two-Class Support Vector Machine --- p.9 / Chapter 2.2 --- Kernel Tricks --- p.12 / Chapter 2.3 --- Weighted Support Vector Machine --- p.14 / Chapter 2.4 --- Parameter Selection in Support Vector Machine --- p.16 / Chapter 3 --- Feature Selection Methods --- p.17 / Chapter 3.1 --- Principle Component Analysis --- p.17 / Chapter 3.1.1 --- Maximizing Variance --- p.17 / Chapter 3.1.2 --- Relation with Singular Value Decomposition --- p.19 / Chapter 3.1.3 --- Feature Selection by Singular Value Decomposition --- p.20 / Chapter 3.1.4 --- Disadvantage of Unsupervised Learning --- p.21 / Chapter 3.2 --- Linear Discriminant Analysis --- p.21 / Chapter 3.2.1 --- Between-class Distance and Within-class Variance --- p.22 / Chapter 3.2.2 --- Generalized Eigenvalue Problem --- p.24 / Chapter 3.2.3 --- Feature Selection by Linear Discriminant Analysis --- p.25 / Chapter 4 --- Application on a Real Problem --- p.27 / Chapter 4.1 --- Problem and Goals --- p.27 / Chapter 4.2 --- Diabetes Data Set --- p.27 / Chapter 4.3 --- Data Processing --- p.28 / Chapter 4.3.1 --- Pre-processing for Categorical Data --- p.28 / Chapter 4.3.2 --- Handling Uneven Data Set --- p.31 / Chapter 5 --- Results on Simulated and Real Data --- p.33 / Chapter 5.1 --- Evaluation --- p.33 / Chapter 5.1.1 --- Training and Testing --- p.33 / Chapter 5.1.2 --- Evaluation Method --- p.34 / Chapter 5.2 --- Classification Procedure --- p.35 / Chapter 5.3 --- Performance on Simulated Data --- p.36 / Chapter 5.4 --- Results on a Real Data Set --- p.39 / Chapter 5.4.1 --- Features Selection --- p.39 / Chapter 5.4.2 --- Performance on the Real Data Set --- p.41 / Chapter 5.4.3 --- Analysis on Risk Factors --- p.42 / Chapter 6 --- Conclusion --- p.44 / Bibliography --- p.46
|
245 |
Development of bioinformatics platforms for methylome and transcriptome data analysis.January 2014 (has links)
高通量大規模並行測序技術,又称為二代測序(NGS),極大的加速了生物和醫學研究的進程。隨著測序通量和複雜度的不斷提高,在分析大量的資料以挖掘其中的資訊的過程中,生物訊息學變得越發重要。在我的博士研究生期間(及本論文中),我主要從事於以下兩個領域的生物訊息學演算法的開發:DNA甲基化資料分析和基因間區長鏈非編碼蛋白RNA(lincRNA)的鑒定。目前二代測序技術在這兩個領域的研究中有著廣泛的應用,同時急需有效的資料處理方法來分析對應的資料。 / DNA甲基化是一種重要的表觀遺傳修飾,主要用來調控基因的表達。目前,全基因組重亞硫酸鹽測序(BS-seq)是最準確的研究DNA甲基化的實驗方法之一,該技術的一大特點就是可以精確到單個堿基的解析度。為了分析BS-seq產生的大量測序數據,我參與開發並深度優化了Methy-Pipe軟體。Methy-Pipe集成了測序序列比對和甲基化程度分析,是一個一體化的DNA甲基化資料分析工具。另外,在Methy-Pipe的基礎上,我又開發了一個新的用於檢測DNA甲基化差異區域(DMR)的演算法,可以用於大範圍的尋找DNA甲基化標記。Methy-Pipe在我們實驗室的DNA甲基化研究項目中得到廣泛的應用,其中包括基於血漿的無創產前診斷(NIPD)和癌症的檢測。 / 基因間區長鏈非編碼蛋白RNA(lincRNA)是一種重要的調節子,其在很多生物學過程中發揮作用,例如轉錄後調控,RNA的剪接,細胞老化等。lincRNA的表達具有很強的組織特異性,因此很大一部分lincRNA還沒有被發現。最近,全轉錄組測序技術(RNA-seq)結合基因從頭組裝,為新的lincRNA鑒定以及構建完整的轉錄組列表提供了最有力的方法。然而,有效並準確的從大量的RNA-seq測序數據中鑒定出真實的新的lincRNA仍然具有很大的挑戰性。為此,我開發了兩個生物訊息學工具:1)iSeeRNA,用於區分lincRNA和編碼蛋白RNA(mRNA);2)sebnif,用於深層次資料篩選以得到高品質的lincRNA列表。這兩個工具已經在多個生物學系統中使用並表現出很好的效果。 / 總的來說,我開發了一些生物訊息學方法,這些方法可以幫助研究人員更好的利用二代測序技術來挖掘大量的測序數據背後的生物學本質,尤其是DNA甲基化和轉錄組的研究。 / High-throughput massive parallel sequencing technologies, or Next-Generation Sequencing (NGS) technologies, have greatly accelerated biological and medical research. With the ever-growing throughput and complexity of the NGS technologies, bioinformatics methods and tools are urgently needed for analyzing the large amount of data and discovering the meaningful information behind. In this thesis, I mainly worked on developing bioinformatics algorithms for two research fields: DNA methylation data analysis and large intergenic noncoding RNA discovery, where the NGS technologies are in-depth employed and novel bioinformatics algorithms are highly needed. / DNA methylation is one of the important epigenetic modifications to control the transcriptional regulations of the genes. Whole genome bisulfite sequencing (BS-seq) is one of the most precise methodologies for DNA methylation study which allows us to perform whole methylome research at single-base resolution. To analyze the large amount of data generated by BS-seq experiments, I have co-developed and optimized Methy-Pipe, an integrated bioinformatics pipeline which can perform both sequencing read alignment and methylation state decoding. Furthermore, I’ve developed a novel algorithm for Differentially Methylated Regions (DMR) mining, which can be used for large scale methylation marker discovery. Methy-Pipehas been routinely used in our laboratory for methylomic studies, including non-invasive prenatal diagnosis and early cancer detections in human plasma. / Large intergenic noncoding RNAs, or lincRNAs, is avery important novel family of gene regulators in many biological processes, such as post-transcriptional regulation, splicing and aging. Due to high tissue-specific expression pattern of the lincRNAs, a large proportion is still undiscovered. The development of Whole Transcriptome Shotgun Sequencing, also known as RNA-seq, combined with de novo or ab initio assembly, promises quantity discovery of novel lincRNAs hence building the complete transcriptome catalog. However, to efficiently and accurately identify the novel lincRNAs from the large transcriptome data stillremains a bioinformatics challenge.To fill this gap, I have developed two bioinformatics tools: I) iSeeRNAfor distinguishing lincRNAs from mRNAs and II) sebnif for comprehensive filtering towards high quality lincRNA screening which has been used in various biological systems and showed satisfactory performance. / In summary, I have developed several bioinformatics algorithms which help the researchers to take advantage of the strength of the NGS technologies(methylome and transcriptome studies) and explore the biological nature behind the large amount of data. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Sun, Kun. / Thesis (Ph.D.) Chinese University of Hong Kong, 2014. / Includes bibliographical references (leaves 118-126). / Abstracts also in Chinese.
|
246 |
Quantifying the Structure of Misfolded Proteins Using Graph TheoryWitt, Walter G 01 May 2017 (has links)
The structure of a protein molecule is highly correlated to its function. Some diseases such as cystic fibrosis are the result of a change in the structure of a protein so that this change interferes or inhibits its function. Often these changes in structure are caused by a misfolding of the protein molecule. To assist computational biologists, there is a database of proteins together with their misfolded versions, called decoys, that can be used to test the accuracy of protein structure prediction algorithms. In our work we use a nested graph model to quantify a selected set of proteins that have two single misfold decoys. The graph theoretic model used is a three tiered nested graph. Measures based on the vertex weights are calculated and we compare the quantification of the proteins with their decoys. Our method is able to separate the misfolded proteins from the correctly folded proteins.
|
247 |
Systematic analysis of enhancer and promoter interactionsHe, Bing 01 December 2015 (has links)
Transcriptional enhancers represent the primary basis for differential gene expression. These elements regulate cell type specificity, development, and evolution, with many human diseases resulting from altered enhancer activity. To date, a key gap in our knowledge is how enhancers select specific promoters for activation.
To fill this gap, in this thesis, I first developed an Integrated Method for Predicting Enhancer Targets (IM-PET). Leveraging abundant “omics” data, I devised and characterized multiple genomic features for distinguishing true enhancer-promoter (EP) pairs from non-interacting pairs. I integrated these features into a probabilistic predictor for EP interactions. Multiple validation experiments demonstrated a significant improvement over extent state-of-the-art approaches. Systematic analyses of EP interactions across twelve human cell types reveals global features of EP interactions.
Second, we used a well-established viral infection model to map the dynamic changes of enhancers and super-enhancers during the CD8+ T cell responses. Our analysis illustrated the complexity and dynamics of the underlying EP interactome during cell differentiation. Taking advantage of the predicted EP interactions, we constructed stage-specific transcriptional regulatory networks, which is critical for understanding the regulatory mechanism during CD8+ T cell differentiation.
Third, recent progress in mapping technologies for chromatin interactions has led to a rapid increase in this type of interaction data. However, there is a lack of a comprehensive depository for chromatin interactions identified by all major technologies. To address this problem, we have developed the 4DGenome database through comprehensive literature curation of experimentally derived interactions. We envision a wide range of investigations will benefit from this carefully curated database.
|
248 |
NOVEL COMPUTATIONAL METHODS FOR SEQUENCING DATA ANALYSIS: MAPPING, QUERY, AND CLASSIFICATIONLiu, Xinan 01 January 2018 (has links)
Over the past decade, the evolution of next-generation sequencing technology has considerably advanced the genomics research. As a consequence, fast and accurate computational methods are needed for analyzing the large data in different applications. The research presented in this dissertation focuses on three areas: RNA-seq read mapping, large-scale data query, and metagenomics sequence classification.
A critical step of RNA-seq data analysis is to map the RNA-seq reads onto a reference genome. This dissertation presents a novel splice alignment tool, MapSplice3. It achieves high read alignment and base mapping yields and is able to detect splice junctions, gene fusions, and circular RNAs comprehensively at the same time. Based on MapSplice3, we further extend a novel lightweight approach called iMapSplice that enables personalized mRNA transcriptional profiling. As huge amount of RNA-seq has been shared through public datasets, it provides invaluable resources for researchers to test hypotheses by reusing existing datasets. To meet the needs of efficiently querying large-scale sequencing data, a novel method, called SeqOthello, has been developed. It is able to efficiently query sequence k-mers against large-scale datasets and finally determines the existence of the given sequence. Metagenomics studies often generate tens of millions of reads to capture the presence of microbial organisms. Thus efficient and accurate algorithms are in high demand. In this dissertation, we introduce MetaOthello, a probabilistic hashing classifier for metagenomic sequences. It supports efficient query of a taxon using its k-mer signatures.
|
249 |
Hybrid Kinetic Monte Carlo Models of Cellular Processes in Interactive Dynamic MicroenvironmentsTimothy James Sego (7041083) 16 October 2019 (has links)
Living tissue consists primarily of cells and extracellular matrix. Cells perform functions, communicate, respire and remodel extracellular matrix. Likewise, diffusive chemical conditions and extracellular matrix exhibit their own effects on cellular and intracellular processes, depending on the consistency of the matrix and phenotype of the cell. These interactions produce the emergent phenomena of tissue function, repair and morphology. Computational modeling seeks to quantify these processes for the purposes of fundamental study and predictive capability in various applications, including wound healing, tumor vascularization and biofabrication of living tissue. Hybrid kinetic Monte Carlo models are well known to be capable of predicting observed behaviors like cell sorting and spheroid fusion due to differential adhesion and energy minimization. However, no hybrid model sufficiently provides a formal treatment of full cell, chemical and matrix interactivity in a dynamic environment, including heterogeneous matrix conditions, advecting materials, and intracellular processes. In this work, hybrid kinetic Monte Carlo models are developed to describe full interactivity of cells, soluble signals and insoluble signals in a complex, dynamic microenvironment at the cellular level. Modeling of intracellular chemical dynamics and effects on the cellular state is developed as stochastic processes, and cell perform metabolic and matrix remodeling activities. Computational models of select \textit{in vivo} and \textit{in vitro} phenomena are developed and simulated, showing the ability to simulate new phenomena concerning cell viability, growth dynamics, highly heterogeneous cellular distributions, and complex tissue structures resulting from phenomena like intercellular signaling, matrix remodeling, and cell polarity.
|
250 |
Detection of artefacts in FFPE-sample sequence dataSwenson, Hugo January 2019 (has links)
Next generation sequencing is increasingly used as a diagnostic tool in the clinical setting. This is driven by the vast increase in molecular targeted therapy, which requires detailed information on what genetic variants are present in patient samples. In the hospital setting, most cancer diagnostics are based on Formalin Fixed Paraffin Embedded (FFPE) samples. The FFPE routine is very beneficial for logistical purposes and for some histopathological analyses, but creates problems for molecular diagnostics based on DNA. These problems derive from sample immersion informalin, which results in DNA fragmentation, interstrand DNA crosslinking and sequence artefacts due to hydrolytic deamination. Distinguishing such artefacts from true somatic variants can be challenging, thus affecting both research and clinical analyses. In order to identify FFPE-artefacts from true variants in next generation sequencing data from FFPE samples, I developed the novelprogram FUSAC (FFPE tissue UMI based Sequence Artefact Classifier) for the facility Clinical Genomics in Uppsala. FUSAC utilizes UniqueMolecular Identifiers (UMI's) to identify and group sequencing reads based on their molecule of origin. By using UMI's to collapse duplicate paired reads into consensus reads, FFPE-artefacts are classified through comparative analysis of the positive and negative strand sequences. My findings indicate that FUSAC can succesfully classify UMI-tagged next generation sequencing reads with FFPE-artefacts, from sequencing reads with true variants. FUSAC thus presents a novel approach in bioinformatic pipelines for studying FFPE-artefacts.
|
Page generated in 0.1209 seconds