Global ETD Search

41	The genease activity of mung bean nuclease: fact or fiction? Kula, Nothemba January 2004 (has links) <p>The action of Mung Bean Nuclease (MBN) on DNA makes it possible to clone intact gene fragments from genes of the malaria parasite, Plasmodium. This &ldquo / genease&rdquo / activity has provided a foundation for further investigation of the coding elements of the Plasmodium genome. MBN has been reported to cleave genomic DNA of Plasmodium preferentially at positions before and after genes, but not within gene coding regions. This mechanism has overcome the difficulty encountered in obtaining genes with low expression levels because the cleavage mechanism of the enzyme yields sequences of genes from genomic DNA rather than mRNA. However, as potentially useful as MBN may be, evidence to support its genease activity comes from analysis of a limited number of genes. It is not clear whether this mechanism is specific to certain genes or species of Plasmodia or whether it is a general cleavage mechanism for Plasmodium DNA .There have also been some projects (Nomura et al., 2001 / van Lin, Janse, and Waters, 2000) which have identified MBN generated fragments which contain fragments of genes with both introns and exons, rather than the intact genes expected from MBN-digestion of genomic DNA, which raises concerns about the efficiency of the MBN mechanism in generating complete genes.</p> <p><br /> Using a large-scale, whole genome mapping approach, 7242 MBN generated genome survey sequences (GSSs) have been mapped to determine their position relative to coding sequences within the complete genome sequences of the human malaria parasite Plasmodium falciparum and the incomplete genome of a rodent malaria parasite Plasmodium berghei. The location of MBN cleavage sites was determined with respect to coding regions in orthologous genes, non-coding /intergenic regions and exon-intron boundaries in these two species of Plasmodium. The survey illustrates that for P. falciparum 79% of GSSs had at least one terminal mapping within an ortholog coding sequence and 85% of GSSs which overlapped coding sequence boundaries mapped within 50 bp of the start or end of the gene. Similarly, despite the partial nature of P.berghei genome sequence information, 73% of P.berghei GSSs had at least one terminal mapping within an ortholog coding sequence and 37% of these mapped between 0-50 bp of the start or end of the gene. This indicates that a larger percentage of cleavage sites in both P.falciparum and P.berghei were found proximal to coding regions. Furthermore, 86% of P.falciparum GSSs had at least one terminal mapping within a coding exon and 85% of GSSs which overlapped exon-intron boundaries mapped within 50bp of the exon start and end site. The fact that 11% of GSSs mapped completely to intronic regions, suggests that some introns contain specific cleavage sites sensitive to cleavage and this also indicates that MBN cleavage of Plasmodium DNA does not always yield complete exons.</p> <p><br /> Finally, the results presented herein were obtained from analysis of several thousand Plasmodium genes which have different coding sequences, in different locations on individual chromosomes/contigs in two different species of Plasmodium. Therefore it appears that the MBN mechanism is neither species specific nor is it limited to specific genes.</p> Exon Genome survey sequence Mung Bean Nuclease Nuclease cleavage site Plasmodium falciparum Plasmodium berghei Sequence Alignment.
42	Core column prediction for protein multiple sequence alignments DeBlasio, Dan, Kececioglu, John 19 April 2017 (has links) Background: In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference alignment are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the known three-dimensional structures of the proteins. Typically the accuracy of a protein multiple sequence alignment that has been computed for a benchmark is only measured with respect to the core columns of the reference alignment. When computing an alignment in practice, however, a reference alignment is not known, so the coreness of its columns can only be predicted. Results: We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment's accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner's scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other column-confidence estimators from the literature, and affords a substantial boost in alignment accuracy. Multiple sequence alignment Core blocks Alignment accuracy Accuracy estimation Parameter advising Machine learning Regression
43	PARSES: A Pipeline for Analysis of RNA-Sequencing Exogenous Sequences Coco, Joseph 20 May 2011 (has links) RNA-Sequencing (RNA-Seq) has become one of the most widely used techniques to interrogate the transcriptome of an organism since the advent of next generation sequencing technologies [1]. A plethora of tools have been developed to analyze and visualize the transcriptome data from RNA-Seq experiments, solving the problem of mapping reads back to the host organism's genome [2] [3]. This allows for analysis of most reads produced by the experiments, but these tools typically discard reads that do not match well with the reference genome. This additional information could reveal important insight into the experiment and possible contributing factors to the condition under consideration. We introduce PARSES, a pipeline constructed from existing sequence analysis tools, which allows the user to interrogate RNA-Sequencing experiments for possible biological contamination or the presence of exogenous sequences that may shed light on other factors influencing an organism's condition. exogenous agents RNA-Seq contamination sequence alignment cancer etiology sequence assembly taxonomical classification cancer treatment
44	Identificação e caracterização de grupos de indivíduos segundo padrões de seqüências de atividades multidimensionais. / Identification and characterization of groups of individuals according to patterns of multidimensional activity sequences. Dalmaso, Ricardo Curvello 30 April 2009 (has links) O presente estudo procura identificar grupos homogêneos de indivíduos quanto aos padrões de seqüências de atividades diárias que estes realizam. As atividades são caracterizadas por múltiplos atributos, fazendo com que as seqüências sejam multidimensionais. Como atributos, ou características, são considerados a natureza da atividade realizada, ou motivo da viagem, e o período de realização da mesma, ambos separados em categorias. É estudado o efeito da inclusão da forma de acesso à atividade, ou modo de viagem, como uma terceira dimensão. Este atributo, entretanto, dados os resultados obtidos, não é utilizado nas análises finais. É também considerada a adoção de diferentes categorizações para a dimensão motivo. São usados dados da pesquisa Origem e Destino realizada em 1997, na Região Metropolitana de São Paulo. No trabalho são considerados os indivíduos com 12 anos ou mais, com pelo menos duas viagens diárias e com seqüência de viagens iniciada e terminada em sua residência, sem inconsistências internas. O número de indivíduos que atende a estes critérios é 49.616. A classificação, ou agrupamento, das seqüências de atividades em classes ou grupos é feita considerando uma medida de distância ou dissimilaridade calculada entre as seqüências, que é baseada no esforço necessário para igualá-las. Esta medida é chamada de OT-MDSAM (uni-dimensional Optimum Trajectories-based MultiDimensional Sequence Alignment Method). A partir da matriz de dissimilaridades é executado um processo estatístico de agrupamento hierárquico aglomerativo usando o Método de Ward. Os grupos de seqüências formados são analisados considerando características das próprias seqüências e atributos sóciodemográficas e econômicas dos indivíduos que os compõem, e usados em um modelo de segmentação do tipo árvore de decisão, usando o CHAID (Chi-square Automatic Interaction Detector). Resultados indicam que os grupos formados são bastante homogêneos quanto aos padrões de seqüências de atividades que representam e aos indivíduos associados a eles. / The main objective of the dissertation is to identify homogeneous groups of individuals, with regard to the daily activity/travel sequences performed in a weekday. Activities are characterized by multiple attributes, thus generating mutidimensional seguences. In this study, the nature of the activity (travel purpose) and the starting period of engagement in the activity (ending time of a trip) were the dimensions considered in the characterization of activities. Access mode to the activity was also considered as a third dimension, but the results had led to the decision not to include it in the final analysis. Alternative categorizations of the activity nature dimension were also studied, that resulted in further disaggregation than adopted in previous analyses of the same data. The study used data from the 1997 Origin-Destination household survey of the Sao Paulo Metropolitan Area. The analysis considered all individuals aged 12 or over that conducted two or more trips (starting and ending at home) on the survey day, resulting in a sample of 49,616 individuals. A sequence alignment method - OT-MDSUM (uni-dimensional Optimum Trajectories-based MultiDimensional Sequence Alignment Method) - was used to compare and calculate distances between pairs of different activity/travel sequences. These distances were then fed into a Ward hierarchical clustering algorithm to create classes of groups of activity/travel patterns. These groups were then analyzed according to the characteristics of the activity/travel sequences included and to the sociodemographic and economic characteristics of individuals who performed these patterns. The data were then utilized to develop a decision tree model using CHAID - Chi-Squared Automatic Interaction Detector, having the group of activity/travel sequences as the response variable and the characteristics of individuals and their families as independent variables. The results indicate that the groups formed through this procedure present a good degree of homogeneity regarding the activity patterns they represent and that they can be clearly associated to the characteristics of the individuals which perform these patterns. Activity sequences multidimensional sequence alignment Planejamento de transportes Transport planning Travel demand
45	Aplicação de estratégias híbridas em algoritmos de alinhamento múltiplo de sequências para ambientes de computação paralela e distribuída. / Application of hybrid strategies in multiple sequence alignments for parallel and distributed computing environments. Zafalon, Geraldo Francisco Donegá 11 November 2014 (has links) A Bioinformática tem se desenvolvido de forma intensa nos últimos anos. A necessidade de se processar os grandes conjuntos de sequências, sejam de nucleotídeos ou de aminoácidos, tem estimulado o desenvolvimento de diversas técnicas algorítmicas, de modo a tratar este problema de maneira factível. Os algoritmos de alinhamento de alinhamento múltiplo de sequências assumiram um papel primordial, tornando a execução de alinhamentos de conjuntos com mais de duas sequencias uma tarefa viável computacionalmente. No entanto, com o aumento vertiginoso tanto da quantidade de sequencias em um determinado conjunto, quanto do comprimento dessas sequencias, a utilização desses algoritmos de alinhamento múltiplo, sem o acoplamento de novas estratégias, tornou-se algo impraticável. Consequentemente, a computação de alto desempenho despontou como um dos recursos a serem utilizados, através da paralelização de diversas estratégias para sua execução em grandes sistemas computacionais. Além disso, com a contínua expansão dos conjuntos de sequências, outras estratégias de otimização passaram a ser agregadas aos algoritmos de alinhamento múltiplo paralelos. Com isso, o desenvolvimento de ferramentas para alinhamento múltiplo de sequencias baseadas em abordagens híbridas destaca-se, atualmente, como a solução com melhor aceitação. Assim, no presente trabalho, pode-se verificar o desenvolvimento de uma estratégia híbrida para os algoritmos de alinhamento múltiplo progressivos, cuja utilização e amplamente difundida, em Bioinformática. Nesta abordagem, conjugou-se a paralelização e o particionamento dos conjuntos de sequências, na fase de construção da matriz de pontuação, e a otimização das fases de construção da árvore filogenética e de alinhamento múltiplo, através dos algoritmos de colônia de formigas e simulated annealling paralelo, respectivamente. / Bioinformatics has been developed in a fast way in the last years. The need for processing large sequences sets, either nucleotides or aminoacids, has stimulated the development of many algorithmic techniques, to solve this problem in a feasible way. Multiple sequence alignment algorithms have played an important role, because with the reduced computational complexity provided by them, it is possible to perform alignments with more than two sequences. However, with the fast growing of the amount and length of sequences in a set, the use of multiple alignment algorithms without new optimization strategies became almost impossible. Therefore, high performance computing has emerged as one of the features being used, through the parallelization of many strategies for execution in large computational systems. Moreover, with the continued expansion of sequences sets, other optimization strategies have been coupled with parallel multiple sequence alignments. Thus, the development of multiple sequences alignment tools based on hybrid strategies has been considered the solution with the best results. In this work, we present the development of a hybrid strategy to progressive multiple sequence alignment, where its using is widespread in Bioinformatics. In this approach, we have aggregated the parallelization and the partitioning of sequences sets in the score matrix calculation stage, and the optimization of the stages of the phylogenetic tree reconstruction and multiple alignment through ant colony and parallel simulated annealing algorithms, respectively. Algoritmos de otimização Alinhamento múltiplo de sequências Bioinformática Bioinformatics Multiple sequence alignment Optimization algorithms Parallel processing Processamento paralelo
46	Aplicação de algoritmos genéricos multi-objetivo para alinhamento de seqüências biológicas. / Multi-objective genetic algorithms applied to protein sequence alignment. Ticona, Waldo Gonzalo Cancino 26 February 2003 (has links) O alinhamento de seqüências biológicas é uma operação básica em Bioinformática, já que serve como base para outros processos como, por exemplo, a determinação da estrutura tridimensional das proteínas. Dada a grande quantidade de dados presentes nas seqüencias, são usadas técnicas matemáticas e de computação para realizar esta tarefa. Tradicionalmente, o Problema de Alinhamento de Seqüências Biológicas é formulado como um problema de otimização de objetivo simples, onde alinhamento de maior semelhança, conforme um esquema de pontuação, é procurado. A Otimização Multi-Objetivo aborda os problemas de otimização que possuem vários critérios a serem atingidos. Para este tipo de problema, existe um conjunto de soluções que representam um "compromiso" entre os objetivos. Uma técnica que se aplica com sucesso neste contexto são os Algoritmos Evolutivos, inspirados na Teoria da Evolução de Darwin, que trabalham com uma população de soluções que vão evoluindo até atingirem um critério de convergência ou de parada. Este trabalho formula o Problema de Alinhamento de Seqüências Biológicas como um Problema de Otimização Multi-Objetivo, para encontrar um conjunto de soluções que representem um compromisso entre a extensão e a qualidade das soluções. Aplicou-se vários modelos de Algoritmos Evolutivos para Otimização Multi-Objetivo. O desempenho de cada modelo foi avaliado por métricas de performance encontradas na literatura. / The Biological Sequence Alignment is a basic operation in Bioinformatics since it serves as a basis for other processes, i.e. determination of the protein's three-dimensional structure. Due to the large amount of data involved, mathematical and computational methods have been used to solve this problem. Traditionally, the Biological Alignment Sequence Problem is formulated as a single optimization problem. Each solution has a score that reflects the similarity between sequences. Then, the optimization process looks for the best score solution. The Multi-Objective Optimization solves problems with multiple objectives that must be reached. Frequently, there is a solution set that represents a trade-off between the objectives. Evolutionary Algorithms, which are inspired by Darwin's Evolution Theory, have been applied with success in solving this kind of problems. This work formulates the Biological Sequence Alignment as a Multi-Objective Optimization Problem in order to find a set of solutions that represent a trade-off between the extension and the quality of the solutions. Several models of Evolutionary Algorithms for Multi-Objetive Optimization have been applied and were evaluated using several performance metrics found in the literature. algoritmos evolutivos alinhamento de seqüências evolutionary algorithms multi-objective optimization otimização multi-objetivo sequence alignment
47	Marker extractions in DNA sequences using sub-sequence segmentation tree. January 2005 (has links) Hung Wah Johnson. / Thesis submitted in: August 2004. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2005. / Includes bibliographical references (leaves 116-121). / Abstracts in English and Chinese. / Abstract --- p.i / Acknowledgement --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Motivation --- p.1 / Chapter 1.2 --- Problem Statement --- p.3 / Chapter 1.3 --- Outline of the thesis --- p.6 / Chapter 2 --- Background --- p.8 / Chapter 2.1 --- Biological Background --- p.8 / Chapter 2.2 --- Sequence Alignments --- p.9 / Chapter 2.2.1 --- Pairwise Sequences Alignment --- p.11 / Chapter 2.2.2 --- Multiple Sequences Alignment --- p.15 / Chapter 2.3 --- Neighbor Joining Tree --- p.16 / Chapter 2.4 --- Marker Extractions --- p.18 / Chapter 2.5 --- Neural Network --- p.19 / Chapter 2.6 --- Conclusion --- p.22 / Chapter 3 --- Related Work --- p.23 / Chapter 3.1 --- FASTA --- p.23 / Chapter 3.2 --- Suffix Tree --- p.25 / Chapter 4 --- Sub-Sequence Segmentation Tree --- p.28 / Chapter 4.1 --- Introduction --- p.28 / Chapter 4.2 --- Problem Statement --- p.29 / Chapter 4.3 --- Design --- p.33 / Chapter 4.4 --- Time and space complexity analysis --- p.38 / Chapter 4.4.1 --- Performance Evaluation --- p.40 / Chapter 4.5 --- Summary --- p.48 / Chapter 5 --- Applications: Global Sequences Alignment --- p.51 / Chapter 5.1 --- Introduction --- p.51 / Chapter 5.2 --- Problem Statement --- p.53 / Chapter 5.3 --- Pairwise Alignment --- p.53 / Chapter 5.3.1 --- Algorithm --- p.53 / Chapter 5.3.2 --- Time and Space Complexity Analysis --- p.64 / Chapter 5.4 --- Multiple Sequences Alignment --- p.67 / Chapter 5.4.1 --- The Clustalw Algorithm --- p.68 / Chapter 5.4.2 --- MSA Using SSST --- p.70 / Chapter 5.4.3 --- Time and Space Complexity Analysis --- p.70 / Chapter 5.5 --- Experiments --- p.71 / Chapter 5.5.1 --- Experiment Setting --- p.72 / Chapter 5.5.2 --- Experimental Results --- p.72 / Chapter 5.6 --- Summary --- p.80 / Chapter 6 --- Applications: Marker Extractions --- p.81 / Chapter 6.1 --- Introduction --- p.81 / Chapter 6.2 --- Problem Statement --- p.82 / Chapter 6.3 --- The Multiple Sequence Alignment Approach --- p.85 / Chapter 6.3.1 --- Design --- p.85 / Chapter 6.4 --- Reference Sequence Alignment Approach --- p.88 / Chapter 6.4.1 --- Design --- p.90 / Chapter 6.5 --- Time and Space Complexity Analysis --- p.95 / Chapter 6.6 --- Experiments --- p.95 / Chapter 6.7 --- Summary --- p.99 / Chapter 7 --- HBV Application Framework --- p.101 / Chapter 7.1 --- Motivations --- p.101 / Chapter 7.2 --- The Procedure Flow of the Application --- p.102 / Chapter 7.2.1 --- Markers Extractions --- p.103 / Chapter 7.2.2 --- Rules Training and Prediction --- p.103 / Chapter 7.3 --- Results --- p.105 / Chapter 7.3.1 --- Clustering --- p.106 / Chapter 7.3.2 --- Classification --- p.107 / Chapter 7.4 --- Summary --- p.110 / Chapter 8 --- Conclusions --- p.112 / Chapter 8.1 --- Contributions --- p.112 / Chapter 8.2 --- Future Works --- p.114 / Chapter 8.2.1 --- HMM Learning --- p.114 / Chapter 8.2.2 --- Splice Sites Learning --- p.114 / Chapter 8.2.3 --- Faster Algorithm for Multiple Sequences Alignment --- p.115 / Bibliography --- p.121 Nucleotide sequence--Methodology Hepatitis B virus Sequence Analysis, DNA--methods Sequence Alignment--methods Hepatitis B virus
48	Aplicação de estratégias híbridas em algoritmos de alinhamento múltiplo de sequências para ambientes de computação paralela e distribuída. / Application of hybrid strategies in multiple sequence alignments for parallel and distributed computing environments. Geraldo Francisco Donegá Zafalon 11 November 2014 (has links) A Bioinformática tem se desenvolvido de forma intensa nos últimos anos. A necessidade de se processar os grandes conjuntos de sequências, sejam de nucleotídeos ou de aminoácidos, tem estimulado o desenvolvimento de diversas técnicas algorítmicas, de modo a tratar este problema de maneira factível. Os algoritmos de alinhamento de alinhamento múltiplo de sequências assumiram um papel primordial, tornando a execução de alinhamentos de conjuntos com mais de duas sequencias uma tarefa viável computacionalmente. No entanto, com o aumento vertiginoso tanto da quantidade de sequencias em um determinado conjunto, quanto do comprimento dessas sequencias, a utilização desses algoritmos de alinhamento múltiplo, sem o acoplamento de novas estratégias, tornou-se algo impraticável. Consequentemente, a computação de alto desempenho despontou como um dos recursos a serem utilizados, através da paralelização de diversas estratégias para sua execução em grandes sistemas computacionais. Além disso, com a contínua expansão dos conjuntos de sequências, outras estratégias de otimização passaram a ser agregadas aos algoritmos de alinhamento múltiplo paralelos. Com isso, o desenvolvimento de ferramentas para alinhamento múltiplo de sequencias baseadas em abordagens híbridas destaca-se, atualmente, como a solução com melhor aceitação. Assim, no presente trabalho, pode-se verificar o desenvolvimento de uma estratégia híbrida para os algoritmos de alinhamento múltiplo progressivos, cuja utilização e amplamente difundida, em Bioinformática. Nesta abordagem, conjugou-se a paralelização e o particionamento dos conjuntos de sequências, na fase de construção da matriz de pontuação, e a otimização das fases de construção da árvore filogenética e de alinhamento múltiplo, através dos algoritmos de colônia de formigas e simulated annealling paralelo, respectivamente. / Bioinformatics has been developed in a fast way in the last years. The need for processing large sequences sets, either nucleotides or aminoacids, has stimulated the development of many algorithmic techniques, to solve this problem in a feasible way. Multiple sequence alignment algorithms have played an important role, because with the reduced computational complexity provided by them, it is possible to perform alignments with more than two sequences. However, with the fast growing of the amount and length of sequences in a set, the use of multiple alignment algorithms without new optimization strategies became almost impossible. Therefore, high performance computing has emerged as one of the features being used, through the parallelization of many strategies for execution in large computational systems. Moreover, with the continued expansion of sequences sets, other optimization strategies have been coupled with parallel multiple sequence alignments. Thus, the development of multiple sequences alignment tools based on hybrid strategies has been considered the solution with the best results. In this work, we present the development of a hybrid strategy to progressive multiple sequence alignment, where its using is widespread in Bioinformatics. In this approach, we have aggregated the parallelization and the partitioning of sequences sets in the score matrix calculation stage, and the optimization of the stages of the phylogenetic tree reconstruction and multiple alignment through ant colony and parallel simulated annealing algorithms, respectively. Algoritmos de otimização Alinhamento múltiplo de sequências Bioinformática Processamento paralelo Bioinformatics Multiple sequence alignment Optimization algorithms Parallel processing
49	Sequências de DNA: uma nova abordagem para o alinhamento ótimo Ioste, Aline Rodrigheri 04 March 2016 (has links) Made available in DSpace on 2016-04-29T14:23:42Z (GMT). No. of bitstreams: 1 Aline Rodrigheri Ioste.pdf: 3892568 bytes, checksum: d4b25a166ea46de0a3b7edfbfeab6923 (MD5) Previous issue date: 2016-03-04 / The objective of this study is to deeply understand the techniques currently used in optimal alignment of DNA sequences, focused on the strengths and limitations of these methods. Analyzing the feasibility of creating a new logical approach able to ensure optimal results , taking into account existing problems in optimal alignment as: (i ) the numerous alignment possibilities of two sequences , ( ii ) the great need for space and memory the machines, ( ii ) processing time to compute the optimal data and (iv ) exponential growth. This study allowed the beginning of the creation of a new logical approach to the global optimum alignment, showing promising results in higher scores with less need for calculations where the mastery of these new techniques can lead to use search of excellent results in the global alignment optimal in large data bases / O objetivo deste estudo é entender profundamente as técnicas utilizadas atualmente no alinhamento ótimo de sequências de DNA e analisar a viabilidade da criação de uma nova abordagem lógica capaz de garantir o resultado ótimo, levando em consideração os problemas existentes no alinhamento ótimo como: (i) as inúmeras possibilidades de alinhamento de duas sequências, (ii) a grande necessidade de espaço e memória das máquinas, (ii) o tempo de processamento para computar os dados ótimos e (iv) seu crescimento exponencial. O presente estudo permitiu o início da criação de uma nova abordagem lógica para o alinhamento ótimo global, demonstrando resultados promissores de maiores pontuações com menos necessidades de cálculos, onde o domínio destas novas técnicas pode conduzir à utilização da busca de resultados ótimos no alinhamento global de sequências biológicas em grandes bases de dados DNA Alinhamento ótimo global Alinhamento de sequências Algoritmos Global alignment Ooptimal Sequence alignment Algorithm CNPQ::OUTROS
50	Development of bioinformatics platforms for methylome and transcriptome data analysis. January 2014 (has links) 高通量大規模並行測序技術，又称為二代測序（NGS），極大的加速了生物和醫學研究的進程。隨著測序通量和複雜度的不斷提高，在分析大量的資料以挖掘其中的資訊的過程中，生物訊息學變得越發重要。在我的博士研究生期間（及本論文中），我主要從事於以下兩個領域的生物訊息學演算法的開發：DNA甲基化資料分析和基因間區長鏈非編碼蛋白RNA（lincRNA）的鑒定。目前二代測序技術在這兩個領域的研究中有著廣泛的應用，同時急需有效的資料處理方法來分析對應的資料。 / DNA甲基化是一種重要的表觀遺傳修飾，主要用來調控基因的表達。目前，全基因組重亞硫酸鹽測序（BS-seq）是最準確的研究DNA甲基化的實驗方法之一，該技術的一大特點就是可以精確到單個堿基的解析度。為了分析BS-seq產生的大量測序數據，我參與開發並深度優化了Methy-Pipe軟體。Methy-Pipe集成了測序序列比對和甲基化程度分析，是一個一體化的DNA甲基化資料分析工具。另外，在Methy-Pipe的基礎上，我又開發了一個新的用於檢測DNA甲基化差異區域（DMR）的演算法，可以用於大範圍的尋找DNA甲基化標記。Methy-Pipe在我們實驗室的DNA甲基化研究項目中得到廣泛的應用，其中包括基於血漿的無創產前診斷（NIPD）和癌症的檢測。 / 基因間區長鏈非編碼蛋白RNA（lincRNA）是一種重要的調節子，其在很多生物學過程中發揮作用，例如轉錄後調控，RNA的剪接，細胞老化等。lincRNA的表達具有很強的組織特異性，因此很大一部分lincRNA還沒有被發現。最近，全轉錄組測序技術（RNA-seq）結合基因從頭組裝，為新的lincRNA鑒定以及構建完整的轉錄組列表提供了最有力的方法。然而，有效並準確的從大量的RNA-seq測序數據中鑒定出真實的新的lincRNA仍然具有很大的挑戰性。為此，我開發了兩個生物訊息學工具：1）iSeeRNA，用於區分lincRNA和編碼蛋白RNA（mRNA）；2）sebnif，用於深層次資料篩選以得到高品質的lincRNA列表。這兩個工具已經在多個生物學系統中使用並表現出很好的效果。 / 總的來說，我開發了一些生物訊息學方法，這些方法可以幫助研究人員更好的利用二代測序技術來挖掘大量的測序數據背後的生物學本質，尤其是DNA甲基化和轉錄組的研究。 / High-throughput massive parallel sequencing technologies, or Next-Generation Sequencing (NGS) technologies, have greatly accelerated biological and medical research. With the ever-growing throughput and complexity of the NGS technologies, bioinformatics methods and tools are urgently needed for analyzing the large amount of data and discovering the meaningful information behind. In this thesis, I mainly worked on developing bioinformatics algorithms for two research fields: DNA methylation data analysis and large intergenic noncoding RNA discovery, where the NGS technologies are in-depth employed and novel bioinformatics algorithms are highly needed. / DNA methylation is one of the important epigenetic modifications to control the transcriptional regulations of the genes. Whole genome bisulfite sequencing (BS-seq) is one of the most precise methodologies for DNA methylation study which allows us to perform whole methylome research at single-base resolution. To analyze the large amount of data generated by BS-seq experiments, I have co-developed and optimized Methy-Pipe, an integrated bioinformatics pipeline which can perform both sequencing read alignment and methylation state decoding. Furthermore, I’ve developed a novel algorithm for Differentially Methylated Regions (DMR) mining, which can be used for large scale methylation marker discovery. Methy-Pipehas been routinely used in our laboratory for methylomic studies, including non-invasive prenatal diagnosis and early cancer detections in human plasma. / Large intergenic noncoding RNAs, or lincRNAs, is avery important novel family of gene regulators in many biological processes, such as post-transcriptional regulation, splicing and aging. Due to high tissue-specific expression pattern of the lincRNAs, a large proportion is still undiscovered. The development of Whole Transcriptome Shotgun Sequencing, also known as RNA-seq, combined with de novo or ab initio assembly, promises quantity discovery of novel lincRNAs hence building the complete transcriptome catalog. However, to efficiently and accurately identify the novel lincRNAs from the large transcriptome data stillremains a bioinformatics challenge.To fill this gap, I have developed two bioinformatics tools: I) iSeeRNAfor distinguishing lincRNAs from mRNAs and II) sebnif for comprehensive filtering towards high quality lincRNA screening which has been used in various biological systems and showed satisfactory performance. / In summary, I have developed several bioinformatics algorithms which help the researchers to take advantage of the strength of the NGS technologies(methylome and transcriptome studies) and explore the biological nature behind the large amount of data. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Sun, Kun. / Thesis (Ph.D.) Chinese University of Hong Kong, 2014. / Includes bibliographical references (leaves 118-126). / Abstracts also in Chinese. DNA--Methylation Nucleotide sequence Sequence alignment (Bioinformatics) Gene Expression Profiling Computational Biology

Search results