Global ETD Search

61	Sequências de DNA: uma nova abordagem para o alinhamento ótimo Ioste, Aline Rodrigheri 04 March 2016 (has links) Made available in DSpace on 2016-04-29T14:23:42Z (GMT). No. of bitstreams: 1 Aline Rodrigheri Ioste.pdf: 3892568 bytes, checksum: d4b25a166ea46de0a3b7edfbfeab6923 (MD5) Previous issue date: 2016-03-04 / The objective of this study is to deeply understand the techniques currently used in optimal alignment of DNA sequences, focused on the strengths and limitations of these methods. Analyzing the feasibility of creating a new logical approach able to ensure optimal results , taking into account existing problems in optimal alignment as: (i ) the numerous alignment possibilities of two sequences , ( ii ) the great need for space and memory the machines, ( ii ) processing time to compute the optimal data and (iv ) exponential growth. This study allowed the beginning of the creation of a new logical approach to the global optimum alignment, showing promising results in higher scores with less need for calculations where the mastery of these new techniques can lead to use search of excellent results in the global alignment optimal in large data bases / O objetivo deste estudo é entender profundamente as técnicas utilizadas atualmente no alinhamento ótimo de sequências de DNA e analisar a viabilidade da criação de uma nova abordagem lógica capaz de garantir o resultado ótimo, levando em consideração os problemas existentes no alinhamento ótimo como: (i) as inúmeras possibilidades de alinhamento de duas sequências, (ii) a grande necessidade de espaço e memória das máquinas, (ii) o tempo de processamento para computar os dados ótimos e (iv) seu crescimento exponencial. O presente estudo permitiu o início da criação de uma nova abordagem lógica para o alinhamento ótimo global, demonstrando resultados promissores de maiores pontuações com menos necessidades de cálculos, onde o domínio destas novas técnicas pode conduzir à utilização da busca de resultados ótimos no alinhamento global de sequências biológicas em grandes bases de dados DNA Alinhamento ótimo global Alinhamento de sequências Algoritmos Global alignment Ooptimal Sequence alignment Algorithm CNPQ::OUTROS
62	Development of bioinformatics platforms for methylome and transcriptome data analysis. January 2014 (has links) 高通量大規模並行測序技術，又称為二代測序（NGS），極大的加速了生物和醫學研究的進程。隨著測序通量和複雜度的不斷提高，在分析大量的資料以挖掘其中的資訊的過程中，生物訊息學變得越發重要。在我的博士研究生期間（及本論文中），我主要從事於以下兩個領域的生物訊息學演算法的開發：DNA甲基化資料分析和基因間區長鏈非編碼蛋白RNA（lincRNA）的鑒定。目前二代測序技術在這兩個領域的研究中有著廣泛的應用，同時急需有效的資料處理方法來分析對應的資料。 / DNA甲基化是一種重要的表觀遺傳修飾，主要用來調控基因的表達。目前，全基因組重亞硫酸鹽測序（BS-seq）是最準確的研究DNA甲基化的實驗方法之一，該技術的一大特點就是可以精確到單個堿基的解析度。為了分析BS-seq產生的大量測序數據，我參與開發並深度優化了Methy-Pipe軟體。Methy-Pipe集成了測序序列比對和甲基化程度分析，是一個一體化的DNA甲基化資料分析工具。另外，在Methy-Pipe的基礎上，我又開發了一個新的用於檢測DNA甲基化差異區域（DMR）的演算法，可以用於大範圍的尋找DNA甲基化標記。Methy-Pipe在我們實驗室的DNA甲基化研究項目中得到廣泛的應用，其中包括基於血漿的無創產前診斷（NIPD）和癌症的檢測。 / 基因間區長鏈非編碼蛋白RNA（lincRNA）是一種重要的調節子，其在很多生物學過程中發揮作用，例如轉錄後調控，RNA的剪接，細胞老化等。lincRNA的表達具有很強的組織特異性，因此很大一部分lincRNA還沒有被發現。最近，全轉錄組測序技術（RNA-seq）結合基因從頭組裝，為新的lincRNA鑒定以及構建完整的轉錄組列表提供了最有力的方法。然而，有效並準確的從大量的RNA-seq測序數據中鑒定出真實的新的lincRNA仍然具有很大的挑戰性。為此，我開發了兩個生物訊息學工具：1）iSeeRNA，用於區分lincRNA和編碼蛋白RNA（mRNA）；2）sebnif，用於深層次資料篩選以得到高品質的lincRNA列表。這兩個工具已經在多個生物學系統中使用並表現出很好的效果。 / 總的來說，我開發了一些生物訊息學方法，這些方法可以幫助研究人員更好的利用二代測序技術來挖掘大量的測序數據背後的生物學本質，尤其是DNA甲基化和轉錄組的研究。 / High-throughput massive parallel sequencing technologies, or Next-Generation Sequencing (NGS) technologies, have greatly accelerated biological and medical research. With the ever-growing throughput and complexity of the NGS technologies, bioinformatics methods and tools are urgently needed for analyzing the large amount of data and discovering the meaningful information behind. In this thesis, I mainly worked on developing bioinformatics algorithms for two research fields: DNA methylation data analysis and large intergenic noncoding RNA discovery, where the NGS technologies are in-depth employed and novel bioinformatics algorithms are highly needed. / DNA methylation is one of the important epigenetic modifications to control the transcriptional regulations of the genes. Whole genome bisulfite sequencing (BS-seq) is one of the most precise methodologies for DNA methylation study which allows us to perform whole methylome research at single-base resolution. To analyze the large amount of data generated by BS-seq experiments, I have co-developed and optimized Methy-Pipe, an integrated bioinformatics pipeline which can perform both sequencing read alignment and methylation state decoding. Furthermore, I’ve developed a novel algorithm for Differentially Methylated Regions (DMR) mining, which can be used for large scale methylation marker discovery. Methy-Pipehas been routinely used in our laboratory for methylomic studies, including non-invasive prenatal diagnosis and early cancer detections in human plasma. / Large intergenic noncoding RNAs, or lincRNAs, is avery important novel family of gene regulators in many biological processes, such as post-transcriptional regulation, splicing and aging. Due to high tissue-specific expression pattern of the lincRNAs, a large proportion is still undiscovered. The development of Whole Transcriptome Shotgun Sequencing, also known as RNA-seq, combined with de novo or ab initio assembly, promises quantity discovery of novel lincRNAs hence building the complete transcriptome catalog. However, to efficiently and accurately identify the novel lincRNAs from the large transcriptome data stillremains a bioinformatics challenge.To fill this gap, I have developed two bioinformatics tools: I) iSeeRNAfor distinguishing lincRNAs from mRNAs and II) sebnif for comprehensive filtering towards high quality lincRNA screening which has been used in various biological systems and showed satisfactory performance. / In summary, I have developed several bioinformatics algorithms which help the researchers to take advantage of the strength of the NGS technologies(methylome and transcriptome studies) and explore the biological nature behind the large amount of data. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Detailed summary in vernacular field only. / Sun, Kun. / Thesis (Ph.D.) Chinese University of Hong Kong, 2014. / Includes bibliographical references (leaves 118-126). / Abstracts also in Chinese. DNA--Methylation Nucleotide sequence Sequence alignment (Bioinformatics) Gene Expression Profiling Computational Biology
63	Origem de genes recentes, uma abordagem com PSSMs deterioradas e arquiteturas de domínio proteico / Origin of recent genes, an approach with deteriorated PSSMs and protein domain architectures Souza, Diego Trindade de 06 October 2016 (has links) A origem dos novos genes é um processo importante para a evolução dos organismos, pois ela fornece fontes singulares para a inovação evolutiva. As abordagens que mostram como esses novos genes surgem e adquirem novas funções no curso da evolução são de fundamental importância, por exemplo, elas podem ajudar a correlacionar mutações com alterações metabólicas, fisiológicas e/ou morfológicas, indicando quais mutações podem ser importantes funcionalmente. Recentemente, uma nova abordagem, nomeada de filoestratigrafia, foi desenvolvida para estabelecer origem evolutiva dos genes. Neste método a emergência de novas sequências de um nó filogenético particular em uma linhagem ancestral-descente é inferida geralmente utilizando algoritmos de similaridade. No presente trabalho, nós fizemos uma pesquisa filoestratigráfica de dois bancos de dados de domínios proteicos, CATH e Pfam, para todas as entradas humanas descrevemos a origem dos domínios e arquiteturas humanas. Também realizamos uma nova abordagem para refinar os resultados por Male-PSI-BLAST, em um estudo de caso dos domínios príons e ADHs. A análise das duas bases de dados mostrou que existiram três períodos importantes de aparecimento de domínios proteicos humanos: a origem do organismo celular, Eucarioto e Euteleostomi, nos quais há um elevado número de surgimento de novos genes na linhagem ancestral-descente de humanos. Quando analisamos o aparecimento de arquiteturas, elas são evidentemente mais recentes que o aparecimento de domínios, embora, em seu conteúdo, possa haver domínios muito antigos misturados com domínios novos. Não notamos nenhuma tendência de acréscimo de novos domínios para arquiteturas que compreendem domínios antigos ou recentes. Para medir o grau de versatilidade de domínio, nós utilizamos a frequência ponderada de bigrama, uma combinação específica de dois domínios adjacentes. O teste de correlação de Spearman mostrou que existe uma baixa correlação negativa entre a idade de domínios e índices de versatilidade. Em um estudo de caso, demonstramos que é possível caracterizar descontinuidades evolutivas nos resultados de Male-PSI-BLAST entre domínios que surgiram a partir de outros. Pela primeira vez, a origem de todos os domínios e arquiteturas proteicas presentes nas bases de dados estudadas foi descrita, fornecendo um cenário evolutivo que pode ser mais explorado a partir das abordagens aqui desenvolvidas. / The origin of new genes is an important process for the evolution of organisms because they provide singular sources for evolutionary innovation. The approaches that show how these new genes arise and acquire new functions in the course of evolution are of fundamental importance: they can help to correlate mutations with metabolic, physiological, and morphological changes, indicating which mutations are likely to be functionally important. Recently, a new approach, named phylostratigraphy, was developed to establish the evolutionary origin of the genes. In this method the emergence of novel sequences at a particular phylogenetic node in a descendent-ancestor lineage is infer usually by using the similarity search algorithm. In the present work, we did a phylostratigraphical search of two protein domain databases, CATH and Pfam, for all human entries and depicted the origin of human domains and architectures. We also conducted a new approach to refine results by Male-PSI-BLAST in a case study of prions and ADH\'s domains. The analysis of two databases showed that there are three important periods of appearance of human gene domains: the origin of cellular organism, Eukaryote, and Euteleostomi appear to be important for production of new genes at the ancestor-descendent lineages that lead to the human species. However, when we analyze the appearance of architectures, they are by far more recent than the appearance of domains, although they might contain very ancient domains mixed with recent ones. We did not notice a bias of addition of new domains to architectures comprising either ancient or recent domains. To measure the degree of domain versatility, we used the weighted bigram frequency, where bigram is defined as a specific combination of two adjacent domains. The Spearman correlation test showed that there is a low negative correlation between the age of domains and versatility indexes. In the study of case, we have demonstrated that it is possible to characterize evolutionary ruptures in results of Male-PSI- BLAST between domains that emerged from others. For the first time the origin of all protein domains and architectures was depicted, providing an evolutionary scenario that can be further explored. Alinhamento de sequências Detecção de novos genes Filoestratigrafia New genes detection Phylostratigraphy Sequence alignment
64	Using event sequence alignment to automatically segment web users for prediction and recommendation / Alignement de séquences d'évènements pour la segmentation automatique d'internautes, la prédiction et la recommandation Luu, Vinh Trung 16 December 2016 (has links) Une masse de données importante est collectée chaque jour par les gestionnaires de site internet sur les visiteurs qui accèdent à leurs services. La collecte de ces données a pour objectif de mieux comprendre les usages et d'acquérir des connaissances sur le comportement des visiteurs. A partir de ces connaissances, les gestionnaires de site peuvent décider de modifier leur site ou proposer aux visiteurs du contenu personnalisé. Cependant, le volume de données collectés ainsi que la complexité de représentation des interactions entre le visiteur et le site internet nécessitent le développement de nouveaux outils de fouille de données. Dans cette thèse, nous avons exploré l’utilisation des méthodes d’alignement de séquences pour l'extraction de connaissances sur l'utilisation de site Web (web mining). Ces méthodes sont la base du regroupement automatique d’internautes en segments, ce qui permet de découvrir des groupes de comportements similaires. De plus, nous avons également étudié comment ces groupes pouvaient servir à effectuer de la prédiction et la recommandation de pages. Ces thèmes sont particulièrement importants avec le développement très rapide du commerce en ligne qui produit un grand volume de données (big data) qu’il est impossible de traiter manuellement. / This thesis explored the application of sequence alignment in web usage mining, including user clustering and web prediction and recommendation.This topic was chosen as the online business has rapidly developed and gathered a huge volume of information and the use of sequence alignment in the field is still limited. In this context, researchers are required to build up models that rely on sequence alignment methods and to empirically assess their relevance in user behavioral mining. This thesis presents a novel methodological point of view in the area and show applicable approaches in our quest to improve previous related work. Web usage behavior analysis has been central in a large number of investigations in order to maintain the relation between users and web services. Useful information extraction has been addressed by web content providers to understand users’ need, so that their content can be correspondingly adapted. One of the promising approaches to reach this target is pattern discovery using clustering, which groups users who show similar behavioral characteristics. Our research goal is to perform users clustering, in real time, based on their session similarity. Fouille du web Regroupement Alignement de séquence Web mining Clustering Sequence alignment 005.7
65	Bioinformatic Analysis of Mutation and Selection in the Vertebrate Non-coding Genome Brandström, Mikael January 2007 (has links) <p>The majority of the vertebrate genome sequence is not coding for proteins. In recent years, the evolution of this noncoding fraction of the genome has gained interest. These studies have been greatly facilitated by the availability of full genome sequences. The aim of this thesis is to study evolution of the noncoding vertebrate genome through bioinformatic analysis of large-scale genomic datasets.</p><p>In a first analysis we addressed the use of conservation of sequence between highly diverged genomes to infer function. We provided evidence for a turnover of the patterns of negative selection. Hence, measures of constraint based on comparisons of diverged genomes might underestimate the functional proportion of the genome.</p><p>In the following analyses we focused on length variation as found in small-scale insertion and deletion (indel) polymorphisms and microsatellites. For indels in chicken, replication slippage is a likely mutation mechanism, as a large proportion of the indels are parts of tandem-duplicates. Using a set of microsatellite polymorphisms in chicken, where we avoid ascertainment bias, we showed that polymorphism is positively correlated with microsatellite length and AT-content. Furthermore, interruptions in the microsatellite sequence decrease the levels of polymorphism.</p><p>We also analysed the association between microsatellite polymorphism and recombination in the human genome. Here we found increased levels of microsatellite polymorphism in human recombination hotspots and also similar increases in the frequencies of single nucleotide polymorphisms (SNPs) and indels. This points towards natural selection shaping the levels of variation. Alternatively, recombination is mutagenic for all three kinds of polymorphisms. </p><p>Finally, I present the program ILAPlot. It is a tool for visualisation, exploration and data extraction based on BLAST.</p><p>Our combined results highlight the intricate connections between evolutionary phenomena. It also emphasises the importance of length variability in genome evolution, as well as the gradual difference between indels and microsatellites.</p> Biology non-coding genome selection mutation chicken human indel microsatellite sequence alignment comparative genomics bioinformatics vertebrate Biologi
66	Detection and characterization of 3D-signature phosphorylation site motifs and their contribution towards improved phosphorylation site prediction in proteins Durek, Pawel, Schudoma, Christian, Weckwerth, Wolfram, Selbig, Joachim, Walther, Dirk January 2009 (has links) Background: Phosphorylation of proteins plays a crucial role in the regulation and activation of metabolic and signaling pathways and constitutes an important target for pharmaceutical intervention. Central to the phosphorylation process is the recognition of specific target sites by protein kinases followed by the covalent attachment of phosphate groups to the amino acids serine, threonine, or tyrosine. The experimental identification as well as computational prediction of phosphorylation sites (P-sites) has proved to be a challenging problem. Computational methods have focused primarily on extracting predictive features from the local, one-dimensional sequence information surrounding phosphorylation sites. Results: We characterized the spatial context of phosphorylation sites and assessed its usability for improved phosphorylation site predictions. We identified 750 non-redundant, experimentally verified sites with three-dimensional (3D) structural information available in the protein data bank (PDB) and grouped them according to their respective kinase family. We studied the spatial distribution of amino acids around phosphorserines, phosphothreonines, and phosphotyrosines to extract signature 3D-profiles. Characteristic spatial distributions of amino acid residue types around phosphorylation sites were indeed discernable, especially when kinase-family-specific target sites were analyzed. To test the added value of using spatial information for the computational prediction of phosphorylation sites, Support Vector Machines were applied using both sequence as well as structural information. When compared to sequence-only based prediction methods, a small but consistent performance improvement was obtained when the prediction was informed by 3D-context information. Conclusion: While local one-dimensional amino acid sequence information was observed to harbor most of the discriminatory power, spatial context information was identified as relevant for the recognition of kinases and their cognate target sites and can be used for an improved prediction of phosphorylation sites. A web-based service (Phos3D) implementing the developed structurebased P-site prediction method has been made available at http://phos3d.mpimp-golm.mpg.de. Support vector machines Microarray data Docking interactions Signal-transduction Sequence alignment Life sciences
67	Bioinformatic Analysis of Mutation and Selection in the Vertebrate Non-coding Genome Brandström, Mikael January 2007 (has links) The majority of the vertebrate genome sequence is not coding for proteins. In recent years, the evolution of this noncoding fraction of the genome has gained interest. These studies have been greatly facilitated by the availability of full genome sequences. The aim of this thesis is to study evolution of the noncoding vertebrate genome through bioinformatic analysis of large-scale genomic datasets. In a first analysis we addressed the use of conservation of sequence between highly diverged genomes to infer function. We provided evidence for a turnover of the patterns of negative selection. Hence, measures of constraint based on comparisons of diverged genomes might underestimate the functional proportion of the genome. In the following analyses we focused on length variation as found in small-scale insertion and deletion (indel) polymorphisms and microsatellites. For indels in chicken, replication slippage is a likely mutation mechanism, as a large proportion of the indels are parts of tandem-duplicates. Using a set of microsatellite polymorphisms in chicken, where we avoid ascertainment bias, we showed that polymorphism is positively correlated with microsatellite length and AT-content. Furthermore, interruptions in the microsatellite sequence decrease the levels of polymorphism. We also analysed the association between microsatellite polymorphism and recombination in the human genome. Here we found increased levels of microsatellite polymorphism in human recombination hotspots and also similar increases in the frequencies of single nucleotide polymorphisms (SNPs) and indels. This points towards natural selection shaping the levels of variation. Alternatively, recombination is mutagenic for all three kinds of polymorphisms. Finally, I present the program ILAPlot. It is a tool for visualisation, exploration and data extraction based on BLAST. Our combined results highlight the intricate connections between evolutionary phenomena. It also emphasises the importance of length variability in genome evolution, as well as the gradual difference between indels and microsatellites. Biology non-coding genome selection mutation chicken human indel microsatellite sequence alignment comparative genomics bioinformatics vertebrate Biologi
68	Game theoretic and machine learning techniques for balancing games Long, Jeffrey Richard 29 August 2006 Game balance is the problem of determining the fairness of actions or sets of actions in competitive, multiplayer games. This problem primarily arises in the context of designing board and video games. Traditionally, balance has been achieved through large amounts of play-testing and trial-and-error on the part of the designers. In this thesis, it is our intent to lay down the beginnings of a framework for a formal and analytical solution to this problem, combining techniques from game theory and machine learning. We first develop a set of game-theoretic definitions for different forms of balance, and then introduce the concept of a strategic abstraction. We show how machine classification techniques can be used to identify high-level player strategy in games, using the two principal methods of sequence alignment and Naive Bayes classification. Bioinformatics sequence alignment, when combined with a 3-nearest neighbor classification approach, can, with only 3 exemplars of each strategy, correctly identify the strategy used in 55\% of cases using all data, and 77\% of cases on data that experts indicated actually had a strategic class. Naive Bayes classification achieves similar results, with 65\% accuracy on all data and 75\% accuracy on data rated to have an actual class. We then show how these game theoretic and machine learning techniques can be combined to automatically build matrices that can be used to analyze game balance properties. games game balance machine learning sequence alignment game theory naive bayes
69	Scrambling analysis of ciliates Liu, Jing 10 September 2009 Ciliates are a class of organisms which undergo a genetic process called gene descrambling after mating. In order to better understand the problem, a literature review of past works has been presented in this thesis. This includes a brief summary of both the relevant biology and bioinformatics literature. Then, a formal definition of scrambling systems is developed which attempts to model the problem of sequence alignment between scrambled and descrambled genes. With this system, sequences can be classified into relevant functional segments. It also provides a framework whereby we can compare various ciliate sequence alignment algorithms. After that, a new method of predicting the various functional segments is studied. This method shows better coverage, and usually a better labelling score with certain parameters. Then we discuss several recent hypotheses as to how ciliates naturally descramble genes. An algorithm suite is developed to test these hypotheses. With the tests, we are able to computationally check which factors are potentially the most important. According to the current results with 247 pointer sequences of 13 micronuclear genes, examining repeats which are the same distance together with either the sequence or the size, as the real pointers, is almost always enough information to guide descrambling. Indeed, the real pointer sequence is the unique repeat 92.7% and 94.3% of the time within the 247 pointers, from the left and right respectively, using only the pointer distance and the pointer sequence information. theoretical computer science ciliate scrambling system scrambling analysis sequence alignment bioinformatics
70	Efficient Characterization of Short Anelloviruses Fragments Found in Metagenomic Samples Al-Absi, Thabit January 2012 (has links) Some viral metagenomic serum samples contain a huge amount of Anellovirus, which is a genetically diverse family with a few conserved regions making it hard to efficiently characterize. Multiple sequence alignment of the Anelloviruses found in the sample must be constructed to get a clear picture of Anellovirus diversity and to identify stable regions. Using available multiple sequence alignment software directly on these fragments results in an MSA of a very poor quality due to their diversity, misaligned regions and low-quality regions present in the sequence. An efficient MSA must be constructed in order to characterize these Anellovirus present in the samples. Pairwise alignment is used to align one fragment to the database sequences at a time. The fragments are then aligned to the database sequences using the start and end position from the pairwise alignment results. The algorithm will also exclude non-aligned portions of the fragments, as these are very hard to handle properly and are often products of misassembly or chimeric sequenced fragments. Other tools to aid further analysis were developed, such as finding a non-overlapping window that contains the most fragments, find consensus of the alignment and extract any regions from the MSA for further analysis. An MSA was constructed with a high percent of correctly aligned bases compared to an MSA constructed using MSA softwares. The minimal number of genomes found in the sampled sequence was found as well as a distribution of the fragments along the database sequence. Moreover, highly conserved region and the window containing most fragments were extracted from the MSA and phylogenetic trees were constructed for these regions. Anellovirus TTV torque teno virus MSA construction multiple sequence alignment construction anellovirus characterisation

Search results