Global ETD Search

101	Genomic variation detection using dynamic programming methods Zhao, Mengyao January 2014 (has links) Thesis advisor: Gabor T. Marth / Background: Due to the rapid development and application of next generation sequencing (NGS) techniques, large amounts of NGS data have become available for genome-related biological research, such as population genetics, evolutionary research, and genome wide association studies. A crucial step of these genome-related studies is the detection of genomic variation between different species and individuals. Current approaches for the detection of genomic variation can be classified into alignment-based variation detection and assembly-based variation detection. Due to the limitation of current NGS read length, alignment-based variation detection remains the mainstream approach. The Smith-Waterman algorithm, which produces the optimal pairwise alignment between two sequences, is frequently used as a key component of fast heuristic read mapping and variation detection tools for next-generation sequencing data. Though various fast Smith-Waterman implementations are developed, they are either designed as monolithic protein database searching tools, which do not return detailed alignment, or they are embedded into other tools. These issues make reusing these efficient Smith-Waterman implementations impractical. After the alignment step in the traditional variation detection pipeline, the afterward variation detection using pileup data and the Bayesian model is also facing great challenges especially from low-complexity genomic regions. Sequencing errors and misalignment problems still influence variation detection (especially INDEL detection) a lot. The accuracy of genomic variation detection still needs to be improved, especially when we work on low- complexity genomic regions and low-quality sequencing data. Results: To facilitate easy integration of the fast Single-Instruction-Multiple-Data Smith-Waterman algorithm into third-party software, we wrote a C/C++ library, which extends Farrar's Striped Smith-Waterman (SSW) to return alignment information in addition to the optimal Smith-Waterman score. In this library we developed a new method to generate the full optimal alignment results and a suboptimal score in linear space at little cost of efficiency. This improvement makes the fast Single-Instruction-Multiple-Data Smith-Waterman become really useful in genomic applications. SSW is available both as a C/C++ software library, as well as a stand-alone alignment tool at: https://github.com/mengyao/Complete- Striped-Smith-Waterman-Library. The SSW library has been used in the primary read mapping tool MOSAIK, the split-read mapping program SCISSORS, the MEI detector TAN- GRAM, and the read-overlap graph generation program RZMBLR. The speeds of the mentioned software are improved significantly by replacing their ordinary Smith-Waterman or banded Smith-Waterman module with the SSW Library. To improve the accuracy of genomic variation detection, especially in low-complexity genomic regions and on low-quality sequencing data, we developed PHV, a genomic variation detection tool based on the profile hidden Markov model. PHV also demonstrates a novel PHMM application in the genomic research field. The banded PHMM algorithms used in PHV make it a very fast whole-genome variation detection tool based on the HMM method. The comparison of PHV to GATK, Samtools and Freebayes for detecting variation from both simulated data and real data shows PHV has good potential for dealing with sequencing errors and misalignments. PHV also successfully detects a 49 bp long deletion that is totally misaligned by the mapping tool, and neglected by GATK and Samtools. Conclusion: The efforts made in this thesis are very meaningful for methodology development in studies of genomic variation detection. The two novel algorithms stated here will also inspire future work in NGS data analysis. / Thesis (PhD) — Boston College, 2014. / Submitted to: Boston College. Graduate School of Arts and Sciences. / Discipline: Biology. genomic variation detection profile hidden Markov model sequence alignment single instruction multiple data single nucleotide polymorphism Smith-Watherman algorithm
102	Algorithms in protein functionality analysis. January 2002 (has links) Leung Ka-Kit. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2002. / Includes bibliographical references (leaves 129-131). / Abstracts in English and Chinese. / Abstract --- p.1 / Chapter CHAPTER 1. --- introduction --- p.14 / Chapter 1.1 --- Preamble --- p.14 / Chapter 1.2 --- Biological background --- p.14 / Chapter CHAPTER 2. --- previous related work --- p.18 / Chapter 2.1 --- Protein functionality analysis --- p.18 / Chapter 2.1.1 --- Analysis from primary structure --- p.18 / Chapter 2.1.2 --- Analysis from tertiary structure --- p.20 / Chapter 2.2 --- Secondary structure prediction --- p.21 / Chapter 2.3 --- Motivation - Challenges from protein complexity --- p.22 / Chapter CHAPTER 3. --- mathematical representations for protein properties and sequence alignment --- p.24 / Chapter 3.1 --- Secondary structure sequence model --- p.24 / Chapter 3.2 --- Substitution matrix --- p.26 / Chapter 3.3 --- Gap --- p.26 / Chapter 3.4 --- Similarity measurement --- p.27 / Chapter 3.5 --- Geometric Model for Protein --- p.28 / Chapter CHAPTER 4. --- overall system design --- p.30 / Chapter 4.1 --- System architecture and design --- p.30 / Chapter 4.2 --- System environment --- p.32 / Chapter 4.3 --- Experimental data --- p.32 / Chapter CHAPTER 5. --- adaptive dynamic programming (adp)- general global alignment consideration --- p.35 / Chapter 5.1 --- t-triangles cutting --- p.35 / Chapter 5.1.1 --- Theoretical time and memory requirements of ADP with z-triangles cutting --- p.43 / Chapter 5.1.1.1 --- Study of parameters affecting h in case 1 --- p.44 / Chapter 5.1.1.2 --- Study of parameters affecting h in case 2 --- p.45 / Chapter 5.1.2 --- Experimental results of ADP with z-triangles cutting --- p.46 / Chapter 5.2 --- Constructing the path matrix by expansion --- p.51 / Chapter 5.2.1 --- Time and memory requirements of EXPAND --- p.57 / Chapter 5.2.2 --- Experimental results and discussions --- p.58 / Chapter CHAPTER 6. --- adp - global alignment of sequences with consecutive repeated characters --- p.65 / Chapter 6.1 --- Estimation of similarity upper bound (Ba) --- p.65 / Chapter 6.1.1 --- Sequence composition (SC) consideration --- p.65 / Chapter 6.1.2 --- Implementation of SC --- p.67 / Chapter 6.1.3 --- Experimental results --- p.69 / Chapter 6.1.4 --- Overall trend of change of structures (OTCS) --- p.74 / Chapter 6.1.5 --- Uninformed search --- p.76 / Chapter 6.2 --- Short-cut --- p.80 / Chapter 6.2.1 --- Time and memory requirements --- p.86 / Chapter 6.2.2 --- Experimental results and discussions --- p.86 / Chapter CHAPTER 7. --- ga based topology discovery --- p.87 / Chapter 7.1 --- Chromosome encoding --- p.87 / Chapter 7.2 --- Non-sequential order penalty --- p.88 / Chapter 7.3 --- Fitness function --- p.88 / Chapter 7.4 --- Genetic operators --- p.88 / Chapter 7.4.1 --- Hop operator --- p.89 / Chapter 7.4.2 --- Inverse operator --- p.89 / Chapter 7.4.3 --- Shift operator --- p.90 / Chapter 7.4.4 --- Selection pressure --- p.90 / Chapter 7.5 --- Selection of progeny --- p.91 / Chapter 7.6 --- Implementation --- p.91 / Chapter 7.6.1 --- Size of population and generation --- p.91 / Chapter 7.6.2 --- Parallelization --- p.91 / Chapter 7.6.3 --- Crowding Handling --- p.92 / Chapter 7.6.4 --- Selection of progeny --- p.92 / Chapter 7.7 --- Results of alignment with GA exploration on topological order --- p.93 / Chapter CHAPTER 8. --- FILTERING OF FALSE POSITIVES --- p.103 / Chapter 8.1 --- Alignment Segments to Gap Ratio (ASGR) --- p.103 / Chapter 8.2 --- Tolerance --- p.104 / Chapter 8.3 --- Overall trend of change of structures (OTCS) --- p.104 / Chapter 8.4 --- Results and discussions --- p.105 / Chapter CHAPTER 9. --- SECONDARY STRUCTURE PREDICTION --- p.111 / Chapter 9.1 --- 3-STATE SECONDARY STRUCTURE PREDICTION IMPROVEMENT --- p.111 / Chapter 9.2 --- 8-state secondary structure prediction --- p.117 / Chapter 9.3 --- Iterative Subordinate Voting (IS V) --- p.117 / Chapter 9.4 --- ISV Results and discussion --- p.119 / Chapter CHAPTER 10. --- CONCLUSIONS --- p.123 / Chapter 10.1 --- Contributions --- p.123 / Chapter 10.2 --- Future Work --- p.126 / Chapter 10.2.1 --- Using database indexing --- p.126 / Chapter 10.2.2 --- 3-state secondary structure prediction improvement --- p.127 / appendix --- p.128 / Chapter ´Ø --- Interpretation on the dp一filter results --- p.128 Proteins--Conformation Bioinformatics Dynamic programming Genetic algorithms Proteins--Analysis Protein Conformation Sequence Alignment Computational Biology--methods Protein--analysis
103	Alinhamento de seqüências com rearranjos / Sequences alignment with rearrangements Vellozo, Augusto Fernandes 18 April 2007 (has links) Uma das tarefas mais básicas em bioinformática é a comparação de seqüências feita por algoritmos de alinhamento, que modelam as alterações evolutivas nas seqüências biológicas através de mutações como inserção, remoção e substituição de símbolos. Este trabalho trata de generalizações nos algoritmos de alinhamento que levam em consideração outras mutações conhecidas como rearranjos, mais especificamente, inversões, duplicações em tandem e duplicações por transposição. Alinhamento com inversões não tem um algoritmo polinomial conhecido e uma simplificação para o problema que considera somente inversões não sobrepostas foi proposta em 1992 por Schöniger e Waterman. Em 2003, dois trabalhos independentes propuseram algoritmos com tempo O(n^4) para alinhar duas seqüências com inversões não sobrepostas. Desenvolvemos dois algoritmos que resolvem este mesmo problema: um com tempo de execução O(n^3 logn) e outro que, sob algumas condições no sistema de pontuação, tem tempo de execução O(n^3), ambos em memória O(n^2). Em 1997, Benson propôs um modelo de alinhamento que reconhecesse as duplicações em tandem além das inserções, remoções e substituições. Ele propôs dois algoritmos exatos para alinhar duas seqüências com duplicações em tandem: um em tempo O(n^5) e memória O(n^2), e outro em tempo O(n^4) e memória O(n^3). Propomos um algoritmo para alinhar duas seqüências com duplicações em tandem em tempo O(n^3) e memória O(n^2). Propomos também um algoritmo para alinhar duas seqüências com transposons (um tipo mais geral que a duplicação em tandem), em tempo O(n^3) e memória O(n^2). / Sequence comparison done by alignment algorithms is one of the most fundamental tasks in bioinformatics. The evolutive mutations considered in these alignments are insertions, deletions and substitutions of nucleotides. This work treats of generalizations introduced in alignment algorithms in such a way that other mutations known as rearrangements are also considered, more specifically, we consider inversions, duplications in tandem and duplications by transpositions. Alignment with inversions does not have a known polynomial algorithm and a simplification to the problem that considers only non-overlapping inversions were proposed by Schöniger and Waterman in 1992. In 2003, two independent works proposed algorithms with O(n^4) time to align two sequences with non-overlapping inversions. We developed two algorithms to solve this problem: one in O(n^3 log n) time and other, considering some conditions in the scoring system, in O(n^3) time, both in O(n^2) memory. In 1997, Benson proposed a model of alignment that recognized tandem duplication, insertion, deletion and substitution. He proposed two exact algorithms to align two sequences with tandem duplication: one in O(n^5) time and O(n^2) memory, and other in O(n^4) time and O(n^3) memory. We propose one algorithm to align two sequences with tandem duplication in O(n^3) time and O(n^2) memory. We also propose one algorithm to align two sequences with transposons (a type of duplication more general than tandem duplication), in O(n^3) time and O(n^2) memory. Alinhamento de seqüências bioinformática bioinformatics biologia computacional Computational Biology duplicação duplications dynamic programming inversão inversions programação dinâmica Sequence alignment
104	Efficient Homology Search for Genomic Sequence Databases Cameron, Michael, mcam@mc-mc.net January 2006 (has links) Genomic search tools can provide valuable insights into the chemical structure, evolutionary origin and biochemical function of genetic material. A homology search algorithm compares a protein or nucleotide query sequence to each entry in a large sequence database and reports alignments with highly similar sequences. The exponential growth of public data banks such as GenBank has necessitated the development of fast, heuristic approaches to homology search. The versatile and popular blast algorithm, developed by researchers at the US National Center for Biotechnology Information (NCBI), uses a four-stage heuristic approach to efficiently search large collections for analogous sequences while retaining a high degree of accuracy. Despite an abundance of alternative approaches to homology search, blast remains the only method to offer fast, sensitive search of large genomic collections on modern desktop hardware. As a result, the tool has found widespread use with millions of queries posed each day. A significant investment of computing resources is required to process this large volume of genomic searches and a cluster of over 200 workstations is employed by the NCBI to handle queries posed through the organisation's website. As the growth of sequence databases continues to outpace improvements in modern hardware, blast searches are becoming slower each year and novel, faster methods for sequence comparison are required. In this thesis we propose new techniques for fast yet accurate homology search that result in significantly faster blast searches. First, we describe improvements to the final, gapped alignment stages where the query and sequences from the collection are aligned to provide a fine-grain measure of similarity. We describe three new methods for aligning sequences that roughly halve the time required to perform this computationally expensive stage. Next, we investigate improvements to the first stage of search, where short regions of similarity between a pair of sequences are identified. We propose a novel deterministic finite automaton data structure that is significantly smaller than the codeword lookup table employed by ncbi-blast, resulting in improved cache performance and faster search times. We also discuss fast methods for nucleotide sequence comparison. We describe novel approaches for processing sequences that are compressed using the byte packed format already utilised by blast, where four nucleotide bases from a strand of DNA are stored in a single byte. Rather than decompress sequences to perform pairwise comparisons, our innovations permit sequences to be processed in their compressed form, four bases at a time. Our techniques roughly halve average query evaluation times for nucleotide searches with no effect on the sensitivity of blast. Finally, we present a new scheme for managing the high degree of redundancy that is prevalent in genomic collections. Near-duplicate entries in sequence data banks are highly detrimental to retrieval performance, however existing methods for managing redundancy are both slow, requiring almost ten hours to process the GenBank database, and crude, because they simply purge highly-similar sequences to reduce the level of internal redundancy. We describe a new approach for identifying near-duplicate entries that is roughly six times faster than the most successful existing approaches, and a novel approach to managing redundancy that reduces collection size and search times but still provides accurate and comprehensive search results. Our improvements to blast have been integrated into our own version of the tool. We find that our innovations more than halve average search times for nucleotide and protein searches, and have no signifcant effect on search accuracy. Given the enormous popularity of blast, this represents a very significant advance in computational methods to aid life science research. bioinformatics computational biology sequence alignment genomic search homology search BLAST Basic Local Alignment Search Tool clustering managing redundancy
105	Multiple Biolgical Sequence Alignment: Scoring Functions, Algorithms, and Evaluations Nguyen, Ken D 14 December 2011 (has links) Aligning multiple biological sequences such as protein sequences or DNA/RNA sequences is a fundamental task in bioinformatics and sequence analysis. These alignments may contain invaluable information that scientists need to predict the sequences' structures, determine the evolutionary relationships between them, or discover drug-like compounds that can bind to the sequences. Unfortunately, multiple sequence alignment (MSA) is NP-Complete. In addition, the lack of a reliable scoring method makes it very hard to align the sequences reliably and to evaluate the alignment outcomes. In this dissertation, we have designed a new scoring method for use in multiple sequence alignment. Our scoring method encapsulates stereo-chemical properties of sequence residues and their substitution probabilities into a tree-structure scoring scheme. This new technique provides a reliable scoring scheme with low computational complexity. In addition to the new scoring scheme, we have designed an overlapping sequence clustering algorithm to use in our new three multiple sequence alignment algorithms. One of our alignment algorithms uses a dynamic weighted guidance tree to perform multiple sequence alignment in progressive fashion. The use of dynamic weighted tree allows errors in the early alignment stages to be corrected in the subsequence stages. Other two algorithms utilize sequence knowledge-bases and sequence consistency to produce biological meaningful sequence alignments. To improve the speed of the multiple sequence alignment, we have developed a parallel algorithm that can be deployed on reconfigurable computer models. Analytically, our parallel algorithm is the fastest progressive multiple sequence alignment algorithm. Multiple sequence alignments Algorithms Scoring functions Computer Sciences
106	Combinatorial optimization and application to DNA sequence analysis Gupta, Kapil 25 August 2008 (has links) With recent and continuing advances in bioinformatics, the volume of sequence data has increased tremendously. Along with this increase, there is a growing need to develop efficient algorithms to process such data in order to make useful and important discoveries. Careful analysis of genomic data will benefit science and society in numerous ways, including the understanding of protein sequence functions, early detection of diseases, and finding evolutionary relationships that exist among various organisms. Most sequence analysis problems arising from computational genomics and evolutionary biology fall into the class of NP-complete problems. Advances in exact and approximate algorithms to address these problems are critical. In this thesis, we investigate a novel graph theoretical model that deals with fundamental evolutionary problems. The model allows incorporation of the evolutionary operations ``insertion', ``deletion', and ``substitution', and various parameters such as relative distances and weights. By varying appropriate parameters and weights within the model, several important combinatorial problems can be represented, including the weighted supersequence, weighted superstring, and weighted longest common sequence problems. Consequently, our model provides a general computational framework for solving a wide variety of important and difficult biological sequencing problems, including the multiple sequence alignment problem, and the problem of finding an evolutionary ancestor of multiple sequences. In this thesis, we develop large scale combinatorial optimization techniques to solve our graph theoretical model. In particular, we formulate the problem as two distinct but related models: constrained network flow problem and weighted node packing problem. The integer programming models are solved in a branch and bound setting using simultaneous column and row generation. The methodology developed will also be useful to solve large scale integer programming problems arising in other areas such as transportation and logistics. DNA sequence analysis Integer programming Network flow Node packing Row generation Column generation Combinatorial optimization Nucleotide sequence Sequence alignment (Bioinformatics)
107	Structural bioinformatics studies and tool development related to drug discovery Hatherley, Rowan January 2016 (has links) This thesis is divided into two distinct sections which can be combined under the broad umbrella of structural bioinformatics studies related to drug discovery. The first section involves the establishment of an online South African natural products database. Natural products (NPs) are chemical entities synthesised in nature and are unrivalled in their structural complexity, chemical diversity, and biological specificity, which has long made them crucial to the drug discovery process. South Africa is rich in both plant and marine biodiversity and a great deal of research has gone into isolating compounds from organisms found in this country. However, there is no official database containing this information, making it difficult to access for research purposes. This information was extracted manually from literature to create a database of South African natural products. In order to make the information accessible to the general research community, a website, named “SANCDB”, was built to enable compounds to be quickly and easily searched for and downloaded in a number of different chemical formats. The content of the database was assessed and compared to other established natural product databases. Currently, SANCDB is the only database of natural products in Africa with an online interface. The second section of the thesis was aimed at performing structural characterisation of proteins with the potential to be targeted for antimalarial drug therapy. This looked specifically at 1) The interactions between an exported heat shock protein (Hsp) from Plasmodium falciparum (P. falciparum), PfHsp70-x and various host and exported parasite J proteins, as well as 2) The interface between PfHsp90 and the heat shock organising protein (PfHop). The PfHsp70-x:J protein study provided additional insight into how these two proteins potentially interact. Analysis of the PfHsp90:PfHop also provided a structural insight into the interaction interface between these two proteins and identified residues that could be targeted due to their contribution to the stability of the Hsp90:Hop binding complex and differences between parasite and human proteins. These studies inspired the development of a homology modelling tool, which can be used to assist researchers with homology modelling, while providing them with step-by-step control over the entire process. This thesis presents the establishment of a South African NP database and the development of a homology modelling tool, inspired by protein structural studies. When combined, these two applications have the potential to contribute greatly towards in silico drug discovery research. Structural bioinformatics Drug development Natural products -- Databases Natural products -- Biotechnology Sequence alignment (Bioinformatics) Malaria -- Chemotherapy Heat shock proteins Plasmodium falciparum
108	Alinhamento de seqüências com rearranjos / Sequences alignment with rearrangements Augusto Fernandes Vellozo 18 April 2007 (has links) Uma das tarefas mais básicas em bioinformática é a comparação de seqüências feita por algoritmos de alinhamento, que modelam as alterações evolutivas nas seqüências biológicas através de mutações como inserção, remoção e substituição de símbolos. Este trabalho trata de generalizações nos algoritmos de alinhamento que levam em consideração outras mutações conhecidas como rearranjos, mais especificamente, inversões, duplicações em tandem e duplicações por transposição. Alinhamento com inversões não tem um algoritmo polinomial conhecido e uma simplificação para o problema que considera somente inversões não sobrepostas foi proposta em 1992 por Schöniger e Waterman. Em 2003, dois trabalhos independentes propuseram algoritmos com tempo O(n^4) para alinhar duas seqüências com inversões não sobrepostas. Desenvolvemos dois algoritmos que resolvem este mesmo problema: um com tempo de execução O(n^3 logn) e outro que, sob algumas condições no sistema de pontuação, tem tempo de execução O(n^3), ambos em memória O(n^2). Em 1997, Benson propôs um modelo de alinhamento que reconhecesse as duplicações em tandem além das inserções, remoções e substituições. Ele propôs dois algoritmos exatos para alinhar duas seqüências com duplicações em tandem: um em tempo O(n^5) e memória O(n^2), e outro em tempo O(n^4) e memória O(n^3). Propomos um algoritmo para alinhar duas seqüências com duplicações em tandem em tempo O(n^3) e memória O(n^2). Propomos também um algoritmo para alinhar duas seqüências com transposons (um tipo mais geral que a duplicação em tandem), em tempo O(n^3) e memória O(n^2). / Sequence comparison done by alignment algorithms is one of the most fundamental tasks in bioinformatics. The evolutive mutations considered in these alignments are insertions, deletions and substitutions of nucleotides. This work treats of generalizations introduced in alignment algorithms in such a way that other mutations known as rearrangements are also considered, more specifically, we consider inversions, duplications in tandem and duplications by transpositions. Alignment with inversions does not have a known polynomial algorithm and a simplification to the problem that considers only non-overlapping inversions were proposed by Schöniger and Waterman in 1992. In 2003, two independent works proposed algorithms with O(n^4) time to align two sequences with non-overlapping inversions. We developed two algorithms to solve this problem: one in O(n^3 log n) time and other, considering some conditions in the scoring system, in O(n^3) time, both in O(n^2) memory. In 1997, Benson proposed a model of alignment that recognized tandem duplication, insertion, deletion and substitution. He proposed two exact algorithms to align two sequences with tandem duplication: one in O(n^5) time and O(n^2) memory, and other in O(n^4) time and O(n^3) memory. We propose one algorithm to align two sequences with tandem duplication in O(n^3) time and O(n^2) memory. We also propose one algorithm to align two sequences with transposons (a type of duplication more general than tandem duplication), in O(n^3) time and O(n^2) memory. Alinhamento de seqüências bioinformática biologia computacional duplicação inversão programação dinâmica bioinformatics Computational Biology duplications dynamic programming inversions Sequence alignment
109	Development of a Python Pipeline for the Analysis of Campylobacter Zetterberg, Elvira, Andersson, Evelina, Nilsson, Alma, Qvarnlöf, Moa, Olivero, Corinne, Sulyaeva, Julia January 2022 (has links) Statens veterinärmedicinska anstalt, SVA, is a government agency that works for better animal and human health with a primary focus on infectious animal diseases. One of their projects involves tracking the spread of Campylobacter infection in broilers and the occurrence of antimicrobial resistance in these bacteria. A pipeline was developed to contribute to making the analysis of Whole Genome Sequencing (WGS) data from Campylobacter more effective. This was done by changing the currently used pipeline’s programming language from Perl to Python and adding the possibility to run multiple analyses in parallel. With parallelization, the time for running multiple analyses was reduced compared to running them sequentially, even if it was not as fast in practice as in theory. It also did not work as well when running parallel analyses of different strains compared to identical strains. Furthermore, different attributes of the pipeline were changed or added to improve the pipeline and a database comparison was performed in order to suggest the best ones for future use. VFDB, CARD, and MEGARes were suggested as appropriate databases to use in future WGS analysis of Campylobacter. Due to a lack of resources and technical difficulties, some of the requested attributes for the pipeline could not be implemented, such as the tool Pilon and the inclusion of MLST and cgMLST analysis. Nonetheless, the pipeline is well structured, has most of the requested tools, and is easy to run. With some minor improvements, the pipeline will be a useful tool for SVA and their project. Campylobacter WGS pipeline sequence alignment Illumina ARIBA Engineering and Technology Teknik och teknologier Natural Sciences Naturvetenskap Computer Sciences Datavetenskap (datalogi)
110	[pt] CLUSTERIZAÇÃO DE POÇOS DE PETRÓLEO UTILIZANDO ALINHAMENTO DE SEQUÊNCIAS BASEADAS EM LITOLOGIA / [en] OIL WELL CLUSTERING USING LITHOLOGY-BASED SEQUENCE ALIGNMENT WALDIR JOSE PEREIRA JUNIOR 25 November 2021 (has links) [pt] A construção de um poço de petróleo requer um planejamento extenso e antecipado. Dentre os vários objetivos deste planejamento, está a verificação da necessidade de aquisição de materiais e equipamentos para a realização das etapas da construção do poço. Tais aquisições muitas vezes envolvem contratações longas e, posteriormente, requerem um grande tempo para entrega, podendo chegar a anos. Como este planejamento é realizado em um cenário de muitas incertezas, várias técnicas, utilizando diversos tipos de dado, já foram propostas para correlacionar poços, de modo a obter antecipadamente as necessidades de materiais e equipamentos para construir um novo poço. Um desses tipos de dado é o perfil litológico, que contém os seguimentos de rochas presentes pela extensão do poço, coletados através de sensores e outros meios presentes durante a perfuração. Este perfil litológico pode ser gerado artificialmente para poços ainda não perfurados, através de dados sísmicos. Este trabalho propõe uma nova metodologia para agrupar poços de petróleo. A medida de distância será calculada com base no grau de similaridade entre poços, obtido através da aplicação de algoritmo de alinhamento de sequências, que, por sua vez, são geradas exclusivamente a partir dos perfis litológicos de tais poços. Desta forma, é possível obter poços correlatos a um determinado poço. Para validação da metodologia, foram realizados experimentos de clusterização envolvendo dados de 120 poços da costa sudeste brasileira. / [en] The construction of an oil well requires extensive and advanced planning. Among the various objectives of this planning is the verification of the need to purchase materials and equipment to carry out the stages of construction of the well. Such acquisitions often involve long contracts and, later, require a long lead-time, which can reach years. As this planning is carried out in a scenario of many uncertainties, several techniques, using different types of data, have already been proposed to correlate wells, in order to obtain in advance the material and equipment requirements to build a new well. One of these types of data is the lithological profile, which contains the rock segments present throughout the length of the well, collected through sensors and other methods present during the drilling. It is possible to generate artificial lithological profiles for not yet drilled wells, through seismic data. This work proposes a new methodology for grouping oil wells. The distance measure is based on the degree of similarity between wells, obtained by applying a sequence alignment algorithm, which, in turn, are generated exclusively from the lithological profiles of such wells. In this way, it is possible to obtain wells related to a specific well. To validate the methodology, clustering experiments involves data from 120 wells on the southeastern Brazilian coast. [pt] CORRELACAO [pt] ALINHAMENTO DE SEQUENCIAS [pt] LITOLOGIA [pt] PERFIL LITOLOGICO [pt] CLUSTERIZACAO [en] CORRELATION [en] SEQUENCE ALIGNMENT [en] LITHOLOGY [en] LITHOLOGICAL PROFILE [en] CLUSTERING

Search results