Global ETD Search

1	Filtered spaced-word matches: a novel approach to fast and accurate sequence comparison / Filtered spaced-word matches: a novel approach to fast and accurate sequence comparison Leimeister, Chris-Andre 12 December 2018 (has links) No description available. 572 alignment-free sequence comparison
2	Probabilistic Modeling for Whole Metagenome Profiling Burks, David 05 1900 (has links) To address the shortcomings in existing Markov model implementations in handling large amount of metagenomic data with comparable or better accuracy in classification, we developed a new algorithm based on pseudo-count supplemented standard Markov model (SMM), which leverages the power of higher order models to more robustly classify reads at different taxonomic levels. Assessment on simulated metagenomic datasets demonstrated that overall SMM was more accurate in classifying reads to their respective taxa at all ranks compared to the interpolated methods. Higher order SMMs (9th order or greater) also outperformed BLAST alignments in assigning taxonomic labels to metagenomic reads at different taxonomic ranks (genus and higher) on tests that masked the read originating species (genome models) in the database. Similar results were obtained by masking at other taxonomic ranks in order to simulate the plausible scenarios of non-representation of the source of a read at different taxonomic levels in the genome database. The performance gap became more pronounced with higher taxonomic levels. To eliminate contaminations in datasets and to further improve our alignment-free approach, we developed a new framework based on a genome segmentation and clustering algorithm. This framework allowed removal of adapter sequences and contaminant DNA, as well as generation of clusters of similar segments, which were then used to sample representative read fragments to constitute training datasets. The parameters of a logistic regression model were learnt from these training datasets using a Bayesian optimization procedure. This allowed us to establish thresholds for classifying metagenomic reads by SMM. This led to the development of a Python-based frontend that combines our SMM algorithm with the logistic regression optimization, named POSMM (Python Optimized Standard Markov Model). POSMM provides a much-needed alternative to metagenome profiling programs. Our algorithm that builds the genome models on the fly, and thus obviates the need to build a database, complements alignment-based classification and can thus be used in concert with alignment-based classifiers to raise the bar in metagenome profiling. Bioinformatics Metagenomics Markov Alignment-Free POSMM Taxonomy Classification Markovian Jensen-Shannon Divergence Segmentation Clustering
3	Attraction Based Models of Collective Motion Strömbom, Daniel January 2013 (has links) Animal groups often exhibit highly coordinated collective motion in a variety of situations. For example, bird flocks, schools of fish, a flock of sheep being herded by a dog and highly efficient traffic on an ant trail. Although these phenomena can be observed every day all over the world our knowledge of what rules the individual's in such groups use is very limited. Questions of this type has been studied using so called self-propelled particle (SPP) models, most of which assume that collective motion arises from individuals aligning with their neighbors. Here we introduce and analyze a SPP-model based on attraction alone. We find that it produces all the typical groups seen in alignment-based models and some novel ones. In particular, a group that exhibits collective motion coupled with non-trivial internal dynamics. Groups that have this property are rarely seen in SPP-models and we show that even when a repulsion term is added to the attraction only model such groups are still present. These findings suggest that an interplay between attraction and repulsion may be the main driving force in real flocks and that the alignment rule may be superfluous. We then proceed to model two different experiments using the SPP-model approach. The first is a shepherding algorithm constructed primarily to model experiments where a sheepdog is herding a flock of sheep. We find that in addition to modeling the specific experimental situation well the algorithm has some properties which may make it useful in more general shepherding situations. The second is a traffic model for leaf-cutting ants bridges. Based on earlier experiments a set of traffic rules for ants on a very narrow bridge had been suggested. We show that these are sufficient to produce the observed traffic dynamics on the narrow bridge. And that when extended to a wider bridge by replacing 'Stop' with 'Turn' the new rules are sufficient to produce several key characteristics of the dynamics on the wide bridge, in particular three-lane formation. flocking swarming self-propelled particles alignment-free models agent-based modelling leaf-cutting ant traffic sheep-sheepdog system the Shepherding problem
4	Metody rychlého srovnání a identifikace sekvencí v metagenomických datech / Methods for fast sequence comparison and identification in metagenomic data Kupková, Kristýna January 2016 (has links) Předmětem této práce je vytvoření metody sloužící k identifikaci organismů z metagenomických dat. Doposud k tomuto účelu spolehlivě dostačovaly metody založené na zarovnání sekvencí s referenční databází. Množství dat ovšem s rozvojem sekvenačních technik rapidně roste a tyto metody se tak stávají díky své výpočetní náročnosti nevhodnými. V této diplomové práci je popsán postup nové techniky, která umožňuje klasifikaci metagenomických dat bez nutnosti zarovnání. Metoda spočívá v převedení sekvenovaných úseků na genomické signály ve formě fázových reprezentací, ze kterých jsou následně extrahovány vektory příznaků. Těmito příznaky jsou tři Hjorthovy deskriptory. Ty jsou dále vystaveny metodě maximalizace věrohodnosti směsi Gaussovských rozložení, která umožňuje spolehlivé roztřídění fragmentů podle jejich příslušnosti k organismu.
5	Analysis of the subsequence composition of biosequences Cunial, Fabio 07 May 2012 (has links) Measuring the amount of information and of shared information in biological strings, as well as relating information to structure, function and evolution, are fundamental computational problems in the post-genomic era. Classical analyses of the information content of biosequences are grounded in Shannon's statistical telecommunication theory, while the recent focus is on suitable specializations of the notions introduced by Kolmogorov, Chaitin and Solomonoff, based on data compression and compositional redundancy. Symmetrically, classical estimates of mutual information based on string editing are currently being supplanted by compositional methods hinged on the distribution of controlled substructures. Current compositional analyses and comparisons of biological strings are almost exclusively limited to short sequences of contiguous solid characters. Comparatively little is known about longer and sparser components, both from the point of view of their effectiveness in measuring information and in separating biological strings from random strings, and from the point of view of their ability to classify and to reconstruct phylogenies. Yet, sparse structures are suspected to grasp long-range correlations and, at short range, they are known to encode signatures and motifs that characterize molecular families. In this thesis, we introduce and study compositional measures based on the repertoire of distinct subsequences of any length, but constrained to occur with a predefined maximum gap between consecutive symbols. Such measures highlight previously unknown laws that relate subsequence abundance to string length and to the allowed gap, across a range of structurally and functionally diverse polypeptides. Measures on subsequences are capable of separating only few amino acid strings from their random permutations, but they reveal that random permutations themselves amass along previously undetected, linear loci. This is perhaps the first time in which the vocabulary of all distinct subsequences of a set of structurally and functionally diverse polypeptides is systematically counted and analyzed. Another objective of this thesis is measuring the quality of phylogenies based on the composition of sparse structures. Specifically, we use a set of repetitive gapped patterns, called motifs, whose length and sparsity have never been considered before. We find that extremely sparse motifs in mitochondrial proteomes support phylogenies of comparable quality to state-of-the-art string-based algorithms. Moving from maximal motifs -- motifs that cannot be made more specific without losing support -- to a set of generators with decreasing size and redundancy, generally degrades classification, suggesting that redundancy itself is a key factor for the efficient reconstruction of phylogenies. This is perhaps the first time in which the composition of all motifs of a proteome is systematically used in phylogeny reconstruction on a large scale. Extracting all maximal motifs, or even their compact generators, is infeasible for entire genomes. In the last part of this thesis, we study the robustness of measures of similarity built around the dictionary of LZW -- the variant of the LZ78 compression algorithm proposed by Welch -- and of some of its recently introduced gapped variants. These algorithms use a very small vocabulary, they perform linearly in the input strings, and they can be made even faster than LZ77 in practice. We find that dissimilarity measures based on maximal strings in the dictionary of LZW support phylogenies that are comparable to state-of-the-art methods on test proteomes. Introducing a controlled proportion of gaps does not degrade classification, and allows to discard up to 20% of each input proteome during comparison. Subsequences Compositional complexity Phylogeny reconstruction Alignment-free sequence comparison Sparse motifs LZW LZWA Variance computation Protein domains Proteomes Phylogeny Polypeptides Molecular biology Algorithms
6	Comparative Analysis of Genomic Similarity Tools in Species Identification Nerella, Chandra Sekhar 14 January 2025 (has links) This study presents the development and evaluation of an automated pipeline for genome comparison, leveraging four bioinformatics tools: alignment-based methods (pyANI, Fas- tANI) and k-mer-based methods (Sourmash, BinDash 2.0). The analysis focuses on high- quality genomic datasets characterized by 100% completeness, ensuring consistency and accuracy in the comparison process. The pipeline processes genomes under uniform con- ditions, recording key performance metrics such as execution time and rank correlations. Initial comparisons were conducted on a subset of five genomes, generating 10 unique pair- wise comparisons to establish baseline performance. This preliminary analysis identified k = 10 as the optimal k-mer size for Sourmash and BinDash, significantly improving their comparability with alignment-based methods. For the expanded dataset of 175 genomes, encompassing (175C2) = 15,225 unique comparisons, pyANI and FastANI demonstrated high similarity values, often exceeding 90% for closely related genomes. Rank correlations, calculated using Spearman's ρ and Kendall's τ , high- lighted strong agreement between pyANI and FastANI (ρ = 0.9630 , τ = 0.8625) due to their shared alignment-based methodology. Similarly, Sourmash and BinDash, both employing k-mer-based approaches, exhibited moderate-to-strong rank correlations (ρ = 0.6967, τ = 0.5290). In contrast, the rank correlations between alignment-based and k-mer-based tools were lower, underscoring methodological differences in genome similarity calculations. Execution times revealed significant contrasts between the tools. Alignment-based meth- ods required substantial computation time, with pyANI taking an average of 1.97 seconds per comparison and FastANI averaging 0.81 seconds per comparison. Conversely, k-mer- based methods demonstrated exceptional computational efficiency, with Sourmash complet- ing comparisons in 2.1 milliseconds and BinDash in just 0.25 milliseconds per comparison, reflecting a difference of nearly three orders of magnitude between the two categories. These results underscore the trade-offs between computational cost and methodological approaches in genome similarity estimation. This study provides valuable insights into the relative strengths and weaknesses of genome comparison tools, offering a comprehensive framework for selecting appropriate methods for diverse genomic research applications. The findings emphasize the importance of param- eter optimization for k-mer-based tools and highlight the scalability of these methods for large-scale genomic analyses. / Master of Science / This study explores the strengths and weaknesses of different tools used to compare genomes, which are the complete set of DNA in living organisms. Comparing genomes allows scientists to understand how different species are related, uncover shared traits, and identify what makes each species unique. The tools we examined fall into two main categories: detailed tools (called alignment-based methods) and faster, more approximate tools (called k-mer- based methods). The detailed tools, such as pyANI and FastANI, compare DNA sequences piece by piece, providing very accurate results. In contrast, the faster tools, such as Sourmash and BinDash, look for patterns in smaller sections of DNA, which makes them much quicker but sometimes less precise. To start, we tested these tools on a small group of genomes to see how they performed. By adjusting a setting in the faster tools, we found that their results became more similar to the detailed tools, improving their reliability. Encouraged by these findings, we expanded the comparison to a much larger dataset of 175 genomes. For this larger dataset, the detailed tools provided highly accurate results but required much more time and computational power. On the other hand, the faster tools completed the comparisons in a fraction of the time, making them ideal for larger datasets where quick results are needed. We also compared how the tools ranked genome similarities and found that tools using similar methods, like pyANI and FastANI, had very consistent rankings. Likewise, the faster tools, Sourmash and BinDash, also agreed with each other. However, the rankings between the two types of tools (detailed versus faster) were less consistent, reflecting their different approaches to genome comparison. This research provides a practical guide for scientists choosing tools to compare genomes. If accuracy and detail are most important, alignment-based tools are the best choice, though they take more time and computational resources. If speed is critical, such as when working with very large datasets, k-mer-based tools offer an excellent alternative. By understanding the strengths and trade-offs of each method, researchers can make informed decisions to suit their specific needs, whether focusing on small, detailed studies or large-scale genome analyses. Genome Comparison Average Nucleotide Identity (ANI) Alignment-Based Tools Alignment-Free Tools pyANI FastANI Sourmash BinDash Jaccard Index Genomic Similarity k-mer Optimization Bioinformatics Computational Biology
7	Numerické metody pro klasifikaci metagenomických dat / Numerical methods for classification of metagenomic data Vaněčková, Tereza January 2016 (has links) This thesis deals with metagenomics and numerical methods for classification of metagenomic data. Review of alignment-free methods based on nucleotide word frequency is provided as they appear to be effective for processing of metagenomic sequence reads produced by next-generation sequencing technologies. To evaluate these methods, selected features based on k-mer analysis were tested on simulated dataset of metagenomic sequence reads. Then the data in original data space were enrolled for hierarchical clustering and PCA processed data were clustered by K-means algorithm. Analysis was performed for different lengths of nucleotide words and evaluated in terms of classification accuracy.
8	Modelling and comparing protein interaction networks using subgraph counts Chegancas Rito, Tiago Miguel January 2012 (has links) The astonishing progress of molecular biology, engineering and computer science has resulted in mature technologies capable of examining multiple cellular components at a genome-wide scale. Protein-protein interactions are one example of such growing data. These data are often organised as networks with proteins as nodes and interactions as edges. Albeit still incomplete, there is now a substantial amount of data available and there is a need for biologically meaningful methods to analyse and interpret these interactions. In this thesis we focus on how to compare protein interaction networks (PINs) and on the rela- tionship between network architecture and the biological characteristics of proteins. The underlying theme throughout the dissertation is the use of small subgraphs – small interaction patterns between 2-5 proteins. We start by examining two popular scores that are used to compare PINs and network models. When comparing networks of the same model type we find that the typical scores are highly unstable and depend on the number of nodes and edges in the networks. This is unsatisfactory and we propose a method based on non-parametric statistics to make more meaningful comparisons. We also employ principal component analysis to judge model fit according to subgraph counts. From these analyses we show that no current model fits to the PINs; this may well reflect our lack of knowledge on the evolution of protein interactions. Thus, we use explanatory variables such as protein age and protein structural class to find patterns in the interactions and subgraphs we observe. We discover that the yeast PIN is highly heterogeneous and therefore no single model is likely to fit the network. Instead, we focus on ego-networks containing an initial protein plus its interacting partners and their interaction partners. In the final chapter we propose a new, alignment-free method for network comparison based on such ego-networks. The method compares subgraph counts in neighbourhoods within PINs in an averaging, many-to-many fashion. It clusters networks of the same model type and is able to successfully reconstruct species phylogenies solely based on PIN data providing exciting new directions for future research. 572.015118

Search results