Global ETD Search

1	Accurate genome relative abundance estimation for closely related species in a metagenomic sample Sohn, Michael, An, Lingling, Pookhao, Naruekamol, Li, Qike January 2014 (has links) BACKGROUND:Metagenomics has a great potential to discover previously unattainable information about microbial communities. An important prerequisite for such discoveries is to accurately estimate the composition of microbial communities. Most of prevalent homology-based approaches utilize solely the results of an alignment tool such as BLAST, limiting their estimation accuracy to high ranks of the taxonomy tree.RESULTS:We developed a new homology-based approach called Taxonomic Analysis by Elimination and Correction (TAEC), which utilizes the similarity in the genomic sequence in addition to the result of an alignment tool. The proposed method is comprehensively tested on various simulated benchmark datasets of diverse complexity of microbial structure. Compared with other available methods designed for estimating taxonomic composition at a relatively low taxonomic rank, TAEC demonstrates greater accuracy in quantification of genomes in a given microbial sample. We also applied TAEC on two real metagenomic datasets, oral cavity dataset and Crohn's disease dataset. Our results, while agreeing with previous findings at higher ranks of the taxonomy tree, provide accurate estimation of taxonomic compositions at the species/strain level, narrowing down which species/strains need more attention in the study of oral cavity and the Crohn's disease.CONCLUSIONS:By taking account of the similarity in the genomic sequence TAEC outperforms other available tools in estimating taxonomic composition at a very low rank, especially when closely related species/strains exist in a metagenomic sample. Metagenomics Alignment similarity Genomic similarity Closely related species
2	Comparative Analysis of Genomic Similarity Tools in Species Identification Nerella, Chandra Sekhar 14 January 2025 (has links) This study presents the development and evaluation of an automated pipeline for genome comparison, leveraging four bioinformatics tools: alignment-based methods (pyANI, Fas- tANI) and k-mer-based methods (Sourmash, BinDash 2.0). The analysis focuses on high- quality genomic datasets characterized by 100% completeness, ensuring consistency and accuracy in the comparison process. The pipeline processes genomes under uniform con- ditions, recording key performance metrics such as execution time and rank correlations. Initial comparisons were conducted on a subset of five genomes, generating 10 unique pair- wise comparisons to establish baseline performance. This preliminary analysis identified k = 10 as the optimal k-mer size for Sourmash and BinDash, significantly improving their comparability with alignment-based methods. For the expanded dataset of 175 genomes, encompassing (175C2) = 15,225 unique comparisons, pyANI and FastANI demonstrated high similarity values, often exceeding 90% for closely related genomes. Rank correlations, calculated using Spearman's ρ and Kendall's τ , high- lighted strong agreement between pyANI and FastANI (ρ = 0.9630 , τ = 0.8625) due to their shared alignment-based methodology. Similarly, Sourmash and BinDash, both employing k-mer-based approaches, exhibited moderate-to-strong rank correlations (ρ = 0.6967, τ = 0.5290). In contrast, the rank correlations between alignment-based and k-mer-based tools were lower, underscoring methodological differences in genome similarity calculations. Execution times revealed significant contrasts between the tools. Alignment-based meth- ods required substantial computation time, with pyANI taking an average of 1.97 seconds per comparison and FastANI averaging 0.81 seconds per comparison. Conversely, k-mer- based methods demonstrated exceptional computational efficiency, with Sourmash complet- ing comparisons in 2.1 milliseconds and BinDash in just 0.25 milliseconds per comparison, reflecting a difference of nearly three orders of magnitude between the two categories. These results underscore the trade-offs between computational cost and methodological approaches in genome similarity estimation. This study provides valuable insights into the relative strengths and weaknesses of genome comparison tools, offering a comprehensive framework for selecting appropriate methods for diverse genomic research applications. The findings emphasize the importance of param- eter optimization for k-mer-based tools and highlight the scalability of these methods for large-scale genomic analyses. / Master of Science / This study explores the strengths and weaknesses of different tools used to compare genomes, which are the complete set of DNA in living organisms. Comparing genomes allows scientists to understand how different species are related, uncover shared traits, and identify what makes each species unique. The tools we examined fall into two main categories: detailed tools (called alignment-based methods) and faster, more approximate tools (called k-mer- based methods). The detailed tools, such as pyANI and FastANI, compare DNA sequences piece by piece, providing very accurate results. In contrast, the faster tools, such as Sourmash and BinDash, look for patterns in smaller sections of DNA, which makes them much quicker but sometimes less precise. To start, we tested these tools on a small group of genomes to see how they performed. By adjusting a setting in the faster tools, we found that their results became more similar to the detailed tools, improving their reliability. Encouraged by these findings, we expanded the comparison to a much larger dataset of 175 genomes. For this larger dataset, the detailed tools provided highly accurate results but required much more time and computational power. On the other hand, the faster tools completed the comparisons in a fraction of the time, making them ideal for larger datasets where quick results are needed. We also compared how the tools ranked genome similarities and found that tools using similar methods, like pyANI and FastANI, had very consistent rankings. Likewise, the faster tools, Sourmash and BinDash, also agreed with each other. However, the rankings between the two types of tools (detailed versus faster) were less consistent, reflecting their different approaches to genome comparison. This research provides a practical guide for scientists choosing tools to compare genomes. If accuracy and detail are most important, alignment-based tools are the best choice, though they take more time and computational resources. If speed is critical, such as when working with very large datasets, k-mer-based tools offer an excellent alternative. By understanding the strengths and trade-offs of each method, researchers can make informed decisions to suit their specific needs, whether focusing on small, detailed studies or large-scale genome analyses. Genome Comparison Average Nucleotide Identity (ANI) Alignment-Based Tools Alignment-Free Tools pyANI FastANI Sourmash BinDash Jaccard Index Genomic Similarity k-mer Optimization Bioinformatics Computational Biology

Search results

Accurate genome relative abundance estimation for closely related species in a metagenomic sample

Comparative Analysis of Genomic Similarity Tools in Species Identification