611 |
Deep learning prediction of Quantmap clustersParakkal Sreenivasan, Akshai January 2021 (has links)
The hypothesis that similar chemicals exert similar biological activities has been widely adopted in the field of drug discovery and development. Quantitative Structure-Activity Relationship (QSAR) models have been used ubiquitously in drug discovery to understand the function of chemicals in biological systems. A common QSAR modeling method calculates similarity scores between chemicals to assess their biological function. However, due to the fact that some chemicals can be similar and yet have different biological activities, or conversely can be structurally different yet have similar biological functions, various methods have instead been developed to quantify chemical similarity at the functional level. Quantmap is one such method, which utilizes biological databases to quantify the biological similarity between chemicals. Quantmap uses quantitative molecular network topology analysis to cluster chemical substances based on their bioactivities. This method by itself, unfortunately, cannot assign new chemicals (those which may not yet have biological data) to the derived clusters. Owing to the fact that there is a lack of biological data for many chemicals, deep learning models were explored in this project with respect to their ability to correctly assign unknown chemicals to Quantmap clusters. The deep learning methods explored included both convolutional and recurrent neural networks. Transfer learning/pretraining based approaches and data augmentation methods were also investigated. The best performing model, among those considered, was the Seq2seq model (a recurrent neural network containing two joint networks, a perceiver and an interpreter network) without pretraining, but including data augmentation.
|
612 |
Expression of Selected Cadherins in Adult Zebrafish Visual System and Regenerating Retina, and Microarray Analysis of Gene Expression in Protocadherin-17 MorphantsMarlowe, Alicja 28 July 2022 (has links)
No description available.
|
613 |
Precision of Positional Information Along the Developing Cochlea Radial Axis: Linear BMP Activity Helps Set the StageMatthew J Thompson (10751937) 10 October 2022 (has links)
<p>Developing embryos rely on morphogenetic signals to inform cells about where they are in space and respond to their positions through the appropriate expression of fate-determining genes. Computational and theoretical analyses are powerful tools that have proven to enhance and inform experimental work in developmental biology. In the study of positional information, mechanistic ordinary and partial differential equations are able to test and suggest hypotheses for morphogen network evolution. Information theoretic interpretations of these profiles have also been proven to be valuable towards making predictions.</p>
<p>These approaches are reviewed and used here together to investigate the morphogenetic signals instructing pattern formation during the earliest phase of development in the cochlea. When the transcription factors SOX2 and pSMAD1/5/9 (two crucial carriers of positional information)<br>
are quantified here for the first time, new observations, questions, and hypotheses emerge that have been out of reach otherwise. Perhaps most intriguingly is the identification of a linear pSMAD1/5/9 profile over a supermajority of the radial axis. </p>
<p>This linear profile is shown to ‘set the stage’ by creating a 1:1 map between position and signal concentration. Feasible mechanisms responsible for maintaining this profile are simulated to propose the existence of a yet-unidentified BMP sink on the medial edge and suggests a role for Follistatin interaction with BMP, which there are currently doubts around. This likewise sets thestage for new experimental and simulation work to home in on the network dynamics implemented by the cochlea to turn a diffusive morphogen system into a linear signal. While BMP sets the stage of the radial axis, adding SOX2 more precisely assigns cells their places for this opening act with its steep profile that reduces positional error. The transition into subsequent phases where cell fates are assigned relies dependently on the precision encoded in this first phase in order to create the cellular pattern required to enable the sense of hearing. </p>
|
614 |
Inference of Gene Regulatory Networks with integration of prior knowledgeMaresi, Emiliano 17 June 2024 (has links)
Gene regulatory networks (GRNs) are crucial for understanding complex biological processes and disease mechanisms, particularly in cancer. However, GRN inference remains challenging due to the intricate nature of gene interactions and limitations of existing methods. Traditionally, prior knowledge in GRN inference simplifies the problem by reducing the search space, but its full potential is unrealized. This research aims to develop a method that uses prior knowledge to guide the GRN inference process, enhancing accuracy and biological plausibility of the resulting networks. We extended the Fused Sparse Structural Equation Models (FSSEM) framework to create the Fused Lasso Adaptive Prior (FLAP) method. FSSEM incorporates gene expression data and genetic variants in the form of expression quantitative trait loci (eQTLs) perturbations. FLAP enhances FSSEM by integrating prior knowledge of gene-gene interactions into the initial network estimate, guiding the selection of relevant gene interactions in the final inferred network. We evaluated FLAP using synthetic data to assess the impact of incorrect prior knowledge and real lung cancer data, using prior knowledge from various gene network databases (GIANT, TissueNexus, STRING, ENCODE, hTFtarget). Our findings demonstrate that integrating prior knowledge improves the accuracy of inferred networks, with FLAP showing tolerance for incorrect
prior knowledge. Using real lung cancer data, functional enrichment analysis and literature validation confirmed the biological plausibility of the networks inferred by FLAP. Different sources of prior knowledge impacted the results, with GIANT providing the most biologically relevant networks, while other sources showed less consistent performance.
FLAP improves GRN inference by effectively integrating prior knowledge, demonstrating robustness against incorrect prior knowledge. The method’s application to lung cancer data indicates that high-quality prior knowledge sources enhance the biological relevance of inferred networks. Future research should focus on improving the quality and integration of prior knowledge, possibly by developing consensus methods that combine multiple sources. This
approach has potential applications in cancer research and drug sensitivity studies, offering a more accurate understanding of gene regulatory mechanisms and potential therapeutic targets.
|
615 |
Clonal reconstruction from co-occurrence of vector integration sites accurately quantifies expanding clones in vivoWagner, Sebastian, Baldow, Christoph, Calabria, Andrea, Rudilosso, Laura, Gallina, Pierangela, Montini, Eugenio, Cesana, Daniela, Glauche, Ingmar 19 April 2024 (has links)
High transduction rates of viral vectors in gene therapies (GT) and experimental hematopoiesis ensure a high frequency of gene delivery, although multiple integration events can occur in the same cell. Therefore, tracing of integration sites (IS) leads to mis-quantification of the true clonal spectrum and limits safety considerations in GT. Hence, we use correlations between repeated measurements of IS abundances to estimate their mutual similarity and identify clusters of co-occurring IS, for which we assume a clonal origin. We evaluate the performance, robustness and specificity of our methodology using clonal simulations. The reconstruction methods, implemented and provided as an R-package, are further applied to experimental clonal mixes and preclinical models of hematopoietic GT. Our results demonstrate that clonal reconstruction from IS data allows to overcome systematic biases in the clonal quantification as an essential prerequisite for the assessment of safety and long-term efficacy of GT involving integrative vectors.
|
616 |
Understanding Isoform Expression and Alternative Splicing Biology through Single-Cell RNAseqArzalluz Luque, Ángeles 27 April 2024 (has links)
[ES] La introducción de la secuenciación de ARN a nivel de célula única (scRNA-seq) en el ámbito de la transcriptómica ha redefinido nuestro entendimiento de la diversidad celular, arrojando luz sobre los mecanismos subyacentes a la heterogeneidad tisular. No obstante, al inicio de esta tesis, las limitaciones de a esta tecnología obstaculizaban su aplicación en el estudio de procesos complejos, entre ellos el splicing alternativo. A pesar de ello, los patrones de splicing a nivel celular planteaban incógnitas que esta tecnología tenía el potencial de resolver: ¿es posible observar, a nivel celular, la misma diversidad de isoformas que se detecta mediante RNA-seq a nivel de tejido? ¿Qué función desempeñan las isoformas alternativas en la constitución de la identidad celular?
El objetivo de esta tesis es desbloquear el potencial del scRNA-seq para el análisis de isoformas, abordando sus dificultades técnicas y analíticas mediante el desarrollo de nuevas metodologías computacionales. Para lograrlo, se trazó una hoja de ruta con tres objetivos. Primero, se establecieron cuatro requisitos para el estudio de las isoformas mediante scRNA-seq, llevando a cabo una revisión de la literatura existente para evaluar su cumplimiento. Tras completar este marco con simulaciones computacionales, se identificaron las debilidades y fortalezas de los métodos de scRNA-seq y las herramientas computacionales disponibles. Durante la segunda etapa de la investigación, estos conocimientos se utilizaron para diseñar un protocolo óptimo de procesamiento de datos de scRNA-seq. En concreto, se integraron datos de lecturas largas a nivel de tejido con datos de scRNA-seq para garantizar una identificación adecuada de las isoformas así como su cuantificación a nivel celular. Este proceso permitió ampliar las estrategias computacionales disponibles para la reconstrucción de transcriptomas a partir de lecturas largas, mejoras que fueron implementadas en SQANTI3, software de referencia en transcriptómica. Por último, los datos procesados se utilizaron para desarrollar un nuevo método de análisis de co-expresión de isoformas a fin de desentrañar redes de regulación del splicing alternativo implicadas en la constitución de la identidad celular.
Dada la elevada variabilidad de los datos de scRNA-seq, este método se basa en la utilización de una estrategia de correlación basada en percentiles que atenúa el ruido técnico y permite la identificación de grupos de isoformas co-expresadas. Una vez configurada la red de co-expresión, se introdujo una nueva estrategia de análisis para la detección de patrones de co-utilización de isoformas que suceden de forma independiente a la expresión a nivel de gen, denominada co-Differential Isoform Usage. Este enfoque facilita la identificación de una capa de regulación de la identidad celular atribuible únicamente a mecanismos post-transcripcionales. Para una interpretación biológica más profunda, se aplicó una estrategia de anotación computacional de motivos y dominios funcionales en las isoformas definidas con lecturas largas, revelando las propiedades biológicas de las isoformas involucradas en la red de co-expresión. Estas investigaciones culminan en el lanzamiento de acorde, un paquete de R que encapsula las diferentes metodologías desarrolladas en esta tesis, potenciando la reproducibilidad de sus resultados y proporcionando una nueva herramienta para explorar la biología de las isoformas alternativas a nivel de célula única.
En resumen, esta tesis describe una serie de esfuerzos destinados a desbloquear el potencial de los datos de scRNA-seq para avanzar en la comprensión del splicing alternativo. Desde un contexto de escasez de herramientas y conocimiento previo, se han desarrollado soluciones de análisis innovadoras que permiten la aplicación de scRNA-seq al estudio de las isoformas alternativas, proporcionando recursos innovadores para profundizar en la regulación post-transcripcional y la función celular. / [CA] La introducció de la seqüenciació d'ARN a escala de cèl·lula única (scRNA-seq) en l'àmbit de la transcriptòmica ha redefinit el nostre enteniment de la diversitat cel·lular, projectant llum sobre els mecanismes subjacents a l'heterogeneïtat tissular. Malgrat les limitacions inicials d'aquesta tecnologia, especialment en el context de processos complexos com l'splicing alternatiu, els patrons d'splicing a escala cel·lular plantejaven incògnites amb potencial de resolució: és possible observar, a escala cel·lular, la mateixa diversitat d'isoformes que es detecta mitjançant RNA-seq en teixits? Quina funció tenen les isoformes alternatives en la constitució de la identitat cel·lular?
L'objectiu d'aquesta tesi és desbloquejar el potencial del scRNA-seq per a l'anàlisi d'isoformes alternatives, abordant les seues dificultats tècniques i analítiques amb noves metodologies computacionals. Per a això, es va traçar una ruta amb tres objectius. Primerament, es van establir quatre requisits per a l'estudi de les isoformes mitjançant scRNA-seq, amb una revisió de la literatura existent per avaluar-ne el compliment. Després de completar aquest marc amb simulacions computacionals, es van identificar les debilitats i fortaleses dels mètodes de scRNA-seq i de les eines computacionals disponibles. Durant la segona etapa de la investigació, aquests coneixements es van utilitzar per dissenyar un protocol òptim de processament de dades de scRNA-seq. En concret, es van integrar dades de lectures llargues a escala de teixit amb dades de scRNA-seq per a garantir una identificació adequada de les isoformes així com la seua quantificació a escala cel·lular. Aquest procés va permetre ampliar les estratègies computacionals disponibles per a la reconstrucció de transcriptomes a partir de lectures llargues, millores que van ser implementades en SQANTI3, un programari de referència en transcriptòmica. Finalment, les dades processades es van fer servir per a desenvolupar un nou mètode d'anàlisi de coexpressió d'isoformes amb l'objectiu de desentranyar xarxes de regulació de l'splicing alternatiu implicades en la constitució de la identitat cel·lular.
Donada l'elevada variabilitat de les dades de scRNA-seq, aquest mètode es basa en la utilització d'una estratègia de correlació basada en percentils que minimitza el soroll tècnic i permet la identificació de grups d'isoformes coexpressades. Un cop configurada la xarxa de coexpressió, es va introduir una nova estratègia d'anàlisi per a la detecció de patrons de co-utilització d'isoformes que succeeixen de forma independent a l'expressió del seu gen, denominada co-Differential Isoform Usage. Aquest enfocament facilita la identificació d'una capa de regulació de la identitat cel·lular atribuïble únicament a mecanismes post-transcripcionals. Per a una interpretació biològica més profunda, es va aplicar una estratègia d'anotació computacional de motius i dominis funcionals en les isoformes definides amb lectures llargues, revelant les propietats biològiques de les isoformes involucrades en la xarxa de coexpressió. Aquestes investigacions culminen en el llançament d'acorde, un paquet de R que encapsula les diferents metodologies desenvolupades en aquesta tesi, potenciant la reproducibilitat dels seus resultats i proporcionant una nova eina per a explorar la biologia de les isoformes alternatives a escala de cèl·lula única.
En resum, aquesta tesi descriu una sèrie d'esforços destinats a desbloquejar el potencial de les dades de scRNA-seq per a avançar en la comprensió de l'splicing alternatiu. Des d'un context de manca d'eines i coneixement previ, s'han desenvolupat solucions d'anàlisi innovadores que permeten l'aplicació de scRNA-seq a l'estudi de les isoformes alternatives, proporcionant recursos innovadors per a aprofundir en la regulació post-transcripcional i la funció cel·lular. / [EN] In the world of transcriptomics, the emergence of single-cell RNA sequencing (scRNA-seq) ignited a revolution in our understanding of cellular diversity, unraveling novel mechanisms in tissue heterogeneity, development and disease. However, when this thesis began, using scRNA-seq to understand Alternative Splicing (AS) was a challenging frontier due the inherent limitations of the technology. In spite of this research gap, pertinent questions persisted regarding cell-level AS patterns, particularly concerning the recapitulation of isoform diversity observed in bulk RNA-seq data at the cellular level and the roles played by cell and cell type-specific isoforms.
The work conducted in the present thesis aims to harness the potential of scRNA-seq for alternative isoform analysis, outlining technical and analytical challenges and designing computational methods to overcome them. To achieve this, we established a roadmap with three main aims. First, we set requirements for studying isoforms using scRNA-seq and conducted an extensive review of existing research, interrogating whether these requirements were met. Combining this acquired knowledge with several computational simulations allowed us to delineate the strengths and pitfalls of available data generation methods and computational tools. During the second research stage, this insight was used to design a suitable data processing pipeline, in which we jointly employed bulk long-read and short-read scRNA-seq sequenced from full-length cDNAs to ensure adequate isoform reconstruction as well as sensitive cell-level isoform quantification. Additionally, we refined available transcriptome curation strategies, introducing them as innovative modules in the transcriptome quality control software SQANTI3. Lastly, we harnessed single-cell isoform expression data and the rich biological diversity inherent in scRNA-seq, encompassing various cell types, in the design of a novel isoform co-expression analysis method. Percentile correlations effectively mitigated single-cell noise, unveiling clusters of co-expressed isoforms and exposing a layer of regulation in cellular identity that operated independently of gene expression. We additionally introduced co-Differential Isoform Usage (coDIU) analysis, enhancing our ability to interpret isoform cluster networks. This endeavour, combined with the computational annotation of functional sites and domains in the long read-defined isoform models, unearthed a distinctive functional signature in coDIU genes. This research effort materialized in the release of acorde, an R package that encapsulates all analyses functionalities developed throughout this thesis, providing a reproducible means for the scientific community to further explore the depths of alternative isoform biology within single-cell transcriptomics.
This thesis describes a complex journey aimed at unlocking the potential of scRNA-seq data for investigating AS and isoforms: from a landscape marked by the scarcity of tools and guidelines, towards the development of novel analysis solutions and the acquisition of valuable biological insight. In a swiftly evolving field, our methodological contributions constitute a significant leap forward in the application of scRNA-seq to the study of alternative isoform expression, providing innovative resources for delving deeper into the intricacies of post-transcriptional regulation and cellular function through the lens of single-cell transcriptomics. / The research project was funded by the BIO2015-71658 and BES-2016-076994 grants awarded by
the Spanish Ministry of Science and Innovation / Arzalluz Luque, Á. (2024). Understanding Isoform Expression and Alternative Splicing Biology through Single-Cell RNAseq [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/203888
|
617 |
Investigating the impact of dose banding and oral formulations of paracetamol in pediatrics: A pharmacokinetic simulation-based safety assessment study / Formulerings- och doseringeringseffekter på paracetamol i barn: en farmakokinetisk simuleringsstudieRosenqvist, Julia January 2024 (has links)
Paracetamol är ett vanligt använt läkemedel med analgesisk och antipyretisk effekt. Läkemedlet finns tillgängligt i ett flertal beredningsformer och doseringsstyrkor för användning både receptfritt och i sjukhusvården. Syftet med detta projekt var att undersöka påverkan av alternativ, off-label, dosering av paracetamol i pediatrisk vård, med hjälp av fysiologiskt baserad farmakokinetisk (PBPK) modellering. Modellen utvecklades först för en vuxen population genom integrering av in vitro, in vivo och in silico data för paracetamol. Efter detta extrapolerades concentrationskurvor till en pediatrisk population med hjälp av ontogeni-information. Modellen validerades i både vuxna och barn, och var tillförlitlig för både peroral och intravenös dosering. Efter valideringen utfördes simuleringar för nio olika åldersgrupper baserat på rekommenderade doseringsprotokoll i Sverige. Simuleringarna visade att perorala tablettdoseringen var jämförbar med formulering i lösningsform, med snarlika maximumkoncentrationer och area-under-kurvan (AUC) för exponering. Hastigheten av magtömning influerade maximumkoncentrationer men inte AUC. Ytterligare testades modellens förmåga att prediktera plasmakoncentrationer i blodet efter överdosering med paracetamol. Dessa prediktioner fungerade bättre när läkemedelsmetaboliserande enzymer lämnades oförändrade, eller ökade något i aktivitet. Slutligen, den utvecklade PBPK-modellen kan användas för att säkert undersöka olika doseringsprotokoll och för design av pediatriska kliniska studier. / Paracetamol, a widely used analgesic and antipyretic drug, can be found in various formulations and doses for both home and hospital use. The aim of this study was to investigate the impact of off-label dosing of paracetamol in pediatric clinical practice using physiologically based pharmacokinetic (PBPK) modeling. The model was initially developed for adults by integrating relevant in vitro, in vivo and in silico data of paracetamol, after which the model was extrapolated for pediatrics by adding ontogeny information. The model was successfully validated in both adult and pediatric populations, and it showed accuracy for both oral and intravenous administration routes. After validation, simulations were conducted across nine different age groups following the recommended doses in Sweden. These simulations showed that tablet dose is comparable to solution dosing, resulting in nearly identical maximum concentrations and area under the curve (AUC) values. Furthermore, it was observed that gastric emptying time, which reflects the fed state of individuals, significantly influences the maximum concentration, with longer gastric emptying times resulting in lower and delayed peak concentrations. However, the gastric emptying time had no effect on the AUC values. Lastly, the model’s performance on overdose data was evaluated, and it turned out that it performs better when liver enzymes were not affected, or they were only slightly elevated. Finally, the developed PBPK model can be further used for safe and effective way of exploring dose banding and designing clinical trials in pediatrics.
|
618 |
Protein-drug binding affinity prediction with machine learning : Assessing the impact of features from molecular dynamic simulationsGuttormsson, Guðmundur Andri, Le Gallo, Léa January 2024 (has links)
The development of medicine is generally a long and costly process, and one big factor is estimating the affinity of protein-drug binding. Leveraging machine learning in this field is a promising approach as it can streamline the prediction process and reduce the need for expensive experimental methods. Machine learning methods have already enabled significant advances in predicting protein-drug binding affinity, yet there remains room for improvement. The primary challenge is the quality of data used for these machine learning models. In this work, two ensemble machine learning models, Random Forest and Extreme Gradient Boosting Machine, have been tested and compared with a recent database of protein-ligand complex features calculated from molecular dynamics simulation. Additional features were also extracted from the PDB database through PLIP (Protein-Ligand interaction Profiler), aiming to improve the predictions further. The results indicate that while the features from the PDB database provided strong predictive power, including features from molecular dynamic simulations did not improve the models’ performance.
|
619 |
Single Cell Analysis of Oncogenic Signalling in Intestinal OrganoidsSell, Thomas Sebastian 03 December 2024 (has links)
Darmkrebs ist ein weit verbreitetes Leiden, das durch erworbene, überwiegend sequenzielle Mutationen in einer kleinen Zahl Onkogene verursacht wird. Diese beeinflussen die zelluläre Signalübertragung. Um Auftreten und Progression kolorektaler Karzinome zu verstehen, müssen daher Transkriptome, Phosphoproteome und phänotypische Proteinmarker individueller Zellen berücksichtigt werden. Die Einzelzell-RNS-Sequenzierung bietet die erste dieser Fähigkeiten, während Techniken zur Messung von Proteinen auf Einzelzellebene noch wenig verbreitet sind. Hauptziel meiner Forschung war es, Massenzytometrie für Darmkrebsmodelle anzupassen und so Studien intrazellulärer Signalnetzwerke zu ermöglichen. Auf Basis etablierter Protokolle habe ich eine Methodologie entwickelt, mit der sich epitheliale Zellkulturen untersuchen lassen. Außerdem habe ich Best Practices für die Analyse von Massenzytometriedaten aufgestellt. So konnte ich zeigen, dass die Aktivierung des ERK-Signalwegs in KRAS-aktivierten Dünndarmorganoiden transgener Mäuse graduell unterschiedlich ist. Bei Aktivierung von BRAF, dem direkten Aktivierungsziel von KRAS, geht diese Eigenschaft verloren. Ich habe Statistik angewendet, um Zelltypeigenschaften aus dem Massenzytometrie-Datensatz zu extrahieren. So konnte ich zeigen, dass onkogenes KRAS zwar nicht den ERK-Signalweg signifikant aktiviert, aber die gesamte Zellpopulation in Richtung eines Stammzellphänotyps verschiebt. Mithilfe einer Serie von sequenziell CRISPR-mutierten humanen Kolonorganoiden untersuchte ich anschließend, wie sich Signalaktivität und Zellphänotypen in Abhängigkeit einzelner Onkogene der Darmkrebsentwicklung verändern. Auf Basis des intestinalen Kryptenmarkers EphB2 definierte ich dafür eine Pseudo-Differenzierungsachse. Meine Ergebnisse zeigen, dass sich die gesamte Zellpopulation während der Darmkrebsentstehung zwar in Richtung Stammzelligkeit verschiebt, diese Progression jedoch nicht durch jedes Onkogen gleichermaßen vorangetrieben wird. / Colorectal cancer (CRC) is a widespread disease caused by acquired, predominantly sequential, mutations in a limited set of oncogenes. They in turn influence cellular signalling and enable clonal advantages of tumour cells over their surrounding stroma. To understand the emergence and progression of CRC it is therefore crucial to assess transcriptomes, phospho-proteomes, and phenotype protein markers of individual cells. Single-cell RNA sequencing already enables the foremost of these capabilities while single-cell (phospho)-proteomic techniques are not yet widely established. Primary goal of my research was adapting mass cytometry (MC) for 2D and 3D CRC model systems and characterising cell signalling as well as phenotype changes in intestinal 3D organoids during CRC progression. Based on established MC protocols I devised a methodology suited for measuring epithelial cell cultures and also best practices for data analysis. With this set of tools I could show that ERK signalling is graded in KRAS-activated mouse small intestinal organoids, but not when KRAS downstream target BRAF is mutated active. I used principal component analysis (PCA) and k-means clustering to extract cell-type information from the MC dataset and could show that while oncogenic KRAS does not significantly change downstream ERK signalling, it globally shifts cells towards a crypt-like phenotype. Using a series of sequentially CRISPR-mutated human colon organoids I then investigated how signalling and cell phenotypes change in response to each newly acquired oncogene of the canonical CRC progression. Based on intestinal crypt-to-villus marker EphB2, I defined a pseudo-differentiation axis. My findings showed that, although cells generally shift towards a stem-like cell phenotype during CRC progression, this shift is not continuous.
|
620 |
Comparative Analysis of Genomic Similarity Tools in Species IdentificationNerella, Chandra Sekhar 14 January 2025 (has links)
This study presents the development and evaluation of an automated pipeline for genome comparison, leveraging four bioinformatics tools: alignment-based methods (pyANI, Fas- tANI) and k-mer-based methods (Sourmash, BinDash 2.0). The analysis focuses on high- quality genomic datasets characterized by 100% completeness, ensuring consistency and accuracy in the comparison process. The pipeline processes genomes under uniform con- ditions, recording key performance metrics such as execution time and rank correlations.
Initial comparisons were conducted on a subset of five genomes, generating 10 unique pair- wise comparisons to establish baseline performance. This preliminary analysis identified k = 10 as the optimal k-mer size for Sourmash and BinDash, significantly improving their comparability with alignment-based methods.
For the expanded dataset of 175 genomes, encompassing (175C2) = 15,225 unique comparisons, pyANI and FastANI demonstrated high similarity values, often exceeding 90% for closely related genomes. Rank correlations, calculated using Spearman's ρ and Kendall's τ , high- lighted strong agreement between pyANI and FastANI (ρ = 0.9630 , τ = 0.8625) due to their shared alignment-based methodology. Similarly, Sourmash and BinDash, both employing k-mer-based approaches, exhibited moderate-to-strong rank correlations (ρ = 0.6967, τ = 0.5290). In contrast, the rank correlations between alignment-based and k-mer-based tools were lower, underscoring methodological differences in genome similarity calculations.
Execution times revealed significant contrasts between the tools. Alignment-based meth- ods required substantial computation time, with pyANI taking an average of 1.97 seconds per comparison and FastANI averaging 0.81 seconds per comparison. Conversely, k-mer- based methods demonstrated exceptional computational efficiency, with Sourmash complet- ing comparisons in 2.1 milliseconds and BinDash in just 0.25 milliseconds per comparison, reflecting a difference of nearly three orders of magnitude between the two categories. These results underscore the trade-offs between computational cost and methodological approaches in genome similarity estimation.
This study provides valuable insights into the relative strengths and weaknesses of genome comparison tools, offering a comprehensive framework for selecting appropriate methods for diverse genomic research applications. The findings emphasize the importance of param- eter optimization for k-mer-based tools and highlight the scalability of these methods for large-scale genomic analyses. / Master of Science / This study explores the strengths and weaknesses of different tools used to compare genomes, which are the complete set of DNA in living organisms. Comparing genomes allows scientists to understand how different species are related, uncover shared traits, and identify what makes each species unique. The tools we examined fall into two main categories: detailed tools (called alignment-based methods) and faster, more approximate tools (called k-mer- based methods). The detailed tools, such as pyANI and FastANI, compare DNA sequences piece by piece, providing very accurate results. In contrast, the faster tools, such as Sourmash and BinDash, look for patterns in smaller sections of DNA, which makes them much quicker but sometimes less precise.
To start, we tested these tools on a small group of genomes to see how they performed. By adjusting a setting in the faster tools, we found that their results became more similar to the detailed tools, improving their reliability. Encouraged by these findings, we expanded the comparison to a much larger dataset of 175 genomes. For this larger dataset, the detailed tools provided highly accurate results but required much more time and computational power. On the other hand, the faster tools completed the comparisons in a fraction of the time, making them ideal for larger datasets where quick results are needed.
We also compared how the tools ranked genome similarities and found that tools using similar methods, like pyANI and FastANI, had very consistent rankings. Likewise, the faster tools, Sourmash and BinDash, also agreed with each other. However, the rankings between the two types of tools (detailed versus faster) were less consistent, reflecting their different approaches to genome comparison.
This research provides a practical guide for scientists choosing tools to compare genomes. If accuracy and detail are most important, alignment-based tools are the best choice, though they take more time and computational resources. If speed is critical, such as when working with very large datasets, k-mer-based tools offer an excellent alternative. By understanding the strengths and trade-offs of each method, researchers can make informed decisions to suit their specific needs, whether focusing on small, detailed studies or large-scale genome analyses.
|
Page generated in 0.1128 seconds