Global ETD Search

1	BioBridge: Bringing Data Exploration to Biologists Boyd, Joseph 01 May 2014 (has links) Since the completion of the Human Genome Project in 2003, biologists have become exceptionally good at producing data. Indeed, biological data has experienced a sustained exponential growth rate, putting effective and thorough analysis beyond the reach of many biologists. This thesis presents BioBridge, an interactive visualization tool developed to bring intuitive data exploration to biologists. BioBridge is designed to work on omics style tabular data in general and thus has broad applicability. This work describes the design and evaluation of BioBridge's Entity View primary visualization as well the accompanying user interface. The Entity View visualization arranges glyphs representing biological entities (e.g. genes, proteins, metabolites) along with related text mining results to provide biological context. Throughout development the goal has been to maximize accessibility and usability for biologists who are not computationally inclined. Evaluations were done with three informal case studies, one of a metabolome dataset and two of microarray datasets. BioBridge is a proof of concept that there is an underexploited niche in the data analysis ecosystem for tools that prioritize accessibility and usability. The use case studies, while anecdotal, are very encouraging. These studies indicate that BioBridge is well suited for the task of data exploration. With further development, BioBridge could become more flexible and usable as additional use case datasets are explored and more feedback is gathered. visualization omics data information visualization interactive visualization exploratory visualization biology
2	Dissecting the multi-functional role of heterogeneous nuclear ribonucleoprotein H1 in methamphetamine addiction traits Ruan, Qiu T. 24 March 2021 (has links) Both genetic and environment factors influence susceptibility to substance use disorders. However, the genetic basis of these disorders is largely unknown. We previously identified Hnrnph1 (heterogeneous nuclear ribonucleoprotein H1) as a quantitative trait gene for reduced methamphetamine (MA) stimulant sensitivity. Mutation (heterozygous deletion of a small region in the first coding exon) in Hnrnph1 also decreased MA reinforcement, reward, and dopamine release. 5’UTR genetic variants in Hnrnph1 support reduced 5’UTR usage and hnRNP H protein expression as a molecular mechanism underlying the reduced MA-induced psychostimulant response. Interestingly, Hnrnph1 mutant mice show a two-fold increase in hnRNP H protein in the striatal synaptosome with no change in whole tissue level. Proteome profiling of the synaptosome identified an increase in mitochondrial complex I and V proteins that rapidly decreased with MA in Hnrnph1 mutants. In contrast, the much lower level of basal mitochondrial proteins in the wild-type mice showed a rapid, MA-induced increase. Altered mitochondrial proteins associated with the Hnrnph1 mutation may contribute to reductions in MA behaviors. hnRNP H1 is an abundant RNA-binding protein in the brain, involved in all aspect of post-transcriptional regulation. We examined both baseline and MA-induced changes in hnRNP H-RNA interactions to identify targets of hnRNP H that could comprise the neurobiological mechanisms of cellular adaptations occurring following MA exposure. hnRNP H post-transcriptionally regulates a set of mRNA transcripts in the striatum involved in psychostimulant-induced synaptic plasticity. MA treatment induced opposite changes in binding of hnRNP H to these mRNA transcripts between Hnrnph1 mutants versus wild-types. RNA-binding, transcriptome, and spliceome analyses triangulated on hnRNP H binding to the 3’UTR of Cacna2d2, an upregulation of Cacna2d2 transcript, and decreased 3’UTR usage of Cacna2d2 in response to MA in the Hnrnph1 mutants. Cacna2d2 codes for a presynaptic, voltage-gated calcium channel subunit that could plausibly regulate MA-induced dopamine release and behavior. The multi-omics datasets point to a dysregulation of mitochondrial function and interrelated calcium signaling as potential mechanisms underlying MA-induced dopamine release and behavior in Hnrnph1 mutants. Neurobiology Addiction Genetics Methamphetamine Omics data RNA binding protein Splicing
3	Tissue-dependent analysis of common and rare genetic variants for Alzheimer's disease using multi-omics data Patel, Devanshi 21 January 2021 (has links) Alzheimer’s disease (AD) is a complex neurodegenerative disease characterized by progressive memory loss and caused by a combination of genetic, environmental, and lifestyle factors. AD susceptibility is highly heritable at 58-79%, but only about one third of the AD genetic component is accounted for by common variants discovered through genome-wide association studies (GWAS). Rare variants may contribute to some of the unexplained heritability of AD and have been demonstrated to contribute to large gene expression changes across tissues, but conventional analytical approaches pose challenges because of low statistical power even for large sample sizes. Recent studies have demonstrated by expression quantitative trait locus (eQTL) analysis that changes in gene expression could play a key role in the pathogenesis of AD. However, regulation of gene expression has been shown to be context-specific (e.g., tissue and cell-types), motivating a context dependent approach to achieve more precise and statistically significant associations. To address these issues, I applied a strategy to identify new AD risk or protective rare variants by examining mutations occurring only in cases or only controls, observing that different mutations in the same gene or variable dose of a mutation may result in distinct dementias. I also evaluated the impact of rare variation on expression at the gene and gene pathway levels in blood and brain tissue, further strengthening the rare variant findings with functional evidence and finding evidence for a large immune and inflammatory component to AD. Lastly, I identified cell-type specific eQTLs in blood and brain tissue to explain underlying genetic associations of common variants in AD, and also discovered additional evidence for the role of myeloid cells in AD risk and potential novel blood and brain AD biomarkers. Collectively, these findings further explain the genetic basis of AD risk and provide insight about mechanisms leading to this disorder. / 2022-01-21T00:00:00Z Bioinformatics Alzheimer's disease Bioinformatics Genetics Genomics Multi-omics data Statistics
4	Deep Learning for Enhancing Precision Medicine Oh, Min 07 June 2021 (has links) Most medical treatments have been developed aiming at the best-on-average efficacy for large populations, resulting in treatments successful for some patients but not for others. It necessitates the need for precision medicine that tailors medical treatment to individual patients. Omics data holds comprehensive genetic information on individual variability at the molecular level and hence the potential to be translated into personalized therapy. However, the attempts to transform omics data-driven insights into clinically actionable models for individual patients have been limited. Meanwhile, advances in deep learning, one of the most promising branches of artificial intelligence, have produced unprecedented performance in various fields. Although several deep learning-based methods have been proposed to predict individual phenotypes, they have not established the state of the practice, due to instability of selected or learned features derived from extremely high dimensional data with low sample sizes, which often results in overfitted models with high variance. To overcome the limitation of omics data, recent advances in deep learning models, including representation learning models, generative models, and interpretable models, can be considered. The goal of the proposed work is to develop deep learning models that can overcome the limitation of omics data to enhance the prediction of personalized medical decisions. To achieve this, three key challenges should be addressed: 1) effectively reducing dimensions of omics data, 2) systematically augmenting omics data, and 3) improving the interpretability of omics data. / Doctor of Philosophy / Most medical treatments have been developed aiming at the best-on-average efficacy for large populations, resulting in treatments successful for some patients but not for others. It necessitates the need for precision medicine that tailors medical treatment to individual patients. Biological data such as DNA sequences and snapshots of genetic activities hold comprehensive information on individual variability and hence the potential to accelerate personalized therapy. However, the attempts to transform data-driven insights into clinical models for individual patients have been limited. Meanwhile, advances in deep learning, one of the most promising branches of artificial intelligence, have produced unprecedented performance in various fields. Although several deep learning-based methods have been proposed to predict individual treatment or outcome, they have not established the state of the practice, due to the complexity of biological data and limited availability, which often result in overfitted models that may work on training data but not on test data or unseen data. To overcome the limitation of biological data, recent advances in deep learning models, including representation learning models, generative models, and interpretable models, can be considered. The goal of the proposed work is to develop deep learning models that can overcome the limitation of omics data to enhance the prediction of personalized medical decisions. To achieve this, three key challenges should be addressed: 1) effectively reducing the complexity of biological data, 2) generating realistic biological data, and 3) improving the interpretability of biological data. Deep learning (Machine learning) Precision Medicine Omics data
5	Telomere analysis based on high-throughput multi-omics data Nersisyan, Lilit 20 September 2017 (has links) Telomeres are repeated sequences at the ends of eukaryotic chromosomes that play prominent role in normal aging and disease development. They are dynamic structures that normally shorten over the lifespan of a cell, but can be elongated in cells with high proliferative capacity. Telomere elongation in stem cells is an advantageous mechanism that allows them to maintain the regenerative capacity of tissues, however, it also allows for survival of cancer cells, thus leading to development of malignancies. Numerous studies have been conducted to explore the role of telomeres in health and disease. However, the majority of these studies have focused on consequences of extreme shortening of telomeres that lead to telomere dysfunction, replicative arrest or chromosomal instability. Very few studies have addressed the regulatory roles of telomeres, and the association of genomic, transcriptomic and epigenomic characteristics of a cell with telomere length dynamics. Scarcity of such studies is partially conditioned by the low-throughput nature of experimental approaches for telomere length measurement and the fact that they do not easily integrate with currently available high-throughput data. In this thesis, we have attempted to build algorithms, in silico pipelines and software packages to utilize high-throughput –omics data for telomere biology research. First, we have developed a software package Computel, to compute telomere length from whole genome next generation sequencing data. We show that it can be used to integrate telomere length dynamics into systems biology research. Using Computel, we have studied the association of telomere length with genomic variations in a healthy human population, as well as with transcriptomic and epigenomic features of lung cancers. Another aim of our study was to develop in silico models to assess the activity of telomere maintenance machanisms (TMM) based on gene expression data. There are two main TMMs: one based on the catalytic activity of ribonucleoprotein complex telomerase, and the other based on recombination events between telomeric sequences. Which type of TMM gets activated in a cancer cell determines the aggressiveness of the tumor and the outcome of the disease. Investigation into TMM mechanisms is valuable not only for basic research, but also for applied medicine, since many anticancer therapies attempt to inhibit the TMM in cancer cells to stop their growth. Therefore, studying the activation mechanisms and regulators of TMMs is of paramount importance for understanding cancer pathomechanisms and for treatment. Many studies have addressed this topic, however many aspects of TMM activation and realization still remain elusive. Additionally, current data-mining pipelines and functional annotation approaches of phenotype-associated genes are not adapted for identification of TMMs. To overcome these limitations, we have constructed pathway networks for the two TMMs based on literature, and have developed a methodology for assessment of TMM pathway activities from gene expression data. We have described the accuracy of our TMM-based approach on a set of cancer samples with experimentally validated TMMs. We have also applied it to explore TMM activity states in lung adenocarcinoma cell lines. In summary, recent developments of high-throughput technologies allow for production of data on multiple levels of cellular organization – from genomic and transcriptiomic to epigenomic. This has allowed for rapid development of various directions in molecular and cellular biology. In contrast, telomere research, although at the heart of stem cell and cancer studies, is still conducted with low-throughput experimental approaches. Here, we have attempted to utilize the huge amount of currently accumulated multi-omics data to foster telomere research and to bring it to systems biology scale. info:eu-repo/classification/ddc/000 ddc:000
6	Characterizing vaginal microbiome regulation of progesterone receptor expression via secondary analysis of host and microbiome multi-omics data Nina Marie Render (18370176) 16 April 2024 (has links) <p dir="ltr">The vaginal microbiome and female sex hormones are both involved in the development and progression of gynecological pathologies. The individual mechanisms by which the vaginal microbiome leads to disease progression and how female sex hormones are known. However, the mechanisms by which the vaginal microbiome regulates female sex hormones, such as progesterone, are not well understood. This study seeks to understand how the vaginal microbiome regulates progesterone receptor (PGR) expression via secondary analysis of host and vaginal microbiome multi-omics data from the Partners PrEP cohort. This dataset consists of cervicovaginal samples of women enrolled in the Partners PrEP study. Partial Least Squares Regression (PLSR) models were created for each biological data type (microbial composition, metabolomics, metaproteomics) to assess how these factors regulate PGR expression. Significant factors were identified through variable importance of projection (VIP) and correlation analysis. Partial correlation analysis and follow-up PLSR models incorporating clinical and demographic variables were performed to assess the robustness of the vaginal microbiome-PGR associations. The PLSR models indicated lower PGR expression was associated with <i>G. vaginalis,</i> and higher PGR expression was associated with <i>Lactobacillus </i>species. Cytosine, guanine, and tyrosine were among metabolites significantly associated with higher PGR expression and experimentally determined to be produced by <i>Lactobacillus</i> species. Conversely, citrulline and succinate were associated with lower PGR expression and experimentally determined to be produced by <i>G. vaginalis</i>. The models indicated that bacterial metabolic pathways involved in glucose metabolism, such as glucagon signaling and starch and sugar metabolism, may regulate PGR expression. Demographic phenotypes were also considered from the dataset and did not significantly alter the association between the biological explanatory variables and PGR expression. The results indicate that guanine, cytosine, succinate, starch and sucrose metabolism, and glycolysis gluconeogenesis may be regulators of PGR abundance and function. The models suggest vaginal microbiome factors could play a role in gynecological conditions where progesterone signaling is suppressed. Future experimental work is needed to validate the results of these models and support their use as predictive tools to understand the role of the vaginal microbiome.</p> Computational physiology vaginal microbiome systems biology omics data progesterone receptor expression
7	Multi-omics Data Integration for Identifying Disease Specific Biological Pathways Lu, Yingzhou 05 June 2018 (has links) Pathway analysis is an important task for gaining novel insights into the molecular architecture of many complex diseases. With the advancement of new sequencing technologies, a large amount of quantitative gene expression data have been continuously acquired. The springing up omics data sets such as proteomics has facilitated the investigation on disease relevant pathways. Although much work has previously been done to explore the single omics data, little work has been reported using multi-omics data integration, mainly due to methodological and technological limitations. While a single omic data can provide useful information about the underlying biological processes, multi-omics data integration would be much more comprehensive about the cause-effect processes responsible for diseases and their subtypes. This project investigates the combination of miRNAseq, proteomics, and RNAseq data on seven types of muscular dystrophies and control group. These unique multi-omics data sets provide us with the opportunity to identify disease-specific and most relevant biological pathways. We first perform t-test and OVEPUG test separately to define the differential expressed genes in protein and mRNA data sets. In multi-omics data sets, miRNA also plays a significant role in muscle development by regulating their target genes in mRNA dataset. To exploit the relationship between miRNA and gene expression, we consult with the commonly used gene library - Targetscan to collect all paired miRNA-mRNA and miRNA-protein co-expression pairs. Next, by conducting statistical analysis such as Pearson's correlation coefficient or t-test, we measured the biologically expected correlation of each gene with its upstream miRNAs and identify those showing negative correlation between the aforementioned miRNA-mRNA and miRNA-protein pairs. Furthermore, we identify and assess the most relevant disease-specific pathways by inputting the differential expressed genes and negative correlated genes into the gene-set libraries respectively, and further characterize these prioritized marker subsets using IPA (Ingenuity Pathway Analysis) or KEGG. We will then use Fisher method to combine all these p-values derived from separate gene sets into a joint significance test assessing common pathway relevance. In conclusion, we will find all negative correlated paired miRNA-mRNA and miRNA-protein, and identifying several pathophysiological pathways related to muscular dystrophies by gene set enrichment analysis. This novel multi-omics data integration study and subsequent pathway identification will shed new light on pathophysiological processes in muscular dystrophies and improve our understanding on the molecular pathophysiology of muscle disorders, preventing and treating disease, and make people become healthier in the long term. / Master of Science / Identification of biological pathways play a central role in understanding both human health and diseases. A biological pathway is a series of information processing steps via interactions among molecules in a cell that partially determines the phenotype of a cell. Specifically, identifying disease-specific pathway will guide focused studies on complex diseases, thus potentially improve the prevention and treatment of diseases. To identify disease-specific pathways, it is crucial to develop computational methods and statistical tests that can integrate multi-omics (multiple omes such as genome, proteome, etc) data. Compared to single omics data, multi-omics data will help gaining a more comprehensive understanding on the molecular architecture of disease processes. In this thesis, we propose a novel data analytics pipeline for multi-omics data integration. We test and apply our method on/to the real proteomics data sets on muscular dystrophy subtypes, and identify several biologically plausible pathways related to muscular dystrophies. Biological Pathways Multi-omics Data Integration Muscular Dystrophy Statistical significance test Gene set enrichment analysis
8	Réponse du grain de blé à la nutrition azotée et soufrée : étude intégrative des mécanismes moléculaires mis en jeu au cours du développement du grain par des analyses -omiques / Wheat grain response to nitrogen and sulfur supply : integrative study of molecular mechanisms involved during the grain development using -omics analyses Bonnot, Titouan 09 December 2016 (has links) L’augmentation des rendements est un enjeu majeur chez les céréales. Dans cet objectif, il est nécessaire de maintenir la qualité du grain de blé, qui est principalement déterminée par sa teneur et sa composition en protéines de réserve. En effet, une forte relation négative existe entre le rendement et la teneur en protéines. Par ailleurs, la qualité du grain est fortement influencée par la disponibilité en azote et en soufre dans le sol. La limitation des apports d’intrants azotés à la culture et la carence en soufre récemment observée dans les sols représentent ainsi des difficultés supplémentaires pour maitriser cette qualité. Une meilleure connaissance des mécanismes moléculaires impliqués dans le contrôle du développement du grain et la mise en place de ses réserves protéiques en réponse à la nutrition azotée et soufrée est donc primordiale. L’objectif de cette thèse a ainsi été d’apporter de nouveaux éléments à la compréhension de ces processus de régulation, aujourd’hui peu connus. Pour cela, les approches -omiques sont apparues comme une stratégie de choix pour identifier les acteurs moléculaires mis en jeu. Le protéome nucléaire a été une cible importante dans les travaux menés. L’étude de ces protéines nucléaires a révélé certains régulateurs transcriptionnels qui pourraient être impliqués dans le contrôle de la mise en place des réserves du grain. Dans une approche combinant des données de protéomique, transcriptomique et métabolomique, une vision intégrative de la réponse du grain à la nutrition azotée et soufrée a été obtenue. L’importance d’un apport de soufre dans le contrôle de la balance azote/soufre du grain, déterminante pour la composition du grain en protéines de réserve, a été clairement vérifiée. Parmi les changements observés au niveau du métabolisme cellulaire, certains des gènes affectés par la modification de cette balance pourraient orchestrer l’ajustement de la composition du grain face à des situations de carences nutritionnelles. Ces nouvelles connaissances devraient permettre de mieux maitriser la qualité du grain de blé dans un contexte d’agriculture durable. / Improving the yield potential of cereals represents a major challenge. In this context, wheat grain quality has to be maintained. Indeed, grain quality is mainly determined by the content and the composition of storage proteins, but there is a strongly negative correlation between yield and grain protein concentration. In addition, grain quality is strongly influenced by the availability of nitrogen and sulfur in soils. Nowadays, the limitation of nitrogen inputs, and also the sulfur deficiency recently observed in soils represent major difficulties to control the quality. Therefore, understanding of molecular mechanisms controlling grain development and accumulation of storage proteins in response to nitrogen and sulfur supply is a major issue. The objective of this thesis was to create knowledge on the comprehension of these regulatory mechanisms. For this purpose, the best strategy to identify molecular actors involved in these processes consisted of -omics approaches. In our studies, the nuclear proteome was an important target. Among these proteins, we revealed some transcriptional regulators likely to be involved in the control of the accumulation of grain storage compounds. Using an approach combining proteomic, transcriptomic and metabolomic data, the characterization of the integrative grain response to the nitrogen and sulfur supply was obtained. Besides, our studies clearly confirmed the major influence of sulfur in the control of the nitrogen/sulfur balance that determines the grain storage protein composition. Among the changes observed in the cell metabolism, some genes were disturbed by the modification of this balance. Thus these genes could coordinate the adjustment of grain composition in response to nutritional deficiencies. These new results contribute in facing the challenge of maintaining wheat grain quality with sustainable agriculture. Blé Grain Protéines de réserve Azote Soufre Protéines nucléaires Données omiques Réseaux biologiques Wheat Grain Storage proteins Nitrogen Sulfur Nuclear protein -omics data Biological network
9	Approche intégrative du développement musculaire afin de décrire le processus de maturation en lien avec la survie néonatale / Integrative approach of muscular development to describe the maturation process related to the neonatal survival Voillet, Valentin 29 September 2016 (has links) Depuis plusieurs années, des projets d'intégration de données omiques se sont développés, notamment avec objectif de participer à la description fine de caractères complexes d'intérêt socio-économique. Dans ce contexte, l'objectif de cette thèse est de combiner différentes données omiques hétérogènes afin de mieux décrire et comprendre le dernier tiers de gestation chez le porc, période influençant la mortinatalité porcine. Durant cette thèse, nous avons identifié les bases moléculaires et cellulaires sous-jacentes de la fin de gestation, en particulier au niveau du muscle squelettique. Ce tissu est en effet déterminant à la naissance car impliqué dans l'efficacité de plusieurs fonctions physiologiques comme la thermorégulation et la capacité à se déplacer. Au niveau du plan expérimental, les tissus analysés proviennent de foetus prélevés à 90 et 110 jours de gestation (naissance à 114 jours), issus de deux lignées extrêmes pour la mortalité à la naissance, Large White et Meishan, et des deux croisements réciproques. Au travers l'application de plusieurs études statistiques et computationnelles (analyses multidimensionnelles, inférence de réseaux, clustering et intégration de données), nous avons montré l'existence de mécanismes biologiques régulant la maturité musculaire chez les porcelets, mais également chez d'autres espèces d'intérêt agronomique (bovin et mouton). Quelques gènes et protéines ont été identifiées comme étant fortement liées à la mise en place du métabolisme énergétique musculaire durant le dernier tiers de gestation. Les porcelets ayant une immaturité du métabolisme musculaire seraient sujets à un plus fort risque de mortalité à la naissance. Un second volet de cette thèse concerne l'imputation de données manquantes (tout un groupe de variables pour un individu) dans les méthodes d'analyses multidimensionnelles, comme l'analyse factorielle multiple (AFM) (ou multiple factor analysis (MFA)). Dans notre contexte, l'AFM fut particulièrement intéressante pour l'intégration de données d'un ensemble d'individus sur différents tissus (deux ou plus). Afin de conserver ces individus manquants pour tout un groupe de variables, nous avons développé une méthode, appelée MI-MFA (multiple imputation - MFA), permettant l'estimation des composantes de l'AFM pour ces individus manquants. / Over the last decades, some omics data integration studies have been developed to participate in the detailed description of complex traits with socio-economic interests. In this context, the aim of the thesis is to combine different heterogeneous omics data to better describe and understand the last third of gestation in pigs, period influencing the piglet mortality at birth. In the thesis, we better defined the molecular and cellular basis underlying the end of gestation, with a focus on the skeletal muscle. This tissue is specially involved in the efficiency of several physiological functions, such as thermoregulation and motor functions. According to the experimental design, tissues were collected at two days of gestation (90 or 110 days of gestation) from four fetal genotypes. These genotypes consisted in two extreme breeds for mortality at birth (Meishan and Large White) and two reciprocal crosses. Through statistical and computational analyses (descriptive analyses, network inference, clustering and biological data integration), we highlighted some biological mechanisms regulating the maturation process in pigs, but also in other livestock species (cattle and sheep). Some genes and proteins were identified as being highly involved in the muscle energy metabolism. Piglets with a muscular metabolism immaturity would be associated with a higher risk of mortality at birth. A second aspect of the thesis was the imputation of missing individual row values in the multidimensional statistical method framework, such as the multiple factor analysis (MFA). In our context, MFA was particularly interesting in integrating data coming from the same individuals on different tissues (two or more). To avoid missing individual row values, we developed a method, called MI-MFA (multiple imputation - MFA), allowing the estimation of the MFA components for these missing individuals. Intégration de données omiques Réseaux biologiques Analyses multidimensionnelles Porc Maturité Mortalité néonatale Omics data integration Biological networks Multidimensional analysis Pigs Maturity Neonatal mortality
10	Deep Learning Strategies for Overcoming Diagnosis Challenges with Limited Annotations Amor del Amor, María Rocío del 27 November 2023 (has links) Tesis por compendio / [ES] En los últimos años, el aprendizaje profundo (DL) se ha convertido en una de las principales áreas de la inteligencia artificial (IA), impulsado principalmente por el avance en la capacidad de procesamiento. Los algoritmos basados en DL han logrado resultados asombrosos en la comprensión y manipulación de diversos tipos de datos, incluyendo imágenes, señales de habla y texto. La revolución digital del sector sanitario ha permitido la generación de nuevas bases de datos, lo que ha facilitado la implementación de modelos de DL bajo el paradigma de aprendizaje supervisado. La incorporación de estos métodos promete mejorar y automatizar la detección y el diagnóstico de enfermedades, permitiendo pronosticar su evolución y facilitar la aplicación de intervenciones clínicas de manera más efectiva. Una de las principales limitaciones de la aplicación de algoritmos de DL supervisados es la necesidad de grandes bases de datos anotadas por expertos, lo que supone una barrera importante en el ámbito médico. Para superar este problema, se está abriendo un nuevo campo de desarrollo de estrategias de aprendizaje no supervisado o débilmente supervisado que utilizan los datos disponibles no anotados o débilmente anotados. Estos enfoques permiten aprovechar al máximo los datos existentes y superar las limitaciones de la dependencia de anotaciones precisas. Para poner de manifiesto que el aprendizaje débilmente supervisado puede ofrecer soluciones óptimas, esta tesis se ha enfocado en el desarrollado de diferentes paradigmas que permiten entrenar modelos con bases de datos débilmente anotadas o anotadas por médicos no expertos. En este sentido, se han utilizado dos modalidades de datos ampliamente empleadas en la literatura para estudiar diversos tipos de cáncer y enfermedades inflamatorias: datos ómicos e imágenes histológicas. En el estudio sobre datos ómicos, se han desarrollado métodos basados en deep clustering que permiten lidiar con las altas dimensiones inherentes a este tipo de datos, desarrollando un modelo predictivo sin la necesidad de anotaciones. Al comparar el método propuesto con otros métodos de clustering presentes en la literatura, se ha observado una mejora en los resultados obtenidos. En cuanto a los estudios con imagen histológica, en esta tesis se ha abordado la detección de diferentes enfermedades, incluyendo cáncer de piel (melanoma spitzoide y neoplasias de células fusocelulares) y colitis ulcerosa. En este contexto, se ha empleado el paradigma de multiple instance learning (MIL) como línea base en todos los marcos desarrollados para hacer frente al gran tamaño de las imágenes histológicas. Además, se han implementado diversas metodologías de aprendizaje, adaptadas a los problemas específicos que se abordan. Para la detección de melanoma spitzoide, se ha utilizado un enfoque de aprendizaje inductivo que requiere un menor volumen de anotaciones. Para abordar el diagnóstico de colitis ulcerosa, que implica la identificación de neutrófilos como biomarcadores, se ha utilizado un enfoque de aprendizaje restrictivo. Con este método, el coste de anotación se ha reducido significativamente al tiempo que se han conseguido mejoras sustanciales en los resultados obtenidos. Finalmente, considerando el limitado número de expertos en el campo de las neoplasias de células fusiformes, se ha diseñado y validado un novedoso protocolo de anotación para anotaciones no expertas. En este contexto, se han desarrollado modelos de aprendizaje profundo que trabajan con la incertidumbre asociada a dichas anotaciones. En conclusión, esta tesis ha desarrollado técnicas de vanguardia para abordar el reto de la necesidad de anotaciones precisas que requiere el sector médico. A partir de datos débilmente anotados o anotados por no expertos, se han propuesto novedosos paradigmas y metodologías basados en deep learning para abordar la detección y diagnóstico de enfermedades utilizando datos ómicos e imágenes histológicas. / [CA] En els últims anys, l'aprenentatge profund (DL) s'ha convertit en una de les principals àrees de la intel·ligència artificial (IA), impulsat principalment per l'avanç en la capacitat de processament. Els algorismes basats en DL han aconseguit resultats sorprenents en la comprensió i manipulació de diversos tipus de dades, incloent-hi imatges, senyals de parla i text. La revolució digital del sector sanitari ha permés la generació de noves bases de dades, la qual cosa ha facilitat la implementació de models de DL sota el paradigma d'aprenentatge supervisat. La incorporació d'aquests mètodes promet millorar i automatitzar la detecció i el diagnòstic de malalties, permetent pronosticar la seua evolució i facilitar l'aplicació d'intervencions clíniques de manera més efectiva. Una de les principals limitacions de l'aplicació d'algorismes de DL supervisats és la necessitat de grans bases de dades anotades per experts, la qual cosa suposa una barrera important en l'àmbit mèdic. Per a superar aquest problema, s'està obrint un nou camp de desenvolupament d'estratègies d'aprenentatge no supervisat o feblement supervisat que utilitzen les dades disponibles no anotades o feblement anotats. Aquests enfocaments permeten aprofitar al màxim les dades existents i superar les limitacions de la dependència d'anotacions precises. Per a posar de manifest que l'aprenentatge feblement supervisat pot oferir solucions òptimes, aquesta tesi s'ha enfocat en el desenvolupat de diferents paradigmes que permeten entrenar models amb bases de dades feblement anotades o anotades per metges no experts. En aquest sentit, s'han utilitzat dues modalitats de dades àmpliament emprades en la literatura per a estudiar diversos tipus de càncer i malalties inflamatòries: dades ómicos i imatges histològiques. En l'estudi sobre dades ómicos, s'han desenvolupat mètodes basats en deep clustering que permeten bregar amb les altes dimensions inherents a aquesta mena de dades, desenvolupant un model predictiu sense la necessitat d'anotacions. En comparar el mètode proposat amb altres mètodes de clustering presents en la literatura, s'ha observat una millora en els resultats obtinguts. Quant als estudis amb imatge histològica, en aquesta tesi s'ha abordat la detecció de diferents malalties, incloent-hi càncer de pell (melanoma spitzoide i neoplàsies de cèl·lules fusocelulares) i colitis ulcerosa. En aquest context, s'ha emprat el paradigma de multiple instance learning (MIL) com a línia base en tots els marcs desenvolupats per a fer front a la gran grandària de les imatges histològiques. A més, s'han implementat diverses metodologies d'aprenentatge, adaptades als problemes específics que s'aborden. Per a la detecció de melanoma spitzoide, s'ha utilitzat un enfocament d'aprenentatge inductiu que requereix un menor volum d'anotacions. Per a abordar el diagnòstic de colitis ulcerosa, que implica la identificació de neutròfils com biomarcadores, s'ha utilitzat un enfocament d'aprenentatge restrictiu. Amb aquest mètode, el cost d'anotació s'ha reduït significativament al mateix temps que s'han aconseguit millores substancials en els resultats obtinguts. Finalment, considerant el limitat nombre d'experts en el camp de les neoplàsies de cèl·lules fusiformes, s'ha dissenyat i validat un nou protocol d'anotació per a anotacions no expertes. En aquest context, s'han desenvolupat models d'aprenentatge profund que treballen amb la incertesa associada a aquestes anotacions. En conclusió, aquesta tesi ha desenvolupat tècniques d'avantguarda per a abordar el repte de la necessitat d'anotacions precises que requereix el sector mèdic. A partir de dades feblement anotades o anotats per no experts, s'han proposat nous paradigmes i metodologies basats en deep learning per a abordar la detecció i diagnòstic de malalties utilitzant dades *ómicos i imatges histològiques. Aquestes innovacions poden millorar l'eficàcia i l'automatització en la detecció precoç i el seguiment de malalties. / [EN] In recent years, deep learning (DL) has become one of the main areas of artificial intelligence (AI), driven mainly by the advancement in processing power. DL-based algorithms have achieved amazing results in understanding and manipulating various types of data, including images, speech signals and text. The digital revolution in the healthcare sector has enabled the generation of new databases, facilitating the implementation of DL models under the supervised learning paradigm. Incorporating these methods promises to improve and automate the detection and diagnosis of diseases, allowing the prediction of their evolution and facilitating the application of clinical interventions with higher efficacy. One of the main limitations in the application of supervised DL algorithms is the need for large databases annotated by experts, which is a major barrier in the medical field. To overcome this problem, a new field of developing unsupervised or weakly supervised learning strategies using the available unannotated or weakly annotated data is opening up. These approaches make the best use of existing data and overcome the limitations of reliance on precise annotations. To demonstrate that weakly supervised learning can offer optimal solutions, this thesis has focused on developing different paradigms that allow training models with weakly annotated or non-expert annotated databases. In this regard, two data modalities widely used in the literature to study various types of cancer and inflammatory diseases have been used: omics data and histological images. In the study on omics data, methods based on deep clustering have been developed to deal with the high dimensions inherent to this type of data, developing a predictive model without requiring annotations. In comparison, the results of the proposed method outperform other existing clustering methods. Regarding histological imaging studies, the detection of different diseases has been addressed in this thesis, including skin cancer (spitzoid melanoma and spindle cell neoplasms) and ulcerative colitis. In this context, the multiple instance learning (MIL) paradigm has been employed as the baseline in all developed frameworks to deal with the large size of histological images. Furthermore, diverse learning methodologies have been implemented, tailored to the specific problems being addressed. For the detection of spitzoid melanoma, an inductive learning approach has been used, which requires a smaller volume of annotations. To address the diagnosis of ulcerative colitis, which involves the identification of neutrophils as biomarkers, a constraint learning approach has been utilized. With this method, the annotation cost has been significantly reduced while achieving substantial improvements in the obtained results. Finally, considering the limited number of experts in the field of spindle cell neoplasms, a novel annotation protocol for non-experts has been designed and validated. In this context, deep learning models that work with the uncertainty associated with such annotations have been developed. In conclusion, this thesis has developed cutting-edge techniques to address the medical sector's challenge of precise data annotation. Using weakly annotated or non-expert annotated data, novel paradigms and methodologies based on deep learning have been proposed to tackle disease detection and diagnosis in omics data and histological images. These innovations can improve effectiveness and automation in early disease detection and monitoring. / The work of Rocío del Amor to carry out this research and to elaborate this dissertation has been supported by the Spanish Ministry of Universities under the FPU grant FPU20/05263. / Amor Del Amor, MRD. (2023). Deep Learning Strategies for Overcoming Diagnosis Challenges with Limited Annotations [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/200227 / Compendio Omics data Digital pathology Aprendizaje profundo Multiple instance learning (MIL) Deep learning Patología digital Datos ómicos Weakly supervised learning ESTADISTICA E INVESTIGACION OPERATIVA TEORÍA DE LA SEÑAL Y COMUNICACIONES

Search results