• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 118
  • 61
  • 21
  • 20
  • 2
  • 2
  • 2
  • 1
  • 1
  • Tagged with
  • 267
  • 267
  • 69
  • 67
  • 59
  • 58
  • 52
  • 39
  • 36
  • 32
  • 31
  • 30
  • 30
  • 29
  • 28
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
241

Statistical Methods for Multivariate Functional Data Clustering, Recurrent Event Prediction, and Accelerated Degradation Data Analysis

Jin, Zhongnan 12 September 2019 (has links)
In this dissertation, we introduce three projects in machine learning and reliability applications after the general introductions in Chapter 1. The first project concentrates on the multivariate sensory data, the second project is related to the bivariate recurrent process, and the third project introduces thermal index (TI) estimation in accelerated destructive degradation test (ADDT) data, in which an R package is developed. All three projects are related to and can be used to solve certain reliability problems. Specifically, in Chapter 2, we introduce a clustering method for multivariate functional data. In order to cluster the customized events extracted from multivariate functional data, we apply the functional principal component analysis (FPCA), and use a model based clustering method on a transformed matrix. A penalty term is imposed on the likelihood so that variable selection is performed automatically. In Chapter 3, we propose a covariate-adjusted model to predict next event in a bivariate recurrent event system. Inspired by geyser eruptions in Yellowstone National Park, we consider two event types and model their event gap time relationship. External systematic conditions are taken account into the model with covariates. The proposed covariate adjusted recurrent process (CARP) model is applied to the Yellowstone National Park geyser data. In Chapter 4, we compare estimation methods for TI. In ADDT, TI is an important index indicating the reliability of materials, when the accelerating variable is temperature. Three methods are introduced in TI estimations, which are least-squares method, parametric model and semi-parametric model. An R package is implemented for all three methods. Applications of R functions are introduced in Chapter 5 with publicly available ADDT datasets. Chapter 6 includes conclusions and areas for future works. / Doctor of Philosophy / This dissertation focuses on three projects that are all related to machine learning and reliability. Specifically, in the first project, we propose a clustering method designated for events extracted from multivariate sensory data. When the customized event is corresponding to reliability issues, such as aging procedures, clustering results can help us learn different event characteristics by examining events belonging to the same group. Applications include diving behavior segmentation based on vehicle sensory data, where multiple sensors are measuring vehicle conditions simultaneously and events are defined as vehicle stoppages. In our project, we also proposed to conduct sensor selection by three different penalizations including individual, variable and group. Our method can be applied for multi-dimensional sensory data clustering, when optimal sensor design is also an objective. The second project introduces a covariate-adjusted model accommodated to a bivariate recurrent event process system. In such systems, events can occur repeatedly and event occurrences for each type can affect each other with certain dependence. Events in the system can be mechanical failures which is related to reliability, while next event time and type predictions are usually of interest. Precise predictions on the next event time and type can essentially prevent serious safety and economy consequences following the upcoming event. We propose two CARP models with marginal behaviors as well as the dependence structure characterized in the bivariate system. We innovate to incorporate external information to the model so that model results are enhanced. The proposed model is evaluated in simulation studies, while geyser data from Yellowstone National Park is applied. In the third project, we comprehensively discuss three estimation methods for thermal index. They are the least-square method, parametric model and semi-parametric model. When temperature is the accelerating variable, thermal index indicates the temperature at which our materials can hold up to a certain time. In reality, estimating the thermal index precisely can prolong lifetime of certain product by choosing the right usage temperature. Methods evaluations are conducted by simulation study, while applications are applied to public available datasets.
242

Sélection de variables pour la classification non supervisée en grande dimension / Variable selection in model-based clustering for high-dimensional data

Meynet, Caroline 09 November 2012 (has links)
Il existe des situations de modélisation statistique pour lesquelles le problème classique de classification non supervisée (c'est-à-dire sans information a priori sur la nature ou le nombre de classes à constituer) se double d'un problème d'identification des variables réellement pertinentes pour déterminer la classification. Cette problématique est d'autant plus essentielle que les données dites de grande dimension, comportant bien plus de variables que d'observations, se multiplient ces dernières années : données d'expression de gènes, classification de courbes... Nous proposons une procédure de sélection de variables pour la classification non supervisée adaptée aux problèmes de grande dimension. Nous envisageons une approche par modèles de mélange gaussien, ce qui nous permet de reformuler le problème de sélection des variables et du choix du nombre de classes en un problème global de sélection de modèle. Nous exploitons les propriétés de sélection de variables de la régularisation l1 pour construire efficacement, à partir des données, une collection de modèles qui reste de taille raisonnable même en grande dimension. Nous nous démarquons des procédures classiques de sélection de variables par régularisation l1 en ce qui concerne l'estimation des paramètres : dans chaque modèle, au lieu de considérer l'estimateur Lasso, nous calculons l'estimateur du maximum de vraisemblance. Ensuite, nous sélectionnons l'un des ces estimateurs du maximum de vraisemblance par un critère pénalisé non asymptotique basé sur l'heuristique de pente introduite par Birgé et Massart. D'un point de vue théorique, nous établissons un théorème de sélection de modèle pour l'estimation d'une densité par maximum de vraisemblance pour une collection aléatoire de modèles. Nous l'appliquons dans notre contexte pour trouver une forme de pénalité minimale pour notre critère pénalisé. D'un point de vue pratique, des simulations sont effectuées pour valider notre procédure, en particulier dans le cadre de la classification non supervisée de courbes. L'idée clé de notre procédure est de n'utiliser la régularisation l1 que pour constituer une collection restreinte de modèles et non pas aussi pour estimer les paramètres des modèles. Cette étape d'estimation est réalisée par maximum de vraisemblance. Cette procédure hybride nous est inspirée par une étude théorique menée dans une première partie dans laquelle nous établissons des inégalités oracle l1 pour le Lasso dans les cadres de régression gaussienne et de mélange de régressions gaussiennes, qui se démarquent des inégalités oracle l0 traditionnellement établies par leur absence totale d'hypothèse. / This thesis deals with variable selection for clustering. This problem has become all the more challenging since the recent increase in high-dimensional data where the number of variables can largely exceeds the number of observations (DNA analysis, functional data clustering...). We propose a variable selection procedure for clustering suited to high-dimensional contexts. We consider clustering based on finite Gaussian mixture models in order to recast both the variable selection and the choice of the number of clusters into a global model selection problem. We use the variable selection property of l1-regularization to build a data-driven model collection in a efficient way. Our procedure differs from classical procedures using l1-regularization as regards the estimation of the mixture parameters: in each model of the collection, rather than considering the Lasso estimator, we calculate the maximum likelihood estimator. Then, we select one of these maximum likelihood estimators by a non-asymptotic penalized criterion. From a theoretical viewpoint, we establish a model selection theorem for maximum likelihood estimators in a density estimation framework with a random model collection. We apply it in our context to determine a convenient penalty shape for our criterion. From a practical viewpoint, we carry out simulations to validate our procedure, for instance in the functional data clustering framework. The basic idea of our procedure, which consists in variable selection by l1-regularization but estimation by maximum likelihood estimators, comes from theoretical results we establish in the first part of this thesis: we provide l1-oracle inequalities for the Lasso in the regression framework, which are valid with no assumption at all contrary to the usual l0-oracle inequalities in the literature, thus suggesting a gap between l1-regularization and l0-regularization.
243

Modèles de mélange pour la régression en grande dimension, application aux données fonctionnelles / High-dimensional mixture regression models, application to functional data

Devijver, Emilie 02 July 2015 (has links)
Les modèles de mélange pour la régression sont utilisés pour modéliser la relation entre la réponse et les prédicteurs, pour des données issues de différentes sous-populations. Dans cette thèse, on étudie des prédicteurs de grande dimension et une réponse de grande dimension. Tout d’abord, on obtient une inégalité oracle ℓ1 satisfaite par l’estimateur du Lasso. On s’intéresse à cet estimateur pour ses propriétés de régularisation ℓ1. On propose aussi deux procédures pour pallier ce problème de classification en grande dimension. La première procédure utilise l’estimateur du maximum de vraisemblance pour estimer la densité conditionnelle inconnue, en se restreignant aux variables actives sélectionnées par un estimateur de type Lasso. La seconde procédure considère la sélection de variables et la réduction de rang pour diminuer la dimension. Pour chaque procédure, on obtient une inégalité oracle, qui explicite la pénalité nécessaire pour sélectionner un modèle proche de l’oracle. On étend ces procédures au cas des données fonctionnelles, où les prédicteurs et la réponse peuvent être des fonctions. Dans ce but, on utilise une approche par ondelettes. Pour chaque procédure, on fournit des algorithmes, et on applique et évalue nos méthodes sur des simulations et des données réelles. En particulier, on illustre la première méthode par des données de consommation électrique. / Finite mixture regression models are useful for modeling the relationship between a response and predictors, arising from different subpopulations. In this thesis, we focus on high-dimensional predictors and a high-dimensional response. First of all, we provide an ℓ1-oracle inequality satisfied by the Lasso estimator. We focus on this estimator for its ℓ1-regularization properties rather than for the variable selection procedure. We also propose two procedures to deal with this issue. The first procedure leads to estimate the unknown conditional mixture density by a maximum likelihood estimator, restricted to the relevant variables selected by an ℓ1-penalized maximum likelihood estimator. The second procedure considers jointly predictor selection and rank reduction for obtaining lower-dimensional approximations of parameters matrices. For each procedure, we get an oracle inequality, which derives the penalty shape of the criterion, depending on the complexity of the random model collection. We extend these procedures to the functional case, where predictors and responses are functions. For this purpose, we use a wavelet-based approach. For each situation, we provide algorithms, apply and evaluate our methods both on simulations and real datasets. In particular, we illustrate the first procedure on an electricity load consumption dataset.
244

Emprego de técnicas de análise exploratória de dados utilizados em Química Medicinal / Use of different techniques for exploratory data analysis in Medicinal Chemistry

Gertrudes, Jadson Castro 10 September 2013 (has links)
Pesquisas na área de Química Medicinal têm direcionado esforços na busca por métodos que acelerem o processo de descoberta de novos medicamentos. Dentre as diversas etapas relacionadas ao longo do processo de descoberta de substâncias bioativas está a análise das relações entre a estrutura química e a atividade biológica de compostos. Neste processo, os pesquisadores da área de Química Medicinal analisam conjuntos de dados que são caracterizados pela alta dimensionalidade e baixo número de observações. Dentro desse contexto, o presente trabalho apresenta uma abordagem computacional que visa contribuir para a análise de dados químicos e, consequentemente, a descoberta de novos medicamentos para o tratamento de doenças crônicas. As abordagens de análise exploratória de dados, utilizadas neste trabalho, combinam técnicas de redução de dimensionalidade e de agrupamento para detecção de estruturas naturais que reflitam a atividade biológica dos compostos analisados. Dentre as diversas técnicas existentes para a redução de dimensionalidade, são discutidas o escore de Fisher, a análise de componentes principais e a análise de componentes principais esparsas. Quanto aos algoritmos de aprendizado, são avaliados o k-médias, fuzzy c-médias e modelo de misturas ICA aperfeiçoado. No desenvolvimento deste trabalho foram utilizados quatro conjuntos de dados, contendo informações de substâncias bioativas, sendo que dois conjuntos foram relacionados ao tratamento da diabetes mellitus e da síndrome metabólica, o terceiro conjunto relacionado a doenças cardiovasculares e o último conjunto apresenta substâncias que podem ser utilizadas no tratamento do câncer. Nos experimentos realizados, os resultados alcançados sugerem a utilização das técnicas de redução de dimensionalidade juntamente com os algoritmos não supervisionados para a tarefa de agrupamento dos dados químicos, uma vez que nesses experimentos foi possível descrever níveis de atividade biológica dos compostos estudados. Portanto, é possível concluir que as técnicas de redução de dimensionalidade e de agrupamento podem possivelmente ser utilizadas como guias no processo de descoberta e desenvolvimento de novos compostos na área de Química Medicinal. / Researches in Medicinal Chemistry\'s area have focused on the search of methods that accelerate the process of drug discovery. Among several steps related to the process of discovery of bioactive substances there is the analysis of the relationships between chemical structure and biological activity of compounds. In this process, researchers of medicinal chemistry analyze data sets that are characterized by high dimensionality and small number of observations. Within this context, this work presents a computational approach that aims to contribute to the analysis of chemical data and, consequently, the discovery of new drugs for the treatment of chronic diseases. Approaches used in exploratory data analysis, employed in this work, combine techniques of dimensionality reduction and clustering for detecting natural structures that reflect the biological activity of the analyzed compounds. Among several existing techniques for dimensionality reduction, we have focused the Fisher\'s score, principal component analysis and sparse principal component analysis. For the clustering procedure, this study evaluated k-means, fuzzy c-means and enhanced ICA mixture model. In order to perform experiments, we used four data sets, containing information of bioactive substances. Two sets are related to the treatment of diabetes mellitus and metabolic syndrome, the third set is related to cardiovascular disease and the latter set has substances that can be used in cancer treatment. In the experiments, the obtained results suggest the use of dimensionality reduction techniques along with clustering algorithms for the task of clustering chemical data, since from these experiments, it was possible to describe different levels of biological activity of the studied compounds. Therefore, we conclude that the techniques of dimensionality reduction and clustering can be used as guides in the process of discovery and development of new compounds in the field of Medicinal Chemistry
245

Análise e comparação de alguns métodos alternativos de seleção de variáveis preditoras no modelo de regressão linear / Analysis and comparison of some alternative methods of selection of predictor variables in linear regression models.

Marques, Matheus Augustus Pumputis 04 June 2018 (has links)
Neste trabalho estudam-se alguns novos métodos de seleção de variáveis no contexto da regressão linear que surgiram nos últimos 15 anos, especificamente o LARS - Least Angle Regression, o NAMS - Noise Addition Model Selection, a Razão de Falsa Seleção - RFS (FSR em inglês), o LASSO Bayesiano e o Spike-and-Slab LASSO. A metodologia foi a análise e comparação dos métodos estudados e aplicações. Após esse estudo, realizam-se aplicações em bases de dados reais e um estudo de simulação, em que todos os métodos se mostraram promissores, com os métodos Bayesianos apresentando os melhores resultados. / In this work, some new variable selection methods that have appeared in the last 15 years in the context of linear regression are studied, specifically the LARS - Least Angle Regression, the NAMS - Noise Addition Model Selection, the False Selection Rate - FSR, the Bayesian LASSO and the Spike-and-Slab LASSO. The methodology was the analysis and comparison of the studied methods. After this study, applications to real data bases are made, as well as a simulation study, in which all methods are shown to be promising, with the Bayesian methods showing the best results.
246

Développement de méthodes spatio-temporelles pour la prévision à court terme de la production photovoltaïque / Development of spatio-temporal methods for short term forecasting of photovoltaïc production

Agoua, Xwégnon 20 December 2017 (has links)
L’évolution du contexte énergétique mondial et la lutte contre le changement climatique ont conduit à l’accroissement des capacités de production d’énergie renouvelable. Les énergies renouvelables sont caractérisées par une forte variabilité due à leur dépendance aux conditions météorologiques. La maîtrise de cette variabilité constitue un enjeu important pour les opérateurs du système électrique, mais aussi pour l’atteinte des objectifs européens de réduction des émissions de gaz à effet de serre, d’amélioration de l’efficacité énergétique et de l’augmentation de la part des énergies renouvelables. Dans le cas du photovoltaïque(PV), la maîtrise de la variabilité de la production passe par la mise en place d’outils qui permettent de prévoir la production future des centrales. Ces prévisions contribuent entre autres à l’augmentation du niveau de pénétration du PV,à l’intégration optimale dans le réseau électrique, à l’amélioration de la gestion des centrales PV et à la participation aux marchés de l’électricité. L’objectif de cette thèse est de contribuer à l’amélioration de la prédictibilité à court-terme (moins de 6 heures) de la production PV. Dans un premier temps, nous analysons la variabilité spatio-temporelle de la production PV et proposons une méthode de réduction de la non-stationnarité des séries de production. Nous proposons ensuite un modèle spatio-temporel de prévision déterministe qui exploite les corrélations spatio-temporelles entre les centrales réparties sur une région. Les centrales sont utilisées comme un réseau de capteurs qui permettent d’anticiper les sources de variabilité. Nous proposons aussi une méthode automatique de sélection des variables qui permet de résoudre les problèmes de dimension et de parcimonie du modèle spatio-temporel. Un modèle spatio-temporel probabiliste a aussi été développé aux fins de produire des prévisions performantes non seulement du niveau moyen de la production future mais de toute sa distribution. Enfin nous proposons, un modèle qui exploite les observations d’images satellites pour améliorer la prévision court-terme de la production et une comparaison de l’apport de différentes sources de données sur les performances de prévision. / The evolution of the global energy context and the challenges of climate change have led to anincrease in the production capacity of renewable energy. Renewable energies are characterized byhigh variability due to their dependence on meteorological conditions. Controlling this variabilityis an important challenge for the operators of the electricity systems, but also for achieving the Europeanobjectives of reducing greenhouse gas emissions, improving energy efficiency and increasing the share of renewable energies in EU energy consumption. In the case of photovoltaics (PV), the control of the variability of the production requires to predict with minimum errors the future production of the power stations. These forecasts contribute to increasing the level of PV penetration and optimal integration in the power grid, improving PV plant management and participating in electricity markets. The objective of this thesis is to contribute to the improvement of the short-term predictability (less than 6 hours) of PV production. First, we analyze the spatio-temporal variability of PV production and propose a method to reduce the nonstationarity of the production series. We then propose a deterministic prediction model that exploits the spatio-temporal correlations between the power plants of a spatial grid. The power stationsare used as a network of sensors to anticipate sources of variability. We also propose an automaticmethod for selecting variables to solve the dimensionality and sparsity problems of the space-time model. A probabilistic spatio-temporal model has also been developed to produce efficient forecasts not only of the average level of future production but of its entire distribution. Finally, we propose a model that exploits observations of satellite images to improve short-term forecasting of PV production.
247

Contrôle des fausses découvertes lors de la sélection de variables en grande dimension / Control of false discoveries in high-dimensional variable selection

Bécu, Jean-Michel 10 March 2016 (has links)
Dans le cadre de la régression, de nombreuses études s’intéressent au problème dit de la grande dimension, où le nombre de variables explicatives mesurées sur chaque échantillon est beaucoup plus grand que le nombre d’échantillons. Si la sélection de variables est une question classique, les méthodes usuelles ne s’appliquent pas dans le cadre de la grande dimension. Ainsi, dans ce manuscrit, nous présentons la transposition de tests statistiques classiques à la grande dimension. Ces tests sont construits sur des estimateurs des coefficients de régression produits par des approches de régressions linéaires pénalisées, applicables dans le cadre de la grande dimension. L’objectif principal des tests que nous proposons consiste à contrôler le taux de fausses découvertes. La première contribution de ce manuscrit répond à un problème de quantification de l’incertitude sur les coefficients de régression réalisée sur la base de la régression Ridge, qui pénalise les coefficients de régression par leur norme l2, dans le cadre de la grande dimension. Nous y proposons un test statistique basé sur le rééchantillonage. La seconde contribution porte sur une approche de sélection en deux étapes : une première étape de criblage des variables, basée sur la régression parcimonieuse Lasso précède l’étape de sélection proprement dite, où la pertinence des variables pré-sélectionnées est testée. Les tests sont construits sur l’estimateur de la régression Ridge adaptive, dont la pénalité est construite à partir des coefficients de régression du Lasso. Une dernière contribution consiste à transposer cette approche à la sélection de groupes de variables. / In the regression framework, many studies are focused on the high-dimensional problem where the number of measured explanatory variables is very large compared to the sample size. If variable selection is a classical question, usual methods are not applicable in the high-dimensional case. So, in this manuscript, we develop the transposition of statistical tests to the high dimension. These tests operate on estimates of regression coefficients obtained by penalized linear regression, which is applicable in high-dimension. The main objective of these tests is the false discovery control. The first contribution of this manuscript provides a quantification of the uncertainty for regression coefficients estimated by ridge regression in high dimension. The Ridge regression penalizes the coefficients on their l2 norm. To do this, we devise a statistical test based on permutations. The second contribution is based on a two-step selection approach. A first step is dedicated to the screening of variables, based on parsimonious regression Lasso. The second step consists in cleaning the resulting set by testing the relevance of pre-selected variables. These tests are made on adaptive-ridge estimates, where the penalty is constructed on Lasso estimates learned during the screening step. A last contribution consists to the transposition of this approach to group-variables selection.
248

Redes Neurais Aplicadas à InferÃncia dos Sinais de Controle de Dosagem de Coagulantes em uma ETA por FiltraÃÃo RÃpida / Artificial Neural Networks applied to the inference of dosage control signals of coagulants in a water treatment plant by direct filtrationâ,

Leonaldo da Silva Gomes 28 February 2012 (has links)
Considerando a importÃncia do controle da coagulaÃÃo quÃmica para o processo de tratamento de Ãgua por filtraÃÃo rÃpida, esta dissertaÃÃo propÃe a aplicaÃÃo de redes neurais artificiais para inferÃncia dos sinais de controle de dosagem de coagulantes principal e auxiliar, no processo de coagulaÃÃo quÃmica em uma estaÃÃo de tratamento de Ãgua por filtraÃÃo rÃpida. Para tanto, foi feito uma anÃlise comparativa da aplicaÃÃo de modelos baseados em redes neurais do tipo: alimentada adiante focada atrasada no tempo (FTLFN); alimentada adiante atrasada no tempo distribuÃda (DTLFN); recorrente de Elman (ERN) e auto-regressiva nÃo-linear com entradas exÃgenas (NARX). Da anÃlise comparativa, o modelo baseado em redes NARX apresentou melhores resultados, evidenciando o potencial do modelo para uso em casos reais, o que contribuirà para a viabilizaÃÃo de projetos desta natureza em estaÃÃes de tratamento de Ãgua de pequeno porte. / Considering the importance of the chemical coagulation control for the water treatment by direct filtration, this work proposes the application of artificial neural networks for inference of dosage control signals of principal and auxiliary coagulant, in the chemical coagulation process in a water treatment plant by direct filtration. To that end, was made a comparative analysis of the application of models based on neural networks, such as: Focused Time Lagged Feedforward Network (FTLFN); Distributed Time Lagged Feedforward Network (DTLFN); Elman Recurrent Network (ERN) and Non-linear Autoregressive with exogenous inputs (NARX). From the comparative analysis, the model based on NARX networks showed better results, demonstrating the potential of the model for use in real cases, which will contribute to the viability of projects of this nature in small size water treatment plants.
249

Análise e comparação de alguns métodos alternativos de seleção de variáveis preditoras no modelo de regressão linear / Analysis and comparison of some alternative methods of selection of predictor variables in linear regression models.

Matheus Augustus Pumputis Marques 04 June 2018 (has links)
Neste trabalho estudam-se alguns novos métodos de seleção de variáveis no contexto da regressão linear que surgiram nos últimos 15 anos, especificamente o LARS - Least Angle Regression, o NAMS - Noise Addition Model Selection, a Razão de Falsa Seleção - RFS (FSR em inglês), o LASSO Bayesiano e o Spike-and-Slab LASSO. A metodologia foi a análise e comparação dos métodos estudados e aplicações. Após esse estudo, realizam-se aplicações em bases de dados reais e um estudo de simulação, em que todos os métodos se mostraram promissores, com os métodos Bayesianos apresentando os melhores resultados. / In this work, some new variable selection methods that have appeared in the last 15 years in the context of linear regression are studied, specifically the LARS - Least Angle Regression, the NAMS - Noise Addition Model Selection, the False Selection Rate - FSR, the Bayesian LASSO and the Spike-and-Slab LASSO. The methodology was the analysis and comparison of the studied methods. After this study, applications to real data bases are made, as well as a simulation study, in which all methods are shown to be promising, with the Bayesian methods showing the best results.
250

Emprego de técnicas de análise exploratória de dados utilizados em Química Medicinal / Use of different techniques for exploratory data analysis in Medicinal Chemistry

Jadson Castro Gertrudes 10 September 2013 (has links)
Pesquisas na área de Química Medicinal têm direcionado esforços na busca por métodos que acelerem o processo de descoberta de novos medicamentos. Dentre as diversas etapas relacionadas ao longo do processo de descoberta de substâncias bioativas está a análise das relações entre a estrutura química e a atividade biológica de compostos. Neste processo, os pesquisadores da área de Química Medicinal analisam conjuntos de dados que são caracterizados pela alta dimensionalidade e baixo número de observações. Dentro desse contexto, o presente trabalho apresenta uma abordagem computacional que visa contribuir para a análise de dados químicos e, consequentemente, a descoberta de novos medicamentos para o tratamento de doenças crônicas. As abordagens de análise exploratória de dados, utilizadas neste trabalho, combinam técnicas de redução de dimensionalidade e de agrupamento para detecção de estruturas naturais que reflitam a atividade biológica dos compostos analisados. Dentre as diversas técnicas existentes para a redução de dimensionalidade, são discutidas o escore de Fisher, a análise de componentes principais e a análise de componentes principais esparsas. Quanto aos algoritmos de aprendizado, são avaliados o k-médias, fuzzy c-médias e modelo de misturas ICA aperfeiçoado. No desenvolvimento deste trabalho foram utilizados quatro conjuntos de dados, contendo informações de substâncias bioativas, sendo que dois conjuntos foram relacionados ao tratamento da diabetes mellitus e da síndrome metabólica, o terceiro conjunto relacionado a doenças cardiovasculares e o último conjunto apresenta substâncias que podem ser utilizadas no tratamento do câncer. Nos experimentos realizados, os resultados alcançados sugerem a utilização das técnicas de redução de dimensionalidade juntamente com os algoritmos não supervisionados para a tarefa de agrupamento dos dados químicos, uma vez que nesses experimentos foi possível descrever níveis de atividade biológica dos compostos estudados. Portanto, é possível concluir que as técnicas de redução de dimensionalidade e de agrupamento podem possivelmente ser utilizadas como guias no processo de descoberta e desenvolvimento de novos compostos na área de Química Medicinal. / Researches in Medicinal Chemistry\'s area have focused on the search of methods that accelerate the process of drug discovery. Among several steps related to the process of discovery of bioactive substances there is the analysis of the relationships between chemical structure and biological activity of compounds. In this process, researchers of medicinal chemistry analyze data sets that are characterized by high dimensionality and small number of observations. Within this context, this work presents a computational approach that aims to contribute to the analysis of chemical data and, consequently, the discovery of new drugs for the treatment of chronic diseases. Approaches used in exploratory data analysis, employed in this work, combine techniques of dimensionality reduction and clustering for detecting natural structures that reflect the biological activity of the analyzed compounds. Among several existing techniques for dimensionality reduction, we have focused the Fisher\'s score, principal component analysis and sparse principal component analysis. For the clustering procedure, this study evaluated k-means, fuzzy c-means and enhanced ICA mixture model. In order to perform experiments, we used four data sets, containing information of bioactive substances. Two sets are related to the treatment of diabetes mellitus and metabolic syndrome, the third set is related to cardiovascular disease and the latter set has substances that can be used in cancer treatment. In the experiments, the obtained results suggest the use of dimensionality reduction techniques along with clustering algorithms for the task of clustering chemical data, since from these experiments, it was possible to describe different levels of biological activity of the studied compounds. Therefore, we conclude that the techniques of dimensionality reduction and clustering can be used as guides in the process of discovery and development of new compounds in the field of Medicinal Chemistry

Page generated in 0.104 seconds