Spelling suggestions: "subject:"modelos QSPR"" "subject:"odelos QSPR""
1 |
Modelado predictivo de sistemas complejos para informática molecular : desarrollo de métodos de selección y aprendizaje de características en presencia de incertidumbreCravero, Fiorella 13 March 2020 (has links)
En la actualidad existe una necesidad creciente de guiar el descubrimiento in silico de nuevos polímeros industriales mediante enfoques de Aprendizaje Maquinal supervisado que identifiquen correlaciones estructura-propiedad a partir de la información contenida en bases de datos de materiales, donde cada uno de estos está caracterizado mediante Descriptores Moleculares (DMs). Estas correlaciones se conocen como Modelos de Relación Cuantitativa Estructura-Actividad/Propiedad (QSAR/QSPR, por las siglas en inglés de Quantitative Structure-Activity/Property Relationship) y pueden ser empleadas para predecir propiedades de interés previo a la etapa de síntesis química, contribuyendo de este modo a acelerar el diseño de nuevos materiales y reducir sus costos de desarrollo.
El modelado QSAR/QSPR ya ha sido ampliamente empleado en Informática Molecular para el Diseño Racional de Fármacos asistido por computadoras. Sin embargo, los materiales poliméricos son significativamente más complejos que las moléculas pequeñas como las drogas, dado que están integrados por colecciones de macromoléculas compuestas por miles de cadenas que, a su vez, se forman por la unión de cientos de miles de Unidades Repetitivas Estructurales (UREs). Estas cadenas poseen diferentes pesos moleculares (o largos de cadena) y, a su vez, aparecen con distintas frecuencias dentro de cada material. Este fenómeno, conocido como polidispersión, es la principal razón de que muchas aproximaciones informáticas desarrolladas para el diseño racional de fármacos no sean directamente aplicables, ni lo suficientemente efectivas, en el ámbito de la Informática de Polímeros.
El objetivo general de esta tesis es contribuir con soluciones para distintas cuestiones relativas a la representación computacional y algoritmia que surgen durante el modelado QSPR de propiedades de polímeros polidispersos de alto peso molecular, con especial énfasis en el tratamiento del problema de selección de descriptores moleculares. Las variaciones en la frecuencia de las cadenas de diferentes largos hacen que la descripción de la estructura de un material polimérico contenga incertidumbre, en contraste con lo que sucede en la caracterización estructural típica de una molécula pequeña. No obstante esto, debido a la complejidad de modelar esta incertidumbre, la mayoría de los estudios QSAR/QSPR han utilizado hasta ahora modelos moleculares simples y univaluados, es decir, calculan los descriptores moleculares para una única instancia de peso, de entre todas las posibles cadenas que conforman un material. En particular, la casi totalidad de estos estudios usan descriptores calculados sobre una única URE, sin tener en cuenta la polidispersión. En tal sentido, esta tesis propone investigar
distintas alternativas de selección y aprendizaje de características para modelado QSPR con incertidumbre, que exploren la efectividad de otras representaciones computacionales más realistas para los materiales poliméricos.
En primer lugar, se presenta una metodología híbrida que emplea tanto algoritmos de Selección de Características como de Aprendizaje de Características, a fin de evaluar la máxima capacidad predictiva que se puede alcanzar con la tradicional representación univaluada URE. En segundo lugar, se proponen nuevas representaciones univaluadas, basadas en pesos moleculares promedios, denominadas como modelos moleculares Mn y Mw, cuyas capacidades para inferir modelos QSPR son contrastadas con el modelo molecular URE.
La siguiente alternativa propuesta estudia una representación computacional trivaluada, basada en la integración de los modelos moleculares univaluados URE, Mn y Mw en una única base de datos, la cual permite capturar parcialmente el fenómeno de la polidispersión. Esta caracterización computacional logra mejorar la generalizabilidad de los modelos QSPR obtenidos durante el proceso aprendizaje supervisado, en comparación con los inferidos mediante enfoques de representación univaluados. Sin embargo, esta nueva representación sigue sin contemplar las frecuencias de aparición de los distintos largos de cadena dentro de un material.
Por último, como contribución final de esta tesis se propone una representación computacional multivaluada, basada en el perfil polidisperso real de un material, donde cada descriptor queda caracterizado por una distribución probabilística discreta. En este contexto, las técnicas de selección de características empleadas para representaciones univaluadas ya no resultan aplicables, y surge la necesidad de contar con algoritmos que permitan operar sobre este nuevo modelo molecular. Como consecuencia de esto, se presenta el diseño e implementación de un algoritmo para selección de características multivaluadas. Este nuevo método, FS4RVDD (como sigla de su nombre en inglés Feature Selection for Random Variables with Discrete Distribution), logra un desempeño prometedor en todos los escenarios experimentales ensayados en estas investigaciones. / Nowadays, there is an increasing need to lead the in silico discovery of new industrial polymers through supervised Machine Learning approaches that identify structure-property correlations from the information contained in material databases, where each of them is characterized by Molecular Descriptors (MDs). These correlations are known as Quantitative Structure-Activity/Property Relationship models (QSAR/QSPR). They can be used to predict desirable properties of new materials before the synthesis stage, contributing to accelerate the design of new materials and to reduce the associated development costs.
QSAR/QSPR modeling is widely used in Molecular Informatics for Computer-Aided Drug Design. However, polymeric materials are significantly more complex than small molecules such as drugs, since they are collections of macromolecules that consist of a large number of structural repetitive units (SRUs) linked together in thousands of chain-like structures. These chains have different molecular weights (or lengths) and, in turn, they appear with different frequencies within each material. This phenomenon, known as polydispersity, is the main reason why many approaches developed for rational drug design are neither directly applicable nor sufficiently effective in the field of Polymer Informatics.
The main objective of this thesis is to contribute with solutions for various issues related to computational representation and algorithm development that arise during the QSPR modeling of properties of high molecular weight polydisperse polymers, with special emphasis on the Feature Selection problem. Because of frequency variations in the different chain lengths, the characterization of the polymeric material structure contains uncertainty, in contrast with the typical structural characterization of a small molecule. However, to deal with the uncertainty that introduces the polydispersity of polymeric materials, most of the QSAR/QSPR studies, until now, have used simple and univalued molecular models, that is, they calculate the molecular descriptors for a single instance of weight among all the possible chains that constitute a material. In particular, most QSPR studies use descriptors calculated on a single SRU, regardless of polydispersity. In this context, the present thesis proposes to investigate different alternatives of Feature Selection and Feature Learning for QSPR modeling with uncertainty that explore the effectiveness of more realistic computational representations for polymeric materials.
First, a hybrid methodology that uses MDs from both Feature Selection and Feature Learning algorithms is presented to evaluate the maximum predictive capability the traditional univalued representation (URE) can achieved. Then, new univalued representations based on average molecular weights are proposed, called Mn molecular model and Mw molecular model, whose capabilities to infer QSPR models are contrasted with the URE molecular model ones.
The other alternative computational representation proposes is trivalued MDs, based on the integration of URE, Mn, and Mw univalued molecular models into a single database. This representation partially captures the polydispersity inherent to polymers. This computational characterization improves the generalizability of QSPR models obtained during the supervised learning process, compared to those inferred through univalued representation approaches. However, this new trivalued representation still does not contemplate the frequencies of appearance of the different chain lengths within a material.
Finally, this thesis contributes with a multivalued computational representation based on the actual polydisperse profile of a material, in which each descriptor is characterized by a probabilistic discrete distribution. In this context, the Feature Selection techniques used for univalued representations are no longer applicable, and there is a need for algorithms to deal with this new multivalued molecular model. To face this need, both the design and implementation of an algorithm for the selection of multivalued features are presented here. This new method is called Feature Selection for Random Variables with Discrete Distribution (FS4RVDD), and it achieves a promising performance in all the experimental scenarios tested in these investigations.
|
2 |
Modelagem do coeficiente de sorção do solo de poluentes orgânicos persistentes no meio ambiente / Modeling of soil sorption coefficient from persistent organic pollutants in the environmentOlguín, Carlos José Maria 17 February 2017 (has links)
Submitted by Edineia Teixeira (edineia.teixeira@unioeste.br) on 2017-09-04T17:30:26Z
No. of bitstreams: 2
Carlos_Olguin2017.pdf: 2821259 bytes, checksum: 4f44c019ceff1c4613be9b0b525a188e (MD5)
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Made available in DSpace on 2017-09-04T17:30:26Z (GMT). No. of bitstreams: 2
Carlos_Olguin2017.pdf: 2821259 bytes, checksum: 4f44c019ceff1c4613be9b0b525a188e (MD5)
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5)
Previous issue date: 2017-02-17 / The soil sorption coefficient normalized for organic carbon content (Koc) is a physicochemical parameter used in environmental risk assessments to determine the final destination of chemicals released in the environment. So, in oreder to predict this parameter, several models were proposed based on the relationship between LogKoc and LogP. The difficulty and cost to obtain experimental values of LogP have drawn to the algorithms development to calculate those values. Thus, in the first paper of this thesis, several free algorithms were considered to calculate LogP, and it was concluded that the best QSPR models to predict soil sorption coefficient of organic nonionic compounds were obtained using ALOGPs, KOWWIN and XLOGP3 algorithms. This study demonstrated the importance and usefulness of the statistical equivalence test used, since it allowed us to state that the models obtained from the considered algorithms are statistically equivalent. In this study, the both importance and usefulness of the statistical equivalence test were proved. These data allowed us to state that the models that have been obtained from the algorithms are statistically equivalent. Thus, in the impossibility of obtaining LogP values based on one of the algorithms, values obtained by another one of them can be used. It was also observed that the models presented in this study presented statistical quality and predictive capacity compatible with more complex models recently published in the area. In addition, it is a well accepted practice in the area the requirement to validate the prediction of a QSPR model from a data set that was not used in the model generation. In this context, some studies have explored the impact that several sizes of training sets would have on the predictive capacity of the generated QSPR models, consequently not reaching conclusive results. Thus, the second paper has been shown that, from not so large training sets, statistically equivalent QSPR models can be developed and that these models have similar predictive capacity to those ones created from a larger training set. Therefore, models were generated considering LogP values of the total training set, calculated with the ALOGPs algorithm and also with subsets of itself (i.e., halves, quarters and eighths). This study, just like the previous one, has confirmed the importance of using the statistical equivalence test since it was ascertained that, following the adopted procedures, the models obtained with subsets of the training set are statistically equivalent / O coeficiente de sorção do solo normalizado para o conteúdo de carbono orgânico (Koc) é um parâmetro físico-químico utilizado em avaliações de risco ambiental e na determinação do destino final das substâncias químicas lançadas na natureza. Vários modelos para prever este parâmetro foram propostos com base na relação entre LogKoc e LogP. A dificuldade e o custo para a obtenção de valores experimentais de LogP levaram ao desenvolvimento de algoritmos para calculá-los. Assim, no primeiro artigo desta tese foram considerados diversos algoritmos gratuitos para cálculo de LogP, e concluiu-se que os melhores modelos QSPR para predizer o coeficiente de sorção do solo de compostos orgânicos não iónicos foram obtidos usando os algoritmos ALOGPs, KOWWIN e XLOGP3. Neste estudo, foram demonstradas a importância e a utilidade do teste de equivalência estatística utilizado, dados que nos permitiram afirmar que os modelos obtidos dos algoritmos considerados são estatisticamente equivalentes. Assim, na impossibilidade de obterem-se valores de LogP a partir de um dos algoritmos, valores obtidos por outro podem ser usados. Verificou-se ainda que os modelos apresentados neste estudo possuem qualidade estatística e capacidade de predição compatíveis à de modelos mais complexos, publicados recentemente na área. Adicionalmente, a necessidade de se realizar a validação da predição de um modelo QSPR a partir de um conjunto de dados que não foi utilizado na geração do modelo é uma prática bem aceita na área. Nesse contexto, alguns trabalhos exploraram o impacto que diversos tamanhos de conjuntos de treinamento teriam na capacidade de predição dos modelos QSPR gerados, não chegando a resultados conclusivos. Assim, no segundo artigo desta tese, foi mostrado que, a partir de conjuntos de treinamento não tão grandes, modelos QSPR estatisticamente equivalentes podem ser desenvolvidos e que tais modelos têm capacidade de predição similar daqueles criados a partir de um conjunto de treinamento maior. Para isto, modelos foram gerados considerando valores de LogP do conjunto de treinamento total, calculados com o algoritmo ALOGPs e também com subconjuntos do mesmo (i.e., metades, quartos e oitavos). Este estudo, assim como o anterior, confirmou a importância do uso do teste de equivalência estatística utilizado nesta tese já que foi verificado que, seguindo os procedimentos adotados, os modelos obtidos com subconjuntos do conjunto de treinamento são estatisticamente equivalentes.
|
Page generated in 0.0479 seconds