• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 41
  • 20
  • 10
  • 4
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 99
  • 99
  • 31
  • 27
  • 22
  • 18
  • 16
  • 14
  • 13
  • 12
  • 11
  • 10
  • 9
  • 9
  • 9
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

Selecting the best model for predicting a term deposit product take-up in banking

Hlongwane, Rivalani Willie 19 February 2019 (has links)
In this study, we use data mining techniques to build predictive models on data collected by a Portuguese bank through a term savings product campaign conducted between May 2008 and November 2010. This data is imbalanced, given an observed take-up rate of 11.27%. Ling et al. (1998) indicated that predictive models built on imbalanced data tend to yield low sensitivity and high specificity, an indication of low true positive and high true negative rates. Our study confirms this finding. We, therefore, use three sampling techniques, namely, under-sampling, oversampling and Synthetic Minority Over-sampling Technique, to balance the data, this results in three additional datasets to use for modelling. We build the following predictive models: random forest, multivariate adaptive regression splines, neural network and support vector machine on the datasets and we compare the models against each other for their ability to identify customers that are likely to take-up a term savings product. As part of the model building process, we investigate parameter permutations related to each modelling technique to tune the models, we find that this assists in building robust models. We assess our models for predictive performance through the use of the receiver operating characteristic curve, confusion matrix, GINI, kappa, sensitivity, specificity, and lift and gains charts. A multivariate adaptive regression splines model built on over-sampled data is found to be the best model for predicting term savings product takeup.
2

Methods and Metrics to Measure and Predict the Social Impact of Engineered Products

Stevenson, Phillip Douglas 01 August 2018 (has links)
More than ever before, engineers are creating products for developing countries. One of the purposes of these products is to improve the consumer's quality of life. Currently, there is no established method of measuring the social impact of these types of products. As a result, engineers have used their own metrics to assess their product's impact, if at all. Some of the common metrics used include products sold and revenue, which measure the financial success of a product without recognizing the social successes or failures it might have. In this thesis I introduce a potential metric, the Product Impact Metric (PIM), which quantifies the impact a product has on impoverished individuals -- especially those living in developing countries. It measures social impact broadly in five dimensions: health, education, standard of living, employment quality, and security. By measuring impact multidimensionally it captures both direct (having to do with the products main functions) and indirect impacts (not related to the products main functions), thereby revealing more about the products total impact than with other metrics. These indirect impacts can have a larger influence on the consumer than the direct impacts and are often left unmeasured. It is calculated based on 18 simple field measurements of the consumer. The Product Impact Metric can be used to predict social impact (using personas that represent real individuals) or measure social impact (using specific data from products introduced into the market). Despite its challenges, the measurement of a program or policies social impact is a common practice in the field of social sciences. This measurement is made through social impact indicators which are used to measure, predict, and improve potential social impacts. While there are clear benefits to predicting the social impact of a product, it is unclear how engineers are to select social impact indicators and build predictive models. This thesis introduces a method of selecting social impact indicators and creating predictive social impact models that can help engineers predict and improve the social impact of their product. First, an engineer identifies the product's users, objectives, and requirements. Then, the social impact categories that are related to the product are determined. From each of these categories, the engineer selects several social impact indicators. Finally, models are created for each indicator to predict how a product will change these indicators. The impact categories and indicators can be translated into product requirements and performance measures that can be used in product development processes. This method of predicting social impact is used on the proposed, expanded U.S. Mexico border wall.
3

Learning predictive models from graph data using pattern mining

Karunaratne, Thashmee M. January 2014 (has links)
Learning from graphs has become a popular research area due to the ubiquity of graph data representing web pages, molecules, social networks, protein interaction networks etc. However, standard graph learning approaches are often challenged by the computational cost involved in the learning process, due to the richness of the representation. Attempts made to improve their efficiency are often associated with the risk of degrading the performance of the predictive models, creating tradeoffs between the efficiency and effectiveness of the learning. Such a situation is analogous to an optimization problem with two objectives, efficiency and effectiveness, where improving one objective without the other objective being worse off is a better solution, called a Pareto improvement. In this thesis, it is investigated how to improve the efficiency and effectiveness of learning from graph data using pattern mining methods. Two objectives are set where one concerns how to improve the efficiency of pattern mining without reducing the predictive performance of the learning models, and the other objective concerns how to improve predictive performance without increasing the complexity of pattern mining. The employed research method mainly follows a design science approach, including the development and evaluation of artifacts. The contributions of this thesis include a data representation language that can be characterized as a form in between sequences and itemsets, where the graph information is embedded within items. Several studies, each of which look for Pareto improvements in efficiency and effectiveness are conducted using sets of small graphs. Summarizing the findings, some of the proposed methods, namely maximal frequent itemset mining and constraint based itemset mining, result in a dramatically increased efficiency of learning, without decreasing the predictive performance of the resulting models. It is also shown that additional background knowledge can be used to enhance the performance of the predictive models, without increasing the complexity of the graphs.
4

Interprétation du potentiel redox et évaluation de la mobilité des oxyanions contaminants (As, Sb,Cr) au cours de cycles redox successifs / Redox potential and mobility of contaminant oxyanions (As, Sb, Cr) in argillaceous rock subjected to oxic and anoxic cycles

Markelova, Ekaterina 14 December 2016 (has links)
Cette thèse démontre qu'une approche expérimentale systématique de complexité croissante permet de réévaluer le sens du potentiel redox (EH), et fournit une mise à jour sur l'interprétation de sa valeur dans les assemblages complexes de matrices minérales, des consortiums microbiens, des nutriments et des contaminants sous dynamique, oxydoréduction oscillant conditions. Pour étudier l'utilité des mesures EH dans les systèmes environnementaux saturés d'eau une cascade complète redox 500 à -350 mV (pH ~7.4) a été reproduit dans le laboratoire. Les expériences ont révélé que l'électrode de Pt classique redox répond à des processus microbiens dans une mesure différente en fonction de l'oxygénation et de la présence d'un tampon d'oxydo-réduction physique, chimique et. Les mesures du EH dans des matrices argileuses appauvris dans le tampon redox, tels que le couple électroactif Fe3 + / Fe2 +, par conséquent, se révèlent avoir une utilité limitée. Dans de tels environnements, les couples redox sensible abondantes, cependant, non électroactif, tels que O2 / H2O, CrO42- / Cr (OH) 3, NO3- / NO2- / NH 4 +, Sb (OH) 6- / Sb2O3, et HAsO42 - / H3AsO3 ne pas d'impact mesuré EH. Pour quantifier l'effet de l'oxydation des perturbations sur la mobilité des oxyanions dans la matrice argileuse, j'ai effectué des expériences de traitement par lots sous oscillations redox contrôlées. cycles successifs de conditions oxiques et anoxiques ont été imposées sur les suspensions argileuses modifiées avec un mélange d'oxyde As (V), Sb (V), Cr (VI) et N (V). la mobilité des oxyanions a été étudiée dans des conditions stériles, avec l'addition de carbone organique labile (éthanol), et avec l'addition de microbienne du sol inoculum. Spéciation analyses ont révélé des réactions irréversibles de réduction avec ou sans ajouts d'éthanol. Fraîchement réduite As (III), Sb (III), Cr (III) et N (III) ne sont pas ré-oxydée pendant les périodes subséquentes oxiques démontrant un comportement non-oscillant. Microbiologiquement induit des transformations de réduction a diminué les concentrations aqueuses de Sb et Cr par précipitation, enlevés N par volatilisation, alors préservé Comme dans la solution. En fonction de la diversité microbienne, altérées par l'addition d'inoculum dans le sol, deux types de contaminants imbrications sont caractérisés comme inhibiteurs de réduction et de non-inhibiteurs. Ces données, le représentant de l'environnement souterrain saturé (sous-sol> 20 m), est en outre par rapport à oxyanion mobilité dans l'environnement proche de la surface (la couche arable <0,15 m). Les principales différences entre les systèmes de la couche arable et du sous-sol sont dans la fraction d'oxyhydroxyde Fe-, Mn- et minéraux Al-, la diversité microbienne, pCO2, et la gamme des valeurs EH développées au cours des cycles d'oxydo-réduction. Par exemple, la gamme EH de plus de 900 mV (500 à -300 mV) dans la suspension de la couche arable est opposée à la gamme EH de 100 mV (350-250 mV) dans la suspension du sous-sol. En outre, dans la suspension de la couche arable, fort cycle redox de Fe et Mn coïncide avec la mobilité d'oscillation de As et Sb. Cette corrélation suggère le rôle crucial des minéraux oxyhydroxyde agissant non seulement comme principaux sorbants, mais aussi comme catalyseurs pour des réactions d'oxydation éventuellement contrôlant la réversibilité de la séquestration des contaminants. Par conséquent, appauvri en minéraux oxyhydroxyde, matrice argileuse est révélée environnement propice à la rétention des contaminants, car il peut supporter des oscillations périodiques redox sans libérer les contaminants de retour à la phase aqueuse sur l'échelle de temps expérimental. / This thesis demonstrates that a systematic experimental approach of increasing complexity allows reassessing the meaning of the redox potential (EH), and provides an update on the interpretation of its value in complex assemblages of mineral matrices, microbial consortiums, nutrients, and contaminants under dynamic, redox-oscillating conditions. To study the usefulness of EH measurements in water-saturated environmental systems a full redox cascade from +500 to -350 mV (pH ∼7.4) was reproduced in the laboratory. The experiments revealed that conventional Pt redox electrode responds to physical, chemical, and microbial processes to a different extent depending on oxygenation and on the presence of a redox buffer. The measurements of EH in argillaceous matrices depleted in the redox buffer, such as the electroactive Fe3+/Fe2+ couple, thus, are shown to have limited usefulness. In such environments, the abundant redox-sensitive couples, yet non-electroactive, such as O2/H2O, CrO42-/Cr(OH)3, NO3-/NO2-/NH4+, Sb(OH)6-/Sb2O3, and HAsO42-/H3AsO3 do not impact measured EH. To quantify the effect of oxidizing perturbations on the mobility of oxyanions in the argillaceous matrix, I performed batch experiments under controlled redox oscillations. Successive cycles of oxic and anoxic conditions were imposed on the argillaceous suspensions amended with a mixture of oxidized As(V), Sb(V), Cr(VI), and N(V). Oxyanion mobility was investigated under sterile conditions, with the addition of labile organic carbon (ethanol), and with the addition of soil microbial inoculum. Speciation analyses revealed irreversible reduction reactions with and without ethanol additions. Freshly reduced As(III), Sb(III), Cr(III), and N(III) were not re-oxidized during subsequent oxic periods demonstrating non-oscillating behavior. Microbially induced reduction transformations decreased aqueous concentrations of Sb and Cr via precipitation, removed N via volatilization, while preserved As in the solution. Depending on microbial diversity, altered by the addition of soil inoculum, two types of contaminant interplays are characterized as inhibitory and non-inhibitory reductions. These data, the representative of saturated subsurface environment (subsoil > 20 m), is further compared to oxyanion mobility in the near-surface environment (topsoil < 0.15 m). The key differences between the topsoil and subsoil systems are in the fraction of oxyhydroxide Fe-, Mn-, and Al- minerals, microbial diversity, pCO2, and the range of EH values developed during redox cycles. For example, the EH range over 900 mV (from +500 to -300 mV) in the topsoil suspension is contrasted to the EH range of 100 mV (from +350 to +250 mV) in the subsoil suspension. Furthermore, in the topsoil suspension, strong redox cycling of Fe and Mn is coincident with the oscillating mobility of As and Sb. This correlation suggests the crucial role of oxyhydroxide minerals acting not only as major sorbents, but also as catalysts for oxidation reactions eventually controlling the reversibility of contaminant sequestration. Therefore, depleted in oxyhydroxide minerals, argillaceous matrix is shown to be suitable environment for contaminant retention, as it can stand periodical redox oscillations without releasing contaminants back to the aqueous phase on the experimental time scale.
5

Evaluation of Productivity, Consumption, and Uncontrolled Total Particulate Matter Emission Factors of Recyclable Abrasives

Sangameswaran, Sivaramakrishnan 22 May 2006 (has links)
Dry abrasive blasting is a commonly used surface preparation operation by many process industries to clean up metallic surfaces and achieve surface finishes suitable for future adhesion. Abrasives used in this process can be recyclable or expendable. This study was undertaken to evaluate the performance of three recyclable abrasives: garnet, barshot and steel grit/shot in terms of productivity (area cleaned per unit time), consumption (amount of abrasive used per unit area cleaned) and uncontrolled total particulate matter (TPM) emission factors (in terms of mass of pollutant emitted per unit area cleaned and mass of pollutant emitted per unit mass of abrasive consumed). Though there have been various attempts in the past to evaluate the performance of these abrasives, there has not been a streamlined approach to evaluate these parameters in the commonly used range of process conditions, or to identify and model the influences of key process variables on these performance parameters. The first step in this study was to evaluate the performance of these three abrasives in blasting painted steel panels under enclosed blasting conditions and using USEPA recommended protocols. The second step was to model the influences of blast pressure and abrasive feed rate, two most critical parameters on productivity, consumption and emission factors. Two and three dimensional models were obtained using multiple linear regression techniques to express productivity, consumption and TPM emission factors in terms of blast pressure and abrasive feed rate. Barshot was found to have high productivities over all and steel grit/shot demonstrated the least emission potential at almost all of the tested pressure and feed rate conditions. The data will help fill the gaps in literature currently available for dry abrasive blasting performance. The models obtained will help industries, the research community and the regulatory agencies to make accurate estimates of the performance parameters. Estimating productivity and consumption will help industries identify best management practices by optimizing the process conditions to achieve high productivity and low consumption rates. Emission factor determination will help in reducing the emissions to the atmosphere by choosing process conditions corresponding to minimum emissions. The performance parameters once optimized can result in reduction in material, labor, energy, emission and disposal costs, lower resource utilization and hence reduction in overall life cycle costs of dry abrasive process. The developed models will help industries in making environmentally preferable purchases thereby promoting source reduction options. PM emissions estimated using the models presented here will aid studies on health risk associated with inhalation of atmospheric PM.
6

Účetní závěrka a finanční analýza společnosti AB JET spol. s r.o. / Statement of Balances and Financial Analysis of the Company AB JET s.r.o.

Kroupová, Michaela January 2009 (has links)
This thesis is dedicated to the statement of balances and financial analysis. The first part describes statement of balances in the Czech Republic. It describes the legal regulation, the basic requirements for disclosure statements and elements of financial statements. The second part deals with the description of elementary methods of financial analysis such as analysis of absolute indicators, differential analysis of indicators and analysis of systems of indicators. In the third part theoretical knowledge of financial analysis is used to assess the financial health of the company AB JET s.r.o.
7

Modelos preditivos de conforto térmico: quantificação de relações entre variáveis microclimáticas e de sensação térmica para avaliação e projeto de espaços abertos / Thermal comfort predictive models: quantification of relationships between microclimatic and thermal sensation variables for outdoor spaces assessment and design

Monteiro, Leonardo Marques 22 August 2008 (has links)
O objeto desta pesquisa é a relação entre as variáveis microclimáticas urbanas e as de sensação térmica. Parte-se da hipótese de que a predição de conforto térmico em espaços abertos requer modelos com calibração e validação específicas para dada população adaptada a determinadas condições climáticas. O objetivo é propor um método para quantificar as correlações entre variáveis microclimáticas urbanas (temperatura, umidade e velocidade do ar e radiação térmica) e variáveis subjetivas (percepção e preferência de sensações térmicas), mediadas por variáveis individuais (vestimentas e atividade física), possibilitando a predição do grau de adequação térmica de espaços abertos para uma população adaptada às condições climáticas em que se encontra (no caso específico, na cidade de São Paulo). O método utilizado é indutivo experimental (levantamento em campo de variáveis microclimáticas, individuais e subjetivas) apoiado por método dedutivo computacional comparativo (simulação preditiva). Os resultados do estudo experimental e computacional comparativo fornecem subsídio para duas proposições: (a) calibração de índices interpretativos para modelos existentes, por meio de método iterativo; (b) proposição de nova modelagem preditiva, por meio de método numérico apoiado por método analítico. Os produtos finais da pesquisa são: (I) procedimento para quantificação empírica de variáveis, (II) quadro comparativo de modelos, (III) calibração de modelos para o caso em estudo, (IV) método de calibração de modelos para outros casos, (V) novo modelo preditivo para o caso em estudo, (VI) método de modelagem preditiva para aplicação em outros casos, (VII) análise e síntese crítica do caso em estudo e dos metódos desenvolvidos. / The subject of this research is the relationship between urban microclimatic and thermal sensation variables. The hypothesis is that outdoor thermal comfort prediction requires modeling with specific calibration and validation to a given population adapted to certain climatic conditions. The objective is to propose a method to quantify the correlations between urban microclimatic variables (temperature, humidity, air velocity and thermal radiation) and subjective variables (thermal sensation perception and preference), mediated by means of individual variables (clothing insulation and metabolic rate), allowing the prediction of the outdoor thermal environment adequacy to a population adapted to a given climatic condition (in the specific case, the city of Sao Paulo). The method used is experimental inductive (field research of microclimatic, individual and subjective variables) supported by comparative computational deductive (predictive simulation). The field research and predictive simulation results allow twos propositions: (a) interpretative indexes calibration for predictive models, by means of iterative method; (b) proposition of a new predictive model, by means of numeric and analytic methods. The research final products are: (I) procedure for empirical estimation of microclimatic, individual and subjective variables (II) comparative chart of predictive models, (III) models calibration for the case in study, (IV) calibration method to be applied in other cases, (V) new predictive model based on the case in study, (VI) predictive modeling method to be applied in other cases, (VII) critical analysis and synthesis of the case in study and the developed methods.
8

Otimização de parâmetros de interação do modelo UNIFAC-VISCO de misturas de interesse para a indústria de óleos essenciais / Optimization of interaction parameters for UNIFAC-VISCO model of mixtures interesting to essential oil industries

Pinto, Camila Nardi 27 February 2015 (has links)
A determinação de propriedades físicas dos óleos essenciais é fundamental para sua aplicação na indústria de alimentos e também em projetos de equipamentos. A vasta quantidade de variáveis envolvidas no processo de desterpenação, tais como temperatura, pressão e composição, tornam a utilização de modelos preditivos de viscosidade necessária. Este trabalho teve como objetivo a obtenção de parâmetros para o modelo preditivo de viscosidade UNIFAC-VISCO com aplicação do método de otimização do gradiente descendente, a partir de dados de viscosidade de sistemas modelo que representam as fases que podem ser formadas em processos de desterpenação por extração líquido-líquido dos óleos essenciais de bergamota, limão e hortelã, utilizando como solvente uma mistura de etanol e água, em diferentes composições, a 25ºC. O experimento foi dividido em duas configurações; na primeira os parâmetros de interação previamente reportados na literatura foram mantidos fixos; na segunda todos os parâmetros de interação foram ajustados. O modelo e o método de otimização foram implementados em linguagem MATLAB&reg;. O algoritmo de otimização foi executado 10 vezes para cada configuração, partindo de matrizes de parâmetros de interação iniciais diferentes obtidos pelo método de Monte Carlo. Os resultados foram comparados com o estudo realizado por Florido et al. (2014), no qual foi utilizado algoritmo genético como método de otimização. A primeira configuração obteve desvio médio relativo (DMR) de 1,366 e a segunda configuração resultou um DMR de 1,042. O método do gradiente descendente apresentou melhor desempenho para a primeira configuração em comparação com o método do algoritmo genético (DMR 1,70). Para a segunda configuração o método do algoritmo genético obteve melhor resultado (DMR 0,68). A capacidade preditiva do modelo UNIFAC-VISCO foi avaliada para o sistema de óleo essencial de eucalipto com os parâmetros determinados, obtendo-se DMR iguais a 17,191 e 3,711, para primeira e segunda configuração, respectivamente. Esses valores de DMR foram maiores do que os encontrados por Florido et al. (2014) (3,56 e 1,83 para primeira e segunda configuração, respectivamente). Os parâmetros de maior contribuição para o cálculo do DMR são CH-CH3 e OH-H2O para a primeira e segunda configuração, respectivamente. Os parâmetros que envolvem o grupo C não influenciam no valor do DMR, podendo ser excluído de análises futuras. / The determination of physical properties of essential oils is critical to their application in the food industry and also in equipment design. The large number of variables involved in deterpenation process, such as temperature, pressure and composition, to make use of viscosity predictive models required. This study aimed obtain parameters for the viscosity predictive model UNIFAC-VISCO using gradient descent as optimization method to model systems viscosity data representing the phases that can be formed in deterpenation processes for extraction liquid-liquid of bergamot, lemon and mint essential oils, using aqueous ethanol as solvente in different compositions at 25 º C. The work was divided in two configurations; in the first one the interaction parameters previously reported in the literature were kept fixed; in the second one all interaction parameters were adjusted. The model and the gradient descent method were implemented in MATLAB language. The optimization algorithm was runned 10 times for each configuration, starting from different arrays of initial interaction parameters obtained by the Monte Carlo method. The results were compared with the study carried out by Florido et al. (2014), which used genetic algorithm as optimization method. The first configuration provided an average deviation (DMR) of 1,366 and the second configuration resulted in a DMR 1,042. The gradient descent method showed better results for the first configuration comparing with the genetic algorithm method (DMR 1.70). On the other hand, for the second configuration the genetic algorithm method had a better result (DMR 0.68). The UNIFAC-VISCO model predictive ability was evaluated for eucalyptus essential oil system using the obtained parameters, providing DMR equal to 17.191 and 3.711, for the first and second configuration, respectively. The parameters determined by genetic algorithm presented lower DMR for the two settings (3.56 and 1.83 to the first and second configuration, respectively). The major parameters for calculating the DMR are CH-CH3 and OH-H2O to the first and second configuration, respectively. The parameters involving the C group did not influence the DMR and may be excluded from further analysis.
9

Estratégias para tratamento de variáveis com dados faltantes durante o desenvolvimento de modelos preditivos / Strategies for treatment of variables with missing data during the development of predictive models

Assunção, Fernando 09 May 2012 (has links)
Modelos preditivos têm sido cada vez mais utilizados pelo mercado a fim de auxiliarem as empresas na mitigação de riscos, expansão de carteiras, retenção de clientes, prevenção a fraudes, entre outros objetivos. Entretanto, durante o desenvolvimento destes modelos é comum existirem, dentre as variáveis preditivas, algumas que possuem dados não preenchidos (missings), sendo necessário assim adotar algum procedimento para tratamento destas variáveis. Dado este cenário, este estudo tem o objetivo de discutir metodologias de tratamento de dados faltantes em modelos preditivos, incentivando o uso de algumas delas já conhecidas pelo meio acadêmico, só que não utilizadas pelo mercado. Para isso, este trabalho descreve sete metodologias. Todas elas foram submetidas a uma aplicação empírica utilizando uma base de dados referente ao desenvolvimento de um modelo de Credit Score. Sobre esta base foram desenvolvidos sete modelos (um para cada metodologia descrita) e seus resultados foram avaliados e comparados através de índices de desempenho amplamente utilizados pelo mercado (KS, Gini, ROC e Curva de Aprovação). Nesta aplicação, as técnicas que apresentaram melhor desempenho foram a que tratam os dados faltantes como uma categoria à parte (técnica já utilizada pelo mercado) e a metodologia que consiste em agrupar os dados faltantes na categoria conceitualmente mais semelhante. Já a que apresentou o pior desempenho foi a metodologia que simplesmente não utiliza a variável com dados faltantes, outro procedimento comumente visto no mercado. / Predictive models have been increasingly used by the market in order to assist companies in risk mitigation, portfolio growth, customer retention, fraud prevention, among others. During the model development, however, it is usual to have, among the predictive variables, some who have data not filled in (missing values), thus it is necessary to adopt a procedure to treat these variables. Given this scenario, the aim of this study is to discuss frameworks to deal with missing data in predictive models, encouraging the use of some already known by academia that are still not used by the market. This paper describes seven methods, which were submitted to an empirical application using a Credit Score data set. Each framework described resulted in a predictive model developed and the results were evaluated and compared through a series of widely used performance metrics (KS, Gini, ROC curve, Approval curve). In this application, the frameworks that presented better performance were the ones that treated missing data as a separate category (technique already used by the market) and the framework which consists of grouping the missing data in the category most similar conceptually. The worst performance framework otherwise was the one that simply ignored the variable containing missing values, another procedure commonly used by the market.
10

Comparação da performance de algoritmos de machine learning para a análise preditiva em saúde pública e medicina / Comparison of machine learning algorithms performance in predictive analyzes in public health and medicine

Santos, Hellen Geremias dos 28 September 2018 (has links)
Modelos preditivos estimam o risco de eventos ou agravos relacionados à saúde e podem ser utilizados como ferramenta auxiliar em tomadas de decisão por gestores e profissionais de saúde. Algoritmos de machine learning (ML), por sua vez, apresentam potencial para identificar relações complexas e não-lineares presentes nos dados, com consequências positivas na performance preditiva desses modelos. A presente pesquisa objetivou aplicar técnicas supervisionadas de ML e comparar sua performance em problemas de classificação e de regressão para predizer respostas de interesse para a saúde pública e a medicina. Os resultados e discussão estão organizados em três artigos científicos. O primeiro apresenta um tutorial para o uso de ML em pesquisas de saúde, utilizando como exemplo a predição do risco de óbito em até 5 anos (frequência do desfecho 15%; n=395) para idosos do estudo \"Saúde, Bem-estar e Envelhecimento\" (n=2.677), segundo variáveis relacionadas ao seu perfil demográfico, socioeconômico e de saúde. Na etapa de aprendizado, cinco algoritmos foram aplicados: regressão logística com e sem penalização, redes neurais, gradient boosted trees e random forest, cujos hiperparâmetros foram otimizados por validação cruzada (VC) 10-fold. Todos os modelos apresentaram área abaixo da curva (AUC) ROC (Receiver Operating Characteristic) maior que 0,70. Para aqueles com maior AUC ROC (redes neurais e regressão logística com e sem penalização) medidas de qualidade da probabilidade predita foram avaliadas e evidenciaram baixa calibração. O segundo artigo objetivou predizer o risco de tempo de vida ajustado pela qualidade de vida de até 30 dias (frequência do desfecho 44,7%; n=347) em pacientes com câncer admitidos em Unidade de Terapia Intensiva (UTI) (n=777), mediante características obtidas na admissão do paciente à UTI. Seis algoritmos (regressão logística com e sem penalização, redes neurais, árvore simples, gradient boosted trees e random forest) foram utilizados em conjunto com VC aninhada para estimar hiperparâmetros e avaliar performance preditiva. Todos os algoritmos, exceto a árvore simples, apresentaram discriminação (AUC ROC > 0,80) e calibração satisfatórias. Para o terceiro artigo, características socioeconômicas e demográficas foram utilizadas para predizer a expectativa de vida ao nascer de municípios brasileiros com mais de 10.000 habitantes (n=3.052). Para o ajuste do modelo preditivo, empregou-se VC aninhada e o algoritmo Super Learner (SL), e para a avaliação de performance, o erro quadrático médio (EQM). O SL apresentou desempenho satisfatório (EQM=0,17) e seu vetor de valores preditos foi utilizado para a identificação de overachievers (municípios com expectativa de vida superior à predita) e underachievers (município com expectativa de vida inferior à predita), para os quais características de saúde foram comparadas, revelando melhor desempenho em indicadores de atenção primária para os overachievers e em indicadores de atenção secundária para os underachievers. Técnicas para a construção e avaliação de modelos preditivos estão em constante evolução e há poucas justificativas teóricas para se preferir um algoritmo em lugar de outro. Na presente tese, não foram observadas diferenças substanciais no desempenho preditivo dos algoritmos aplicados aos problemas de classificação e de regressão analisados. Espera-se que a maior disponibilidade de dados estimule a utilização de algoritmos de ML mais flexíveis em pesquisas de saúde futuras. / Predictive models estimate the risk of health-related events or injuries and can be used as an auxiliary tool in decision-making by public health officials and health care professionals. Machine learning (ML) algorithms have the potential to identify complex and non-linear relationships, with positive implications in the predictive performance of these models. The present research aimed to apply various ML supervised techniques and compare their performance in classification and regression problems to predict outcomes of interest to public health and medicine. Results and discussion are organized into three articles. The first, presents a tutorial for the use of ML in health research, using as an example the prediction of death up to 5 years (outcome frequency=15%; n=395) in elderly participants of the study \"Saúde, Bemestar e Envelhecimento\" (n=2,677), using variables related to demographic, socioeconomic and health characteristics. In the learning step, five algorithms were applied: logistic regression with and without regularization, neural networks, gradient boosted trees and random forest, whose hyperparameters were optimized by 10-fold cross-validation (CV). The area under receiver operating characteristic (AUROC) curve was greater than 0.70 for all models. For those with higher AUROC (neural networks and logistic regression with and without regularization), the quality of the predicted probability was evaluated and it showed low calibration. The second article aimed to predict the risk of quality-adjusted life up to 30 days (outcome frequency=44.7%; n=347) in oncologic patients admitted to the Intensive Care Unit (ICU) (n=777), using patients\' characteristics obtained at ICU admission. Six algorithms (logistic regression with and without regularization, neural networks, basic decision trees, gradient boosted trees and random forest) were used with nested CV to estimate hyperparameters values and to evaluate predictive performance. All algorithms, with exception of basic decision trees, presented acceptable discrimination (AUROC > 0.80) and calibration. For the third article, socioeconomic and demographic characteristics were used to predict the life expectancy at birth of Brazilian municipalities with more than 10,000 inhabitants (n=3,052). Nested CV and the Super Learner (SL) algorithm were used to adjust the predictive model, and for evaluating performance, the mean squared error (MSE). The SL showed good performance (MSE=0.17) and its vector of predicted values was used for the identification of underachievers and overachievers (i.e. municipalities showing worse and better outcome than predicted, respectively). Health characteristics were analyzed revealing that overachievers performed better on primary health care indicators, while underachievers fared better on secondary health care indicators. Techniques for constructing and evaluating predictive models are constantly evolving and there is scarce theoretical justification for preferring one algorithm over another. In this thesis no substantial differences were observed in the predictive performance of the algorithms applied to the classification and regression problems analyzed herein. It is expected that increase in data availability will encourage the use of more flexible ML algorithms in future health research.

Page generated in 0.0498 seconds