11 |
Detecting Fraud in Affiliate Marketing: Comparative Analysis of Supervised Machine Learning AlgorithmsAhlqvist, Oskar January 2023 (has links)
Affiliate marketing has become a rapidly growing part of the digital marketing sector. However, fraud in affiliate marketing raises a serious threat to the trust and financial stability of the involved parties. This thesis investigates the performance of three supervised machine learning algorithms - random forest, logistic regression, and support vector machine in detecting fraud in affiliate marketing. The objective is to answer the following main research question by answering two sub-questions: How much can Random Forest, Logistic Regression, and Support Vector Machine contribute to the detection of fraud in affiliate marketing? 1. How can the models be compared in an experiment? 2. How can they be optimized and applied within an affiliate marketing framework? To answer these questions, a dataset of transaction logs is analyzed in collaboration with an affiliate network company. The machine learning experiment employs k-fold crossvalidation and the Area Under the ROC Curve (AUC-ROC) performance metric to evaluate the effectiveness of the classifiers in distinguishing fraudulent from non-fraudulent transactions. The results indicate that the random forest classifier performs best out of the models, achieving the highest mean AUC of 0.7172. Furthermore, using feature importance analysis demonstrates that each feature category had different impact on the performance of the models. It was discovered that the models computes different feature importance meaning that some features displayed greater influence on specific models. By fine-tuning and optimizing the hyperparameters for each model, it is possible to enhance their performance. Despite certain limitations, such as time constraints, data availability, and security restrictions, this study highlights the potential of supervised machine learning algorithms. Particularly random forest showed to how it could be used to improve fraud detection capabilities in affiliate marketing.The insights contribute to closing the knowledge gap in comparing the effectiveness of various classification methods and practical applications for fraud detection.
|
12 |
[pt] SEGMENTAÇÃO E O MODELO RFM NO VAREJO BRASILEIRO: UMA ANÁLISE COM BASES DE DADOS TRANSACIONAIS DO VAREJO DE VESTUÁRIO / [en] THE RFM MODEL: THE IMPACT OF DATA SCIENCE ON MODEL APPLICABILITY DEVELOPMENT, STRATEGIES AND APPLICATIONS IN THE BRAZILIAN RETAIL MARKETANA CLARA ARAGAO FERNANDES 21 November 2022 (has links)
[pt] A pandemia de Covid-19 alterou o comportamento do consumidor no varejo
a nível mundial. Este trabalho apresenta uma análise longitudinal do
comportamento do consumidor ao longo entre 2018 e 2021, possibilitando, dessa
forma, a comparação entre o comportamento do consumidor pré e pós pandemia de
covid-19 em uma loja do varejo brasileiro. Para realizar essa análise, o modelo RFM
é aplicado a partir de métodos de inteligência artificial para a análise de grandes
volumes de dados transacionais com o objetivo de classificar os clientes de acordo
com os seus comportamentos de consumo. Para o caso apresentado foram
identificados 5 segmentos de consumo distintos e de grande utilidade para a gestão
de CRM da empresa. / [en] The Covid-19 pandemic has changed consumer behavior in retail worldwide.
This work presents a longitudinal analysis of consumer behavior between 2018 and
2021, thus making it possible to compare consumer behavior before and after the
covid-19 pandemic in a Brazilian retail store. To perform this analysis, the RFM
model is applied using artificial intelligence methods to analyze large volumes of
transactional data in order to classify customers according to their consumption
behaviors. For the case presented, 5 distinct and very useful consumer segments
were identified for the company s CRM management.
|
13 |
Semi-Supervised Classification Using Gaussian ProcessesPatel, Amrish 01 1900 (has links)
Gaussian Processes (GPs) are promising Bayesian methods for classification and regression problems. They have also been used for semi-supervised classification tasks. In this thesis, we propose new algorithms for solving semi-supervised binary classification problem using GP regression (GPR) models. The algorithms are closely related to semi-supervised classification based on support vector regression (SVR) and maximum margin clustering. The proposed algorithms are simple and easy to implement. Also, the hyper-parameters are estimated without resorting to expensive cross-validation technique. The algorithm based on sparse GPR model gives a sparse solution directly unlike the SVR based algorithm. Use of sparse GPR model helps in making the proposed algorithm scalable. The results of experiments on synthetic and real-world datasets demonstrate the efficacy of proposed sparse GP based algorithm for semi-supervised classification.
|
14 |
Aplicação de espectroscopia no infravermelho próximo e análise multivariada para identificação e quantificação de hidrocarbonetos totais do petróleo em solo /Nespeca, Maurílio Gustavo. January 2018 (has links)
Orientador: Rodrigo Sequinel / Coorientador: Danilo Luiz Flumignan / Banca: Edilene Cristina Ferreira / Banca: Erica Regina Filletti Nascimento / Banca: Mária Cristina Breitkreitz / Banca: Heron Dominguez Torres da Silva / Resumo: Segundo dados, de 2017, da Companhia Ambiental do Estado de São Paulo (CETESB), os postos de combustíveis são responsáveis pela contaminação ambiental de 72% das 5942 áreas contaminadas cadastradas no estado de São Paulo. A contaminação de solos e águas subterrâneas por combustíveis fósseis causam imensos danos ambientais devido à alta toxicidade dos constituintes destes combustíveis, além de apresentarem propriedades carcinogênicas e elevada permanência no solo. O monitoramento ambiental em áreas de risco potencial de contaminação, como postos de combustíveis, é realizado através da análise de hidrocarbonetos totais do petróleo (TPH), entre outros hidrocarbonetos individuais. Estas análises são realizadas através de métodos cromatográficos em fase gasosa que requerem etapas de extração com solventes halogenados, purificação e, muitas vezes, pré-concentração. O elevado custo e o tempo demandado para a quantificação de TPH em solo tornam-se barreiras para o crescimento do monitoramento e acompanhamento de processos de remediação das áreas contaminadas. Desta forma, este trabalho objetivou o desenvolvimento de métodos analíticos mais rápidos e de menor custo para a identificação e quantificação de TPH em solo através da espectroscopia na região do infravermelho próximo (NIR). Três métodos NIR foram desenvolvidos: (i) sem preparo de amostra; (ii) após extração por hexano; e (iii) após extração por etanol. Os modelos de classificação foram desenvolvidos pelo método de análise dis... (Resumo completo, clicar acesso eletrônico abaixo) / Abstract: According to the Environmental Company of the State of São Paulo (CETESB), the gas stations are responsible for the environmental contamination of 72% of the 5942 contaminated areas registered in the state of São Paulo. Contamination of soils and groundwater by fossil fuels causes immense environmental damages due to their high toxicity, carcinogenic properties and high permanence in the soil. Environmental monitoring in areas of potential contamination risk, such as gas stations, is carried out through the analysis of total petroleum hydrocarbons (TPH), among other individual compounds. These analyzes are performed by gas chromatographic methods which require some sample preparation steps, such as extraction with halogenated solvents, purification, and often preconcentration. The high cost and time demanded for the quantification of TPH in the soil become barriers for the growth of the monitoring program and remediation processes of the contaminated areas. Therefore, this work aimed at the development of faster and lower cost analytical methods for the identification and quantification of TPH in soil through near-infrared spectroscopy (NIR). Three NIR methods were developed: (i) without sample preparation; (ii) after hexane extraction; and (iii) after extraction with ethanol. The classification models were developed by partial least squares discriminant analysis (PLS-DA) method and the calibration models by partial least squares (PLS) method. In the development of the models... (Complete abstract click electronic access below) / Doutor
|
15 |
Sélection automatisée d'informations crédibles sur la santé en ligneBayani, Azadeh 01 1900 (has links)
Introduction : Le contenu en ligne est une source significative et primordiale pour les utilisateurs à la recherche d'informations liées à la santé. Pour éviter la désinformation, il est crucial d'automatiser l'évaluation de la fiabilité des sources et de vérification de la véracité des informations.
Objectif : Cette étude visait à d’automatiser l'identification de la qualité des sources de santé en ligne. Pour cela, deux éléments complémentaires de qualité ont été automatisés : (1) L'évaluation de la fiabilité des sources d’information liée à la santé, en tenant compte des critères de la HONcode, et (2) L’appréciation de la véracité des informations, avec la base de données PubMed comme source de vérité.
Méthodes : Dans cette étude, nous avons analysé 538 pages Web en englais provenant de 43 sites Web. Dans la première phase d’évaluation de la fiabilité des sources, nous avons classé les critères HONcode en deux niveaux : le "niveau pages Web" (autorité, complémentarité, justifiabilité, et attribution) et le "niveau sites Web" (confidentialité, transparence, divulgation financière et politique publicitaire). Pour le niveau pages Web, nous avons annoté 200 pages manuellement et appliqué trois modèles d’apprentissage machine (ML) : Forêt aléatoire (RF), machines à vecteurs de support (SVM) et le transformateur BERT. Pour le niveau sites Web, nous avons identifié des sacs de mots et utilisé un modèle basé sur des règles. Dans la deuxième phase de l’appréciation de la véracité des informations, les contenus des pages Web ont été catégorisées en trois catégories de contenu (séméiologie, épidémiologie et gestion) avec BERT. Enfin, l’automatisation de l’extraction des requêtes PubMed basée sur les termes MeSH a permis d’extraire et de comparer automatiquement les 20 articles les plus pertinents avec le contenu des pages Web.
Résultats : Pour le niveau page Web, le modèle BERT a obtenu une meilleure aire sous la courbe (AUC) de 96 %, 98 % et 100 % pour les phrases neutres, la justifiabilité et l'attribution respectivement. SVM a présenté une meilleure performance pour la classification de la complémentarité (AUC de 98 %). Enfin, SVM et BERT ont obtenu une AUC de 98 % pour le critère d'autorité. Pour le niveau sites Web, le modèle basé sur des règles a récupéré les pages Web avec une précision de 97 % pour la confidentialité, 82 % pour la transparence, 51 % pour la divulgation financière et la politique publicitaire. Finalement, pour l’appréciation de la véracité des informations, en moyenne, 23 % des phrases ont été automatiquement vérifiées par le modèle pour chaque page Web.
Conclusion : Cette étude souligne l'importance des modèles transformateurs et l'emploi de PubMed comme référence essentielle pour accomplir les deux tâches cruciales dans l'identification de sources d'information fiables en ligne : l’évaluation de la fiabilité des sources et vérifier la véracité des contenus. Finalement, notre recherche pourrait servir à améliorer le développement d’une approche d’évaluation automatique de la crédibilité des sites Web sur la santé. / Introduction: Online content is a significant and primary source for many users seeking healthrelated information. To prevent misinformation, it's crucial to automate the assessment of
reliability of sources and fact-checking of information.
Objective: This study aimed to automate the identification of the credibility of online information
sources. For this, two complementary quality elements were automated: (1) The reliability
assessment of health-related information, considering the HONcode criteria, and (2) The factchecking of the information, using PubMed articles as a source of truth.
Methods: In this study, we analyzed 538 English webpages from 43 websites. In the first phase of
credibility assessment of the information, we classified the HONcode criteria into two levels: the
“web page level” (authority, complementarity, justifiability, and attribution) and the “website
level” (confidentiality, transparency, financial disclosure, and advertising policy). For the web
page level, we manually annotated 200 pages and applied three machine learning (ML) models:
Random Forest (RF), Support Vector Machines (SVM) and the BERT Transformer. For those at
website level criteria, we identified the bags of words and used a rule-based model. In a second
phase of fact-checking, the contents of the web pages were categorized into three themes
(semiology, epidemiology, and management) with BERT. Finally, for automating the factchecking of information, the automation of PubMed queries extraction using MeSH terms made it
possible to automatically extract and compare the 20 most relevant articles with the content of the
web pages.
Results: For the web page level the BERT model obtained the best area under the curve (AUC) of
96%, 98% and 100% for neutral sentences, justifiability and attribution respectively. SVM showed
a better performance for complementarity classification (AUC of 98%). Finally, SVM and BERT
obtained an AUC of 98% for the authority criterion. For the websites level, the rules-based model
retrieved web pages with an accuracy of 97% for privacy, 82% for transparency, 51% for financial
disclosure and advertising policy. Finally, for fact-checking, on average, 23% of sentences were
automatically checked by the model for each web page.
Conclusion: This study emphasized the significance of Transformers and leveraging PubMed as
a key reference for two critical tasks: assessing source reliability and verifying information
accuracy. Ultimately, our research stands poised to significantly advance the creation of an
automated system for evaluating the credibility of health websites.
|
16 |
Aplicação de espectroscopia no infravermelho próximo e análise multivariada para identificação e quantificação de hidrocarbonetos totais do petróleo em solo / Application of near-infrared spectroscopy and multivariate analysis for identification and quantification of total petroleum hydrocarbons in soilNespeca, Maurílio Gustavo 31 August 2018 (has links)
Submitted by Maurilio Gustavo Nespeca (mauriliogn@hotmail.com) on 2018-08-31T21:01:07Z
No. of bitstreams: 1
TESE - MAURILIO G NESPECA.pdf: 12338235 bytes, checksum: 0ffb2ce70bfeb2d88ea11b6f84d923e9 (MD5) / Approved for entry into archive by Ana Carolina Gonçalves Bet null (abet@iq.unesp.br) on 2018-09-04T12:54:22Z (GMT) No. of bitstreams: 1
nespeca_mg_dr_araiq_int.pdf: 8911620 bytes, checksum: d6033930d5b3d08e08ac463efd5ad737 (MD5) / Made available in DSpace on 2018-09-04T12:54:22Z (GMT). No. of bitstreams: 1
nespeca_mg_dr_araiq_int.pdf: 8911620 bytes, checksum: d6033930d5b3d08e08ac463efd5ad737 (MD5)
Previous issue date: 2018-08-31 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / Segundo dados, de 2017, da Companhia Ambiental do Estado de São Paulo (CETESB), os postos de combustíveis são responsáveis pela contaminação ambiental de 72% das 5942 áreas contaminadas cadastradas no estado de São Paulo. A contaminação de solos e águas subterrâneas por combustíveis fósseis causam imensos danos ambientais devido à alta toxicidade dos constituintes destes combustíveis, além de apresentarem propriedades carcinogênicas e elevada permanência no solo. O monitoramento ambiental em áreas de risco potencial de contaminação, como postos de combustíveis, é realizado através da análise de hidrocarbonetos totais do petróleo (TPH), entre outros hidrocarbonetos individuais. Estas análises são realizadas através de métodos cromatográficos em fase gasosa que requerem etapas de extração com solventes halogenados, purificação e, muitas vezes, pré-concentração. O elevado custo e o tempo demandado para a quantificação de TPH em solo tornam-se barreiras para o crescimento do monitoramento e acompanhamento de processos de remediação das áreas contaminadas. Desta forma, este trabalho objetivou o desenvolvimento de métodos analíticos mais rápidos e de menor custo para a identificação e quantificação de TPH em solo através da espectroscopia na região do infravermelho próximo (NIR). Três métodos NIR foram desenvolvidos: (i) sem preparo de amostra; (ii) após extração por hexano; e (iii) após extração por etanol. Os modelos de classificação foram desenvolvidos pelo método de análise discriminante por mínimos quadrados parciais (PLS-DA) e os modelos de calibração por mínimos quadrados parciais (PLS). No desenvolvimento dos modelos, nove diferentes pré-processamentos e seleção de variáveis pelo algoritmo genético (GA) foram avaliados. Os modelos foram desenvolvidos usando-se amostras de solo fortificado com os contaminantes (gasolina, diesel e óleo lubrificantes) e validados com amostras de solo contaminado, adquiridas no monitoramento ambiental de postos de combustíveis. O modelo PLS-DA proporcionou 100% de classificações corretas das amostras contaminadas sem a necessidade de preparo de amostra, enquanto a predição da concentração dos analitos tornou-se possível através dos modelos PLS após a etapa de extração com etanol. Como objetivos secundários deste trabalho, foram desenvolvidos métodos de quantificação das diferentes frações de TPH e dos compostos benzeno, tolueno, etilbenzeno e xilenos (BTEX) por cromatografia em fase gasosa ultrarrápida com detector de ionização por chama (UFGC-FID). Além do desenvolvimento dos métodos UFGC-FID, otimizou-se o processo de extração dos analitos por sonicação através de planejamentos experimentais, avaliando-se diferentes solventes, tempo de sonicação, agitação por vórtex e volume de solvente. Os métodos UFGC-FID proporcionaram análises de 3 a 17 vezes mais rápidas que o método cromatográfico de acordo com a norma EPA 8015. Ao final deste trabalho, os métodos desenvolvidos e o método EPA 8015 foram comparados segundo aspectos analíticos, ambientais e financeiros. De forma geral, os três métodos apresentaram mesma exatidão; o método cromatográfico (EPA 8015) foi o método mais preciso; o método UFGC-FID foi o de menor investimento inicial e de menor tempo para retorno financeiro; e, finalmente, o método NIR após extração com etanol foi o mais sensível, rápido, favorável à química verde e de menor custo por análise. / According to the Environmental Company of the State of São Paulo (CETESB), the gas stations are responsible for the environmental contamination of 72% of the 5942 contaminated areas registered in the state of São Paulo. Contamination of soils and groundwater by fossil fuels causes immense environmental damages due to their high toxicity, carcinogenic properties and high permanence in the soil. Environmental monitoring in areas of potential contamination risk, such as gas stations, is carried out through the analysis of total petroleum hydrocarbons (TPH), among other individual compounds. These analyzes are performed by gas chromatographic methods which require some sample preparation steps, such as extraction with halogenated solvents, purification, and often preconcentration. The high cost and time demanded for the quantification of TPH in the soil become barriers for the growth of the monitoring program and remediation processes of the contaminated areas. Therefore, this work aimed at the development of faster and lower cost analytical methods for the identification and quantification of TPH in soil through near-infrared spectroscopy (NIR). Three NIR methods were developed: (i) without sample preparation; (ii) after hexane extraction; and (iii) after extraction with ethanol. The classification models were developed by partial least squares discriminant analysis (PLS-DA) method and the calibration models by partial least squares (PLS) method. In the development of the models, we evaluated nine different preprocessing and selection of variables by the genetic algorithm (GA). The models were developed using soil samples fortified with contaminants (gasoline, diesel and lubricant oil) and validated with samples of contaminated soil acquired in the environmental monitoring of gas stations. The PLS-DA model provided 100% correct classifications without sample preparation, while the prediction of the concentration of the analytes was possible by PLS models after the ethanol extraction. As secondary objectives of this work, we developed quantification methods for the different fractions of TPH and benzene, toluene, ethylbenzene, and xylenes (BTEX) compounds by ultrafast gas chromatography with flame ionization detector (UFGC-FID). In addition to the UFGC-FID methods, the TPH extraction by sonication was optimized through experimental design, evaluating different solvents, sonication time, agitation and solvent volume. The UFGC-FID methods provided analyzes 3 to 17 times faster than the chromatographic method according to EPA 8015. At the end of this work, the developed methods and the EPA 8015 method were compared according to analytical, environmental and financial aspects. In general, the three methods presented the same accuracy; the EPA 8015 method was the most precise; the UFGC-FID method presented lower initial investment and less time for financial return; and the NIR method after ethanol extraction was the most sensitive, fast, favorable to green chemistry and presented the lowest cost per analysis.
|
17 |
L’utilité des médias sociaux pour la surveillance épidémiologique : une étude de cas de Twitter pour la surveillance de la maladie de LymeLaison, Elda Kokoe Elolo 12 1900 (has links)
La maladie de Lyme est la maladie transmise par tiques la plus répandue dans l’hémisphère du Nord. Le système de surveillance des cas humains de la maladie de Lyme est basé sur un système passif des cas par les professionnels de santé qui présente plusieurs failles rendant la surveillance incomplète. Avec l’expansion de l’usage de l’internet et des réseaux sociaux, des chercheurs proposent l’utilisation des données provenant des réseaux sociaux comme outil de surveillance, cette approche est appelée l’infodémiologie. Cette approche a été testée dans plusieurs études avec succès. L’objectif de ce mémoire est de construire une base de données à partir des tweets auto-déclarés, des tweets classifiés et étiquetés comme un cas potentiel de Lyme ou non à l’aide des modèles de classificateurs basés sur des transformateurs comme, BERTweet, DistilBERT et ALBERT. Pour ce faire, un total de 20 000 tweets en anglais en lien avec la maladie de Lyme sans restriction géographique de 2010 à 2022 a été collecté avec la plateforme API twitter. Nous avons procédé au nettoyage la base de données. Ensuite les données nettoyées ont été classifiées en binaire comme cas potentiels ou non de la maladie de Lyme sur la base des symptômes de la maladie comme mots-clés. À l’aide des modèles de classification basés sur les transformateurs, la classification automatique des données est évaluée en premier sans, et ensuite avec des émojis convertis en mots.
Nous avons trouvé que les modèles de classification basés sur les transformateurs performent mieux que les modèles de classification classiques comme TF-IDF, Naive Bayes et autres ; surtout le modèle BERTweet a surpassé tous les modèles évalués avec un score F1 moyen de 89,3%, une précision de 97%, une exactitude de 90% et un rappel de 82,6%. Aussi l’incorporation des émojis dans notre base de données améliore la performance de tous les modèles d’au moins 5% mais BERTweet a une fois de plus le mieux performé avec une augmentation de tous les paramètres évalués. Les tweets en anglais sont majoritairement en provenance des États-Unis et pour contrecarrer cette prédominance, les futurs travaux devraient collecter des tweets de toutes langues en lien avec la maladie de Lyme surtout parce que les pays européens où la maladie de Lyme sont en émergence ne sont pas des pays anglophones. / Lyme disease is the most common tick-borne disease in the Northern Hemisphere. The surveillance system for human cases of Lyme disease has several flaws which make the surveillance incomplete. Nowadays with the extensive use of internet and social networks, researchers propose the use of data from social networks as a surveillance tool, this approach is called Infodemiology. This approach has been successfully tested in several studies.
The aim of this thesis is to build a database from self-reported tweets, capable of classifying a tweet as a potential case of Lyme or not using BERT transformer-based classifier models.
A total of 20,000 English tweets related to Lyme disease without geographical restriction from 2010 to 2022 were collected with twitter API. Then these data were cleaned and manually classified by binary classification as potential Lyme cases or not using as keywords the symptoms of Lyme disease; Also, emojis have been converted into words and integrated. Using classification models based on BERT transformers, the labeling of data as disease-related or non-disease-related is evaluated first without, and then with emojis.
Transformer-based classification models performed better than conventional classification models, especially the BERTweet model outperformed all evaluated models with an average F1 score of 89.3%, precision of 97%, accuracy of 90%, and recall of 82.6%. Also, the incorporation of emojis in our database improves the performance of all models by at least 5% but BERTweet once again performed best with an increase in all parameters evaluated. Tweets in English are mostly from the United States and to counteract this predominance, future work should collect tweets of all languages related to Lyme disease especially because the European countries where Lyme disease are emerging are not English-speaking countries.
|
18 |
Legislative Language for SuccessGundala, Sanjana 01 June 2022 (has links) (PDF)
Legislative committee meetings are an integral part of the lawmaking process for local and state bills. The testimony presented during these meetings is a large factor in the outcome of the proposed bill. This research uses Natural Language Processing and Machine Learning techniques to analyze testimonies from California Legislative committee meetings from 2015-2016 in order to identify what aspects of a testimony makes it successful. A testimony is considered successful if the alignment of the testimony matches the bill outcome (alignment is "For" and the bill passes or alignment is "Against" and the bill fails). The process of finding what makes a testimony successful was accomplished through data filtration, feature extraction, implementation of classification models, and feature analysis. Several features were extracted and tested to find those that had the greatest impact on the bill outcome. The features chosen provided information on the sentence complexity and type of words used (adjective, verb, nouns) for each testimony. Additionally all the testimonies were analyzed to find common phrases used within successful testimonies. Two types of classification models were implemented: ones that used the manually extracted feature as input and ones that used their own feature extraction process. The results from the classification models and feature analysis show that certain aspects within a testimony such as sentence complexity and using specific phrases significantly impact the bill outcome. The most successful models, Support Vector Machine and Multinomial Naive Bayes, achieved an accuracy of 91.79\% and 91.22\% respectively
|
19 |
[en] PREDICTING DRUG SENSITIVITY OF CANCER CELLS BASED ON GENOMIC DATA / [pt] PREVENDO A EFICÁCIA DE DROGAS A PARTIR DE CÉLULAS CANCEROSAS BASEADO EM DADOS GENÔMICOSSOFIA PONTES DE MIRANDA 22 April 2021 (has links)
[pt] Prever com precisão a resposta a drogas para uma dada amostra baseado em características moleculares pode ajudar a otimizar o desenvolvimento de drogas e explicar mecanismos por trás das respostas aos tratamentos. Nessa dissertação, dois estudos de caso foram gerados, cada um aplicando diferentes dados genômicos para a previsão de resposta a drogas. O estudo de caso 1 avaliou dados de perfis de metilação de DNA como um tipo de característica molecular que se sabe ser responsável por causar tumorigênese e modular a resposta a tratamentos. Usando perfis de metilação de 987 linhagens celulares do genoma completo na base de dados Genomics of Drug Sensitivity in Cancer (GDSC), utilizamos algoritmos de aprendizado de máquina para avaliar o potencial preditivo de respostas citotóxicas para oito drogas contra o câncer. Nós comparamos a performance de cinco algoritmos de classificação e quatro algoritmos de regressão representando metodologias diversas, incluindo abordagens tree-, probability-, kernel-, ensemble- e distance-based. Aplicando sub-amostragem artificial em graus variados, essa pesquisa procura avaliar se o treinamento baseado em resultados relativamente extremos geraria melhoria no desempenho. Ao utilizar algoritmos de classificação e de regressão para prever respostas discretas ou contínuas, respectivamente, nós observamos consistentemente excelente desempenho na predição quando os conjuntos de treinamento e teste consistiam em dados de linhagens celulares. Algoritmos de classificação apresentaram melhor desempenho quando nós treinamos os modelos utilizando linhagens celulares com valores de resposta a drogas relativamente extremos, obtendo valores de area-under-the-receiver-operating-characteristic-curve de até 0,97. Os algoritmos de regressão tiveram melhor desempenho quando treinamos os modelos utilizado o intervalo completo de valores de resposta às drogas, apesar da dependência das métricas de desempenho utilizadas.
O estudo de caso 2 avaliou dados de RNA-seq, dados estes comumente utilizados no estudo da eficácia de drogas. Aplicando uma abordagem de aprendizado semi-supervisionado, essa pesquisa busca avaliar o impacto da combinação de dados rotulados e não-rotulados para melhorar a predição do modelo. Usando dados rotulados de RNA-seq do genoma completo de uma média de 125 amostras de tumor AML rotuladas da base de dados Beat AML (separados por tipos de droga) e 151 amostras de tumor AML não-rotuladas na base de dados The Cancer Genome Atlas (TCGA), utilizamos uma estrutura de modelo semi-supervisionado para prever respostas citotóxicas para quatro drogas contra câncer. Modelos semi-supervisionados foram gerados, avaliando várias combinações de parâmetros e foram comparados com os algoritmos supervisionados de classificação. / [en] Accurately predicting drug responses for a given sample based on molecular features may help to optimize drug-development pipelines and explain mechanisms behind treatment responses. In this dissertation, two case studies were generated, each applying different genomic data to predict drug response. Case study 1 evaluated DNA methylation profile data as one type of molecular feature that is known to drive tumorigenesis and modulate treatment responses. Using genome-wide, DNA methylation profiles from 987 cell lines in the Genomics of Drug Sensitivity in Cancer (GDSC) database, we used machine-learning algorithms to evaluate the potential to predict cytotoxic responses for eight anti-cancer drugs. We compared the performance of five classification algorithms and four regression algorithms representing diverse methodologies, including tree-, probability-, kernel-, ensemble- and distance-based approaches. By applying artificial subsampling in varying degrees, this research aims to understand whether training based on relatively extreme outcomes would yield improved performance. When using classification or regression algorithms to predict discrete or continuous responses, respectively, we consistently observed excellent predictive performance when the training and test sets consisted of cell-line data. Classification algorithms performed best when we trained the models using cell lines with relatively extreme drug-response values, attaining area-under-the-receiver-operating-characteristic-curve values as high as 0.97. The regression algorithms performed best when we trained the models using the full range of drug-response values, although this depended on the performance metrics we used. Case study 2 evaluated RNA-seq data as one of the most popular molecular data used to study drug efficacy. By applying a semi-supervised learning approach, this research aimed to understand the impact of combining labeled and unlabeled data to improve model prediction. Using genome-wide RNA-seq labeled data from an average of 125 AML tumor samples in the Beat AML database (varying by drug type) and 151 unlabeled AML tumor samples in The Cancer Genome Atlas (TCGA) database, we used a semi-supervised model structure to predict cytotoxic responses for four anti-cancer drugs. Semi-supervised models were generated, while assessing several parameter combinations and were compared against supervised classification algorithms.
|
Page generated in 0.1001 seconds