Global ETD Search

1	Machine learning for detecting fraud in an API Sánchez Espunyes, Anna January 2022 (has links) An Application Programming Interface (API) provides developers with a high-level framework that abstracts the underlying implementation of services. Using an API reduces the time developers spent on implementation, and it encourages collaboration and innovation from third-party developers. Making an API public has a risk: developers might use it inappropriately. Most APIs have a policy that states which behaviors are considered fraudulent. Detecting applications that fraudulently use an API is a challenging problem: it is unfeasible to review all applications that make requests. API providers aim to implement an automatic tool that accurately detects suspicious applications from all the requesting applications. In this thesis, we study the possibility of using machine learning techniques to detect fraud in Web APIs. We experiment with supervised learning methods (random forests and gradient boosting), clustering methods such as Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and ensemble methods that combine the predictions of supervised learning methods and clustering methods. The dataset available contains data gathered when a developer creates an application and data collected when the application starts making HTTP requests. We derive a meaningful representation from the most important textual fields of the dataset using Sentence-BERT (S-BERT). Furthermore, we experiment with the predictive importance of the S-BERT embeddings. The method that achieves the best performance in the test set is an ensemble method that combines the results from the gradient boosting classifier and DBSCAN. Furthermore, this method performs better when using the S-BERT embeddings of the textual data of the applications, achieving an f1-score of 0.9896. / Ett API (Application Program Interface) ger utvecklare ett högnivåramverk som abstraherar den underliggande implementationen av tjänster. Användning av ett API reducerar tiden utvecklare lägger på implementation, och uppmuntrar samarbete med och innovation av tredjeparts-utvecklare. Att göra ett API publikt har ett risk: utvecklare kan utnyttja den på olämpliga sätt. De flesta API:erna har ett policy som beskriver beteenden som räknas som bedrägliga. Upptäckandet av applikationer som använder ett API på ett bedrägligt sätt är ett icke-trivialt problem, det är omöjligt att undersöka alla applikationer som skickar begäran till API:et. API leverantörerna siktar ständigt på att skapa ett automatiskt verktyg för att exakt upptäcka applikationer misstänkta för bedrägeri. I denna avhandling undersöks möjligheten av användning av maskininlärning för att upptäcka bedrägeri i Web API. Vi experimenterar med övervakad inlärningsmetoder (random forests och gradient boosting), klustring metoder som Density-Based Spatial Clustering of Applications with Noise (DBSCAN) och ensemble metoder som kombinerar prediktionerna av övervakad inlärningsmetoder och klustring metoder. Det tillgängliga datasetet innehåller data samlat när en utvecklare skapar en applikation och när den börjar skicka HTTP begäran. Vi härleder en meningsfull representation från de viktigaste textfälten i datasetet med hjälp av Sentence-BERT (SBERT). Dessutom experimenterar vi med den prediktiva betydelsen av S-BERT-inbäddningarna. Metoden som uppfyller den bästa prestandan i testsetet är ett ensemble metod som kombinerade resultaten från gradient boosting klassificeraren och DBSCAN. Denna metod presterar även bättre vid användning av S-BERT-inbäddnignarna av applikationernas textdata och därav uppnår ett f1-score på 0.9896. Fraud detection Machine learning Supervised learning Sentence-BERT Web API Bedrägeriupptäckt Maskininlärning Övervakad inlärning Sentence-BERT Web API Computer and Information Sciences Data- och informationsvetenskap
2	Semantic Topic Modeling and Trend Analysis Mann, Jasleen Kaur January 2021 (has links) This thesis focuses on finding an end-to-end unsupervised solution to solve a two-step problem of extracting semantically meaningful topics and trend analysis of these topics from a large temporal text corpus. To achieve this, the focus is on using the latest develop- ments in Natural Language Processing (NLP) related to pre-trained language models like Google’s Bidirectional Encoder Representations for Transformers (BERT) and other BERT based models. These transformer-based pre-trained language models provide word and sentence embeddings based on the context of the words. The results are then compared with traditional machine learning techniques for topic modeling. This is done to evalu- ate if the quality of topic models has improved and how dependent the techniques are on manually defined model hyperparameters and data preprocessing. These topic models provide a good mechanism for summarizing and organizing a large text corpus and give an overview of how the topics evolve with time. In the context of research publications or scientific journals, such analysis of the corpus can give an overview of research/scientific interest areas and how these interests have evolved over the years. The dataset used for this thesis is research articles and papers from a journal, namely ’Journal of Cleaner Productions’. This journal has more than 24000 research articles at the time of working on this project. We started with implementing Latent Dirichlet Allocation (LDA) topic modeling. In the next step, we implemented LDA along with document clus- tering to get topics within these clusters. This gave us an idea of the dataset and also gave us a benchmark. After having some base results, we explored transformer-based contextual word and sentence embeddings to evaluate if this leads to more meaningful, contextual, and semantic topics. For document clustering, we have used K-means clustering. In this thesis, we also discuss methods to optimally visualize the topics and the trend changes of these topics over the years. Finally, we conclude with a method for leveraging contextual embeddings using BERT and Sentence-BERT to solve this problem and achieve semantically meaningful topics. We also discuss the results from traditional machine learning techniques and their limitations. NLP unsupervised topic modelling trend analysis LDA BERT Sentence-BERT TF-IDF transformer based language models document clustering Computer Sciences Datavetenskap (datalogi)

Search results

Machine learning for detecting fraud in an API

Semantic Topic Modeling and Trend Analysis