Global ETD Search

71	Forecasting checking account balance : Using supervised machine learning Dannelind, Martin January 2022 (has links) The introduction of open banking has made it possible for companies to build the next generation of applications based on transactional data. Enabling economic forecasts which private individuals can use to make responsible financial decisions. This project investigated forecasting account balances using supervised learning. 7 different regression models were run on transactional data from 377 anonymised checking accounts split into subgroups. The results concluded that multivariate XGBoost optimised with feature selection was the best performing forecasting model and the subgroup with recurring income transactions was easiest to forecast. Based on the result from this project it can be concluded that a viable option to forecast account balances is to split the transactional data into subgroups and forecast them separately. Minimising the errors given by certain random, infrequent and large types of transactions. Time series forecasting account balance forecasting economic predicition Python GRU LSTM RNN XGBoost prophet checking account
72	Employee Turnover Prediction - A Comparative Study of Supervised Machine Learning Models Kovvuri, Suvoj Reddy, Dommeti, Lydia Sri Divya January 2022 (has links) Background: In every organization, employees are an essential resource. For several reasons, employees are neglected by the organizations, which leads to employee turnover. Employee turnover causes considerable losses to the organization. Using machine learning algorithms and with the data in hand, a prediction of an employee’s future in an organization is made. Objectives: The aim of this thesis is to conduct a comparison study utilizing supervised machine learning algorithms such as Logistic Regression, Naive Bayes Classifier, Random Forest Classifier, and XGBoost to predict an employee’s future in a company. Using evaluation metrics models are assessed in order to discover the best efficient model for the data in hand. Methods: The quantitative research approach is used in this thesis, and data is analyzed using statistical analysis. The labeled data set comes from Kaggle and includes information on employees at a company. The data set is used to train algorithms. The created models will be evaluated on the test set using evaluation measures including Accuracy, Precision, Recall, F1 Score, and ROC curve to determine which model performs the best at predicting employee turnover. Results: Among the studied features in the data set, there is no feature that has a significant impact on turnover. Upon analyzing the results, the XGBoost classifier has better mean accuracy with 85.3%, followed by the Random Forest classifier with 83% accuracy than the other two algorithms. XGBoost classifier has better precision with 0.88, followed by Random Forest Classifier with 0.82. Both the Random Forest classifier and XGBoost classifier showed a 0.69 Recall score. XGBoost classifier had the highest F1 Score with 0.77, followed by the Random Forest classifier with 0.75. In the ROC curve, the XGBoost classifier had a higher area under the curve(AUC) with 0.88. Conclusions: Among the studied four machine learning algorithms, Logistic Regression, Naive Bayes Classifier, Random Forest Classifier, and XGBoost, the XGBoost classifier is the most optimal with a good performance score respective to the tested performance metrics. No feature is found majorly affect employee turnover. Machine Learning Employee Turnover Prediction Supervised Learn- ing Models Logistic Regression Naive Bayes Classifier Random Forest Classifier XGBoost Computer Sciences Datavetenskap (datalogi)
73	Restaurant Daily Revenue Prediction : Utilizing Synthetic Time Series Data for Improved Model Performance Jarlöv, Stella, Svensson Dahl, Anton January 2023 (has links) This study aims to enhance the accuracy of a demand forecasting model, XGBoost, by incorporating synthetic multivariate restaurant time series data during the training process. The research addresses the limited availability of training data by generating synthetic data using TimeGAN, a generative adversarial deep neural network tailored for time series data. A one-year daily time series dataset, comprising numerical and categorical features based on a real restaurant's sales history, supplemented by relevant external data, serves as the original data. TimeGAN learns from this dataset to create synthetic data that closely resembles the original data in terms of temporal and distributional dynamics. Statistical and visual analyses demonstrate a strong similarity between the synthetic and original data. To evaluate the usefulness of the synthetic data, an experiment is conducted where varying lengths of synthetic data are iteratively combined with the one-year real dataset. Each iteration involves retraining the XGBoost model and assessing its accuracy for a one-week forecast using the Root Mean Square Error (RMSE). The results indicate that incorporating 6 years of synthetic data improves the model's performance by 65%. The hyperparameter configurations suggest that deeper tree structures benefit the XGBoost model when synthetic data is added. Furthermore, the model exhibits improved feature selection with an increased amount of training data. This study demonstrates that incorporating synthetic data closely resembling the original data can effectively enhance the accuracy of predictive models, particularly when training data is limited. demand forecasting data augmentation time series data machine learning restaurant industry generative adversarial networks TimeGAN XGBoost Computer and Information Sciences Data- och informationsvetenskap
74	Neonatal Sepsis Detection Using Decision Tree Ensemble Methods: Random Forest and XGBoost Al-Bardaji, Marwan, Danho, Nahir January 2022 (has links) Neonatal sepsis is a potentially fatal medical conditiondue to an infection and is attributed to about 200 000annual deaths globally. With healthcare systems that are facingconstant challenges, there exists a potential for introducingmachine learning models as a diagnostic tool that can beautomatized within existing workflows and would not entail morework for healthcare personnel. The Herlenius Research Teamat Karolinska Institutet has collected neonatal sepsis data thathas been used for the development of many machine learningmodels across several papers. However, none have tried to studydecision tree ensemble methods. In this paper, random forestand XGBoost models are developed and evaluated in order toassess their feasibility for clinical practice. The data contained24 features of vital parameters that are easily collected througha patient monitoring system. The validation and evaluationprocedure needed special consideration due to the data beinggrouped based on patient level and being imbalanced. Theproposed methods developed in this paper have the potentialto be generalized to other similar applications. Finally, usingthe measure receiver-operating-characteristic area-under-curve(ROC AUC), both models achieved around ROC AUC= 0.84.Such results suggest that the random forest and XGBoost modelsare potentially feasible for clinical practice. Another gainedinsight was that both models seemed to perform better withsimpler models, suggesting that future work could create a moreexplainable model. / Nenatal sepsis är ett potentiellt dödligt‌‌‌ medicinskt tillstånd till följd av en infektion och uppges globalt orsaka 200 000 dödsfall årligen. Med sjukvårdssystem som konstant utsätts för utmaningar existerar det en potential för maskininlärningsmodeller som diagnostiska verktyg automatiserade inom existerande arbetsflöden utan att innebära mer arbete för sjukvårdsanställda. Herelenius forskarteam på Karolinska Institet har samlat ihop neonatal sepsis data som har använts för att utveckla många maskininlärningsmodeller över flera studier. Emellertid har ingen prövat att undersöka beslutsträds ensemble metoder. Syftet med denna studie är att utveckla och utvärdera random forest och XGBoost modeller för att bedöma deras möjligheter i klinisk praxis. Datan innehör 24 attribut av vitalparameterar som enkelt samlas in genom patientövervakningssystem. Förfarandet för validering och utvärdering krävde särskild hänsyn med tanke på att datan var grupperad på patientnivå och var obalanserad. Den föreslagna metoden har potential att generaliseras till andra liknande tillämpningar. Slutligen, genom att använda receiveroperating-characteristic area-under-curve (ROC AUC) måttet kunde vi uppvisa att båda modellerna presterade med ett resultat på ROC AUC= 0.84. Sådana resultat föreslår att både random forest och XGBoost modellerna kan potentiellt användas i klinisk praxis. En annan insikt var att båda modellerna verkade prestera bättre med enklare modeller vilket föreslår att ete skulle kunna vara att skapa en mer förklarlig skininlärningsmodell. / Kandidatexjobb i elektroteknik 2022, KTH, Stockholm Machine Learning Sepsis Neonatal Sepsis Random Forest XGBoost Imbalanced Data Binary Classification Cross-Validation Hyperparameter Tuning Elektroteknik och elektronik
75	Geochemical investigation of the co-evolution of life and environment in the Neoproterozoic Era Kang, Junyao 19 February 2024 (has links) The co-evolution of life and the environment stands as a cornerstone in Earth's 4.5-billion-year history. Environmental fluctuations have wielded substantial influence over biological evolution, while life forms have, in turn, reshaped Earth's surface and climate. This dissertation centers on a critical period in Earth's history—the Neoproterozoic Era—when profound environmental shifts potentially catalyzed pivotal eukaryotic evolutionary events. By delving deeper into Neoproterozoic paleoenvironments, I aim at a clearer understanding of life-environment co-evolution in this crucial era. The first chapter focuses on an important juncture—the transition from prokaryote to eukaryote dominance in marine ecosystems during the Tonian Period (1000 Ma to 720 Ma). To assess whether the availability of nitrate, an important macro-nutrient, played a critical role in this evolutionary event, nitrogen isotope compositions (δ<sup>15</sup>N) of marine carbonates from the early Tonian (ca. 1000 Ma to ca. 800 Ma) Huaibei Group in North China were measured. The data indicate nitrate limitation in early Neoproterozoic oceans. Further, a compilation of Proterozoic sedimentary δ<sup>15</sup>N data, together with box model simulations, suggest a ~50% increase in marine nitrate availability at ~800 Ma. Limited nitrate availability in early Neoproterozoic oceans may have delayed the ecological rise of eukaryotes until ~800 Ma when increased nitrate supply, together with other environmental and ecological factors, may have contributed to the transition from prokaryote-dominant to eukaryote-dominant marine ecosystems. Recognizing the spatial and temporal variations in Neoproterozoic oceanic environments, the second chapter lays the groundwork for a robust stratigraphic framework for the early Tonian Period. Employing the dynamic time warping algorithm, I constructed a global stratigraphic framework for the early Tonian Period using δ<sup>13</sup>C<sub>carb</sub> data from the North China, São Francisco, and Congo cratons. This exercise confirms the generally narrow range of δ<sup>13</sup>C<sub>carb</sub> fluctuations in the early Tonian, but also confirms the presence of a negative δ<sup>13</sup>C<sub>carb</sub> excursion of notable magnitude (~9 ‰) at ca. 920 Ma in multiple records, suggesting that it was global in scope. This negative excursion, known as the Majiatun excursion, is likely the oldest negative excursion in the Neoproterozoic Era and marks the onset of the dynamic Neoproterozoic carbon cycle. Shifting focus to the late Neoproterozoic, the third chapter delves into the origins of Neoproterozoic superheavy pyrite, whose bulk-sample δ<sup>34</sup>S values are greater than those of contemporaneous seawater sulfate and whose origins remain controversial. Two supervised machine learning algorithms were trained on a large LA-ICP-MS pyrite trace element database to distinguish pyrite of different origins. The analysis validates that two models built on the co-behavior of 12 trace elements (Co, Ni, Cu, Zn, As, Mo, Ag, Sb, Te, Au, Tl, and Pb) can be used to accurately predict pyrite origins. This novel approach was then used to identify the origins of pyrite from two Neoproterozoic sedimentary successions in South China. The first set of samples contains isotopically superheavy pyrite from the Cryogenian Tiesi'ao and Datangpo formations. The second set of samples contains pyritic rims from the Ediacaran Doushantuo Formation; these pyrite rims are associated with fossiliferous chert nodules and do not have superheavy sulfur isotopes. For the superheavy pyrite, the models consistently show high confidence levels in identifying its genesis type, and three out of four samples were inferred to be of sedimentary origins. For the pyritic nodule rims, the models suggest that early diagenetic pyrite was subsequently altered by hydrothermal fluids and therefore shows mixed signals. The third chapter highlights the importance of pyrite trace elements in deciphering and distinguishing the origins of pyrite in sedimentary strata. / Doctor of Philosophy / Understanding how life and the environment have shaped our planet's story over 4.5 billion years is like piecing together an intricate puzzle. On the one hand, changes in the environment kickstarted big shifts in how life evolved. On the other hand, living creatures have also left their mark on Earth's landscapes and climate. This dissertation focuses on unraveling the mysterious Neoproterozoic Era (1 billion to 538 million years ago), a time when Earth saw some of its most dramatic changes. A significant aspect of my investigation delves into the evolutionary dynamics within ancient marine ecosystems. Specifically, I'm exploring a critical juncture when organisms with more complex cellular structures, known as eukaryotes, became ecologically more important than prokaryotic life forms in many aspects of Earth systems. By examining ancient rock formations from China, I have found evidence suggesting that nitrate, a vital nutrient, was scarce in the Neoproterozoic oceans. However, around 800 million years ago, there appears to have been a significant surge in nitrate availability. This surge potentially catalyzed a pivotal phase in evolution, possibly driving the shift from prokaryote to eukaryote dominance in these ancient waters. Second, there is a challenge to delineate a robust timeline for the early Neoproterozoic Era. Imagine trying to piece together a story from a time when there were no calendars or clear dates. Employing advanced statistical methods and comparing chemical signals preserved in carbonate rocks from disparate global locations, I endeavor to craft a coherent timeline for this crucial period. Within this timeline, a noteworthy anomaly in the carbon cycle emerged around 920 million years ago known as the Majiatun excursion. This anomaly represents a significant shift in the Neoproterozoic carbon cycle. Furthermore, my investigation plunges into the geochemistry of sulfur, an important element in shaping ancient marine environments. Certain sedimentary rocks harbor anomalous sulfur isotope signatures in the mineral pyrite (also known as fool's gold), hinting at dramatic environmental transformations during the late Neoproterozoic. Employing advanced analytical techniques and machine learning methodologies, I seek to discern the origins and implications of these anomalous sulfur isotope signals found in pyrite, unraveling their significance in reconstructing the environmental dynamics of ancient oceans. Neoproterozoic Nitrogen Isotope Iron Speciation Redox Tonian Nitrate Eukaryotes Carbon Isotope North China Craton Chemostratigraphy Sulfur Isotope Pyrite Trace Element LA-ICP-MS Machine Learning Random Forest XGBoost
76	Viewership forecast on a Twitch broadcast : Using machine learning to predict viewers on sponsored Twitch streams Malm, Jonas, Friberg, Martin January 2022 (has links) Today, the video game industry is larger than the sports and film industries combined, and the largest streaming platform Twitch with an average of 2.8 million concurrent viewers offers the possibility for gaming and non-gaming brands to market their products. Estimating streamers’ viewership is central in these marketing campaigns, but no large-scale studies have been conducted to predict viewership previously. This paper evaluates three different machine learning algorithms with regard to the three different error metrics MAE, MAPE and RMSE; and presents novel features for predicting viewership. Different models are chosen through recursive feature elimination using k-fold cross-validation with respect to both MAE and MAPE separately. The models are evaluated on an independent test and show promising results, on par with manual expert predictions. None of the models can be said to be significantly better than another. XGBoost optimized for MAPE obtained the lowest MAE error score of 282.54 and lowest MAPE error score of 41.36% on the test set, in comparison to expert predictions with 288.06 MAE and 83.05% MAPE. Furthermore, the study illustrates the importance of past viewership and streamer variety to predict future viewership. twitch viewership prediction regression machine learning XGBoost streaming distance metrics feature selection cross-validation feature engineering pre-processing Computer and Information Sciences Data- och informationsvetenskap
77	Prognostic Stratification in Patients with Left Heart Disease : A Machine Learning Approach / Prognostisk stratifiering hos patienter med vänstersidig hjärtsvikt : En maskininlärningsmetod Saleh, Mariam January 2024 (has links) Left heart disease often results in left heart failure and right ventricular dysfunction which is challenging to diagnose with traditional diagnostic approaches. To address this a novel empirical 4-point right ventricular dysfunction score was created at Sahlgrenska University Hospital to overcome the limitations of single variables for diagnosing right ventricular dysfunction. In this study, we used machine learning, more specifically XGBoost coupled with interactive machine learning to develop four different models for predicting death or receiving a left ventricular assist device in patients with left heart disease (n=486). Features were selected from the dataset using recursive feature elimination with the default number of features. The initial model with 29 features, called the baseline model served as the foundation of the three additional models, each adjusted based on feedback from a clinician. The first step of feedback included removing features due to high correlation, creating a modified model with 12 features, the second step was to use 12 well-known characteristics of left and right ventricular dysfunction creating an empirical model, and adjusting the prediction threshold from 50% to 60%. The third step was to reduce the number of features to 5 based on empirical grounds. The models were compared to the right ventricular dysfunction score using the metrics area under the curve, f1 score, positive likelihood ratio, and negative likelihood ratio. The predictive efficacy of the machine learning models was superior compared to the right ventricular dysfunction score. The results also indicated that the models did neither improve nor deteriorate when reducing the number of features. However, insufficient accuracy indicates that none of the machine learning models are clinically viable. These results show the potential of machine learning in enhancing prognostic stratification in patients with left heart disease although further refinement is necessary for clinical use. / Vänstersidig hjärtsjukdom resulterar ofta i vänstersidig hjärtsvikt och högerkammarsvikt vilket är utmanade att diagnostisera med traditionella diagnostiska metoder. För att komma undan med begränsningen med enskilda variabler för att diagnostisera högerkammarsvikt skapades ett 4 poängs högerkammarsvikt score vid Sahlgrenska Universitetssjukhuset. I denna studie användes en XGBoost-algoritm kombinerat med interaktiv maskininlärning för att utveckla fyra olika prediktions modeller för att förutsäga dödlighet eller risken att få en mekanisk hjärtpump för vänster kammare hos patienter med vänster hjärtsvikt (n=486). Variabler valdes från datamängden med hjälp av rekursiv funktionseliminering med ett standardantal variabler. Den initiala modellen med 29 variabler kallades baslinjemodellen och fungerade som grunden för de tre ytterligare modellerna som justerades baserat på klinikerns feedback. Det först steget inkluderade att ta bort variabler med inbördes hög korrelation och vi skapade en modifierad modell med 12 variabler. I det andra steget i den empiriska modellen använde vi 12 kända egenskaperna vid vänsterkammar- och högerkammarsvikt och för båda justerades tröskelvärdet för prediktion från 50% till 60%. I ett tredje steg skapade vi en förenklad modell med 5 variabler ut ifrån klinisk grund. Modellerna jämfördes med höger hjärtsvikts 4 poängskalan med hjälp av mätvariablerna area under kurvan, f1-poäng, positivt sannolikhets ratio och negativt sannolikhets ratio. Detta avslöjade att maskininlärnings modellerna hade bättre prediktiv förmåga än 4-poängs högerkammarsvikt score. Dessutom visade resultatet att modellerna inte försämrades eller förbättrades när variabler valdes bort eller när nya modeller skapades på klinisk grund. Dock hade maskininlärnings modellerna otillräcklig noggrannhet för klinisk användning. Left heart disease right ventricular dysfunction XGBoost RVD score machine learning interactive machine learning modAL recursive feature elimination multiple imputation by chained equations Medical Engineering Medicinteknik
78	Efficient Resource Management : A Comparison of Predictive Scaling Algorithms in Cloud-Based Applications Dahl, Johanna, Strömbäck, Elsa January 2024 (has links) This study aims to explore predictive scaling algorithms used to predict and manage workloadsin a containerized system. The goal is to identify which predictive scaling approach delivers themost effective results, contributing to research on cloud elasticity and resource management.This potentially leads to reduced infrastructure costs while maintaining efficient performance,enabling a more sustainable cloud-computing technology. The work involved the developmentand comparison of three different autoscaling algorithms with an interchangeable predictioncomponent. For the predictive part, three different time-series analysis methods were used:XGBoost, ARIMA, and Prophet. A simulation system with the necessary modules wasdeveloped, as well as a designated target service to experience the load. Each algorithm'sscaling accuracy was evaluated by comparing its suggested number of instances to the optimalnumber, with each instance representing a simulated CPU core. The results showed varyingefficiency: XGBoost and Prophet excelled with richer datasets, while ARIMA performed betterwith limited data. Although XGBoost and Prophet maintained 100% uptime, this could lead toresource wastage, whereas ARIMA's lower uptime percentage possibly suggested a moreresource-efficient, though less reliable, approach. Further analysis, particularly experimentalinvestigation is required to deepen the understanding of these predictors' influence on resourceallocation. forecasting cloud computing time series data machine learning containerization Prophet ARIMA XGBoost Engineering and Technology Teknik och teknologier Computer and Information Sciences Data- och informationsvetenskap
79	Comparative Analysis of Machine Learning Algorithms for Cryptocurrency Price Prediction Kurtagic, Leila January 2024 (has links) As the cryptocurrency markets continuously grow, so does the need for reliable analytical tools for price prediction. This study conducted a comparative analysis of machine learning (ML) algorithms for cryptocurrency price prediction. Through a literature review, three common and reliable ML algorithms for cryptocurrency price prediction were identified: Long Short-Term Memory (LSTM), Random Forest (RF), and eXtreme Gradient Boosting (XGBoost). Utilizing the Bitcoin All Time History dataset from TradingView, the study assessed both the individual performance of each algorithm and the potential of ensemble methods to enhance predictive accuracy. The results reveal that the LSTM algorithm outperformed RF and XGBoost in terms of predictive accuracy according to the metrics Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). Additionally, two ensemble approaches were tested: Ensemble 1, which enhanced the LSTM model with the combined predictions from RF and XGBoost, and Ensemble 2, which integrated predictions from all three models. Ensemble 2 demonstrated the highest predictive performance among all models, highlighting the advantages of using ensemble approaches for more robust predictions. Machine Learning Cryptocurrency Price Prediction LSTM (Long Short-Term Memory) Random Forest XGBoost (eXtreme Gradient Boosting) Ensemble Methods Feature Importance Financial Analytics Computer and Information Sciences Data- och informationsvetenskap
80	Money Laundering Detection using Tree Boosting and Graph Learning Algorithms / Detektion av Penningtvätt med hjälp av Trädalgoritmer och Grafinlärningsalgoritmer Frumerie, Rickard January 2021 (has links) In this masters thesis we focused on using machine learning methods for detecting money laundering in financial transaction networks, in order to demonstrate that it can be used as a complement or instead of the more commonly used rule based systems. The graph learning method graph convolutional networks (GCN) has been a hot topic in the field since they were shown to scale well with data size back in 2018. However the typical GCN models cannot use edge features, which is why this thesis combines the GCN model with a node and edge neural network (NENN) in order to solve this problem. This new method will be compared towards an already established machine learning method for financial transactions, namely the tree boosting method (XGBoost). Because of confidentiality concerns for financial transactions data, the machine learning algorithms will be tested on two carefully constructed synthetically generated data sets, which from agent based simulations resembles real financial data. The results showed the viability and superiority of the new implementation of the GCN model with it being a preferable method for connectivly structured data, meaning that a transaction or account is analyzed in the context of its financial environment. On the other hand the XGBoost method showed better results when examining transactions independently. Hence it was more accurately able to find fraudulent and non fraudulent patterns from the transactional features themselves. / I detta examensarbete fokuserar vi på användandet av maskininlärningsmetoder för att detektera penningtvätt i finansiella transaktionsnätverk, med målet att demonstrera att dess kan användas som ett komplement till eller i stället för de mer vanligt använda regelbaserade systemen. Grafinlärningsmetoden \textit{graph convolutional networks} (GCN) som har varit ett hett ämne inom området sedan metoden under 2018 visades fungera bra för stora datamängder. Däremot kan inte en vanlig GCN-modell använda kantinformation, vilket är varför denna avhandling kombinerar GCN-modellen med \textit{node and edge neural networks} (NENN) för att mer effektivt detektera penningtvätt. Denna nya metod kommer att jämföras med en redan etablerad maskininlärningsmetod för finansiella transaktioner, nämligen \textit{tree boosting} (XGBoost). På grund av sekretessanledningar för finansiella transaktionsdata var maskininlärningsalgoritmerna testade på två noggrant konstruerade syntetiskt genererade datamängder som från agentbaserade simuleringar liknar riktiga finansiella data. Resultaten visade på applikationsmöjligheter och överlägsenhet för den nya implementationen av GCN-modellen vilken är att föredra för relationsstrukturerade data, det vill säga när transaktioner och konton analyseras i kontexten av deras finansiella omgivning. Å andra sidan visar XGBoost bättre resultat på att examinera transaktioner individuellt eftersom denna metod mer precist kan identifiera bedrägliga och icke-bedrägliga mönster från de transnationella funktionerna. Tree boosting XGBoost graph convolutional networks (GCN) node and edge neural networks (NENN) exploratory data analysis (EDA) anti money laundering (AML) financial graph networks. Trädalgoritmer XGBoost convolutions grafnätverk (GCN) nod och kant neurala nätverk (NENN) utforskande dataanalys penningtvättsbekämpning (AML) finansiella grafnätverk. Probability Theory and Statistics Sannolikhetsteori och statistik

Search results