81 |
Comparative Analysis of Machine Learning Algorithms for Cryptocurrency Price PredictionKurtagic, Leila January 2024 (has links)
As the cryptocurrency markets continuously grow, so does the need for reliable analytical tools for price prediction. This study conducted a comparative analysis of machine learning (ML) algorithms for cryptocurrency price prediction. Through a literature review, three common and reliable ML algorithms for cryptocurrency price prediction were identified: Long Short-Term Memory (LSTM), Random Forest (RF), and eXtreme Gradient Boosting (XGBoost). Utilizing the Bitcoin All Time History dataset from TradingView, the study assessed both the individual performance of each algorithm and the potential of ensemble methods to enhance predictive accuracy. The results reveal that the LSTM algorithm outperformed RF and XGBoost in terms of predictive accuracy according to the metrics Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). Additionally, two ensemble approaches were tested: Ensemble 1, which enhanced the LSTM model with the combined predictions from RF and XGBoost, and Ensemble 2, which integrated predictions from all three models. Ensemble 2 demonstrated the highest predictive performance among all models, highlighting the advantages of using ensemble approaches for more robust predictions.
|
82 |
Money Laundering Detection using Tree Boosting and Graph Learning Algorithms / Detektion av Penningtvätt med hjälp av Trädalgoritmer och GrafinlärningsalgoritmerFrumerie, Rickard January 2021 (has links)
In this masters thesis we focused on using machine learning methods for detecting money laundering in financial transaction networks, in order to demonstrate that it can be used as a complement or instead of the more commonly used rule based systems. The graph learning method graph convolutional networks (GCN) has been a hot topic in the field since they were shown to scale well with data size back in 2018. However the typical GCN models cannot use edge features, which is why this thesis combines the GCN model with a node and edge neural network (NENN) in order to solve this problem. This new method will be compared towards an already established machine learning method for financial transactions, namely the tree boosting method (XGBoost). Because of confidentiality concerns for financial transactions data, the machine learning algorithms will be tested on two carefully constructed synthetically generated data sets, which from agent based simulations resembles real financial data. The results showed the viability and superiority of the new implementation of the GCN model with it being a preferable method for connectivly structured data, meaning that a transaction or account is analyzed in the context of its financial environment. On the other hand the XGBoost method showed better results when examining transactions independently. Hence it was more accurately able to find fraudulent and non fraudulent patterns from the transactional features themselves. / I detta examensarbete fokuserar vi på användandet av maskininlärningsmetoder för att detektera penningtvätt i finansiella transaktionsnätverk, med målet att demonstrera att dess kan användas som ett komplement till eller i stället för de mer vanligt använda regelbaserade systemen. Grafinlärningsmetoden \textit{graph convolutional networks} (GCN) som har varit ett hett ämne inom området sedan metoden under 2018 visades fungera bra för stora datamängder. Däremot kan inte en vanlig GCN-modell använda kantinformation, vilket är varför denna avhandling kombinerar GCN-modellen med \textit{node and edge neural networks} (NENN) för att mer effektivt detektera penningtvätt. Denna nya metod kommer att jämföras med en redan etablerad maskininlärningsmetod för finansiella transaktioner, nämligen \textit{tree boosting} (XGBoost). På grund av sekretessanledningar för finansiella transaktionsdata var maskininlärningsalgoritmerna testade på två noggrant konstruerade syntetiskt genererade datamängder som från agentbaserade simuleringar liknar riktiga finansiella data. Resultaten visade på applikationsmöjligheter och överlägsenhet för den nya implementationen av GCN-modellen vilken är att föredra för relationsstrukturerade data, det vill säga när transaktioner och konton analyseras i kontexten av deras finansiella omgivning. Å andra sidan visar XGBoost bättre resultat på att examinera transaktioner individuellt eftersom denna metod mer precist kan identifiera bedrägliga och icke-bedrägliga mönster från de transnationella funktionerna.
|
83 |
Loan Default Prediction using Supervised Machine Learning Algorithms / Fallissemangprediktion med hjälp av övervakade maskininlärningsalgoritmerGranström, Daria, Abrahamsson, Johan January 2019 (has links)
It is essential for a bank to estimate the credit risk it carries and the magnitude of exposure it has in case of non-performing customers. Estimation of this kind of risk has been done by statistical methods through decades and with respect to recent development in the field of machine learning, there has been an interest in investigating if machine learning techniques can perform better quantification of the risk. The aim of this thesis is to examine which method from a chosen set of machine learning techniques exhibits the best performance in default prediction with regards to chosen model evaluation parameters. The investigated techniques were Logistic Regression, Random Forest, Decision Tree, AdaBoost, XGBoost, Artificial Neural Network and Support Vector Machine. An oversampling technique called SMOTE was implemented in order to treat the imbalance between classes for the response variable. The results showed that XGBoost without implementation of SMOTE obtained the best result with respect to the chosen model evaluation metric. / Det är nödvändigt för en bank att ha en bra uppskattning på hur stor risk den bär med avseende på kunders fallissemang. Olika statistiska metoder har använts för att estimera denna risk, men med den nuvarande utvecklingen inom maskininlärningsområdet har det väckt ett intesse att utforska om maskininlärningsmetoder kan förbättra kvaliteten på riskuppskattningen. Syftet med denna avhandling är att undersöka vilken metod av de implementerade maskininlärningsmetoderna presterar bäst för modellering av fallissemangprediktion med avseende på valda modelvaldieringsparametrar. De implementerade metoderna var Logistisk Regression, Random Forest, Decision Tree, AdaBoost, XGBoost, Artificiella neurala nätverk och Stödvektormaskin. En översamplingsteknik, SMOTE, användes för att behandla obalansen i klassfördelningen för svarsvariabeln. Resultatet blev följande: XGBoost utan implementering av SMOTE visade bäst resultat med avseende på den valda metriken.
|
84 |
[en] A SUPERVISED LEARNING APPROACH TO PREDICT HOUSEHOLD AID DEMAND FOR RECURRENT CLIME-RELATED DISASTERS IN PERU / [pt] UMA ABORDAGEM DE APRENDIZADO SUPERVISIONADO PARA PREVER A DEMANDA DE AJUDA FAMILIAR PARA DESASTRES CLIMÁTICOS RECORRENTES NO PERURENATO JOSE QUILICHE ALTAMIRANO 21 November 2023 (has links)
[pt] Esta dissertação apresenta uma abordagem baseada em dados para
o problema de predição de desastres recorrentes em países em
desenvolvimento. Métodos de aprendizado de máquina supervisionado são
usados para treinar classificadores que visam prever se uma família seria
afetada por ameaças climáticas recorrentes (um classificador é treinado
para cada perigo natural). A abordagem desenvolvida é válida para perigos
naturais recorrentes que afetam um país e permite que os gerentes de risco
de desastres direcionem suas operações com mais conhecimento. Além
disso, a avaliação preditiva permite que os gerentes entendam os
impulsionadores dessas previsões, levando à formulação proativa de
políticas e planejamento de operações para mitigar riscos e preparar
comunidades para desastres recorrentes.
A metodologia proposta foi aplicada ao estudo de caso do Peru, onde
foram treinados classificadores para ondas de frio, inundações e
deslizamentos de terra. No caso das ondas de frio, o classificador tem
73,82% de precisão. A pesquisa descobriu que famílias pobres em áreas
rurais são vulneráveis a desastres relacionados a ondas de frio e precisam
de intervenção humanitária proativa. Famílias vulneráveis têm
infraestrutura urbana precária, incluindo trilhas, caminhos, postes de
iluminação e redes de água e drenagem. O papel do seguro saúde, estado
de saúde e educação é menor. Domicílios com membros doentes levam a
maiores probabilidades de serem afetados por ondas de frio. Maior
realização educacional do chefe da família está associada a uma menor
probabilidade de ser afetado por ondas de frio. No caso das inundações, o classificador tem 82.57% de precisão.
Certas condições urbanas podem tornar as famílias rurais mais suscetíveis
a inundações, como acesso à água potável, postes de iluminação e redes
de drenagem. Possuir um computador ou laptop diminui a probabilidade de
ser afetado por inundações, enquanto possuir uma bicicleta e ser chefiado
por indivíduos casados aumenta. Inundações são mais comuns em
assentamentos urbanos menos desenvolvidos do que em famílias rurais
isoladas.
No caso dos deslizamentos de terra, o classificador tem 88.85% de
precisão, e é segue uma lógica diferente do das inundações. A importância
na previsão é mais uniformemente distribuída entre as características
consideradas no aprendizado do classificador. Assim, o impacto de um
recurso individual na previsão é pequeno. A riqueza a longo prazo parece
ser mais crítica: a probabilidade de ser afetado por um deslizamento é
menor para famílias com certos aparelhos e materiais domésticos de
construção. Comunidades rurais são mais afetadas por deslizamentos,
especialmente aquelas localizadas em altitudes mais elevadas e maiores
distâncias das cidades e mercados. O impacto marginal médio da altitude
é não linear.
Os classificadores fornecem um método inteligente baseado em
dados que economiza recursos garantindo precisão. Além disso, a
pesquisa fornece diretrizes para abordar a eficiência na distribuição da
ajuda, como formulações de localização da instalação e roteamento de
veículos.
Os resultados da pesquisa têm várias implicações gerenciais, então
os autores convocam à ação gestores de risco de desastres e outros
interessados relevantes. Desastres recorrentes desafiam toda a
humanidade. / [en] This dissertation presents a data-driven approach to the problem of predicting recurrent disasters in developing countries. Supervised machine learning methods are used to train classifiers that aim to predict whether a household would be affected by recurrent climate threats (one classifier is trained for each natural hazard). The approach developed is valid for recurrent natural hazards affecting a country and allows disaster risk managers to target their operations with more knowledge. In addition, predictive assessment allows managers to understand the drivers of these predictions, leading to proactive policy formulation and operations planning to mitigate risks and prepare communities for recurring disasters. The proposed methodology was applied to the case study of Peru, where classifiers were trained for cold waves, floods, and landslides. In the case of cold waves, the classifier was 73.82% accurate. The research found that low-income families in rural areas are vulnerable to cold wave related disasters and need proactive humanitarian intervention. Vulnerable families have poor urban infrastructure, including footpaths, roads, lampposts, and water and drainage networks. The role of health insurance, health status, and education is minor. Households with sick members are more likely to be affected by cold waves. Higher educational attainment of the head of the household is associated with a lower probability of being affected by cold snaps.In the case of flooding, the classifier is 82.57% accurate. Certain urban conditions, such as access to drinking water, lampposts, and drainage networks, can make rural households more susceptible to flooding. Owning a computer or laptop decreases the likelihood of being affected by flooding while owning a bicycle and being headed by married individuals increases it. Flooding is more common in less developed urban settlements than isolated rural families.In the case of landslides, the classifier is 88.85% accurate and follows a different logic than that of floods. The importance of the prediction is more evenly distributed among the features considered when learning the classifier. Thus, the impact of an individual feature on the prediction is small. Long-term wealth is more critical: the probability of being affected by a landslide is lower for families with specific appliances and household building materials. Rural communities are more affected by landslides, especially those located at higher altitudes and greater distances from cities and markets. The average marginal impact of altitude is non-linear.The classifiers provide an intelligent data-driven method that saves resources by ensuring accuracy. In addition, the research provides guidelines for addressing efficiency in aid distribution, such as facility location formulations and vehicle routing.The research results have several managerial implications, so the authors call for action from disaster risk managers and other relevant stakeholders. Recurrent disasters challenge all of humanity.
|
85 |
Estimation of Voltage Drop in Power Circuits using Machine Learning Algorithms : Investigating potential applications of machine learning methods in power circuits design / Uppskattning av spänningsfall i kraftkretsar med hjälp av maskininlärningsalgoritmer : Undersöka potentiella tillämpningar av maskininlärningsmetoder i kraftkretsdesignKoutlis, Dimitrios January 2023 (has links)
Accurate estimation of voltage drop (IR drop), in Application-Specific Integrated Circuits (ASICs) is a critical challenge, which impacts their performance and power consumption. As technology advances and die sizes shrink, predicting IR drop fast and accurate becomes increasingly challenging. This thesis focuses on exploring the application of Machine Learning (ML) algorithms, including Extreme Gradient Boosting (XGBoost), Convolutional Neural Network (CNN) and Graph Neural Network (GNN), to address this problem. Traditional methods of estimating IR drop using commercial tools are time consuming, especially for complex designs with millions of transistors. To overcome that, ML algorithms are investigated for their ability to provide fast and accurate IR drop estimation. This thesis utilizes electrical, timing and physical features of the ASIC design as input to train the ML models. The scalability of the selected features allows for their effective application across various ASIC designs with very few adjustments. Experimental results demonstrate the advantages of ML models over commercial tools, offering significant improvements in prediction speed. Notably, GNNs, such as Graph Convolutional Network (GCN) models showed promising performance with low prediction errors in voltage drop estimation. The incorporation of graph-structures models opens new fields of research for accurate IR drop prediction. The conclusions drawn emphasize the effectiveness of ML algorithms in accurately estimating IR drop, thereby optimizing ASIC design efficiency. The application of ML models enables faster predictions and noticeably reducing calculation time. This contributes to enhancing energy efficiency and minimizing environmental impact through optimised power circuits. Future work can focus on exploring the scalability of the models by training on a smaller portion of the circuit and extrapolating predictions to the entire design seems promising for more efficient and accurate IR drop estimation in complex ASIC designs. These advantages present new opportunities in the field and extend the capabilities of ML algorithms in the task of IR drop prediction. / Noggrann uppskattning av spänningsfallet (IR-fall), i ASIC är en kritisk utmaning som påverkar deras prestanda och strömförbrukning. När tekniken går framåt och formstorlekarna krymper, blir det allt svårare att förutsäga IR-fall snabbt och exakt. Denna avhandling fokuserar på att utforska tillämpningen av ML-algoritmer, inklusive XGBoost, CNN och GNN, för att lösa detta problem. Traditionella metoder för att uppskatta IR-fall med kommersiella verktyg är tidskrävande, särskilt för komplexa konstruktioner med miljontals transistorer. För att övervinna det undersöks ML-algoritmer för deras förmåga att ge snabb och exakt IR-falluppskattning. Denna avhandling använder elektriska, timing och fysiska egenskaper hos ASIC-designen som input för att träna ML-modellerna. Skalbarheten hos de valda funktionerna möjliggör deras effektiva tillämpning över olika ASIC-designer med mycket få justeringar. Experimentella resultat visar fördelarna med ML-modeller jämfört med kommersiella verktyg, och erbjuder betydande förbättringar i förutsägelsehastighet. Noterbart är att GNNs, såsom GCN-modeller, visade lovande prestanda med låga prediktionsfel vid uppskattning av spänningsfall. Införandet av grafstrukturmodeller öppnar nya forskningsfält för exakt IRfallförutsägelse. De slutsatser som dras betonar effektiviteten hos MLalgoritmer för att noggrant uppskatta IR-fall, och därigenom optimera ASICdesigneffektiviteten. Tillämpningen av ML-modeller möjliggör snabbare förutsägelser och märkbart minskad beräkningstid. Detta bidrar till att förbättra energieffektiviteten och minimera miljöpåverkan genom optimerade kraftkretsar. Framtida arbete kan fokusera på att utforska skalbarheten hos modellerna genom att träna på en mindre del av kretsen och att extrapolera förutsägelser till hela designen verkar lovande för mer effektiv och exakt IR-falluppskattning i komplexa ASIC-designer. Dessa fördelar ger nya möjligheter inom området och utökar kapaciteten hos ML-algoritmer i uppgiften att förutsäga IR-fall.
|
86 |
Automatic Patent ClassificationYehe, Nala January 2020 (has links)
Patents have a great research value and it is also beneficial to the community of industrial, commercial, legal and policymaking. Effective analysis of patent literature can reveal important technical details and relationships, and it can also explain business trends, propose novel industrial solutions, and make crucial investment decisions. Therefore, we should carefully analyze patent documents and use the value of patents. Generally, patent analysts need to have a certain degree of expertise in various research fields, including information retrieval, data processing, text mining, field-specific technology, and business intelligence. In real life, it is difficult to find and nurture such an analyst in a relatively short period of time, enabling him or her to meet the requirement of multiple disciplines. Patent classification is also crucial in processing patent applications because it will empower people with the ability to manage and maintain patent texts better and more flexible. In recent years, the number of patents worldwide has increased dramatically, which makes it very important to design an automatic patent classification system. This system can replace the time-consuming manual classification, thus providing patent analysis managers with an effective method of managing patent texts. This paper designs a patent classification system based on data mining methods and machine learning techniques and use KNIME software to conduct a comparative analysis. This paper will research by using different machine learning methods and different parts of a patent. The purpose of this thesis is to use text data processing methods and machine learning techniques to classify patents automatically. It mainly includes two parts, the first is data preprocessing and the second is the application of machine learning techniques. The research questions include: Which part of a patent as input data performs best in relation to automatic classification? And which of the implemented machine learning algorithms performs best regarding the classification of IPC keywords? This thesis will use design science research as a method to research and analyze this topic. It will use the KNIME platform to apply the machine learning techniques, which include decision tree, XGBoost linear, XGBoost tree, SVM, and random forest. The implementation part includes collection data, preprocessing data, feature word extraction, and applying classification techniques. The patent document consists of many parts such as description, abstract, and claims. In this thesis, we will feed separately these three group input data to our models. Then, we will compare the performance of those three different parts. Based on the results obtained from these three experiments and making the comparison, we suggest using the description part data in the classification system because it shows the best performance in English patent text classification. The abstract can be as the auxiliary standard for classification. However, the classification based on the claims part proposed by some scholars has not achieved good performance in our research. Besides, the BoW and TFIDF methods can be used together to extract efficiently the features words in our research. In addition, we found that the SVM and XGBoost techniques have better performance in the automatic patent classification system in our research.
|
87 |
Using supervised learning methods to predict the stop duration of heavy vehicles.Oldenkamp, Emiel January 2020 (has links)
In this thesis project, we attempt to predict the stop duration of heavy vehicles using data based on GPS positions collected in a previous project. All of the training and prediction is done in AWS SageMaker, and we explore possibilities with Linear Learner, K-Nearest Neighbors and XGBoost, all of which are explained in this paper. Although we were not able to construct a production-grade model within the time frame of the thesis, we were able to show that the potential for such a model does exist given more time, and propose some suggestions for the paths one can take to improve on the endpoint of this project.
|
88 |
Ensemble Classifier Design and Performance Evaluation for Intrusion Detection Using UNSW-NB15 DatasetZoghi, Zeinab 30 November 2020 (has links)
No description available.
|
89 |
Knowledge Discovery and Data Mining Using Demographic and Clinical Data to Diagnose Heart Disease. / Knowledge Discovery och Data mining med hjälp av demografiska och kliniska data för att diagnostisera hjärtsjukdomar.Fernandez Sanchez, Javier January 2018 (has links)
Cardiovascular disease (CVD) is the leading cause of morbidity, mortality, premature death and reduced quality of life for the citizens of the EU. It has been reported that CVD represents a major economic load on health care sys- tems in terms of hospitalizations, rehabilitation services, physician visits and medication. Data Mining techniques with clinical data has become an interesting tool to prevent, diagnose or treat CVD. In this thesis, Knowledge Dis- covery and Data Mining (KDD) was employed to analyse clinical and demographic data, which could be used to diagnose coronary artery disease (CAD). The exploratory data analysis (EDA) showed that female patients at an el- derly age with a higher level of cholesterol, maximum achieved heart rate and ST-depression are more prone to be diagnosed with heart disease. Furthermore, patients with atypical angina are more likely to be at an elderly age with a slightly higher level of cholesterol and maximum achieved heart rate than asymptotic chest pain patients. More- over, patients with exercise induced angina contained lower values of maximum achieved heart rate than those who do not experience it. We could verify that patients who experience exercise induced angina and asymptomatic chest pain are more likely to be diagnosed with heart disease. On the other hand, Logistic Regression, K-Nearest Neighbors, Support Vector Machines, Decision Tree, Bagging and Boosting methods were evaluated by adopting a stratified 10 fold cross-validation approach. The learning models provided an average of 78-83% F-score and a mean AUC of 85-88%. Among all the models, the highest score is given by Radial Basis Function Kernel Support Vector Machines (RBF-SVM), achieving 82.5% ± 4.7% of F-score and an AUC of 87.6% ± 5.8%. Our research con- firmed that data mining techniques can support physicians in their interpretations of heart disease diagnosis in addition to clinical and demographic characteristics of patients.
|
90 |
Railway curve squeal: Statistical analysis of train speed impact on squeal noiseAsplund, Ruben January 2024 (has links)
No description available.
|
Page generated in 0.0476 seconds