Spelling suggestions: "subject:"[een] DECISION TREE"" "subject:"[enn] DECISION TREE""
111 |
[pt] ALGORITMOS DE APROXIMAÇÃO PARA ÁRVORES DE DECISÃO / [en] APPROXIMATION ALGORITHMS FOR DECISION TREESALINE MEDEIROS SAETTLER 13 December 2021 (has links)
[pt] A construção de árvores de decisão é um problema central em diversas áreas da ciência da computação, por exemplo, teoria de banco de dados e aprendizado computacional. Este problema pode ser visto como o problema de avaliar uma função discreta, onde para verificar o valor de cada variável da função temos que pagar um custo, e os pontos onde a função está definida estão associados a uma distribuição de probabilidade. O objetivo do problema é avaliar a função minimizando o custo gasto (no pior caso ou no caso médio). Nesta tese, apresentamos quatro contribuições relacionadas a esse problema. A
primeira é um algoritmo que alcança uma aproximação de O(log(n)) em relação a tanto o custo esperado quanto ao pior custo. A segunda é um método que combina duas árvores, uma com pior custo W e outra com custo esperado E, e produz uma árvore com pior custo de no máximo (1+p)W e custo esperado no
máximo (1/(1-e-p))E, onde p é um parâmetro dado. Nós também provamos que esta é uma caracterização justa do melhor trade-off alcançável, mostrando que existe um número infinito de instâncias para as quais não podemos obter uma árvore de decisão com tanto o pior custo menor que (1 + p)OPTW(I)
quanto o custo esperado menor que (1/(1 - e - p))OPTE(I), onde OPTW(I) (resp. OPTE(I)) denota o pior custo da árvore de decisão que minimiza o pior custo (resp. custo esperado) para uma instância I do problema. A terceira contribuição é um algoritmo de aproximação de O(log(n)) para a minimização
do pior custo para uma variante do problema onde o custo de ler uma variável depende do seu valor. Nossa última contribuição é um algoritmo randomized rounding que, dada uma instância do problema (com um inteiro adicional (k > 0) e um parâmetro 0 < e < 1/2, produz uma árvore de decisão oblivious
com custo no máximo (3/(1 - 2e))ln(n)OPT(I) e que produz no máximo (k/e) erros, onde OPT(I) denota o custo da árvore de decisão oblivious com o menor custo entre todas as árvores oblivious para a instância I que produzem no máximo k erros de classificação. / [en] Decision tree construction is a central problem in several areas of computer science, for example, data base theory and computational learning. This problem can be viewed as the problem of evaluating a discrete function, where to check the value of each variable of the function we have to pay a cost, and the points where the function is defined are associated with a probability distribution. The goal of the problem is to evaluate the function minimizing the cost spent (in the worst case or in expectation). In this Thesis, we present four contributions related to this problem. The first one is an algorithm that achieves an O(log(n)) approximation with respect to both the expected and the worst costs. The second one is a procedure that combines two trees, one with worst costW and another with expected cost E, and produces a tree with worst cost at most (1+p)W and expected cost at most (1/(1-e-p))E, where p is a given parameter. We also prove that this is a sharp characterization of the best possible trade-off attainable, showing that there are infinitely many instances for which we cannot obtain a decision tree with both worst cost smaller than
(1+p)OPTW(I) and expected cost smaller than (1/(1-e-p))OPTE(I), where OPTW(I) (resp. OPTE(I)) denotes the cost of the decision tree that minimizes the worst cost (resp. expected cost) for an instance I of the problem. The third contribution is an O(log(n)) approximation algorithm for the minimization
of the worst cost for a variant of the problem where the cost of reading a variable depends on its value. Our final contribution is a randomized rounding algorithm that, given an instance of the problem (with an additional integer k > 0) and a parameter 0 < e < 1/2, builds an oblivious decision tree with
cost at most (3/(1 - 2e))ln(n)OPT(I) and produces at most (k/e) errors, where OPT(I) denotes the cost of the oblivious decision tree with minimum cost among all oblivious decision trees for instance I that make at most k classification errors.
|
112 |
Data driven driving evaluation : A supervised machine learning approach for classification of high frequency triaxial accelerationLundberg, Henrik January 2024 (has links)
The ability to navigate through a continuously changing business landscape has been a success factor for Scania to stay a competitive business, when the landscape continues to change. Digitalization has enabled data to be collected from various sources and the ability to embrace the possibilities that come with it and turn it into an advantage is crucial to make sure that Scania is driving the changing industry. Today, Scania is good at collecting and analyzing data but there is room for improvements when it comes to utilizing the data to create data-driven decision-making. This study aims to investigate the possibility of learning more about the users driving behavior through data-driven driving evaluation. This is done with a machine learning approach where a CNN-GRU neural network with an XGBoost classifier is created to classify triaxial acceleration data into normal or aggressive driving behavior. The findings show that this model architecture has a classification accuracy of 87.80 % and the result is discussed with respect to method implementation, quality of data, hyperparameter tuning, and future studies.
|
113 |
A Comprehensive Experimental and Computational Investigation on Estimation of Scour Depth at Bridge Abutment: Emerging Ensemble Intelligent SystemsPandey, M., Karbasi, M., Jamei, M., Malik, A., Pu, Jaan H. 12 October 2024 (has links)
No / Several bridges failed because of scouring and erosion around the bridge elements. Hence,
precise prediction of abutment scour is necessary for the safe design of bridges. In this
research, experimental and computational investigations have been devoted based on 45
flume experiments carried out at the NIT Warangal, India. Three innovative ensemblebased
data intelligence paradigms, namely categorical boosting (CatBoost) in conjunction
with extra tree regression (ETR) and K-nearest neighbor (KNN), are used to accurately
predict the scour depth around the bridge abutment. A total of 308 series of laboratory
data (a wide range of existing abutment scour depth datasets (263 datasets) and 45 flume
data) in various sediment and hydraulic conditions were used to develop the models. Four
dimensionless variables were used to calculate scour depth: approach densimetric Froude
number (Fd50), the upstream depth (y) to abutment transverse length ratio (y/L), the abutment
transverse length to the sediment mean diameter (L/d50), and the mean velocity to
the critical velocity ratio (V/Vcr). The Gradient boosting decision tree (GBDT) method
selected features with higher importance. Based on the feature selection results, two combinations
of input variables (comb1 (all variables as model input) and comb2 (all variables
except Fd50)) were used. The CatBoost model with Comb1 data input (RMSE = 0.1784,
R = 0.9685, MAPE = 10.4724) provided better accuracy when compared to other machine
learning models.
|
114 |
Tratamento de imprecisão na geração de árvores de decisãoLopes, Mariana Vieira Ribeiro 03 March 2016 (has links)
Submitted by Ronildo Prado (ronisp@ufscar.br) on 2017-08-08T20:30:11Z
No. of bitstreams: 1
DissMVRL.pdf: 2179441 bytes, checksum: 3c4089c4b24a3d98521f8561c6f2c515 (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-08-08T20:30:33Z (GMT) No. of bitstreams: 1
DissMVRL.pdf: 2179441 bytes, checksum: 3c4089c4b24a3d98521f8561c6f2c515 (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-08-08T20:30:39Z (GMT) No. of bitstreams: 1
DissMVRL.pdf: 2179441 bytes, checksum: 3c4089c4b24a3d98521f8561c6f2c515 (MD5) / Made available in DSpace on 2017-08-08T20:31:24Z (GMT). No. of bitstreams: 1
DissMVRL.pdf: 2179441 bytes, checksum: 3c4089c4b24a3d98521f8561c6f2c515 (MD5)
Previous issue date: 2016-03-03 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / Inductive Decision Trees (DT) are mechanisms based on the symbolic paradigm of machine learning which main characteristics are easy interpretability and low computational cost. Though they are widely used, the DTs can represent problems with just discrete or continuous variables. However, for some problems, the variables are not well represented in this way. In order to improve DTs, the Fuzzy Decision Trees (FDT) were developed, adding the ability to deal with fuzzy variables to the Inductive Decision Trees, making them capable to deal with imprecise knowledge. In this text, it is presented a new algorithm for fuzzy decision trees induction. Its fuzification method is applied during the induction and it is inspired by the C4.5’s partitioning method for continuous attributes. The proposed algorithm was tested with 20 datasets from UCI repository (LICHMAN, 2013). It was compared with other three algorithms that implement different solutions to classification problem: C4.5, which induces an Inductive Decision Tree, FURIA, that induces a Rule-based Fuzzy System and FuzzyDT, which induces a Fuzzy Decision Tree where the fuzification is done before tree’s induction is performed. The results are presented in Chapter 4. / As Árvores de Decisão Indutivas (AD) são um mecanismo baseado no paradigma simbólico do Aprendizado de Máquina que tem como principais características a fácil interpretabilidade e baixo custo computacional. Ainda que sejam amplamente utilizadas, as ADs são limitadas à representação de problemas cujas variáveis são do tipo discreto ou contínuo. No entanto, para alguns tipos de problemas, pode haver variáveis que não são bem representadas por estes formatos. Diante deste contexto, foram criadas as Árvores de Decisão Fuzzy (ADF), que adicionam à interpretabilidade das Árvores de Decisão Indutivas, a capacidade de lidar com variáveis fuzzy, as quais representam adequadamente conhecimentos imprecisos. Neste texto, apresentamos o trabalho desenvolvido durante o mestrado, que tem como principal resultado um novo algoritmo para indução de Árvores de Decisão Fuzzy, cujo método de fuzificação dos atributos contínuos é realizado durante a indução da árvore e foi inspirado no método de particionamento de atributos contínuos adotado pelo C4.5. Para validação do algoritmo, foram realizados testes com 20 conjuntos de dados do repositório UCI (LICHMAN, 2013) e o algoritmo foi comparado com outros três algoritmos que abordam o problema de classificação por meio de técnicas diferentes: o C4.5 que induz uma Árvore de Decisão Indutiva, o FURIA, que induz um Sistema Fuzzy Baseado em Regras, porém não segue a estrutura de árvore e o FuzzyDT que induz uma Árvore de Decisão fuzzy realizando a fuzificação dos atributos contínuos antes da indução da árvore. Os resultados dos experimentos realizados são apresentados e discutidos no Capítulo 4 deste texto.
|
115 |
Carbon Intensity Estimation of Publicly Traded Companies / Uppskattning av koldioxidintensitet hos börsnoterade bolagRibberheim, Olle January 2021 (has links)
The purpose of this master thesis is to develop a model to estimate the carbon intensity, i.e the carbon emission relative to economic activity, of publicly traded companies which do not report their carbon emissions. By using statistical and machine learning models, the core of this thesis is to develop and compare different methods and models with regard to accuracy, robustness, and explanatory value when estimating carbon intensity. Both discrete variables, such as the region and sector the company is operating in, and continuous variables, such as revenue and capital expenditures, are used in the estimation. Six methods were compared, two statistically derived and four machine learning methods. The thesis consists of three parts: data preparation, model implementation, and model comparison. The comparison indicates that boosted decision tree is both the most accurate and robust model. Lastly, the strengths and weaknesses of the methodology is discussed, as well as the suitability and legitimacy of the boosted decision tree when estimating carbon intensity. / Syftet med denna masteruppsats är att utveckla en modell som uppskattar koldioxidsintensiteten, det vill säga koldioxidutsläppen i förhållande till ekonomisk aktivitet, hos publika bolag som inte rapporterar sina koldioxidutsläpp. Med hjälp av statistiska och maskininlärningsmodeller kommer stommen i uppsatsen vara att utveckla och jämföra olika metoder och modeller utifrån träffsäkerhet, robusthet och förklaringsvärde vid uppskattning av koldioxidintensitet. Både diskreta och kontinuerliga variabler används vid uppskattningen, till exempel region och sektor som företaget är verksam i, samt omsättning och kapitalinvesteringar. Sex stycken metoder jämfördes, två statistiskt härledda och fyra maskininlärningsmetoder. Arbetet består av tre delar; förberedelse av data, modellutveckling och modelljämförelse, där jämförelsen indikerar att boosted decision tree är den modell som är både mest träffsäker och robust. Slutligen diskuteras styrkor och svagheter med metodiken, samt lämpligheten och tillförlitligheten med att använda ett boosted decision tree för att uppskatta koldioxidintensitet.
|
116 |
Automatic Analysis of Peer Feedback using Machine Learning and Explainable Artificial Intelligence / Automatisk analys av Peer feedback med hjälp av maskininlärning och förklarig artificiell IntelligenceHuang, Kevin January 2023 (has links)
Peer assessment is a process where learners evaluate and provide feedback on one another’s performance, which is critical to the student learning process. Earlier research has shown that it can improve student learning outcomes in various settings, including the setting of engineering education, in which collaborative teaching and learning activities are common. Peer assessment activities in computer-supported collaborative learning (CSCL) settings are becoming more and more common. When using digital technologies for performing these activities, much student data (e.g., peer feedback text entries) is generated automatically. These large data sets can be analyzed (through e.g., computational methods) and further used to improve our understanding of how students regulate their learning in CSCL settings in order to improve their conditions for learning by for example, providing in-time feedback. Yet there is currently a need to automatise the coding process of these large volumes of student text data since it is a very time- and resource consuming task. In this regard, the recent development in machine learning could prove beneficial. To understand how we can harness the affordances of machine learning technologies to classify student text data, this thesis examines the application of five models on a data set containing peer feedback from 231 students in the settings of a large technical university course. The models used to evaluate on the dataset are: the traditional models Multi Layer Perceptron (MLP), Decision Tree and the transformers-based models BERT, RoBERTa and DistilBERT. To evaluate each model’s performance, Cohen’s κ, accuracy, and F1-score were used as metrics. Preprocessing of the data was done by removing stopwords; then it was examined whether removing them improved the performance of the models. The results showed that preprocessing on the dataset only made the Decision Tree increase in performance while it decreased on all other models. RoBERTa was the model with the best performance on the dataset on all metrics used. Explainable artificial intelligence (XAI) was used on RoBERTa as it was the best performing model and it was found that the words considered as stopwords made a difference in the prediction. / Kamratbedömning är en process där eleverna utvärderar och ger feedback på varandras prestationer, vilket är avgörande för elevernas inlärningsprocess. Tidigare forskning har visat att den kan förbättra studenternas inlärningsresultat i olika sammanhang, däribland ingenjörsutbildningen, där samarbete vid undervisning och inlärning är vanligt förekommande. I dag blir det allt vanligare med kamratbedömning inom datorstödd inlärning i samarbete (CSCL). När man använder digital teknik för att utföra dessa aktiviteter skapas många studentdata (t.ex. textinlägg om kamratåterkoppling) automatiskt. Dessa stora datamängder kan analyseras (genom t.ex, beräkningsmetoder) och användas vidare för att förbättra våra kunskaper om hur studenterna reglerar sitt lärande i CSCL-miljöer för att förbättra deras förutsättningar för lärande. Men för närvarande finns det ett stort behov av att automatisera kodningen av dessa stora volymer av textdata från studenter. I detta avseende kan den senaste utvecklingen inom maskininlärning vara till nytta. För att förstå hur vi kan nyttja möjligheterna med maskininlärning teknik för att klassificera textdata från studenter, undersöker vi i denna studie hur vi kan använda fem modeller på en datamängd som innehåller feedback från kamrater till 231 studenter. Modeller som används för att utvärdera datasetet är de traditionella modellerna Multi Layer Perceptron (MLP), Decision Tree och de transformer-baserade modellerna BERT, RoBERTa och DistilBERT. För att utvärdera varje modells effektivitet användes Cohen’s κ, noggrannhet och F1-poäng som mått. Förbehandling av data gjordes genom att ta bort stoppord, därefter undersöktes om borttagandet av dem förbättrade modellernas effektivitet. Resultatet visade att förbehandlingen av datasetet endast fick Decision Tree att öka sin prestanda, medan den minskade för alla andra modeller. RoBERTa var den modell som presterade bäst på datasetet för alla mätvärden som användes. Förklarlig artificiell intelligens (XAI) användes på RoBERTa eftersom det var den modell som presterade bäst, och det visade sig att de ord som ansågs vara stoppord hade betydelse för prediktionen.
|
117 |
Consensus Algorithms in Blockchain : A survey to create decision trees for blockchain applications / Konsensusalgoritmer i Blockchain : En undersökning för att skapa beslutsträd för blockchain-applikationerZhu, Xinlin January 2023 (has links)
Blockchain is a decentralized database that is distributed among a computer network. To enable a smooth decision making process without any authority, different blockchain applications use their own consensus algorithms. The problem is that for a new blockchain application, there is limited aid in deciding which algorithm it should implement. Selecting consensus algorithms is crucial because reaching consensus is the fundamental issue of a decentralized system. Different algorithms are designed with their own advantages and limitations, making it complex to navigate one’s way through a list of consensus algorithms. This thesis attempts to contribute to solving this problem by surveying 15 existing cryptocurrencies’ consensus algorithms used in their blockchain application and then producing a decision tree as the aid for algorithm selection. The top 5 algorithms from each category in Proof of Work (PoW), Proof of Stake (PoS), and Hybrid Proof of Work + Proof of Stake (PoW + PoS) are selected. The research method is qualitative. The study shows that different consensus algorithms often share some properties, but they are usually built to solve the issues of another algorithm, which means they also have their own distinctive advantages. Therefore, the decision tree reveals how these algorithms are logically connected and the key properties blockchain consensus algorithms possess. Based on the result of this thesis, further research can be conducted to include more algorithms in order to make the decision tree more comprehensive. Implementations of these algorithms in similar network setup can also be done to experiment with their claimed properties. The decision tree can be sent to industry for further feedback. / Blockchain är en decentraliserad databas som distribueras i ett datornätverk. För att möjliggöra en smidig beslutsprocess utan någon auktoritet använder olika blockkedjeapplikationer sina egna konsensusalgoritmer. Problemet är att för en ny blockchain-applikation finns det begränsad hjälp för att bestämma vilken algoritm den ska implementera. Att välja konsensusalgoritmer är avgörande eftersom att nå konsensus är den grundläggande frågan för ett decentraliserat system. Olika algoritmer är designade med sina egna fördelar och begränsningar, vilket gör det komplicerat att navigera sig igenom en lista med konsensusalgoritmer. Forskningsmetoden är kvalitativ. Det här dokumentet försöker bidra till att lösa detta problem genom att kartlägga 15 befintliga kryptovalutors konsensusalgoritmer som används i deras blockkedjeapplikation och sedan ta fram ett beslutsträd som hjälp för val av algoritmer. De 5 bästa algoritmerna från varje kategori i Proof of Work (PoW), Proof of Stake (PoS) och Hybrid Proof of Work + Proof of Stake (PoW + PoS) väljs. Studien visar att olika konsensusalgoritmer ofta delar vissa egenskaper, men de är vanligtvis byggda för att lösa problem med en annan algoritm, vilket innebär att de också har sina egna distinkta fördelar. Därför avslöjar beslutsträdet hur dessa algoritmer är logiskt kopplade och de nyckelegenskaper som blockchain konsensusalgoritmer besitter. Baserat på resultatet av denna artikel kan ytterligare forskning utföras för att inkludera fler algoritmer för att göra beslutsträdet mer heltäckande. Implementeringar av dessa algoritmer i liknande nätverksuppsättningar kan också göras för att experimentera med deras påstådda egenskaper. Beslutsträdet kan skickas till industrin för vidare feedback.
|
118 |
Categorization of Swedish e-mails using Supervised Machine Learning / Kategorisering av svenska e-postmeddelanden med användning av övervakad maskininlärningMann, Anna, Höft, Olivia January 2021 (has links)
Society today is becoming more digitalized, and a common way of communication is to send e-mails. Currently, the company Auranest has a filtering method for categorizing e-mails, but the method is a few years old. The filter provides a classification of valuable e-mails for jobseekers, where employers can make contact. The company wants to know if the categorization can be performed with a different method and improved. The degree project aims to investigate whether the categorization can be proceeded with higher accuracy using machine learning. Three supervised machine learning algorithms, Naïve Bayes, Support Vector Machine (SVM), and Decision Tree, have been examined, and the algorithm with the highest results has been compared with Auranest's existing filter. Accuracy, Precision, Recall, and F1 score have been used to determine which machine learning algorithm received the highest results and in comparison, with Auranest's filter. The results showed that the supervised machine learning algorithm SVM achieved the best results in all metrics. The comparison between Auranest's existing filter and SVM showed that SVM performed better in all calculated metrics, where the accuracy showed 99.5% for SVM and 93.03% for Auranest’s filter. The comparative results showed that accuracy was the only factor that received similar results. For the other metrics, there was a noticeable difference. / Dagens samhälle blir alltmer digitaliserat och ett vanligt kommunikationssätt är att skicka e-postmeddelanden. I dagsläget har företaget Auranest ett filter för att kategorisera e-postmeddelanden men filtret är några år gammalt. Användningsområdet för filtret är att sortera ut värdefulla e-postmeddelanden för arbetssökande, där kontakt kan ske från arbetsgivare. Företaget vill veta ifall kategoriseringen kan göras med en annan metod samt förbättras. Målet med examensarbetet är att undersöka ifall filtreringen kan göras med högre träffsäkerhet med hjälp av maskininlärning. Tre övervakade maskininlärningsalgoritmer, Naïve Bayes, Support Vector Machine (SVM) och Decision Tree, har granskats och algoritmen med de högsta resultaten har jämförts med Auranests befintliga filter. Träffsäkerhet, precision, känslighet och F1-poäng har använts för att avgöra vilken maskininlärningsalgoritm som gav högst resultat sinsemellan samt i jämförelse med Auranests filter. Resultatet påvisade att den övervakade maskininlärningsmetoden SVM åstadkom de främsta resultaten i samtliga mätvärden. Jämförelsen mellan Auranests befintliga filter och SVM visade att SVM presterade bättre i alla kalkylerade mätvärden, där träffsäkerheten visade 99,5% för SVM och 93,03% för Auranests filter. De jämförande resultaten visade att träffsäkerheten var den enda faktorn som gav liknande resultat. För de övriga mätvärdena var det en märkbar skillnad.
|
119 |
Predicting the threshold grade for university admission through Machine Learning Classification Models / Förutspå tröskelvärdet för universitetsantagningsbetyg genom klassificeringsmodeller inom maskininlärningAlmawed, Anas, Victorin, Anton January 2023 (has links)
Higher-level education is very important these days, which can create very high thresholds for admission on popular programs on certain universities. In order to know what grade will be needed to be admitted to a program, a student can look at the threshold from previous years. We explored whether it was possible to generate accurate predictions of what the future threshold would be. We did this by using well-established machine learning classification models and admission data from 14 years back covering all applicants to the Computer Science and Engineering Program at KTH Royal Institute of Technology. What we found through this work is that the models are good at correctly classifying data from the past, but not in a meaningful way able to predict future thresholds. The models could not make accurate future predictions solely based on grades of past applicants. / Eftergymnasiala studier är väldigt viktiga numera, vilket kan leda till mycket höga antagningskrav på populära program på vissa universitet och högskolor. För att veta vilket betyg som krävs för att komma in på en utbildning så kan studenten titta på gränsen från tidigare år och utifrån det gissa sig till vad gränsen kommer vara kommande år. Vi undersöker om det är möjligt att, med hjälp av väletablerade, klassificerande Maskininlärnings-modeller kunna förutse antagningsgränsen i framtiden. Vi tränar modellerna på data med antagningsstatistik som sträcker sig tillbaka 14 år med alla ansökningar till civilingenjörs-programmet Datateknik på Kungliga Tekniska Högskolan. Det vi finner genom detta arbete är att modellerna är bra på att korrekt klassificera data från tidigare år, men att de inte, på ett meningsfullt sätt, kan förutse betygsgränsen kommande år. Modellerna kan inte göra detta endast genom data på betyg från tidigare år.
|
120 |
Loss Given Default Estimation with Machine Learning Ensemble Methods / Estimering av förlust vid fallissemang med ensembelmetoder inom maskininlärningVelka, Elina January 2020 (has links)
This thesis evaluates the performance of three machine learning methods in prediction of the Loss Given Default (LGD). LGD can be seen as the opposite of the recovery rate, i.e. the ratio of an outstanding loan that the loan issuer would not be able to recover in case the customer would default. The methods investigated are decision trees, random forest and boosted methods. All of the methods investigated performed well in predicting the cases were the loan is not recovered, LGD = 1 (100%), or the loan is totally recovered, LGD = 0 (0% ). When the performance of the models was evaluated on a dataset where the observations with LGD = 1 were removed, a significant decrease in performance was observed. The random forest model built on an unbalanced training dataset showed better performance on the test dataset that included values LGD = 1 and the random forest model built on a balanced training dataset performed better on the test set where the observations of LGD = 1 were removed. Boosted models evaluated in this study showed less accurate predictions than other methods used. Overall, the performance of random forest models showed slightly better results than the performance of decision tree models, although the computational time (the cost) was considerably longer when running the random forest models. Therefore decision tree models would be suggested for prediction of the Loss Given Default. / Denna uppsats undersöker och jämför tre maskininlärningsmetoder som estimerar förlust vid fallissemang (Loss Given Default, LGD). LGD kan ses som motsatsen till återhämtningsgrad, dvs. andelen av det utstående lånet som långivaren inte skulle återfå ifall kunden skulle fallera. Maskininlärningsmetoder som undersöks i detta arbete är decision trees, random forest och boosted metoder. Alla metoder fungerade väl vid estimering av lån som antingen inte återbetalas, dvs. LGD = 1 (100%), eller av lån som betalas i sin helhet, LGD = 0 (0%). En tydlig minskning i modellernas träffsäkerhet påvisades när modellerna kördes med ett dataset där observationer med LGD = 1 var borttagna. Random forest modeller byggda på ett obalanserat träningsdataset presterade bättre än de övriga modellerna på testset som inkluderade observationer där LGD = 1. Då observationer med LGD = 1 var borttagna visade det sig att random forest modeller byggda på ett balanserat träningsdataset presterade bättre än de övriga modellerna. Boosted modeller visade den svagaste träffsäkerheten av de tre metoderna som blev undersökta i denna studie. Totalt sett visade studien att random forest modeller byggda på ett obalanserat träningsdataset presterade en aning bättre än decision tree modeller, men beräkningstiden (kostnaden) var betydligt längre när random forest modeller kördes. Därför skulle decision tree modeller föredras vid estimering av förlust vid fallissemang.
|
Page generated in 0.0554 seconds