Global ETD Search

131	Predicting the threshold grade for university admission through Machine Learning Classification Models / Förutspå tröskelvärdet för universitetsantagningsbetyg genom klassificeringsmodeller inom maskininlärning Almawed, Anas, Victorin, Anton January 2023 (has links) Higher-level education is very important these days, which can create very high thresholds for admission on popular programs on certain universities. In order to know what grade will be needed to be admitted to a program, a student can look at the threshold from previous years. We explored whether it was possible to generate accurate predictions of what the future threshold would be. We did this by using well-established machine learning classification models and admission data from 14 years back covering all applicants to the Computer Science and Engineering Program at KTH Royal Institute of Technology. What we found through this work is that the models are good at correctly classifying data from the past, but not in a meaningful way able to predict future thresholds. The models could not make accurate future predictions solely based on grades of past applicants. / Eftergymnasiala studier är väldigt viktiga numera, vilket kan leda till mycket höga antagningskrav på populära program på vissa universitet och högskolor. För att veta vilket betyg som krävs för att komma in på en utbildning så kan studenten titta på gränsen från tidigare år och utifrån det gissa sig till vad gränsen kommer vara kommande år. Vi undersöker om det är möjligt att, med hjälp av väletablerade, klassificerande Maskininlärnings-modeller kunna förutse antagningsgränsen i framtiden. Vi tränar modellerna på data med antagningsstatistik som sträcker sig tillbaka 14 år med alla ansökningar till civilingenjörs-programmet Datateknik på Kungliga Tekniska Högskolan. Det vi finner genom detta arbete är att modellerna är bra på att korrekt klassificera data från tidigare år, men att de inte, på ett meningsfullt sätt, kan förutse betygsgränsen kommande år. Modellerna kan inte göra detta endast genom data på betyg från tidigare år. Admission data Data Classification Machine Learning Logistic Regression Support Vector Machine Decision Tree Classifier Random Forest Antagningsdata Dataklassificering Maskininlärning Logistic Regression Support Vector Machine Decision Tree Classifier Random Forest Computer and Information Sciences Data- och informationsvetenskap
132	Loss Given Default Estimation with Machine Learning Ensemble Methods / Estimering av förlust vid fallissemang med ensembelmetoder inom maskininlärning Velka, Elina January 2020 (has links) This thesis evaluates the performance of three machine learning methods in prediction of the Loss Given Default (LGD). LGD can be seen as the opposite of the recovery rate, i.e. the ratio of an outstanding loan that the loan issuer would not be able to recover in case the customer would default. The methods investigated are decision trees, random forest and boosted methods. All of the methods investigated performed well in predicting the cases were the loan is not recovered, LGD = 1 (100%), or the loan is totally recovered, LGD = 0 (0% ). When the performance of the models was evaluated on a dataset where the observations with LGD = 1 were removed, a significant decrease in performance was observed. The random forest model built on an unbalanced training dataset showed better performance on the test dataset that included values LGD = 1 and the random forest model built on a balanced training dataset performed better on the test set where the observations of LGD = 1 were removed. Boosted models evaluated in this study showed less accurate predictions than other methods used. Overall, the performance of random forest models showed slightly better results than the performance of decision tree models, although the computational time (the cost) was considerably longer when running the random forest models. Therefore decision tree models would be suggested for prediction of the Loss Given Default. / Denna uppsats undersöker och jämför tre maskininlärningsmetoder som estimerar förlust vid fallissemang (Loss Given Default, LGD). LGD kan ses som motsatsen till återhämtningsgrad, dvs. andelen av det utstående lånet som långivaren inte skulle återfå ifall kunden skulle fallera. Maskininlärningsmetoder som undersöks i detta arbete är decision trees, random forest och boosted metoder. Alla metoder fungerade väl vid estimering av lån som antingen inte återbetalas, dvs. LGD = 1 (100%), eller av lån som betalas i sin helhet, LGD = 0 (0%). En tydlig minskning i modellernas träffsäkerhet påvisades när modellerna kördes med ett dataset där observationer med LGD = 1 var borttagna. Random forest modeller byggda på ett obalanserat träningsdataset presterade bättre än de övriga modellerna på testset som inkluderade observationer där LGD = 1. Då observationer med LGD = 1 var borttagna visade det sig att random forest modeller byggda på ett balanserat träningsdataset presterade bättre än de övriga modellerna. Boosted modeller visade den svagaste träffsäkerheten av de tre metoderna som blev undersökta i denna studie. Totalt sett visade studien att random forest modeller byggda på ett obalanserat träningsdataset presterade en aning bättre än decision tree modeller, men beräkningstiden (kostnaden) var betydligt längre när random forest modeller kördes. Därför skulle decision tree modeller föredras vid estimering av förlust vid fallissemang. Loss Given Default Non-Performing Loans Internal Ratings Based Approach Machine Learning Decision Tree Random Forest Boosted Method Förlust vid fallissemang Icke-presterande lån Intern riskklassificeringsmetod Maskininlärning Decision Tree Random Forest Boosted Metod Mathematics Matematik
133	Automated dust storm detection using satellite images : development of a computer system for the detection of dust storms from MODIS satellite images and the creation of a new dust storm database El-Ossta, Esam Elmehde Amar January 2013 (has links) Dust storms are one of the natural hazards, which have increased in frequency in the recent years over Sahara desert, Australia, the Arabian Desert, Turkmenistan and northern China, which have worsened during the last decade. Dust storms increase air pollution, impact on urban areas and farms as well as affecting ground and air traffic. They cause damage to human health, reduce the temperature, cause damage to communication facilities, reduce visibility which delays both road and air traffic and impact on both urban and rural areas. Thus, it is important to know the causation, movement and radiation effects of dust storms. The monitoring and forecasting of dust storms is increasing in order to help governments reduce the negative impact of these storms. Satellite remote sensing is the most common method but its use over sandy ground is still limited as the two share similar characteristics. However, satellite remote sensing using true-colour images or estimates of aerosol optical thickness (AOT) and algorithms such as the deep blue algorithm have limitations for identifying dust storms. Many researchers have studied the detection of dust storms during daytime in a number of different regions of the world including China, Australia, America, and North Africa using a variety of satellite data but fewer studies have focused on detecting dust storms at night. The key elements of this present study are to use data from the Moderate Resolution Imaging Spectroradiometers on the Terra and Aqua satellites to develop more effective automated method for detecting dust storms during both day and night and generate a MODIS dust storm database. 551.55
134	An Integrative Approach for Examining the Determinants of Abnormal Returns: The Cases of Internet Security Breach and Ecommerce Initiative Andoh-Baidoo, Francis Kofi 01 January 2006 (has links) Researchers in various business disciplines use the event study methodology to assess the market value of firms through capital market reaction to news in the public media about the firm's activities. Capital market reaction is assessed based on cumulative abnormal return (sum of abnormal returns over the event window). In this study, the event study methodology is used to assess the impact that two important information technology activities, Internet security breach and ecommerce initiative, have on the market value of firms. While prior research on the relationship between these business activities and cumulative abnormal return involved the use of regression analysis, in this study, we use decision tree induction and regression.For the Internet security breach study, we use negative cumulative abnormal return as a surrogate for damage to the breached firm. In contrast to what has been reported in the research literature, our results suggest that the relationship between cumulative abnormal return and the independent variables for both the Internet security breach and ecommerce initiative studies is complex, often involving conditional interactions between the independent variables. We report that the incomplete contract theory is unable to effectively explain the relationship between cumulative abnormal return and the organizational variables. Other ecommerce theories provide support to the findings from our analysis. We show that both attack and firm characteristics are determinants of damage to breached firms.Our results revealed that the use of decision tree induction presents additional insight to that provided by regression models. We illustrate that there is value in using data mining techniques to study the market value of e-commerce initiative and Internet security breach and that this approach has applicability in other domains and that Decision Tree can enhance the event study methodology.We demonstrate that Decision Tree induction can be used for both theory building and theory testing. We specifically employ Decision Tree induction to test and enhance ecommerce theories and develop a theoretical model for cumulative abnormal return and ecommerce. We also present theoretical models for Internet security breach and damage to the breached firm. These models can be used by decision makers in Internet security and ecommerce investments strategic formulations and implementations. cumulative abnormal return investor perception decision tree induction ecommerce Internet security market value event study methodology Business Management Information Systems
135	Adaptive Similarity of XML Data / Adaptive Similarity of XML Data Jílková, Eva January 2014 (has links) In the present work we explore application of XML schema mapping in conceptual modeling of XML schemas. We expand upon the previous efforts to map XML schemas to PIM schema via a decision tree. In this thesis more versatile method is implemented - the decision tree is trained from a large set of user- annotated mapping decision samples. Several variations of training that could improve the mapping results are proposed. The approach is evaluated in a wide range of experiments that show the advantages and disadvantages of the proposed variations of training. The work also contains a survey of different approaches to schema mapping and description of schema used in this work. Powered by TCPDF (www.tcpdf.org)
136	Artificial intelligence and Machine learning : a diabetic readmission study Forsman, Robin, Jönsson, Jimmy January 2019 (has links) The maturing of Artificial intelligence provides great opportunities for healthcare, but also comes with new challenges. For Artificial intelligence to be adequate a comprehensive analysis of the data is necessary along with testing the data in multiple algorithms to determine which algorithm is appropriate to use. In this study collection of data has been gathered that consists of patients who have either been readmitted or not readmitted to hospital within 30-days after being admitted. The data has then been analyzed and compared in different algorithms to determine the most appropriate algorithm to use. Artificial intelligence Machine learning Logistic regression K-nearest neighbor Boosted decision tree Artificial neural network Computer Sciences Datavetenskap (datalogi)
137	Uma adaptação do método Binary Relevance utilizando árvores de decisão para problemas de classificação multirrótulo aplicado à genômica funcional / An Adaptation of Binary Relevance for Multi-Label Classification applied to Functional Genomics Tanaka, Erica Akemi 30 August 2013 (has links) Muitos problemas de classificação descritos na literatura de aprendizado de máquina e mineração de dados dizem respeito à classificação em que cada exemplo pertence a um único rótulo. Porém, vários problemas de classificação, principalmente no campo de Bioinformática são associados a mais de um rótulo; esses problemas são conhecidos como problemas de classificação multirrótulo. O princípio básico da classificação multirrótulo é similar ao da classificação tradicional (que possui um único rótulo), sendo diferenciada no número de rótulos a serem preditos, na qual há dois ou mais rótulos. Na área da Bioinformática muitos problemas são compostos por uma grande quantidade de rótulos em que cada exemplo pode estar associado. Porém, algoritmos de classificação tradicionais são incapazes de lidar com um conjunto de exemplos mutirrótulo, uma vez que esses algoritmos foram projetados para predizer um único rótulo. Uma solução mais simples é utilizar o método conhecido como método Binary Relevance. Porém, estudos mostraram que tal abordagem não constitui uma boa solução para o problema da classificação multirrótulo, pois cada classe é tratada individualmente, ignorando as possíveis relações entre elas. Dessa maneira, o objetivo dessa pesquisa foi propor uma nova adaptação do método Binary Relevance que leva em consideração relações entre os rótulos para tentar minimizar sua desvantagem, além de também considerar a capacidade de interpretabilidade do modelo gerado, não só o desempenho. Os resultados experimentais mostraram que esse novo método é capaz de gerar árvores que relacionam os rótulos correlacionados e também possui um desempenho comparável ao de outros métodos, obtendo bons resultados usando a medida-F. / Many classification problems described in the literature on Machine Learning and Data Mining relate to the classification in which each example belongs to a single class. However, many classification problems, especially in the field of Bioinformatics, are associated with more than one class; these problems are known as multi-label classification problems. The basic principle of multi-label classification is similar to the traditional classification (single label), and distinguished by the number of classes to be predicted, in this case, in which there are two or more labels. In Bioinformatics many problems are composed of a large number of labels that can be associated with each example. However, traditional classification algorithms are unable to cope with a set of multi-label examples, since these algorithms are designed to predict a single label. A simpler solution is to use the method known as Binary Relevance. However, studies have shown that this approach is not a good solution to the problem of multi-label classification because each class is treated individually, ignoring possible relations between them. Thus, the objective of this research was to propose a new adaptation of Binary Relevance method that took into account relations between labels trying to minimize its disadvantage, and also consider the ability of interpretability of the model generated, not just its performance. The experimental results show that this new method is capable of generating trees that relate labels and also has a performance comparable to other methods, obtaining good results using F-measure. Aprendizado de Maquina Árvores de Decisão Classificação Multirrótulo Decision Tree Funcional Genomic Genômica FUncional Machine Learning Multi-Label Classification
138	Application of Stochastic Decision Models to Solid Waste Management Wright, William Ervin 08 1900 (has links) This research applies stochastic decision tree analytical techniques to a decision of the type a small community may face when choosing a solid waste disposal system from among several alternatives. Specifically targeted are those situations in which a community finds itself (1) lying at or near the boundary of a central planning area, (2) in a position to exercise one of several disposal options, and (3) has access to the data base on solid waste which has been systematically developed by a central planning agency. The options available may or may not be optimal in terms of total cost, either to the community or to adjacent communities which participate in centrally coordinated or jointly organized activities. The study suggests that stochastic simulation models, drawing upon a data base developed by central planning agencies in cases where local data are inadequate or not available, can be useful in evaluating disposal alternatives at the community level. Further, the decision tree can be usefully employed to communicate results of the analysis. Some important areas of further research on the small community disposal system selection problem are noted. solid waste disposal systems Refuse and refuse disposal.
139	Algoritmo para indução de árvores de classificação para dados desbalanceados / Algorithm for induction of classification trees for unbalanced data Cláudio Frizzarini 21 November 2013 (has links) As técnicas de mineração de dados, e mais especificamente de aprendizado de máquina, têm se popularizado enormemente nos últimos anos, passando a incorporar os Sistemas de Informação para Apoio à Decisão, Previsão de Eventos e Análise de Dados. Por exemplo, sistemas de apoio à decisão na área médica e ambientes de \\textit{Business Intelligence} fazem uso intensivo dessas técnicas. Algoritmos indutores de árvores de classificação, particularmente os algoritmos TDIDT (Top-Down Induction of Decision Trees), figuram entre as técnicas mais comuns de aprendizado supervisionado. Uma das vantagens desses algoritmos em relação a outros é que, uma vez construída e validada, a árvore tende a ser interpretada com relativa facilidade, sem a necessidade de conhecimento prévio sobre o algoritmo de construção. Todavia, são comuns problemas de classificação em que as frequências relativas das classes variam significativamente. Algoritmos baseados em minimização do erro global de classificação tendem a construir classificadores com baixas taxas de erro de classificação nas classes majoritárias e altas taxas de erro nas classes minoritárias. Esse fenômeno pode ser crítico quando as classes minoritárias representam eventos como a presença de uma doença grave (em um problema de diagnóstico médico) ou a inadimplência em um crédito concedido (em um problema de análise de crédito). Para tratar esse problema, diversos algoritmos TDIDT demandam a calibração de parâmetros {\\em ad-hoc} ou, na ausência de tais parâmetros, a adoção de métodos de balanceamento dos dados. As duas abordagens não apenas introduzem uma maior complexidade no uso das ferramentas de mineração de dados para usuários menos experientes, como também nem sempre estão disponíveis. Neste trabalho, propomos um novo algoritmo indutor de árvores de classificação para problemas com dados desbalanceados. Esse algoritmo, denominado atualmente DDBT (Dynamic Discriminant Bounds Tree), utiliza um critério de partição de nós que, ao invés de se basear em frequências absolutas de classes, compara as proporções das classes nos nós com as proporções do conjunto de treinamento original, buscando formar subconjuntos com maior discriminação de classes em relação ao conjunto de dados original. Para a rotulação de nós terminais, o algoritmo atribui a classe com maior prevalência relativa no nó em relação à prevalência no conjunto original. Essas características fornecem ao algoritmo a flexibilidade para o tratamento de conjuntos de dados com desbalanceamento de classes, resultando em um maior equilíbrio entre as taxas de erro em classificação de objetos entre as classes. / Data mining techniques and, particularly, machine learning methods, have become very popular in recent years. Many decision support information systems and business intelligence tools have incorporated and made intensive use of such techniques. Top-Down Induction of Decision Trees Algorithms (TDIDT) appear among the most popular tools for supervised learning. One of their advantages with respect to other methods is that a decision tree is frequently easy to be interpreted by the domain specialist, precluding the necessity of previous knowledge about the induction algorithms. On the other hand, several typical classification problems involve unbalanced data (heterogeneous class prevalence). In such cases, algorithms based on global error minimization tend to induce classifiers with low error rates over the high prevalence classes, but with high error rates on the low prevalence classes. This phenomenon may be critical when low prevalence classes represent rare or important events, like the presence of a severe disease or the default in a loan. In order to address this problem, several TDIDT algorithms require the calibration of {\\em ad-hoc} parameters, or even data balancing techniques. These approaches usually make data mining tools more complex for less expert users, if they are ever available. In this work, we propose a new TDIDT algorithm for problems involving unbalanced data. This algorithm, currently named DDBT (Dynamic Discriminant Bounds Tree), uses a node partition criterion which is not based on absolute class frequencies, but compares the prevalence of each class in the current node with those in the original training sample. For terminal nodes labeling, the algorithm assigns the class with maximum ration between the relative prevalence in the node and the original prevalence in the training sample. Such characteristics provide more flexibility for the treatment of unbalanced data-sets, yielding a higher equilibrium among the error rates in the classes. Aprendizado supervisionado Árvore de classificação Árvore de decisão Dados desbalanceados Mineração de dados Classification tree Data mining Decision Tree Supervised learning Unbalanced data
140	Algoritmo para indução de árvores de classificação para dados desbalanceados / Algorithm for induction of classification trees for unbalanced data Frizzarini, Cláudio 21 November 2013 (has links) As técnicas de mineração de dados, e mais especificamente de aprendizado de máquina, têm se popularizado enormemente nos últimos anos, passando a incorporar os Sistemas de Informação para Apoio à Decisão, Previsão de Eventos e Análise de Dados. Por exemplo, sistemas de apoio à decisão na área médica e ambientes de \\textit{Business Intelligence} fazem uso intensivo dessas técnicas. Algoritmos indutores de árvores de classificação, particularmente os algoritmos TDIDT (Top-Down Induction of Decision Trees), figuram entre as técnicas mais comuns de aprendizado supervisionado. Uma das vantagens desses algoritmos em relação a outros é que, uma vez construída e validada, a árvore tende a ser interpretada com relativa facilidade, sem a necessidade de conhecimento prévio sobre o algoritmo de construção. Todavia, são comuns problemas de classificação em que as frequências relativas das classes variam significativamente. Algoritmos baseados em minimização do erro global de classificação tendem a construir classificadores com baixas taxas de erro de classificação nas classes majoritárias e altas taxas de erro nas classes minoritárias. Esse fenômeno pode ser crítico quando as classes minoritárias representam eventos como a presença de uma doença grave (em um problema de diagnóstico médico) ou a inadimplência em um crédito concedido (em um problema de análise de crédito). Para tratar esse problema, diversos algoritmos TDIDT demandam a calibração de parâmetros {\\em ad-hoc} ou, na ausência de tais parâmetros, a adoção de métodos de balanceamento dos dados. As duas abordagens não apenas introduzem uma maior complexidade no uso das ferramentas de mineração de dados para usuários menos experientes, como também nem sempre estão disponíveis. Neste trabalho, propomos um novo algoritmo indutor de árvores de classificação para problemas com dados desbalanceados. Esse algoritmo, denominado atualmente DDBT (Dynamic Discriminant Bounds Tree), utiliza um critério de partição de nós que, ao invés de se basear em frequências absolutas de classes, compara as proporções das classes nos nós com as proporções do conjunto de treinamento original, buscando formar subconjuntos com maior discriminação de classes em relação ao conjunto de dados original. Para a rotulação de nós terminais, o algoritmo atribui a classe com maior prevalência relativa no nó em relação à prevalência no conjunto original. Essas características fornecem ao algoritmo a flexibilidade para o tratamento de conjuntos de dados com desbalanceamento de classes, resultando em um maior equilíbrio entre as taxas de erro em classificação de objetos entre as classes. / Data mining techniques and, particularly, machine learning methods, have become very popular in recent years. Many decision support information systems and business intelligence tools have incorporated and made intensive use of such techniques. Top-Down Induction of Decision Trees Algorithms (TDIDT) appear among the most popular tools for supervised learning. One of their advantages with respect to other methods is that a decision tree is frequently easy to be interpreted by the domain specialist, precluding the necessity of previous knowledge about the induction algorithms. On the other hand, several typical classification problems involve unbalanced data (heterogeneous class prevalence). In such cases, algorithms based on global error minimization tend to induce classifiers with low error rates over the high prevalence classes, but with high error rates on the low prevalence classes. This phenomenon may be critical when low prevalence classes represent rare or important events, like the presence of a severe disease or the default in a loan. In order to address this problem, several TDIDT algorithms require the calibration of {\\em ad-hoc} parameters, or even data balancing techniques. These approaches usually make data mining tools more complex for less expert users, if they are ever available. In this work, we propose a new TDIDT algorithm for problems involving unbalanced data. This algorithm, currently named DDBT (Dynamic Discriminant Bounds Tree), uses a node partition criterion which is not based on absolute class frequencies, but compares the prevalence of each class in the current node with those in the original training sample. For terminal nodes labeling, the algorithm assigns the class with maximum ration between the relative prevalence in the node and the original prevalence in the training sample. Such characteristics provide more flexibility for the treatment of unbalanced data-sets, yielding a higher equilibrium among the error rates in the classes. Aprendizado supervisionado Árvore de classificação Árvore de decisão Classification tree Dados desbalanceados Data mining Decision Tree Mineração de dados Supervised learning Unbalanced data

Search results