1 |
Optimización del clasificador “naive bayes” usando árbol de decisión C4.5Alarcón Jaimes, Carlos January 2015 (has links)
El clasificador Naive Bayes es uno de los modelos de clasificación más efectivos, debido a su simplicidad, resistencia al ruido, poco tiempo de procesamiento y alto poder predictivo. El clasificador Naive Bayes asume una fuerte suposición de independencia entre las variables predictoras dada la clase, lo que generalmente no se cumple. Muchas investigaciones buscan mejorar el poder predictivo del clasificador relajando esta suposición de independencia, como el escoger un subconjunto de variables que sean independientes o aproximadamente independientes.
En este trabajo, se presenta un método que busca optimizar el clasificador Naive Bayes usando el árbol de decisión C4.5. Este método, selecciona un subconjunto de variables del conjunto de datos usando el árbol de decisión C4.5 inducido y luego aplica el clasificador Naive Bayes a estas variables seleccionadas. Con el uso previo del árbol de decisión C4.5 se consigue remover las variables redundantes y/o irrelevantes del conjunto de datos y escoger las que son más informativas en tareas de clasificación, y de esta forma mejorar el poder predictivo del clasificador. Este método es ilustrado utilizando tres conjuntos de datos provenientes del repositorio UCI , Irvin Repository of Machine Learning databases de la Universidad de California y un conjunto de datos proveniente de la Encuesta Nacional de Hogares del Instituto Nacional de Estadística e Informática del Perú, ENAHO – INEI, e implementado con el programa WEKA.
|
2 |
Data Mining with Decision Trees in the Gene Logic Database : A Breast Cancer StudyRahpeymai, Neda January 2002 (has links)
<p>Data mining approaches have been increasingly used in recent years in order to find patterns and regularities in large databases. In this study, the C4.5 decision tree approach was used for mining of Gene Logic database, containing biological data. The decision tree approach was used in order to identify the most relevant genes and risk factors involved in breast cancer, in order to separate healthy patients from breast cancer patients in the data sets used. Four different tests were performed for this purpose. Cross validation was performed, for each of the four tests, in order to evaluate the capacity of the decision tree approaches in correctly classifying ‘new’ samples. In the first test, the expression of 108 breast related genes, shown in appendix A, for 75 patients were used as input to the C4.5 algorithm. This test resulted in a decision tree containing only four genes considered to be the most relevant in order to correctly classify patients. Cross validation indicates an average accuracy of 89% in classifying ‘new’ samples. In the second test, risk factor data was used as input. The cross validation result shows an average accuracy of 87% in classifying ‘new’ samples. In the third test, both gene expression data and risk factor data were put together as one input. The cross validation procedure for this approach again indicates an average accuracy of 87% in classifying ‘new’ samples. In the final test, the C4.5 algorithm was used in order to indicate possible signalling pathways involving the four genes identified by the decision tree based on only gene expression data. In some of cases, the C4.5 algorithm found trees suggesting pathways which are supported by the breast cancer literature. Since not all pathways involving the four putative breast cancer genes are known yet, the other suggested pathways should be further analyzed in order to increase their credibility.</p><p>In summary, this study demonstrates the application of decision tree approaches for the identification of genes and risk factors relevant for the classification of breast cancer patients</p>
|
3 |
Prediction of financial product acquisition for Peruvian savings and credit associationsVargas, Emmanuel Roque, Cadillo Montesinos, Ricardo, Mauricio, David 30 September 2020 (has links)
El texto completo de este trabajo no está disponible en el Repositorio Académico UPC por restricciones de la casa editorial donde ha sido publicado. / Savings and credit cooperatives in Peru are of great importance for their participation in the economy, reaching in 2019, deposits and deposits and assets of more than 2,890,191,000. However, they do not invest in predictive technologies to identify customers with a higher probability of purchasing a financial product, making marketing campaigns unproductive. In this work, a model based on machine learning is proposed to identify the clients who are most likely to acquire a financial product for Peruvian savings and credit cooperatives. The model was implemented using IBM SPSS Modeler for predictive analysis and tests were performed on 40,000 records on 10,000 clients, obtaining 91.25% accuracy on data not used in training. / Revisión por pares
|
4 |
Empirical investigation of decision tree extraction from neural networksRangwala, Maimuna H. 08 September 2006 (has links)
No description available.
|
5 |
Implementation and Experimentation with C4.5 Decision TreesBeck, Jason 01 January 2007 (has links)
C4.5 is a decision tree learning algorithm that was developed by Ross Quinlan based on his earlier algorithm ID3. C4.5 is one of the most popular algorithms used to solve classification problems. Classification problems are problems of interest in a variety of disciplines. C4.5 is a supervised learning algorithm which uses a set of training patterns to build a decision tree. The algorithm uses the patterns and analyzes their individual attributes to partition the pattern data. The popularity of C4.5 stems from the fact that it can handle both continuous and categorical attributes, and it can deal with missing attribute values, while at the same time providing an easy interpretation for the answers that it produces. There are two objectives of this thesis. The first is to implement C4.5 in C++ within a generic architecture to allow for additional modules to be added. The second is to use this generic architecture to implement an innovative post induction phase which adjusts splits to minimize the error of the C4.5 tree. The C4.5 code and the post induction phase will be compiled into a MEX DLL for use as functions within MATLAB. Experimentation is performed using MATLAB to verify the advantages of this post induction phase.
|
6 |
以資料採礦技術分析大台北地區保單貸款李珮榕 Unknown Date (has links)
摘要
本研究是利用某保險公司在大台北地區的壽險保單資料,進行知識發現過程。常見的資料採礦技術為類神經網路模型、CART及C4.5,利用這三種模型,來探討保單貸款行為模式。在抽樣過程中,藉由改變抽樣方法、樣本數大小及樣本中有貸款保單的比例,來選擇樣本的結構,並討論不同的樣本結構對模型的影響。研究過程中,也討論了連續變數轉換與否對各個模型的影響。
結果發現樣本中有貸款保單的比例對於模型的影響較大,而樣本數及抽樣方法對模型的影響都會隨著有貸款保單的比例不同而不同,每種模型適用的樣本結構並不一致。
連續變數的影響中,類神經網路受到連續變數轉換的影響較大,研究結果發現轉換連續變數可以使得類神經網路模型結果較好;對於CART或C4.5模型,受到連續變數轉換的影響小,CART模型連續變數轉換前後結果不變,而C4.5受連續變數影響在不同樣本結構並不一致,但改變量都很小。
從模型結果來看影響保單是否有貸款的變數,在類神經網路模型的靈敏度分析結果中,對模型影響較大的變數為體位別、被保人職業別級數、保險型態及地區;在CART模型結果中,影響較大的變數為繳別、保單年度、保單價值金、繳費方式及投保面額;在C4.5模型結果中,影響較大的變數為主約保單預定利率、年繳化保費、保單年度及繳別。對於CART、C4.5模型,選擇有較高正確率的規則,以提供保險公司決策方針。 / In this study, data mining is being applied on data taken from one of the life insurance company in Taipei. The techniques used are neural network, CART and C4.5 which are widely used models in data mining. In the process of acquiring samples, we comprised groups of samples by using different kind of sampling methods, different sample sizes, different ratios of loaned to un-loaned policies. In addition another groups of samples are created based on whether the continuous variables have been transformed. We then applied the three models into each of our various samples combinations to see which samples combination best described consumer behaviors with respect to their borrowing attitudes against their policies and its effects on different data mining models.
The results we found based on our study are summarized as following:
1. The assigned ratios have great influences on the model. However the magnitude of influences of sampling method and sample size on the model depends largely on the sample combination.
2. The sample combinations having transformed continuous variables affect and improve the results of neural network model significantly. However for CART model, the affects are insignificant whether the continuous variables having been transformed or not. The effect of transformed continuous variables on C4.5 is of limited.
3. The variables used to describe the behavior of the consumers as to taking the loan against the insurance policy vary for the three models.
|
7 |
資料採礦技術在保險公司客戶保單貸款行為研究的應用邱蔚群, Lilian Chiu Unknown Date (has links)
摘 要
過去對於保險資料的研究多採用傳統統計方法,然而保險公司龐大資料庫中蘊含的寶貴資訊可能因此被遺漏。
本研究目的是將資料採礦的技術應用到保險公司資料庫中的高雄縣市保戶保單貸款資料上,研究保戶利用保單貸款的行為,做為保險公司日後推行保單貸款的參考。
從整理過後的資料中,用不同抽樣方法抽出不同樣本大小以及不同是否貸款比例的樣本,將連續變數做轉換後,建立決策樹和類神經模型,透過統計上的變異數分析,討論四個因子對預測結果好壞的影響。選出最好組合的樣本大小、是否貸款比例(已貸款:尚未貸款)、抽樣方法、以及建立的模型。
最後將此最佳組合建立的C4.5決策樹轉換成規則,並探討其中正確率較高的幾項,作為給保險公司的參考。 / Abstract
In the past, the analysis of insurance data is usually conducted with traditional statistical methods, however a large amount of valuable information hidden might be left undiscovered.
The purpose of this research is to apply data mining techniques to customer policy data taken from one of insurance company’s database in Kaoshuing city and county to study the behavior of customers taking loans against their policies as a reference for insurance company in promoting policy in the future.
From the cleansed data, we sample policies of different sizes and percentage of policies with loans by different sampling methods, decision trees and neural network models, then through the significant interactions of ANOVA, discuss how the results being influenced by the four factors. We then choose the best model that manifests factors affecting customer’s behavior in taking out the loan thus providing insurance company a vital information in targeting its customers group.
|
8 |
Extração de conhecimento de redes neurais artificiais. / Knowledge extraction from artificial neural networks.Martineli, Edmar 20 August 1999 (has links)
Este trabalho descreve experimentos realizados com Redes Neurais Artificiais e algoritmos de aprendizado simbólico. Também são investigados dois algoritmos de extração de conhecimento de Redes Neurais Artificiais. Esses experimentos são realizados com três bases de dados com o objetivo de comparar os desempenhos obtidos. As bases de dados utilizadas neste trabalho são: dados de falência de bancos brasileiros, dados do jogo da velha e dados de análise de crédito. São aplicadas sobre os dados três técnicas para melhoria de seus desempenhos. Essas técnicas são: partição pela menor classe, acréscimo de ruído nos exemplos da menor classe e seleção de atributos mais relevantes. Além da análise do desempenho obtido, também é feita uma análise da dificuldade de compreensão do conhecimento extraído por cada método em cada uma das bases de dados. / This work describes experiments carried out witch Artificial Neural Networks and symbolic learning algorithms. Two algorithms for knowledge extraction from Artificial Neural Networks are also investigates. This experiments are performed whit three data set with the objective of compare the performance obtained. The data set used in this work are: Brazilians banks bankruptcy data set, tic-tac-toe data set and credit analysis data set. Three techniques for data set performance improvements are investigates. These techniques are: partition for the smallest class, noise increment in the examples of the smallest class and selection of more important attributes. Besides the analysis of the performance obtained, an analysis of the understanding difficulty of the knowledge extracted by each method in each data bases is made.
|
9 |
Applying the Wrapper Approach for Auto Discovery of Under-Sampling and Over-Sampling Percentages on Skewed DatasetsJoshi, Ajay D 03 November 2004 (has links)
Machine learning applications are plagued by the imbalance observed among the class sizes in many real world datasets. A dataset is said to be skewed or imbalanced when its classes are very unequally represented. A naïve classifier learned from these skewed datasets is always biased towards the majority classes which constitute a major percentage of the samples in the dataset. As a result the accuracy on the minority classes is hampered. In many real world applications like network intrusion detection, cancer detection from mammography images, etc. the events of interest are very rare and the cost of not detecting these events is very high. Hence it very important to improve accuracies on the minority classes. It has been proposed previously that under-sampling of the majority classes can reduce the bias of the learned classifier and over-sampling of the minority classes - especially SMOTE (Synthetic Minority Over-sampling TEchnique) can boost the classifier accuracy on minority classes. But the question of how much under-sampling and over-sampling to be done for a particular induction learning algorithm and dataset remains. We present a wrapper approach for searching for the under-sampling and over-sampling (i.e. SMOTE) percentages for a particular learning algorithm for a given skewed dataset. We compare the results obtained by the classifiers built on wrapper selected under sampled and SMOTEd datasets with the ones obtained by classifiers built on the original datasets to show a statistically significant improvement in accuracies over minority classes. This proves the efficacy of the wrapper approach in searching for the under-sampling and over-sampling percentages. Further, it provides an automated method to select the number of synthetic examples to be created.
|
10 |
非補償型意思決定方略を表現するためのデータマイニング手法の適用に関する分析山本, 俊行, YAMAMOTO, Toshiyuki 07 1900 (has links)
No description available.
|
Page generated in 0.0428 seconds