Global ETD Search

1	Methodology For Generating High-Confidence Cost-Sensitive Rules For Classification Bakshi, Arjun 21 October 2013 (has links) No description available. Computer Science Cost sensitive High Confidence Association Rules Classification Non forced
2	The Effectiveness of a Random Forests Model in Detecting Network-Based Buffer Overflow Attacks Julock, Gregory Alan 01 January 2013 (has links) Buffer Overflows are a common type of network intrusion attack that continue to plague the networked community. Unfortunately, this type of attack is not well detected with current data mining algorithms. This research investigated the use of Random Forests, an ensemble technique that creates multiple decision trees, and then votes for the best tree. The research Investigated Random Forests' effectiveness in detecting buffer overflows compared to other data mining methods such as CART and Naïve Bayes. Random Forests was used for variable reduction, cost sensitive classification was applied, and each method's detection performance compared and reported along with the receive operator characteristics. The experiment was able to show that Random Forests outperformed CART and Naïve Bayes in classification performance. Using a technique to obtain Buffer Overflow most important variables, Random Forests was also able to improve upon its Buffer Overflow classification performance. Buffer Overflows Cost sensitive classification Data Mining Decision Trees Naive Bayes Random Forests Computer Sciences
3	Active learning in cost-sensitive environments Liu, Alexander Yun-chung 21 June 2010 (has links) Active learning techniques aim to reduce the amount of labeled data required for a supervised learner to achieve a certain level of performance. This can be very useful in domains where unlabeled data is easy to obtain but labelling data is costly. In this dissertation, I introduce methods of creating computationally efficient active learning techniques that handle different misclassification costs, different evaluation metrics, and different label acquisition costs. This is accomplished in part by developing techniques from utility-based data mining typically not studied in conjunction with active learning. I first address supervised learning problems where labeled data may be scarce, especially for one particular class. I revisit claims about resampling, a particularly popular approach to handling imbalanced data, and cost-sensitive learning. The presented research shows that while resampling and cost-sensitive learning can be equivalent in some cases, the two approaches are not identical. This work on resampling and cost-sensitive learning motivates a need for active learners that can handle different misclassification costs. After presenting a cost-sensitive active learning algorithm, I show that this algorithm can be combined with a proposed framework for analyzing evaluation metrics in order to create an active learning approach that can optimize any evaluation metric that can be expressed as a function of terms in a confusion matrix. Finally, I address methods for active learning in terms of different utility costs incurred when labeling different types of points, particularly when label acquisition costs are spatially driven. / text Active learning Labeled data Supervised learners Utility-based data mining Resampling Cost-sensitive learning Label acquisition
4	Cost-Sensitive Boosting for Classification of Imbalanced Data Sun, Yanmin 11 May 2007 (has links) The classification of data with imbalanced class distributions has posed a significant drawback in the performance attainable by most well-developed classification systems, which assume relatively balanced class distributions. This problem is especially crucial in many application domains, such as medical diagnosis, fraud detection, network intrusion, etc., which are of great importance in machine learning and data mining. This thesis explores meta-techniques which are applicable to most classifier learning algorithms, with the aim to advance the classification of imbalanced data. Boosting is a powerful meta-technique to learn an ensemble of weak models with a promise of improving the classification accuracy. AdaBoost has been taken as the most successful boosting algorithm. This thesis starts with applying AdaBoost to an associative classifier for both learning time reduction and accuracy improvement. However, the promise of accuracy improvement is trivial in the context of the class imbalance problem, where accuracy is less meaningful. The insight gained from a comprehensive analysis on the boosting strategy of AdaBoost leads to the investigation of cost-sensitive boosting algorithms, which are developed by introducing cost items into the learning framework of AdaBoost. The cost items are used to denote the uneven identification importance among classes, such that the boosting strategies can intentionally bias the learning towards classes associated with higher identification importance and eventually improve the identification performance on them. Given an application domain, cost values with respect to different types of samples are usually unavailable for applying the proposed cost-sensitive boosting algorithms. To set up the effective cost values, empirical methods are used for bi-class applications and heuristic searching of the Genetic Algorithm is employed for multi-class applications. This thesis also covers the implementation of the proposed cost-sensitive boosting algorithms. It ends with a discussion on the experimental results of classification of real-world imbalanced data. Compared with existing algorithms, the new algorithms this thesis presents are superior in achieving better measurements regarding the learning objectives. Classification Imbalanced Data Cost-Sensitive Learning Boosting
5	Cost-Sensitive Boosting for Classification of Imbalanced Data Sun, Yanmin 11 May 2007 (has links) The classification of data with imbalanced class distributions has posed a significant drawback in the performance attainable by most well-developed classification systems, which assume relatively balanced class distributions. This problem is especially crucial in many application domains, such as medical diagnosis, fraud detection, network intrusion, etc., which are of great importance in machine learning and data mining. This thesis explores meta-techniques which are applicable to most classifier learning algorithms, with the aim to advance the classification of imbalanced data. Boosting is a powerful meta-technique to learn an ensemble of weak models with a promise of improving the classification accuracy. AdaBoost has been taken as the most successful boosting algorithm. This thesis starts with applying AdaBoost to an associative classifier for both learning time reduction and accuracy improvement. However, the promise of accuracy improvement is trivial in the context of the class imbalance problem, where accuracy is less meaningful. The insight gained from a comprehensive analysis on the boosting strategy of AdaBoost leads to the investigation of cost-sensitive boosting algorithms, which are developed by introducing cost items into the learning framework of AdaBoost. The cost items are used to denote the uneven identification importance among classes, such that the boosting strategies can intentionally bias the learning towards classes associated with higher identification importance and eventually improve the identification performance on them. Given an application domain, cost values with respect to different types of samples are usually unavailable for applying the proposed cost-sensitive boosting algorithms. To set up the effective cost values, empirical methods are used for bi-class applications and heuristic searching of the Genetic Algorithm is employed for multi-class applications. This thesis also covers the implementation of the proposed cost-sensitive boosting algorithms. It ends with a discussion on the experimental results of classification of real-world imbalanced data. Compared with existing algorithms, the new algorithms this thesis presents are superior in achieving better measurements regarding the learning objectives. Classification Imbalanced Data Cost-Sensitive Learning Boosting
6	Enhancing Gene Expression Signatures in Cancer Prediction Models: Understanding and Managing Classification Complexity Kamath, Vidya P. 29 July 2010 (has links) Cancer can develop through a series of genetic events in combination with external influential factors that alter the progression of the disease. Gene expression studies are designed to provide an enhanced understanding of the progression of cancer and to develop clinically relevant biomarkers of disease, prognosis and response to treatment. One of the main aims of microarray gene expression analyses is to develop signatures that are highly predictive of specific biological states, such as the molecular stage of cancer. This dissertation analyzes the classification complexity inherent in gene expression studies, proposing both techniques for measuring complexity and algorithms for reducing this complexity. Classifier algorithms that generate predictive signatures of cancer models must generalize to independent datasets for successful translation to clinical practice. The predictive performance of classifier models is shown to be dependent on the inherent complexity of the gene expression data. Three specific quantitative measures of classification complexity are proposed and one measure ( f) is shown to correlate highly (R 2=0.82) with classifier accuracy in experimental data. Three quantization methods are proposed to enhance contrast in gene expression data and reduce classification complexity. The accuracy for cancer prognosis prediction is shown to improve using quantization in two datasets studied: from 67% to 90% in lung cancer and from 56% to 68% in colorectal cancer. A corresponding reduction in classification complexity is also observed. A random subspace based multivariable feature selection approach using costsensitive analysis is proposed to model the underlying heterogeneous cancer biology and address complexity due to multiple molecular pathways and unbalanced distribution of samples into classes. The technique is shown to be more accurate than the univariate ttest method. The classifier accuracy improves from 56% to 68% for colorectal cancer prognosis prediction. A published gene expression signature to predict radiosensitivity of tumor cells is augmented with clinical indicators to enhance modeling of the data and represent the underlying biology more closely. Statistical tests and experiments indicate that the improvement in the model fit is a result of modeling the underlying biology rather than statistical over-fitting of the data, thereby accommodating classification complexity through the use of additional variables. quantization survival analysis random subspaces cost-sensitive analysis biological covariates American Studies Arts and Humanities
7	Cost-Sensitive Classification Methods for the Detection of Smuggled Nuclear Material in Cargo Containers Webster, Jennifer B 16 December 2013 (has links) Classification problems arise in so many different parts of life – from sorting machine parts to diagnosing a disease. Humans make these classifications utilizing vast amounts of data, filtering observations for useful information, and then making a decision based on a subjective level of cost/risk of classifying objects incorrectly. This study investigates the translation of the human decision process into a mathematical problem in the context of a border security problem: How does one find special nuclear material being smuggled inside large cargo crates while balancing the cost of invasively searching suspect containers against the risk of al lowing radioactive material to escape detection? This may be phrased as a classification problem in which one classifies cargo containers into two categories – those containing a smuggled source and those containing only innocuous cargo. This task presents numerous challenges, e.g., the stochastic nature of radiation and the low signal-to-noise ratio caused by background radiation and cargo shielding. In the course of this work, we will break the analysis of this problem into three major sections – the development of an optimal decision rule, the choice of most useful measurements or features, and the sensitivity of developed algorithms to physical variations. This will include an examination of how accounting for the cost/risk of a decision affects the formulation of our classification problem. Ultimately, a support vector machine (SVM) framework with F -score feature selection will be developed to provide nearly optimal classification given a constraint on the reliability of detection provided by our algorithm. In particular, this can decrease the fraction of false positives by an order of magnitude over current methods. The proposed method also takes into account the relationship between measurements, whereas current methods deal with detectors independently of one another. cost-sensitive classification feature selection error control border security smuggled nuclear material
8	Intelligent Adaptation of Ensemble Size in Data Streams Using Online Bagging Olorunnimbe, Muhammed January 2015 (has links) In this era of the Internet of Things and Big Data, a proliferation of connected devices continuously produce massive amounts of fast evolving streaming data. There is a need to study the relationships in such streams for analytic applications, such as network intrusion detection, fraud detection and financial forecasting, amongst other. In this setting, it is crucial to create data mining algorithms that are able to seamlessly adapt to temporal changes in data characteristics that occur in data streams. These changes are called concept drifts. The resultant models produced by such algorithms should not only be highly accurate and be able to swiftly adapt to changes. Rather, the data mining techniques should also be fast, scalable, and efficient in terms of resource allocation. It then becomes important to consider issues such as storage space needs and memory utilization. This is especially relevant when we aim to build personalized, near-instant models in a Big Data setting. This research work focuses on mining in a data stream with concept drift, using an online bagging method, with consideration to the memory utilization. Our aim is to take an adaptive approach to resource allocation during the mining process. Specifically, we consider metalearning, where the models of multiple classifiers are combined into an ensemble, has been very successful when building accurate models against data streams. However, little work has been done to explore the interplay between accuracy, efficiency and utility. This research focuses on this issue. We introduce an adaptive metalearning algorithm that takes advantage of the memory utilization cost of concept drift, in order to vary the ensemble size during the data mining process. We aim to minimize the memory usage, while maintaining highly accurate models with a high utility. We evaluated our method against a number of benchmarking datasets and compare our results against the state-of-the art. Return on Investment (ROI) was used to evaluate the gain in performance in terms of accuracy, in contrast to the time and memory invested. We aimed to achieve high ROI without compromising on the accuracy of the result. Our experimental results indicate that we achieved this goal. Data stream Concept drift Metalearning Cost sensitive adaptation ROI Utility Adaptive ensemble size Online bagging
9	Neural Networks for Predictive Maintenance on Highly Imbalanced Industrial Data Montilla Tabares, Oscar January 2023 (has links) Preventive maintenance plays a vital role in optimizing industrial operations. However, detecting equipment needing such maintenance using available data can be particularly challenging due to the class imbalance prevalent in real-world applications. The datasets gathered from equipment sensors primarily consist of records from well-functioning machines, making it difficult to identify those on the brink of failure, which is the main focus of preventive maintenance efforts. In this study, we employ neural network algorithms to address class imbalance and cost sensitivity issues in industrial scenarios for preventive maintenance. Our investigation centers on the "APS Failure in the Scania Trucks Data Set," a binary classification problem exhibiting significant class imbalance and cost sensitivity issues—a common occurrence across various fields. Inspired by image detection techniques, we introduce a novel loss function called Focal loss to traditional neural networks, combined with techniques like Cost-Sensitive Learning and Threshold Calculation to enhance classification accuracy. Our study's novelty is adapting image detection techniques to tackle the class imbalance problem within a binary classification task. Our proposed method demonstrates improvements in addressing the given optimization problem when confronted with these issues, matching or surpassing existing machine learning and deep learning techniques while maintaining computational efficiency. Our results indicate that class imbalance can be addressed without relying on conventional sampling techniques, which typically come at the cost of increased computational cost (oversampling) or loss of critical information (undersampling). In conclusion, our proposed method presents a promising approach for addressing class imbalance and cost sensitivity issues in industrial datasets heavily affected by these phenomena. It contributes to developing preventive maintenance solutions capable of enhancing the efficiency and productivity of industrial operations by detecting machines in need of attention: this discovery process we term predictive maintenance. The artifact produced in this study showcases the utilization of Focal Loss, Cost-Sensitive Learning, and Threshold Calculation to create reliable and effective predictive maintenance solutions for real-world applications. This thesis establishes a method that contributes to the body of knowledge in binary classification within machine learning, specifically addressing the challenges mentioned above. Our research findings have broader implications beyond industrial classification tasks, extending to other fields, such as medical or cybersecurity classification problems. The artifact (code) is at: https://shorturl.at/lsNSY Class Imbalance Cost Sensitivity Cost-Sensitive Learning Focal Loss Binary Classification Machine Learning Deep Learning Computer Sciences Datavetenskap (datalogi)
10	Detecção de fraudes em cartões: um classificador baseado em regras de associação e regressão logística / Card fraud detection: a classifier based on association rules and logistic regression Oliveira, Paulo Henrique Maestrello Assad 11 December 2015 (has links) Os cartões, sejam de crédito ou débito, são meios de pagamento altamente utilizados. Esse fato desperta o interesse de fraudadores. O mercado de cartões enxerga as fraudes como custos operacionais, que são repassados para os consumidores e para a sociedade em geral. Ainda, o alto volume de transações e a necessidade de combater as fraudes abrem espaço para a aplicação de técnicas de Aprendizagem de Máquina; entre elas, os classificadores. Um tipo de classificador largamente utilizado nesse domínio é o classificador baseado em regras. Entretanto, um ponto de atenção dessa categoria de classificadores é que, na prática, eles são altamente dependentes dos especialistas no domínio, ou seja, profissionais que detectam os padrões das transações fraudulentas, os transformam em regras e implementam essas regras nos sistemas de classificação. Ao reconhecer esse cenário, o objetivo desse trabalho é propor a uma arquitetura baseada em regras de associação e regressão logística - técnicas estudadas em Aprendizagem de Máquina - para minerar regras nos dados e produzir, como resultado, conjuntos de regras de detecção de transações fraudulentas e disponibilizá-los para os especialistas no domínio. Com isso, esses profissionais terão o auxílio dos computadores para descobrir e gerar as regras que embasam o classificador, diminuindo, então, a chance de haver padrões fraudulentos ainda não reconhecidos e tornando as atividades de gerar e manter as regras mais eficientes. Com a finalidade de testar a proposta, a parte experimental do trabalho contou com cerca de 7,7 milhões de transações reais de cartões fornecidas por uma empresa participante do mercado de cartões. A partir daí, dado que o classificador pode cometer erros (falso-positivo e falso-negativo), a técnica de análise sensível ao custo foi aplicada para que a maior parte desses erros tenha um menor custo. Além disso, após um longo trabalho de análise do banco de dados, 141 características foram combinadas para, com o uso do algoritmo FP-Growth, gerar 38.003 regras que, após um processo de filtragem e seleção, foram agrupadas em cinco conjuntos de regras, sendo que o maior deles tem 1.285 regras. Cada um desses cinco conjuntos foi submetido a uma modelagem de regressão logística para que suas regras fossem validadas e ponderadas por critérios estatísticos. Ao final do processo, as métricas de ajuste estatístico dos modelos revelaram conjuntos bem ajustados e os indicadores de desempenho dos classificadores também indicaram, num geral, poderes de classificação muito bons (AROC entre 0,788 e 0,820). Como conclusão, a aplicação combinada das técnicas estatísticas - análise sensível ao custo, regras de associação e regressão logística - se mostrou conceitual e teoricamente coesa e coerente. Por fim, o experimento e seus resultados demonstraram a viabilidade técnica e prática da proposta. / Credit and debit cards are two methods of payments highly utilized. This awakens the interest of fraudsters. Businesses see fraudulent transactions as operating costs, which are passed on to consumers. Thus, the high number of transactions and the necessity to combat fraud stimulate the use of machine learning algorithms; among them, rule-based classifiers. However, a weakness of these classifiers is that, in practice, they are highly dependent on professionals who detect patterns of fraudulent transactions, transform them into rules and implement these rules in the classifier. Knowing this scenario, the aim of this thesis is to propose an architecture based on association rules and logistic regression - techniques studied in Machine Learning - for mining rules on data and produce rule sets to detect fraudulent transactions and make them available to experts. As a result, these professionals will have the aid of computers to discover the rules that support the classifier, decreasing the chance of having non-discovered fraudulent patterns and increasing the efficiency of generate and maintain these rules. In order to test the proposal, the experimental part of the thesis has used almost 7.7 million transactions provided by a real company. Moreover, after a long process of analysis of the database, 141 characteristics were combined using the algorithm FP-Growth, generating 38,003 rules. After a process of filtering and selection, they were grouped into five sets of rules which the biggest one has 1,285 rules. Each of the five sets was subjected to logistic regression, so their rules have been validated and weighted by statistical criteria. At the end of the process, the goodness of fit tests were satisfied and the performance indicators have shown very good classification powers (AUC between 0.788 and 0.820). In conclusion, the combined application of statistical techniques - cost sensitive learning, association rules and logistic regression - proved being conceptually and theoretically cohesive and coherent. Finally, the experiment and its results have demonstrated the technical and practical feasibilities of the proposal. Análise sensível ao custo Aprendizagem de máquina Association rule learning Cost sensitive learning Detecção e prevenção de fraudes Fraud detection and prevention Logistic regression Machine learning Mineração de regras de associação Regressão logística

Search results