Spelling suggestions: "subject:"class imbalance"" "subject:"class umbalance""
1 |
Large-Scale Web Page ClassificationMarath, Sathi 09 November 2010 (has links)
Web page classification is the process of assigning predefined categories to web pages.
Empirical evaluations of classifiers such as Support Vector Machines (SVMs), k-Nearest
Neighbor (k-NN), and Naïve Bayes (NB), have shown that these algorithms are effective
in classifying small segments of web directories. The effectiveness of these algorithms,
however, has not been thoroughly investigated on large-scale web page classification of
such popular web directories as Yahoo! and LookSmart. Such web directories have
hundreds of thousands of categories, deep hierarchies, spindle category and document
distributions over the hierarchies, and skewed category distribution over the documents.
These statistical properties indicate class imbalance and rarity within the dataset.
In hierarchical datasets similar to web directories, expanding the content of each category
using the web pages of the child categories helps to decrease the degree of rarity. This
process, however, results in the localized overabundance of positive instances especially
in the upper level categories of the hierarchy. The class imbalance, rarity and the
localized overabundance of positive instances make applying classification algorithms to
web directories very difficult and the problem has not been thoroughly studied. To our
knowledge, the maximum number of categories ever previously classified on web
taxonomies is 246,279 categories of Yahoo! directory using hierarchical SVMs leading to
a Macro-F1 of 12% only.
We designed a unified framework for the content based classification of imbalanced
hierarchical datasets. The complete Yahoo! web directory of 639,671 categories and
4,140,629 web pages is used to setup the experiments. In a hierarchical dataset, the prior
probability distribution of the subcategories indicates the presence or absence of class
imbalance, rarity and the overabundance of positive instances within the dataset. Based
on the prior probability distribution and associated machine learning issues, we
partitioned the subcategories of Yahoo! web directory into five mutually exclusive
groups. The effectiveness of different data level, algorithmic and architectural solutions
to the associated machine learning issues is explored. Later, the best performing
classification technologies for a particular prior probability distribution have been
identified and integrated into the Yahoo! Web directory classification model. The
methodology is evaluated using a DMOZ subset of 17,217 categories and 130,594 web
pages and we statistically proved that the methodology of this research works equally
well on large and small dataset.
The average classifier performance in terms of macro-averaged F1-Measure achieved in
this research for Yahoo! web directory and DMOZ subset is 81.02% and 84.85%
respectively.
|
2 |
Data Mining Techniques to Identify Financial RestatementsDutta, Ila 27 March 2018 (has links)
Data mining is a multi-disciplinary field of science and technology widely used in developing predictive models and data visualization in various domains. Although there are numerous data mining algorithms and techniques across multiple fields, it appears that there is no consensus on the suitability of a particular model, or the ways to address data preprocessing issues. Moreover, the effectiveness of data mining techniques depends on the evolving nature of data. In this study, we focus on the suitability and robustness of various data mining models for analyzing real financial data to identify financial restatements. From data mining perspective, it is quite interesting to study financial restatements for the following reasons: (i) the restatement data is highly imbalanced that requires adequate attention in model building, (ii) there are many financial and non-financial attributes that may affect financial restatement predictive models. This requires careful implementation of data mining techniques to develop parsimonious models, and (iii) the class imbalance issue becomes more complex in a dataset that includes both intentional and unintentional restatement instances. Most of the previous studies focus on fraudulent (or intentional) restatements and the literature has largely ignored unintentional restatements. Intentional (i.e. fraudulent) restatements instances are rare and likely to have more distinct features compared to non-restatement cases. However, unintentional cases are comparatively more prevalent and likely to have fewer distinct features that separate them from non-restatement cases. A dataset containing unintentional restatement cases is likely to have more class overlapping issues that may impact the effectiveness of predictive models. In this study, we developed predictive models based on all restatement cases (both intentional and unintentional restatements) using a real, comprehensive and novel dataset which includes 116 attributes and approximately 1,000 restatement and 19,517 non-restatement instances over a period of 2009 to 2014. To the best of our knowledge, no other study has developed predictive models for financial restatements using post-financial crisis events. In order to avoid redundant attributes, we use three feature selection techniques: Correlation based feature subset selection (CfsSubsetEval), Information gain attribute evaluation (InfoGainEval), Stepwise forward selection (FwSelect) and generate three datasets with reduced attributes. Our restatement dataset is highly skewed and highly biased towards non-restatement (majority) class. We applied various algorithms (e.g. random undersampling (RUS), Cluster based undersampling (CUS) (Sobhani et al., 2014), random oversampling (ROS), Synthetic minority oversampling technique (SMOTE) (Chawla et al., 2002), Adaptive synthetic sampling (ADASYN) (He et al., 2008), and Tomek links with SMOTE) to address class imbalance in the financial restatement dataset. We perform classification employing six different choices of classifiers, Decision three (DT), Artificial neural network (ANN), Naïve Bayes (NB), Random forest (RF), Bayesian belief network (BBN) and Support vector machine (SVM) using 10-fold cross validation and test the efficiency of various predictive models using minority class recall value, minority class F-measure and G-mean. We also experiment different ensemble methods (bagging and boosting) with the base classifiers and employ other meta-learning algorithms (stacking and cost-sensitive learning) to improve model performance. While applying cluster-based undersampling technique, we find that various classifiers (e.g. SVM, BBN) show a high success rate in terms of minority class recall value. For example, SVM classifier shows a minority recall value of 96% which is quite encouraging. However, the ability of these classifiers to detect majority class instances is dismal. We find that some variations of synthetic oversampling such as ‘Tomek Link + SMOTE’ and ‘ADASYN’ show promising results in terms of both minority recall value and G-mean. Using InfoGainEval feature selection method, RF classifier shows minority recall values of 92.6% for ‘Tomek Link + SMOTE’ and 88.9% for ‘ADASYN’ techniques, respectively. The corresponding G-mean values are 95.2% and 94.2% for these two oversampling techniques, which show that RF classifier is quite effective in predicting both minority and majority classes. We find further improvement in results for RF classifier with cost-sensitive learning algorithm using ‘Tomek Link + SMOTE’ oversampling technique. Subsequently, we develop some decision rules to detect restatement firms based on a subset of important attributes. To the best of our knowledge, only Kim et al. (2016) perform a data mining study using only pre-financial crisis restatement data. Kim et al. (2016) employed a matching sample based undersampling technique and used logistic regression, SVM and BBN classifiers to develop financial restatement predictive models. The study’s highest reported G-mean is 70%. Our results with clustering based undersampling are similar to the performance measures reported by Kim et al. (2016). However, our synthetic oversampling based results show a better predictive ability. The RF classifier shows a very high degree of predictive capability for minority class instances (97.4%) and a very high G-mean value (95.3%) with cost-sensitive learning. Yet, we recognize that Kim et al. (2016) use a different restatement dataset (with pre-crisis restatement cases) and hence a direct comparison of results may not be fully justified. Our study makes contributions to the data mining literature by (i) presenting predictive models for financial restatements with a comprehensive dataset, (ii) focussing on various datamining techniques and presenting a comparative analysis, and (iii) addressing class imbalance issue by identifying most effective technique. To the best of our knowledge, we used the most comprehensive dataset to develop our predictive models for identifying financial restatement.
|
3 |
On the Application of Multi-Class Classification in Physical Therapy RecommendationZhang, Jing Unknown Date
No description available.
|
4 |
Active Learning for One-class ClassificationBarnabé-Lortie, Vincent January 2015 (has links)
Active learning is a common solution for reducing labeling costs and maximizing the impact of human labeling efforts in binary and multi-class classification settings. However, when we are faced with extreme levels of class imbalance, a situation in which it is not safe to assume that we have a representative sample of the minority class, it has been shown effective to replace the binary classifiers with a one-class classifiers. In such a setting, traditional active learning methods, and many previously proposed in the literature for one-class classifiers, prove to be inappropriate, as they rely on assumptions about the data that no longer stand.
In this thesis, we propose a novel approach to active learning designed for one-class classification. The proposed method does not rely on many of the inappropriate assumptions of its predecessors and leads to more robust classification performance. The gist of this method consists of labeling, in priority, the instances considered to fit the learned class the least by previous iterations of a one-class classification model.
Throughout the thesis, we provide evidence for the merits of our method, then deepen our understanding of these merits by exploring the properties of the method that allow it to outperform the alternatives.
|
5 |
Beyond the Boundaries of SMOTE: A Framework for Manifold-based Synthetic OversamplingBellinger, Colin January 2016 (has links)
Within machine learning, the problem of class imbalance refers to the scenario in which one or more classes is significantly outnumbered by the others. In the most extreme case, the minority class is not only significantly outnumbered by the majority class, but it also considered to be rare, or absolutely imbalanced. Class imbalance appears in a wide variety of important domains, ranging from oil spill and fraud detection, to text classification and medical diagnosis. Given this, it has been deemed as one of the ten most important research areas in data mining, and for more than a decade now the machine learning community has been coming together in an attempt to unequivocally solve the problem.
The fundamental challenge in the induction of a classifier from imbalanced training data is in managing the prediction bias. The current state-of-the-art methods deal with this by readjusting misclassification costs or by applying resampling methods. In cases of absolute imbalance, these methods are insufficient; rather, it has been observed that we need more training examples. The nature of class imbalance, however, dictates that additional examples cannot be acquired, and thus, synthetic oversampling becomes the natural choice.
We recognize the importance of selecting algorithms with assumptions and biases that are appropriate for the properties of the target data, and argue that this is of absolute importance when it comes to developing synthetic oversampling methods because a large generative leap must be made from a relatively small training set. In particular, our research into gamma-ray spectral classification has demonstrated the benefits of incorporating prior knowledge of conformance to the manifold assumption into the synthetic oversampling algorithms.
We empirically demonstrate the negative impact of the manifold property on the state-of-the-art methods, and propose a framework for manifold-based synthetic oversampling. We algorithmically present the generic form of the framework and demonstrate formalizations of it with PCA and the denoising autoencoder. Through use of the helix and swiss roll datasets, which are standards in the manifold learning community, we visualize and qualitatively analyze the benefits of our proposed framework. Moreover, we unequivocally show the framework to be superior on three real-world gamma-ray spectral datasets and on sixteen benchmark UCI datasets in general. Specifically, our results demonstrate that the framework for manifold-based synthetic oversampling produces higher area under the ROC results than the current state-of-the-art and degrades less on data that conforms to the manifold assumption.
|
6 |
A Novel Data Imbalance Methodology Using a Class Ordered Synthetic Oversampling TechniquePahren, Laura 23 August 2022 (has links)
No description available.
|
7 |
Head Tail Open: Open Tailed Classification of Imbalanced Document DataJoshi, Chetan 23 April 2024 (has links) (PDF)
Deep learning models for scanned document image classification and form understand- ing have made significant progress in the last few years. High accuracy can be achieved by a model with the help of copious amounts of labelled training data for closed-world classification. However, very little work has been done in the domain of fine-grained and head-tailed(class imbalance with some classes having high numbers of data points and some having a low number of data points) open-world classification for documents. Our proposed method achieves a better classification results than the baseline of the head-tail-novel/open dataset. Our techniques include separating the head-tail classes and transferring the knowledge from head data to the tail data. This transfer of knowledge also improves the capability of recognizing a novel category by 15% as compared to the baseline.
|
8 |
Use of machine learning in bankruptcy prediction with highly imbalanced datasets : The impact of sampling methodsMahembe, Wonder January 2024 (has links)
Since Altman’s 1968 discriminant analysis model for corporate bankruptcy prediction, there have been numerous studies applying statistical and machine learning (ML) models in predicting bankruptcy under various contexts. ML models have been proven to be highly accurate in bankruptcy prediction up to three years before the event, more so than statistical models. A major limitation of ML models is that they suffer from an inability to handle highly imbalanced datasets, which has resulted in the development of a plethora of oversampling and undersampling methods for addressing class imbalances. However, current research on the impact of different sampling methods on the predictive performance of ML models is fragmented, inconsistent, and limited. This thesis investigated whether the choice of sampling method led to significant differences in the performance of five predictive algorithms: logistic regression, multiple discriminant analysis(MDA), random forests, Extreme Gradient Boosting (XGBoost), and support vector machines(SVM). Four oversampling methods (random oversampling (ROWR), synthetic minority oversampling technique (SMOTE), oversampling based on propensity scores (OBPS), and oversampling based on weighted nearest neighbour (WNN)) and three undersampling methods (random undersampling (RU), undersampling based on clustering from nearest neighbour (CFNN), and undersampling based on clustering from Gaussian mixture methods (GMM) were tested. The dataset was made up of non-listed Swedish restaurant businesses (1998 – 2021) obtained from the business registry of Sweden, having 10,696 companies with 335 bankrupt instances. Results, assessed through 10-fold cross-validated AUC scores, reveal those oversampling methods generally outperformed undersampling methods. SMOTE performed highest in four of five algorithms, while WNN performed highest with the random forest model. Results of Wilcoxon’s signed rank test showed that some differences between oversampling and undersampling were statistically significant, but differences within each group were not significant. Further, results showed that while the XGBoost had the highest AUC score of all predictive algorithms, it was also the most sensitive to different sampling methods, while MDA was the least sensitive. Overall, it was concluded that the choice of sampling method can significantly impact the performance of different algorithms, and thus users should consider both the algorithm’s sensitivity and the comparative performance of the sampling methods. The thesis’s results challenge some prior findings and suggests avenues for further exploration, highlighting the importance of selecting appropriate sampling methods when working with highly imbalanced datasets.
|
9 |
Gaussian Process Multiclass Classification : Evaluation of Binarization Techniques and Likelihood FunctionsRingdahl, Benjamin January 2019 (has links)
In binary Gaussian process classification the prior class membership probabilities are obtained by transforming a Gaussian process to the unit interval, typically either with the logistic likelihood function or the cumulative Gaussian likelihood function. Multiclass classification problems can be handled by any binary classifier by means of so-called binarization techniques, which reduces the multiclass problem into a number of binary problems. Other than introducing the mathematics behind the theory and methods behind Gaussian process classification, we compare the binarization techniques one-against-all and one-against-one in the context of Gaussian process classification, and we also compare the performance of the logistic likelihood and the cumulative Gaussian likelihood. This is done by means of two experiments: one general experiment where the methods are tested on several publicly available datasets, and one more specific experiment where the methods are compared with respect to class imbalance and class overlap on several artificially generated datasets. The results indicate that there is no significant difference in the choices of binarization technique and likelihood function for typical datasets, although the one-against-one technique showed slightly more consistent performance. However the second experiment revealed some differences in how the methods react to varying degrees of class imbalance and class overlap. Most notably the logistic likelihood was a dominant factor and the one-against-one technique performed better than one-against-all.
|
10 |
"Novas abordagens em aprendizado de máquina para a geração de regras, classes desbalanceadas e ordenação de casos" / "New approaches in machine learning for rule generation, class imbalance and rankings"Prati, Ronaldo Cristiano 07 July 2006 (has links)
Algoritmos de aprendizado de máquina são frequentemente os mais indicados em uma grande variedade de aplicações de mineração dados. Entretanto, a maioria das pesquisas em aprendizado de máquina refere-se ao problema bem definido de encontrar um modelo (geralmente de classificação) de um conjunto de dados pequeno, relativamente bem preparado para o aprendizado, no formato atributo-valor, no qual os atributos foram previamente selecionados para facilitar o aprendizado. Além disso, o objetivo a ser alcançado é simples e bem definido (modelos de classificação precisos, no caso de problemas de classificação). Mineração de dados propicia novas direções para pesquisas em aprendizado de máquina e impõe novas necessidades para outras. Com a mineração de dados, algoritmos de aprendizado estão quebrando as restrições descritas anteriormente. Dessa maneira, a grande contribuição da área de aprendizado de máquina para a mineração de dados é retribuída pelo efeito inovador que a mineração de dados provoca em aprendizado de máquina. Nesta tese, exploramos alguns desses problemas que surgiram (ou reaparecem) com o uso de algoritmos de aprendizado de máquina para mineração de dados. Mais especificamente, nos concentramos seguintes problemas: Novas abordagens para a geração de regras. Dentro dessa categoria, propomos dois novos métodos para o aprendizado de regras. No primeiro, propomos um novo método para gerar regras de exceção a partir de regras gerais. No segundo, propomos um algoritmo para a seleção de regras denominado Roccer. Esse algoritmo é baseado na análise ROC. Regras provêm de um grande conjunto externo de regras e o algoritmo proposto seleciona regras baseado na região convexa do gráfico ROC. Proporção de exemplos entre as classes. Investigamos vários aspectos relacionados a esse tópico. Primeiramente, realizamos uma série de experimentos em conjuntos de dados artificiais com o objetivo de testar nossa hipótese de que o grau de sobreposição entre as classes é um fator complicante em conjuntos de dados muito desbalanceados. Também executamos uma extensa análise experimental com vários métodos (alguns deles propostos neste trabalho) para balancear artificialmente conjuntos de dados desbalanceados. Finalmente, investigamos o relacionamento entre classes desbalanceadas e pequenos disjuntos, e a influência da proporção de classes no processo de rotulação de exemplos no algoritmo de aprendizado de máquina semi-supervisionado Co-training. Novo método para a combinação de rankings. Propomos um novo método, chamado BordaRank, para construir ensembles de rankings baseado no método de votação borda count. BordaRank pode ser aplicado em qualquer problema de ordenação binária no qual vários rankings estejam disponíveis. Resultados experimentais mostram uma melhora no desempenho com relação aos rankings individuais, alem de um desempenho comparável com algoritmos mais sofisticados que utilizam a predição numérica, e não rankings, para a criação de ensembles para o problema de ordenação binária. / Machine learning algorithms are often the most appropriate algorithms for a great variety of data mining applications. However, most machine learning research to date has mainly dealt with the well-circumscribed problem of finding a model (generally a classifier) given a single, small and relatively clean dataset in the attribute-value form, where the attributes have previously been chosen to facilitate learning. Furthermore, the end-goal is simple and well-defined, such as accurate classifiers in the classification problem. Data mining opens up new directions for machine learning research, and lends new urgency to others. With data mining, machine learning is now removing each one of these constraints. Therefore, machine learning's many valuable contributions to data mining are reciprocated by the latter's invigorating effect on it. In this thesis, we explore this interaction by proposing new solutions to some problems due to the application of machine learning algorithms to data mining applications. More specifically, we contribute to the following problems. New approaches to rule learning. In this category, we propose two new methods for rule learning. In the first one, we propose a new method for finding exceptions to general rules. The second one is a rule selection algorithm based on the ROC graph. Rules come from an external larger set of rules and the algorithm performs a selection step based on the current convex hull in the ROC graph. Proportion of examples among classes. We investigated several aspects related to this issue. Firstly, we carried out a series of experiments on artificial data sets in order to verify our hypothesis that overlapping among classes is a complicating factor in highly skewed data sets. We also carried out a broadly experimental analysis with several methods (some of them proposed by us) that artificially balance skewed datasets. Our experiments show that, in general, over-sampling methods perform better than under-sampling methods. Finally, we investigated the relationship between class imbalance and small disjuncts, as well as the influence of the proportion of examples among classes in the process of labelling unlabelled cases in the semi-supervised learning algorithm Co-training. New method for combining rankings. We propose a new method called BordaRanking to construct ensembles of rankings based on borda count voting, which could be applied whenever only the rankings are available. Results show an improvement upon the base-rankings constructed by taking into account the ordering given by classifiers which output continuous-valued scores, as well as a comparable performance with the fusion of such scores.
|
Page generated in 0.0593 seconds