Spelling suggestions: "subject:"imbalanced data"" "subject:"imbalanced mata""
1 |
CHRONIC PAIN A study on patients with chronic pain : What characteristics/variables lie behind the fact that a patient does not respond well to treatment?Lindvall, Agnes, Chilaika, Ana January 2015 (has links)
The primary purpose of this study was to find out which variables lie behind the fact that patients who respond well to treatment of chronic pain differs from those who do not. We used logistic regression to predict group belonging based on the self-reported health surveys, i.e if different answers in the surveys can predict whether a patient is “responsive” or “unresponsive”. By bootstrapping 176 samples, and aggregating the results from 176 logistic regressions based on the sub-samples, we calculate an averaged model. The variables anxiety and physical health were significant in 76% and 70% of the models respectively, while depression was significant in 30% of the models. Gender was significant in 15% of the models and health status in 0,006%. The averaged model correctly classified the most unresponsive patients at cut-off value 0.5. As the cut –off value was increased, the number of correctly classified unresponsive patients decreased while the number of correctly classified responsive patients increased, as well as unresponsive patients classified as responsive. We concluded that the model did not discriminate enough between the two groups. We were also interested in finding out how the variables anxiety, depression, heath status, willingness to participate in activities as well as engagement in activities, mental and physical health relate with one another. The results from confirmatory factor analysis showed that a patient’s health status is highly related to their physical health and activity engagement while pain willingness and engagement in activity were least related. Furthermore, the analysis showed that mental health is highly related with anxiety and health status, indicating that mental health is indeed important to reflect upon when considering the health status of a patient.
|
2 |
FUZZY CLASSIFIERS FOR IMBALANCED DATA SETSVISA, SOFIA 08 October 2007 (has links)
No description available.
|
3 |
A Segmentation and Re-balancing Approach for Classification of Imbalanced DataGong, Rongsheng 19 April 2011 (has links)
No description available.
|
4 |
Improving Text Classification Using Graph-based MethodsKarajeh, Ola Abdel-Raheem Mohammed 05 June 2024 (has links)
Text classification is a fundamental natural language processing task. However, in real-world applications, class distributions are usually skewed, e.g., due to inherent class imbalance. In addition, the task difficulty changes based on the underlying language. When rich morphological structure and high ambiguity are exhibited, natural language understanding can become challenging. For example, Arabic, ranked the fifth most widely used language, has a rich morphological structure and high ambiguity that result from Arabic orthography. Thus, Arabic natural language processing is challenging. Several studies employ Long Short- Term Memory (LSTM) networks and Convolutional Neural Networks (CNNs), but Graph Convolutional Networks (GCNs) have not yet been investigated for the task. Sequence- based models can successfully capture semantics in local consecutive text sequences. On the other hand, graph-based models can preserve global co-occurrences that capture non- consecutive and long-distance semantics. A text representation approach that combines local and global information can enhance performance in practical class imbalance text classification scenarios. Yet, multi-view graph-based text representations have received limited attention.
In this research, first we introduce Multi-view Minority Class Text Graph Convolutional Network (MMCT-GCN), a transductive multi-view text classification model that captures textual graph representations for the minority class alongside sequence-based text representations. Experimental results show that MMCT-GCN obtains consistent improvements over baselines. Second, we develop an Arabic Bidirectional Encoder Representations from Transformers (BERT) Graph Convolutional Network (AraBERT-GCN), a hybrid model that combines the large-scale pre-trained models that encode the local context and semantics alongside graph-based features that are capable of extracting the global word co-occurrences in non-consecutive extended semantics by only one or two hops. Experimental results show that AraBERT-GCN outperforms the state-of-the-art (SOTA) on our Arabic text datasets. Finally, we propose an Arabic Multidimensional Edge Graph Convolutional Network (AraMEGraph) designed for text classification that encapsulates richer and context-aware representations of word and phrase relationships, thus mitigating the impact of the complexity and ambiguity of the Arabic language. / Doctor of Philosophy / The text classification task is an important step in understanding natural language. However, this task has many challenges, such as uneven data distributions and language difficulty. For example, Arabic is the fifth most spoken language. It has many different word forms and meanings, which can make things harder to understand. Long Short-Term Memory (LSTM) networks and Convolutional Neural Networks (CNNs) are widely utilized for text classification. However, another kind of network called graph convolutional network (GCN) has yet to be explored for this task. Graph-based models keep track of how words are connected, even if they are not right next to each other in a sentence. This helps with better understanding the meaning of words. On the other hand, sequence-based models do well in understanding the meaning of words that are right next to each other. Mixing both types of information in text understanding can work better, especially when dealing with unevenly distributed data. In this research, we introduce a new text classification method called Multi-view Minority Class Text Graph Convolutional Network (MMCT-GCN). This model looks at text from different angles and combines information from graphs and sequence-based models. Our experiments show that this model performs better than other ones proposed in the literature. Additionally, we propose an Arabic BERT Graph Convolutional Network (AraBERT-GCN). It combines pre-trained models that understand words in context and graph features that look at how words relate to each other globally. This helps AraBERT- GCN do better than other models when working with Arabic text. Finally, we develop a special network called Arabic Multidimensional Edge Graph Convolutional Network (AraMEGraph) for Arabic text. It is designed to better understand Arabic and classify text more accurately. We do this by adding special edge features with multiple dimensions to help the network learn the relationships between words and phrases.
|
5 |
Advanced Text Analytics and Machine Learning Approach for Document ClassificationAnne, Chaitanya 19 May 2017 (has links)
Text classification is used in information extraction and retrieval from a given text, and text classification has been considered as an important step to manage a vast number of records given in digital form that is far-reaching and expanding. This thesis addresses patent document classification problem into fifteen different categories or classes, where some classes overlap with other classes for practical reasons. For the development of the classification model using machine learning techniques, useful features have been extracted from the given documents. The features are used to classify patent document as well as to generate useful tag-words. The overall objective of this work is to systematize NASA’s patent management, by developing a set of automated tools that can assist NASA to manage and market its portfolio of intellectual properties (IP), and to enable easier discovery of relevant IP by users. We have identified an array of methods that can be applied such as k-Nearest Neighbors (kNN), two variations of the Support Vector Machine (SVM) algorithms, and two tree based classification algorithms: Random Forest and J48. The major research steps in this work consist of filtering techniques for variable selection, information gain and feature correlation analysis, and training and testing potential models using effective classifiers. Further, the obstacles associated with the imbalanced data were mitigated by adding synthetic data wherever appropriate, which resulted in a superior SVM classifier based model.
|
6 |
Detection of unusual fish trajectories from underwater videosBeyan, Çigdem January 2015 (has links)
Fish behaviour analysis is a fundamental research area in marine ecology as it is helpful for detecting environmental changes by observing unusual fish patterns or new fish behaviours. The traditional way of analysing fish behaviour is by visual inspection using human observers, which is very time consuming and also limits the amount of data that can be processed. Therefore, there is a need for automatic algorithms to identify fish behaviours by using computer vision and machine learning techniques. The aim of this thesis is to help marine biologists with their work. We focus on behaviour understanding and analysis of detected and tracked fish with unusual behaviour detection approaches. Normal fish trajectories exhibit frequently observed behaviours while unusual trajectories are outliers or rare trajectories. This thesis proposes 3 approaches to detecting unusual trajectories: i) a filtering mechanism for normal fish trajectories, ii) an unusual fish trajectory classification method using clustered and labelled data and iii) an unusual fish trajectory classification approach using a clustering based hierarchical decomposition. The rule based trajectory filtering mechanism is proposed to remove normal fish trajectories which potentially helps to increase the accuracy of the unusual fish behaviour detection system. The aim is to reject normal fish trajectories as much as possible while not rejecting unusual fish trajectories. The results show that this method successfully filters out normal trajectories with a low false negative rate. This method is useful to assist building a ground truth data set from a very large fish trajectory repository, especially when the amount of normal fish trajectories greatly dominates the unusual fish trajectories. Moreover, it successfully distinguishes true fish trajectories from false fish trajectories which result from errors by the fish detection and tracking algorithms. A key contribution of this thesis is the proposed flat classifier, which uses an outlier detection method based on cluster cardinalities and a distance function to detect unusual fish trajectories. Clustered and labelled data are used to select feature sets which perform best on a training set. To describe fish trajectories 10 groups of trajectory descriptions are proposed which were not previously used for fish behaviour analysis. The proposed flat classifier improved the performance of unusual fish detection compared to the filtering approach. The performance of the flat classifier is further improved by integrating it into a hierarchical decomposition. This hierarchical decomposition method selects more specific features for different trajectory clusters which is useful considering the trajectory variety. Significantly improved results were obtained using this hierarchical decomposition in comparison to the flat classifier. This hierarchical framework is also applied to classification of more general imbalanced data sets which is a key current topic in machine learning. The experiments showed that the proposed hierarchical decomposition method is significantly better than the state of art classification methods, other outlier detection methods and unusual trajectory detection methods. Furthermore, it is successful at classifying imbalanced data sets even though the majority and minority classes contain varieties, and classes overlap which is frequently seen in real-world applications. Finally, we explored the benefits of active learning in the context of the hierarchical decomposition method, where active learning query strategies choose the most informative training data. A substantial performance gain is possible by using less labelled training data compared to learning from larger labelled data sets. Additionally, active learning with feature selection is investigated. The results show that feature selection has a positive effect on the performance of active learning. However, we show that random selection can be as effective as popular active learning query strategies in combination with active learning and feature selection, especially for imbalanced set classification.
|
7 |
Predicting the Unobserved : A statistical analysis of missing data techniques for binary classificationSäfström, Stella January 2019 (has links)
The aim of the thesis is to investigate how the classification performance of random forest and logistic regression differ, given an imbalanced data set with MCAR missing data. The performance is measured in terms of accuracy and sensitivity. Two analyses are performed: one with a simulated data set and one application using data from the Swedish population registries. The simulation study is created to have the same class imbalance at 1:5. The missing values are handled using three different techniques: complete case analysis, predictive mean matching and mean imputation. The thesis concludes that logistic regression and random forest are on average equally accurate, with some instances of random forest outperforming logistic regression. Logistic regression consistently outperforms random forest with regards to sensitivity. This implies that logistic regression may be the best option for studies where the goal is to accurately predict outcomes in the minority class. None of the missing data techniques stood out in terms of performance.
|
8 |
Técnicas para o problema de dados desbalanceados em classificação hierárquica / Techniques for the problem of imbalanced data in hierarchical classificationBarella, Victor Hugo 24 July 2015 (has links)
Os recentes avanços da ciência e tecnologia viabilizaram o crescimento de dados em quantidade e disponibilidade. Junto com essa explosão de informações geradas, surge a necessidade de analisar dados para descobrir conhecimento novo e útil. Desse modo, áreas que visam extrair conhecimento e informações úteis de grandes conjuntos de dados se tornaram grandes oportunidades para o avanço de pesquisas, tal como o Aprendizado de Máquina (AM) e a Mineração de Dados (MD). Porém, existem algumas limitações que podem prejudicar a acurácia de alguns algoritmos tradicionais dessas áreas, por exemplo o desbalanceamento das amostras das classes de um conjunto de dados. Para mitigar tal problema, algumas alternativas têm sido alvos de pesquisas nos últimos anos, tal como o desenvolvimento de técnicas para o balanceamento artificial de dados, a modificação dos algoritmos e propostas de abordagens para dados desbalanceados. Uma área pouco explorada sob a visão do desbalanceamento de dados são os problemas de classificação hierárquica, em que as classes são organizadas em hierarquias, normalmente na forma de árvore ou DAG (Direct Acyclic Graph). O objetivo deste trabalho foi investigar as limitações e maneiras de minimizar os efeitos de dados desbalanceados em problemas de classificação hierárquica. Os experimentos realizados mostram que é necessário levar em consideração as características das classes hierárquicas para a aplicação (ou não) de técnicas para tratar problemas dados desbalanceados em classificação hierárquica. / Recent advances in science and technology have made possible the data growth in quantity and availability. Along with this explosion of generated information, there is a need to analyze data to discover new and useful knowledge. Thus, areas for extracting knowledge and useful information in large datasets have become great opportunities for the advancement of research, such as Machine Learning (ML) and Data Mining (DM). However, there are some limitations that may reduce the accuracy of some traditional algorithms of these areas, for example the imbalance of classes samples in a dataset. To mitigate this drawback, some solutions have been the target of research in recent years, such as the development of techniques for artificial balancing data, algorithm modification and new approaches for imbalanced data. An area little explored in the data imbalance vision are the problems of hierarchical classification, in which the classes are organized into hierarchies, commonly in the form of tree or DAG (Direct Acyclic Graph). The goal of this work aims at investigating the limitations and approaches to minimize the effects of imbalanced data with hierarchical classification problems. The experimental results show the need to take into account the features of hierarchical classes when deciding the application of techniques for imbalanced data in hierarchical classification.
|
9 |
Cost-Sensitive Boosting for Classification of Imbalanced DataSun, Yanmin 11 May 2007 (has links)
The classification of data with imbalanced class distributions has
posed a significant drawback in the performance attainable by most
well-developed classification systems, which assume relatively
balanced class distributions. This problem is especially crucial
in many application domains, such as medical diagnosis, fraud
detection, network intrusion, etc., which are of great importance
in machine learning and data mining.
This thesis explores meta-techniques which are applicable to most
classifier learning algorithms, with the aim to advance the
classification of imbalanced data. Boosting is a powerful
meta-technique to learn an ensemble of weak models with a promise
of improving the classification accuracy. AdaBoost has been taken
as the most successful boosting algorithm. This thesis starts with
applying AdaBoost to an associative classifier for both learning
time reduction and accuracy improvement. However, the promise of
accuracy improvement is trivial in the context of the class
imbalance problem, where accuracy is less meaningful. The insight
gained from a comprehensive analysis on the boosting strategy of
AdaBoost leads to the investigation of cost-sensitive boosting
algorithms, which are developed by introducing cost items into the
learning framework of AdaBoost. The cost items are used to denote
the uneven identification importance among classes, such that the
boosting strategies can intentionally bias the learning towards
classes associated with higher identification importance and
eventually improve the identification performance on them. Given
an application domain, cost values with respect to different types
of samples are usually unavailable for applying the proposed
cost-sensitive boosting algorithms. To set up the effective cost
values, empirical methods are used for bi-class applications and
heuristic searching of the Genetic Algorithm is employed for
multi-class applications.
This thesis also covers the implementation of the proposed
cost-sensitive boosting algorithms. It ends with a discussion on
the experimental results of classification of real-world
imbalanced data. Compared with existing algorithms, the new
algorithms this thesis presents are superior in achieving better
measurements regarding the learning objectives.
|
10 |
Cost-Sensitive Boosting for Classification of Imbalanced DataSun, Yanmin 11 May 2007 (has links)
The classification of data with imbalanced class distributions has
posed a significant drawback in the performance attainable by most
well-developed classification systems, which assume relatively
balanced class distributions. This problem is especially crucial
in many application domains, such as medical diagnosis, fraud
detection, network intrusion, etc., which are of great importance
in machine learning and data mining.
This thesis explores meta-techniques which are applicable to most
classifier learning algorithms, with the aim to advance the
classification of imbalanced data. Boosting is a powerful
meta-technique to learn an ensemble of weak models with a promise
of improving the classification accuracy. AdaBoost has been taken
as the most successful boosting algorithm. This thesis starts with
applying AdaBoost to an associative classifier for both learning
time reduction and accuracy improvement. However, the promise of
accuracy improvement is trivial in the context of the class
imbalance problem, where accuracy is less meaningful. The insight
gained from a comprehensive analysis on the boosting strategy of
AdaBoost leads to the investigation of cost-sensitive boosting
algorithms, which are developed by introducing cost items into the
learning framework of AdaBoost. The cost items are used to denote
the uneven identification importance among classes, such that the
boosting strategies can intentionally bias the learning towards
classes associated with higher identification importance and
eventually improve the identification performance on them. Given
an application domain, cost values with respect to different types
of samples are usually unavailable for applying the proposed
cost-sensitive boosting algorithms. To set up the effective cost
values, empirical methods are used for bi-class applications and
heuristic searching of the Genetic Algorithm is employed for
multi-class applications.
This thesis also covers the implementation of the proposed
cost-sensitive boosting algorithms. It ends with a discussion on
the experimental results of classification of real-world
imbalanced data. Compared with existing algorithms, the new
algorithms this thesis presents are superior in achieving better
measurements regarding the learning objectives.
|
Page generated in 0.0646 seconds