Spelling suggestions: "subject:"imbalanced dataset"" "subject:"imabalanced dataset""
1 |
A Comparison of Rule Extraction Techniques with Emphasis on Heuristics for Imbalanced DatasetsSingh, Manjeet 22 September 2010 (has links)
No description available.
|
2 |
Development of Artificial Intelligence-based In-Silico Toxicity Models. Data Quality Analysis and Model Performance Enhancement through Data Generation.Malazizi, Ladan January 2008 (has links)
Toxic compounds, such as pesticides, are routinely tested against a range of aquatic,
avian and mammalian species as part of the registration process. The need for
reducing dependence on animal testing has led to an increasing interest in alternative
methods such as in silico modelling. The QSAR (Quantitative Structure Activity
Relationship)-based models are already in use for predicting physicochemical
properties, environmental fate, eco-toxicological effects, and specific biological
endpoints for a wide range of chemicals. Data plays an important role in modelling
QSARs and also in result analysis for toxicity testing processes. This research
addresses number of issues in predictive toxicology. One issue is the problem of data
quality. Although large amount of toxicity data is available from online sources, this
data may contain some unreliable samples and may be defined as of low quality. Its
presentation also might not be consistent throughout different sources and that makes
the access, interpretation and comparison of the information difficult. To address this
issue we started with detailed investigation and experimental work on DEMETRA
data. The DEMETRA datasets have been produced by the EC-funded project
DEMETRA. Based on the investigation, experiments and the results obtained, the
author identified a number of data quality criteria in order to provide a solution for
data evaluation in toxicology domain. An algorithm has also been proposed to assess
data quality before modelling. Another issue considered in the thesis was the missing
values in datasets for toxicology domain. Least Square Method for a paired dataset
and Serial Correlation for single version dataset provided the solution for the problem
in two different situations. A procedural algorithm using these two methods has been
proposed in order to overcome the problem of missing values. Another issue we paid
attention to in this thesis was modelling of multi-class data sets in which the severe
imbalance class samples distribution exists. The imbalanced data affect the
performance of classifiers during the classification process. We have shown that as
long as we understand how class members are constructed in dimensional space in
each cluster we can reform the distribution and provide more knowledge domain for
the classifier.
|
3 |
Development of artificial intelligence-based in-silico toxicity models : data quality analysis and model performance enhancement through data generationMalazizi, Ladan January 2008 (has links)
Toxic compounds, such as pesticides, are routinely tested against a range of aquatic, avian and mammalian species as part of the registration process. The need for reducing dependence on animal testing has led to an increasing interest in alternative methods such as in silico modelling. The QSAR (Quantitative Structure Activity Relationship)-based models are already in use for predicting physicochemical properties, environmental fate, eco-toxicological effects, and specific biological endpoints for a wide range of chemicals. Data plays an important role in modelling QSARs and also in result analysis for toxicity testing processes. This research addresses number of issues in predictive toxicology. One issue is the problem of data quality. Although large amount of toxicity data is available from online sources, this data may contain some unreliable samples and may be defined as of low quality. Its presentation also might not be consistent throughout different sources and that makes the access, interpretation and comparison of the information difficult. To address this issue we started with detailed investigation and experimental work on DEMETRA data. The DEMETRA datasets have been produced by the EC-funded project DEMETRA. Based on the investigation, experiments and the results obtained, the author identified a number of data quality criteria in order to provide a solution for data evaluation in toxicology domain. An algorithm has also been proposed to assess data quality before modelling. Another issue considered in the thesis was the missing values in datasets for toxicology domain. Least Square Method for a paired dataset and Serial Correlation for single version dataset provided the solution for the problem in two different situations. A procedural algorithm using these two methods has been proposed in order to overcome the problem of missing values. Another issue we paid attention to in this thesis was modelling of multi-class data sets in which the severe imbalance class samples distribution exists. The imbalanced data affect the performance of classifiers during the classification process. We have shown that as long as we understand how class members are constructed in dimensional space in each cluster we can reform the distribution and provide more knowledge domain for the classifier.
|
4 |
Credit Scoring using Machine Learning ApproachesChitambira, Bornvalue January 2022 (has links)
This project will explore machine learning approaches that are used in creditscoring. In this study we consider consumer credit scoring instead of corporatecredit scoring and our focus is on methods that are currently used in practiceby banks such as logistic regression and decision trees and also compare theirperformance against machine learning approaches such as support vector machines (SVM), neural networks and random forests. In our models we addressimportant issues such as dataset imbalance, model overfitting and calibrationof model probabilities. The six machine learning methods we study are support vector machine, logistic regression, k-nearest neighbour, artificial neuralnetworks, decision trees and random forests. We implement these models inpython and analyse their performance on credit dataset with 30000 observations from Taiwan, extracted from the University of California Irvine (UCI)machine learning repository.
|
5 |
Instance Segmentation of Multiclass Litter and Imbalanced Dataset Handling : A Deep Learning Model Comparison / Instanssegmentering av kategoriserat skräp samt hantering av obalanserat datasetSievert, Rolf January 2021 (has links)
Instance segmentation has a great potential for improving the current state of littering by autonomously detecting and segmenting different categories of litter. With this information, litter could, for example, be geotagged to aid litter pickers or to give precise locational information to unmanned vehicles for autonomous litter collection. Land-based litter instance segmentation is a relatively unexplored field, and this study aims to give a comparison of the instance segmentation models Mask R-CNN and DetectoRS using the multiclass litter dataset called Trash Annotations in Context (TACO) in conjunction with the Common Objects in Context precision and recall scores. TACO is an imbalanced dataset, and therefore imbalanced data-handling is addressed, exercising a second-order relation iterative stratified split, and additionally oversampling when training Mask R-CNN. Mask R-CNN without oversampling resulted in a segmentation of 0.127 mAP, and with oversampling 0.163 mAP. DetectoRS achieved 0.167 segmentation mAP, and improves the segmentation mAP of small objects most noticeably, with a factor of at least 2, which is important within the litter domain since small objects such as cigarettes are overrepresented. In contrast, oversampling with Mask R-CNN does not seem to improve the general precision of small and medium objects, but only improves the detection of large objects. It is concluded that DetectoRS improves results compared to Mask R-CNN, as well does oversampling. However, using a dataset that cannot have an all-class representation for train, validation, and test splits, together with an iterative stratification that does not guarantee all-class representations, makes it hard for future works to do exact comparisons to this study. Results are therefore approximate considering using all categories since 12 categories are missing from the test set, where 4 of those were impossible to split into train, validation, and test set. Further image collection and annotation to mitigate the imbalance would most noticeably improve results since results depend on class-averaged values. Doing oversampling with DetectoRS would also help improve results. There is also the option to combine the two datasets TACO and MJU-Waste to enforce training of more categories.
|
Page generated in 0.0564 seconds