Spelling suggestions: "subject:"matthews correlation"" "subject:"matthew's correlation""
1 |
A Comparative Review of SMOTE and ADASYN in Imbalanced Data ClassificationBrandt, Jakob, Lanzén, Emil January 2021 (has links)
In this thesis, the performance of two over-sampling techniques, SMOTE and ADASYN, is compared. The comparison is done on three imbalanced data sets using three different classification models and evaluation metrics, while varying the way the data is pre-processed. The results show that both SMOTE and ADASYN improve the performance of the classifiers in most cases. It is also found that SVM in conjunction with SMOTE performs better than with ADASYN as the degree of class imbalance increases. Furthermore, both SMOTE and ADASYN increase the relative performance of the Random forest as the degree of class imbalance grows. However, no pre-processing method consistently outperforms the other in its contribution to better performance as the degree of class imbalance varies.
|
2 |
Optimising Machine Learning Models for Imbalanced Swedish Text Financial Datasets: A Study on Receipt Classification : Exploring Balancing Methods, Naive Bayes Algorithms, and Performance TradeoffsHu, Li Ang, Ma, Long January 2023 (has links)
This thesis investigates imbalanced Swedish text financial datasets, specifically receipt classification using machine learning models. The study explores the effectiveness of under-sampling and over-sampling methods for Naive Bayes algorithms, collaborating with Fortnox for a controlled experiment. Evaluation metrics compare balancing methods regarding the accuracy, Matthews's correlation coefficient (MCC) , F1 score, precision, and recall. Findings contribute to Swedish text classification, providing insights into balancing methods. The thesis report examines balancing methods and parameter tuning on machine learning models for imbalanced datasets. Multinomial Naive Bayes (MultiNB) algorithms in Natural language processing (NLP) are studied, with potential application in image classification for assessing industrial thin component deformation. Experiments show balancing methods significantly affect MCC and recall, with a recall-MCC-accuracy tradeoff. Smaller alpha values generally improve accuracy. Synthetic Minority Oversampling Technique (SMOTE) and Tomek's algorithm for removing links developed in 1976 by Ivan Tomek. First Tomek, then SMOTE (TomekSMOTE) yield promising accuracy improvements. Due to time constraints, Over-sampling using SMOTE and cleaning using Tomek links. First SMOTE, then Tomek (SMOTETomek) training is incomplete. This thesis report finds the best MCC is achieved when $\alpha$ is 0.01 on imbalanced datasets.
|
3 |
Computational Methods for Inferring Transcription Factor Binding SitesMorozov, Vyacheslav 11 October 2012 (has links)
Position weight matrices (PWMs) have become a tool of choice for the identification of transcription factor binding sites in DNA sequences. PWMs are compiled from experimentally verified and aligned binding sequences. PWMs are then used to computationally discover novel putative binding sites for a given protein. DNA-binding proteins often show degeneracy in their binding requirement, the overall binding specificity of many proteins is unknown and remains an active area of research. Although PWMs are more reliable predictors than consensus string matching, they generally result in a high number of false positive hits. A previous study introduced a novel method to PWM training based on the known motifs to sample additional putative binding sites from a proximal promoter area. The core idea was further developed, implemented and tested in this thesis with a large scale application. Improved mono- and dinucleotide PWMs were computed for Drosophila melanogaster. The Matthews correlation coefficient was used as an optimization criterion in the PWM refinement algorithm. New PWMs keep an account of non-uniform background nucleotide distributions on the promoters and consider a larger number of new binding sites during the refinement steps. The optimization included the PWM motif length, the position on the promoter, the threshold value and the binding site location. The obtained predictions were compared for mono- and dinucleotide PWM versions with initial matrices and with conventional tools. The optimized PWMs predicted new binding sites with better accuracy than conventional PWMs.
|
4 |
Computational Methods for Inferring Transcription Factor Binding SitesMorozov, Vyacheslav 11 October 2012 (has links)
Position weight matrices (PWMs) have become a tool of choice for the identification of transcription factor binding sites in DNA sequences. PWMs are compiled from experimentally verified and aligned binding sequences. PWMs are then used to computationally discover novel putative binding sites for a given protein. DNA-binding proteins often show degeneracy in their binding requirement, the overall binding specificity of many proteins is unknown and remains an active area of research. Although PWMs are more reliable predictors than consensus string matching, they generally result in a high number of false positive hits. A previous study introduced a novel method to PWM training based on the known motifs to sample additional putative binding sites from a proximal promoter area. The core idea was further developed, implemented and tested in this thesis with a large scale application. Improved mono- and dinucleotide PWMs were computed for Drosophila melanogaster. The Matthews correlation coefficient was used as an optimization criterion in the PWM refinement algorithm. New PWMs keep an account of non-uniform background nucleotide distributions on the promoters and consider a larger number of new binding sites during the refinement steps. The optimization included the PWM motif length, the position on the promoter, the threshold value and the binding site location. The obtained predictions were compared for mono- and dinucleotide PWM versions with initial matrices and with conventional tools. The optimized PWMs predicted new binding sites with better accuracy than conventional PWMs.
|
5 |
Computational Methods for Inferring Transcription Factor Binding SitesMorozov, Vyacheslav January 2012 (has links)
Position weight matrices (PWMs) have become a tool of choice for the identification of transcription factor binding sites in DNA sequences. PWMs are compiled from experimentally verified and aligned binding sequences. PWMs are then used to computationally discover novel putative binding sites for a given protein. DNA-binding proteins often show degeneracy in their binding requirement, the overall binding specificity of many proteins is unknown and remains an active area of research. Although PWMs are more reliable predictors than consensus string matching, they generally result in a high number of false positive hits. A previous study introduced a novel method to PWM training based on the known motifs to sample additional putative binding sites from a proximal promoter area. The core idea was further developed, implemented and tested in this thesis with a large scale application. Improved mono- and dinucleotide PWMs were computed for Drosophila melanogaster. The Matthews correlation coefficient was used as an optimization criterion in the PWM refinement algorithm. New PWMs keep an account of non-uniform background nucleotide distributions on the promoters and consider a larger number of new binding sites during the refinement steps. The optimization included the PWM motif length, the position on the promoter, the threshold value and the binding site location. The obtained predictions were compared for mono- and dinucleotide PWM versions with initial matrices and with conventional tools. The optimized PWMs predicted new binding sites with better accuracy than conventional PWMs.
|
6 |
Moderní řečové příznaky používané při diagnóze chorob / State of the art speech features used during the Parkinson disease diagnosisBílý, Ondřej January 2011 (has links)
This work deals with the diagnosis of Parkinson's disease by analyzing the speech signal. At the beginning of this work there is described speech signal production. The following is a description of the speech signal analysis, its preparation and subsequent feature extraction. Next there is described Parkinson's disease and change of the speech signal by this disability. The following describes the symptoms, which are used for the diagnosis of Parkinson's disease (FCR, VSA, VOT, etc.). Another part of the work deals with the selection and reduction symptoms using the learning algorithms (SVM, ANN, k-NN) and their subsequent evaluation. In the last part of the thesis is described a program to count symptoms. Further is described selection and the end evaluated all the result.
|
Page generated in 0.0809 seconds