The focus in the field of machine learning has increasingly shifted towards data-centric approaches, recognizing that the quality of data is crucial for the effectiveness of the models developed. One significant challenge that can degrade data quality is the presence of outliers. Therefore, this study investigates the impact of various outlier detection algorithms on the performance of machine learning models applied to QSAR datasets. Utilizing methods such as Isolation Forest (IF), Local Outlier Factor (LOF), and One-Class Support Vector Machine (OCSVM), the aim was to explore these methods and evaluate them, identify potential outliers, and assess their influence on model predictions. The study incorporated both synthetic data and real-world datasets, including those obtained from a pharmaceutical company and benchmark datasets. The methodology involved preprocessing the data, applying outlier detection algorithms, and evaluating the models using traditional metrics like Mean Squared Error (MSE) and conformal prediction for uncertainty estimation. Results indicated that no major improvements were observed using the different algorithms and that excessive data removal led to a decline. While OCSVM showed inconsistent performance across different datasets, LOF demonstrated promising potential as a method worth further investigation. This study has even highlighted different challenges including high dimensionality, the need for hyperparameter tuning, and the limitations of current outlier detection methods. It also underscores the complexity of outlier detection in QSAR data and suggests directions for future research to improve model robustness and accuracy.
Identifer | oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:uu-532770 |
Date | January 2024 |
Creators | Sawas, Hala |
Publisher | Uppsala universitet, Institutionen för farmaceutisk biovetenskap |
Source Sets | DiVA Archive at Upsalla University |
Language | English |
Detected Language | English |
Type | Student thesis, info:eu-repo/semantics/bachelorThesis, text |
Format | application/pdf |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0022 seconds