Global ETD Search

Return to search

ADDRESSING DATA IMBALANCE IN BREAST CANCER PREDICTION USING SUPERVISED MACHINE LEARNING

<p>Every 12 minutes, 12 women are diagnosed with breast cancer in the US, and 1 dies out of it. Globally, every 46 seconds, a woman loses her life due to breast cancer, meaning more than 1,800 deaths every day. The condition makes the prediction of breast cancer very important. To achieve the goal, supervised machine learning (ML) methods are used for breast cancer likelihood predictions. However, due to imbalance in the real-world data with very low portion of positive cases, the prediction accuracy of ML models for positive cancer cases was limited. Two procedures were done to address the issues in the study. Firstly, four supervised ML models, including Naïve Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), and Multilayer Perceptron (MLP), using WEKA, the industry-standard software, were applied to the Breast Cancer Surveillance Consortium (BCSC) dataset to assess the impact of the data imbalance on breast cancer prediction. Secondly, the data was manually built as balanced (24,558 cases, 12,279 for each class-positive and negative) and unbalanced (99,000 cases for negative) training datasets and a non-overlapping testing dataset (11,000 cases) based on the same dataset and a decision support system was developed for two ML models, NB and LR to tackle the class imbalance issue for breast cancer prediction. Overall, the results indicate that MLP had the best performance on positive breast cancer prediction with 0.959 sensitivity and 0.907 PPV and balanced dataset predicted better results for all ML models than unbalanced dataset. Furthermore, the proposed method improved the sensitivity of positive cancer case prediction from 0.687 to 0.936 using the NB model and from 0.358 to 0.8306 using the LR model. The improvement demonstrated that the approach provided higher confidence ML-based predictions and filtered weaker ones, and the technique could efficiently address the class imbalance issue in breast cancer likelihood prediction and be used in clinical practice.</p>

10.25394/pgs.20393361.v1

Deep learning

Neural networks

machine learning-based

breast cancer prediction models

breast cancer prediction

Supervised Machine Learning

data imbalance problem

Identifer	oai:union.ndltd.org:purdue.edu/oai:figshare.com:article/20393361
Date	28 July 2022
Creators	Shuning Yin (13169550)
Source Sets	Purdue University
Detected Language	English
Type	Text, Thesis
Rights	CC BY 4.0
Relation	https://figshare.com/articles/thesis/ADDRESSING_DATA_IMBALANCE_IN_BREAST_CANCER_PREDICTION_USING_SUPERVISED_MACHINE_LEARNING/20393361

Page generated in 0.0023 seconds

ADDRESSING DATA IMBALANCE IN BREAST CANCER PREDICTION USING SUPERVISED MACHINE LEARNING

Description

Links & Downloads

Tags

Additional Fields