• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • No language data
  • Tagged with
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

ADDRESSING DATA IMBALANCE IN BREAST CANCER PREDICTION USING SUPERVISED MACHINE LEARNING

Shuning Yin (13169550) 28 July 2022 (has links)
<p>Every 12 minutes, 12 women are diagnosed with breast cancer in the US, and 1 dies out of  it. Globally, every 46 seconds, a woman loses her life due to breast cancer, meaning more than  1,800 deaths every day. The condition makes the prediction of breast cancer very important. To  achieve the goal, supervised machine learning (ML) methods are used for breast cancer  likelihood predictions. However, due to imbalance in the real-world data with very low portion  of positive cases, the prediction accuracy of ML models for positive cancer cases was limited. Two procedures were done to address the issues in the study. Firstly, four supervised ML  models, including Naïve Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), and Multilayer Perceptron (MLP), using WEKA, the industry-standard software, were  applied to the Breast Cancer Surveillance Consortium (BCSC) dataset to assess the impact of the  data imbalance on breast cancer prediction. Secondly, the data was manually built as balanced  (24,558 cases, 12,279 for each class-positive and negative) and unbalanced (99,000 cases for  negative) training datasets and a non-overlapping testing dataset (11,000 cases) based on the  same dataset and a decision support system was developed for two ML models, NB and LR to  tackle the class imbalance issue for breast cancer prediction. Overall, the results indicate that  MLP had the best performance on positive breast cancer prediction with 0.959 sensitivity and  0.907 PPV and balanced dataset predicted better results for all ML models than unbalanced  dataset. Furthermore, the proposed method improved the sensitivity of positive cancer case  prediction from 0.687 to 0.936 using the NB model and from 0.358 to 0.8306 using the LR  model. The improvement demonstrated that the approach provided higher confidence ML-based  predictions and filtered weaker ones, and the technique could efficiently address the class  imbalance issue in breast cancer likelihood prediction and be used in clinical practice.</p>

Page generated in 0.1108 seconds