It has been a while that computers have replaced our record books, and medical records are no exception. Electronic Health Records (EHR) are digital version of a patient's medical records. EHRs are available to authorized users, and they contain the medical records of the patient, which should help doctors understand a patient's condition quickly. In recent years, Deep Learning models have proved their value and have become state-of-the-art in computer vision, natural language processing, speech and other areas. The private nature of EHR data has prevented public access to EHR datasets. There are many obstacles to create a deep learning model with EHR data. Because EHR data are primarily consisting of huge sparse matrices, these challenges are mostly unique to this field. Due to this, research in this area is limited, and we can improve existing research substantially. In this study, we focus on high-performance synthetic data generation in EHR datasets. Artificial data generation can help reduce privacy leakage for dataset owners as it is proven that de-identification methods are prone to re-identification attacks. We propose a novel approach we call Improved Correlation Capturing Wasserstein Generative Adversarial Network (SCorGAN) to create EHR data. This work, leverages Deep Convolutional Neural Networks to extract and understand spatial dependencies in EHR data. To improve our model's performance, we focus on our Deep Convolutional AutoEncoder to better map our real EHR data to our latent space where we train the Generator. To assess our model's performance, we demonstrate that our generative model can create excellent data that are statistically close to the input dataset. Additionally, we evaluate our synthetic dataset against the original data using our previous work that focused on GAN Performance Evaluation. This work is publicly available at https://github.com/mohibeyki/SCorGAN / Master of Science / Artificial Intelligence (AI) systems have improved greatly in recent years. They are being used to understand all kinds of data. A practical use case for AI systems is to leverage their power to identify illnesses and find correlations between different conditions. To train AI and Machine Learning systems, we need to feed them huge datasets, and in the training process, we need to guide them so that they learn different features in our data. The more data an intelligent system has seen, the better it performs. However, health records are private, and we cannot share real people's health records with the public, whether they are a researcher or not. This study provides a novel approach to synthetic data generation that others can use with intelligent systems. Then these systems can work with actual health records can give us accurate feedback on people's health conditions. We then show that our synthetic dataset is a good substitute for real datasets to train intelligent systems. Lastly, we present an intelligent system that we have trained using synthetic datasets to identify illnesses in a real dataset with high accuracy and precision.
Identifer | oai:union.ndltd.org:VTETD/oai:vtechworks.lib.vt.edu:10919/104642 |
Date | 13 August 2021 |
Creators | Beyki, Mohammad Reza |
Contributors | Computer Science, Fox, Edward A., Huang, Jia-Bin, Eldardiry, Hoda |
Publisher | Virginia Tech |
Source Sets | Virginia Tech Theses and Dissertation |
Detected Language | English |
Type | Thesis |
Format | ETD, application/pdf |
Rights | In Copyright, http://rightsstatements.org/vocab/InC/1.0/ |
Page generated in 0.0017 seconds