1 |
Synthetic Data Generation for the Financial Industry Using Generative Adversarial Networks / Generering av Syntetisk Data för Finansbranchen med Generativa Motstridande NätverkLjung, Mikael January 2021 (has links)
Following the introduction of new laws and regulations to ensure data protection in GDPR and PIPEDA, interests in technologies to protect data privacy have increased. A promising research trajectory in this area is found in Generative Adversarial Networks (GAN), an architecture trained to produce data that reflects the statistical properties of its underlying dataset without compromising the integrity of the data subjects. Despite the technology’s young age, prior research has made significant progress in the generation process of so-called synthetic data, and the current models can generate images with high-quality. Due to the architecture’s success with images, it has been adapted to new domains, and this study examines its potential to synthesize financial tabular data. The study investigates a state-of-the-art model within tabular GANs, called CTGAN, together with two proposed ideas to enhance its generative ability. The results indicate that a modified training dynamic and a novel early stopping strategy improve the architecture’s capacity to synthesize data. The generated data presents realistic features with clear influences from its underlying dataset, and the inferred conclusions on subsequent analyses are similar to those based on the original data. Thus, the conclusion is that GANs has great potential to generate tabular data that can be considered a substitute for sensitive data, which could enable organizations to have more generous data sharing policies. / Med striktare förhållningsregler till hur data ska hanteras genom GDPR och PIPEDA har intresset för anonymiseringsmetoder för att censurera känslig data aktualliserats. En lovande teknik inom området återfinns i Generativa Motstridande Nätverk, en arkitektur som syftar till att generera data som återspeglar de statiska egenskaperna i dess underliggande dataset utan att äventyra datasubjektens integritet. Trots forskningsfältet unga ålder har man gjort stora framsteg i genereringsprocessen av så kallad syntetisk data, och numera finns det modeller som kan generera bilder av hög realistisk karaktär. Som ett steg framåt i forskningen har arkitekturen adopterats till nya domäner, och den här studien syftar till att undersöka dess förmåga att syntatisera finansiell tabelldata. I studien undersöks en framträdande modell inom forskningsfältet, CTGAN, tillsammans med två föreslagna idéer i syfte att förbättra dess generativa förmåga. Resultaten indikerar att en förändrad träningsdynamik och en ny optimeringsstrategi förbättrar arkitekturens förmåga att generera syntetisk data. Den genererade datan håller i sin tur hög kvalité med tydliga influenser från dess underliggande dataset, och resultat på efterföljande analyser mellan datakällorna är av jämförbar karaktär. Slutsatsen är således att GANs har stor potential att generera tabulär data som kan betrakatas som substitut till känslig data, vilket möjliggör för en mer frikostig delningspolitik av data inom organisationer.
|
2 |
Generation of Synthetic Clinical Trial Subject Data Using Generative Adversarial NetworksLindell, Linus January 2024 (has links)
The development of new solutions incorporating artificial intelligence (AI) within the medical field is an area of great interest. However, access to comprehensive and diverse datasets is restricted due to the sensitive nature of the data. A potential solution to this is to generatesynthetic datasets based on real medical data. Synthetic data could protect the integrity of the subjects while preserving the inherent information necessary for training AI models and be generated in greater quantity than otherwise available. This thesis project aims to generate reliable clinical trial subject data using a generative adversarial network (GAN). The main data set used is a mock clinical trial dataset consisting of multiple subject visits, however an additional data set containing authentic medical data is also used for better insights into the model’s ability to learn underlying relationships. The thesis also investigates training strategies for simulating the temporal dimension and the missing values in the data. The GAN model used is an altered version of the Conditional Tabular GAN (CTGAN)made to be compatible with the preprocessed clinical trial mock data, and multiple model architectures and number of training epochs are examined. The results show great potential for GAN models on clinical trial datasets, especially for real-life data. One model, trained on the authentic dataset, generates near-perfect synthetic data with respect to column distributions and correlation between columns. The results also show that classification models trained on synthetic data and tested on real data have the potential to match the performance of classification models trained on real data. While the synthetic data replicates the missing values, no definitive conclusion can be drawn regarding the temporal characteristics due to the sparsity of the mock dataset and lack of real correlations in it. Although the results are promising, further experiments on authentic datasets with less sparsity are required.
|
Page generated in 0.029 seconds