Spelling suggestions: "subject:"aynthetic data 1generation"" "subject:"aynthetic data 4egeneration""
1 |
Privacy-Preserving Synthetic Medical Data Generation with Deep LearningTorfi, Amirsina 26 August 2020 (has links)
Deep learning models demonstrated good performance in various domains such as ComputerVision and Natural Language Processing. However, the utilization of data-driven methods in healthcare raises privacy concerns, which creates limitations for collaborative research. A remedy to this problem is to generate and employ synthetic data to address privacy concerns. Existing methods for artificial data generation suffer from different limitations, such as being bound to particular use cases. Furthermore, their generalizability to real-world problems is controversial regarding the uncertainties in defining and measuring key realistic characteristics. Hence, there is a need to establish insightful metrics (and to measure the validity of synthetic data), as well as quantitative criteria regarding privacy restrictions. We propose the use of Generative Adversarial Networks to help satisfy requirements for realistic characteristics and acceptable values of privacy metrics, simultaneously. The present study makes several unique contributions to synthetic data generation in the healthcare domain. First, we propose a novel domain-agnostic metric to evaluate the quality of synthetic data. Second, by utilizing 1-D Convolutional Neural Networks, we devise a new approach to capturing the correlation between adjacent diagnosis records. Third, we employ ConvolutionalAutoencoders for creating a robust and compact feature space to handle the mixture of discrete and continuous data. Finally, we devise a privacy-preserving framework that enforcesRényi differential privacy as a new notion of differential privacy. / Doctor of Philosophy / Computers programs have been widely used for clinical diagnosis but are often designed with assumptions limiting their scalability and interoperability. The recent proliferation of abundant health data, significant increases in computer processing power, and superior performance of data-driven methods enable a trending paradigm shift in healthcare technology. This involves the adoption of artificial intelligence methods, such as deep learning, to improve healthcare knowledge and practice. Despite the success in using deep learning in many different domains, in the healthcare field, privacy challenges make collaborative research difficult, as working with data-driven methods may jeopardize patients' privacy. To overcome these challenges, researchers propose to generate and utilize realistic synthetic data that can be used instead of real private data. Existing methods for artificial data generation are limited by being bound to special use cases. Furthermore, their generalizability to real-world problems is questionable. There is a need to establish valid synthetic data that overcomes privacy restrictions and functions as a real-world analog for healthcare deep learning data training. We propose the use of Generative Adversarial Networks to simultaneously overcome the realism and privacy challenges associated with healthcare data.
|
2 |
Semantic Segmentation with Carla SimulatorMalec, Stanislaw January 2021 (has links)
Autonomous vehicles perform semantic segmentation to orient themselves, but training neural networks for semantic segmentation requires large amounts of labeled data. A hand-labeled real-life dataset requires considerable effort to create, so we instead turn to virtual simulators where the segmented labels are known to generate large datasets virtually for free. This work investigates how effective synthetic datasets are in driving scenarios by collecting a dataset from a simulator and testing it against a real-life hand-labeled dataset. We show that we can get a model up and running faster by mixing synthetic and real-life data than traditional dataset collection methods and achieve close to baseline performance.
|
3 |
GAN-Based Approaches for Generating Structured Data in the Medical DomainAbedi, Masoud, Hempel, Lars, Sadeghi, Sina, Kirsten, Toralf 03 November 2023 (has links)
Modern machine and deep learning methods require large datasets to achieve reliable
and robust results. This requirement is often difficult to meet in the medical field, due to data
sharing limitations imposed by privacy regulations or the presence of a small number of patients (e.g.,
rare diseases). To address this data scarcity and to improve the situation, novel generative models
such as Generative Adversarial Networks (GANs) have been widely used to generate synthetic
data that mimic real data by representing features that reflect health-related information without
reference to real patients. In this paper, we consider several GAN models to generate synthetic data
used for training binary (malignant/benign) classifiers, and compare their performances in terms
of classification accuracy with cases where only real data are considered. We aim to investigate
how synthetic data can improve classification accuracy, especially when a small amount of data is
available. To this end, we have developed and implemented an evaluation framework where binary
classifiers are trained on extended datasets containing both real and synthetic data. The results show
improved accuracy for classifiers trained with generated data from more advanced GAN models,
even when limited amounts of original data are available.
|
4 |
Bayesian Variable Selection with Shrinkage Priors and Generative Adversarial Networks for Fraud DetectionIssoufou Anaroua, Amina 01 January 2024 (has links) (PDF)
This research paper focuses on fraud detection in the financial industry using Generative Adversarial Networks (GANs) in conjunction with Uni and Multi Variate Bayesian Model with Shrinkage Priors (BMSP). The problem addressed is the need for accurate and advanced fraud detection techniques due to the increasing sophistication of fraudulent activities. The methodology involves the implementation of GANs and the application of BMSP for variable selection to generate synthetic fraud samples for fraud detection using the augmented dataset. Experimental results demonstrate the effectiveness of the BMSP GAN approach in detecting fraud with improved performance compared to other methods. The conclusions drawn highlight the potential of GANs and BMSP for enhancing fraud detection capabilities and suggest future research directions for further improvements in the field.
|
5 |
IMPROVING THE UTILITY OF DIFFERENTIALLY PRIVATE ALGORITHMS USING DATA CHARACTERISTICSFarzad Zafarani (11837222) 10 January 2025 (has links)
<p dir="ltr">As data continues to grow rapidly in volume and complexity, there is an increasing need to extract meaningful insights from it. These datasets often contain sensitive individual information, making privacy protection crucial. Differential privacy has become the de facto standard for protecting individuals' privacy. Many datasets also have known constraints and structures. Can these known constraints or structures be leveraged to design mechanisms with better utility?</p><p dir="ltr">The focus of this thesis is to demonstrate that by leveraging the inherent structures and constraints within datasets, it may be possible to design differential privacy mechanisms that offer better utility (i.e., more accurate results) while maintaining the required level of privacy. This involves exploring advanced techniques and modifications to the basic mechanisms that take advantage of dataset-specific properties, such as sparsity, distributional assumptions, or other contextual information. This approach aims to minimize the noise added, thereby improving the utility of differentially private outputs.</p><p dir="ltr">In many scenarios, datasets contain constraints. In this thesis, we show that generating differentially private synthetic data while preserving constraints increases utility across several metrics, including marginal queries, classification task accuracy, and clustering. Smooth sensitivity is a data-dependent sensitivity metric that allows for more precise noise addition based on the actual data distribution, rather than worst-case scenarios. It addresses the limitations of local sensitivity by ensuring robust privacy guarantees, even in the presence of outliers or small changes in the data.</p><p dir="ltr"><br></p><p dir="ltr">We have developed a differentially private Naive Bayes model using smooth sensitivity. By using data-dependent sensitivity measures like smooth sensitivity and incorporating known data constraints, we can reduce the amount of noise added, resulting in a more accurate model.</p>
|
6 |
Generation of Synthetic Data with Generative Adversarial NetworksGarcia Torres, Douglas January 2018 (has links)
The aim of synthetic data generation is to provide data that is not real for cases where the use of real data is somehow limited. For example, when there is a need for larger volumes of data, when the data is sensitive to use, or simply when it is hard to get access to the real data. Traditional methods of synthetic data generation use techniques that do not intend to replicate important statistical properties of the original data. Properties such as the distribution, the patterns or the correlation between variables, are often omitted. Moreover, most of the existing tools and approaches require a great deal of user-defined rules and do not make use of advanced techniques like Machine Learning or Deep Learning. While Machine Learning is an innovative area of Artificial Intelligence and Computer Science that uses statistical techniques to give computers the ability to learn from data, Deep Learning is a closely related field based on learning data representations, which may serve useful for the task of synthetic data generation. This thesis focuses on one of the most interesting and promising innovations of the last years in the Machine Learning community: Generative Adversarial Networks. An approach for generating discrete, continuous or text synthetic data with Generative Adversarial Networks is proposed, tested, evaluated and compared with a baseline approach. The results prove the feasibility and show the advantages and disadvantages of using this framework. Despite its high demand for computational resources, a Generative Adversarial Networks framework is capable of generating quality synthetic data that preserves the statistical properties of a given dataset. / Syftet med syntetisk datagenerering är att tillhandahålla data som inte är verkliga i fall där användningen av reella data på något sätt är begränsad. Till exempel, när det finns behov av större datamängder, när data är känsliga för användning, eller helt enkelt när det är svårt att få tillgång till den verkliga data. Traditionella metoder för syntetiska datagenererande använder tekniker som inte avser att replikera viktiga statistiska egenskaper hos de ursprungliga data. Egenskaper som fördelningen, mönstren eller korrelationen mellan variabler utelämnas ofta. Dessutom kräver de flesta av de befintliga verktygen och metoderna en hel del användardefinierade regler och använder inte avancerade tekniker som Machine Learning eller Deep Learning. Machine Learning är ett innovativt område för artificiell intelligens och datavetenskap som använder statistiska tekniker för att ge datorer möjlighet att lära av data. Deep Learning ett närbesläktat fält baserat på inlärningsdatapresentationer, vilket kan vara användbart för att generera syntetisk data. Denna avhandling fokuserar på en av de mest intressanta och lovande innovationerna från de senaste åren i Machine Learning-samhället: Generative Adversarial Networks. Generative Adversarial Networks är ett tillvägagångssätt för att generera diskret, kontinuerlig eller textsyntetisk data som föreslås, testas, utvärderas och jämförs med en baslinjemetod. Resultaten visar genomförbarheten och visar fördelarna och nackdelarna med att använda denna metod. Trots dess stora efterfrågan på beräkningsresurser kan ett generativt adversarialnätverk skapa generell syntetisk data som bevarar de statistiska egenskaperna hos ett visst dataset.
|
7 |
Energy-Efficient Private Forecasting on Health Data using SNNs / Energieffektiv privat prognos om hälsodata med hjälp av SNNsDi Matteo, Davide January 2022 (has links)
Health monitoring devices, such as Fitbit, are gaining popularity both as wellness tools and as a source of information for healthcare decisions. Predicting such wellness goals accurately is critical for the users to make informed lifestyle choices. The core objective of this thesis is to design and implement such a system that takes energy consumption and privacy into account. This research is modelled as a time-series forecasting problem that makes use of Spiking Neural Networks (SNNs) due to their proven energy-saving capabilities. Thanks to their design that closely mimics natural neural networks (such as the brain), SNNs have the potential to significantly outperform classic Artificial Neural Networks in terms of energy consumption and robustness. In order to prove our hypotheses, a previous research by Sonia et al. [1] in the same domain and with the same dataset is used as our starting point, where a private forecasting system using Long short-term memory (LSTM) is designed and implemented. Their study also implements and evaluates a clustering federated learning approach, which fits well the highly distributed data. The results obtained in their research act as a baseline to compare our results in terms of accuracy, training time, model size and estimated energy consumed. Our experiments show that Spiking Neural Networks trades off accuracy (2.19x, 1.19x, 4.13x, 1.16x greater Root Mean Square Error (RMSE) for macronutrients, calories burned, resting heart rate, and active minutes respectively), to grant a smaller model (19% less parameters an 77% lighter in memory) and a 43% faster training. Our model is estimated to consume 3.36μJ per inference, which is much lighter than traditional Artificial Neural Networks (ANNs) [2]. The data recorded by health monitoring devices is vastly distributed in the real-world. Moreover, with such sensitive recorded information, there are many possible implications to consider. For these reasons, we apply the clustering federated learning implementation [1] to our use-case. However, it can be challenging to adopt such techniques since it can be difficult to learn from data sequences that are non-regular. We use a two-step streaming clustering approach to classify customers based on their eating and exercise habits. It has been shown that training different models for each group of users is useful, particularly in terms of training time; however this is strongly dependent on the cluster size. Our experiments conclude that there is a decrease in error and training time if the clusters contain enough data to train the models. Finally, this study addresses the issue of data privacy by using state of-the-art differential privacy. We apply e-differential privacy to both our baseline model (trained on the whole dataset) and our federated learning based approach. With a differential privacy of ∈= 0.1 our experiments report an increase in the measured average error (RMSE) of only 25%. Specifically, +23.13%, 25.71%, +29.87%, 21.57% for macronutrients (grams), calories burned (kCal), resting heart rate (beats per minute (bpm), and minutes (minutes) respectively. / Hälsoövervakningsenheter, som Fitbit, blir allt populärare både som friskvårdsverktyg och som informationskälla för vårdbeslut. Att förutsäga sådana välbefinnandemål korrekt är avgörande för att användarna ska kunna göra välgrundade livsstilsval. Kärnmålet med denna avhandling är att designa och implementera ett sådant system som tar hänsyn till energiförbrukning och integritet. Denna forskning är modellerad som ett tidsserieprognosproblem som använder sig av SNNs på grund av deras bevisade energibesparingsförmåga. Tack vare deras design som nära efterliknar naturliga neurala nätverk (som hjärnan) har SNNs potentialen att avsevärt överträffa klassiska artificiella neurala nätverk när det gäller energiförbrukning och robusthet. För att bevisa våra hypoteser har en tidigare forskning av Sonia et al. [1] i samma domän och med samma dataset används som utgångspunkt, där ett privat prognossystem som använder LSTM designas och implementeras. Deras studie implementerar och utvärderar också en klustringsstrategi för federerad inlärning, som passar väl in på den mycket distribuerade data. Resultaten som erhållits i deras forskning fungerar som en baslinje för att jämföra våra resultat vad gäller noggrannhet, träningstid, modellstorlek och uppskattad energiförbrukning. Våra experiment visar att Spiking Neural Networks byter ut precision (2,19x, 1,19x, 4,13x, 1,16x större RMSE för makronäringsämnen, förbrända kalorier, vilopuls respektive aktiva minuter), för att ge en mindre modell ( 19% mindre parametrar, 77% lättare i minnet) och 43% snabbare träning. Vår modell beräknas förbruka 3, 36μJ, vilket är mycket lättare än traditionella ANNs [2]. Data som registreras av hälsoövervakningsenheter är enormt spridda i den verkliga världen. Dessutom, med sådan känslig registrerad information finns det många möjliga konsekvenser att överväga. Av dessa skäl tillämpar vi klustringsimplementeringen för federerad inlärning [1] på vårt användningsfall. Det kan dock vara utmanande att använda sådana tekniker eftersom det kan vara svårt att lära sig av datasekvenser som är oregelbundna. Vi använder en tvåstegs streaming-klustringsmetod för att klassificera kunder baserat på deras mat- och träningsvanor. Det har visat sig att det är användbart att träna olika modeller för varje grupp av användare, särskilt när det gäller utbildningstid; detta är dock starkt beroende av klustrets storlek. Våra experiment drar slutsatsen att det finns en minskning av fel och träningstid om klustren innehåller tillräckligt med data för att träna modellerna. Slutligen tar denna studie upp frågan om datasekretess genom att använda den senaste differentiell integritet. Vi tillämpar e-differentiell integritet på både vår baslinjemodell (utbildad på hela datasetet) och vår federerade inlärningsbaserade metod. Med en differentiell integritet på ∈= 0.1 rapporterar våra experiment en ökning av det uppmätta medelfelet (RMSE) på endast 25%. Specifikt +23,13%, 25,71%, +29,87%, 21,57% för makronäringsämnen (gram), förbrända kalorier (kCal), vilopuls (bpm och minuter (minuter).
|
8 |
Material Artefact Generation / Material Artefact GenerationRončka, Martin January 2019 (has links)
Ne vždy je jednoduché získání dostatečně velké a kvalitní datové sady s obrázky zřetelných artefaktů, ať už kvůli nedostatku ze strany zdroje dat nebo složitosti tvorby anotací. To platí například pro radiologii, nebo také strojírenství. Abychom mohli využít moderní uznávané metody strojového učení které se využívají pro klasifikaci, segmentaci a detekci defektů, je potřeba aby byla datová sada dostatečně velká a vyvážená. Pro malé datové sady čelíme problémům jako je přeučení a slabost dat, které způsobují nesprávnou klasifikaci na úkor málo reprezentovaných tříd. Tato práce se zabývá prozkoumáváním využití generativních sítí pro rozšíření a vyvážení datové sady o nové vygenerované obrázky. Za použití sítí typu Conditional Generative Adversarial Networks (CGAN) a heuristického generátoru anotací jsme schopni generovat velké množství nových snímků součástek s defekty. Pro experimenty s generováním byla použita datová sada závitů. Dále byly použity dvě další datové sady keramiky a snímků z MRI (BraTS). Nad těmito dvěma datovými sadami je provedeno zhodnocení vlivu generovaných dat na učení a zhodnocení přínosu pro zlepšení klasifikace a segmentace.
|
9 |
Porovnání přístupů ke generování umělých dat / Comparison of Approaches to Synthetic Data GenerationŠejvlová, Ludmila January 2017 (has links)
The diploma thesis deals with synthetic data, selected approaches to their generation together with a practical task of data generation. The goal of the thesis is to describe the selected approaches to data generation, capture their key advantages and disadvantages and compare the individual approaches to each other. The practical part of the thesis describes generation of synthetic data for teaching knowledge discovery using databases. The thesis includes a basic description of synthetic data and thoroughly explains the process of their generation. The approaches selected for further examination are random data generation, the statistical approach, data generation languages and the ReverseMiner tool. The thesis also describes the practical usage of synthetic data and the suitability of each approach for certain purposes. Within this thesis, educational data Hotel SD were created using the ReverseMiner tool. The data contain relations discoverable with SD (set-difference) GUHA-procedures.
|
10 |
Complex Vehicle Modeling: A Data Driven ApproachSchoen, Alexander C. 12 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / This thesis proposes an artificial neural network (NN) model to predict fuel consumption in heavy vehicles. The model uses predictors derived from vehicle speed, mass, and road grade. These variables are readily available from telematics devices that are becoming an integral part of connected vehicles. The model predictors are aggregated over a fixed distance traveled (i.e., window) instead of fixed time interval. It was found that 1km windows is most appropriate for the vocations studied in this thesis. Two vocations were studied, refuse and delivery trucks.
The proposed NN model was compared to two traditional models. The first is a parametric model similar to one found in the literature. The second is a linear regression model that uses the same features developed for the NN model.
The confidence level of the models using these three methods were calculated in order to evaluate the models variances. It was found that the NN models produce lower point-wise error. However, the stability of the models are not as high as regression models. In order to improve the variance of the NN models, an ensemble based on the average of 5-fold models was created.
Finally, the confidence level of each model is analyzed in order to understand how much error is expected from each model. The mean training error was used to correct the ensemble predictions for five K-Fold models. The ensemble K-fold model predictions are more reliable than the single NN and has lower confidence interval than both the parametric and regression models.
|
Page generated in 0.1203 seconds