Spelling suggestions: "subject:"aynthetic data generation"" "subject:"asynthetic data generation""
1 |
Privacy-Preserving Synthetic Medical Data Generation with Deep LearningTorfi, Amirsina 26 August 2020 (has links)
Deep learning models demonstrated good performance in various domains such as ComputerVision and Natural Language Processing. However, the utilization of data-driven methods in healthcare raises privacy concerns, which creates limitations for collaborative research. A remedy to this problem is to generate and employ synthetic data to address privacy concerns. Existing methods for artificial data generation suffer from different limitations, such as being bound to particular use cases. Furthermore, their generalizability to real-world problems is controversial regarding the uncertainties in defining and measuring key realistic characteristics. Hence, there is a need to establish insightful metrics (and to measure the validity of synthetic data), as well as quantitative criteria regarding privacy restrictions. We propose the use of Generative Adversarial Networks to help satisfy requirements for realistic characteristics and acceptable values of privacy metrics, simultaneously. The present study makes several unique contributions to synthetic data generation in the healthcare domain. First, we propose a novel domain-agnostic metric to evaluate the quality of synthetic data. Second, by utilizing 1-D Convolutional Neural Networks, we devise a new approach to capturing the correlation between adjacent diagnosis records. Third, we employ ConvolutionalAutoencoders for creating a robust and compact feature space to handle the mixture of discrete and continuous data. Finally, we devise a privacy-preserving framework that enforcesRényi differential privacy as a new notion of differential privacy. / Doctor of Philosophy / Computers programs have been widely used for clinical diagnosis but are often designed with assumptions limiting their scalability and interoperability. The recent proliferation of abundant health data, significant increases in computer processing power, and superior performance of data-driven methods enable a trending paradigm shift in healthcare technology. This involves the adoption of artificial intelligence methods, such as deep learning, to improve healthcare knowledge and practice. Despite the success in using deep learning in many different domains, in the healthcare field, privacy challenges make collaborative research difficult, as working with data-driven methods may jeopardize patients' privacy. To overcome these challenges, researchers propose to generate and utilize realistic synthetic data that can be used instead of real private data. Existing methods for artificial data generation are limited by being bound to special use cases. Furthermore, their generalizability to real-world problems is questionable. There is a need to establish valid synthetic data that overcomes privacy restrictions and functions as a real-world analog for healthcare deep learning data training. We propose the use of Generative Adversarial Networks to simultaneously overcome the realism and privacy challenges associated with healthcare data.
|
2 |
Semantic Segmentation with Carla SimulatorMalec, Stanislaw January 2021 (has links)
Autonomous vehicles perform semantic segmentation to orient themselves, but training neural networks for semantic segmentation requires large amounts of labeled data. A hand-labeled real-life dataset requires considerable effort to create, so we instead turn to virtual simulators where the segmented labels are known to generate large datasets virtually for free. This work investigates how effective synthetic datasets are in driving scenarios by collecting a dataset from a simulator and testing it against a real-life hand-labeled dataset. We show that we can get a model up and running faster by mixing synthetic and real-life data than traditional dataset collection methods and achieve close to baseline performance.
|
3 |
GAN-Based Approaches for Generating Structured Data in the Medical DomainAbedi, Masoud, Hempel, Lars, Sadeghi, Sina, Kirsten, Toralf 03 November 2023 (has links)
Modern machine and deep learning methods require large datasets to achieve reliable
and robust results. This requirement is often difficult to meet in the medical field, due to data
sharing limitations imposed by privacy regulations or the presence of a small number of patients (e.g.,
rare diseases). To address this data scarcity and to improve the situation, novel generative models
such as Generative Adversarial Networks (GANs) have been widely used to generate synthetic
data that mimic real data by representing features that reflect health-related information without
reference to real patients. In this paper, we consider several GAN models to generate synthetic data
used for training binary (malignant/benign) classifiers, and compare their performances in terms
of classification accuracy with cases where only real data are considered. We aim to investigate
how synthetic data can improve classification accuracy, especially when a small amount of data is
available. To this end, we have developed and implemented an evaluation framework where binary
classifiers are trained on extended datasets containing both real and synthetic data. The results show
improved accuracy for classifiers trained with generated data from more advanced GAN models,
even when limited amounts of original data are available.
|
4 |
Bayesian Variable Selection with Shrinkage Priors and Generative Adversarial Networks for Fraud DetectionIssoufou Anaroua, Amina 01 January 2024 (has links) (PDF)
This research paper focuses on fraud detection in the financial industry using Generative Adversarial Networks (GANs) in conjunction with Uni and Multi Variate Bayesian Model with Shrinkage Priors (BMSP). The problem addressed is the need for accurate and advanced fraud detection techniques due to the increasing sophistication of fraudulent activities. The methodology involves the implementation of GANs and the application of BMSP for variable selection to generate synthetic fraud samples for fraud detection using the augmented dataset. Experimental results demonstrate the effectiveness of the BMSP GAN approach in detecting fraud with improved performance compared to other methods. The conclusions drawn highlight the potential of GANs and BMSP for enhancing fraud detection capabilities and suggest future research directions for further improvements in the field.
|
5 |
Generation of Synthetic Data with Generative Adversarial NetworksGarcia Torres, Douglas January 2018 (has links)
The aim of synthetic data generation is to provide data that is not real for cases where the use of real data is somehow limited. For example, when there is a need for larger volumes of data, when the data is sensitive to use, or simply when it is hard to get access to the real data. Traditional methods of synthetic data generation use techniques that do not intend to replicate important statistical properties of the original data. Properties such as the distribution, the patterns or the correlation between variables, are often omitted. Moreover, most of the existing tools and approaches require a great deal of user-defined rules and do not make use of advanced techniques like Machine Learning or Deep Learning. While Machine Learning is an innovative area of Artificial Intelligence and Computer Science that uses statistical techniques to give computers the ability to learn from data, Deep Learning is a closely related field based on learning data representations, which may serve useful for the task of synthetic data generation. This thesis focuses on one of the most interesting and promising innovations of the last years in the Machine Learning community: Generative Adversarial Networks. An approach for generating discrete, continuous or text synthetic data with Generative Adversarial Networks is proposed, tested, evaluated and compared with a baseline approach. The results prove the feasibility and show the advantages and disadvantages of using this framework. Despite its high demand for computational resources, a Generative Adversarial Networks framework is capable of generating quality synthetic data that preserves the statistical properties of a given dataset. / Syftet med syntetisk datagenerering är att tillhandahålla data som inte är verkliga i fall där användningen av reella data på något sätt är begränsad. Till exempel, när det finns behov av större datamängder, när data är känsliga för användning, eller helt enkelt när det är svårt att få tillgång till den verkliga data. Traditionella metoder för syntetiska datagenererande använder tekniker som inte avser att replikera viktiga statistiska egenskaper hos de ursprungliga data. Egenskaper som fördelningen, mönstren eller korrelationen mellan variabler utelämnas ofta. Dessutom kräver de flesta av de befintliga verktygen och metoderna en hel del användardefinierade regler och använder inte avancerade tekniker som Machine Learning eller Deep Learning. Machine Learning är ett innovativt område för artificiell intelligens och datavetenskap som använder statistiska tekniker för att ge datorer möjlighet att lära av data. Deep Learning ett närbesläktat fält baserat på inlärningsdatapresentationer, vilket kan vara användbart för att generera syntetisk data. Denna avhandling fokuserar på en av de mest intressanta och lovande innovationerna från de senaste åren i Machine Learning-samhället: Generative Adversarial Networks. Generative Adversarial Networks är ett tillvägagångssätt för att generera diskret, kontinuerlig eller textsyntetisk data som föreslås, testas, utvärderas och jämförs med en baslinjemetod. Resultaten visar genomförbarheten och visar fördelarna och nackdelarna med att använda denna metod. Trots dess stora efterfrågan på beräkningsresurser kan ett generativt adversarialnätverk skapa generell syntetisk data som bevarar de statistiska egenskaperna hos ett visst dataset.
|
6 |
Energy-Efficient Private Forecasting on Health Data using SNNs / Energieffektiv privat prognos om hälsodata med hjälp av SNNsDi Matteo, Davide January 2022 (has links)
Health monitoring devices, such as Fitbit, are gaining popularity both as wellness tools and as a source of information for healthcare decisions. Predicting such wellness goals accurately is critical for the users to make informed lifestyle choices. The core objective of this thesis is to design and implement such a system that takes energy consumption and privacy into account. This research is modelled as a time-series forecasting problem that makes use of Spiking Neural Networks (SNNs) due to their proven energy-saving capabilities. Thanks to their design that closely mimics natural neural networks (such as the brain), SNNs have the potential to significantly outperform classic Artificial Neural Networks in terms of energy consumption and robustness. In order to prove our hypotheses, a previous research by Sonia et al. [1] in the same domain and with the same dataset is used as our starting point, where a private forecasting system using Long short-term memory (LSTM) is designed and implemented. Their study also implements and evaluates a clustering federated learning approach, which fits well the highly distributed data. The results obtained in their research act as a baseline to compare our results in terms of accuracy, training time, model size and estimated energy consumed. Our experiments show that Spiking Neural Networks trades off accuracy (2.19x, 1.19x, 4.13x, 1.16x greater Root Mean Square Error (RMSE) for macronutrients, calories burned, resting heart rate, and active minutes respectively), to grant a smaller model (19% less parameters an 77% lighter in memory) and a 43% faster training. Our model is estimated to consume 3.36μJ per inference, which is much lighter than traditional Artificial Neural Networks (ANNs) [2]. The data recorded by health monitoring devices is vastly distributed in the real-world. Moreover, with such sensitive recorded information, there are many possible implications to consider. For these reasons, we apply the clustering federated learning implementation [1] to our use-case. However, it can be challenging to adopt such techniques since it can be difficult to learn from data sequences that are non-regular. We use a two-step streaming clustering approach to classify customers based on their eating and exercise habits. It has been shown that training different models for each group of users is useful, particularly in terms of training time; however this is strongly dependent on the cluster size. Our experiments conclude that there is a decrease in error and training time if the clusters contain enough data to train the models. Finally, this study addresses the issue of data privacy by using state of-the-art differential privacy. We apply e-differential privacy to both our baseline model (trained on the whole dataset) and our federated learning based approach. With a differential privacy of ∈= 0.1 our experiments report an increase in the measured average error (RMSE) of only 25%. Specifically, +23.13%, 25.71%, +29.87%, 21.57% for macronutrients (grams), calories burned (kCal), resting heart rate (beats per minute (bpm), and minutes (minutes) respectively. / Hälsoövervakningsenheter, som Fitbit, blir allt populärare både som friskvårdsverktyg och som informationskälla för vårdbeslut. Att förutsäga sådana välbefinnandemål korrekt är avgörande för att användarna ska kunna göra välgrundade livsstilsval. Kärnmålet med denna avhandling är att designa och implementera ett sådant system som tar hänsyn till energiförbrukning och integritet. Denna forskning är modellerad som ett tidsserieprognosproblem som använder sig av SNNs på grund av deras bevisade energibesparingsförmåga. Tack vare deras design som nära efterliknar naturliga neurala nätverk (som hjärnan) har SNNs potentialen att avsevärt överträffa klassiska artificiella neurala nätverk när det gäller energiförbrukning och robusthet. För att bevisa våra hypoteser har en tidigare forskning av Sonia et al. [1] i samma domän och med samma dataset används som utgångspunkt, där ett privat prognossystem som använder LSTM designas och implementeras. Deras studie implementerar och utvärderar också en klustringsstrategi för federerad inlärning, som passar väl in på den mycket distribuerade data. Resultaten som erhållits i deras forskning fungerar som en baslinje för att jämföra våra resultat vad gäller noggrannhet, träningstid, modellstorlek och uppskattad energiförbrukning. Våra experiment visar att Spiking Neural Networks byter ut precision (2,19x, 1,19x, 4,13x, 1,16x större RMSE för makronäringsämnen, förbrända kalorier, vilopuls respektive aktiva minuter), för att ge en mindre modell ( 19% mindre parametrar, 77% lättare i minnet) och 43% snabbare träning. Vår modell beräknas förbruka 3, 36μJ, vilket är mycket lättare än traditionella ANNs [2]. Data som registreras av hälsoövervakningsenheter är enormt spridda i den verkliga världen. Dessutom, med sådan känslig registrerad information finns det många möjliga konsekvenser att överväga. Av dessa skäl tillämpar vi klustringsimplementeringen för federerad inlärning [1] på vårt användningsfall. Det kan dock vara utmanande att använda sådana tekniker eftersom det kan vara svårt att lära sig av datasekvenser som är oregelbundna. Vi använder en tvåstegs streaming-klustringsmetod för att klassificera kunder baserat på deras mat- och träningsvanor. Det har visat sig att det är användbart att träna olika modeller för varje grupp av användare, särskilt när det gäller utbildningstid; detta är dock starkt beroende av klustrets storlek. Våra experiment drar slutsatsen att det finns en minskning av fel och träningstid om klustren innehåller tillräckligt med data för att träna modellerna. Slutligen tar denna studie upp frågan om datasekretess genom att använda den senaste differentiell integritet. Vi tillämpar e-differentiell integritet på både vår baslinjemodell (utbildad på hela datasetet) och vår federerade inlärningsbaserade metod. Med en differentiell integritet på ∈= 0.1 rapporterar våra experiment en ökning av det uppmätta medelfelet (RMSE) på endast 25%. Specifikt +23,13%, 25,71%, +29,87%, 21,57% för makronäringsämnen (gram), förbrända kalorier (kCal), vilopuls (bpm och minuter (minuter).
|
7 |
Material Artefact Generation / Material Artefact GenerationRončka, Martin January 2019 (has links)
Ne vždy je jednoduché získání dostatečně velké a kvalitní datové sady s obrázky zřetelných artefaktů, ať už kvůli nedostatku ze strany zdroje dat nebo složitosti tvorby anotací. To platí například pro radiologii, nebo také strojírenství. Abychom mohli využít moderní uznávané metody strojového učení které se využívají pro klasifikaci, segmentaci a detekci defektů, je potřeba aby byla datová sada dostatečně velká a vyvážená. Pro malé datové sady čelíme problémům jako je přeučení a slabost dat, které způsobují nesprávnou klasifikaci na úkor málo reprezentovaných tříd. Tato práce se zabývá prozkoumáváním využití generativních sítí pro rozšíření a vyvážení datové sady o nové vygenerované obrázky. Za použití sítí typu Conditional Generative Adversarial Networks (CGAN) a heuristického generátoru anotací jsme schopni generovat velké množství nových snímků součástek s defekty. Pro experimenty s generováním byla použita datová sada závitů. Dále byly použity dvě další datové sady keramiky a snímků z MRI (BraTS). Nad těmito dvěma datovými sadami je provedeno zhodnocení vlivu generovaných dat na učení a zhodnocení přínosu pro zlepšení klasifikace a segmentace.
|
8 |
Porovnání přístupů ke generování umělých dat / Comparison of Approaches to Synthetic Data GenerationŠejvlová, Ludmila January 2017 (has links)
The diploma thesis deals with synthetic data, selected approaches to their generation together with a practical task of data generation. The goal of the thesis is to describe the selected approaches to data generation, capture their key advantages and disadvantages and compare the individual approaches to each other. The practical part of the thesis describes generation of synthetic data for teaching knowledge discovery using databases. The thesis includes a basic description of synthetic data and thoroughly explains the process of their generation. The approaches selected for further examination are random data generation, the statistical approach, data generation languages and the ReverseMiner tool. The thesis also describes the practical usage of synthetic data and the suitability of each approach for certain purposes. Within this thesis, educational data Hotel SD were created using the ReverseMiner tool. The data contain relations discoverable with SD (set-difference) GUHA-procedures.
|
9 |
Complex Vehicle Modeling: A Data Driven ApproachSchoen, Alexander C. 12 1900 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / This thesis proposes an artificial neural network (NN) model to predict fuel consumption in heavy vehicles. The model uses predictors derived from vehicle speed, mass, and road grade. These variables are readily available from telematics devices that are becoming an integral part of connected vehicles. The model predictors are aggregated over a fixed distance traveled (i.e., window) instead of fixed time interval. It was found that 1km windows is most appropriate for the vocations studied in this thesis. Two vocations were studied, refuse and delivery trucks.
The proposed NN model was compared to two traditional models. The first is a parametric model similar to one found in the literature. The second is a linear regression model that uses the same features developed for the NN model.
The confidence level of the models using these three methods were calculated in order to evaluate the models variances. It was found that the NN models produce lower point-wise error. However, the stability of the models are not as high as regression models. In order to improve the variance of the NN models, an ensemble based on the average of 5-fold models was created.
Finally, the confidence level of each model is analyzed in order to understand how much error is expected from each model. The mean training error was used to correct the ensemble predictions for five K-Fold models. The ensemble K-fold model predictions are more reliable than the single NN and has lower confidence interval than both the parametric and regression models.
|
10 |
[en] AN APPROACH BASED ON INTERACTIVE MACHINE LEARNING AND NATURAL INTERACTION TO SUPPORT PHYSICAL REHABILITATION / [pt] UMA ABORDAGEM BASEADA NO APRENDIZADO DE MÁQUINA INTERATIVO E INTERAÇÃO NATURAL PARA APOIO À REABILITAÇÃO FÍSICAJESSICA MARGARITA PALOMARES PECHO 10 August 2021 (has links)
[pt] A fisioterapia visa melhorar a funcionalidade física das pessoas, procurando
atenuar as incapacidades causadas por alguma lesão, distúrbio ou
doença. Nesse contexto, diversas tecnologias computacionais têm sido desenvolvidas
com o intuito de apoiar o processo de reabilitação, como as tecnologias
adaptáveis para o usuário final. Essas tecnologias possibilitam ao fisioterapeuta
adequar aplicações e criarem atividades com características personalizadas de
acordo com as preferências e necessidades de cada paciente. Nesta tese é proposta
uma abordagem de baixo custo baseada no aprendizado de máquina
interativo (iML - Interactive Machine Learning) que visa auxiliar os fisioterapeutas
a criarem atividades personalizadas para seus pacientes de forma fácil
e sem a necessidade de codificação de software, a partir de apenas alguns exemplos
em vídeo RGB (capturadas por uma câmera de vídeo digital) Para tal,
aproveitamos a estimativa de pose baseada em aprendizado profundo para rastrear,
em tempo real, as articulações-chave do corpo humano a partir de dados
da imagem. Esses dados são processados como séries temporais por meio do algoritmo
Dynamic Time Warping em conjunto com com o algoritmo K-Nearest
Neighbors para criar um modelo de aprendizado de máquina. Adicionalmente,
usamos um algoritmo de detecção de anomalias com o intuito de avaliar automaticamente
os movimentos. A arquitetura de nossa abordagem possui dois
módulos: um para o fisioterapeuta apresentar exemplos personalizados a partir
dos quais o sistema cria um modelo para reconhecer esses movimentos; outro
para o paciente executar os movimentos personalizados enquanto o sistema
avalia o paciente. Avaliamos a usabilidade de nosso sistema com fisioterapeutas
de cinco clínicas de reabilitação. Além disso, especialistas avaliaram clinicamente
nosso modelo de aprendizado de máquina. Os resultados indicam que
a nossa abordagem contribui para avaliar automaticamente os movimentos dos
pacientes sem monitoramento direto do fisioterapeuta, além de reduzir o tempo
necessário do especialista para treinar um sistema adaptável. / [en] Physiotherapy aims to improve the physical functionality of people, seeking
to mitigate the disabilities caused by any injury, disorder or disease. In
this context, several computational technologies have been developed in order
to support the rehabilitation process, such as the end-user adaptable technologies.
These technologies allow the physiotherapist to adapt applications and
create activities with personalized characteristics according to the preferences
and needs of each patient. This thesis proposes a low-cost approach based on
interactive machine learning (iML) that aims to help physiotherapists to create
personalized activities for their patients easily and without the need for
software coding, from just a few examples in RGB video (captured by a digital
video camera). To this end, we take advantage of pose estimation based on deep
learning to track, in real time, the key joints of the human body from image
data. This data is processed as time series using the Dynamic Time Warping
algorithm in conjunction with the K-Nearest Neighbors algorithm to create a
machine learning model. Additionally, we use an anomaly detection algorithm
in order to automatically assess movements. The architecture of our approach
has two modules: one for the physiotherapist to present personalized examples
from which the system creates a model to recognize these movements; another
to the patient performs personalized movements while the system evaluates
the patient. We assessed the usability of our system with physiotherapists
from five rehabilitation clinics. In addition, experts have clinically evaluated
our machine learning model. The results indicate that our approach contributes
to automatically assessing patients movements without direct monitoring by
the physiotherapist, in addition to reducing the specialist s time required to
train an adaptable system.
|
Page generated in 0.1478 seconds