Global ETD Search

31	Practicality in Generative Modeling & Synthetic Data Daniel Antonio Cardona (19339264) 07 August 2024 (has links) <p dir="ltr">As machine learning continues to grow and surprise us, its complexity grows as well. Indeed, many machine learning models have become black boxes. Yet, there is a prevailing need for practicality. This dissertation offers some practicality on generative modeling and synthetic data, a recently popular application of generative models. First, Lightweight Chained Universal Approximators (LiCUS) is proposed. Motivated by statistical sampling principles, LiCUS tackles a simplified generative task with its universal approximation property while having a minimal computational bottleneck. When compared to a generative adversarial network (GAN) and variational auto-encoder (VAE), LiCUS empirically yields synthetic data with greater utility for a classifier on the Modified National Institute of Standards and Technology (MNIST) dataset. Second, following on its potential for informative synthetic data, LiCUS undergoes an extensive synthetic data supplementation experiment. The experiment largely serves as an informative starting point for practical use of synthetic data via LiCUS. In addition, by proposing a gold standard of reserved data, the experimental results suggest that additional data collection may generally outperform models supplemented with synthetic data, at least when using LiCUS. Given that the experiment was conducted on two datasets, future research could involve further experimentation on a greater number and variety of datasets, such as images. Lastly, generative machine learning generally demands large datasets, which is not guaranteed in practice. To alleviate this demand, one could offer expert knowledge. This is demonstrated by applying an expert-informed Wasserstein GAN with gradient penalty (WGAN-GP) on network flow traffic from NASA's Operational Simulation for Small Satellites (NOS3). If one were to directly apply a WGAN-GP, it would fail to respect the physical limitations between satellite components and permissible communications amongst them. By arming a WGAN-GP with cyber-security software Argus, the informed WGAN-GP could produce permissible satellite network flows when given as little as 10,000 flows. In all, this dissertation illustrates how machine learning processes could be modified under a more practical lens and incorporate pre-existing statistical principles and expert knowledge. </p> Statistical data science Statistics not elsewhere classified Generative modeling Synthetic data
32	Synthetic Data Generation and Training Pipeline for General Object Detection Using Domain Randomization Arnestrand, Hampus, Mark, Casper January 2024 (has links) The development of high-performing object detection models requires extensive and varied datasets with accurately annotated images, a process that is traditionally labor-intensive and prone to errors. To address these challenges, this report explores the generation of synthetic data using domain randomization techniques to train object detection models. We present a pipeline that integrates synthetic data creation in Unity, and the training of YOLOv8 object detection models. Our approach uses the Unity Perception package to produce diverse and precisely annotated datasets, overcoming the domain gap typically associated with synthetic data. The pipeline was evaluated through a series of experiments, analyzing the impact of various parameters such as background textures, and training arguments on model performance. The results demonstrate that models trained with our synthetic data can achieve high accuracy and generalize well to real-world scenarios, offering a scalable and efficient alternative to manual data annotation. object detection synthetic data domain randomization machine learning Computer Sciences Datavetenskap (datalogi)
33	Applying Simulation to the Problem of Detecting Financial Fraud Lopez-Rojas, Edgar Alonso January 2016 (has links) This thesis introduces a financial simulation model covering two related financial domains: Mobile Payments and Retail Stores systems. The problem we address in these domains is different types of fraud. We limit ourselves to isolated cases of relatively straightforward fraud. However, in this thesis the ultimate aim is to introduce our approach towards the use of computer simulation for fraud detection and its applications in financial domains. Fraud is an important problem that impact the whole economy. Currently, there is a lack of public research into the detection of fraud. One important reason is the lack of transaction data which is often sensitive. To address this problem we present a mobile money Payment Simulator (PaySim) and Retail Store Simulator (RetSim), which allow us to generate synthetic transactional data that contains both: normal customer behaviour and fraudulent behaviour. These simulations are Multi Agent-Based Simulations (MABS) and were calibrated using real data from financial transactions. We developed agents that represent the clients and merchants in PaySim and customers and salesmen in RetSim. The normal behaviour was based on behaviour observed in data from the field, and is codified in the agents as rules of transactions and interaction between clients and merchants, or customers and salesmen. Some of these agents were intentionally designed to act fraudulently, based on observed patterns of real fraud. We introduced known signatures of fraud in our model and simulations to test and evaluate our fraud detection methods. The resulting behaviour of the agents generate a synthetic log of all transactions as a result of the simulation. This synthetic data can be used to further advance fraud detection research, without leaking sensitive information about the underlying data or breaking any non-disclose agreements. Using statistics and social network analysis (SNA) on real data we calibrated the relations between our agents and generate realistic synthetic data sets that were verified against the domain and validated statistically against the original source. We then used the simulation tools to model common fraud scenarios to ascertain exactly how effective are fraud techniques such as the simplest form of statistical threshold detection, which is perhaps the most common in use. The preliminary results show that threshold detection is effective enough at keeping fraud losses at a set level. This means that there seems to be little economic room for improved fraud detection techniques. We also implemented other applications for the simulator tools such as the set up of a triage model and the measure of cost of fraud. This showed to be an important help for managers that aim to prioritise the fraud detection and want to know how much they should invest in fraud to keep the loses below a desired limit according to different experimented and expected scenarios of fraud. security privacy anonymisation multi-agent-based simulation MABS ABS retail store fraud detection synthetic data mobile money Computer Systems Datorsystem
34	Venture Capital Investment under Private Information Narayanan, Meyyappan January 2011 (has links) Many venture capitalists (VCs) use the “VC method” of valuation where they use judgment to estimate a probability of successful exit while determining the ownership share to demand in exchange for investing in a venture. However, prior models are not aligned with the “VC method” because they do not consider private information about entrepreneurial characteristics, the primary drivers of the above probability, and consequently do not model judgment. The three main chapters of this thesis—one theoretical, one simulation, and one empirical study—examine the venture capital deal process in sync with the “VC method.” Chapter 2 is theoretical and develops a principal-agent model of venture capital deal process incorporating double-sided moral hazard and one-sided private information. The VC is never fully informed about the entrepreneur’s disutility of effort in spite of due diligence checks, so takes on a belief about the latter’s performance in the funded venture to determine the offer. This study suggests that there exists a critical point in the VC’s belief—and correspondingly in the VC’s ownership share—that maximizes the total return to the two parties. It also uncovers optimal revision strategies for the VC to adopt if the offer is rejected where it is shown that the VC should develop a strong advisory capacity and minimize time constraints to facilitate investment. Chapter 3 simulates venture capital deals as per the theoretical model and confirms the existence of critical points in the VC’s belief and ownership share that maximize the returns to the two parties and their total return. Particularly, the VC’s return (in excess of his or her return from an alternate investment) peaks for a moderate ownership share for the VC. Since private information with the entrepreneur would preclude the VC from knowing these critical points a priori, the VC should demand a moderate ownership share to stay close to such a peak. Using data from simulations, we also generate predictions about the properties of the venture capital deal space—notably: (a) Teamwork is crucial to financing; and (b) If the VC is highly confident about the entrepreneur’s performance, it would work to the latter’s advantage. Chapter 4 reports the results from our survey of eight seasoned VCs affiliated with seven firms operating in Canada, USA, and UK, where our findings received a high degree of support. venture capital investment deal firm valuation information asymmetry moral hazard principal-agent model simulation synthetic data survey Management Sciences
35	Venture Capital Investment under Private Information Narayanan, Meyyappan January 2011 (has links) Many venture capitalists (VCs) use the “VC method” of valuation where they use judgment to estimate a probability of successful exit while determining the ownership share to demand in exchange for investing in a venture. However, prior models are not aligned with the “VC method” because they do not consider private information about entrepreneurial characteristics, the primary drivers of the above probability, and consequently do not model judgment. The three main chapters of this thesis—one theoretical, one simulation, and one empirical study—examine the venture capital deal process in sync with the “VC method.” Chapter 2 is theoretical and develops a principal-agent model of venture capital deal process incorporating double-sided moral hazard and one-sided private information. The VC is never fully informed about the entrepreneur’s disutility of effort in spite of due diligence checks, so takes on a belief about the latter’s performance in the funded venture to determine the offer. This study suggests that there exists a critical point in the VC’s belief—and correspondingly in the VC’s ownership share—that maximizes the total return to the two parties. It also uncovers optimal revision strategies for the VC to adopt if the offer is rejected where it is shown that the VC should develop a strong advisory capacity and minimize time constraints to facilitate investment. Chapter 3 simulates venture capital deals as per the theoretical model and confirms the existence of critical points in the VC’s belief and ownership share that maximize the returns to the two parties and their total return. Particularly, the VC’s return (in excess of his or her return from an alternate investment) peaks for a moderate ownership share for the VC. Since private information with the entrepreneur would preclude the VC from knowing these critical points a priori, the VC should demand a moderate ownership share to stay close to such a peak. Using data from simulations, we also generate predictions about the properties of the venture capital deal space—notably: (a) Teamwork is crucial to financing; and (b) If the VC is highly confident about the entrepreneur’s performance, it would work to the latter’s advantage. Chapter 4 reports the results from our survey of eight seasoned VCs affiliated with seven firms operating in Canada, USA, and UK, where our findings received a high degree of support. venture capital investment deal firm valuation information asymmetry moral hazard principal-agent model simulation synthetic data survey Management Sciences
36	Generating Synthetic Schematics with Generative Adversarial Networks Daley Jr, John January 2020 (has links) This study investigates synthetic schematic generation using conditional generative adversarial networks, specifically the Pix2Pix algorithm was implemented for the experimental phase of the study. With the increase in deep neural network’s capabilities and availability, there is a demand for verbose datasets. This in combination with increased privacy concerns, has led to synthetic data generation utilization. Analysis of synthetic images was completed using a survey. Blueprint images were generated and were successful in passing as genuine images with an accuracy of 40%. This study confirms the ability of generative neural networks ability to produce synthetic blueprint images. Synthetic data generative adversarial network machine learning convolutional neural network python tensorflow blueprints Pix2Pix Computer Sciences Datavetenskap (datalogi)
37	Synthetic Data Generation Using Transformer Networks / Textgenerering med transformatornätverk : Skapa text från ett syntetiskt dataset i tabellform Campos, Pedro January 2021 (has links) One of the areas propelled by the advancements in Deep Learning is Natural Language Processing. These continuous advancements allowed the emergence of new language models such as the Transformer [1], a deep learning model based on attention mechanisms that takes a sequence of symbols as input and outputs another sequence, attending to the input during its generation. This model is often used in translation, text summarization and text generation, outperforming previous used methods such as Recurrent Neural Networks and Generative Adversarial Networks. The problem statement provided by the company Syndata for this thesis is related to this new architecture: Given a tabular dataset, create a model based on the Transformer that can generate text fields considering the underlying context from the rest of the accompanying fields. In an attempt to accomplish this, Syndata has previously implemented a recurrent model, nevertheless, they’re confident that a Transformer could perform better at this task. Their goal is to improve the solution provided with the implementation of a model based on the Transformer architecture. The implemented model should then be compared to the previous recurrent model and it’s expected to outperform it. Since there aren’t many published research articles where Transformers are used for synthetic tabular data generation, this problem is fairly original. Four different models were implemented: a model that is based on the GPT architecture [2], an LSTM [3], a Bidirectional-LSTM with an Encoder- Decoder structure and the Transformer. The first two models are autoregressive models and the later two are sequence-to-sequence models which have an Encoder-Decoder architecture. We evaluated each one of them based on 3 different aspects: on the distribution similarity between the real and generated datasets, on how well each model was able to condition name generation considering the information contained in the accompanying fields and on how much real data the model compromised after generation, which addresses a privacy related issue. We found that the Encoder-Decoder models such as the Transformer and the Bidirectional LSTM seem to perform better for this type of synthetic data generation where the output (or the field to be predicted) has to be conditioned by the rest of the accompanying fields. They’ve outperformed the GPT and the RNNmodels in the aspects that matter most to Syndata: keeping customer data private and being able to correctly condition the output with the information contained in the accompanying fields. / Deep learning har lett till stora framsteg inom textbaserad språkteknologi (Natural Language Processing) där en typ av maskininlärningsarkitektur kallad Transformers[1] har haft ett extra stort intryck. Dessa modeller använder sig av en så kallad attention mekanism, tränas som språkmodeller (Language Models), där de tar in en sekvens av symboler och matar ut en annan. Varje steg i den utgående sekvensen beror olika mycket på steg i den ingående sekvensen givet vad denna attention mekanism lärt sig vara relevant. Dessa modeller används för översättning, sammanfattning och textgenerering och har överträffat andra arkitekturer som Recurrent Neural Networks, RNNs samt Generative Adversarial Networks. Problemformuleringen för denna avhandling kom från företaget Syndata och är relaterat till denna arkitektur: givet tabellbaserad data, implementera en Transformer som genererar textfält beroende av informationen i de medföljande tabellfälten. Syndata har tidigare implementerat ett RNN för detta ändamål men är övertygande om att en Transformer kan prestera bättre. Målet för denna avhandling är att implementera en Transformer och jämföra med den tidigare implementationen med hypotesen att den kommer att prestera bättre. Det underliggande målet är att givet data i tabellform kunna generera ny syntetisk data, användbar för industrin, där problem kring integritet och privat information kan minimeras. Fyra modeller implementerades: en Transformermodel baserad på GPT- arkitekturen[ 2], en LSTM[3]-modell, en encoder-decoder Transformer och en BiLSTM-modell. De två förstnämnda modellerna är auto-regressiva och de senare två är sequence-to-sequence som har en encoder-decoder arkitektur. Dessa modeller utvärderades och jämfördes givet tre kriterier: hur lik sannolikhetsfördelningen mellan den verkliga och den genererade datamängden, hur mycket varje modell baserade generationen på de medföljande fälten och hur mycket verklig data som komprometteras genom synteseringen. Slutsatsen var att Encoder-Decoder varianterna, Transformern och BiLSTM, var bättre för att syntesera data i tabellformat, där utdatan (eller fälten som ska genereras) ska uppvisa ett starkt beroende av resten av de medföljande fälten. De överträffade GPT- och RNN- modellerna i de aspekter som betyder mest för Syndata att hålla kunddata privat och att den syntetiserade datan ska vara beroende av informationen i de medföljande fälten. Transformer Synthetic Data Text Generation Deep Learning Tabular Data Transformator Syntetisk data Textgenerering Djupinlärning Tabelldata Computer Sciences Datavetenskap (datalogi)
38	Learning from 3D generated synthetic data for unsupervised anomaly detection Fröjdholm, Hampus January 2021 (has links) Modern machine learning methods, utilising neural networks, require a lot of training data. Data gathering and preparation has thus become a major bottleneck in the machine learning pipeline and researchers often use large public datasets to conduct their research (such as the ImageNet [1] or MNIST [2] datasets). As these methods begin being used in industry, these challenges become apparent. In factories objects being produced are often unique and may even involve trade secrets and patents that need to be protected. Additionally, manufacturing may not have started yet, making real data collection impossible. In both cases a public dataset is unlikely to be applicable. One possible solution, investigated in this thesis, is synthetic data generation. Synthetic data generation using physically based rendering was tested for unsupervised anomaly detection on a 3D printed block. A small image dataset was gathered of the block as control and a data generation model was created using its CAD model, a resource most often available in industrial settings. The data generation model used randomisation to reduce the domain shift between the real and synthetic data. For testing the data, autoencoder models were trained, both on the real and synthetic data separately and in combination. The material of the block, a white painted surface, proved challenging to reconstruct and no significant difference between the synthetic and real data could be observed. The model trained on real data outperformed the models trained on synthetic and the combined data. However, the synthetic data combined with the real data showed promise with reducing some of the bias intentionally introduced in the real dataset. Future research could focus on creating synthetic data for a problem where a good anomaly detection model already exists, with the goal of transferring some of the synthetic data generation model (such as the materials) to a new problem. This would be of interest in industries where they produce many different but similar objects and could reduce the time needed when starting a new machine learning project. machine learning synthetic data anomaly detection physically based rendering maskininlärning syntetisk data anomalidetektion Computer Sciences Datavetenskap (datalogi)
39	Deep Learning for 3D Perception: Computer Vision and Tactile Sensing Garcia-Garcia, Alberto 23 October 2019 (has links) The care of dependent people (for reasons of aging, accidents, disabilities or illnesses) is one of the top priority lines of research for the European countries as stated in the Horizon 2020 goals. In order to minimize the cost and the intrusiveness of the therapies for care and rehabilitation, it is desired that such cares are administered at the patient’s home. The natural solution for this environment is an indoor mobile robotic platform. Such robotic platform for home care needs to solve to a certain extent a set of problems that lie in the intersection of multiple disciplines, e.g., computer vision, machine learning, and robotics. In that crossroads, one of the most notable challenges (and the one we will focus on) is scene understanding: the robot needs to understand the unstructured and dynamic environment in which it navigates and the objects with which it can interact. To achieve full scene understanding, various tasks must be accomplished. In this thesis we will focus on three of them: object class recognition, semantic segmentation, and grasp stability prediction. The first one refers to the process of categorizing an object into a set of classes (e.g., chair, bed, or pillow); the second one goes one level beyond object categorization and aims to provide a per-pixel dense labeling of each object in an image; the latter consists on determining if an object which has been grasped by a robotic hand is in a stable configuration or if it will fall. This thesis presents contributions towards solving those three tasks using deep learning as the main tool for solving such recognition, segmentation, and prediction problems. All those solutions share one core observation: they all rely on tridimensional data inputs to leverage that additional dimension and its spatial arrangement. The four main contributions of this thesis are: first, we show a set of architectures and data representations for 3D object classification using point clouds; secondly, we carry out an extensive review of the state of the art of semantic segmentation datasets and methods; third, we introduce a novel synthetic and large-scale photorealistic dataset for solving various robotic and vision problems together; at last, we propose a novel method and representation to deal with tactile sensors and learn to predict grasp stability. Deep Learning Computer Vision Synthetic Data Tactile Sensing Convolutional Neural Networks Semantic Segmentation
40	Using Synthetic Data to ModelMobile User Interface Interactions Jalal, Laoa January 2023 (has links) Usability testing within User Interface (UI) is a central part of assuring high-quality UIdesign that provides good user-experiences across multiple user-groups. The processof usability testing often times requires extensive collection of user feedback, preferablyacross multiple user groups, to ensure an unbiased observation of the potential designflaws within the UI design. Attaining feedback from certain user groups has shown tobe challenging, due to factors such as medical conditions that limits the possibilities ofusers to participate in the usability test. An absence of these hard-to-access groups canlead to designs that fails to consider their unique needs and preferences, which maypotentially result in a worse user experience for these individuals. In this thesis, wetry to address the current gaps within data collection of usability tests by investigatingwhether the Generative Adversarial Network (GAN) framework can be used to generatehigh-quality synthetic user interactions of a particular UI gesture across multiple usergroups. Moreover, a collection UI interaction of 2 user groups, namely the elderlyand young population, was conducted where the UI interaction at focus was thedrag-and-drop operation. The datasets, comprising of both user groups were trainedon separate GANs, both using the doppelGANger architecture, and the generatedsynthetic data were evaluated based on its diversity, how well temporal correlations arepreserved and its performance compared to the real data when used in a classificationtask. The experiment result shows that both GANs produces high-quality syntheticresemblances of the drag-and-drop operation, where the synthetic samples show bothdiversity and uniqueness when compared to the actual dataset. The synthetic datasetacross both user groups also provides similar statistical properties within the originaldataset, such as the per-sample length distribution and the temporal correlationswithin the sequences. Furthermore, the synthetic dataset shows, on average, similarperformance achievements across precision, recall and F1 scores compared to theactual dataset when used to train a classifier to distinguish between the elderly andyounger population drag-and-drop sequences. Further research regarding the use ofmultiple UI gestures, using a single GAN to generate UI interactions across multipleuser groups, and performing a comparative study of different GAN architectures wouldprovide valuable insights of unexplored potentials and possible limitations within thisparticular problem domain. Time Series Data Generative Adversarial Networks Synthetic Data Generator Usability Testing Machine Learning Computer Sciences Datavetenskap (datalogi)

Search results