Global ETD Search

31	GAN-Based Approaches for Generating Structured Data in the Medical Domain Abedi, Masoud, Hempel, Lars, Sadeghi, Sina, Kirsten, Toralf 03 November 2023 (has links) Modern machine and deep learning methods require large datasets to achieve reliable and robust results. This requirement is often difficult to meet in the medical field, due to data sharing limitations imposed by privacy regulations or the presence of a small number of patients (e.g., rare diseases). To address this data scarcity and to improve the situation, novel generative models such as Generative Adversarial Networks (GANs) have been widely used to generate synthetic data that mimic real data by representing features that reflect health-related information without reference to real patients. In this paper, we consider several GAN models to generate synthetic data used for training binary (malignant/benign) classifiers, and compare their performances in terms of classification accuracy with cases where only real data are considered. We aim to investigate how synthetic data can improve classification accuracy, especially when a small amount of data is available. To this end, we have developed and implemented an evaluation framework where binary classifiers are trained on extended datasets containing both real and synthetic data. The results show improved accuracy for classifiers trained with generated data from more advanced GAN models, even when limited amounts of original data are available. info:eu-repo/classification/ddc/610 ddc:610
32	Generative Adversarial Networks for Vehicle Trajectory Generation / Generativa Motståndarnätverk för Generering av Fordonsbana Bajarunas, Kristupas January 2022 (has links) Deep learning models heavily rely on an abundance of data, and their performance is directly affected by data availability. In mobility pattern modeling, problems, such as next location prediction or flow prediction, are commonly solved using deep learning approaches. Despite advances in modeling techniques, complications arise when acquiring mobility data is limited by geographic factors and data protection laws. Generating highquality synthetic data is one of the solutions to get around at times when information is scarce. Trajectory generation is concerned with generating trajectories that can reproduce the spatial and temporal characteristics of the underlying original mobility patterns. The task of this project was to evaluate Generative Adversarial Network (GAN) capabilities to generate synthetic vehicle trajectory data. We extend the methodology of previous research on trajectory generation by introducing conditional trajectory duration labels and a model pretraining mechanism. The evaluation of generated trajectories consisted of a two-fold analysis. We perform qualitative analysis by visually inspecting generated trajectories and quantitative analysis by calculating the statistical distance between synthetic and original data distributions. The results indicate that extending the previous GAN methodology allows the novel model to generate trajectories statistically closer to the original data distribution. Nevertheless, a statistical base model has the best generative performance and is the only model to generate visually plausible results. We accredit the superior performance of the statistical base model to the highly predictive nature of vehicle trajectories, which must follow the road network and have the tendency to follow minimum distance routes. This research considered only one type of GAN-based model, and further research should explore other architecture alternatives to understand the potential of GAN-based models fully / Modeller för djupinlärning är starkt beroende av ett överflöd av data, och derasprestanda påverkas direkt av datatillgänglighet. I mobilitetsmönstermodellering löses problem, såsom nästa platsförutsägelse eller flödesprediktion,vanligtvis med hjälp av djupinlärningsmetoder. Trots framsteg inommodelleringsteknik uppstår komplikationer när inhämtning av mobilitetsdatabegränsas av geografiska faktorer och dataskyddslagar. Att generera syntetiskdata av hög kvalitet är en av lösningarna för att ta sig runt i tider dåinformationen är knapp. Bangenerering handlar om att generera banorsom kan reproducera de rumsliga och tidsmässiga egenskaperna hos deunderliggande ursprungliga rörlighetsmönstren. Uppgiften för detta projektvar att utvärdera GAN-kapaciteten för att generera syntetiska fordonsbanor. Viutökar metodiken för tidigare forskning om banagenerering genom att introducera villkorliga etiketter för banalängd och en modellförträningsmekanism.Utvärderingen av genererade banor bestod av en tvåfaldig analys. Viutför kvalitativ analys genom att visuellt inspektera genererade banor ochkvantitativ analys genom att beräkna det statistiska avståndet mellan syntetiskaoch ursprungliga datafördelningar. Resultaten indikerar att en utvidgningav den tidigare GAN-metoden tillåter den nya modellen att generera banorstatistiskt närmare den ursprungliga datadistributionen. Ändå har en statistiskbasmodell den bästa generativa prestandan och är den enda modellen somgenererar visuellt rimliga resultat. Vi ackrediterar den statistiska basmodellensöverlägsna prestanda till den mycket prediktiva karaktären hos fordonsbanor,som måste följa vägnätet och ha en tendens att följa minimiavståndsrutter.Denna forskning övervägde endast en typ av GAN-baserad modell, ochytterligare forskning bör utforska andra arkitekturalternativ för att förståpotentialen hos GAN-baserade modeller fullt ut Data Generation Generative Adversarial Networks Vehicle Trajectories Datagenerering Generativa Motståndarnätverk Fordonsbanor Computer and Information Sciences Data- och informationsvetenskap
33	The Importance of Data in RF Machine Learning Clark IV, William Henry 17 November 2022 (has links) While the toolset known as Machine Learning (ML) is not new, several of the tools available within the toolset have seen revitalization with improved hardware, and have been applied across several domains in the last two decades. Deep Neural Network (DNN) applications have contributed to significant research within Radio Frequency (RF) problems over the last decade, spurred by results in image and audio processing. Machine Learning (ML), and Deep Learning (DL) specifically, are driven by access to relevant data during the training phase of the application due to the learned feature sets that are derived from vast amounts of similar data. Despite this critical reliance on data, the literature provides insufficient answers on how to quantify the data training needs of an application in order to achieve a desired performance. This dissertation first aims to create a practical definition that bounds the problem space of Radio Frequency Machine Learning (RFML), which we take to mean the application of Machine Learning (ML) as close to the sampled baseband signal directly after digitization as is possible, while allowing for preprocessing when reasonably defined and justified. After constraining the problem to the Radio Frequency Machine Learning (RFML) domain space, an understanding of what kinds of Machine Learning (ML) have been applied as well as the techniques that have shown benefits will be reviewed from the literature. With the problem space defined and the trends in the literature examined, the next goal aims at providing a better understanding for the concept of data quality through quantification. This quantification helps explain how the quality of data: affects Machine Learning (ML) systems with regard to final performance, drives required data observation quantity within that space, and impacts can be generalized and contrasted. With the understanding of how data quality and quantity can affect the performance of a system in the Radio Frequency Machine Learning (RFML) space, an examination of the data generation techniques and realizations from conceptual through real-time hardware implementations are discussed. Consequently, the results of this dissertation provide a foundation for estimating the investment required to realize a performance goal within a Deep Learning (DL) framework as well as a rough order of magnitude for common goals within the Radio Frequency Machine Learning (RFML) problem space. / Doctor of Philosophy / Machine Learning (ML) is a powerful toolset capable of solving difficult problems across many domains. A fundamental part of this toolset is the representative data used to train a system. Unlike the domains of image or audio processing, for which datasets are constantly being developed thanks to usage agreements with entities such as Facebook, Google, and Amazon, the field of Machine Learning (ML) within the Radio Frequency (RF) domain, or Radio Frequency Machine Learning (RFML), does not have access to such crowdsourcing means of creating labeled datasets. Therefore data within the Radio Frequency Machine Learning (RFML) problem space must be intentionally cultivated to address the target problem. This dissertation explains the problem space of Radio Frequency Machine Learning (RFML) and then quantifies the effect of quality on data used during the training of Radio Frequency Machine Learning (RFML) systems. Taking this one step further, the work then goes on to provide a means of estimating data quantity needs to achieve high levels of performance based on the current Deep Learning (DL) approach to solve the problem, which in turn can be used as guidance to better refine the approach when the real-world data quantity requirements exceed practical acquisition levels. Finally, the problem of data generation is examined and provides context for the difficulties associated with procuring high quality data for problems in the Radio Frequency Machine Learning (RFML) space. machine learning rfml radio frequency machine learning data generation data collection software defined radio
34	An automatic test data generation from UML state diagram using genetic algorithm. Doungsa-ard, Chartchai, Dahal, Keshav P., Hossain, M. Alamgir, Suwannasart, T. January 2007 (has links) Yes / Software testing is a part of software development process. However, this part is the first one to miss by software developers if there is a limited time to complete the project. Software developers often finish their software construction closed to the delivery time, they usually don¿t have enough time to create effective test cases for testing their programs. Creating test cases manually is a huge work for software developers in the rush hours. A tool which automatically generates test cases and test data can help the software developers to create test cases from software designs/models in early stage of the software development (before coding). Heuristic techniques can be applied for creating quality test data. In this paper, a GA-based test data generation technique has been proposed to generate test data from UML state diagram, so that test data can be generated before coding. The paper details the GA implementation to generate sequences of triggers for UML state diagram as test cases. The proposed algorithm has been demonstrated manually for an example of a vending machine. Test data generation Gray-box testing Artificial intelligence Genetic algorithm Software development
35	Test data generation from UML state machine diagrams using GAs Doungsa-ard, Chartchai, Dahal, Keshav P., Hossain, M. Alamgir, Suwannasart, T. January 2008 (has links) Yes / Automatic test data generation helps testers to validate software against user requirements more easily. Test data can be generated from many sources; for example, experience of testers, source program, or software specification. Selecting a proper test data set is a decision making task. Testers have to decide what test data that they should use, and a heuristic technique is needed to solve this problem automatically. In this paper, we propose a framework for generating test data from software specifications. The selected specification is Unified Modeling Language (UML) state machine diagram. UML state machine diagram describes a system in term of state which can be changed when there is an action occurring in the system. The generated test data is a sequence of these actions. These sequences of action help testers to know how they should test the system. The quality of generated test data is measured by the number of transitions which is fired using the test data. The more transitions test data can fire, the better quality of test data is. The number of coverage transitions is also used as a feedback for a heuristic search for a better test set. Genetic algorithms (GAs) are selected for searching the best test data. Our experimental results show that the proposed GA-based approach can work well for generating test data for some types of UML state machine diagrams. Test data generation UML state machine diagram Genetic algorithms Software validation
36	Multivariate Time Series Data Generation using Generative Adversarial Networks : Generating Realistic Sensor Time Series Data of Vehicles with an Abnormal Behaviour using TimeGAN Nord, Sofia January 2021 (has links) Large datasets are a crucial requirement to achieve high performance, accuracy, and generalisation for any machine learning task, such as prediction or anomaly detection, However, it is not uncommon for datasets to be small or imbalanced since gathering data can be difficult, time-consuming, and expensive. In the task of collecting vehicle sensor time series data, in particular when the vehicle has an abnormal behaviour, these struggles are present and may hinder the automotive industry in its development. Synthetic data generation has become a growing interest among researchers in several fields to handle the struggles with data gathering. Among the methods explored for generating data, generative adversarial networks (GANs) have become a popular approach due to their wide application domain and successful performance. This thesis focuses on generating multivariate time series data that are similar to vehicle sensor readings from the air pressures in the brake system of vehicles with an abnormal behaviour, meaning there is a leakage somewhere in the system. A novel GAN architecture called TimeGAN was trained to generate such data and was then evaluated using both qualitative and quantitative evaluation metrics. Two versions of this model were tested and compared. The results obtained proved that both models learnt the distribution and the underlying information within the features of the real data. The goal of the thesis was achieved and can become a foundation for future work in this field. / När man applicerar en modell för att utföra en maskininlärningsuppgift, till exempel att förutsäga utfall eller upptäcka avvikelser, är det viktigt med stora dataset för att uppnå hög prestanda, noggrannhet och generalisering. Det är dock inte ovanligt att dataset är små eller obalanserade eftersom insamling av data kan vara svårt, tidskrävande och dyrt. När man vill samla tidsserier från sensorer på fordon är dessa problem närvarande och de kan hindra bilindustrin i dess utveckling. Generering av syntetisk data har blivit ett växande intresse bland forskare inom flera områden som ett sätt att hantera problemen med datainsamling. Bland de metoder som undersökts för att generera data har generative adversarial networks (GANs) blivit ett populärt tillvägagångssätt i forskningsvärlden på grund av dess breda applikationsdomän och dess framgångsrika resultat. Denna avhandling fokuserar på att generera flerdimensionell tidsseriedata som liknar fordonssensoravläsningar av lufttryck i bromssystemet av fordon med onormalt beteende, vilket innebär att det finns ett läckage i systemet. En ny GAN modell kallad TimeGAN tränades för att genera sådan data och utvärderades sedan både kvalitativt och kvantitativt. Två versioner av denna modell testades och jämfördes. De erhållna resultaten visade att båda modellerna lärde sig distributionen och den underliggande informationen inom de olika signalerna i den verkliga datan. Målet med denna avhandling uppnåddes och kan lägga grunden för framtida arbete inom detta område. Time Series Data Generation Generative Adversarial Network Deep Neural Network Data Augmentation Synthetic Data Generation Generering av Tidsseriedata Generativa Motstridande Nätverk Djupa Neurala Nätverk Dataökning Syntetisk Datagenerering Computer and Information Sciences Data- och informationsvetenskap
37	Search-based software engineering : a search-based approach for testing from extended finite state machine (EFSM) models Kalaji, Abdul Salam January 2010 (has links) The extended finite state machine (EFSM) is a powerful modelling approach that has been applied to represent a wide range of systems. Despite its popularity, testing from an EFSM is a substantial problem for two main reasons: path feasibility and path test case generation. The path feasibility problem concerns generating transition paths through an EFSM that are feasible and satisfy a given test criterion. In an EFSM, guards and assignments in a path‟s transitions may cause some selected paths to be infeasible. The problem of path test case generation is to find a sequence of inputs that can exercise the transitions in a given feasible path. However, the transitions‟ guards and assignments in a given path can impose difficulties when producing such data making the range of acceptable inputs narrowed down to a possibly tiny range. While search-based approaches have proven efficient in automating aspects of testing, these have received little attention when testing from EFSMs. This thesis proposes an integrated search-based approach to automatically test from an EFSM. The proposed approach generates paths through an EFSM that are potentially feasible and satisfy a test criterion. Then, it generates test cases that can exercise the generated feasible paths. The approach is evaluated by being used to test from five EFSM cases studies. The achieved experimental results demonstrate the value of the proposed approach. 005.3
38	Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics DING, ZEJIN 07 May 2011 (has links) In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in this dissertation.We propose a new ensemble learning framework—Diversified Ensemble Classifiers for Imbal-anced Data Learning (DECIDL), based on the advantages of existing ensemble imbalanced learning strategies. Our framework combines three learning techniques: a) ensemble learning, b) artificial example generation, and c) diversity construction by reversely data re-labeling. As a meta-learner, DECIDL utilizes general supervised learning algorithms as base learners to build an ensemble committee. We create a standard benchmark data pool, which contains 30 highly skewed sets with diverse characteristics from different domains, in order to facilitate future research on imbalance data learning. We use this benchmark pool to evaluate and compare our DECIDL framework with several ensemble learning methods, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost. Extensive experiments suggest that our DECIDL framework is comparable with other methods. The data sets, experiments and results provide a valuable knowledge base for future research on imbalance learning. We develop a simple but effective artificial example generation method for data balancing. Two new methods DBEG-ensemble and DECIDL-DBEG are then designed to improve the power of imbalance learning. Experiments show that these two methods are comparable to the state-of-the-art methods, e.g., GSVM-RU and SMOTE-bagging. Furthermore, we investigate learning on imbalanced data from a new angle—active learning. By combining active learning with the DECIDL framework, we show that the newly designed Active-DECIDL method is very effective for imbalance learning, suggesting the DECIDL framework is very robust and flexible.Lastly, we apply the proposed learning methods to a real-world bioinformatics problem—protein methylation prediction. Extensive computational results show that the DECIDL method does perform very well for the imbalanced data mining task. Importantly, the experimental results have confirmed our new contributions on this particular data learning problem. Machine learning Classification Imbalanced data learning Diversified ensemble Active learning Artificial data generation Bioinformatics Protein methylation Computer Sciences
39	Knowledge management infrastructure and knowledge sharing: The case of a large fast moving consumer goods distribution centre in the Western Cape George, Chadrick Hendrik January 2014 (has links) Magister Commercii - MCom / The aim of this study is to understand how knowledge is created, shared and used within the fast moving consumer goods distribution centre in the Western Cape (WC). It also aims to understand knowledge sharing between individuals in the organisation. A literature review was conducted, in order to answer the research questions- this covered the background of knowledge management (KM) and KS and its current status with particular reference to SA’s private sector. The study found that technological KM infrastructure, cultural KM infrastructure and organisational KM infrastructure are important enablers of KS. A conceptual model was developed around these concepts. In order to answer the research questions, the study identified a FMCG DC in the WC, where KS is practiced Knowledge Knowledge sharing Knowledge management Knowledge management maturity Knowledge transfer Data generation People management Organisational knowledge Tacit knowledge Organisational capabilities
40	Génération automatique de tests unitaires avec Praspel, un langage de spécification pour PHP / The art of contract-based testiong in PHP with Praspel Enderlin, Ivan 16 July 2014 (has links) Les travaux présentés dans ce mémoire portent sur la validation de programmes PHP à travers un nouveau langage de spécification, accompagné de ses outils. Ces travaux s’articulent selon trois axes : langage de spécification, génération automatique de données de test et génération automatique de tests unitaires.La première contribution est Praspel, un nouveau langage de spécification pour PHP, basé sur la programmation par contrat. Praspel spécifie les données avec des domaines réalistes, qui sont des nouvelles structures permettant de valider etgénérer des données. À partir d’un contrat écrit en Praspel, nous pouvons faire du Contract-based Testing, c’est à dire exploiter les contrats pour générer automatiquement des tests unitaires. La deuxième contribution concerne la génération de données de test. Pour les booléens, les entiers et les réels, une génération aléatoire uniforme est employée. Pour les tableaux, un solveur de contraintes a été implémenté et utilisé. Pour les chaînes de caractères, un langage de description de grammaires avec un compilateur de compilateurs LL(⋆) et plusieurs algorithmes de génération de données sont employés. Enfin, la génération d’objets est traitée.La troisième contribution définit des critères de couverture sur les contrats.Ces derniers fournissent des objectifs de test. Toutes ces contributions ont été implémentées et expérimentées dans des outils distribués à la communauté PHP. / The works presented in this memoir are about the validation of PHPprograms through a new specification language, along with its tools. These works follow three axes: specification language, automatic test data generation and automatic unit test generation. The first contribution is Praspel, a new specification language for PHP, based on the Design by Contract. Praspel specifies data with realistic domains, which are new structures allowing to validate and generate data. Based on a contract, we are able to perform Contract-based Testing, i.e.using contracts to automatically generate unit tests. The second contribution isabout test data generation. For booleans, integers and floating point numbers, auniform random generation is used. For arrays, a dedicated constraint solver has been implemented and used. For strings, a grammar description language along with an LL(⋆) compiler compiler and several algorithms for data generation are used. Finally, the object generation is supported. The third contribution defines contract coverage criteria. These latters provide test objectives. All these contributions are implemented and experimented into tools distributed to the PHP community. Praspel Génération de données PHP Contract-based Testing Praspel Data generation Automatic unit test generation PHP 004.75

Search results