Global ETD Search

1	A Framework for Data Quality for Synthetic Information Gupta, Ragini 24 July 2014 (has links) Data quality has been an area of increasing interest for researchers in recent years due to the rapid emergence of 'big data' processes and applications. In this work, the data quality problem is viewed from the standpoint of synthetic information. Based on the structure and complexity of synthetic data, a need to have a data quality framework specific to it was realized. This thesis presents this framework along with implementation details and results of a large synthetic dataset to which the developed testing framework is applied. A formal conceptual framework was designed for assessing data quality of synthetic information. This framework involves developing analytical methods and software for assessing data quality for synthetic information. It includes dimensions of data quality that check the inherent properties of the data as well as evaluate it in the context of its use. The framework developed here is a software framework which is designed considering software design techniques like scalability, generality, integrability and modularity. A data abstraction layer has been introduced between the synthetic data and the tests. This abstraction layer has multiple benefits over direct access of the data by the tests. It decouples the tests from the data so that the details of storage and implementation are kept hidden from the user. We have implemented data quality measures for several quality dimensions: accuracy and precision, reliability, completeness, consistency, and validity. The particular tests and quality measures implemented span a range from low-level syntactic checks to high-level semantic quality measures. In each case, in addition to the results of the quality measure itself, we also present results on the computational performance (scalability) of the measure. / Master of Science Data quality Synthetic data Testing
2	Rise and Pitfalls of Synthetic Data for Abusive Language Detection Casula, Camilla 28 October 2024 (has links) Synthetic data has been proposed as a method to potentially mitigate a number of issues with existing models and datasets for abusive language detection online, such as negative psychological impact on annotators, privacy issues, dataset obsolescence and representation bias. However, previous work on the topic has mostly focused on downstream task performance of models, without paying much attention to the evaluation of other aspects. In this thesis, we carry out a series of experiments and analyses on synthetic data for abusive language detection going beyond performance, with the goal of assessing both the potential and the pitfalls of synthetic data from a qualitative point of view. More specifically, we study synthetic data for abusive language detection in English focusing on four aspects: robustness, examining the ability of models trained on synthetic data to generalize to out-of-distribution scenarios; fairness, with an exploration of the representation of identity groups; privacy, exploring the use of entirely synthetic datasets to avoid sharing user-generated data; and finally we consider the quality of the synthetic data, through a manual annotation and analysis of how realistic and representative of real data synthetic data can be with regards to abusive language.
3	Automatické generování umelých XML dokumentu / Automatic Generation of Synthetic XML Documents Betík, Roman January 2015 (has links) The aim of this thesis is to research the current possibilities and limitations of automatic generation of synthetic XML documents. The first part of the work discusses the properties of the most used XML data generators and compares them to each other. The next part of the thesis proposes an algorithm for XML data generation which focuses on subset of the main XML data characteristics (number of elements, number of attributes, fan-out, mixed contents etc.). The main target of the algorithm is to generate XML documents using a set of settings which are easy to understand. The last part of the work compares the proposed solution with the existing ones. The comparison focuses on how easy it is to generate XML documents, what structures can be created and finally it compares properties of the similar XML data created using different XML data generators. Powered by TCPDF (www.tcpdf.org)
4	Automatické generování umelých XML dokumentu / Automatic Generation of Synthetic XML Documents Betík, Roman January 2013 (has links) The aim of this thesis is to research the current possibilities and limitations of automatic generation of synthetic XML documents. The first part of the work discusses the properties of the most used XML data generators and compares them to each other. The next part of the thesis proposes an algorithm for XML data generation which focuses on subset of the main XML data characteristics (number of elements, number of attributes, fan-out, mixed contents etc.). The main target of the algorithm is to generate XML documents using a set of settings which are easy to understand. The last part of the work compares the proposed solution with the existing ones. The comparison focuses on how easy it is to generate XML documents, what structures can be created and finally it compares properties of the similar XML data created using different XML data generators. Powered by TCPDF (www.tcpdf.org)
5	Modelling and simulation of dynamic contrast-enhanced MRI of abdominal tumours Banerji, Anita January 2012 (has links) Dynamic contrast-enhanced (DCE) time series analysis techniques are hard to fully validate quantitatively as ground truth microvascular parameters are difficult to obtain from patient data. This thesis presents a software application for generating synthetic image data from known ground truth tracer kinetic model parameters. As an object oriented design has been employed to maximise flexibility and extensibility, the application can be extended to include different vascular input functions, tracer kinetic models and imaging modalities. Data sets can be generated for different anatomical and motion descriptions as well as different ground truth parameters. The application has been used to generate a synthetic DCE-MRI time series of a liver tumour with non-linear motion of the abdominal organs due to breathing. The utility of the synthetic data has been demonstrated in several applications: in the development of an Akaike model selection technique for assessing the spatially varying characteristics of liver tumours; the robustness of model fitting and model selection to noise, partial volume effects and breathing motion in liver tumours; and the benefit of using model-driven registration to compensate for breathing motion. When applied to synthetic data with appropriate noise levels, the Akaike model selection technique can distinguish between the single-input extended Kety model for tumour and the dual-input Materne model for liver, and is robust to motion. A significant difference between median Akaike probability value in tumour and liver regions is also seen in 5/6 acquired data sets, with the extended Kety model selected for tumour. Knowledge of the ground truth distribution for the synthetic data was used to demonstrate that, whilst median Ktrans does not change significantly due to breathing motion, model-driven registration restored the structure of the Ktrans histogram and so could be beneficial to tumour heterogeneity assessments.
6	Geophone Array Optimization for Monitoring Geologic Carbon Sequestration using Double-Difference Tomography Fahrman, Benjamin Paul 13 January 2012 (has links) Analysis of synthetic data was performed to determine the most cost-effective tomographic monitoring system for a geologic carbon sequestration injection site. Artificial velocity models were created that accounted for the expected velocity decrease due to the existence of a CO₂ plume after underground injection into a depleted petroleum reservoir. Seismic events were created to represent induced seismicity from injection, and five different geophone arrays were created to monitor this artificial seismicity. Double-difference tomographic inversion was performed on 125 synthetic data sets: five stages of CO₂ plume growth, five seismic event regions, and five geophone arrays. Each resulting velocity model from tomoDD—the double-difference tomography program used for inversion—was compared quantitatively to its respective synthetic velocity model to determine an accuracy value. The quantitative results were examined in an attempt to determine a relationship between cost and accuracy in monitoring, verification, and accounting applications using double-difference tomography. While all scenarios resulted in little error, no such relationship could be found. The lack of a relationship between cost and error is most likely due to error inherent to the travel time calculation algorithm used. / Master of Science Geologic Carbon Sequestration Synthetic Data
7	Privacy-Preserving Synthetic Medical Data Generation with Deep Learning Torfi, Amirsina 26 August 2020 (has links) Deep learning models demonstrated good performance in various domains such as ComputerVision and Natural Language Processing. However, the utilization of data-driven methods in healthcare raises privacy concerns, which creates limitations for collaborative research. A remedy to this problem is to generate and employ synthetic data to address privacy concerns. Existing methods for artificial data generation suffer from different limitations, such as being bound to particular use cases. Furthermore, their generalizability to real-world problems is controversial regarding the uncertainties in defining and measuring key realistic characteristics. Hence, there is a need to establish insightful metrics (and to measure the validity of synthetic data), as well as quantitative criteria regarding privacy restrictions. We propose the use of Generative Adversarial Networks to help satisfy requirements for realistic characteristics and acceptable values of privacy metrics, simultaneously. The present study makes several unique contributions to synthetic data generation in the healthcare domain. First, we propose a novel domain-agnostic metric to evaluate the quality of synthetic data. Second, by utilizing 1-D Convolutional Neural Networks, we devise a new approach to capturing the correlation between adjacent diagnosis records. Third, we employ ConvolutionalAutoencoders for creating a robust and compact feature space to handle the mixture of discrete and continuous data. Finally, we devise a privacy-preserving framework that enforcesRényi differential privacy as a new notion of differential privacy. / Doctor of Philosophy / Computers programs have been widely used for clinical diagnosis but are often designed with assumptions limiting their scalability and interoperability. The recent proliferation of abundant health data, significant increases in computer processing power, and superior performance of data-driven methods enable a trending paradigm shift in healthcare technology. This involves the adoption of artificial intelligence methods, such as deep learning, to improve healthcare knowledge and practice. Despite the success in using deep learning in many different domains, in the healthcare field, privacy challenges make collaborative research difficult, as working with data-driven methods may jeopardize patients' privacy. To overcome these challenges, researchers propose to generate and utilize realistic synthetic data that can be used instead of real private data. Existing methods for artificial data generation are limited by being bound to special use cases. Furthermore, their generalizability to real-world problems is questionable. There is a need to establish valid synthetic data that overcomes privacy restrictions and functions as a real-world analog for healthcare deep learning data training. We propose the use of Generative Adversarial Networks to simultaneously overcome the realism and privacy challenges associated with healthcare data. Deep learning healthcare synthetic data generation generative adversarial networks privacy.
8	Multiple Imputation Methods for Nonignorable Nonresponse, Adaptive Survey Design, and Dissemination of Synthetic Geographies Paiva, Thais Viana January 2014 (has links) <p>This thesis presents methods for multiple imputation that can be applied to missing data and data with confidential variables. Imputation is useful for missing data because it results in a data set that can be analyzed with complete data statistical methods. The missing data are filled in by values generated from a model fit to the observed data. The model specification will depend on the observed data pattern and the missing data mechanism. For example, when the reason why the data is missing is related to the outcome of interest, that is nonignorable missingness, we need to alter the model fit to the observed data to generate the imputed values from a different distribution. Imputation is also used for generating synthetic values for data sets with disclosure restrictions. Since the synthetic values are not actual observations, they can be released for statistical analysis. The interest is in fitting a model that approximates well the relationships in the original data, keeping the utility of the synthetic data, while preserving the confidentiality of the original data. We consider applications of these methods to data from social sciences and epidemiology.</p><p>The first method is for imputation of multivariate continuous data with nonignorable missingness. Regular imputation methods have been used to deal with nonresponse in several types of survey data. However, in some of these studies, the assumption of missing at random is not valid since the probability of missing depends on the response variable. We propose an imputation method for multivariate data sets when there is nonignorable missingness. We fit a truncated Dirichlet process mixture of multivariate normals to the observed data under a Bayesian framework to provide flexibility. With the posterior samples from the mixture model, an analyst can alter the estimated distribution to obtain imputed data under different scenarios. To facilitate that, I developed an R application that allows the user to alter the values of the mixture parameters and visualize the imputation results automatically. I demonstrate this process of sensitivity analysis with an application to the Colombian Annual Manufacturing Survey. I also include a simulation study to show that the correct complete data distribution can be recovered if the true missing data mechanism is known, thus validating that the method can be meaningfully interpreted to do sensitivity analysis.</p><p>The second method uses the imputation techniques for nonignorable missingness to implement a procedure for adaptive design in surveys. Specifically, I develop a procedure that agencies can use to evaluate whether or not it is effective to stop data collection. This decision is based on utility measures to compare the data collected so far with potential follow-up samples. The options are assessed by imputation of the nonrespondents under different missingness scenarios considered by the analyst. The variation in the utility measures is compared to the cost induced by the follow-up sample sizes. We apply the proposed method to the 2007 U.S. Census of Manufactures.</p><p>The third method is for imputation of confidential data sets with spatial locations using disease mapping models. We consider data that include fine geographic information, such as census tract or street block identifiers. This type of data can be difficult to release as public use files, since fine geography provides information that ill-intentioned data users can use to identify individuals. We propose to release data with simulated geographies, so as to enable spatial analyses while reducing disclosure risks. We fit disease mapping models that predict areal-level counts from attributes in the file, and sample new locations based on the estimated models. I illustrate this approach using data on causes of death in North Carolina, including evaluations of the disclosure risks and analytic validity that can result from releasing synthetic geographies.</p> / Dissertation Statistics Confidential data Missing data Multiple Imputation Nonignorable missingness Spatial model Synthetic data
9	Automatické generování umělých XML dokumentů / Automatic Generation of Synthetic XML Documents Betík, Roman January 2015 (has links) The aim of this thesis is to research the current possibilities and limitations of automatic generation of synthetic XML and JSON documents used in the area of Big Data. The first part of the work discusses the properties of the most used XML data generators, Big Data and JSON generators and compares them. The next part of the thesis proposes an algorithm for data generation of semistructured data. The main focus of the algorithm is on the parallel execution of the generation process while preserving the ability to control the contents of the generated documents. The data generator can also use samples of real data in the generation of the synthetic data and is also capable of automatic creation of simple references between JSON documents. The last part of the thesis provides the results of experiments with the data generator exploited for the purpose of testing database MongoDB, describes its added value and compares it to other solutions. Powered by TCPDF (www.tcpdf.org)
10	Exploring the Behaviour of the Hidden Markov Model on CpG Island Prediction 2013 April 1900 (has links) DNA can be represented abstrzctly as a language with only four nucleotides represented by the letters A, C, G, and T, yet the arrangement of those four letters plays a major role in determining the development of an organism. Understanding the signi cance of certain arrangements of nucleotides can unlock the secrets of how the genome achieves its essential functionality. Regions of DNA particularly enriched with cytosine (C nucleotides) and guanine (G nucleotides), especially the CpG di-nucleotide, are frequently associated with biological function related to gene expression, and concentrations of CpGs referred to as \CpG islands" are known to collocate with regions upstream from gene coding sequences within the promoter region. The pattern of occurrence of these nucleotides, relative to adenine (A nucleotides) and thymine (T nucleotides), lends itself to analysis by machine-learning techniques such as Hidden Markov Models (HMMs) to predict the areas of greater enrichment. HMMs have been applied to CpG island prediction before, but often without an awareness of how the outcomes are a ected by the manner in which the HMM is applied. Two main ndings of this study are: 1. The outcome of a HMM is highly sensitive to the setting of the initial probability estimates. 2. Without the appropriate software techniques, HMMs cannot be applied e ectively to large data such as whole eukaryotic chromosomes. Both of these factors are rarely considered by users of HMMs, but are critical to a successful application of HMMs to large DNA sequences. In fact, these shortcomings were discovered through a close examination of published results of CpG island prediction using HMMs, and without being addressed, can lead to an incorrect implementation and application of HMM theory. A rst-order HMM is developed and its performance compared to two other historical methods, the Takai and Jones method and the UCSC method from the University of California Santa Cruz. The HMM is then extended to a second-order to acknowledge that pairs of nucleotides de ne CpG islands rather than single nucleotides alone, and the second-order HMM is evaluated in comparison to the other methods. The UCSC method is found to be based on properties that are not related to CpG islands, and thus is not a fair comparison to the other methods. Of the other methods, the rst-order HMM method and the Takai and Jones method are comparable in the tests conducted, but the second-order HMM method demonstrates superior predictive capabilities. However, these results are valid only when taking into consideration the highly sensitive outcomes based on initial estimates, and nding a suitable set of estimates that provide the most appropriate results. The rst-order HMM is applied to the problem of producing synthetic data that simulates the characteristics of a DNA sequence, including the speci ed presence of CpG islands, based on the model parameters of a trained HMM. HMM analysis is applied to the synthetic data to explore its delity in generating data with similar characteristics, as well as to validate the predictive ability of an HMM. Although this test fails to i meet expectations, a second test using a second-order HMM to produce simulated DNA data using frequency distributions of CpG island pro les exhibits highly accurate predictions of the pre-speci ed CpG islands, con- rming that when the synthetic data are appropriately structured, an HMM can be an accurate predictive tool. One outcome of this thesis is a set of software components (CpGID 2.0 and TrackMap) capable of ef- cient and accurate application of an HMM to genomic sequences, together with visualization that allows quantitative CpG island results to be viewed in conjunction with other genomic data. CpGID 2.0 is an adaptation of a previously published software component that has been extensively revised, and TrackMap is a companion product that works with the results produced by the CpGID 2.0 program. Executing these components allows one to monitor output aspects of the computational model such as number and size of the predicted CpG islands, including their CG content percentage and level of CpG frequency. These outcomes can then be related to the input values used to parameterize the HMM. CpG islands Hidden Markov Model synthetic data Baum-Welch Viterbi methylation

Search results