Global ETD Search

1	Security of genetic databases Giggins, Helen January 2009 (has links) Research Doctorate - Doctor of Philosophy (PhD) / The rapid pace of growth in the field of human genetics has left researchers with many new challenges in the area of security and privacy. To encourage participation and foster trust towards research, it is important to ensure that genetic databases are adequately protected. This task is a particularly challenging one for statistical agencies due to the high prevalence of categorical data contained within statistical genetic databases. The absence of natural ordering makes the application of traditional Statistical Disclosure Control (SDC) methods less straightforward, which is why we have proposed a new noise addition technique for categorical values. The main contributions of the thesis are as follows. We provide a comprehensive analysis of the trust relationships that occur between the different stakeholders in a genetic data warehouse system. We also provide a quantifiable model of trust that allows the database manager to granulate the level of protection based on the amount of trust that exists between the stakeholders. To the best of our knowledge, this is the first time that trust has been applied in the SDC context. We propose a privacy protection framework for genetic databases which is designed to deal with the fact that genetic data warehouses typically contain a high proportion of categorical data. The framework includes the use of a clustering technique which allows for the easier application of traditional noise addition techniques for categorical values. Another important contribution of this thesis is a new similarity measure for categorical values, which aims to capture not only the direct similarity between values, but also some sense of transitive similarity. This novel measure also has possible applications in providing a way of ordering categorical values, so that more traditional SDC methods can be more easily applied to them. Our analysis of experimental results also points to a numerical attribute phenomenon, whereby we typically have high similarity between numerical values that are close together, and where the similarity decreases as the absolute value of the difference between numerical values increases. However, some numerical attributes appear to not behave in a strictly `numerical' way. That is, values which are close together numerically do not always appear very similar. We also provide a novel noise addition technique for categorical values, which employs our similarity measure to partition the values in the data set. Our method - VICUS - then perturbs the original microdata file so that each value is more likely to be changed to another value in the same partition than one from a different partition. The technique helps to ensure that the perturbed microdata file retains data quality while also preserving the privacy of individual records. statistical disclosure control trust privacy security genetic databases statistical databases
2	Security of genetic databases Giggins, Helen January 2009 (has links) Research Doctorate - Doctor of Philosophy (PhD) / The rapid pace of growth in the field of human genetics has left researchers with many new challenges in the area of security and privacy. To encourage participation and foster trust towards research, it is important to ensure that genetic databases are adequately protected. This task is a particularly challenging one for statistical agencies due to the high prevalence of categorical data contained within statistical genetic databases. The absence of natural ordering makes the application of traditional Statistical Disclosure Control (SDC) methods less straightforward, which is why we have proposed a new noise addition technique for categorical values. The main contributions of the thesis are as follows. We provide a comprehensive analysis of the trust relationships that occur between the different stakeholders in a genetic data warehouse system. We also provide a quantifiable model of trust that allows the database manager to granulate the level of protection based on the amount of trust that exists between the stakeholders. To the best of our knowledge, this is the first time that trust has been applied in the SDC context. We propose a privacy protection framework for genetic databases which is designed to deal with the fact that genetic data warehouses typically contain a high proportion of categorical data. The framework includes the use of a clustering technique which allows for the easier application of traditional noise addition techniques for categorical values. Another important contribution of this thesis is a new similarity measure for categorical values, which aims to capture not only the direct similarity between values, but also some sense of transitive similarity. This novel measure also has possible applications in providing a way of ordering categorical values, so that more traditional SDC methods can be more easily applied to them. Our analysis of experimental results also points to a numerical attribute phenomenon, whereby we typically have high similarity between numerical values that are close together, and where the similarity decreases as the absolute value of the difference between numerical values increases. However, some numerical attributes appear to not behave in a strictly `numerical' way. That is, values which are close together numerically do not always appear very similar. We also provide a novel noise addition technique for categorical values, which employs our similarity measure to partition the values in the data set. Our method - VICUS - then perturbs the original microdata file so that each value is more likely to be changed to another value in the same partition than one from a different partition. The technique helps to ensure that the perturbed microdata file retains data quality while also preserving the privacy of individual records. statistical disclosure control trust privacy security genetic databases statistical databases
3	Population-Based Ant Colony Optimization for Multivariate Microaggregation Askut, Ann Ahu 01 January 2013 (has links) Numerous organizations collect and distribute non-aggregate personal data for a variety of different purposes, including demographic and public health research. In these situations, the data distributor is responsible with the protection of the anonymity and personal information of individuals. Microaggregation is one of the most commonly used statistical disclosure control methods. In microaggregation, the set of original records is first partitioned into several groups. The records in the same group are similar to each other. The minimum number of records in each group is k. Each record is replaced by the mean value of the group (centroid). The confidentiality of records is protected by ensuring that each group has at least a minimum of k records and each record is indistinguishable from at least k-1 other records in the microaggregated dataset. The goal of this process is to keep the within-group homogeneity higher and the information loss lower, where information loss is the sum squared deviation between the actual records and the group centroids. Several heuristics have been proposed for the NP-hard minimum information loss microaggregation problem. Among the most promising methods is the multivariate Hansen-Mukherjee (MHM) algorithm that uses a shortest path algorithm to identify the best partition consistent with a specified ordering of records. Developing improved heuristics for ordering multivariate points for microaggregation remains an open research challenge. This dissertation adapts a version of the population-based ant colony optimization algorithm (PACO) to order records within which MHM algorithm is used iteratively to improve the quality of grouping. Results of computational experiments using benchmark test problems indicate that P-ACO/MHM based microaggregation algorithm yields comparable or improved information loss than those obtained by extant methods. ant colony optimization ant system microaggregation population-based ant system statistical disclosure control Computer Sciences
4	A Heuristic Evolutionary Method for the Complementary Cell Suppression Problem Herrington, Hira B. 04 February 2015 (has links) Cell suppression is a common method for disclosure avoidance used to protect sensitive information in two-dimensional tables where row and column totals are published along with non-sensitive data. In tables with only positive cell values, cell suppression has been demonstrated to be non-deterministic NP-hard. Therefore, finding more efficient methods for producing low-cost solutions is an area of active research. Genetic algorithms (GA) have shown to be effective in finding good solutions to the cell suppression problem. However, these methods have the shortcoming that they tend to produce a large proportion of infeasible solutions. The primary goal of this research was to develop a GA that produced low-cost solutions with fewer infeasible solutions created at each generation than previous methods without introducing excessive CPU runtime costs. This research involved developing a GA that produces low-cost solutions with fewer infeasible solutions produced at each generation; and implementing selection and replacement operations that maintained genetic diversity during the evolution process. The GA's performance was tested using tables containing 10,000 and 100,000 cells. The primary criterion for the evaluation of effectiveness of the GA was total cost of the complementary suppressions and the CPU runtime. Experimental results indicate that the GA-based method developed in this dissertation produced better quality solutions than those produced by extant heuristics. Because existing heuristics are very effective, this GA-based method was able to surpass them only modestly. Existing evolutionary methods have also been used to improve upon the quality of solutions produced by heuristics. Experimental results show that the GA-based method developed in this dissertation is computationally more efficient than GA-based methods proposed in the literature. This is attributed to the fact that the specialized genetic operators designed in this study produce fewer infeasible solutions. The results of these experiments suggest the need for continued research into non-probabilistic methods to seed the initial populations, selection and replacement strategies that factor in genetic diversity on the level of the circuits protecting sensitive cells; solution-preserving crossover and mutation operators; and the use of cost benefit ratios to determine program termination. complementary cell suppression problem genetic algorithm heuristic statistical disclosure control Artificial Intelligence and Robotics Computer Sciences Theory and Algorithms
5	Syntetisering av tabulär data: En systematisk litteraturstudie om verktyg för att skapa syntetiska dataset Allergren, Erik, Hildebrand, Clara January 2023 (has links) De senaste åren har efterfrågan på stora mängder data för att träna maskininläringsalgoritmer ökat. Algoritmerna kan användas för att lösa stora som små samhällsfrågor och utmaningar. Ett sätt att möta efterfrågan är att generera syntetisk data som bibehåller statistiska värden och egenskaper från verklig data. Den syntetiska datan möjliggör generering av stora mängder data men är också bra då den minimerar risken för att personlig integritet röjd och medför att data kan tillgängliggöras för forskning utan att identiteter röjs. I denna studie var det övergripande syftet att undersöka och sammanställa vilka verktyg för syntetisering av tabulär data som finns beskrivna i vetenskapliga publiceringar på engelska. Studien genomfördes genom att följa de åtta stegen i en systematisk litteraturstudie med tydligt definierade kriterier för vilka artiklar som skulle inkluderas eller exkluderas. De främsta kraven för artiklarna var att de beskrivna verktygen existerar i form av kod eller program, alltså inte enbart i teorin, samt var generella och applicerbara på olika tabulära dataset. Verktygen fick därmed inte bara fungera eller vara anpassad till ett specifikt dataset eller situation. De verktyg som fanns beskrivna i de återstående artiklarna efter genomförd sökning och därmed representeras i resultatet är (a) Synthpop, ett verktyg som togs fram i ett projekt för UK Longitudinal Studies för att kunna hantera känslig data och personuppgifter; (b) Gretel, ett kommersiellt och open-source verktyg som uppkommit för att möta det ökade behovet av träningsdata; (c) UniformGAN, en ny variant av GAN (Generative Adversarial Network) som genererar syntetiska tabulära dataset medan sekretess säkerställs samt; (d) Synthia, ett open-source paket för Python som är gjort för att generera syntetisk data med en eller flera variabler, univariat och multivariat data. Resultatet visade att verktygen använder sig av olika metoder och modeller för att framställa syntetisk data samt har olika grad av tillgänglighet. Gretel framträdde mest från verktygen, då den är mer kommersiell med fler tjänster samt erbjuder möjligheten att generera syntetiskt data utan att ha goda kunskaper i programmering. / During the last years the demand for big amounts of data to train machine learning algorithms has increased. The algorithms can be used to solve real world problems and challenges. A way to meet the demand is to generate synthetic data that preserve the statistical values and characteristics from real data. The synthetic data makes it possible to obtain large amounts of data, but is also good since it minimizes the risk for privacy issues in micro data. In that way, this type of data can be made accessible for important research without disclosure and potentially harming personal integrity. In this study, the overall aim was to examine and compile which tools for generation of synthetic data are described in scientific articles written in English. The study was conducted by following the eight steps of systematic literature reviews with clearly defined requirements for which articles to include or exclude. The primary requirements for the articles were that the described tools where existing in the form of accessible code or program and that they could be used for general tabular datasets. Thus the tools could not be made just for a specific dataset or situation. The tools that were described in the remaining articles after the search, and consequently included in the result of the study, was (a) Synthpop, a tool developed within the UK Longitudinal Studies to handle sensitive data containing personal information; (b) Gretel, a commercial and open source tool that was created to meet the demand for training data; (c) UniformGAN, a new Generative Adversarial Network that generates synthetic data while preserving privacy and (d) Synthia, a Python open-source package made to generate synthetic univariate and multivariate data. The result showed that the tools use different methods and models to generate synthetic data and have different degrees of accessibility. Gretel is distinguished from the other tools, since it is more commercial with several services and offers the possibility to generate synthetic data without good knowledge in programming. Synthetic data statistical disclosure control machine learning tabular data Syntetisk data statistisk röjandekontroll maskininlärning tabulär data Information Systems

1

Page generated in 0.1064 seconds