• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 63
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 82
  • 82
  • 30
  • 22
  • 19
  • 19
  • 14
  • 12
  • 12
  • 9
  • 9
  • 8
  • 8
  • 8
  • 8
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
81

Privacy preserving software engineering for data driven development

Tongay, Karan Naresh 14 December 2020 (has links)
The exponential rise in the generation of data has introduced many new areas of research including data science, data engineering, machine learning, artificial in- telligence to name a few. It has become important for any industry or organization to precisely understand and analyze the data in order to extract value out of the data. The value of the data can only be realized when it is put into practice in the real world and the most common approach to do this in the technology industry is through software engineering. This brings into picture the area of privacy oriented software engineering and thus there is a rise of data protection regulation acts such as GDPR (General Data Protection Regulation), PDPA (Personal Data Protection Act), etc. Many organizations, governments and companies who have accumulated huge amounts of data over time may conveniently use the data for increasing business value but at the same time the privacy aspects associated with the sensitivity of data especially in terms of personal information of the people can easily be circumvented while designing a software engineering model for these types of applications. Even before the software engineering phase for any data processing application, often times there can be one or many data sharing agreements or privacy policies in place. Every organization may have their own way of maintaining data privacy practices for data driven development. There is a need to generalize or categorize their approaches into tactics which could be referred by other practitioners who are trying to integrate data privacy practices into their development. This qualitative study provides an understanding of various approaches and tactics that are being practised within the industry for privacy preserving data science in software engineering, and discusses a tool for data usage monitoring to identify unethical data access. Finally, we studied strategies for secure data publishing and conducted experiments using sample data to demonstrate how these techniques can be helpful for securing private data before publishing. / Graduate
82

Measuring the Utility of Synthetic Data : An Empirical Evaluation of Population Fidelity Measures as Indicators of Synthetic Data Utility in Classification Tasks / Mätning av Användbarheten hos Syntetiska Data : En Empirisk Utvärdering av Population Fidelity mätvärden som Indikatorer på Syntetiska Datas Användbarhet i Klassifikationsuppgifter

Florean, Alexander January 2024 (has links)
In the era of data-driven decision-making and innovation, synthetic data serves as a promising tool that bridges the need for vast datasets in machine learning (ML) and the imperative necessity of data privacy. By simulating real-world data while preserving privacy, synthetic data generators have become more prevalent instruments in AI and ML development. A key challenge with synthetic data lies in accurately estimating its utility. For such purpose, Population Fidelity (PF) measures have shown to be good candidates, a category of metrics that evaluates how well the synthetic data mimics the general distribution of the original data. With this setting, we aim to answer: "How well are different population fidelity measures able to indicate the utility of synthetic data for machine learning based classification models?" We designed a reusable six-step experiment framework to examine the correlation between nine PF measures and the performance of four ML for training classification models over five datasets. The six-step approach includes data preparation, training, testing on original and synthetic datasets, and PF measures computation. The study reveals non-linear relationships between the PF measures and synthetic data utility. The general analysis, meaning the monotonic relationship between the PF measure and performance over all models, yielded at most moderate correlations, where the Cluster measure showed the strongest correlation. In the more granular model-specific analysis, Random Forest showed strong correlations with three PF measures. The findings show that no PF measure shows a consistently high correlation over all models to be considered a universal estimator for model performance.This highlights the importance of context-aware application of PF measures and sets the stage for future research to expand the scope, including support for a wider range of types of data and integrating privacy evaluations in synthetic data assessment. Ultimately, this study contributes to the effective and reliable use of synthetic data, particularly in sensitive fields where data quality is vital. / I eran av datadriven beslutsfattning och innovation, fungerar syntetiska data som ett lovande verktyg som bryggar behovet av omfattande dataset inom maskininlärning (ML) och nödvändigheten för dataintegritet. Genom att simulera verklig data samtidigt som man bevarar integriteten, har generatorer av syntetiska data blivit allt vanligare verktyg inom AI och ML-utveckling. En viktig utmaning med syntetiska data är att noggrant uppskatta dess användbarhet. För detta ändamål har mått under kategorin Populations Fidelity (PF) visat sig vara goda kandidater, det är mätvärden som utvärderar hur väl syntetiska datan efterliknar den generella distributionen av den ursprungliga datan. Med detta i åtanke strävar vi att svara på följande: Hur väl kan olika population fidelity mätvärden indikera användbarheten av syntetisk data för maskininlärnings baserade klassifikationsmodeller? För att besvara frågan har vi designat ett återanvändbart sex-stegs experiment ramverk, för att undersöka korrelationen mellan nio PF-mått och prestandan hos fyra ML klassificeringsmodeller, på fem dataset. Sex-stegs strategin inkluderar datatillredning, träning, testning på både ursprungliga och syntetiska dataset samt beräkning av PF-mått. Studien avslöjar förekommandet av icke-linjära relationer mellan PF-måtten och användbarheten av syntetiska data. Den generella analysen, det vill säga den monotona relationen mellan PF-måttet och prestanda över alla modeller, visade som mest medelmåttiga korrelationer, där Cluster-måttet visade den starkaste korrelationen. I den mer detaljerade, modell-specifika analysen visade Random Forest starka korrelationer med tre PF-mått. Resultaten visar att inget PF-mått visar konsekvent hög korrelation över alla modeller för att betraktas som en universell indikator för modellprestanda. Detta understryker vikten av kontextmedveten tillämpning av PF-mått och banar väg för framtida forskning för att utöka omfånget, inklusive stöd för ett bredare utbud för data av olika typer och integrering av integritetsutvärderingar i bedömningen av syntetiska data. Därav, så bidrar denna studie till effektiv och tillförlitlig användning av syntetiska data, särskilt inom känsliga områden där datakvalitet är avgörande.

Page generated in 0.0439 seconds