Global ETD Search

51	Local differentially private mechanisms for text privacy protection Mo, Fengran 08 1900 (has links) Dans les applications de traitement du langage naturel (NLP), la formation d’un modèle efficace nécessite souvent une quantité massive de données. Cependant, les données textuelles dans le monde réel sont dispersées dans différentes institutions ou appareils d’utilisateurs. Leur partage direct avec le fournisseur de services NLP entraîne d’énormes risques pour la confidentialité, car les données textuelles contiennent souvent des informations sensibles, entraînant une fuite potentielle de la confidentialité. Un moyen typique de protéger la confidentialité consiste à privatiser directement le texte brut et à tirer parti de la confidentialité différentielle (DP) pour protéger le texte à un niveau de protection de la confidentialité quantifiable. Par ailleurs, la protection des résultats de calcul intermédiaires via un mécanisme de privatisation de texte aléatoire est une autre solution disponible. Cependant, les mécanismes existants de privatisation des textes ne permettent pas d’obtenir un bon compromis entre confidentialité et utilité en raison de la difficulté intrinsèque de la protection de la confidentialité des textes. Leurs limitations incluent principalement les aspects suivants: (1) ces mécanismes qui privatisent le texte en appliquant la notion de dχ-privacy ne sont pas applicables à toutes les métriques de similarité en raison des exigences strictes; (2) ils privatisent chaque jeton (mot) dans le texte de manière égale en fournissant le même ensemble de sorties excessivement grand, ce qui entraîne une surprotection; (3) les méthodes actuelles ne peuvent garantir la confidentialité que pour une seule étape d’entraînement/ d’inférence en raison du manque de composition DP et de techniques d’amplification DP. Le manque du compromis utilité-confidentialité empêche l’adoption des mécanismes actuels de privatisation du texte dans les applications du monde réel. Dans ce mémoire, nous proposons deux méthodes à partir de perspectives différentes pour les étapes d’apprentissage et d’inférence tout en ne requérant aucune confiance de sécurité au serveur. La première approche est un mécanisme de privatisation de texte privé différentiel personnalisé (CusText) qui attribue à chaque jeton d’entrée un ensemble de sortie personnalisé pour fournir une protection de confidentialité adaptative plus avancée au niveau du jeton. Il surmonte également la limitation des métriques de similarité causée par la notion de dχ-privacy, en adaptant le mécanisme pour satisfaire ϵ-DP. En outre, nous proposons deux nouvelles stratégies de 5 privatisation de texte pour renforcer l’utilité du texte privatisé sans compromettre la confidentialité. La deuxième approche est un modèle Gaussien privé différentiel local (GauDP) qui réduit considérablement le volume de bruit calibrée sur la base d’un cadre avancé de comptabilité de confidentialité et améliore ainsi la précision du modèle en incorporant plusieurs composants. Le modèle se compose d’une couche LDP, d’algorithmes d’amplification DP de sous-échantillonnage et de sur-échantillonnage pour l’apprentissage et l’inférence, et d’algorithmes de composition DP pour l’étalonnage du bruit. Cette nouvelle solution garantit pour la première fois la confidentialité de l’ensemble des données d’entraînement/d’inférence. Pour évaluer nos mécanismes de privatisation de texte proposés, nous menons des expériences étendues sur plusieurs ensembles de données de différents types. Les résultats expérimentaux démontrent que nos mécanismes proposés peuvent atteindre un meilleur compromis confidentialité-utilité et une meilleure valeur d’application pratique que les méthodes existantes. En outre, nous menons également une série d’études d’analyse pour explorer les facteurs cruciaux de chaque composant qui pourront fournir plus d’informations sur la protection des textes et généraliser d’autres explorations pour la NLP préservant la confidentialité. / In Natural Language Processing (NLP) applications, training an effective model often requires a massive amount of data. However, text data in the real world are scattered in different institutions or user devices. Directly sharing them with the NLP service provider brings huge privacy risks, as text data often contains sensitive information, leading to potential privacy leakage. A typical way to protect privacy is to directly privatize raw text and leverage Differential Privacy (DP) to protect the text at a quantifiable privacy protection level. Besides, protecting the intermediate computation results via a randomized text privatization mechanism is another available solution. However, existing text privatization mechanisms fail to achieve a good privacy-utility trade-off due to the intrinsic difficulty of text privacy protection. The limitations of them mainly include the following aspects: (1) those mechanisms that privatize text by applying dχ-privacy notion are not applicable for all similarity metrics because of the strict requirements; (2) they privatize each token in the text equally by providing the same and excessively large output set which results in over-protection; (3) current methods can only guarantee privacy for either the training/inference step, but not both, because of the lack of DP composition and DP amplification techniques. Bad utility-privacy trade-off performance impedes the adoption of current text privatization mechanisms in real-world applications. In this thesis, we propose two methods from different perspectives for both training and inference stages while requiring no server security trust. The first approach is a Customized differentially private Text privatization mechanism (CusText) that assigns each input token a customized output set to provide more advanced adaptive privacy protection at the token-level. It also overcomes the limitation for the similarity metrics caused by dχ-privacy notion, by turning the mechanism to satisfy ϵ-DP. Furthermore, we provide two new text privatization strategies to boost the utility of privatized text without compromising privacy. The second approach is a Gaussian-based local Differentially Private (GauDP) model that significantly reduces calibrated noise power adding to the intermediate text representations based on an advanced privacy accounting framework and thus improves model accuracy by incorporating several components. The model consists of an LDP-layer, sub-sampling and up-sampling DP amplification algorithms 7 for training and inference, and DP composition algorithms for noise calibration. This novel solution guarantees privacy for both training and inference data. To evaluate our proposed text privatization mechanisms, we conduct extensive experiments on several datasets of different types. The experimental results demonstrate that our proposed mechanisms can achieve a better privacy-utility trade-off and better practical application value than the existing methods. In addition, we also carry out a series of analyses to explore the crucial factors for each component which will be able to provide more insights in text protection and generalize further explorations for privacy-preserving NLP. Traitement du langue naturelle Confidentialité différentielle Natural language processing Differential privacy Text privacy protection Privacy-Preserving method
52	Variational AutoEncoders and Differential Privacy : balancing data synthesis and privacy constraints / Variational AutoEncoders och Differential Privacy : balans mellan datasyntes och integritetsbegränsningar Bremond, Baptiste January 2024 (has links) This thesis investigates the effectiveness of Tabular Variational Auto Encoders (TVAEs) in generating high-quality synthetic tabular data and assesses their compliance with differential privacy principles. The study shows that while TVAEs are better than VAEs at generating synthetic data that faithfully reproduces the distribution of real data as measured by the Synthetic Data Vault (SDV) metrics, the latter does not guarantee that the synthetic data is up to the task in practical industrial applications. In particular, models trained on TVAE-generated data from the Creditcards dataset are ineffective. The author also explores various optimisation methods on TVAE, such as Gumbel Max Trick, Drop Out (DO) and Batch Normalization, while pointing out that techniques frequently used to improve two-dimensional TVAE, such as Kullback–Leibler Warm-Up and B Disentanglement, are not directly transferable to the one-dimensional context. However, differential privacy to TVAE was not implemented due to time constraints and inconclusive results. The study nevertheless highlights the benefits of stabilising training with the Differential Privacy - Stochastic Gradient Descent (DP-SGD), as with a dropout, and the existence of an optimal equilibrium point between the constraints of differential privacy and the number of training epochs in the model. / Denna avhandling undersöker hur effektiva Tabular Variational AutoEncoders (TVAE) är när det gäller att generera högkvalitativa syntetiska tabelldata och utvärderar deras överensstämmelse med differentierade integritetsprinciper. Studien visar att även om TVAE är bättre än VAE på att generera syntetiska data som troget återger fördelningen av verkliga data mätt med Synthetic Data Vault (SDV), garanterar det senare inte att de syntetiska data är upp till uppgiften i praktiska industriella tillämpningar. I synnerhet är modeller som tränats på TVAE-genererade data från Creditcards-datasetet ineffektiva. Författaren undersöker också olika optimeringsmetoder för TVAE, såsom Gumbel Max Trick, DO och Batch Normalization, samtidigt som han påpekar att tekniker som ofta används för att förbättra tvådimensionell TVAE, såsom Kullback-Leibler Warm-Up och B Disentanglement, inte är direkt överförbara till det endimensionella sammanhanget. På grund av tidsbegränsningar och redan ofullständiga resultat implementerades dock inte differentierad integritet för TVAE. Studien belyser ändå fördelarna med att stabilisera träningen med Differential Privacy - Stochastic Gradient Descent (DP-SGD), som med en drop-out, och förekomsten av en optimal jämviktspunkt mellan begränsningarna för differential privacy och antalet träningsepoker i modellen. TVAE Differential privacy Tabular data Synthetic data DP-SGD TVAE differentiell integritet tabelldata syntetiska data DP-SGD Computer and Information Sciences Data- och informationsvetenskap
53	Towards privacy-preserving and fairness-enhanced item ranking in recommender systems Sun, Jia Ao 07 1900 (has links) Nous présentons une nouvelle approche de préservation de la vie privée pour améliorer l’équité des éléments dans les systèmes de classement. Nous utilisons des techniques de post-traitement dans un environnement de recommandation multipartite afin d’équilibrer l’équité et la protection de la vie privée pour les producteurs et les consommateurs. Notre méthode utilise des serveurs de calcul multipartite sécurisés (MPC) et une confidentialité différentielle (DP) pour maintenir la confidentialité des utilisateurs tout en atténuant l’injustice des éléments sans compromettre l’utilité. Les utilisateurs soumettent leurs données sous forme de partages secrets aux serveurs MPC, et tous les calculs sur ces données restent cryptés. Nous évaluons notre approche à l’aide d’ensembles de données du monde réel, tels qu’Amazon Digital Music, Book Crossing et MovieLens-1M, et analysons les compromis entre confidentialité, équité et utilité. Notre travail encourage une exploration plus approfondie de l’intersection de la confidentialité et de l’équité dans les systèmes de recommandation, jetant les bases de l’intégration d’autres techniques d’amélioration de la confidentialité afin d’optimiser l’exécution et l’évolutivité pour les applications du monde réel. Nous envisageons notre approche comme un tremplin vers des solutions de bout en bout préservant la confidentialité et promouvant l’équité dans des environnements de recommandation multipartites. / We present a novel privacy-preserving approach to enhance item fairness in ranking systems. We employ post-processing techniques in a multi-stakeholder recommendation environment in order to balance fairness and privacy protection for both producers and consumers. Our method utilizes secure multi-party computation (MPC) servers and differential privacy (DP) to maintain user privacy while mitigating item unfairness without compromising utility. Users submit their data as secret shares to MPC servers, and all calculations on this data remain encrypted. We evaluate our approach using real-world datasets, such as Amazon Digital Music, Book Crossing, and MovieLens-1M, and analyze the trade-offs between privacy, fairness, and utility. Our work encourages further exploration of the intersection of privacy and fairness in recommender systems, laying the groundwork for integrating other privacy-enhancing techniques to optimize runtime and scalability for real-world applications. We envision our approach as a stepping stone towards end-to-end privacy-preserving and fairness-promoting solutions in multi-stakeholder recommendation environments. Privacy Fairness Ranking Secure multi-party computation Differential privacy Confidentialité Équité Classement Calcul multipartite sécurisé Confidentialité différentielle
54	Valuing Differential Privacy : Assessing the value of personal data anonymization solutions, specifically Differential Privacy-solutions, for companies in the mobility sector / Värdering av Differential Privacy : En värdering av anonymiseringsalgoritmer, specifikt Differential Privacy-lösningar, för bolag inom mobilitetssektorn Andersson, Axel, Borgernäs, Sebastian January 2022 (has links) This paper aims to determine the value of the product based on the mathematical concept of Differential Privacy, by assessing the value of the business opportunities it enables and the value of the possible GDPR-fines it prevents. To delimit the scope of the research the analysis will focus on what the value of personal data is for companies within the mobility sector. Mobility is a cross-industrial sector consisting of companies within connectivity-technology, transportation, and automotive. The method used to assess the final value of anonymizing personal data, such as consumer data, using a DP-solution (meaning, an implementation of the theory) has consisted of both quantitative and qualitative analysis. The quantitative analysis aims to assess the ‘Cost of Risk’ for mobility companies that are exposed to personal integrity regulation due to data processing. To further conclude the true cost of the financial impact caused by getting fined for infringing on privacy regulation because of unlawful data processing is done through a complementary qualitative assessment. Lastly, the 'Opportunity Cost', or rather the cost of missed financial opportunities, is determined qualitatively for a case study company within Sweden’s mobility ecosystem to conclude the overall value of a DP-solution for a specific company. The final product of this research paper is to provide a framework assessing the total value, for specifically companies in the mobility sector, of implementing differential privacy solutions. / Syftet med denna uppsats är att fastställa värdet av anonymisering baserat på det matematiska konceptet Differential Privacy, genom att bedöma värdet av de affärsmöjligheter det skapar, samt värdet av de möjliga GDPR- böter det förhindrar. För att avgränsa studiens omfattning består analysen endast av att uppskatta dessa värden för företag inom mobilitetssektorn. Mobilitetssektorn är en tvärindustriell sektor som består av företag inom uppkoppling-, transport- och bilindustrin. Metoden som använts för att ta fram det slutliga värdet av att anonymisera persondata genom en differential privacy lösning, består både av en kvantitativ och en kvalitativ analys. Målet med den kvantitativa analysen är att estimera kostnadsrisken för företag inom mobilitetssektorn som exponeras mot GDPR-böter med avseende på dess datahantering. För att vidare ta reda på den totala finansiella inverkan av sådana böter, kompletteras analysen av en kvalitativ studie, som delvis omfattas av de finansiella möjligheterna ett företag går miste om i en sådan situation. Den kvalitativa analysen består också av en fallstudie av ett svenskt företag inom mobilitetssektorn, med målet att estimera värdet av de affärsmöjligheter som uppstår med hjälp av anonymisering av data. Slutligen är målet med denna uppsats att förse läsaren med att ramverk för att estimera det totala värdet av att implementera differential privacy lösningar i företag inom mobilitetssektorn. differential privacy hedonic pricing method logistic regression GDPR privacy regulation valuation risk assessment valuing emerging technologies valuing non-market goods and services mobility Computer Sciences Datavetenskap (datalogi)
55	Causal Inference in the Face of Assumption Violations Yuki Ohnishi (18423810) 26 April 2024 (has links) <p dir="ltr">This dissertation advances the field of causal inference by developing methodologies in the face of assumption violations. Traditional causal inference methodologies hinge on a core set of assumptions, which are often violated in the complex landscape of modern experiments and observational studies. This dissertation proposes novel methodologies designed to address the challenges posed by single or multiple assumption violations. By applying these innovative approaches to real-world datasets, this research uncovers valuable insights that were previously inaccessible with existing methods. </p><p><br></p><p dir="ltr">First, three significant sources of complications in causal inference that are increasingly of interest are interference among individuals, nonadherence of individuals to their assigned treatments, and unintended missing outcomes. Interference exists if the outcome of an individual depends not only on its assigned treatment, but also on the assigned treatments for other units. It commonly arises when limited controls are placed on the interactions of individuals with one another during the course of an experiment. Treatment nonadherence frequently occurs in human subject experiments, as it can be unethical to force an individual to take their assigned treatment. Clinical trials, in particular, typically have subjects that do not adhere to their assigned treatments due to adverse side effects or intercurrent events. Missing values also commonly occur in clinical studies. For example, some patients may drop out of the study due to the side effects of the treatment. Failing to account for these considerations will generally yield unstable and biased inferences on treatment effects even in randomized experiments, but existing methodologies lack the ability to address all these challenges simultaneously. We propose a novel Bayesian methodology to fill this gap. </p><p><br></p><p dir="ltr">My subsequent research further addresses one of the limitations of the first project: a set of assumptions about interference structures that may be too restrictive in some practical settings. We introduce a concept of the ``degree of interference" (DoI), a latent variable capturing the interference structure. This concept allows for handling arbitrary, unknown interference structures to facilitate inference on causal estimands. </p><p><br></p><p dir="ltr">While randomized experiments offer a solid foundation for valid causal analysis, people are also interested in conducting causal inference using observational data due to the cost and difficulty of randomized experiments and the wide availability of observational data. Nonetheless, using observational data to infer causality requires us to rely on additional assumptions. A central assumption is that of \emph{ignorability}, which posits that the treatment is randomly assigned based on the variables (covariates) included in the dataset. While crucial, this assumption is often debatable, especially when treatments are assigned sequentially to optimize future outcomes. For instance, marketers typically adjust subsequent promotions based on responses to earlier ones and speculate on how customers might have reacted to alternative past promotions. This speculative behavior introduces latent confounders, which must be carefully addressed to prevent biased conclusions. </p><p dir="ltr">In the third project, we investigate these issues by studying sequences of promotional emails sent by a US retailer. We develop a novel Bayesian approach for causal inference from longitudinal observational data that accommodates noncompliance and latent sequential confounding. </p><p><br></p><p dir="ltr">Finally, we formulate the causal inference problem for the privatized data. In the era of digital expansion, the secure handling of sensitive data poses an intricate challenge that significantly influences research, policy-making, and technological innovation. As the collection of sensitive data becomes more widespread across academic, governmental, and corporate sectors, addressing the complex balance between making data accessible and safeguarding private information requires the development of sophisticated methods for analysis and reporting, which must include stringent privacy protections. Currently, the gold standard for maintaining this balance is Differential privacy. </p><p dir="ltr">Local differential privacy is a differential privacy paradigm in which individuals first apply a privacy mechanism to their data (often by adding noise) before transmitting the result to a curator. The noise for privacy results in additional bias and variance in their analyses. Thus, it is of great importance for analysts to incorporate the privacy noise into valid inference.</p><p dir="ltr">In this final project, we develop methodologies to infer causal effects from locally privatized data under randomized experiments. We present frequentist and Bayesian approaches and discuss the statistical properties of the estimators, such as consistency and optimality under various privacy scenarios.</p> Econometric and statistical methods Applied statistics Computational statistics Statistical data science Statistical theory Causal Inference Bayesian statistics Interference Noncompliance Missing not at random (MNAR) Bayesian Nonparametrics Differential privacy
56	Real-time forecasting of dietary habits and user health using Federated Learning with privacy guarantees Horchidan, Sonia-Florina January 2020 (has links) Modern health self-monitoring devices and applications, such as Fitbit and MyFitnessPal, empower users to take concrete actions and set fitness and lifestyle goals based on their recorded trends and statistics. Predicting such trends is beneficial in the road of achieving long-time targets, as the individuals can adjust their diets and habits at any point to guarantee success. The design and implementation of such a system, which also respects user privacy, is the main objective of our work.This application is modelled as a time-series forecasting problem. Given the historical data of users, we aim to predict their eating and lifestyle habits in real-time. We apply the federated learning paradigm to our use-case be- cause of the highly-distributed nature of our data and the privacy concerns of such sensitive recorded information. However, federated learning from het- erogeneous sequences of data can be challenging, as even state-of-the-art ma- chine learning techniques for time-series forecasting can encounter difficulties when learning from very irregular data sequences. Specifically, in the pro- posed healthcare scenario, the machine learning algorithms might fail to cater to users with unique dietary patterns.In this work, we implement a two-step streaming clustering mechanism and group clients that exhibit similar eating and fitness behaviours. The con- ducted experiments prove that learning federatively in this context can achieve very high prediction accuracy, as our predictions are no more than 0.025% far from the ground truth value with respect to the range of each feature. Training separate models for each group of users is shown to be beneficial, especially in terms of the training time, but it is highly dependent on the parameters used for the models and the training process. Our experiments conclude that the configuration used for the general federated model cannot be applied to the clusters of data. However, a decrease in prediction error of more than 45% can be achieved, given the parameters are optimized for each case.Lastly, this work tackles the problem of data privacy by applying state-of- the-art differential privacy techniques. Our empirical study shows that noising the gradients sent to the server is unsuitable for small datasets and cancels out the benefits obtained by prior users’ clustering. On the other hand, noising the training data achieves remarkable results, obtaining a differential privacy level corresponding to an epsilon value of 0.1 with an increase in the observed mean absolute error by a factor of only 0.21. / Moderna apparater och applikationer för självövervakning av hälsa, som Fitbit och MyFitnessPal, ger användarna möjlighet att vidta konkreta åtgärder och sätta fitness- och livsstilsmål baserat på deras dokumenterade trender och statistik. Att förutsäga sådana trender är fördelaktigt för att uppnå långtidsmål, eftersom individerna kan anpassa sina dieter och vanor när som helst för att garantera framgång.Utformningen och implementeringen av ett sådant system, som dessutom respekterar användarnas integritet, är huvudmålet för vårt arbete. Denna appli- kation är modellerad som ett tidsserieprognosproblem. Med avseende på an- vändarnas historiska data är målet att förutsäga deras matvanor och livsstilsva- nor i realtid. Vi tillämpar det federerade inlärningsparadigmet på vårt använd- ningsfall på grund av den mycket distribuerade karaktären av vår data och in- tegritetsproblemen för sådan känslig bokförd information. Federerade lärande från heterogena datasekvenser kan emellertid vara utmanande, eftersom även de modernaste maskininlärningstekniker för tidsserieprognoser kan stöta på svårigheter när de lär sig från mycket oregelbundna datasekvenser. Specifikt i det föreslagna sjukvårdsscenariot kan maskininlärningsalgoritmerna misslyc- kas med att förse användare med unika dietmönster.I detta arbete implementerar vi en tvåstegsströmmande klustermekanism och grupperar användare som uppvisar liknande ät- och fitnessbeteenden. De genomförda experimenten visar att federerade lärande i detta sammanhang kan uppnå mycket hög nogrannhet i förutsägelse, eftersom våra förutsägelser in- te är mer än 0,025% ifrån det sanna värdet med avseende på intervallet för varje funktion. Träning av separata modeller för varje grupp användare visar sig vara fördelaktigt, särskilt gällande träningstiden, men det är mycket be- roende av parametrarna som används för modellerna och träningsprocessen. Våra experiment drar slutsatsen att konfigurationen som används för den all- männa federerade modellen inte kan tillämpas på dataklusterna. Dock kan en minskning av förutsägelsefel på mer än 45% uppnås, givet att parametrarna är optimerade för varje fall.Slutligen hanteras problemet med datasekretess genom att tillämpa bästa tillgängliga differentiell integritetsteknik. Vår empiriska studie visar att adde- ra brus till gradienter som skickas till servern är olämpliga för liten data och avbryter fördelarna med tidigare användares kluster. Däremot, genom att ad- dera brus till träningsdata uppnås anmärkningsvärda resultat. En differentierad integritetsnivå motsvarande ett epsilonvärde på 0,1 med en ökning av det ob- serverade genomsnittliga absoluta felet med en faktor på endast 0,21 erhölls. Federated Learning Time Series Forecasting Clustering Pattern Matching Real-time Data Processing Differential Privacy Data Privacy. Federerade Lärande Tidsseriesprognos Klustergruppering Mönstermatchning Realtidshantering av data Differentialintegritet Dataintegritet Computer and Information Sciences Data- och informationsvetenskap
57	Privacy and utility assessment within statistical data bases / Mesure de la vie privée et de l’utilité des données dans les bases de données statistiques Sondeck, Louis-Philippe 15 December 2017 (has links) Les données personnelles sont d’une importance avérée pour presque tous les secteurs d’activité économiques grâce à toute la connaissance qu’on peut en extraire. Pour preuve, les plus grandes entreprises du monde que sont: Google, Amazon, Facebook et Apple s’en servent principalement pour fournir de leurs services. Cependant, bien que les données personnelles soient d’une grande utilité pour l’amélioration et le développement de nouveaux services, elles peuvent aussi, de manière intentionnelle ou non, nuire à la vie privée des personnes concernées. En effet, plusieurs études font état d’attaques réalisées à partir de données d’entreprises, et ceci, bien qu’ayant été anonymisées. Il devient donc nécessaire de définir des techniques fiables, pour la protection de la vie privée des personnes tout en garantissant l’utilité de ces données pour les services. Dans cette optique, l’Europe a adopté un nouveau règlement (le Règlement Général sur la Protection des Données) (EU, 2016) qui a pour but de protéger les données personnelles des citoyens européens. Cependant, ce règlement ne concerne qu’une partie du problème puisqu’il s’intéresse uniquement à la protection de la vie privée, alors que l’objectif serait de trouver le meilleur compromis entre vie privée et utilité des données. En effet, vie privée et utilité des données sont très souvent inversement proportionnelles, c’est ainsi que plus les données garantissent la vie privée, moins il y reste d’information utile. Pour répondre à ce problème de compromis entre vie privée et utilité des données, la technique la plus utilisée est l’anonymisation des données. Dans la littérature scientifique, l’anonymisation fait référence soit aux mécanismes d’anonymisation, soit aux métriques d’anonymisation. Si les mécanismes d’anonymisation sont utiles pour anonymiser les données, les métriques d’anonymisation sont elles, nécessaires pour valider ou non si le compromis entre vie privée et utilité des données a été atteint. Cependant, les métriques existantes ont plusieurs défauts parmi lesquels, le manque de précision des mesures et la difficulté d’implémentation. De plus, les métriques existantes permettent de mesurer soit la vie privée, soit l’utilité des données, mais pas les deux simultanément; ce qui rend plus complexe l’évaluation du compromis entre vie privée et utilité des données. Dans cette thèse, nous proposons une approche nouvelle, permettant de mesurer à la fois la vie privée et l’utilité des données, dénommée Discrimination Rate (DR). Le DR est une métrique basée sur la théorie de l’information, qui est pratique et permet des mesures d’une grande finesse. Le DR mesure la capacité des attributs à raffiner un ensemble d’individus, avec des valeurs comprises entre 0 et 1; le meilleur raffinement conduisant à un DR de 1. Par exemple, un identifiant a un DR égale à 1 étant donné qu’il permet de raffiner complètement un ensemble d’individus. Grâce au DR nous évaluons de manière précise et comparons les mécanismes d’anonymisation en termes d’utilité et de vie privée (aussi bien différentes instanciations d’un même mécanisme, que différents mécanismes). De plus, grâce au DR, nous proposons des définitions formelles des identifiants encore appelés informations d’identification personnelle. Ce dernier point est reconnu comme l’un des problèmes cruciaux des textes juridiques qui traitent de la protection de la vie privée. Le DR apporte donc une réponse aussi bien aux entreprises qu’aux régulateurs, par rapport aux enjeux que soulève la protection des données personnelles / Personal data promise relevant improvements in almost every economy sectors thanks to all the knowledge that can be extracted from it. As a proof of it, some of the biggest companies in the world, Google, Amazon, Facebook and Apple (GAFA) rely on this resource for providing their services. However, although personal data can be very useful for improvement and development of services, they can also, intentionally or not, harm data respondent’s privacy. Indeed, many studies have shown how data that were intended to protect respondents’ personal data were finally used to leak private information. Therefore, it becomes necessary to provide methods for protecting respondent’s privacy while ensuring utility of data for services. For this purpose, Europe has established a new regulation (The General Data Protection Regulation) (EU, 2016) that aims to protect European citizens’ personal data. However, the regulation only targets one side of the main goal as it focuses on privacy of citizens while the goal is about the best trade-off between privacy and utility. Indeed, privacy and utility are usually inversely proportional and the greater the privacy, the lower the data utility. One of the main approaches for addressing the trade-off between privacy and utility is data anonymization. In the literature, anonymization refers either to anonymization mechanisms or anonymization metrics. While the mechanisms are useful for anonymizing data, metrics are necessary to validate whether or not the best trade-off has been reached. However, existing metrics have several flaws including the lack of accuracy and the complexity of implementation. Moreover existing metrics are intended to assess either privacy or utility, this adds difficulties when assessing the trade-off between privacy and utility. In this thesis, we propose a novel approach for assessing both utility and privacy called Discrimination Rate (DR). The DR is an information theoretical approach which provides practical and fine grained measurements. The DR measures the capability of attributes to refine a set of respondents with measurements scaled between 0 and 1, the best refinement leading to single respondents. For example an identifier has a DR equals to 1 as it completely refines a set of respondents. We are therefore able to provide fine grained assessments and comparison of anonymization mechanisms (whether different instantiations of the same mechanism or different anonymization mechanisms) in terms of utility and privacy. Moreover, thanks to the DR, we provide formal definitions of identifiers (Personally Identifying Information) which has been recognized as one of the main concern of privacy regulations. The DR can therefore be used both by companies and regulators for tackling the personal data protection issues Discrimination rate K-anonymat L-diversité T-proximité Anonymisation Métriques de vie privée Utilité des données Confidentialité différentielle Discrimination rate K-anonymity L-diversity T-closeness Privacy measurement Utility measurement Differential privacy
58	Towards Scalable Machine Learning with Privacy Protection Fay, Dominik January 2023 (has links) The increasing size and complexity of datasets have accelerated the development of machine learning models and exposed the need for more scalable solutions. This thesis explores challenges associated with large-scale machine learning under data privacy constraints. With the growth of machine learning models, traditional privacy methods such as data anonymization are becoming insufficient. Thus, we delve into alternative approaches, such as differential privacy. Our research addresses the following core areas in the context of scalable privacy-preserving machine learning: First, we examine the implications of data dimensionality on privacy for the application of medical image analysis. We extend the classification algorithm Private Aggregation of Teacher Ensembles (PATE) to deal with high-dimensional labels, and demonstrate that dimensionality reduction can be used to improve privacy. Second, we consider the impact of hyperparameter selection on privacy. Here, we propose a novel adaptive technique for hyperparameter selection in differentially gradient-based optimization. Third, we investigate sampling-based solutions to scale differentially private machine learning to dataset with a large number of records. We study the privacy-enhancing properties of importance sampling, highlighting that it can outperform uniform sub-sampling not only in terms of sample efficiency but also in terms of privacy. The three techniques developed in this thesis improve the scalability of machine learning while ensuring robust privacy protection, and aim to offer solutions for the effective and safe application of machine learning in large datasets. / Den ständigt ökande storleken och komplexiteten hos datamängder har accelererat utvecklingen av maskininlärningsmodeller och gjort behovet av mer skalbara lösningar alltmer uppenbart. Den här avhandlingen utforskar tre utmaningar förknippade med storskalig maskininlärning under dataskyddskrav. För stora och komplexa maskininlärningsmodeller blir traditionella metoder för integritet, såsom datananonymisering, otillräckliga. Vi undersöker därför alternativa tillvägagångssätt, såsom differentiell integritet. Vår forskning behandlar följande utmaningar inom skalbar och integitetsmedveten maskininlärning: För det första undersöker vi hur hög data-dimensionalitet påverkar integriteten för medicinsk bildanalys. Vi utvidgar klassificeringsalgoritmen Private Aggregation of Teacher Ensembles (PATE) för att hantera högdimensionella etiketter och visar att dimensionsreducering kan användas för att förbättra integriteten. För det andra studerar vi hur valet av hyperparametrar påverkar integriteten. Här föreslår vi en ny adaptiv teknik för val av hyperparametrar i gradient-baserad optimering med garantier på differentiell integritet. För det tredje granskar vi urvalsbaserade lösningar för att skala differentiellt privat maskininlärning till stora datamängder. Vi studerar de integritetsförstärkande egenskaperna hos importance sampling och visar att det kan överträffa ett likformigt urval av sampel, inte bara när det gäller effektivitet utan även för integritet. De tre teknikerna som utvecklats i denna avhandling förbättrar skalbarheten för integritetsskyddad maskininlärning och syftar till att erbjuda lösningar för effektiv och säker tillämpning av maskininlärning på stora datamängder. / <p>QC 20231101</p> Machine Learning Privacy Differential Privacy Dimensionality Reduction Image Segmentation Hyperparameter Selection Adaptive Optimization Privacy Amplification Importance Sampling Maskininlärning Dataskydd Differentiell Integritet Dimensionsreducering Bildsegmentering Hyperparameterurval Adaptiv Optimering Integritetsförstärkning Importance Sampling Computer Sciences Datavetenskap (datalogi)
59	Privacy-preserving Synthetic Data Generation for Healthcare Planning / Sekretessbevarande syntetisk generering av data för vårdplanering Yang, Ruizhi January 2021 (has links) Recently, a variety of machine learning techniques have been applied to different healthcare sectors, and the results appear to be promising. One such sector is healthcare planning, in which patient data is used to produce statistical models for predicting the load on different units of the healthcare system. This research introduces an attempt to design and implement a privacy-preserving synthetic data generation method adapted explicitly to patients’ health data and for healthcare planning. A Privacy-preserving Conditional Generative Adversarial Network (PPCGAN) is used to generate synthetic data of Healthcare events, where a well-designed noise is added to the gradients in the training process. The concept of differential privacy is used to ensure that adversaries cannot reveal the exact training samples from the trained model. Notably, the goal is to produce digital patients and model their journey through the healthcare system. / Nyligen har en mängd olika maskininlärningstekniker tillämpats på olika hälso- och sjukvårdssektorer, och resultaten verkar lovande. En sådan sektor är vårdplanering, där patientdata används för att ta fram statistiska modeller för att förutsäga belastningen på olika enheter i sjukvården. Denna forskning introducerar ett försök att utforma och implementera en sekretessbevarande syntetisk datagenereringsmetod som uttryckligen anpassas till patienters hälsodata och för vårdplanering. Ett sekretessbevarande villkorligt generativt kontradiktoriskt nätverk (PPCGAN) används för att generera syntetisk data från hälsovårdshändelser, där ett väl utformat brus läggs till gradienterna i träningsprocessen. Begreppet differentiell integritet används för att säkerställa att motståndare inte kan avslöja de exakta träningsproven från den tränade modellen. Målet är särskilt att producera digitala patienter och modellera deras resa genom sjukvården. Synthetic data generation differential privacy generative network GAN Moments Accountant Markov modeling. Syntetisk datagenerering differentiell integritet generativt nätverk GAN Moments Accountant Markov -modellering. Elektroteknik och elektronik
60	Social Graph Anonymization / Anonymisation de graphes sociaux Nguyen, Huu-Hiep 04 November 2016 (has links) La vie privée est une préoccupation des utilisateurs des réseaux sociaux. Les réseaux sociaux sont une source de données précieuses pour des analyses scientifiques ou commerciales. Cette thèse aborde trois problèmes de confidentialité des réseaux sociaux: l'anonymisation de graphes sociaux, la détection de communautés privées et l'échange de liens privés. Nous abordons le problème d'anonymisation de graphes via la sémantique de l'incertitude et l'intimité différentielle. Pour la première, nous proposons un modèle général appelé Uncertain Adjacency Matrix (UAM) qui préserve dans le graphe anonymisé les degrés des nœuds du graphe non-anonymisé. Nous analysons deux schémas proposés récemment et montrons leur adaptation dans notre modèle. Nous aussi présentons notre approche dite MaxVar. Pour la technique d'intimité différentielle, le problème devient difficile en raison de l'énorme espace des graphes anonymisés possibles. Un grand nombre de systèmes existants ne permettent pas de relâcher le budget contrôlant la vie privée, ni de déterminer sa borne supérieure. Dans notre approche nous pouvons calculer cette borne. Nous introduisons le nouveau schéma Top-m-Filter de complexité linéaire et améliorons la technique récente EdgeFlip. L'évaluation de ces algorithmes sur une large gamme de graphes donne un panorama de l'état de l'art. Nous présentons le problème original de la détection de la communauté dans le cadre de l'intimité différentielle. Nous analysons les défis majeurs du problème et nous proposons quelques approches pour les aborder sous deux angles: par perturbation d'entrée (schéma LouvainDP) et par perturbation d'algorithme (schéma ModDivisive) / Privacy is a serious concern of users in daily usage of social networks. Social networks are a valuable data source for large-scale studies on social organization and evolution and are usually published in anonymized forms. This thesis addresses three privacy problems of social networks: graph anonymization, private community detection and private link exchange. First, we tackle the problem of graph anonymization via uncertainty semantics and differential privacy. As for uncertainty semantics, we propose a general obfuscation model called Uncertain Adjacency Matrix (UAM) that keep expected node degrees equal to those in the unanonymized graph. We analyze two recently proposed schemes and show their fitting into the model. We also present our scheme Maximum Variance (MaxVar) to fill the gap between them. Using differential privacy, the problem is very challenging because of the huge output space of noisy graphs. A large body of existing schemes on differentially private release of graphs are not consistent with increasing privacy budgets as well as do not clarify the upper bounds of privacy budgets. In this thesis, such a bound is provided. We introduce the new linear scheme Top-m-Filter (TmF) and improve the existing technique EdgeFlip. Thorough comparative evaluation on a wide range of graphs provides a panorama of the state-of-the-art's performance as well as validates our proposed schemes. Second, we present the problem of community detection under differential privacy. We analyze the major challenges behind the problem and propose several schemes to tackle them from two perspectives: input perturbation (LouvainDP) and algorithm perturbation (ModDivisive) Réseaux sociaux Incertaine matrice d'adjacence Maximum Variance Vie privée différentielle Top-M-Filte Détection de communautés LouvainDP ModDivisive Échange intime des liens $(\alpha \beta)$-Échange Social networks Uncertain Adjacency Matrix Maximum Variance Differential privacy Top-M-Filter Community detection LouvainDP ModDivisive Private link exchange $(\alpha \beta)$-Exchange 005.8

Search results