Global ETD Search

101	Secure Distributed MapReduce Protocols : How to have privacy-preserving cloud applications? / Protocoles distribués et sécurisés pour le paradigme MapReduce : Comment avoir des applications dans les nuages respectueuses de la vie privée ? Giraud, Matthieu 24 September 2019 (has links) À l’heure des réseaux sociaux et des objets connectés, de nombreuses et diverses données sont produites à chaque instant. L’analyse de ces données a donné lieu à une nouvelle science nommée "Big Data". Pour traiter du mieux possible ce flux incessant de données, de nouvelles méthodes de calcul ont vu le jour. Les travaux de cette thèse portent sur la cryptographie appliquée au traitement de grands volumes de données, avec comme finalité la protection des données des utilisateurs. En particulier, nous nous intéressons à la sécurisation d’algorithmes utilisant le paradigme de calcul distribué MapReduce pour réaliser un certain nombre de primitives (ou algorithmes) indispensables aux opérations de traitement de données, allant du calcul de métriques de graphes (e.g. PageRank) aux requêtes SQL (i.e. intersection d’ensembles, agrégation, jointure naturelle). Nous traitons dans la première partie de cette thèse de la multiplication de matrices. Nous décrivons d’abord une multiplication matricielle standard et sécurisée pour l’architecture MapReduce qui est basée sur l’utilisation du chiffrement additif de Paillier pour garantir la confidentialité des données. Les algorithmes proposés correspondent à une hypothèse spécifique de sécurité : collusion ou non des nœuds du cluster MapReduce, le modèle général de sécurité étant honnête mais curieux. L’objectif est de protéger la confidentialité de l’une et l’autre matrice, ainsi que le résultat final, et ce pour tous les participants (propriétaires des matrices, nœuds de calcul, utilisateur souhaitant calculer le résultat). D’autre part, nous exploitons également l’algorithme de multiplication de matrices de Strassen-Winograd, dont la complexité asymptotique est O(n^log2(7)) soit environ O(n^2.81) ce qui est une amélioration par rapport à la multiplication matricielle standard. Une nouvelle version de cet algorithme adaptée au paradigme MapReduce est proposée. L’hypothèse de sécurité adoptée ici est limitée à la non-collusion entre le cloud et l’utilisateur final. La version sécurisée utilise comme pour la multiplication standard l’algorithme de chiffrement Paillier. La seconde partie de cette thèse porte sur la protection des données lorsque des opérations d’algèbre relationnelle sont déléguées à un serveur public de cloud qui implémente à nouveau le paradigme MapReduce. En particulier, nous présentons une solution d’intersection sécurisée qui permet à un utilisateur du cloud d’obtenir l’intersection de n > 1 relations appartenant à n propriétaires de données. Dans cette solution, tous les propriétaires de données partagent une clé et un propriétaire de données sélectionné partage une clé avec chacune des clés restantes. Par conséquent, alors que ce propriétaire de données spécifique stocke n clés, les autres propriétaires n’en stockent que deux. Le chiffrement du tuple de relation réelle consiste à combiner l’utilisation d’un chiffrement asymétrique avec une fonction pseudo-aléatoire. Une fois que les données sont stockées dans le cloud, chaque réducteur (Reducer) se voit attribuer une relation particulière. S’il existe n éléments différents, des opérations XOR sont effectuées. La solution proposée reste donc très efficace. Par la suite, nous décrivons les variantes des opérations de regroupement et d’agrégation préservant la confidentialité en termes de performance et de sécurité. Les solutions proposées associent l’utilisation de fonctions pseudo-aléatoires à celle du chiffrement homomorphe pour les opérations COUNT, SUM et AVG et à un chiffrement préservant l’ordre pour les opérations MIN et MAX. Enfin, nous proposons les versions sécurisées de deux protocoles de jointure (cascade et hypercube) adaptées au paradigme MapReduce. Les solutions consistent à utiliser des fonctions pseudo-aléatoires pour effectuer des contrôles d’égalité et ainsi permettre les opérations de jointure lorsque des composants communs sont détectés.(...) / In the age of social networks and connected objects, many and diverse data are produced at every moment. The analysis of these data has led to a new science called "Big Data". To best handle this constant flow of data, new calculation methods have emerged.This thesis focuses on cryptography applied to processing of large volumes of data, with the aim of protection of user data. In particular, we focus on securing algorithms using the distributed computing MapReduce paradigm to perform a number of primitives (or algorithms) essential for data processing, ranging from the calculation of graph metrics (e.g. PageRank) to SQL queries (i.e. set intersection, aggregation, natural join).In the first part of this thesis, we discuss the multiplication of matrices. We first describe a standard and secure matrix multiplication for the MapReduce architecture that is based on the Paillier’s additive encryption scheme to guarantee the confidentiality of the data. The proposed algorithms correspond to a specific security hypothesis: collusion or not of MapReduce cluster nodes, the general security model being honest-but-curious. The aim is to protect the confidentiality of both matrices, as well as the final result, and this for all participants (matrix owners, calculation nodes, user wishing to compute the result). On the other hand, we also use the matrix multiplication algorithm of Strassen-Winograd, whose asymptotic complexity is O(n^log2(7)) or about O(n^2.81) which is an improvement compared to the standard matrix multiplication. A new version of this algorithm adapted to the MapReduce paradigm is proposed. The safety assumption adopted here is limited to the non-collusion between the cloud and the end user. The version uses the Paillier’s encryption scheme.The second part of this thesis focuses on data protection when relational algebra operations are delegated to a public cloud server using the MapReduce paradigm. In particular, we present a secureintersection solution that allows a cloud user to obtain the intersection of n > 1 relations belonging to n data owners. In this solution, all data owners share a key and a selected data owner sharesa key with each of the remaining keys. Therefore, while this specific data owner stores n keys, the other owners only store two keys. The encryption of the real relation tuple consists in combining the use of asymmetric encryption with a pseudo-random function. Once the data is stored in the cloud, each reducer is assigned a specific relation. If there are n different elements, XOR operations are performed. The proposed solution is very effective. Next, we describe the variants of grouping and aggregation operations that preserve confidentiality in terms of performance and security. The proposed solutions combine the use of pseudo-random functions with the use of homomorphic encryption for COUNT, SUM and AVG operations and order preserving encryption for MIN and MAX operations. Finally, we offer secure versions of two protocols (cascade and hypercube) adapted to the MapReduce paradigm. The solutions consist in using pseudo-random functions to perform equality checks and thus allow joining operations when common components are detected. All the solutions described above are evaluated and their security proven. Big data Informatique en nuage Confidentialité MapReduce Sécurité Big Data Cloud computing Confidentiality MapReduce Security
102	Personalising privacy contraints in Generalization-based Anonymization Models / Personnalisation de protection de la vie privée sur des modèles d'anonymisation basés sur des généralisations Michel, Axel 08 April 2019 (has links) Les bénéfices engendrés par les études statistiques sur les données personnelles des individus sont nombreux, que ce soit dans le médical, l'énergie ou la gestion du trafic urbain pour n'en citer que quelques-uns. Les initiatives publiques de smart-disclosure et d'ouverture des données rendent ces études statistiques indispensables pour les institutions et industries tout autour du globe. Cependant, ces calculs peuvent exposer les données personnelles des individus, portant ainsi atteinte à leur vie privée. Les individus sont alors de plus en plus réticent à participer à des études statistiques malgré les protections garanties par les instituts. Pour retrouver la confiance des individus, il devient nécessaire de proposer dessolutions de user empowerment, c'est-à-dire permettre à chaque utilisateur de contrôler les paramètres de protection des données personnelles les concernant qui sont utilisées pour des calculs.Cette thèse développe donc un nouveau concept d'anonymisation personnalisé, basé sur la généralisation de données et sur le user empowerment.En premier lieu, ce manuscrit propose une nouvelle approche mettant en avant la personnalisation des protections de la vie privée par les individus, lors de calculs d'agrégation dans une base de données. De cette façon les individus peuvent fournir des données de précision variable, en fonction de leur perception du risque. De plus, nous utilisons une architecture décentralisée basée sur du matériel sécurisé assurant ainsi les garanties de respect de la vie privée tout au long des opérations d'agrégation.En deuxième lieu, ce manuscrit étudie la personnalisations des garanties d'anonymat lors de la publication de jeux de données anonymisés. Nous proposons l'adaptation d'heuristiques existantes ainsi qu'une nouvelle approche basée sur la programmation par contraintes. Des expérimentations ont été menées pour étudier l'impact d’une telle personnalisation sur la qualité des données. Les contraintes d’anonymat ont été construites et simulées de façon réaliste en se basant sur des résultats d'études sociologiques. / The benefit of performing Big data computations over individual’s microdata is manifold, in the medical, energy or transportation fields to cite only a few, and this interest is growing with the emergence of smart-disclosure initiatives around the world. However, these computations often expose microdata to privacy leakages, explaining the reluctance of individuals to participate in studies despite the privacy guarantees promised by statistical institutes. To regain indivuals’trust, it becomes essential to propose user empowerment solutions, that is to say allowing individuals to control the privacy parameter used to make computations over their microdata.This work proposes a novel concept of personalized anonymisation based on data generalization and user empowerment.Firstly, this manuscript proposes a novel approach to push personalized privacy guarantees in the processing of database queries so that individuals can disclose different amounts of information (i.e. data at different levels of accuracy) depending on their own perception of the risk. Moreover, we propose a decentralized computing infrastructure based on secure hardware enforcing these personalized privacy guarantees all along the query execution process.Secondly, this manuscript studies the personalization of anonymity guarantees when publishing data. We propose the adaptation of existing heuristics and a new approach based on constraint programming. Experiments have been done to show the impact of such personalization on the data quality. Individuals’privacy constraints have been built and realistically using social statistic studies Big Data Matériel sécurisé Data privacy and security Big Data Secure Hardware
103	The role of large research infrastructures in scientifics creativity : a user-level analysis in the cases of a biological database platform and a synchrotron / Le rôle des grandes infrastructures de recherche dans la créativité scientifique : analyse au niveau de l'utilisateur dans le cas d'une plate-forme de base de données biologique et d'un synchrotron Moratal Ferrando, Núria 28 February 2019 (has links) A l'origine de cette thèse il y a le constat d’une science en changement. Ce changement se caractérise par deux grandes tendances globales : la dépendance croissante à des grands équipements coûteux et partagés et la production de données de masse qui sont également très coûteuses à stocker et gérer. Dans les deux cas ces ressources sont financées par des programmes publics et proposés à la communauté scientifique selon un principe d’ouverture à des utilisateurs extérieurs sous forme de Infrastructures de recherche (IR). Plusieurs facteurs peuvent nous amener à penser que les IR sont des lieux favorables à la créativité. Cependant les moyens par lesquels les IR favorisent la créativité n’ont pas été étudiés. L’objectif de cette thèse est de répondre à cette question. La problématique se décline en deux sous-questions de recherche. D’abord nous nous demandons, comment les IR peuvent-elles contribuer à la créativité scientifique de leurs utilisateurs ? Puis nous nous interrogeons sur : comment mesurer cet impact ? / At the origin of this thesis there is the observation of a changing science. This change is characterized by two major global trends: the growing reliance on large expensive and shared equipment and the production of mass data which are also very expensive to store and manage. In both cases these resources are financed by public programs and proposed to the scientific community according to a principle of openness to external users in the form of Research Infrastructures (RIs). Several factors may lead us to believe that RIs are favourable places for creativity. However, the means by which RIs promote creativity have not been studied. The purpose of this thesis is to answer this question. The research question is divided into two sub-questions of research. First, we wonder how IRs can contribute to the scientific creativity of their users. Then we ask ourselves: how to measure this impact Créativité Créativité scientifique Infrastructures de recherche Big data Creativity Scientific creativity Research infrastructures Big data 338.9
104	HopsWorks : A project-based access control model for Hadoop Moré, Andre, Gebremeskel, Ermias January 2015 (has links) The growth in the global data gathering capacity is producing a vast amount of data which is getting vaster at an increasingly faster rate. This data properly analyzed can represent great opportunity for businesses, but processing it is a resource-intensive task. Sharing can increase efficiency due to reusability but there are legal and ethical questions that arise when data is shared. The purpose of this thesis is to gain an in depth understanding of the different access control methods that can be used to facilitate sharing, and choose one to implement on a platform that lets user analyze, share, and collaborate on, datasets. The resulting platform uses a project based access control on the API level and a fine-grained role based access control on the file system to give full control over the shared data to the data owner. / I dagsläget så genereras och samlas det in oerhört stora mängder data som växer i ett allt högre tempo för varje dag som går. Den korrekt analyserade datan skulle kunna erbjuda stora möjligheter för företag men problemet är att det är väldigt resurskrävande att bearbeta. Att göra det möjligt för organisationer att dela med sig utav datan skulle effektivisera det hela tack vare återanvändandet av data men det dyker då upp olika frågor kring lagliga samt etiska aspekter när man delar dessa data. Syftet med denna rapport är att få en djupare förståelse för dom olika åtkomstmetoder som kan användas vid delning av data för att sedan kunna välja den metod som man ansett vara mest lämplig att använda sig utav i en plattform. Plattformen kommer att användas av användare som vill skapa projekt där man vill analysera, dela och arbeta med DataSets, vidare kommer plattformens säkerhet att implementeras med en projekt-baserad åtkomstkontroll på API nivå och detaljerad rollbaserad åtkomstkontroll på filsystemet för att ge dataägaren full kontroll över den data som delas Hops HopsWorks Hadoop DataSets Big Data Distributed Computing Hops HopsWorks Hadoop DataSets Big Data Distributed Computing
105	Hur har digitaliseringen och Big data påverkat revisionsbranschen? : - Hur ser framtiden ut? Ryman, Jonatan, Torbjörnsson, Felicia January 2021 (has links) Information och data är något som är värdefullt i de flesta organisationer. Idag finns det möjlighet att samla in och analysera stora mängder data, något som brukar benämnas Big data. Ifall materialet hanteras på ett korrekt sätt kan det leda till ett förbättrat beslutsfattande samt ökade insikter om kunden och marknaden. Bearbetning av Big data har möjliggjorts under de senaste åren genom ny teknik, men det är förknippat med höga kostnader vilket har medfört att små- och medelstora företag inte har tillräckligt med resurser för att implementera Big data- tekniker. Detta är något som i stor utsträckning även kan appliceras på revisionsbyråer. I studiens problemdiskussion diskuteras att trots de fördelar som Big data-tekniker och digitalisering innebär för revisionsbranschen, har utvecklingen varit långsam. Genom det utformades ett syfte, att undersöka hur digitaliseringen och Big data-tekniker påverkat den svenska revisionsbranschen. Syftet med denna studie är att undersöka, analysera och beskriva hur implementeringen av Big data har och kommer att påverka revisorns arbetsprocess. Syftet är också att undersöka vilka fördelar och nackdelar som implementeringen har medfört för revisionsbyråer och det skedde genom en empirisk undersökning. Den empiriska undersökningen är en fallstudie som består av sju stycken intervjuer med representanter från olika revisionsbyråer i Västra Sverige. Genom att använda sig av intervjuer har respondenterna haft möjlighet att berätta om sina egna erfarenheter utan störningsmoment. Det är en kvalitativ metod som har tillämpats och studien har utgångspunkt ur en deduktiv ansats. Det eftersom utgångspunkten kommer ifrån tidigare forskning samt teori som redan existerar om digitalisering, Big data och revisionsbranschen. Resultatet av studien visar att de olika respondenterna gav liknande svar, dels kring hur deras revisionsprocess är utformad, men även om den digitala utvecklingen och Big data. Det är tre stycken byråer som idag använder sig utav Big data i begränsad omfattning och en av respondenterna hade ingen kunskap om konceptet alls. Samtliga respondenter anser att digitaliseringen inom revisionsbranschen kommer att bli ännu viktigare och spela en större roll i framtiden. Automatisering och standardisering är processer som respondenterna tror kommer att bli mer omfattande i framtiden inom revisionen. I studiens slutsats framkommer det att digitaliseringen inom revisionsbranschen har utvecklats något långsamt jämfört med andra delar i samhället, men att det skett en stor utveckling under de senaste åren. De slutsatser som presenteras och som utgår ifrån studiens frågeställningar är att arbetsprocessen för revisorerna inte har påverkats av Big data än, då det fortfarande är ett relativt okänt begrepp. Framtidens revision kommer dock inte se ut som idag, utan att det kommer ske stora förändringar inom branschen kommande år. Arbetsprocesser kommer att bli mer effektiva samt att revisionen kommer att ha en högre kvalitet. / Information and data are something that is of value in most organizations. Today, there is theopportunity to collect and analyze large amounts of data, which usually goes by the term Bigdata. If the essential material is handled correctly, it may result in improved decision makingand increased insights about clients and the market. The use and processing of Big data hasbeen made possible in the recent years through new technology, but it is associated with highcosts which has resulted in that small and medium-sized companies don’t have the resourcesto implement Big data technologies. This means that larger and more established companieshave an advantage. Big data can also be widely applied to auditing firms. The study's problemdiscussion proves that despite the many benefits that Big data technologies and the digitizationcreate for the auditing industry, the development is slagging behind other industries. Therefore,the purpose of the study was designed to see how digitalization and Big Data technologies hasaffected the Swedish auditing industry.The purpose of this study is to investigate, analyze and describe how the implementation ofBig data has and future affect the auditors’ work process. The purpose is also to investigatewhich advantages and disadvantages the implementation has brought auditing firms and thatwill be examined through an empirical study.The empirical study is a case study that consists of seven interviews with representatives fromvarious auditing firms in Western Sweden. By using interviews, the respondents have theopportunity to describe their own experiences. The study is based on the qualitative method.The study has a deductive approach as the starting point comes from previous research andtheory that already exists about digitization, Big data and the auditing industry.The results of the study shows that the respondents gave somewhat similar answers to thequestions, mostly about their auditing process and how it's designed today. But also about theirdigital development and Big data. Three of the interviewed firms use Big data today, but onlyto a limited extent and one of the respondents had no idea what the concept Big datarepresented. However, all respondents believe that digitalization in the auditing industry willbecome even more important and play a greater role in the near future. Automation andstandardization are processes that the respondents believe will be more extensive in the futurewithin the auditing firms.The study's conclusion shows that digitalization in the auditing industry has been somewhatslow compared with other industries, but that there has been a great development in recentyears. The conclusions that are presented and which are based on the study's questions are thatthe work process for the auditors has not been affected by Big data yet, as it’s still a relativelyunknown concept. However, auditing in the future will not look like it does today, there will besome major changes in the industry in the upcoming years. The work processes will be moreefficient than today and the audit will be of a higher quality. Big data digitization auditing and the audit process Big data digitalisering revision och revisionsprocessen Business Administration Företagsekonomi
106	BIG DATA-ANALYS INOM FOTBOLLSORGANISATIONER En studie om big data-analys och värdeskapande Flike, Felix, Gervard, Markus January 2019 (has links) Big data är ett relativt nytt begrepp men fenomenet har funnits länge. Det går att beskriva utifrån fem V:n; volume, veracity, variety, velocity och value. Analysen av Big Data har kommit att visa sig värdefull för organisationer i arbetet med beslutsfattande, generering av mätbara ekonomiska fördelar och förbättra verksamheten. Inom idrottsbranschen började detta på allvar användas i början av 2000-talet i baseballorganisationen Oakland Athletics. Man började värva spelare baserat på deras statistik istället för hur bra scouterna bedömde deras förmåga vilket gav stora framgångar. Detta ledde till att fler organisationer tog efter och det har inte dröjt länge innan Big Data-analys används i alla stora sporter för att vinna fördelar gentemot konkurrenter. I svensk kontext så är användningen av dessa verktyg fortfarande relativt ny och mångaorganisationer har möjligtvis gått för fort fram i implementeringen av dessa verktyg. Dennastudie syftar till att undersöka fotbollsorganisationers arbete när det gäller deras Big Dataanalys kopplat till organisationens spelare utifrån en fallanalys. Resultatet visar att båda organisationerna skapar värde ur sina investeringar som de har nytta av i arbetet med att nå sina strategiska mål. Detta gör organisationerna på olika sätt. Vilket sätt som är mest effektivt utifrån värdeskapande går inte att svara på utifrån denna studie. big data big data-analys data management värdeskapande resurshantering Engineering and Technology Teknik och teknologier
107	Accéler la préparation des données pour l'analyse du big data / Accelerating data preparation for big data analytics Tian, Yongchao 07 April 2017 (has links) Nous vivons dans un monde de big data, où les données sont générées en grand volume, grande vitesse et grande variété. Le big data apportent des valeurs et des avantages énormes, de sorte que l’analyse des données est devenue un facteur essentiel de succès commercial dans tous les secteurs. Cependant, si les données ne sont pas analysées assez rapidement, les bénéfices de big data seront limités ou même perdus. Malgré l’existence de nombreux systèmes modernes d’analyse de données à grande échelle, la préparation des données est le processus le plus long de l’analyse des données, n’a pas encore reçu suffisamment d’attention. Dans cette thèse, nous étudions le problème de la façon d’accélérer la préparation des données pour le big data d’analyse. En particulier, nous nous concentrons sur deux grandes étapes de préparation des données, le chargement des données et le nettoyage des données. Comme première contribution de cette thèse, nous concevons DiNoDB, un système SQL-on-Hadoop qui réalise l’exécution de requêtes à vitesse interactive sans nécessiter de chargement de données. Les applications modernes impliquent de lourds travaux de traitement par lots sur un grand volume de données et nécessitent en même temps des analyses interactives ad hoc efficaces sur les données temporaires générées dans les travaux de traitement par lots. Les solutions existantes ignorent largement la synergie entre ces deux aspects, nécessitant de charger l’ensemble des données temporaires pour obtenir des requêtes interactives. En revanche, DiNoDB évite la phase coûteuse de chargement et de transformation des données. L’innovation importante de DiNoDB est d’intégrer à la phase de traitement par lots la création de métadonnées que DiNoDB exploite pour accélérer les requêtes interactives. La deuxième contribution est un système de flux distribué de nettoyage de données, appelé Bleach. Les approches de nettoyage de données évolutives existantes s’appuient sur le traitement par lots pour améliorer la qualité des données, qui demandent beaucoup de temps. Nous ciblons le nettoyage des données de flux dans lequel les données sont nettoyées progressivement en temps réel. Bleach est le premier système de nettoyage qualitatif de données de flux, qui réalise à la fois la détection des violations en temps réel et la réparation des données sur un flux de données sale. Il s’appuie sur des structures de données efficaces, compactes et distribuées pour maintenir l’état nécessaire pour nettoyer les données et prend également en charge la dynamique des règles. Nous démontrons que les deux systèmes résultants, DiNoDB et Bleach, ont tous deux une excellente performance par rapport aux approches les plus avancées dans nos évaluations expérimentales, et peuvent aider les chercheurs à réduire considérablement leur temps consacré à la préparation des données. / We are living in a big data world, where data is being generated in high volume, high velocity and high variety. Big data brings enormous values and benefits, so that data analytics has become a critically important driver of business success across all sectors. However, if the data is not analyzed fast enough, the benefits of big data will be limited or even lost. Despite the existence of many modern large-scale data analysis systems, data preparation which is the most time-consuming process in data analytics has not received sufficient attention yet. In this thesis, we study the problem of how to accelerate data preparation for big data analytics. In particular, we focus on two major data preparation steps, data loading and data cleaning. As the first contribution of this thesis, we design DiNoDB, a SQL-on-Hadoop system which achieves interactive-speed query execution without requiring data loading. Modern applications involve heavy batch processing jobs over large volume of data and at the same time require efficient ad-hoc interactive analytics on temporary data generated in batch processing jobs. Existing solutions largely ignore the synergy between these two aspects, requiring to load the entire temporary dataset to achieve interactive queries. In contrast, DiNoDB avoids the expensive data loading and transformation phase. The key innovation of DiNoDB is to piggyback on the batch processing phase the creation of metadata, that DiNoDB exploits to expedite the interactive queries. The second contribution is a distributed stream data cleaning system, called Bleach. Existing scalable data cleaning approaches rely on batch processing to improve data quality, which are very time-consuming in nature. We target at stream data cleaning in which data is cleaned incrementally in real-time. Bleach is the first qualitative stream data cleaning system, which achieves both real-time violation detection and data repair on a dirty data stream. It relies on efficient, compact and distributed data structures to maintain the necessary state to clean data, and also supports rule dynamics. We demonstrate that the two resulting systems, DiNoDB and Bleach, both of which achieve excellent performance compared to state-of-the-art approaches in our experimental evaluations, and can help data scientists significantly reduce their time spent on data preparation. Big data Base de données Système distribué Nettoyage de données Big data Database Distributed system Data cleaning
108	Méthodes parallèles pour le traitement des flux de données continus / Parallel and continuous join processing for data stream Song, Ge 28 September 2016 (has links) Nous vivons dans un monde où une grande quantité de données est généré en continu. Par exemple, quand on fait une recherche sur Google, quand on achète quelque chose sur Amazon, quand on clique en ‘Aimer’ sur Facebook, quand on upload une image sur Instagram, et quand un capteur est activé, etc., de nouvelles données vont être généré. Les données sont différentes d’une simple information numérique, mais viennent dans de nombreux format. Cependant, les données prisent isolément n’ont aucun sens. Mais quand ces données sont reliées ensemble on peut en extraire de nouvelles informations. De plus, les données sont sensibles au temps. La façon la plus précise et efficace de représenter les données est de les exprimer en tant que flux de données. Si les données les plus récentes ne sont pas traitées rapidement, les résultats obtenus ne sont pas aussi utiles. Ainsi, un système parallèle et distribué pour traiter de grandes quantités de flux de données en temps réel est un problème de recherche important. Il offre aussi de bonne perspective d’application. Dans cette thèse nous étudions l’opération de jointure sur des flux de données, de manière parallèle et continue. Nous séparons ce problème en deux catégories. La première est la jointure en parallèle et continue guidée par les données. La second est la jointure en parallèle et continue guidée par les requêtes. / We live in a world where a vast amount of data is being continuously generated. Data is coming in a variety of ways. For example, every time we do a search on Google, every time we purchase something on Amazon, every time we click a ‘like’ on Facebook, every time we upload an image on Instagram, every time a sensor is activated, etc., it will generate new data. Data is different than simple numerical information, it now comes in a variety of forms. However, isolated data is valueless. But when this huge amount of data is connected, it is very valuable to look for new insights. At the same time, data is time sensitive. The most accurate and effective way of describing data is to express it as a data stream. If the latest data is not promptly processed, the opportunity of having the most useful results will be missed.So a parallel and distributed system for processing large amount of data streams in real time has an important research value and a good application prospect. This thesis focuses on the study of parallel and continuous data stream Joins. We divide this problem into two categories. The first one is Data Driven Parallel and Continuous Join, and the second one is Query Driven Parallel and Continuous Join. Big Data Flux de Données Calculation en Parallel Exploration de Données Big Data Data Stream Parallel Computing Data Mining
109	Big Data Analys och Revision : En studie om hur revisorer använder analysverktyg på stora datamängder i revisionsprocessen för att säkerställa finansiell information. Ryder, Joel, Arvidsson, Oliver January 2022 (has links) Digitization has made information and data increasingly valuable to most organizations and in recent years, the concept of Big Data has increasingly taken place in the business world. Big Data analysis, i.e. analyzing Big Data, can lead to improved efficiency and improved quality in the audit. Large investments in Big Data analysis have been made in recent years. Despite these investments, the audit business lags behind the business world in using Big Data analysis. Researchers argue that there are many uses for Big Data analysis, but there are not many studies that show practically how auditors use Big Data analysis in their profession or factors that affect the use of new audit technologies. The purpose of the study is to gain a deeper understanding of how auditors use Big Data analysis in the audit process to assure financial information, also to explain the factors affecting the auditors use of new audit technologies. The empirical study has a qualitative method, an inductive approach and eight semi-structured interviews with respondents from different accounting firms around Sweden have been made. The interviewees believe that Big Data analysis is just in the beginning of its development. They think that Big Data analysis will be used in more areas in the future. The study shows that Big Data and Big Data analysis in the audit environment is hard to define. It is shown that auditors use Big Data analysis to complement traditional audit procedures, rather than replacing them. The study also examines the factors that affect the use of audit technologies. Factors affecting the auditors' use of BDA are legitimacy, ethics, laws & regulations, the auditor, and the responsible part of the financial information that is being assured. Revision Big Data Big Data analys Legitimitet Learning Lärande Business Administration Företagsekonomi
110	Explaining the Big Data adoption decision in Small and Medium Sized Enterprises: Cape Town case studies Matross, Lonwabo 29 March 2023 (has links) (PDF) Problem Statement: Small and Medium-Sized Enterprises (SMEs) play an integral role in the economy of developed and developing countries. SMEs are constantly searching for innovative technologies that will not only reduce their overhead costs but also improve product development, customer relations and profitability. Literature has revealed that some SMEs around the world have incorporated a fairly new technology called Big Data to achieve higher levels of operational efficiency. Therefore, it is interesting to observe the reasons why some organizations in developing countries such as South Africa are not adopting this technology as compared to other developed countries. A large portion of the available literature revealed that there isa general lack of in-depth information and understanding of Big Data amongst SMEs in developing countries such as South Africa. The main objective of this study is to explain the factors that SMEs consider during the Big Data decision process. Purpose of the study: This research study aimed to identify the factors that South African SMEs consider as important in their decision-making process when it comes to the adoption of BigData. The researcher used the conceptual framework proposed by Frambach and Schillewaert to derive an updated and adapted conceptual framework that explained the factors that SMEs consider when adopting Big Data. Research methodology: SMEs located in the Western Province of South Africa were chosen as the case studies. The interpretive research philosophy formed the basis of this research. Additionally, the nature of the phenomenon being investigated deemed it appropriate that the qualitative research method and research design be applied to this thesis. Due to constraints such as limited time and financial resources this was a cross-sectional study. The research strategy in this study was multiple in-depth case studies. The qualitative approach was deemed appropriate for this study. The researcher used two methods to collect data, namely, the primary research method and the secondary research method. The primary research method enabled the researcher to obtain rich data that could assist in answering the primary research questions, whilst the secondary research method included documents which supplemented the primary data collected. Data was analyzed using the NVivo software provided by the University of Cape Town. Key Findings: The findings suggest that the process that influences the decision to adopt Big Data by SMEs follows a three-step approach namely: 1.) Awareness, 2.) Consideration, 3.) Intention. This indicates that for Big Data to be adopted by SMEs there must be organizational readiness to go through the process. This study identified the main intention for SMEs to adopt Big Data is to ensure operational stability. Improved operational efficiency was identified as the supporting sub-theme. This study has raised awareness about the process that SMEs, academic researchers, IT practitioners and government need to place emphasis on to improve the adoption of Big Data by SMEs. Furthermore, this study has raised awareness about the opportunities and challenges that SMEs, academic researchers, IT practitioners and government need to place emphasis on to improve the adoption of Big Data by SMEs. Value of the study: The study adds value in both academia and the business industry as it provides more insight into the factors that SMEs consider in the Big Data adoption decision. Big Data Big Data technologies Innovation adoption SMEs Decision-making process Operational efficiency

Search results