Global ETD Search

1	HopsWorks : A project-based access control model for Hadoop Moré, Andre, Gebremeskel, Ermias January 2015 (has links) The growth in the global data gathering capacity is producing a vast amount of data which is getting vaster at an increasingly faster rate. This data properly analyzed can represent great opportunity for businesses, but processing it is a resource-intensive task. Sharing can increase efficiency due to reusability but there are legal and ethical questions that arise when data is shared. The purpose of this thesis is to gain an in depth understanding of the different access control methods that can be used to facilitate sharing, and choose one to implement on a platform that lets user analyze, share, and collaborate on, datasets. The resulting platform uses a project based access control on the API level and a fine-grained role based access control on the file system to give full control over the shared data to the data owner. / I dagsläget så genereras och samlas det in oerhört stora mängder data som växer i ett allt högre tempo för varje dag som går. Den korrekt analyserade datan skulle kunna erbjuda stora möjligheter för företag men problemet är att det är väldigt resurskrävande att bearbeta. Att göra det möjligt för organisationer att dela med sig utav datan skulle effektivisera det hela tack vare återanvändandet av data men det dyker då upp olika frågor kring lagliga samt etiska aspekter när man delar dessa data. Syftet med denna rapport är att få en djupare förståelse för dom olika åtkomstmetoder som kan användas vid delning av data för att sedan kunna välja den metod som man ansett vara mest lämplig att använda sig utav i en plattform. Plattformen kommer att användas av användare som vill skapa projekt där man vill analysera, dela och arbeta med DataSets, vidare kommer plattformens säkerhet att implementeras med en projekt-baserad åtkomstkontroll på API nivå och detaljerad rollbaserad åtkomstkontroll på filsystemet för att ge dataägaren full kontroll över den data som delas Hops HopsWorks Hadoop DataSets Big Data Distributed Computing Hops HopsWorks Hadoop DataSets Big Data Distributed Computing
2	Multitenant PrestoDB as a service Yedurupak, Aruna Kumari January 2017 (has links) In recent years, there has been tremendous growth in both the volumes of data that is produced, stored, and queried by organizations. Organizations spend more money to investigate and obtain useful information or knowledge against terabytes and even petabytes of data. Large-scale data analysis is the key functionality provided by Big Data platforms. Previously, data platforms would get the information from unstructured data in the form of files, text, and videos. In recent times, the Hadoop stack has played a vital role in Big Data, becoming the defector open source software used to process and analyze Big Data. Hops is a Hadoop distribution developed by KTH and RISE SICS. Hops modifies the Hadoop stack by moving the meta-data for YARN and HDFS to NDB, an open-source in-memory distributed database. HopsWorks is the User Interface for Hops and provides support for multi-tenant users, as well as self-service, graphical access to frameworks such as Hadoop, Flink, Spark, Kafka, and Kibana. HopsWorks currently does not provide a SQL-on-Hadoop service, although work is ongoing for supporting Hive. Presto is one of the main SQL-on-Hadoop platform, but, currently, Presto does not provide multi-tenancy support for users. This thesis investigates providing multitenancy support to Presto with the help of HopsWorks, including both the security problem and the self-service UI requirements of HopsWorks. Presto is a distributed SQL query Engine which can run SQL queries against up to petabytes of data. As HopsWorks provides UI access to services, we decided to build our UI for Presto on an existing open-source UI for Presto, called Airpal, developed by Airbnb. This provided solution of the thesis divided into two functionalities. First one, maintain two separate Applications (HopsWorks and Airpal Applications) run by the help of two JVMs and maintain ProxyServlet to control traffic between them. Second one HopsWorks-Presto-service leverages HopsWorks access-control (Data owner and Data-scientist) and self-service security model. The evaluation of the thesis used qualitative approach by comparing HopsWorks-PrestoService with standalone PrestoDB and comparing HopsWorks-PrestoService with HopsWorks without Presto-Service. / De senaste åren, har det varit en avsevärd ökning vad gäller mängden av data som produceras, lagras och som används för analys av olika organisationer. Organisationer spenderar mer pengar för att undersöka och extrahera information och insikter i enorma datavolymer på flera terabyte eller petabyte. Storskalig dataanalys är en central funktionalitet som tillhandahålls av Big Data plattformar. I tidigare tillvägagångssätt hämtade data plattformaro-strukturerade data i form av filer, texter och videoklipp. I nutid, så har Hadoop-stacken spelat en kärnroll i Big Data, och blivit en viktig öppen källkod mjukvara som används för att processera och analysera Big Data. Hops är en Hadoop distribution som har utvecklats av KTH och RISE SICS. Hops tillför ändringar till Hadoop stacken genom att migrera metadata för YARN och HDFS till NDB, en öppen källkod i-minnet distribuerad databas. HopsWorks är ett användargränssnitt för Hops och tillför stöd för flera användare, med tillgång till självservice och tjänster såsom Hadoop, Flink, Spark, Kafka och Kibana. HopsWorks stödjer i nuläget inte någon SQL på Hadoop tjänst, även om arbete utförs i nuläget för att integrera Hive. Presto är en av de mest populära SQL på Hadoop plattformarna, men i nuläget så stödjer inte Presto flera användare. Den här uppsatsen utreder stöd för flera användare i Presto med hjälp av HopsWorks, både vad gäller säkerhetsproblem och självservice i HopsWorks. Presto är en distribuerad SQL frågespråk motor som kan ställa frågor mot upp till petabyte med data. Eftersom HopsWorks tillhandahåller ett gränssnitt för att interagera med tjänster, beslutade vi oss att bygga ett gränssnitt för Presto på det existerande öppen källkod gränssnittet för Presto, vid namn AirPal, utvecklat av Airbnb. Den utvecklade lösningen för uppsatsen kan delas in i två delar. Den första delen, att hantera två separata applikationer (HopsWorks och AirPal) som kör med hjälp av två Java virtuella maskiner och använder en ProxyServlet för att kontrollera trafik mellan dom. Den andra, HopsWorks-Presto-service som tillhandahåller HopsWorks åtkomstkontroll (Dataägare och Dataforskare) och en självservice säkerhetsmodell. Utvärderingen i uppsatsen är att genom ett kvalitativt tillvägagångssätt jämföra HopsWorks-Presto-service med en fristående PrestoDB och jämföra HopsWorks-Presto-service med HopsWorks utan Presto-service. Hadoop Presto SQL Multi-tenancy Hops HopsWorks Airpal Proxy servlet Hadoop Presto SQL multi-hyresrätt Hops HopsWorks Airpal Proxy servlet Computer Sciences Datavetenskap (datalogi)
3	Ablation Programming for Machine Learning Sheikholeslami, Sina January 2019 (has links) As machine learning systems are being used in an increasing number of applications from analysis of satellite sensory data and health-care analytics to smart virtual assistants and self-driving cars they are also becoming more and more complex. This means that more time and computing resources are needed in order to train the models and the number of design choices and hyperparameters will increase as well. Due to this complexity, it is usually hard to explain the effect of each design choice or component of the machine learning system on its performance.A simple approach for addressing this problem is to perform an ablation study, a scientific examination of a machine learning system in order to gain insight on the effects of its building blocks on its overall performance. However, ablation studies are currently not part of the standard machine learning practice. One of the key reasons for this is the fact that currently, performing an ablation study requires major modifications in the code as well as extra compute and time resources.On the other hand, experimentation with a machine learning system is an iterative process that consists of several trials. A popular approach for execution is to run these trials in parallel, on an Apache Spark cluster. Since Apache Spark follows the Bulk Synchronous Parallel model, parallel execution of trials includes several stages, between which there will be barriers. This means that in order to execute a new set of trials, all trials from the previous stage must be finished. As a result, we usually end up wasting a lot of time and computing resources on unpromising trials that could have been stopped soon after their start.We have attempted to address these challenges by introducing MAGGY, an open-source framework for asynchronous and parallel hyperparameter optimization and ablation studies with Apache Spark and TensorFlow. This framework allows for better resource utilization as well as ablation studies and hyperparameter optimization in a unified and extendable API. / Eftersom maskininlärningssystem används i ett ökande antal applikationer från analys av data från satellitsensorer samt sjukvården till smarta virtuella assistenter och självkörande bilar blir de också mer och mer komplexa. Detta innebär att mer tid och beräkningsresurser behövs för att träna modellerna och antalet designval och hyperparametrar kommer också att öka. På grund av denna komplexitet är det ofta svårt att förstå vilken effekt varje komponent samt designval i ett maskininlärningssystem har på slutresultatet.En enkel metod för att få insikt om vilken påverkan olika komponenter i ett maskinlärningssytem har på systemets prestanda är att utföra en ablationsstudie. En ablationsstudie är en vetenskaplig undersökning av maskininlärningssystem för att få insikt om effekterna av var och en av dess byggstenar på dess totala prestanda. Men i praktiken så är ablationsstudier ännu inte vanligt förekommande inom maskininlärning. Ett av de viktigaste skälen till detta är det faktum att för närvarande så krävs både stora ändringar av koden för att utföra en ablationsstudie, samt extra beräkningsoch tidsresurser.Vi har försökt att ta itu med dessa utmaningar genom att använda en kombination av distribuerad asynkron beräkning och maskininlärning. Vi introducerar maggy, ett ramverk med öppen källkodsram för asynkron och parallell hyperparameteroptimering och ablationsstudier med PySpark och TensorFlow. Detta ramverk möjliggör bättre resursutnyttjande samt ablationsstudier och hyperparameteroptimering i ett enhetligt och utbyggbart API. Distributed Machine Learning Distributed Systems Ablation Studies Apache Spark Keras Hopsworks Computer and Information Sciences Data- och informationsvetenskap
4	Project-based Multi-tenant Container Registry For Hopsworks Kashyap, Pradyumna Krishna January 2020 (has links) There has been a substantial growth in the usage of data in the past decade, cloud technologies and big data platforms have gained popularity as they help in processing such data on a large scale. Hopsworks is such a managed plat- form for scale out data science. It is an open-source platform for the develop- ment and operation of Machine Learning models, available on-premise and as a managed platform in the cloud. As most of these platforms provide data sci- ence environments to collate the required libraries to work with, Hopsworks provides users with Anaconda environments.Hopsworks provides multi-tenancy, ensuring a secure model to manage sen- sitive data in the shared platform. Most of the Hopsworks features are built around projects, each project includes an Anaconda environment that provides users with a number of libraries capable of processing data. Each project cre- ation triggers a creation of a base Anaconda environment and each added li- brary updates this environment. For an on-premise application, as data science teams are diverse and work towards building repeatable and scalable models, it becomes increasingly important to manage these environments in a central location locally.The purpose of the thesis is to provide a secure storage for these Anaconda en- vironments. As Hopsworks uses a Kubernetes cluster to serve models, these environments can be containerized and stored on a secure container registry on the Kubernetes Cluster. The provided solution also aims to extend the multi- tenancy feature of Hopsworks onto the hosted local storage. The implemen- tation comprises of two parts; First one, is to host a compatible open source container registry to store the container images on a local Kubernetes cluster with fault tolerance and by avoiding a single point of failure. Second one, is to leverage the multi-tenancy feature in Hopsworks by storing the images on the self sufficient secure registry with project level isolation. / Det har skett en betydande tillväxt i dataanvändningen under det senaste decen- niet, molnteknologier och big data-plattformar har vunnit popularitet eftersom de hjälper till att bearbeta sådan data i stor skala. Hopsworks är en sådan hante- rad plattform för att skala ut datavetenskap. Det är en öppen källkodsplattform för utveckling och drift av Machine Learning-modeller, tillgänglig på plats och som en hanterad plattform i molnet. Eftersom de flesta av dessa plattformar tillhandahåller datavetenskapsmiljöer för att samla in de bibliotek som krävs för att arbeta med, ger Hopsworks användare Anaconda-miljöer.Hopsworks tillhandahåller multi-tenancy, vilket säkerställer en säker modell för att hantera känslig data i den delade plattformen. De flesta av Hopsworks- funktionerna är uppbyggda kring projekt, varje projekt innehåller en Anaconda- miljö som ger användarna ett antal bibliotek som kan bearbeta data. Varje projektskapning utlöser skapandet av en basanacondamiljö och varje tillagt bibliotek uppdaterar denna miljö. För en lokal applikation, eftersom datave- tenskapsteam är olika och arbetar för att bygga repeterbara och skalbara mo- deller, blir det allt viktigare att hantera dessa miljöer på en central plats lokalt. Syftet med avhandlingen är att tillhandahålla en säker lagring för dessa Anaconda- miljöer. Eftersom Hopsworks använder ett Kubernetes-kluster för att betjäna modeller kan dessa miljöer containeriseras och lagras i ett säkert container- register i Kubernetes-klustret. Den medföljande lösningen syftar också till att utvidga Hopsworks-funktionen för flera hyresgäster till det lokala lagrade vär- det. Implementeringen består av två delar; Den första är att vara värd för ett kompatibelt register med öppen källkod för att lagra behållaravbildningarna iett lokalt Kubernetes-kluster med feltolerans och genom att undvika en enda felpunkt. Den andra är att utnyttja multihyresfunktionen i Hopsworks genom att lagra bilderna i det självförsörjande säkra registret med projektnivåisole- ring. Cloud Big Data Hopsworks Data Science On-premise Multitenancy Container Registry Kubernetes. Computer and Information Sciences Data- och informationsvetenskap
5	Multi-Tenant Apache Kafka for Hops : Kafka Topic-Based Multi-Tenancy and ACL- Based Authorization for Hops Dessalegn Muruts, Misganu January 2016 (has links) Apache Kafka is a distributed, high throughput and fault-tolerant publish/subscribe messaging system in the Hadoop ecosystem. It is used as a distributed data streaming and processing platform. Kafka topics are the units of message feeds in the Kafka cluster. Kafka producer publishes messages into these topics and a Kafka consumer subscribes to topics to pull those messages. With the increased usage of Kafka in the data infrastructure of many companies, there are many Kafka clients that publish and consume messages to/from the Kafka topics. In fact, these client operations can be malicious. To mitigate this risk, clients must authenticate themselves and their operation must be authorized before they can access to a given topic. Nowadays, Kafka ships with a pluggable Authorizer interface to implement access control list (ACL) based authorization for client operation. Kafka users can implement the interface differently to satisfy their security requirements. SimpleACLAuthorizer is the out-of-box implementation of the interface and uses a Zookeeper for ACLs storage.HopsWorks, based on Hops a next generation Hadoop distribution, provides support for project-based multi-tenancy, where projects are fully isolated at the level of the Hadoop Filesystem and YARN. In this project, we added Kafka topicbased multi-tenancy in Hops projects. Kafka topic is created from inside Hops project and persisted both at the Zookeeper and the NDBCluster. Persisting a topic into a database enabled us for topic sharing across projects. ACLs are added to Kafka topics and are persisted only into the database. Client access to Kafka topics is authorized based on these ACLs. ACLs are added, updated, listed and/or removed from the HopsWorks WebUI. HopsACLAuthorizer, a Hops implementation of the Authorizer interface, authorizes Kafka client operations using the ACLs in the database. The Apache Avro schema registry for topics enabled the producer and consumer to better integrate by transferring a preestablished message format. The result of this project is the first Hadoop distribution that supports Kafka multi-tenancy. Hadoop Kafka Hops HopsWorks Multi-Tenancy Kafka Topics Schema Registry Messaging Systems ACL Authorization Computer Sciences Datavetenskap (datalogi)

1

Page generated in 0.0207 seconds