Global ETD Search

1	Implementierung und Evaluierung einer Verarbeitung von Datenströmen im Big Data Umfeld am Beispiel von Apache Flink Oelschlegel, Jan 17 May 2021 (has links) Die Verarbeitung von Datenströmen rückt zunehmend in den Fokus beim Aufbau moderner Big Data Infrastrukturen. Der Praxispartner dieser Master-Thesis, die integrationfactory GmbH & Co. KG, möchte zunehmend den Big Data Bereich ausbauen, um den Kunden auch in diesen Aspekten als Beratungshaus Unterstützung bieten zu können. Der Fokus wurde von Anfang an auf Apache Flink gelegt, einem aufstrebenden Stream-Processing-Framework. Das Ziel dieser Arbeit ist die Implementierung verschiedener typischer Anwendungsfälle des Unternehmens mithilfe von Flink und die anschließende Evaluierung dieser. Im Rahmen dessen wird am Anfang zunächst die zentrale Problemstellung festgehalten und daraus die Zielstellungen abgeleitet. Zum besseren Verständnis werden im Nachgang wichtige Grundbegriffe und Konzepte vermittelt. Es wird außerdem dem Framework ein eigenes Kapitel gewidmet, um den Leser einen umfangreichen aber dennoch kompakten Einblick in Flink zu geben. Dabei wurde auf verschiedene Quellen eingegangen, mitunter wurde auch ein direkter Kontakt mit aktiven Entwicklern des Frameworks aufgebaut. Dadurch konnten zunächst unklare Sachverhalte durch fehlende Informationen aus den Primärquellen im Nachgang geklärt und aufbereitet in das Kapitel hinzugefügt werden. Im Hauptteil der Arbeit wird eine Implementierung von definierten Anwendungsfällen vorgenommen. Dabei kommen die Datastream-API und FlinkSQL zum Einsatz, dessen Auswahl auch begründet wird. Die Ausführung der programmierten Jobs findet im firmeneigenen Big Data Labor statt, einer virtualisierten Umgebung zum Testen von Technologien. Als zentrales Problem dieser Master-Thesis sollen beide Schnittstellen auf die Eignung hinsichtlich der Anwendungsfälle evaluiert werden. Auf Basis des Wissens aus den Grundlagen-Kapiteln und der Erfahrungen aus der Entwicklung der Jobs werden Kriterien zur Bewertung mithilfe des Analytic Hierarchy Processes aufgestellt. Im Nachgang findet eine Auswertung statt und die Einordnung des Ergebnisses.:1. Einleitung 1.1. Motivation 1.2. Problemstellung 1.3. Zielsetzung 2. Grundlagen 2.1. Begriffsdefinitionen 2.1.1. Big Data 2.1.2. Bounded vs. unbounded Streams 2.1.3. Stream vs. Tabelle 2.2. Stateful Stream Processing 2.2.1. Historie 2.2.2. Anforderungen 2.2.3. Pattern-Arten 2.2.4. Funktionsweise zustandsbehafteter Datenstromverarbeitung 3. Apache Flink 3.1. Historie 3.2. Architektur 3.3. Zeitabhängige Verarbeitung 3.4. Datentypen und Serialisierung 3.5. State Management 3.6. Checkpoints und Recovery 3.7. Programmierschnittstellen 3.7.1. DataStream-API 3.7.2. FlinkSQL & Table-API 3.7.3. Integration mit Hive 3.8. Deployment und Betrieb 4. Implementierung 4.1. Entwicklungsumgebung 4.2. Serverumgebung 4.3. Konfiguration von Flink 4.4. Ausgangsdaten 4.5. Anwendungsfälle 4.6. Umsetzung in Flink-Jobs 4.6.1. DataStream-API 4.6.2. FlinkSQL 4.7. Betrachtung der Resultate 5. Evaluierung 5.1. Analytic Hierarchy Process 5.1.1. Ablauf und Methodik 5.1.2. Phase 1: Problemstellung 5.1.3. Phase 2: Struktur der Kriterien 5.1.4. Phase 3: Aufstellung der Vergleichsmatrizen 5.1.5. Phase 4: Bewertung der Alternativen 5.2. Auswertung des AHP 6. Fazit und Ausblick 6.1. Fazit 6.2. Ausblick
2	Efficient Data Stream Sampling on Apache Flink / Effektiv dataströmsampling med Apache Flink Vlachou-Konchylaki, Martha January 2016 (has links) Sampling is considered to be a core component of data analysis making it possibleto provide a synopsis of possibly large amounts of data by maintainingonly subsets or multisubsets of it. In the context of data streaming, an emergingprocessing paradigm where data is assumed to be unbounded, samplingoffers great potential since it can establish a representative bounded view ofinfinite data streams to any streaming operations. This further unlocks severalbenefits such as sustainable continuous execution on managed memory, trendsensitivity control and adaptive processing tailored to the operations that consumedata streams.The main aim of this thesis is to conduct an experimental study in order tocategorize existing sampling techniques over a selection of properties derivedfrom common streaming use cases. For that purpose we designed and implementeda testing framework that allows for configurable sampling policiesunder different processing scenarios along with a library of different samplersimplemented as operators. We build on Apache Flink, a distributed streamprocessing system to provide this testbed and all component implementationsof this study. Furthermore, we show in our experimental analysis that there isno optimal sampling technique for all operations. Instead, there are differentdemands across usage scenarios such as online aggregations and incrementalmachine learning. In principle, we show that each sampling policy trades offbias, sensitivity and concept drift adaptation, properties that can be potentiallypredefined by different operators.We believe that this study serves as the starting point towards automatedadaptive sampling selection for sustainable continuous analytics pipelines thatcan react to stream changes and thus offer the right data needed at each time,for any possible operation Sampling Streaming Apache Flink Distributed Systems Computer Sciences Datavetenskap (datalogi)
3	Scalable Data Integration for Linked Data Nentwig, Markus 06 August 2020 (has links) Linked Data describes an extensive set of structured but heterogeneous datasources where entities are connected by formal semantic descriptions. In thevision of the Semantic Web, these semantic links are extended towards theWorld Wide Web to provide as much machine-readable data as possible forsearch queries. The resulting connections allow an automatic evaluation to findnew insights into the data. Identifying these semantic connections betweentwo data sources with automatic approaches is called link discovery. We derivecommon requirements and a generic link discovery workflow based on similaritiesbetween entity properties and associated properties of ontology concepts. Mostof the existing link discovery approaches disregard the fact that in times ofBig Data, an increasing volume of data sources poses new demands on linkdiscovery. In particular, the problem of complex and time-consuming linkdetermination escalates with an increasing number of intersecting data sources.To overcome the restriction of pairwise linking of entities, holistic clusteringapproaches are needed to link equivalent entities of multiple data sources toconstruct integrated knowledge bases. In this context, the focus on efficiencyand scalability is essential. For example, reusing existing links or backgroundinformation can help to avoid redundant calculations. However, when dealingwith multiple data sources, additional data quality problems must also be dealtwith. This dissertation addresses these comprehensive challenges by designingholistic linking and clustering approaches that enable reuse of existing links.Unlike previous systems, we execute the complete data integration workflowvia a distributed processing system. At first, the LinkLion portal will beintroduced to provide existing links for new applications. These links act asa basis for a physical data integration process to create a unified representationfor equivalent entities from many data sources. We then propose a holisticclustering approach to form consolidated clusters for same real-world entitiesfrom many different sources. At the same time, we exploit the semantic typeof entities to improve the quality of the result. The process identifies errorsin existing links and can find numerous additional links. Additionally, theentity clustering has to react to the high dynamics of the data. In particular,this requires scalable approaches for continuously growing data sources withmany entities as well as additional new sources. Previous entity clusteringapproaches are mostly static, focusing on the one-time linking and clustering ofentities from few sources. Therefore, we propose and evaluate new approaches for incremental entity clustering that supports the continuous addition of newentities and data sources. To cope with the ever-increasing number of LinkedData sources, efficient and scalable methods based on distributed processingsystems are required. Thus we propose distributed holistic approaches to linkmany data sources based on a clustering of entities that represent the samereal-world object. The implementation is realized on Apache Flink. In contrastto previous approaches, we utilize efficiency-enhancing optimizations for bothdistributed static and dynamic clustering. An extensive comparative evaluationof the proposed approaches with various distributed clustering strategies showshigh effectiveness for datasets from multiple domains as well as scalability on amulti-machine Apache Flink cluster. info:eu-repo/classification/ddc/000 ddc:000
4	Big Data Analytics Using Apache Flink for Cybercrime Forensics on X (formerly known as Twitter) / Big Data Analytics Using Apache Flink for Cybercrime Forensics on X (formerly known as Twitter) Kakkepalya Puttaswamy, Manjunath January 2023 (has links) The exponential growth of social media usage has led to massive data sharing, posing challenges for traditional systems in managing and analyzing such vast amounts of data. This surge in data exchange has also resulted in an increase in cyber threats from individuals and criminal groups. Traditional forensic methods, such as evidence collection and data backup, become impractical when dealing with petabytes or terabytes of data. To address this, Big Data Analytics has emerged as a powerful solution for handling and analyzing structured and unstructured data. This thesis explores the use of Apache Flink, an open-source tool by the Apache Software Foundation, to enhance cybercrime forensic research. Unlike batch processing engines like Apache Spark, Apache Flink offers real-time processing capabilities, making it well-suited for analyzing dynamic and time-sensitive data streams. The study compares Apache Flink's performance against Apache Spark in handling various workloads on a single node. The literature review reveals a growing interest in utilizing Big Data Analytics, including platforms like Apache Flink, for cybercrime detection and investigation, especially on social media platforms like X (formerly known as Twitter). Sentiment analysis is a vital technique, but challenges arise due to the unique nature of social data. X (formerly known as Twitter), as a valuable source for cybercrime forensics, enables the study of fraudulent, extremist, and other criminal activities. This research explores various data mining techniques and emphasizes the need for real-time analytics to combat cybercrime effectively. The methodology involves data collection from X, preprocessing to remove noise, and sentiment analysis to identify cybercrime-related tweets. The comparative analysis between Apache Flink and Apache Spark demonstrates Flink's efficiency in handling larger datasets and real-time processing. Parallelism and scalability are evaluated to optimize performance. The results indicate that Apache Flink outperforms Apache Spark regarding response time, making it a valuable tool for cybercrime forensics. Despite progress, challenges such as data privacy, accuracy improvement, and cross-platform analysis remain. Future research should focus on refining algorithms, enhancing scalability, and addressing these challenges to further advance cybercrime forensics using Big Data Analytics and platforms like Apache Flink. Apache Flink Apache Spark Big Data Twitter X Computer Sciences Datavetenskap (datalogi)
5	External Streaming State Abstractions and Benchmarking / Extern strömmande statliga abstraktioner och benchmarking Sree Kumar, Sruthi January 2021 (has links) Distributed data stream processing is a popular research area and is one of the promising paradigms for faster and efficient data management. Application state is a first-class citizen in nearly every stream processing system. Nowadays, stream processing is, by definition, stateful. For a stream processing application, the state is backing operations such as aggregations, joins, and windows. Apache Flink is one of the most accepted and widely used stream processing systems in the industry. One of the main reasons engineers choose Apache Flink to write and deploy continuous applications is its unique combination of flexibility and scalability for stateful programmability, and the firm guarantee that the system ensures. Apache Flink’s guarantees always make its states correct and consistent even when nodes fail or when the number of tasks changes. Flink state can scale up to its compute node’s hard disk boundaries using embedded databases to store and retrieve data. Nevertheless, in all existing state backends officially supported by Flink, the state is always available locally to compute tasks. Even though this makes deployment more convenient, it creates other challenges such as non-trivial state reconfiguration and failure recovery. At the same time, compute, and state are bound to be tightly coupled. This strategy also leads to over-provisioning and is counterintuitive on state intensive only workloads or compute-intensive only workloads. This thesis investigates an alternative state backend architecture, FlinkNDB, which can tackle these challenges. FlinkNDB decouples state and computes by using a distributed database to store the state. The thesis covers the challenges of existing state backends and design choices and the new state backend implementation. We have evaluated the implementation of FlinkNDB against existing state backends offered by Apache Flink. / Distribuerad dataströmsbehandling är ett populärt forskningsområde och är ett av de lovande paradigmen för snabbare och effektivare datahantering. Applicationstate är en förstklassig medborgare i nästan alla strömbehandlingssystem. Numera är strömbearbetning per definition statlig. För en strömbehandlingsapplikation backar staten operationer som aggregeringar, sammanfogningar och windows. Apache Flink är ett av de mest accepterade och mest använda strömbehandlingssystemen i branschen. En av de främsta anledningarna till att ingenjörer väljer ApacheFlink för att skriva och distribuera kontinuerliga applikationer är dess unika kombination av flexibilitet och skalbarhet för statlig programmerbarhet, och företaget garanterar att systemet säkerställer. Apache Flinks garantier gör alltid dess tillstånd korrekt och konsekvent även när noder misslyckas eller när antalet uppgifter ändras. Flink-tillstånd kan skala upp till dess beräkningsnods hårddiskgränser genom att använda inbäddade databaser för att lagra och hämta data. I allmänna tillståndsstöd som officiellt stöds av Flink är staten dock alltid tillgänglig lokalt för att beräkna uppgifter. Även om detta gör installationen bekvämare, skapar det andra utmaningar som icke-trivial tillståndskonfiguration och felåterställning. Samtidigt måste beräkning och tillstånd vara tätt kopplade. Den här strategin leder också till överanvändning och är kontraintuitiv för statligt intensiva endast arbetsbelastningar eller beräkningsintensiva endast arbetsbelastningar. Denna avhandling undersöker en alternativ statsbackendarkitektur, FlinkNDB, som kan hantera dessa utmaningar. FlinkNDB frikopplar tillstånd och beräknar med hjälp av en distribuerad databas för att lagra tillståndet. Avhandlingen täcker utmaningarna med befintliga statliga backends och designval och den nya implementeringen av statebackend. Vi har utvärderat genomförandet av FlinkNDBagainst befintliga statliga backends som erbjuds av Apache Flink. Apache Flink Distributed Systems NDB FlinkNDB State State Backends External State Stream Processing Systems Benchmarking Caching Apache Flink Distributed Systems NDB FlinkNDB State State Backends External State Stream Processing Systems Benchmarking Caching Computer and Information Sciences Data- och informationsvetenskap
6	FlinkNDB : Guaranteed Data Streaming Using External State Asif, Muhammad Haseeb January 2021 (has links) Apache Flink is a stream processing framework that provides a unified state management mechanism which, at its core, treats stream processing as a sequence of distributed transactions. Flink handles failures, re-scaling and reconfiguration seamlessly via a form of a two-phase commit protocol that periodically commits all past side effects consistently into the state backends. This involves invoking and combining checkpoints and, in time of need, redistributing the state to resume data pipelines. All the existing Flink state backend implementations, such as RocksDB, are embedded and coupled with the compute nodes. Therefore, recovery time is proportional to the state needed to be reconfigured and that can take from a few seconds to hours. If application logic is compute-heavy and Flink’s tasks are overloaded, scaling out compute pipeline means scaling out storage together with compute tasks and vice-versa because of the embedded state backends. It also introduces delays due to expensive state re-shuffle and moving large state on the wire. This thesis work proposes the decoupling of the state storage from compute to improve Flink’s scalability. It introduces the design and implementation of a new State backend, FlinkNDB, that decouples state storage from compute. Furthermore, we designed and implemented new techniques to perform snapshotting, and failure recovery to reduce the recovery time close to zero. / Apache Flink är ett strömbehandlingsramverk som tillhandahåller en enhetlig tillståndshanteringsmekanism som i sin kärna behandlar strömbehandling som en sekvens av distribuerade transaktioner. Flink hanterar fel, omskalning och omkonfigurering sömlöst via en form av ett tvåfas-engagemangsprotokoll som regelbundet begår alla tidigare biverkningar konsekvent i tillståndets backends. Detta innebär att man åberopar och kombinerar kontrollpunkter och vid behov omdistribuerar dess tillstånd för att återuppta dataledningar. Alla befintliga backendimplementeringar för Flink-tillstånd, som Rocks- DB, är inbäddade och kopplade till beräkningsnoderna. Därför är återhämtningstiden proportionell mot det tillstånd som behöver konfigureras om och det kan ta från några sekunder till timmar. Om applikationslogiken är beräkningstung och Flinks uppgifter är överbelastade, innebär utskalning av beräkningsrörledning att utskalning av lagring, tillsammans med beräkningsuppgifter och vice versa på grund av det inbäddade tillståndet i backend. Det introducerar också förseningar i förhållande till dyra tillståndsförflyttningar och flyttning av stora datamängder som upptar stora delar av bandbredden. Detta avhandlingsarbete föreslår frikoppling av tillståndslagring från beräkning för att förbättra Flinks skalbarhet. Den introducerar designen och implementeringen av ett nytt tillstånd i backend, FlinkNDB, som frikopplar tillståndslagring från beräkning. Avslutningsvis designade och implementerade vi nya tekniker för att utföra snapshotting och felåterställning för att minska återhämtningstiden till nära noll. Apache Flink NDB Flink State Backend RocksDB State Backend State management Large State Applications Apache Flink NDB Flink State Backend RocksDB State Backend State management Large State Applications Computer and Information Sciences Data- och informationsvetenskap
7	Investigating programming language support for fault-tolerance Demirkoparan, Ismail January 2023 (has links) Dataflow systems have become the norm for developing data-intensive computing applications. These systems provide transparent scalability and fault tolerance. For fault tolerance, many dataflow-system adopt a snapshotting approach which persists the state of an operator once it has received a snapshot marker on all its input channels. This approach requires channels to be blocked for potentially prolonged durations until all other input channels have received their markers to guarantee that no events from the future make it into the operator’s present state snapshot. Alignment can for this reason have a severe performance impact. In particular, for black-box user-defined operators, the system has no knowledge about how events from different channels affect the operator’s state. Thus, the system must conservatively assume that all events affect the same state and align all channels. In this thesis, we argue that alignment between two channels is unnecessary if messages from those channels are not written to the same output channel. We propose a snapshotting approach for the fault tolerance and call it partial approach. The partial approach does not require alignment when an operator’s input channels are independent. Two input channels are independent if their events do not affect the same state and are never written to the same output channel. We propose the use of static code analysis to identify such dependencies. To enable this analysis, we translate operators into finite state machines that make the operator’s state explicit. As a proof of concept, we extend the implementation of Arc-Lang, an existing dataflow language, so that applications written in it transparently execute with fault tolerance. We evaluate our approach by comparing it to a baseline eager approach that always requires alignment between the input channels. The conducted experiments’ results show that the partial approach performs about 47 % better than the eager approach when the streaming sources are producing data at different velocities. / Dataflödessystem har blivit normen för utveckling av dataintensiva datorapplikationer. Dessa system erbjuder transparent skalbarhet och felhantering. För felhantering adopterar många dataflödessystem en snapshot-approach som sparar en operatörs tillstånd när den har fått en snapshot-markör på alla sina ingångskanaler. Denna metod kräver att kanalerna blockeras under möjligen förlängda tidsperioder tills alla andra ingångskanaler har fått sina markörer, vilket görs för att garantera att inga händelser från framtiden når operatörens nuvarande tillstånd. Synkronisering mellan kanaler kan därför ha en allvarlig prestandapåverkan. Särskilt för black-box användardefinierade operatörer där systemet inte har kunskap om hur händelser från olika kanaler påverkar operatörens tillstånd. Systemet måste därför konservativt anta att alla händelser påverkar samma tillstånd och synkronisera alla kanaler. I denna avhandling argumenterar vi för att synkroniseringen mellan två kanaler inte är nödvändig om meddelanden från de kanalerna inte skrivs till samma utgångskanal. Vi föreslår en snapshot-approach för felhantering och kallar den för partial-approach. Partial-approach kräver inte justering när en operatörs ingångskanaler är oberoende. Två ingångskanaler är oberoende om deras händelser inte påverkar samma tillstånd och aldrig skrivs till samma utgångskanal. Vi föreslår användning av statisk kodanalys för att identifiera sådana beroenden. För att möjliggöra denna analys översätter vi operatörer till finite state machines som gör operatörens tillstånd explicit. För att bevisa konceptet utökar vi implementeringen av Arc-Lang, vilket är en existerande dataflödesspråk, så att program skrivna i den transparent körs med felhantering. Vi utvärderar vår approach genom att jämföra den med en baseline eager-approach som alltid kräver justering mellan ingångskanalerna. Resultaten från de genomförda experimenten visar att partial-approach presterar cirka 47 % bättre än eager-approach när sourcestreams producerar data i otakt. Dataflow Fault tolerance Data streaming Distributed systems Checkpointing Logging Lineage Lineage stash Arc-Lang Apache Flink Computer and Information Sciences Data- och informationsvetenskap
8	Comparison of Popular Data Processing Systems Nasr, Kamil January 2021 (has links) Data processing is generally defined as the collection and transformation of data to extract meaningful information. Data processing involves a multitude of processes such as validation, sorting summarization, aggregation to name a few. Many analytics engines exit today for largescale data processing, namely Apache Spark, Apache Flink and Apache Beam. Each one of these engines have their own advantages and drawbacks. In this thesis report, we used all three of these engines to process data from the Carbon Monoxide Daily Summary Dataset to determine the emission levels per area and unit of time. Then, we compared the performance of these 3 engines using different metrics. The results showed that Apache Beam, while offered greater convenience when writing programs, was slower than Apache Flink and Apache Spark. Spark Runner in Beam was the fastest runner and Apache Spark was the fastest data processing framework overall. / Databehandling definieras generellt som insamling och omvandling av data för att extrahera meningsfull information. Databehandling involverar en mängd processer som validering, sorteringssammanfattning, aggregering för att nämna några. Många analysmotorer lämnar idag för storskalig databehandling, nämligen Apache Spark, Apache Flink och Apache Beam. Var och en av dessa motorer har sina egna fördelar och nackdelar. I den här avhandlingsrapporten använde vi alla dessa tre motorer för att bearbeta data från kolmonoxidens dagliga sammanfattningsdataset för att bestämma utsläppsnivåerna per område och tidsenhet. Sedan jämförde vi prestandan hos dessa 3 motorer med olika mått. Resultaten visade att Apache Beam, även om det erbjuds större bekvämlighet när man skriver program, var långsammare än Apache Flink och Apache Spark. Spark Runner in Beam var den snabbaste löparen och Apache Spark var den snabbaste databehandlingsramen totalt. Apache Spark Apache Flink Apache Beam Spark Runner Flink Runner Direct Runner Big Data Analytics Data Processing Systems Benchmarking Kaggle Computer and Information Sciences Data- och informationsvetenskap
9	Improving Availability of Stateful Serverless Functions in Apache Flink / Förbättring av Tillgänglighet För Tillståndsbaserade Serverlösa Funktioner i Apache Flink Gustafson, Christopher January 2022 (has links) Serverless computing and Function-as-a-Service are rising in popularity due to their ease of use, provided scalability and cost-efficient billing model. One such platform is Apache Flink Stateful Functions. It allows application developers to run serverless functions with state that is persisted using the underlying stream processing engine Apache Flink. Stateful Functions use an embedded RocksDB state backend, where state is stored locally at each worker. One downside of this architecture is that state is lost if a worker fails. To recover, a recent snapshot of the state is fetched from a persistent file system. This can be a costly operation if the size of the state is large. In this thesis, we designed and developed a new decoupled state backend for Apache Flink Stateful Functions, with the goal of increasing availability while measuring potential performance trade-offs. It extends an existing decoupled state backend for Flink, FlinkNDB, to support the operations of Stateful Functions. FlinkNDB stores state in a separate highly available database, RonDB, instead of locally at the worker nodes. This allows for fast recovery as large state does not have to be transferred between nodes. Two new recovery methods were developed, eager and lazy recovery. The results show that lazy recovery can decrease recovery time by up to 60% compared to RocksDB when the state is large. Eager recovery did not provide any recovery time improvements. The measured performance was similar between RocksDB and FlinkNDB. Checkpointing times in FlinkNDB were however longer, which cause short periodic performance degradation. The evaluation of FlinkNDB suggests that decoupled state can be used to improve availability, but that there might be performance deficits included. The proposed solution could thus be a viable option for applications with high requirements of availability and lower performance requirements. / Serverlös datorberäkning och Function-as-a-Service (FaaS) ökar i popularitet på grund av dess enkelhet att använda, skalbarhet och kostnadseffektiva fakturerings-model. En sådan platform är Apache Flink Stateful Functions. Den tillåter applikationsutvecklare att köra serverlösa funktioner med varaktigt tillstånd genom den underliggande strömprocesseringsmotorn Apache Flink. Stateful Functions använder en inbyggd RocksDB tillståndslagring, där tillstånd lagras lokalt på arbetarnoderna. Ett problem med denna arkitektur är att tillstånd förloras om en arbetarnod krashar. För att återhämta sig behöver systemet hämta en tidigare sparad tillståndskopia från ett varaktivt filsystem, vilket kan bli kostsamt om tillståndet är stort. I denna uppsatts har vi designat och utvecklat en ny prototyp för att separat hantera tillstånd i Apache Flink Stateful Functions, med målet att öka tillgängligheten utan att förlora prestanda. Prototypen är en vidareutveckling av en existerande separat tillståndshantering för Flink, FlinkNDB, som utökades för att kunna hantera Stateful Functions. FlinkNDB sparar tillstånd i en separat högtillgänglig database, RonDB, istället för att spara tillstånd lokalt på arbetarnoderna. Detta möjliggör snabb återhämtning då inte stora mängder tillstånd behöver skickas mellan noder. Två återhämtningsmetoder utvecklades, ivrig och lat återhämtning. Resultaten visar att lat återhämtning kan sänka återhämtningstiden med upp till 60% jämfört med RocksDB då tillståndet är stort. Ivrig återhämtning visade inte några förbättringar i återhämtningstid. Prestandan var liknande mellan RocksDB och FlinkNDB. Tiden för checkpoints var däremot längre för FlinkNDB vilket orsakade korta periodiska prestandadegraderingar jämfört med RocksDB. Evalueringen av FlinkNDB föreslår att separat tillståndshantering kan öka tillgängligheten av Stateful Functions, men att detta kan innebära vissa prestanda degraderingar. Den föreslagna lösningen kan således vara ett bra alternativ när det finns höga krav på tillgänglighet, men lågra krav på prestanda. Function-as-a-Service Stateful Serverless Functions Apache Flink StateFun Availability RonDB RocksDB Function-as-a-Service ApacheFlink StateFun Tillgänglighet RonDB RocksDB Software Engineering Programvaruteknik
10	New authentication mechanism using certificates for big data analytic tools Velthuis, Paul January 2017 (has links) Companies analyse large amounts of sensitive data on clusters of machines, using a framework such as Apache Hadoop to handle inter-process communication, and big data analytic tools such as Apache Spark and Apache Flink to analyse the growing amounts of data. Big data analytic tools are mainly tested on performance and reliability. Security and authentication have not been enough considered and they lack behind. The goal of this research is to improve the authentication and security for data analytic tools.Currently, the aforementioned big data analytic tools are using Kerberos for authentication. Kerberos has difficulties in providing multi factor authentication. Attacks on Kerberos can abuse the authentication. To improve the authentication, an analysis of the authentication in Hadoop and the data analytic tools is performed. The research describes the characteristics to gain an overview of the security of Hadoop and the data analytic tools. One characteristic is that the usage of the transport layer security (TLS) for the security of data transportation. TLS usually establishes connections with certificates. Recently, certificates with a short time to live can be automatically handed out.This thesis develops new authentication mechanism using certificates for data analytic tools on clusters of machines, providing advantages over Kerberos. To evaluate the possibility to replace Kerberos, the mechanism is implemented in Spark. As a result, the new implementation provides several improvements. The certificates used for authentication are made valid with a short time to live and are thus less vulnerable to abuse. Further, the authentication mechanism solves new requirements coming from businesses, such as providing multi-factor authenticationand scalability.In this research a new authentication mechanism is developed, implemented and evaluated, giving better data protection by providing improved authentication. Cloud Access Management certificate on demand Apache Spark Apache Flink Kerberos transport security layer (TLS) Authentication Multi Factor Authentication Authentication for data analytic tools certificate based Spark authentication public key encryption distributed authentication short valid authentication Computer Sciences Datavetenskap (datalogi)

Search results