101 |
Geo-distributed multi-layer stream aggregationCannalire, Pietro January 2018 (has links)
The standard processing architectures are enough to satisfy a lot of applications by employing already existing stream processing frameworks which are able to manage distributed data processing. In some specific cases, having geographically distributed data sources requires to distribute even more the processing over a large area by employing a geographically distributed architecture. The issue addressed in this work is the reduction of data movement across the network which is continuously flowing in a geo-distributed architecture from streaming sources to the processing location and among processing entities within the same distributed cluster. Reduction of data movement can be critical for decreasing bandwidth costs since accessing links placed in the middle of the network can be costly and can increase as the amount of data exchanges increase. In this work we want to create a different concept to deploy geographically distributed architectures by relying on Apache Spark Structured Streaming and Apache Kafka. The features needed for an algorithm to run on a geo-distributed architecture are provided. The algorithms to be executed on this architecture apply the windowing and the data synopses techniques to produce a summaries of the input data and to address issues of the geographically distributed architecture. The computation of the average and the Misra-Gries algorithm are then implemented to test the designed architecture. This thesis work contributes in providing a new model of building geographically distributed architecture. The experimental results show that, for the algorithms running on top of the geo distributed architecture, the computation time is reduced on average by 70% compared to the distributed setup. Similarly, and the amount of data exchanged across the network is reduced on average by 99%, compared to the distributed setup. / Standardbehandlingsarkitekturer är tillräckligt för uppfylla behoven av många tillämpningar genom användning av befintliga ramverk för flödesbehandling med stöd för distribuerad databehandling. I specifika fall kan geografiskt fördelade datakällor kräva att databehandlingen fördelas över ett stort område med hjälp av en geografiskt distribuerad arkitektur. Problemet som behandlas i detta arbete är minskningen av kontinuerlig dataöverföring i ett nätverk med geo-distribuerad arkitektur. Minskad dataöverföring kan vara avgörande för minskade bandbreddskonstnader då åtkomst av länkar placerade i mitten av ett nätverk kan vara dyrt och öka ytterligare med tilltagande dataöverföring. I det här arbetet vill vi skapa ett nytt koncept för att upprätta geografiskt distribuerade arkitekturer med hjälp av Apache Spark Structured Streaming och Apache Kafka. Funktioner och förutsättningar som behövs för att en algoritm ska kunna köras på en geografisk distribuerad arkitektur tillhandahålls. Algoritmerna som ska köras på denna arkitektur tillämpar “windowing synopsing” och “data synopses”-tekniker för att framställa en sammanfattning av ingående data samt behandla problem beträffande den geografiskt fördelade arkitekturen. Beräkning av medelvärdet och Misra-Gries-algoritmen implementeras för att testa den konstruerade arkitekturen. Denna avhandling bidrar till att förse ny modell för att bygga geografiskt distribuerad arkitektur. Experimentella resultat visar att beräkningstiden reduceras i genomsnitt 70% för de algoritmer som körs ovanför den geo-distribuerade arkitekturen jämfört med den distribuerade konfigurationen. På liknande sätt reduceras mängden data som utväxlas över nätverket med 99% i snitt jämfört med den distribuerade inställningen.
|
102 |
Проектирование архитектуры и разработка модуля корпоративной платформы с использованием брокера сообщений для организации событийной интеграции между системами : магистерская диссертация / Architecture design and development of an enterprise platform module using a message broker to organize event-based integration between systemsМатвеева, Ю. А., Matveeva, Y. A. January 2022 (has links)
В работе рассматриваются подходы к созданию интеграционной платформы, для цепочки взаимодействия сервисов расположенных в разной доступности сети. Основное внимание сфокусировано на проектировании и разработке сервисной шины предприятия, реализующей событийную интеграцию с помощью брокера сообщений. Как результат представлено готовое решение модуля корпоративной платформы, которое адаптивно под схожие системы. Магистерская диссертация состоит из введения, трех глав и заключения, изложенных на 79 страницах, а также библиографического списка. В работе имеется 18 рисунок. Библиографический список состоит из 36 наименований. / The paper considers approaches to creating an integration platform for the chain of interaction of services located in different network accessibility. The main attention is focused on the design and development of an enterprise service bus that implements event-based integration using a message broker. As a result, a ready-made solution for the corporate platform module is presented, which is adaptive for similar systems. The master's thesis consists of an introduction, three chapters and a conclusion set out on 79 pages, as well as a bibliographic list. There are 18 drawings in the work. The bibliographic list consists of 36 titles.
|
103 |
RDMA-based Plugin Design and Profiler for Apache and Enterprise Hadoop Distributed File systemBhat, Adithya January 2015 (has links)
No description available.
|
104 |
Empirical Evaluation of Edge Computing for Smart Building Streaming IoT ApplicationsGhaffar, Talha 13 March 2019 (has links)
Smart buildings are one of the most important emerging applications of Internet of Things (IoT). The astronomical growth in IoT devices, data generated from these devices and ubiquitous connectivity have given rise to a new computing paradigm, referred to as "Edge computing", which argues for data analysis to be performed at the "edge" of the IoT infrastructure, near the data source. The development of efficient Edge computing systems must be based on advanced understanding of performance benefits that Edge computing can offer. The goal of this work is to develop this understanding by examining the end-to-end latency and throughput performance characteristics of Smart building streaming IoT applications when deployed at the resource-constrained infrastructure Edge and to compare it against the performance that can be achieved by utilizing Cloud's data-center resources. This work also presents a real-time streaming application to detect and localize the footstep impacts generated by a building's occupant while walking. We characterize this application's performance for Edge and Cloud computing and utilize a hybrid scheme that (1) offers maximum of around 60% and 65% reduced latency compared to Edge and Cloud respectively for similar throughput performance and (2) enables processing of higher ingestion rates by eliminating network bottleneck. / Master of Science / Among the various emerging applications of Internet of Things (IoT) are Smart buildings, that allow us to monitor and manipulate various operating parameters of a building by instrumenting it with sensor and actuator devices (Things). These devices operate continuously and generate unbounded streams of data that needs to be processed at low latency. This data, until recently, has been processed by the IoT applications deployed in the Cloud at the cost of high network latency of accessing Cloud’s resources. However, the increasing availability of IoT devices, ubiquitous connectivity, and exponential growth in the volume of IoT data has given rise to a new computing paradigm, referred to as “Edge computing”. Edge computing argues that IoT data should be analyzed near its source (at the network’s Edge) in order to eliminate high latency of accessing Cloud for data processing. In order to develop efficient Edge computing systems, an in-depth understanding of the trade-offs involved in Edge and Cloud computing paradigms is required. In this work, we seek to understand these trade-offs and the potential benefits of Edge computing. We examine end to-end latency and throughput performance characteristics of Smart building streaming IoT applications by deploying them at the resource-constrained Edge and compare it against the performance that can be achieved by Cloud deployment. We also present a real-time streaming application to detect and localize the footstep impacts generated by a building’s occupant while walking. We characterize this application’s performance for Edge and Cloud computing and utilize a hybrid scheme that (1) offers maximum of around 60% and 65% reduced latency compared to Edge and Cloud respectively for similar throughput performance and (2) enables processing of higher ingestion rates by eliminating network bottleneck.
|
105 |
A COMPARISON OF DATA INGESTION PLATFORMS IN REAL-TIME STREAM PROCESSING PIPELINESTallberg, Sebastian January 2020 (has links)
In recent years there has been an increasing demand for real-time streaming applications that handle large volumes of data with low latency. Examples of such applications include real-time monitoring and analytics, electronic trading, advertising, fraud detection, and more. In a streaming pipeline the first step is ingesting the incoming data events, after which they can be sent off for processing. Choosing the correct tool that satisfies application requirements is an important technical decision that must be made. This thesis focuses entirely on the data ingestion part by evaluating three different platforms: Apache Kafka, Apache Pulsar and Redis Streams. The platforms are compared both on characteristics and performance. Architectural and design differences reveal that Kafka and Pulsar are more suited for use cases involving long-term persistent storage of events, whereas Redis is a potential solution when only short-term persistence is required. They all provide means for scalability and fault tolerance, ensuring high availability and reliable service. Two metrics, throughput and latency, were used in evaluating performance in a single node cluster. Kafka proves to be the most consistent in throughput but performs the worst in latency. Pulsar manages high throughput with low message sizes but struggles with larger message sizes. Pulsar performs the best in overall average latency across all message sizes tested, followed by Redis. The tests also show Redis being the most inconsistent in terms of throughput potential between different message sizes
|
106 |
Implementering av testplattform för end-to-end streaming telemetry i nätverkErlandsson, Niklas January 2020 (has links)
Målen med denna studie är att implementera en testmiljö för streaming telemetry samt jämföra två alternativ för att möjliggöra realtidsanalys av det insamlade datat. Dessa två alternativ är Python-biblioteken PyKafka och Confluent-Kafka-Python. Bedömningskritierna för jämförselsen var dokumentation, kodmängd och minnesanvändning. Testmiljön för streaming telemetry använder en router med Cisco IOS XR programvara som skickar data till en Cisco Pipeline collector, som vidare sänder datat till ett Kafka-kluster. Jämförelsen av Python-biblioteken utfördes med språket Python. Resultaten av jämförelsen visade att båda biblioteken hade välskriven dokumentation och liten skillnad i kodmängd, dock använde Confluent-Kafka-Python mindre minne. Studien visar att streaming telemetry med realtidsanalys kan fungera bra som ett komplement till eller en ersättning av SNMP. Studien rekommenderar användning av Confluent-Kafka-Python för implementering i produktionsmiljöer med ett stort antal nätverksenheter med tanke på den lägre minnesanvändningen. / The goals of this study are to implement a test environment for streaming telemetry and compare two alternatives for analysing the collected data in realtime. The two alternatives are the Python libraries PyKafka and Confluent-Kafka-Python. The comparison focused mainly on three areas, these being documentation, amount of code and memory usage. The test environment for streaming telemetry was set up with a router running IOS XR software that is sending data to a Cisco Pipeline collector, which in turn sends data to a Kafka-cluster. The comparison of the two libraries for interfacing with the cluster was made with the language Python. The results of the comparison showed that both libraries had well-written documentation and showed a negligible difference in amount of code. The memory usage was considerably lower with the Confluent-Kafka-Python library. The study shows that streaming telemetry together with real-time analysis makes a good complement to or a replacement of SNMP. The study further recommends the use of Confluent-Kafka-Python in real-world implementations of streaming telemetry, particularly in large networks with a large amount of devices.
|
107 |
A Performance Comparison of an Event-Driven Node.js Web Server and Multi-Threaded Web Servers / En prestandajämförelse mellan en händelsestyrd Node.js-webbserver och flertrådiga webbservrarVilhelmsson, Isak January 2021 (has links)
The goal of this study is to conduct a performance comparison betweenNode.js, Apache, Internet Information Services (IIS) and Go web servers in terms of throughput and memory consumption in both Input/Output (I/O)-intensive and computation-intensive situations. The computation-intensive tests consisted of calculating Fibonacci numbers, while the I/O-intensive tests consisted of querying a database. JMeter was used to send the requests and collect client- side data while Windows Performance Monitor was used to collect data on the resource use of the web servers on the server-side computer. The results showed that Go web server had the highest throughput and lowest memory consumption in all of the tests, with an average increase in throughput of 26% and an average decrease in memory consumption by 66% compared to the web servers placing second in the tests. IIS web server was the server that most often placed second behind Go. Contrary to previous studies Node.js performed worse than Apache in the I/O-intensive tests. The results also showed that Apache web server performed poorly in computation-intensive situations in terms of throughput. The conclusion is that the results indicate that Go web server performs better than Apache, IIS and Node.js web servers in both I/O- intensive and computation-intensive situations in terms of both throughput and memory consumption. / Denna studies mål är att genomföra en prestandajämförelse mellan Node.js, Apache, IIS och Go-webbservrar mätt i genomströmning och minnesallokering i både I/O-intensiva och beräkningsintensiva situationer. De beräkningsintensiva testerna bestod av att beräkna Fibonaccital medan de I/O-intensiva bestod av att be en databas om data. Programmet JMeter användes till att genomföra klientdatorns begäran till serverdatorn och till att samla data om begäran. Windows-programmet Performance Monitor användes till att samla data om webbservrarnas resursanvändning på serverdatorn. Resultaten visade att Go-webbservern hade högst genomströmmning och minst minnesallokering i alla tester med en genomströmmning som i genomsnitt var 26% högre och minnesallokering som var i genomsnitt 66% lägre än webbservrarna som presterade näst bäst i testerna. Webbservern som oftast presterade näst bäst var IIS. I motsats till resultaten i tidigare studier presterade Node.js sämre än Apache i de I/O-intensiva testerna. Apache visade sig prestera dåligt de beräkningsintensiva testerna. Studiens slutsats är att resultaten indikerar att Go-webbserver presterar bättre än Node.js, Apache och IIS-webbservrar i både I/O-intensiva och beräkningsintensiva situationer sett till genomströmning och minnesallokering.
|
108 |
Методология запуска Apache Spark в различных менеджерах контейнеров (Hadoop, Kubernetes) : магистерская диссертация / Methodology for running Apache Spark in various container managers (Hadoop, Kubernetes)Краубаев, А. С., Kraubaev, A. S. January 2023 (has links)
Цель работы – разработка методики для студентов, разработчиков и инженер по работе с данными, которые заинтересованы расширить свой кругозор, по запуску Apache Spark в кластерной среде «Hadoop» и «Kubernetes». Объектом исследования – данной работы являются практика применения методологии запуска Apache Spark в кластерной среде Kubernetes, Hadoop. Результаты работы: практика применения контейнеризации и кластерной среды Kubernetes, чтобы ознакомить с методологией запуска «Apache Spark». Выпускная квалификационная работа выполнена в текстовом редакторе. Microsoft Word и предоставлена в твердой копии. / The goal of the work is to develop a methodology for students, developers and data engineers who are interested in expanding their horizons on running Apache Spark in the Hadoop and Kubernetes cluster environment. The object of research - this work is the practice of applying the methodology for launching Apache Spark in the Kubernetes and Hadoop cluster environment. Results of the work: practice of using containerization and the Kubernetes cluster environment to familiarize yourself with the methodology for launching Apache Spark. The final qualifying work was completed in a text editor. Microsoft Word and provided in hard copy.
|
109 |
Resource-efficient and fast Point-in-Time joins for Apache Spark : Optimization of time travel operations for the creation of machine learning training datasets / Resurseffektiva och snabba Point-in-Time joins i Apache Spark : Optimering av tidsresningsoperationer för skapande av träningsdata för maskininlärningsmodellerPettersson, Axel January 2022 (has links)
A scenario in which modern machine learning models are trained is to make use of past data to be able to make predictions about the future. When working with multiple structured and time-labeled datasets, it has become a more common practice to make use of a join operator called the Point-in-Time join, or PIT join, to construct these datasets. The PIT join matches entries from the left dataset with entries of the right dataset where the matched entry is the row whose recorded event time is the closest to the left row’s timestamp, out of all the right entries whose event time occurred before or at the same time of the left event time. This feature has long only been a part of time series data processing tools but has recently received a new wave of attention due to the rise of the popularity of feature stores. To be able to perform such an operation when dealing with a large amount of data, data engineers commonly turn to large-scale data processing tools, such as Apache Spark. However, Spark does not have a native implementation when performing these joins and there has not been a clear consensus by the community on how this should be achieved. This, along with previous implementations of the PIT join, raises the question: ”How to perform fast and resource efficient Pointin- Time joins in Apache Spark?”. To answer this question, three different algorithms have been developed and compared for performing a PIT join in Spark in terms of resource consumption and execution time. These algorithms were benchmarked using generated datasets using varying physical partitions and sorting structures. Furthermore, the scalability of the algorithms was tested by running the algorithms on Apache Spark clusters of varying sizes. The results received from the benchmarks showed that the best measurements were achieved by performing the join using Early Stop Sort-Merge Join, a modified version of the regular Sort-Merge Join native to Spark. The best performing datasets were the datasets that were sorted by timestamp and primary key, ascending or descending, using a suitable number of physical partitions. Using this new information gathered by this project, data engineers have been provided with general guidelines to optimize their data processing pipelines to be able to perform more resource-efficient and faster PIT joins. / Ett vanligt scenario för maskininlärning är att träna modeller på tidigare observerad data för att för att ge förutsägelser om framtiden. När man jobbar med ett flertal strukturerade och tidsmärkta dataset har det blivit vanligare att använda sig av en join-operator som kallas Point-in-Time join, eller PIT join, för att konstruera dessa datauppsättningar. En PIT join matchar rader från det vänstra datasetet med rader i det högra datasetet där den matchade raden är den raden vars registrerade händelsetid är närmaste den vänstra raden händelsetid, av alla rader i det högra datasetet vars händelsetid inträffade före eller samtidigt som den vänstra händelsetiden. Denna funktionalitet har länge bara varit en del av datahanteringsverktyg för tidsbaserad data, men har nyligen fått en ökat popularitet på grund av det ökande intresset för feature stores. För att kunna utföra en sådan operation vid hantering av stora mängder data vänder sig data engineers vanligvis till storskaliga databehandlingsverktyg, såsom Apache Spark. Spark har dock ingen inbyggd implementation för denna join-operation, och det finns inte ett tydligt konsensus från Spark-rörelsen om hur det ska uppnås. Detta, tillsammans med de tidigare implementationerna av PIT joins, väcker frågan: ”Vad är det mest effektiva sättet att utföra en PIT join i Apache Spark?”. För att svara på denna fråga har tre olika algoritmer utvecklats och jämförts med hänsyn till resursförbrukning och exekveringstid. För att jämföra algoritmerna, exekverades de på genererade datauppsättningar med olika fysiska partitioner och sorteringstrukturer. Dessutom testades skalbarheten av algoritmerna genom att köra de på Spark-kluster av varierande storlek. Resultaten visade att de bästa mätvärdena uppnåddes genom att utföra operationen med algoritmen early stop sort-merge join, en modifierad version av den vanliga sort-merge join som är inbyggd i Spark, med en datauppsättning som är sorterad på tidsstämpel och primärnyckel, antingen stigande eller fallande. Fysisk partitionering av data kunde även ge bättre resultat, men det optimala antal fysiska partitioner kan variera beroende på datan i sig. Med hjälp av denna nya information som samlats in av detta projekt har data engineers försetts med allmänna riktlinjer för att optimera sina databehandlings-pipelines för att kunna utföra mer resurseffektiva och snabbare PIT joins
|
110 |
A greenhouse evaluation of plant species for use in revegetation of Black Mesa coal mine overburden materialMitchell, Gregg F. January 1979 (has links)
No description available.
|
Page generated in 0.0469 seconds