Global ETD Search

1	Distributed Frameworks Towards Building an Open Data Architecture Venumuddala, Ramu Reddy 05 1900 (has links) Data is everywhere. The current Technological advancements in Digital, Social media and the ease at which the availability of different application services to interact with variety of systems are causing to generate tremendous volumes of data. Due to such varied services, Data format is now not restricted to only structure type like text but can generate unstructured content like social media data, videos and images etc. The generated Data is of no use unless been stored and analyzed to derive some Value. Traditional Database systems comes with limitations on the type of data format schema, access rates and storage sizes etc. Hadoop is an Apache open source distributed framework that support storing huge datasets of different formatted data reliably on its file system named Hadoop File System (HDFS) and to process the data stored on HDFS using MapReduce programming model. This thesis study is about building a Data Architecture using Hadoop and its related open source distributed frameworks to support a Data flow pipeline on a low commodity hardware. The Data flow components are, sourcing data, storage management on HDFS and data access layer. This study also discuss about a use case to utilize the architecture components. Sqoop, a framework to ingest the structured data from database onto Hadoop and Flume is used to ingest the semi-structured Twitter streaming json data on to HDFS for analysis. The data sourced using Sqoop and Flume have been analyzed using Hive for SQL like analytics and at a higher level of data access layer, Hadoop has been compared with an in memory computing system using Spark. Significant differences in query execution performances have been analyzed when working with Hadoop and Spark frameworks. This integration helps for ingesting huge Volumes of streaming json Variety data to derive better Value based analytics using Hive and Spark. Hadoop Hive MapReduce Flume Software architecture. Data structures (Computer science) Apache Hadoop.
2	Escalonamento adaptativo para o Apache Hadoop / Adaptative scheduling for Apache Hadoop Cassales, Guilherme Weigert 11 March 2016 (has links) Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / Many alternatives have been employed in order to process all the data generated by current applications in a timely manner. One of these alternatives, the Apache Hadoop, combines parallel and distributed processing with the MapReduce paradigm in order to provide an environment that is able to process a huge data volume using a simple programming model. However, Apache Hadoop has been designed for dedicated and homogeneous clusters, a limitation that creates challenges for those who wish to use the framework in other circumstances. Often, acquiring a dedicated cluster can be impracticable due to the cost, and the acquisition of reposition parts can be a threat to the homogeneity of a cluster. In these cases, an option commonly used by the companies is the usage of idle computing resources in their network, however the original distribution of Hadoop would show serious performance issues in these conditions. Thus, this study was aimed to improve Hadoop’s capacity of adapting to pervasive and shared environments, where the availability of resources will undergo variations during the execution. Therefore, context-awareness techniques were used in order to collect information about the available capacity in each worker node and distributed communication techniques were used to update this information on scheduler. The joint usage of both techniques aimed at minimizing and/or eliminating the overload that would happen on shared nodes, resulting in an improvement of up to 50% on performance in a shared cluster, when compared to the original distribution, and indicated that a simple solution can positively impact the scheduling, increasing the variety of environments where the use of Hadoop is possible. / Diversas alternativas têm sido empregadas para o processamento, em tempo hábil, da grande quantidade de dados que é gerada pelas aplicações atuais. Uma destas alternativas, o Apache Hadoop, combina processamento paralelo e distribuído com o paradigma MapReduce para fornecer um ambiente capaz de processar um grande volume de informações através de um modelo de programação simplificada. No entanto, o Apache Hadoop foi projetado para utilização em clusters dedicados e homogêneos, uma limitação que gera desafios para aqueles que desejam utilizá-lo sob outras circunstâncias. Muitas vezes um cluster dedicado pode ser inviável pelo custo de aquisição e a homogeneidade pode ser ameaçada devido à dificuldade de adquirir peças de reposição. Em muitos desses casos, uma opção encontrada pelas empresas é a utilização dos recursos computacionais ociosos em sua rede, porém a distribuição original do Hadoop apresentaria sérios problemas de desempenho nestas condições. Sendo assim, este estudo propôs melhorar a capacidade do Hadoop em adaptar-se a ambientes, pervasivos e compartilhados, onde a disponibilidade de recursos sofrerá variações no decorrer da execução. Para tanto, utilizaram-se técnicas de sensibilidade ao contexto para coletar informações sobre a capacidade disponível nos nós trabalhadores e técnicas de comunicação distribuída para atualizar estas informações no escalonador. A utilização conjunta dessas técnicas teve como objetivo a minimização e/ou eliminação da sobrecarga que seria causada em nós com compartilhamento, resultando em uma melhora de até 50% no desempenho em um cluster compartilhado, quando comparado com a distribuição original, e indicou que uma solução simples pode impactar positivamente o escalonamento, aumentando a variedade de ambientes onde a utilização do Hadoop é possível. Apache Hadoop Escalonamento Sensibilidade ao contexto Scheduling Context-aware
3	Big Data v technológiách IBM / Big Data in technologies from IBM Šoltýs, Matej January 2014 (has links) This diploma thesis presents Big Data technologies and their possible use cases and applications. Theoretical part is initially focused on definition of term Big Data and afterwards is focused on Big Data technology, particularly on Hadoop framework. There are described principles of Hadoop, such as distributed storage and data processing, and its individual components. Furthermore are presented the largest vendors of Big Data technologies. At the end of this part of the thesis are described possible use cases of Big Data technologies and also some case studies. The practical part describes implementation of demo example of Big Data technologies and it is divided into two chapters. The first chapter of the practical part deals with conceptual design of demo example, used products and architecture of the solution. Afterwards, implementation of the demo example is described in the second chapter, from preparation of demo environment to creation of applications. Goals of this thesis are description and characteristics of Big Data, presentation of the largest vendors and their Big Data products, description of possible use cases of Big Data technologies and especially implementation of demo example in Big Data tools from IBM.
4	Fast Data Analysis Methods For Social Media Data Nhlabano, Valentine Velaphi 07 August 2018 (has links) The advent of Web 2.0 technologies which supports the creation and publishing of various social media content in a collaborative and participatory way by all users in the form of user generated content and social networks has led to the creation of vast amounts of structured, semi-structured and unstructured data. The sudden rise of social media has led to their wide adoption by organisations of various sizes worldwide in order to take advantage of this new way of communication and engaging with their stakeholders in ways that was unimaginable before. Data generated from social media is highly unstructured, which makes it challenging for most organisations which are normally used for handling and analysing structured data from business transactions. The research reported in this dissertation was carried out to investigate fast and efficient methods available for retrieving, storing and analysing unstructured data form social media in order to make crucial and informed business decisions on time. Sentiment analysis was conducted on Twitter data called tweets. Twitter, which is one of the most widely adopted social network service provides an API (Application Programming Interface), for researchers and software developers to connect and collect public data sets of Twitter data from the Twitter database. A Twitter application was created and used to collect streams of real-time public data via a Twitter source provided by Apache Flume and efficiently storing this data in Hadoop File System (HDFS). Apache Flume is a distributed, reliable, and available system which is used to efficiently collect, aggregate and move large amounts of log data from many different sources to a centralized data store such as HDFS. Apache Hadoop is an open source software library that runs on low-cost commodity hardware and has the ability to store, manage and analyse large amounts of both structured and unstructured data quickly, reliably, and flexibly at low-cost. A Lexicon based sentiment analysis approach was taken and the AFINN-111 lexicon was used for scoring. The Twitter data was analysed from the HDFS using a Java MapReduce implementation. MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. The results demonstrate that it is fast, efficient and economical to use this approach to analyse unstructured data from social media in real time. / Dissertation (MSc)--University of Pretoria, 2019. / National Research Foundation (NRF) - Scarce skills / Computer Science / MSc / Unrestricted Big Data Machine Learning Sentiment Analysis Text Mining Apache Hadoop UCTD
5	Sharing the love : a generic socket API for Hadoop Mapreduce Yee, Adam J. 01 January 2011 (has links) Hadoop is a popular software framework written in Java that performs data-intensive distributed computations on a cluster. It includes Hadoop MapReduce and the Hadoop Distributed File System (HDFS). HDFS has known scalability limitations due to its single NameNode which holds the entire file system namespace in RAM on one computer. Therefore, the NameNode can only store limited amounts of file names depending on the RAM capacity. The solution to furthering scalability is distributing the namespace similar to how file is data divided into chunks and stored across cluster nodes. Hadoop has an abstract file system API which is extended to integrate HDFS, but has also been extended for integrating file systems S3, CloudStore, Ceph and PVFS. File systems Ceph and PVFS already distribute the namespace, while others such as Lustre are making the conversion. Google previously announced in 2009 they have been implementing a Google File System distributed namespace to achieve greater scalability. The Generic Hadoop API is created from Hadoop's abstract file system API. It speaks a simple communication protocol that can integrate any file system which supports TCP sockets. By providing a file system agnostic API, future work with other file systems might provide ways for surpassing Hadoop 's current scalability limitations. Furthermore, the new API eliminates the need for customizing Hadoop's Java implementation, and instead moves the implementation to the file system itself. Thus, developers wishing to integrate their new file system with Hadoop are not responsible for understanding details ofHadoop's internal operation. The API is tested on a homogeneous, four-node cluster with OrangeFS. Initial OrangeFS I/0 throughputs compared to HDFS are 67% ofHDFS' write throughput and 74% percent of HDFS' read throughput. But, compared with an alternate method of integrating with OrangeFS (a POSIX kernel interface), write and read throughput is increased by 23% and 7%, respectively Apache Hadoop (Computer file) MapReduce (Computer program) Cluster analysis Data processing Computer algorithms Computer Sciences
6	BigData řešení pro zpracování rozsáhlých dat ze síťových toků / BigData Approach to Management of Large Netflow Datasets Melkes, Miloslav January 2014 (has links) This master‘s thesis focuses on distributed processing of big data from network communication. It begins with exploring network communication based on TCP/IP model with focus on data units on each layer, which is necessary to process during analyzation. In terms of the actual processing of big data is described programming model MapReduce, architecture of Apache Hadoop technology and it‘s usage for processing network flows on computer cluster. Second part of this thesis deals with design and following implementation of the application for processing network flows from network communication. In this part are discussed main and problematic parts from the actual implementation. After that this thesis ends with a comparison with available applications for network analysis and evaluation set of tests which confirmed linear growth of acceleration.
7	New Primitives for Tackling Graph Problems and Their Applications in Parallel Computing Zhong, Peilin January 2021 (has links) We study fundamental graph problems under parallel computing models. In particular, we consider two parallel computing models: Parallel Random Access Machine (PRAM) and Massively Parallel Computation (MPC). The PRAM model is a classic model of parallel computation. The efficiency of a PRAM algorithm is measured by its parallel time and the number of processors needed to achieve the parallel time. The MPC model is an abstraction of modern massive parallel computing systems such as MapReduce, Hadoop and Spark. The MPC model captures well coarse-grained computation on large data --- data is distributed to processors, each of which has a sublinear (in the input data) amount of local memory and we alternate between rounds of computation and rounds of communication, where each machine can communicate an amount of data as large as the size of its memory. We usually desire fully scalable MPC algorithms, i.e., algorithms that can work for any local memory size. The efficiency of a fully scalable MPC algorithm is measured by its parallel time and the total space usage (the local memory size times the number of machines). Consider an 𝑛-vertex 𝑚-edge undirected graph 𝐺 (either weighted or unweighted) with diameter 𝐷 (the largest diameter of its connected components). Let 𝑁=𝑚+𝑛 denote the size of 𝐺. We present a series of efficient (randomized) parallel graph algorithms with theoretical guarantees. Several results are listed as follows: 1) Fully scalable MPC algorithms for graph connectivity and spanning forest using 𝑂(𝑁) total space and 𝑂(log 𝐷loglog_{𝑁/𝑛} 𝑛) parallel time. 2) Fully scalable MPC algorithms for 2-edge and 2-vertex connectivity using 𝑂(𝑁) total space where 2-edge connectivity algorithm needs 𝑂(log 𝐷loglog_{𝑁/𝑛} 𝑛) parallel time, and 2-vertex connectivity algorithm needs 𝑂(log 𝐷⸱log²log_{𝑁/𝑛} n+\log D'⸱loglog_{𝑁/𝑛} 𝑛) parallel time. Here 𝐷' denotes the bi-diameter of 𝐺. 3) PRAM algorithms for graph connectivity and spanning forest using 𝑂(𝑁) processors and 𝑂(log 𝐷loglog_{𝑁/𝑛} 𝑛) parallel time. 4) PRAM algorithms for (1 + 𝜖)-approximate shortest path and (1 + 𝜖)-approximate uncapacitated minimum cost flow using 𝑂(𝑁) processors and poly(log 𝑛) parallel time. These algorithms are built on a series of new graph algorithmic primitives which may be of independent interests. Computer science Computer algorithms Parallel programming (Computer science) SPARK (Computer program language) MapReduce (Computer file) Apache Hadoop
8	Big data - použití v bankovní sféře / Big data - application in banking Uřídil, Martin January 2012 (has links) There is a growing volume of global data, which is offering new possibilities for those market participants, who know to take advantage of it. Data, information and knowledge are new highly regarded commodity especially in the banking industry. Traditional data analytics is intended for processing data with known structure and meaning. But how can we get knowledge from data with no such structure? The thesis focuses on Big Data analytics and its use in banking and financial industry. Definition of specific applications in this area and description of benefits for international and Czech banking institutions are the main goals of the thesis. The thesis is divided in four parts. The first part defines Big Data trend, the second part specifies activities and tools in banking. The purpose of the third part is to apply Big Data analytics on those activities and shows its possible benefits. The last part focuses on the particularities of Czech banking and shows what actual situation about Big Data in Czech banks is. The thesis gives complex description of possibilities of using Big Data analytics. I see my personal contribution in detailed characterization of the application in real banking activities.
9	Scientific Workflows for Hadoop Bux, Marc Nicolas 07 August 2018 (has links) Scientific Workflows bieten flexible Möglichkeiten für die Modellierung und den Austausch komplexer Arbeitsabläufe zur Analyse wissenschaftlicher Daten. In den letzten Jahrzehnten sind verschiedene Systeme entstanden, die den Entwurf, die Ausführung und die Verwaltung solcher Scientific Workflows unterstützen und erleichtern. In mehreren wissenschaftlichen Disziplinen wachsen die Mengen zu verarbeitender Daten inzwischen jedoch schneller als die Rechenleistung und der Speicherplatz verfügbarer Rechner. Parallelisierung und verteilte Ausführung werden häufig angewendet, um mit wachsenden Datenmengen Schritt zu halten. Allerdings sind die durch verteilte Infrastrukturen bereitgestellten Ressourcen häufig heterogen, instabil und unzuverlässig. Um die Skalierbarkeit solcher Infrastrukturen nutzen zu können, müssen daher mehrere Anforderungen erfüllt sein: Scientific Workflows müssen parallelisiert werden. Simulations-Frameworks zur Evaluation von Planungsalgorithmen müssen die Instabilität verteilter Infrastrukturen berücksichtigen. Adaptive Planungsalgorithmen müssen eingesetzt werden, um die Nutzung instabiler Ressourcen zu optimieren. Hadoop oder ähnliche Systeme zur skalierbaren Verwaltung verteilter Ressourcen müssen verwendet werden. Diese Dissertation präsentiert neue Lösungen für diese Anforderungen. Zunächst stellen wir DynamicCloudSim vor, ein Simulations-Framework für Cloud-Infrastrukturen, welches verschiedene Aspekte der Variabilität adäquat modelliert. Im Anschluss beschreiben wir ERA, einen adaptiven Planungsalgorithmus, der die Ausführungszeit eines Scientific Workflows optimiert, indem er Heterogenität ausnutzt, kritische Teile des Workflows repliziert und sich an Veränderungen in der Infrastruktur anpasst. Schließlich präsentieren wir Hi-WAY, eine Ausführungsumgebung die ERA integriert und die hochgradig skalierbare Ausführungen in verschiedenen Sprachen beschriebener Scientific Workflows auf Hadoop ermöglicht. / Scientific workflows provide a means to model, execute, and exchange the increasingly complex analysis pipelines necessary for today's data-driven science. Over the last decades, scientific workflow management systems have emerged to facilitate the design, execution, and monitoring of such workflows. At the same time, the amounts of data generated in various areas of science outpaced hardware advancements. Parallelization and distributed execution are generally proposed to deal with increasing amounts of data. However, the resources provided by distributed infrastructures are subject to heterogeneity, dynamic performance changes at runtime, and occasional failures. To leverage the scalability provided by these infrastructures despite the observed aspects of performance variability, workflow management systems have to progress: Parallelization potentials in scientific workflows have to be detected and exploited. Simulation frameworks, which are commonly employed for the evaluation of scheduling mechanisms, have to consider the instability encountered on the infrastructures they emulate. Adaptive scheduling mechanisms have to be employed to optimize resource utilization in the face of instability. State-of-the-art systems for scalable distributed resource management and storage, such as Apache Hadoop, have to be supported. This dissertation presents novel solutions for these aspirations. First, we introduce DynamicCloudSim, a cloud computing simulation framework that is able to adequately model the various aspects of variability encountered in computational clouds. Secondly, we outline ERA, an adaptive scheduling policy that optimizes workflow makespan by exploiting heterogeneity, replicating bottlenecks in workflow execution, and adapting to changes in the underlying infrastructure. Finally, we present Hi-WAY, an execution engine that integrates ERA and enables the highly scalable execution of scientific workflows written in a number of languages on Hadoop. Scientific Workflows Workflow-Management-System Cloud Computing Simulation Workflow-Planungsalgorithmen Adaptive Planungsalgorithmen Brownsche Bewegung Wiener-Prozess Apache Hadoop Hadoop YARN Scientific Workflows Workflow Management Systems Cloud Computing Simulation Workflow Scheduling Adaptive Scheduling Brownian Motion Wiener Process Apache Hadoop Hadoop YARN 004 Datenverarbeitung; Informatik ST 201 H03 ST 201 ddc:004
10	Návrh a implementace testovacího systému na architektuře GRID / Design and Implement Grid Testing System Hubík, Filip January 2013 (has links) This project addresses parallelization of building and testing projects written i Java programming language. It proposes software that uses methods of continual integration, parallelization and distribution of computationally intensive tasks to grid architecture. Suggested software helps to accelerate the development of software product and automation of its parts.

Search results