Global ETD Search

1	An approach to choosing the right distributed file system : Microsoft DFS vs. Hadoop DFS Musatoiu, Mihai January 2015 (has links) Context. An important goal of most IT groups is to manage server resources in such a way that their users are provided with fast, reliable and secure access to files. The modern needs of organizations imply that resources are often distributed geographically, asking for new design solutions for the file systems to remain highly available and efficient. This is where distributed file systems (DFSs) come into the picture. A distributed file system (DFS), as opposed to a "classical", local, file system, is accessible across some kind of network and allows clients to access files remotely as if they were stored locally. Objectives. This paper has the goal of comparatively analyzing two distributed file systems, Microsoft DFS (MSDFS) and Hadoop DFS (HDFS). The two systems come from different "worlds" (proprietary - Microsoft DFS - vs. open-source - Hadoop DFS); the abundance of solutions and the variety of choices that exist today make such a comparison more relevant. Methods. The comparative analysis is done on a cluster of 4 computers running dual-installations of Microsoft Windows Server 2012 R2 (the MSDFS environment) and Linux Ubuntu 14.04 (the HDFS environment). The comparison is done on read and write operations on files and sets of files of increasing sizes, as well as on a set of key usage scenarios. Results. Comparative results are produced for reading and writing operations of files of increasing size - 1 MB, 2 MB, 4 MB and so on up to 4096 MB - and of sets of small files (64 KB each) amounting to totals of 128 MB, 256 MB and so on up to 4096 MB. The results expose the behavior of the two DFSs on different types of stressful activities (when the size of the transferred file increases, as well as when the quantity of data is divided into (tens of) thousands of many small files). The behavior in the case of key usage scenarios is observed and analyzed. Conclusions. HDFS performs better at writing large files, while MSDFS is better at writing many small files. At read operations, the two show similar performance, with a slight advantage for MSDFS. In the key usage scenarios, HDFS shows more flexibility, but MSDFS could be the better choice depending on the needs of the users (for example, most of the common functions can be configured through the graphical user interface). DFS MSDFS HDFS Microsoft Hadoop
2	P2PHDFS: AN IMPLEMENTATION OF STATISTIC MULTIPLEXED COMPUTING ARCHITECTURE IN HADOOP FILE SYSTEM Pradeep, Aakash January 2012 (has links) The Peer to Peer Hadoop Distributed File System (P2PHDFS) is designed to store and process extremely large-scale data sets reliably. This is a first attempt implementation of the Statistic Multiplexed Computing Architecture concept proposed by Dr. Shi for the existing Hadoop File System (HDFS) to eliminate all single point failures. Unlike HDFS, in P2PHDFS every node is designed to be equal and behaves as a file system server as well as slave, which enable it to attain higher performance and higher reliability at the same time as the infrastructure up scales. Due to the data intensive nature, a full implementation of P2PHDFS must address CAP Theorem challenges. This MS project is only intended as the ground breaking point using only sequential replication at this time. / Computer and Information Science Computer Science Peer-to-peer Hdfs
3	Aplikace pro Big Data / Application for Big Data Blaho, Matúš January 2018 (has links) This work deals with the description and analysis of the Big Data concept and its processing and use in the process of decision support. Suggested processing is based on the MapReduce concept designed for Big Data processing. The theoretical part of this work is largely about the Hadoop system that implements this concept. Its understanding is a key feature for properly designing applications that run within it. The work also contains design for specific Big Data processing applications. In the implementation part of the thesis is a description of Hadoop system management, description of implementation of MapReduce applications and description of their testing over data sets.
4	ASHWHIN- Array Storage system on HadoopFS With HDF5 Interface Khandrika, Ananth Viswa Sai Kalyan 04 September 2018 (has links) No description available. Computer Science Computer Engineering Array storage HDFS Scalability Concurrency
5	RDMA-based Plugin Design and Profiler for Apache and Enterprise Hadoop Distributed File system Bhat, Adithya January 2015 (has links) No description available. Computer Science HDFS Hadoop RDMA InfiniBand Apache HDP CDH
6	Performance Characterization and Improvements of SQL-On-Hadoop Systems Kulkarni, Kunal Vikas 28 December 2016 (has links) No description available. Computer Science Hadoop SQL Impala Hive Big Data Joins HDFS
7	Processing data sources with big data frameworks / Behandla datakällor med big data-ramverk Nyström, Simon, Lönnegren, Joakim January 2016 (has links) Big data is a concept that is expanding rapidly. As more and more data is generatedand garnered, there is an increasing need for efficient solutions that can be utilized to process all this data in attempts to gain value from it. The purpose of this thesis is to find an efficient way to quickly process a large number of relatively small files. More specifically, the purpose is to test two frameworks that can be used for processing big data. The frameworks that are tested against each other are Apache NiFi and Apache Storm. A method is devised in order to, firstly, construct a data flow and secondly, construct a method for testing the performance and scalability of the frameworks running this data flow. The results reveal that Apache Storm is faster than Apache NiFi, at the sort of task that was tested. As the number of nodes included in the tests went up, the performance did not always do the same. This indicates that adding more nodes to a big data processing pipeline, does not always result in a better performing setup and that, sometimes, other measures must be made to heighten the performance. / Big data är ett koncept som växer snabbt. När mer och mer data genereras och samlas in finns det ett ökande behov av effektiva lösningar som kan användas föratt behandla all denna data, i försök att utvinna värde från den. Syftet med detta examensarbete är att hitta ett effektivt sätt att snabbt behandla ett stort antal filer, av relativt liten storlek. Mer specifikt så är det för att testa två ramverk som kan användas vid big data-behandling. De två ramverken som testas mot varandra är Apache NiFi och Apache Storm. En metod beskrivs för att, för det första, konstruera ett dataflöde och, för det andra, konstruera en metod för att testa prestandan och skalbarheten av de ramverk som kör dataflödet. Resultaten avslöjar att Apache Storm är snabbare än NiFi, på den typen av test som gjordes. När antalet noder som var med i testerna ökades, så ökade inte alltid prestandan. Detta visar att en ökning av antalet noder, i en big data-behandlingskedja, inte alltid leder till bättre prestanda och att det ibland krävs andra åtgärder för att öka prestandan. Big data NiFi Storm HDFS performance real-time HDP Big data NiFi Storm HDFS prestanda realtid HDP Software Engineering Programvaruteknik
8	Intermediate Results Materialization Selection and Format for Data-Intensive Flows Munir, Rana Faisal, Nadal, Sergi, Romero, Oscar, Abelló, Alberto, Jovanovic, Petar, Thiele, Maik, Lehner, Wolfgang 14 June 2023 (has links) Data-intensive flows deploy a variety of complex data transformations to build information pipelines from data sources to different end users. As data are processed, these workflows generate large intermediate results, typically pipelined from one operator to the following ones. Materializing intermediate results, shared among multiple flows, brings benefits not only in terms of performance but also in resource usage and consistency. Similar ideas have been proposed in the context of data warehouses, which are studied under the materialized view selection problem. With the rise of Big Data systems, new challenges emerge due to new quality metrics captured by service level agreements which must be taken into account. Moreover, the way such results are stored must be reconsidered, as different data layouts can be used to reduce the I/O cost. In this paper, we propose a novel approach for automatic selection of multi-objective materialization of intermediate results in data-intensive flows, which can tackle multiple and conflicting quality objectives. In addition, our approach chooses the optimal storage data format for selected materialized intermediate results based on subsequent access patterns. The experimental results show that our approach provides 40% better average speedup with respect to the current state-of-the-art, as well as an improvement on disk access time of 18% as compared to fixed format solutions. info:eu-repo/classification/ddc/510 ddc:510
9	Design and Implementation of a QoS file transfer protocol over Hadoop distributed file system Chen, Chih-yi 26 July 2010 (has links) Cloud computing is pervasive in our daily life. For instance, I usually use Google¡¦s GMail to receive e-mail, Google Document to edit documents online and Google Calendar to make my daily schedule. We can say that Google provides a ¡§Platform as a Service (PaaS)¡¨, which delivers a computing platform as a service, and the platform sustaining lots of cloud applications such as I mentioned above. However, the cloud computing platform of Google is private: we cannot trace its source code and make cloud applications on it! Fortunately, there¡¦s an open source project supported by Apache named ¡§Hadoop¡¨, which has a distributed file system which is very like Google File System (GFS) called ¡§Hadoop distributed file system (HDFS)¡¨. In order to observe the properties of HDFS, we design and implement a HDFS-based FTP server system called FTP-ON-HDFS system, say, a FTP server whose storage is HDFS. There are a web-console for FTP administrator, a FreeRADIUS server and a MySQL database for user authentication, a NameNode daemon on its machine, a SecondaryNameNode on its machine and five DataNode daemons and on five different machines in FTP-ON-HDFS system. Our FTP-ON-HDFS system can tune two QoS parameters: ¡§data block size¡¨ and ¡§data replication¡¨. Then, we tuned ¡§data block size¡¨ and ¡§data replication¡¨ in our system and compared its performance with Hadoop File System (FS) shell command and normal vsftpd. On the other hand, FUSE can mount HDFS from remote cluster to local machine, and make use of the permission of the local machine to manage HDFS. So, we compared the performance of FUSE with HDFS (FUSE-DFS) and our FTP-ON-HDFS system. File Transfer Protocol (FTP) FUSE FreeRADIUS Hadoop Distributed File System (HDFS)
10	Big Data v technológiách IBM / Big Data in technologies from IBM Šoltýs, Matej January 2014 (has links) This diploma thesis presents Big Data technologies and their possible use cases and applications. Theoretical part is initially focused on definition of term Big Data and afterwards is focused on Big Data technology, particularly on Hadoop framework. There are described principles of Hadoop, such as distributed storage and data processing, and its individual components. Furthermore are presented the largest vendors of Big Data technologies. At the end of this part of the thesis are described possible use cases of Big Data technologies and also some case studies. The practical part describes implementation of demo example of Big Data technologies and it is divided into two chapters. The first chapter of the practical part deals with conceptual design of demo example, used products and architecture of the solution. Afterwards, implementation of the demo example is described in the second chapter, from preparation of demo environment to creation of applications. Goals of this thesis are description and characteristics of Big Data, presentation of the largest vendors and their Big Data products, description of possible use cases of Big Data technologies and especially implementation of demo example in Big Data tools from IBM.

Search results