Global ETD Search

1	Design and Implementation of a QoS file transfer protocol over Hadoop distributed file system Chen, Chih-yi 26 July 2010 (has links) Cloud computing is pervasive in our daily life. For instance, I usually use Google¡¦s GMail to receive e-mail, Google Document to edit documents online and Google Calendar to make my daily schedule. We can say that Google provides a ¡§Platform as a Service (PaaS)¡¨, which delivers a computing platform as a service, and the platform sustaining lots of cloud applications such as I mentioned above. However, the cloud computing platform of Google is private: we cannot trace its source code and make cloud applications on it! Fortunately, there¡¦s an open source project supported by Apache named ¡§Hadoop¡¨, which has a distributed file system which is very like Google File System (GFS) called ¡§Hadoop distributed file system (HDFS)¡¨. In order to observe the properties of HDFS, we design and implement a HDFS-based FTP server system called FTP-ON-HDFS system, say, a FTP server whose storage is HDFS. There are a web-console for FTP administrator, a FreeRADIUS server and a MySQL database for user authentication, a NameNode daemon on its machine, a SecondaryNameNode on its machine and five DataNode daemons and on five different machines in FTP-ON-HDFS system. Our FTP-ON-HDFS system can tune two QoS parameters: ¡§data block size¡¨ and ¡§data replication¡¨. Then, we tuned ¡§data block size¡¨ and ¡§data replication¡¨ in our system and compared its performance with Hadoop File System (FS) shell command and normal vsftpd. On the other hand, FUSE can mount HDFS from remote cluster to local machine, and make use of the permission of the local machine to manage HDFS. So, we compared the performance of FUSE with HDFS (FUSE-DFS) and our FTP-ON-HDFS system. File Transfer Protocol (FTP) FUSE FreeRADIUS Hadoop Distributed File System (HDFS)
2	Implementation of the HadoopMapReduce algorithm on virtualizedshared storage systems Nethula, Shravya January 2016 (has links) Context Hadoop is an open-source software framework developed for distributed storage and distributed processing of large sets of data. The implementation of the Hadoop MapReduce algorithm on virtualized shared storage by eliminating the concept of Hadoop Distributed File System (HDFS) is a challenging task. In this study, the Hadoop MapReduce algorithm is implemented on the Compuverde software that deals with virtualized shared storage of data. Objectives In this study, the effect of using virtualized shared storage with Hadoop framework is identified. The main objective of this study is to design a method to implement the Hadoop MapReduce algorithm on Compuverde software that deals with virtualized shared storage of big data. Finally, the performance of the MapReduce algorithm on Compuverde shared storage (Compuverde File System - CVFS) is evaluated and compared to the performance of the MapReduce algorithm on HDFS. Methods Initially a literature study is conducted to identify the effect of Hadoop implementation on virtualized shared storage. The Compuverde software is analyzed in detail during this literature study. The concepts of the MapReduce algorithms and the functioning of HDFS are scrutinized in detail. The next main research method that is adapted for this study is the implementation of a method where the Hadoop MapReduce algorithm is applied on the Compuverde software that deals with the virtualized shared storage by eliminating the HDFS. The next step is experimentation in which the performance of the implementation of the MapReduce algorithm on Compuverde shared storage (CVFS) in comparison with implementation of the MapReduce algorithm on Hadoop Distributed File System. Results The experiment is conducted in two different scenarios namely the CPU bound scenario and I/O bound scenario. In CPU bound scenario, the average execution time of WordCount program has a linear growth with respect to size of data set. This linear growth is observed for both the file systems, HDFS and CVFS. The same is the case with I/O bound scenario. There is linear growth for both the file systems. When the averages of execution time are plotted on the graph, both the file systems perform similarly in CPU bound scenario(multi-node environment). In the I/O bound scenario (multi-node environment), HDFS slightly out performs CVFS when the size of 1.0GB and both the file systems performs without much difference when the size of data set is 0.5GB and 1.5GB. Conclusions The MapReduce algorithm can be implemented on live data present in the virtualized shared storage systems without copying data into HDFS. In single node environment, distributed storage systems perform better than shared storage systems. In multi-node environment, when the CPU bound scenario is considered, both HDFS and CVFS file systems perform similarly. On the other hand, HDFS performs slightly better than CVFS for 1.0GB of data set in the I/O bound scenario. Hence we can conclude that distributed storage systems perform similar to the shared storage systems in both CPU bound and I/O bound scenarios in multi-node environment. Hadoop virtualized systems shared storage MapReduce Hadoop Distributed File System Computer Sciences Datavetenskap (datalogi)
3	Towards an S3-based, DataNode-less implementation of HDFS / Mot en S3-baserad implementering av HDFS utan DataNodes Caceres Gutierrez, Franco Jesus January 2020 (has links) The relevance of data processing and analysis today cannot be overstated. The convergence of several technological advancements has fostered the proliferation of systems and infrastructure that together support the generation, transmission, and storage of nearly 15,000 exabytes of digital, analyzabledata. The Hadoop Distributed File System (HDFS) is an open source system designed to leverage the storage capacity of thousands of servers, and is the file system component of an entire ecosystem of tools to transform and analyze massive data sets. While HDFS is used by organizations of all sizes, smaller ones are not as well-suited to organically grow their clusters to accommodate their ever-expanding data sets and processing needs. This is because larger clusters are concomitant with higher investment in servers, greater rates of failures to recover from, and the need to allocate moreresources in maintenance and administration tasks. This poses a potential limitation down the road for organizations, and it might even deter some from venturing into the data world altogether. This thesis addresses this matter by presenting a novel implementation of HopsFS, an already improved version of HDFS, that requires no user-managed data servers. Instead, it relies on S3, a leading object storage service, for all its user-data storage needs. We compared the performance of both S3-based and regular clusters and found that such architecture is not only feasible, but also perfectly viable in terms of read and write throughputs, in some cases even outperforming its original counterpart. Furthermore, our solution provides first-class elasticity, reliability, and availability, all while being remarkably more affordable. / Relevansen av databehandling och analys idag kan inte överdrivas. Konvergensen av flera tekniska framsteg har främjat spridningen av system och infrastruk-tur som tillsammans stöder generering, överföring och lagring av nästan 15,000 exabyte digitala, analyserbara data. Hadoop Distributed File System (HDFS) är ett öppen källkodssystem som är utformat för att utnyttja lagringskapaciteten hos tusentals servrar och är filsystemkomponenten i ett helt ekosystem av verktyg för att omvandla och analysera massiva datamängder. HDFS används av organisationer i alla storlekar, men mindre är inte lika lämpade för att organiskt växa sina kluster för att tillgodose deras ständigt växande datamängder och behandlingsbehov. Detta beror på att större kluster är samtidigt med högre investeringar i servrar, större misslyckanden att återhämta sig från och behovet av att avsätta mer resurser i underhålls- och administrationsuppgifter. Detta utgör en potentiell begränsning på vägen för organisationer, och det kan till och med avskräcka en del från att våga sig helt in i datavärlden. Denna avhandling behandlar denna fråga genom att presentera en ny implementering av HopsFS, en redan förbättrad version av HDFS, som inte kräver några användarhanterade dataservrar. Istället förlitar sig det på S3, en ledande objektlagringstjänst, för alla dess användardata lagringsbehov. Vi jämförde prestandan för både S3-baserade och vanliga kluster och fann att sådan arkitektur inte bara är möjlig, utan också helt livskraftig när det gäller läs- och skrivgenomströmningar, i vissa fall till och med bättre än dess ursprungliga motsvarighet. Dessutom ger vår lösning förstklassig elasticitet, tillförlitlighet och tillgänglighet, samtidigt som den är anmärkningsvärt billigare. Hadoop distributed file system HDFS HopsFS S3 Computer and Information Sciences Data- och informationsvetenskap

1

Page generated in 0.1008 seconds