Global ETD Search

1	Using AFS as a distributed file system for computational and data grids in high energy physics Jones, Michael Angus Scott January 2005 (has links) The use of the distributed file system, AFS, as a solution to the “input/output sandbox” problem in grid computing is studied. A computational grid middleware, primarily to accommodate the environment of the BaBar Computing Model, has been designed, written and is presented. A summary of the existing grid middleware and resources is discussed. A number of benchmarks (one written for this thesis) are used to test the performance of the AFS over the wide area network and grid environment. The performance of the AFS is also tested using a straightforward BaBar Analysis code on real data. Secure web-based and command-line interfaces created to monitor job submission and grid fabric are presented. 539.7
2	Performance Analysis of Relational Database over Distributed File Systems Tsai, Ching-Tang 08 July 2011 (has links) With the growing of Internet, people use network frequently. Many PC applications have moved to the network-based environment, such as text processing, calendar, photo management, and even users can develop applications on the network. Google is a company providing web services. Its popular services are search engine and Gmail which attract people by short response time and lots amount of data storage. It also charges businesses to place their own advertisements. Another hot social network is Facebook which is also a popular website. It processes huge instant messages and social relationships between users. The magic power of doing this depends on the new technique, Cloud Computing. Cloud computing has ability to keep high-performance processing and short response time, and its kernel components are distributed data storage and distributed data processing. Hadoop is a famous open source to build cloud distributed file system and distributed data analysis. Hadoop is suitable for batch applications and write-once-and-read-many applications. Thus, currently there are only fewer applications, such as pattern searching and log file analysis, to be implemented over Hadoop. However, almost all database applications are still using relational databases. To port them into cloud platform, it becomes necessary to let relational database running over HDFS. So we will test the solution of FUSE-DFS which is an interface to mount HDFS into a system and is used like a local filesystem. If we can make FUSE-DFS performance satisfy user¡¦s application, then we can easier persuade people to port their application into cloud platform with least overhead. distributed file system Cloud Computing relational database interface performance analysis
3	Design and Implementation of Cloud Data Backup System with Load Balance Strategy Tsai, Chia-ping 15 August 2012 (has links) The fast growing bandwidth has made the development of cloud storage. More and more resource has put in cloud storage. In this thesis, we proposed a new cloud storage that consists of a single main server and multiple data servers. The main server controls system-wide activities such as data server management. It also periodically communicates with each data server and collects its state. Data servers store data on local disks as Windows files. In order to response to the large number of data access, Selection of the server which is necessary to offer equalized performance. In this paper, we propose a server selection algorithm using different parameters to get the performance metrics which enables us to balance multi-resource from server-side. We design new cloud storage and implement the algorithm. According to upload experiment, the difference between the maximum and the minimum free space when using our algorithm is less than 5GB. But using the random mode, the free space difference is increased as time, and the maximum is 30GB. In the mixed experiment, we added the download mode, and our algorithm is fewer than 10GB. The result of the random mode approximated to the first experiment. Finally, our algorithm obtains 10% and 3% speedup in upload throughput by upload experiment and mixed experiment, 10% speedup in download throughput by mixed experiment. Cloud Storage Server Selection Distributed File System Multithreading Synchronization
4	UnityFS: A File System for the Unity Block Store Huang, Wei 27 November 2013 (has links) A large number of personal cloud storage systems have emerged in recent years, such as Dropbox, iCloud, Google Drive etc. A common limitation of these system is that the users have to trust the cloud provider not to be malicious. Now we have a Unity block store, which can solve the problem and provide a secure and durable cloud-based block store. However, the base Unity system does not have the concept of file on top of its block device, thus the concurrent operations to different files can cause false sharing problem. In this thesis, we propose UnityFS, a file system built on top of the base Unity system. We design and implement the file system that maintains a mapping between files and a group of data blocks, such that the whole Unity system can support concurrent file operations to different files from multiple user devices in the personal cloud. Distributed File System Cloud Computing Computer Security 0984
5	UnityFS: A File System for the Unity Block Store Huang, Wei 27 November 2013 (has links) A large number of personal cloud storage systems have emerged in recent years, such as Dropbox, iCloud, Google Drive etc. A common limitation of these system is that the users have to trust the cloud provider not to be malicious. Now we have a Unity block store, which can solve the problem and provide a secure and durable cloud-based block store. However, the base Unity system does not have the concept of file on top of its block device, thus the concurrent operations to different files can cause false sharing problem. In this thesis, we propose UnityFS, a file system built on top of the base Unity system. We design and implement the file system that maintains a mapping between files and a group of data blocks, such that the whole Unity system can support concurrent file operations to different files from multiple user devices in the personal cloud. Distributed File System Cloud Computing Computer Security 0984
6	Design and Implementation of a QoS file transfer protocol over Hadoop distributed file system Chen, Chih-yi 26 July 2010 (has links) Cloud computing is pervasive in our daily life. For instance, I usually use Google¡¦s GMail to receive e-mail, Google Document to edit documents online and Google Calendar to make my daily schedule. We can say that Google provides a ¡§Platform as a Service (PaaS)¡¨, which delivers a computing platform as a service, and the platform sustaining lots of cloud applications such as I mentioned above. However, the cloud computing platform of Google is private: we cannot trace its source code and make cloud applications on it! Fortunately, there¡¦s an open source project supported by Apache named ¡§Hadoop¡¨, which has a distributed file system which is very like Google File System (GFS) called ¡§Hadoop distributed file system (HDFS)¡¨. In order to observe the properties of HDFS, we design and implement a HDFS-based FTP server system called FTP-ON-HDFS system, say, a FTP server whose storage is HDFS. There are a web-console for FTP administrator, a FreeRADIUS server and a MySQL database for user authentication, a NameNode daemon on its machine, a SecondaryNameNode on its machine and five DataNode daemons and on five different machines in FTP-ON-HDFS system. Our FTP-ON-HDFS system can tune two QoS parameters: ¡§data block size¡¨ and ¡§data replication¡¨. Then, we tuned ¡§data block size¡¨ and ¡§data replication¡¨ in our system and compared its performance with Hadoop File System (FS) shell command and normal vsftpd. On the other hand, FUSE can mount HDFS from remote cluster to local machine, and make use of the permission of the local machine to manage HDFS. So, we compared the performance of FUSE with HDFS (FUSE-DFS) and our FTP-ON-HDFS system. File Transfer Protocol (FTP) FUSE FreeRADIUS Hadoop Distributed File System (HDFS)
7	Implementation of the HadoopMapReduce algorithm on virtualizedshared storage systems Nethula, Shravya January 2016 (has links) Context Hadoop is an open-source software framework developed for distributed storage and distributed processing of large sets of data. The implementation of the Hadoop MapReduce algorithm on virtualized shared storage by eliminating the concept of Hadoop Distributed File System (HDFS) is a challenging task. In this study, the Hadoop MapReduce algorithm is implemented on the Compuverde software that deals with virtualized shared storage of data. Objectives In this study, the effect of using virtualized shared storage with Hadoop framework is identified. The main objective of this study is to design a method to implement the Hadoop MapReduce algorithm on Compuverde software that deals with virtualized shared storage of big data. Finally, the performance of the MapReduce algorithm on Compuverde shared storage (Compuverde File System - CVFS) is evaluated and compared to the performance of the MapReduce algorithm on HDFS. Methods Initially a literature study is conducted to identify the effect of Hadoop implementation on virtualized shared storage. The Compuverde software is analyzed in detail during this literature study. The concepts of the MapReduce algorithms and the functioning of HDFS are scrutinized in detail. The next main research method that is adapted for this study is the implementation of a method where the Hadoop MapReduce algorithm is applied on the Compuverde software that deals with the virtualized shared storage by eliminating the HDFS. The next step is experimentation in which the performance of the implementation of the MapReduce algorithm on Compuverde shared storage (CVFS) in comparison with implementation of the MapReduce algorithm on Hadoop Distributed File System. Results The experiment is conducted in two different scenarios namely the CPU bound scenario and I/O bound scenario. In CPU bound scenario, the average execution time of WordCount program has a linear growth with respect to size of data set. This linear growth is observed for both the file systems, HDFS and CVFS. The same is the case with I/O bound scenario. There is linear growth for both the file systems. When the averages of execution time are plotted on the graph, both the file systems perform similarly in CPU bound scenario(multi-node environment). In the I/O bound scenario (multi-node environment), HDFS slightly out performs CVFS when the size of 1.0GB and both the file systems performs without much difference when the size of data set is 0.5GB and 1.5GB. Conclusions The MapReduce algorithm can be implemented on live data present in the virtualized shared storage systems without copying data into HDFS. In single node environment, distributed storage systems perform better than shared storage systems. In multi-node environment, when the CPU bound scenario is considered, both HDFS and CVFS file systems perform similarly. On the other hand, HDFS performs slightly better than CVFS for 1.0GB of data set in the I/O bound scenario. Hence we can conclude that distributed storage systems perform similar to the shared storage systems in both CPU bound and I/O bound scenarios in multi-node environment. Hadoop virtualized systems shared storage MapReduce Hadoop Distributed File System Computer Sciences Datavetenskap (datalogi)
8	Towards an S3-based, DataNode-less implementation of HDFS / Mot en S3-baserad implementering av HDFS utan DataNodes Caceres Gutierrez, Franco Jesus January 2020 (has links) The relevance of data processing and analysis today cannot be overstated. The convergence of several technological advancements has fostered the proliferation of systems and infrastructure that together support the generation, transmission, and storage of nearly 15,000 exabytes of digital, analyzabledata. The Hadoop Distributed File System (HDFS) is an open source system designed to leverage the storage capacity of thousands of servers, and is the file system component of an entire ecosystem of tools to transform and analyze massive data sets. While HDFS is used by organizations of all sizes, smaller ones are not as well-suited to organically grow their clusters to accommodate their ever-expanding data sets and processing needs. This is because larger clusters are concomitant with higher investment in servers, greater rates of failures to recover from, and the need to allocate moreresources in maintenance and administration tasks. This poses a potential limitation down the road for organizations, and it might even deter some from venturing into the data world altogether. This thesis addresses this matter by presenting a novel implementation of HopsFS, an already improved version of HDFS, that requires no user-managed data servers. Instead, it relies on S3, a leading object storage service, for all its user-data storage needs. We compared the performance of both S3-based and regular clusters and found that such architecture is not only feasible, but also perfectly viable in terms of read and write throughputs, in some cases even outperforming its original counterpart. Furthermore, our solution provides first-class elasticity, reliability, and availability, all while being remarkably more affordable. / Relevansen av databehandling och analys idag kan inte överdrivas. Konvergensen av flera tekniska framsteg har främjat spridningen av system och infrastruk-tur som tillsammans stöder generering, överföring och lagring av nästan 15,000 exabyte digitala, analyserbara data. Hadoop Distributed File System (HDFS) är ett öppen källkodssystem som är utformat för att utnyttja lagringskapaciteten hos tusentals servrar och är filsystemkomponenten i ett helt ekosystem av verktyg för att omvandla och analysera massiva datamängder. HDFS används av organisationer i alla storlekar, men mindre är inte lika lämpade för att organiskt växa sina kluster för att tillgodose deras ständigt växande datamängder och behandlingsbehov. Detta beror på att större kluster är samtidigt med högre investeringar i servrar, större misslyckanden att återhämta sig från och behovet av att avsätta mer resurser i underhålls- och administrationsuppgifter. Detta utgör en potentiell begränsning på vägen för organisationer, och det kan till och med avskräcka en del från att våga sig helt in i datavärlden. Denna avhandling behandlar denna fråga genom att presentera en ny implementering av HopsFS, en redan förbättrad version av HDFS, som inte kräver några användarhanterade dataservrar. Istället förlitar sig det på S3, en ledande objektlagringstjänst, för alla dess användardata lagringsbehov. Vi jämförde prestandan för både S3-baserade och vanliga kluster och fann att sådan arkitektur inte bara är möjlig, utan också helt livskraftig när det gäller läs- och skrivgenomströmningar, i vissa fall till och med bättre än dess ursprungliga motsvarighet. Dessutom ger vår lösning förstklassig elasticitet, tillförlitlighet och tillgänglighet, samtidigt som den är anmärkningsvärt billigare. Hadoop distributed file system HDFS HopsFS S3 Computer and Information Sciences Data- och informationsvetenskap
9	Otimização de operações de entrada e saída visando reduzir o tempo de resposta de aplicações distribuídas que manipulam grandes volumes de dados / Optimization input output operations aiming at reduce execution time of distributed applications which handle large amount of data Ishii, Renato Porfirio 01 September 2010 (has links) Aplicações científicas atuais têm produzido volumes de dados cada vez maiores. O processamento, a manipulação e a análise desses dados requerem infraestruturas computacionais de larga escala tais como aglomerados e grades de computadores. Nesse contexto, várias pesquisas visam o aumento de desempenho dessas aplicações por meio da otimização de acesso a dados. Para alcançar tal objetivo, pesquisadores têm utilizado técnicas de replicação, migração, distribuição e paralelismo de dados. No entanto, uma das principais lacunas dessas pesquisas está na falta de emprego de conhecimento sobre aplicações com objetivo de realizar essa otimização. Essa lacuna motivou esta tese que visa empregar comportamento histórico e preditivo de aplicações a fim de otimizar suas operações de leitura e escrita sobre dados distribuídos. Os estudos foram iniciados empregando-se informações previamente monitoradas de aplicações a fim de tomar decisões relativas à replicação, migração e manutenção de consistência. Observou-se, por meio de uma nova heurística, que um conjunto histórico de eventos auxilia a estimar o comportamento futuro de uma aplicação e otimizar seus acessos. Essa primeira abordagem requer ao menos uma execução prévia da aplicação para composição de histórico. Esse requisito pode limitar aplicações reais que apresentam mudanças comportamentais ou que necessitam de longos períodos de execução para completar seu processamento. Para superar essa limitação, uma segunda abordagem foi proposta baseada na predição on-line de eventos comportamentais de aplicações. Essa abordagem não requer a execução prévia da aplicação e permite adaptar estimativas de comportamento futuro em função de alterações adjacentes. A abordagem preditiva analisa propriedades de séries temporais com objetivo de classificar seus processos geradores. Essa classificação aponta modelos que melhor se ajustam ao comportamento das aplicações e que, portanto, permitem predições com maior acurácia. As duas abordagens propostas foram implementadas e avaliadas utilizando o simulador OptorSim, vinculado ao projeto LHC/CERN, amplamente adotado pela comunidade científica. Experimentos constataram que as duas abordagens propostas reduzem o tempo de resposta (ou execução) de aplicações que manipulam grandes volumes de dados distribuídos em aproximadamente 50% / Current scientific applications produce large amount of data and handling, processing and analyzing such data require large-scale computing infrastructure such as clusters and grids. In this context, various studies have focused at improving the performance of these applications by optimizing data access. In order to achieve this goal, researchers have employed techniques of replication, migration, distribution and parallelism of data. However, these common approaches do not use knowledge about the applications at hand to perform this optimization. This gap motivated the present thesis, which aims at applying historical and predictive behavior of applications to optimize their reading and writing operations on distributed data. Based on information previously monitored from applications to make decisions regarding replication, migration and consistency of data, a new heuristic was initially proposed. Its evaluation revealed that considering sets of historical events indeed helps to estimate the behavior of future applications and to optimize their access operations. Thus it was embedded into two optimization approaches. The first one requires at least a previous execution for the history composition. This requirement may limit real world applications which present behavioral changes or take very long time to execute. In order to overcome this issue, a second technique was proposed. It performs on-line predictions about the behavior of the applications, mitigating the need of any prior execution. Additionally, this approach considers the future behavior of an application as a function of its underlying changes. This behavior can be modeled as time series. The method works by analyzing the series properties in order to classify their generating processes. This classification indicates models that best fit the applications behavior, allowing more accurate predictions. Experiments using the OptorSim simulator (LHC/CERN project) confirmed that the proposed approaches are able to reduce the response time of applications that handle large amount of distributed data in approximately 50% Análise de séries temporais Computação distribuída Data access optimization Distributed computing Distributed file system Otimização de acesso a dados Sistemas de arquivos distribuídos Time series analysis
10	DATA MINING: TRACKING SUSPICIOUS LOGGING ACTIVITY USING HADOOP Sodhi, Bir Apaar Singh 01 March 2016 (has links) In this modern rather interconnected era, an organization’s top priority is to protect itself from major security breaches occurring frequently within a communicational environment. But, it seems, as if they quite fail in doing so. Every week there are new headlines relating to information being forged, funds being stolen and corrupt usage of credit card and so on. Personal computers are turned into “zombie machines” by hackers to steal confidential and financial information from sources without disclosing hacker’s true identity. These identity thieves rob private data and ruin the very purpose of privacy. The purpose of this project is to identify suspicious user activity by analyzing a log file which then later can help an investigation agency like FBI to track and monitor anonymous user(s) who seek for weaknesses to attack vulnerable parts of a system to have access of it. The project also emphasizes the potential damage that a malicious activity could have on the system. This project uses Hadoop framework to search and store log files for logging activities and then performs a ‘Map Reduce’ programming code to finally compute and analyze the results. Parallel Computing Distributed File System Java Programming Parser Big Data Partitioner Reducer Combiner Mapper Computer and Systems Architecture Data Storage Systems Information Security Programming Languages and Compilers

Search results