Global ETD Search

1	CloudNotes: Annotation Management in Cloud-Based Platforms Lu, Yue 24 April 2014 (has links) We present an annotation management system for cloud-based platforms, which is called â€œCloudNotesâ€�. CloudNotes enables the annotation management feature in the scalable Hadoop and MapRedue platforms. In CloudNotes system, every piece of data may have one or more annotations associate with it, and these annotations will be propagated when the data is being transformed through the MapReduce jobs. Such an annotation management system is important for understanding the provenance and quality of data, especially in applications that deal with integration of scientific and biological data at unprecedented scale and complexity. We propose several extensions to the Hadoop platform that allow end-users to add and retrieve annotations seamlessly. Annotations in CloudNotes will be generated, propagated and managed in a distributed manner. We address several challenges that include attaching annotations to data at various granularities in Hadoop, annotating data in flat files with no known schema until query time, and creating and storing the annotations is a distributed fashion. We also present new storage mechanisms and novel indexing techniques that enable adding the annotations in small increments although Hadoopâ€™s file system is optimized for large batch processing. Annotation Hadoop
2	Evaluating Clustering Techniques over Big Data in Distributed Infrastructures Shetty, Kartik 25 April 2018 (has links) Clustering is defined as the process of grouping a set of objects in a way that objects in the same group are similar in some sense to each other than to those in other groups. It is used in many fields including machine learning, image recognition, pattern recognition and knowledge discovery. In this era of Big Data, we could leverage the computing power of distributed environment to achieve it over large dataset. It can be achieved through various algorithms, but in general they have high time complexities. We see that for large datasets the scalability and the parameters of the environment in which it is running become issues which needs to be addressed. Therefore it's brute force implementation is not scalable over large datasets even in a distributed environment, which calls the need for an approximation technique or optimization to make it scalable. We study three clustering techniques: CURE, DBSCAN and k-means over distributed environment like Hadoop. For each of these algorithms we understand their performance trade offs and bottlenecks and then propose enhancements or optimizations or an approximation technique to make it scalable in Hadoop. Finally we evaluate it's performance and suitability to datasets of different sizes and distributions. clustering hadoop
3	A Hadoop-based Cloud Computing for Network Flow Analysis and Packet Dissection Applications Wu, Shih-lin 26 July 2010 (has links) With the growing of Internet, people use network frequently. Many PC applications have moved to the network-based environment, such as text processing, calendar, photo management, and even user can develop applications on the network. Google is a company providing web services. Its popular services are search engine and gmail which attracts people by short response time and lots amount of data storage. It also charges businesses to place their own advertisements. Another hot social network is Facebook which is also a popular website. It processes huge instant messages and social relationships between users. The magic power of doing this depends on the new technique, Could Computing. Cloud computing has ability to keep high-performance processing and short response time, and its kernel components are distributed data storage and distributed data processing. Because of the new concept, there are fewer application, such as pattern searching and log file analysis, related to the cloud computing. Therefore, we use the technique to perform the packet analysis and packet dissection. The packet data are placed by distributed file system, and further process according to different requirements, which acts as IPS (Intrusion Protection System). hadoop packet dissection
4	Reducing Communication Overhead and Computation Costs in a Cloud Network by Early Combination of Partial Results Huang, Jun-neng 22 August 2011 (has links) This thesis describes a method of reducing communication overheads within the MapReduce infrastructure of a cloud computing environment. MapReduce is an framework for parallelizing the processing on massive data systems stored across a distributed computer network. One of the benefits of MapReduce is that the computation is usually performed on a computer (node) that holds the data file. Not only does this approach achieve parallelism, but it also benefits from a characteristic common to many applications: that the answer derived from a computation is often smaller than the size of the input file. Our new method benefits also from this feature. We delay the transmission of individual answers out a given node, so as to allow these answers to be combined locally, first. This combination has two advantages. First, it allows for a further reduction in the amount of data to ultimately transmit. And second, it allows for additional computation across files (such as a merge-sort). There is a limit to the benefit of delaying transmission, however, because the reducer stage of MapReduce cannot begin its work until the nodes transmit their answers. We therefore consider a mechanism to allow the user to adjust the amount of delay before data transmission out of each node. cloud computing MapReduce Hadoop
5	Enhancing Query Support in HBase via an extended Coprocessor Framework Vashishtha, Himanshu Unknown Date No description available. HBase, Coprocessors, Endpoints, Hadoop
6	An approach to choosing the right distributed file system : Microsoft DFS vs. Hadoop DFS Musatoiu, Mihai January 2015 (has links) Context. An important goal of most IT groups is to manage server resources in such a way that their users are provided with fast, reliable and secure access to files. The modern needs of organizations imply that resources are often distributed geographically, asking for new design solutions for the file systems to remain highly available and efficient. This is where distributed file systems (DFSs) come into the picture. A distributed file system (DFS), as opposed to a "classical", local, file system, is accessible across some kind of network and allows clients to access files remotely as if they were stored locally. Objectives. This paper has the goal of comparatively analyzing two distributed file systems, Microsoft DFS (MSDFS) and Hadoop DFS (HDFS). The two systems come from different "worlds" (proprietary - Microsoft DFS - vs. open-source - Hadoop DFS); the abundance of solutions and the variety of choices that exist today make such a comparison more relevant. Methods. The comparative analysis is done on a cluster of 4 computers running dual-installations of Microsoft Windows Server 2012 R2 (the MSDFS environment) and Linux Ubuntu 14.04 (the HDFS environment). The comparison is done on read and write operations on files and sets of files of increasing sizes, as well as on a set of key usage scenarios. Results. Comparative results are produced for reading and writing operations of files of increasing size - 1 MB, 2 MB, 4 MB and so on up to 4096 MB - and of sets of small files (64 KB each) amounting to totals of 128 MB, 256 MB and so on up to 4096 MB. The results expose the behavior of the two DFSs on different types of stressful activities (when the size of the transferred file increases, as well as when the quantity of data is divided into (tens of) thousands of many small files). The behavior in the case of key usage scenarios is observed and analyzed. Conclusions. HDFS performs better at writing large files, while MSDFS is better at writing many small files. At read operations, the two show similar performance, with a slight advantage for MSDFS. In the key usage scenarios, HDFS shows more flexibility, but MSDFS could be the better choice depending on the needs of the users (for example, most of the common functions can be configured through the graphical user interface). DFS MSDFS HDFS Microsoft Hadoop
7	Optimalizace platformy pro distribuované výpočty Hadoop / Optimization of the Hadoop Platform for Distributed Computation Čecho, Jaroslav January 2012 (has links) This thesis is focusing on possibilities of improving the Apache Hadoop framework by outsourcing some computation to a graphic card using the NVIDIA CUDA technology. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model called mapreduce. NVIDIA CUDA is a platform which allows one to use a graphic card for a general computation. This thesis contains description and experimental implementations of suitable computation inside te Hadoop framework that can benefit from being executed on a graphic card.
8	Distributed Text Mining in R Theußl, Stefan, Feinerer, Ingo, Hornik, Kurt 16 March 2011 (has links) (PDF) R has recently gained explicit text mining support with the "tm" package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) an increase of the amount of data to be analyzed leads to increasing computational workload. Fortunately, adequate parallel programming models like MapReduce and the corresponding open source implementation called Hadoop allow for processing data sets beyond what would fit into memory. In this paper we present the package "tm.plugin.dc" offering a seamless integration between "tm" and Hadoop. We show on the basis of an application in culturomics that we can efficiently handle data sets of significant size. / Series: Research Report Series / Department of Statistics and Mathematics
9	A tm Plug-In for Distributed Text Mining in R Theußl, Stefan, Feinerer, Ingo, Hornik, Kurt 11 1900 (has links) (PDF) R has gained explicit text mining support with the tm package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) the more data to be analyzed the higher the need for efficient procedures for calculating valuable results. Fortunately, adequate programming models like MapReduce facilitate parallelization of text mining tasks and allow for processing data sets beyond what would fit into memory by using a distributed file system possibly spanning over several machines, e.g., in a cluster of workstations. In this paper we present a plug-in package to tm called tm.plugin.dc implementing a distributed corpus class which can take advantage of the Hadoop MapReduce library for large scale text mining tasks. We show on the basis of an application in culturomics that we can efficiently handle data sets of signifficant size. (authors' abstract)
10	Analysis of PageRank on Wikipedia Tadakamala, Anirudh January 1900 (has links) Master of Science / Department of Computing and Information Science / Daniel Andresen / With massive explosion of data in recent times and people depending more and more on search engines to get all kinds of information they want, it has becoming increasingly difficult for the search engines to produce most relevant data to the users. PageRank is one algorithm that has revolutionized the way search engines work. It was developed by Google`s Larry Page and Sergey Brin. It was developed by Google to rank websites and display them in order of ranking in its search engine results. PageRank is a link analysis algorithm that assigns a weight to each document in a corpus and measures the relative importance within the corpus. The purpose of my project is to extract all the English Wikipedia data using MediaWiki API and JWPL(Java Wikipedia Library), build PageRank Algorithm and analyze its performance on the this data set. Since the data set is too big to run in a single node Hadoop cluster, the analysis is done in a high computation cluster called Beocat, provided by Kansas State University, Computing and Information Sciences Department. Hadoop PageRank MapReduce Computer Science (0984)

Search results