Global ETD Search

11	Utilização de metaheurísticas para balanceamento de carga em ambientes MapReduce / Metaheuristics approach for online load balancing in MapReduce Pericini, Matheus Henrique Machado January 2017 (has links) PERICINI, Matheus Henrique Machado. Utilização de metaheurísticas para balanceamento de carga em ambientes MapReduce. 2017. 71 f. Dissertação (Mestrado em Ciência da Computação)-Universidade Federal do Ceará, Fortaleza, 2017. / Submitted by Jonatas Martins (jonatasmartins@lia.ufc.br) on 2017-10-19T17:17:01Z No. of bitstreams: 1 2017_dis_mhmpericini.pdf: 2342022 bytes, checksum: 8bfd2d1fee199d87109de3ba41cb73df (MD5) / Approved for entry into archive by Jairo Viana (jairo@ufc.br) on 2017-10-30T17:13:30Z (GMT) No. of bitstreams: 1 2017_dis_mhmpericini.pdf: 2342022 bytes, checksum: 8bfd2d1fee199d87109de3ba41cb73df (MD5) / Made available in DSpace on 2017-10-30T17:13:30Z (GMT). No. of bitstreams: 1 2017_dis_mhmpericini.pdf: 2342022 bytes, checksum: 8bfd2d1fee199d87109de3ba41cb73df (MD5) Previous issue date: 2017 / With the increase in the number of data obtained by large companies, it was necessary to elaborate new strategies for the processing of this data in order to maintain the relevance of the information that they contain. One of the strategies that has been widely used is based on a programming model, called MapReduce, which uses division and conquest to process the data in a cluster of machines. Hadoop is one of the most consolidated implementations of the MapReduce model. But even such a strategy is subject to improvement. In it, the runtime depends on all the machines causing any overloaded machine to generate a delay in the delivery of the result. This overhead is caused by a problem commonly called Data Skew which consists of an unequal division of data, either by the size of the data or by the way it is divided. In order to solve this problem, we have proposed the MALiBU, an improvement of the execution strategy of Hadoop, which partitions the data between the machines using a meta-heuristic among them Simulated Annealing, Local Beam Search or Stochastic Beam Search. Experimental results showed improvements in the performance of Hadoop when using metaheuristics to distribute the data among the processing elements of the model, as well as among the three meta-heuristics evaluated, which has the best results. / Com o aumento do número de dados obtidos por grandes empresas, foi necessário elaborar novas estratégias para o processamento desses dados de modo a manter sua relevância e aproveitar suas informações. Uma das estratégias que tem sido amplamente utilizada tem como base um modelo de programação, chamado MapReduce, que utiliza divisão e conquista para processar os dados em um cluster de máquinas. O Hadoop é uma das implementações mais consolidadas do modelo de MapReduce. Mas mesmo tal estratégia é passível de melhorias. Nela o tempo de execução é dependente de todas as máquinas fazendo com que qualquer máquina sobrecarregada gere um atraso na entrega do resultado. Essa sobrecarga é causada por um problema chamado comumente de Data Skew que consiste em uma divisão desigual dos dados causado pelo tamanho dos dados, o modo como eles são divididos, ou o processamento desigual dos dados. Visando resolver esse problema, propusemos o MALiBU, uma melhoria da estratégia de execução do MapReduce que particiona os dados entre as máquinas usando uma meta-heurística dentre elas Simulated Annealing, Local Beam Search ou Stochastic Beam Search. Resultados experimentais mostraram melhorias no desempenho do MapReduce quando se faz uso de meta-heurística para distribuir os dados entre as máquinas, bem como mostraram, dentre as três meta-heurísticas avaliadas, qual delas melhor balanceia a carga. MapReduce Meta-heurísticas Skew Otimização
12	Optimalizace platformy pro distribuované výpočty Hadoop / Optimization of the Hadoop Platform for Distributed Computation Čecho, Jaroslav January 2012 (has links) This thesis is focusing on possibilities of improving the Apache Hadoop framework by outsourcing some computation to a graphic card using the NVIDIA CUDA technology. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model called mapreduce. NVIDIA CUDA is a platform which allows one to use a graphic card for a general computation. This thesis contains description and experimental implementations of suitable computation inside te Hadoop framework that can benefit from being executed on a graphic card.
13	REGULARIZED MARKOV CLUSTERING IN MPI AND MAP REDUCE Varia, Siddharth 02 October 2013 (has links) No description available. Computer Science MPI, MapReduce, Graphs
14	MapReduce network enabled algorithms for classification based on association rules Hammoud, Suhel January 2011 (has links) There is growing evidence that integrating classification and association rule mining can produce more efficient and accurate classifiers than traditional techniques. This thesis introduces a new MapReduce based association rule miner for extracting strong rules from large datasets. This miner is used later to develop a new large scale classifier. Also new MapReduce simulator was developed to evaluate the scalability of proposed algorithms on MapReduce clusters. The developed associative rule miner inherits the MapReduce scalability to huge datasets and to thousands of processing nodes. For finding frequent itemsets, it uses hybrid approach between miners that uses counting methods on horizontal datasets, and miners that use set intersections on datasets of vertical formats. The new miner generates same rules that usually generated using apriori-like algorithms because it uses the same confidence and support thresholds definitions. In the last few years, a number of associative classification algorithms have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. This thesis also introduces a new MapReduce classifier that based MapReduce associative rule mining. This algorithm employs different approaches in rule discovery, rule ranking, rule pruning, rule prediction and rule evaluation methods. The new classifier works on multi-class datasets and is able to produce multi-label predications with probabilities for each predicted label. To evaluate the classifier 20 different datasets from the UCI data collection were used. Results show that the proposed approach is an accurate and effective classification technique, highly competitive and scalable if compared with other traditional and associative classification approaches. Also a MapReduce simulator was developed to measure the scalability of MapReduce based applications easily and quickly, and to captures the behaviour of algorithms on cluster environments. This also allows optimizing the configurations of MapReduce clusters to get better execution times and hardware utilization. 005.3
15	Distributed Text Mining in R Theußl, Stefan, Feinerer, Ingo, Hornik, Kurt 16 March 2011 (has links) (PDF) R has recently gained explicit text mining support with the "tm" package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) an increase of the amount of data to be analyzed leads to increasing computational workload. Fortunately, adequate parallel programming models like MapReduce and the corresponding open source implementation called Hadoop allow for processing data sets beyond what would fit into memory. In this paper we present the package "tm.plugin.dc" offering a seamless integration between "tm" and Hadoop. We show on the basis of an application in culturomics that we can efficiently handle data sets of significant size. / Series: Research Report Series / Department of Statistics and Mathematics
16	A tm Plug-In for Distributed Text Mining in R Theußl, Stefan, Feinerer, Ingo, Hornik, Kurt 11 1900 (has links) (PDF) R has gained explicit text mining support with the tm package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) the more data to be analyzed the higher the need for efficient procedures for calculating valuable results. Fortunately, adequate programming models like MapReduce facilitate parallelization of text mining tasks and allow for processing data sets beyond what would fit into memory by using a distributed file system possibly spanning over several machines, e.g., in a cluster of workstations. In this paper we present a plug-in package to tm called tm.plugin.dc implementing a distributed corpus class which can take advantage of the Hadoop MapReduce library for large scale text mining tasks. We show on the basis of an application in culturomics that we can efficiently handle data sets of signifficant size. (authors' abstract)
17	Analysis of PageRank on Wikipedia Tadakamala, Anirudh January 1900 (has links) Master of Science / Department of Computing and Information Science / Daniel Andresen / With massive explosion of data in recent times and people depending more and more on search engines to get all kinds of information they want, it has becoming increasingly difficult for the search engines to produce most relevant data to the users. PageRank is one algorithm that has revolutionized the way search engines work. It was developed by Google`s Larry Page and Sergey Brin. It was developed by Google to rank websites and display them in order of ranking in its search engine results. PageRank is a link analysis algorithm that assigns a weight to each document in a corpus and measures the relative importance within the corpus. The purpose of my project is to extract all the English Wikipedia data using MediaWiki API and JWPL(Java Wikipedia Library), build PageRank Algorithm and analyze its performance on the this data set. Since the data set is too big to run in a single node Hadoop cluster, the analysis is done in a high computation cluster called Beocat, provided by Kansas State University, Computing and Information Sciences Department. Hadoop PageRank MapReduce Computer Science (0984)
18	Distributed Storage and Processing of Image Data / Distribuerad lagring och bearbeting av bilddata Dahlberg, Tobias January 2012 (has links) Systems operating in a medical environment need to maintain high standards regarding availability and performance. Large amounts of images are stored and studied to determine what is wrong with a patient. This puts hard requirements on the storage of the images. In this thesis, ways of incorporating distributed storage into a medical system are explored. Products, inspired by the success of Google, Amazon and others, are experimented with and compared to the current storage solutions. Several “non-relational databases” (NoSQL) are investigated for storing medically relevant metadata of images, while a set of distributed file systems are considered for storing the actual images. Distributed processing of the stored data is investigated by using Hadoop MapReduce to generate a useful model of the images' metadata. Distributed Storage NoSQL Distributed Processing Hadoop MapReduce
19	The Impact of Near-Duplicate Documents on Information Retrieval Evaluation Khoshdel Nikkhoo, Hani 18 January 2011 (has links) Near-duplicate documents can adversely affect the efficiency and effectiveness of search engines. Due to the pairwise nature of the comparisons required for near-duplicate detection, this process is extremely costly in terms of the time and processing power it requires. Despite the ubiquitous presence of near-duplicate detection algorithms in commercial search engines, their application and impact in research environments is not fully explored. The implementation of near-duplicate detection algorithms forces trade-offs between efficiency and effectiveness, entailing careful testing and measurement to ensure acceptable performance. In this thesis, we describe and evaluate a scalable implementation of a near-duplicate detection algorithm, based on standard shingling techniques, running under a MapReduce framework. We explore two different shingle sampling techniques and analyze their impact on the near-duplicate document detection process. In addition, we investigate the prevalence of near-duplicate documents in the runs submitted to the adhoc task of TREC 2009 web track. near-duplicate detection MapReduce shingles Computer Science
20	Application of MapReduce to Ranking SVM for Large-Scale Datasets Hu, Su-Hsien 10 August 2010 (has links) Nowadays, search engines are more relying on machine learning techniques to construct a model, using past user queries and clicks as training data, for ranking web pages. There are several learning to rank methods for information retrieval, and among them ranking support vector machine (SVM) attracts a lot of attention in the information retrieval community. One difficulty with Ranking SVM is that the computation cost is very high for constructing a ranking model due to the huge number of training data pairs when the size of training dataset is large. We adopt the MapReduce programming model to solve this difficulty. MapReduce is a distributed computing framework introduced by Google and is commonly adopted in cloud computing centers. It can deal easily with large-scale datasets using a large number of computers. Moreover, it hides the messy details of parallelization, fault-tolerance, data distribution, and load balancing from the programmer and allows him/her to focus on only the underlying problem to be solved. In this paper, we apply MapReduce to Ranking SVM for processing large-scale datasets. We specify the Map function to solve the dual sub problems involved in Ranking SVM and the Reduce function to aggregate all the outputs having the same intermediate key from Map functions of distributed machines. Experimental results show efficiency improvement on ranking SVM by our proposed approach. search engines cloud computing MapReduce Ranking SVM

Search results