Spelling suggestions: "subject:"mapereduce"" "subject:"hmapreduce""
121 |
Efficient Social Network Data Query Processing on MapReduceLiu, Liu 01 January 2013 (has links) (PDF)
Social network data analysis becomes increasingly important today. In order to improve the integration and reuse of their data, many social networks start to apply RDF to present the data. Accordingly, one common approach for social network data analysis is to employ SPARQL to query RDF data.
As the sizes of social networks expand rapidly, queries need to be executed in parallel such as using the MapReduce framework. However, the state-of-the-art translation from SPARQL queries to MapReduce jobs mainly follows a two layer rule, in which SPARQL is first translated to SQL join, is not efficient. In this thesis, we introduce two primitives to enable automatic translation from SPARQL to MapReduce, and to enable efficient execution of the SPARQL queries. We use multiple-join-with-filter to substitute traditional SQL multiple join when feasible, and merge different stages in the MapReduce query workflow. The evaluation on social network benchmarks shows that these two primitives can achieve up to 2x speedup in query running time compared with the original two layer scheme.
|
122 |
Improving Performance And Programmer Productivity For I/o-intensive High Performance Computing ApplicationsSehrish, Saba 01 January 2010 (has links)
Due to the explosive growth in the size of scientific data sets, data-intensive computing is an emerging trend in computational science. HPC applications are generating and processing large amount of data ranging from terabytes (TB) to petabytes (PB). This new trend of growth in data for HPC applications has imposed challenges as to what is an appropriate parallel programming framework to efficiently process large data sets. In this work, we study the applicability of two programming models (MPI/MPI-IO and MapReduce) to a variety of I/O-intensive HPC applications ranging from simulations to analytics. We identify several performance and programmer productivity related limitations of these existing programming models, if used for I/O-intensive applications. We propose new frameworks which will improve both performance and programmer productivity for the emerging I/O-intensive applications. Message Passing Interface (MPI) is widely used for writing HPC applications. MPI/MPI- IO allows a fine-grained control of assigning data and task distribution. At the programming frameworks level, various optimizations have been proposed to improve the performance of MPI/MPI-IO function calls. These performance optimizations are provided as various function options to the programmers. In order to write an efficient code, they are required to know the exact usage of the optimization functions, hence programmer productivity is limited. We propose an abstraction called Reduced Function Set Abstraction (RFSA) for MPI-IO to reduce the number of I/O functions and provide methods to automate the selection of appropriate I/O function for writing HPC simulation applications. The purpose of RFSA is to hide the performance optimization functions from the application developer, and relieve the application developer from deciding on a specific function. The proposed set of functions relies on a selection algorithm to decide among the most common optimizations provided by MPI-IO. Additionally, many application scientists are looking to integrate data-intensive computing into computational-intensive High Performance Computing facilities, particularly for data analytics. We have observed several scientific applications which must migrate their data from an HPC storage system to a data-intensive one. There is a gap between the data semantics of HPC storage and data-intensive system, hence, once migrated, the data must be further refined and reorganized. This reorganization must be performed before existing data-intensive tools such as MapReduce can be effectively used to analyze data. This reorganization requires at least two complete scans through the data set and then at least one MapReduce program to prepare the data before analyzing it. Running multiple MapReduce phases causes significant overhead for the application, in the form of excessive I/O operations. For every MapReduce application that must be run in order to complete the desired data analysis, a distributed read and write operation on the file system must be performed. Our contribution is to extend Map-Reduce to eliminate the multiple scans and also reduce the number of pre-processing MapReduce programs. We have added additional expressiveness to the MapReduce language in our novel framework called MapReduce with Access Patterns (MRAP), which allows users to specify the logical semantics of their data such that 1) the data can be analyzed without running multiple data pre-processing MapReduce programs, and 2) the data can be simultaneously reorganized as it is migrated to the data-intensive file system. We also provide a scheduling mechanism to further improve the performance of these applications. The main contributions of this thesis are, 1) We implement a selection algorithm for I/O functions like read/write, merge a set of functions for data types and file views and optimize the atomicity function by automating the locking mechanism in RFSA. By running different parallel I/O benchmarks on both medium-scale clusters and NERSC supercomputers, we show an improved programmer productivity (35.7% on average). This approach incurs an overhead of 2-5% for one particular optimization, and shows performance improvement of 17% when a combination of different optimizations is required by an application. 2) We provide an augmented Map-Reduce system (MRAP), which consist of an API and corresponding optimizations i.e. data restructuring and scheduling. We have demonstrated up to 33% throughput improvement in one real application (read-mapping in bioinformatics), and up to 70% in an I/O kernel of another application (halo catalogs analytics). Our scheduling scheme shows performance improvement of 18% for an I/O kernel of another application (QCD analytics).
|
123 |
Time-sensitive Information Communication, Sensing, and Computing in Cyber-Physical SystemsLi, Xinfeng 08 September 2014 (has links)
No description available.
|
124 |
A Holistic Study on Electronic and Visual Signal Integration for Efficient SurveillanceLi, Gang 11 August 2017 (has links)
No description available.
|
125 |
Maximizing Parallelization Opportunities by Automatically Inferring Optimal Container Memory for Asymmetrical Map TasksShrimal, Shubhendra 18 July 2016 (has links)
No description available.
|
126 |
Designing and Modeling High-Performance MapReduce and DAG Execution Framework on Modern HPC SystemsRahman, Md Wasi-ur- January 2016 (has links)
No description available.
|
127 |
A resource management framework for cloud computingLi, Min 06 May 2014 (has links)
The cloud computing paradigm is realized through large scale distributed resource management and computation platforms such as MapReduce, Hadoop, Dryad, and Pregel. These platforms enable quick and efficient development of a large range of applications that can be sustained at scale in a fault-tolerant fashion. Two key technologies, namely resource virtualization and feature-rich enterprise storage, are further driving the wide-spread adoption of virtualized cloud environments. Many challenges arise when designing resource management techniques for both native and virtualized data centers. First, parameter tuning of MapReduce jobs for efficient resource utilization is a daunting and time consuming task. Second, while the MapReduce model is designed for and leverages information from native clusters to operate efficiently, the emergence of virtual cluster topology results in overlaying or hiding the actual network information. This leads to two resource selection and placement anomalies: (i) loss of data locality, and (ii) loss of job locality. Consequently, jobs may be placed physically far from their associated data or related jobs, which adversely affect the overall performance. Finally, the extant resource provisioning approach leads to significant wastage as enterprise cloud providers have to consider and provision for peak loads instead of average load (that is many times lower).
In this dissertation, we design and develop a resource management framework to address the above challenges. We first design an innovative resource scheduler, CAM, aimed at MapReduce applications running in virtualized cloud environments. CAM reconciles both data and VM resource allocation with a variety of competing constraints, such as storage utilization, changing CPU load and network link capacities based on a flow-network algorithm. Additionally, our platform exposes the typically hidden lower-level topology information to the MapReduce job scheduler, which enables it to make optimal task assignments. Second, we design an online performance tuning system, mrOnline, which monitors the MapReduce job execution, tunes the parameters based on collected statistics and provides fine-grained control over parameter configuration changes to the user. To this end, we employ a gray-box based smart hill-climbing algorithm that leverages MapReduce runtime statistics and effectively converge to a desirable configuration within a single iteration. Finally, we target enterprise applications in virtualized environment where typically a network attached centralized storage system is deployed. We design a new protocol to share primary data de-duplication information available at the storage server with the client. This enables better client-side cache utilization and reduces server-client network traffic, which leads to overall high performance. Based on the protocol, a workload aware VM management strategy is further introduced to decrease the load to the storage server and enhance the I/O efficiency for clients. / Ph. D.
|
128 |
Exploiting Heterogeneity in Distributed Software FrameworksKumaraswamy Ravindranathan, Krishnaraj 08 January 2016 (has links)
The objective of this thesis is to address the challenges faced in sustaining efficient, high-performance and scalable Distributed Software Frameworks (DSFs), such as MapReduce, Hadoop, Dryad, and Pregel, for supporting data-intensive scientific and enterprise applications on emerging heterogeneous compute, storage and network infrastructure. Large DSF deployments in the cloud continue to grow both in size and number, given DSFs are cost-effective and easy to deploy. DSFs are becoming heterogeneous with the use of advanced hardware technologies and due to regular upgrades to the system. For instance, low-cost, power-efficient clusters that employ traditional servers along with specialized resources such as FPGAs, GPUs, powerPC, MIPS and ARM based embedded devices, and high-end server-on-chip solutions will drive future DSFs infrastructure. Similarly, high-throughput DSF storage is trending towards hybrid and tiered approaches that use large in-memory buffers, SSDs, etc., in addition to disks. However, the schedulers and resource managers of these DSFs assume the underlying hardware to be similar or homogeneous. Another problem faced in evolving applications is that they are typically complex workflows comprising of different kernels. The kernels can be diverse, e.g., compute-intensive processing followed by data-intensive visualization and each kernel will have a different affinity towards different hardware. Because of the inability of the DSFs to understand heterogeneity of the underlying hardware architecture and applications, existing resource managers cannot ensure appropriate resource-application match for better performance and resource usage.
In this dissertation, we design, implement, and evaluate DerbyhatS, an application-characteristics-aware resource manager for DSFs, which predicts the performance of the application under different hardware configurations and dynamically manage compute and storage resources as per the application needs. We adopt a quantitative approach where we first study the detailed behavior of various Hadoop applications running on different hardware configurations and propose application-attuned dynamic system management in order to improve the resource-application match. We re-design the Hadoop Distributed File System (HDFS) into a multi-tiered storage system that seamlessly integrates heterogeneous storage technologies into the HDFS. We also propose data placement and retrieval policies to improve the utilization of the storage devices based on their characteristics such as I/O throughput and capacity. DerbyhatS workflow scheduler is an application-attuned workflow scheduler and is constituted by two components. phi-Sched coupled with epsilon-Sched manages the compute heterogeneity and DUX coupled with AptStore manages the storage substrate to exploit heterogeneity. DerbyhatS will help realize the full potential of the emerging infrastructure for DSFs, e.g., cloud data centers, by offering many advantages over the state of the art by ensuring application-attuned, dynamic heterogeneous resource management. / Ph. D.
|
129 |
Distributed Frameworks Towards Building an Open Data ArchitectureVenumuddala, Ramu Reddy 05 1900 (has links)
Data is everywhere. The current Technological advancements in Digital, Social media and the ease at which the availability of different application services to interact with variety of systems are causing to generate tremendous volumes of data. Due to such varied services, Data format is now not restricted to only structure type like text but can generate unstructured content like social media data, videos and images etc. The generated Data is of no use unless been stored and analyzed to derive some Value. Traditional Database systems comes with limitations on the type of data format schema, access rates and storage sizes etc. Hadoop is an Apache open source distributed framework that support storing huge datasets of different formatted data reliably on its file system named Hadoop File System (HDFS) and to process the data stored on HDFS using MapReduce programming model. This thesis study is about building a Data Architecture using Hadoop and its related open source distributed frameworks to support a Data flow pipeline on a low commodity hardware. The Data flow components are, sourcing data, storage management on HDFS and data access layer. This study also discuss about a use case to utilize the architecture components. Sqoop, a framework to ingest the structured data from database onto Hadoop and Flume is used to ingest the semi-structured Twitter streaming json data on to HDFS for analysis. The data sourced using Sqoop and Flume have been analyzed using Hive for SQL like analytics and at a higher level of data access layer, Hadoop has been compared with an in memory computing system using Spark. Significant differences in query execution performances have been analyzed when working with Hadoop and Spark frameworks. This integration helps for ingesting huge Volumes of streaming json Variety data to derive better Value based analytics using Hive and Spark.
|
130 |
Entrepôt de textes : de l'intégration à la modélisation multidimensionnelle de données textuelles / Text Warehouses : from the integration to the multidimensional modeling of textual dataAknouche, Rachid 26 April 2014 (has links)
Le travail présenté dans ce mémoire vise à proposer des solutions aux problèmes d'entreposage des données textuelles. L'intérêt porté à ce type de données est motivé par le fait qu'elles ne peuvent être intégrées et entreposées par l'application de simples techniques employées dans les systèmes décisionnels actuels. Pour aborder cette problématique, nous avons proposé une démarche pour la construction d'entrepôts de textes. Elle couvre les principales phases d'un processus classique d'entreposage des données et utilise de nouvelles méthodes adaptées aux données textuelles. Dans ces travaux de thèse, nous nous sommes focalisés sur les deux premières phases qui sont l'intégration des données textuelles et leur modélisation multidimensionnelle. Pour mettre en place une solution d'intégration de ce type de données, nous avons eu recours aux techniques de recherche d'information (RI) et du traitement automatique du langage naturel (TALN). Pour cela, nous avons conçu un processus d'ETL (Extract-Transform-Load) adapté aux données textuelles. Il s'agit d'un framework d'intégration, nommé ETL-Text, qui permet de déployer différentes tâches d'extraction, de filtrage et de transformation des données textuelles originelles sous une forme leur permettant d'être entreposées. Certaines de ces tâches sont réalisées dans une approche, baptisée RICSH (Recherche d'information contextuelle par segmentation thématique de documents), de prétraitement et de recherche de données textuelles. D'autre part, l'organisation des données textuelles à des fins d'analyse est effectuée selon TWM (Text Warehouse Modelling), un nouveau modèle multidimensionnel adapté à ce type de données. Celui-ci étend le modèle en constellation classique pour prendre en charge la représentation des textes dans un environnement multidimensionnel. Dans TWM, il est défini une dimension sémantique conçue pour structurer les thèmes des documents et pour hiérarchiser les concepts sémantiques. Pour cela, TWM est adossé à une source sémantique externe, Wikipédia, en l'occurrence, pour traiter la partie sémantique du modèle. De plus, nous avons développé WikiCat, un outil pour alimenter la dimension sémantique de TWM avec des descripteurs sémantiques issus de Wikipédia. Ces deux dernières contributions complètent le framework ETL-Text pour constituer le dispositif d'entreposage des données textuelles. Pour valider nos différentes contributions, nous avons réalisé, en plus des travaux d'implémentation, une étude expérimentale pour chacune de nos propositions. Face au phénomène des données massives, nous avons développé dans le cadre d'une étude de cas des algorithmes de parallélisation des traitements en utilisant le paradigme MapReduce que nous avons testés dans l'environnement Hadoop. / The work, presented in this thesis, aims to propose solutions to the problems of textual data warehousing. The interest in the textual data is motivated by the fact that they cannot be integrated and warehoused by using the traditional applications and the current techniques of decision-making systems. In order to overcome this problem, we proposed a text warehouses approach which covers the main phases of a data warehousing process adapted to textual data. We focused specifically on the integration of textual data and their multidimensional modeling. For the textual data integration, we used information retrieval (IR) techniques and automatic natural language processing (NLP). Thus, we proposed an integration framework, called ETL-Text which is an ETL (Extract- Transform- Load) process suitable for textual data. The ETL-Text performs the extracting, filtering and transforming tasks of the original textual data in a form allowing them to be warehoused. Some of these tasks are performed in our RICSH approach (Contextual information retrieval by topics segmentation of documents) for pretreatment and textual data search. On the other hand, the organization of textual data for the analysis is carried out by our proposed TWM (Text Warehouse Modelling). It is a new multidimensional model suitable for textual data. It extends the classical constellation model to support the representation of textual data in a multidimensional environment. TWM includes a semantic dimension defined for structuring documents and topics by organizing the semantic concepts into a hierarchy. Also, we depend on a Wikipedia, as an external semantic source, to achieve the semantic part of the model. Furthermore, we developed WikiCat, which is a tool permit to feed the TWM semantic dimension with semantics descriptors from Wikipedia. These last two contributions complement the ETL-Text framework to establish the text warehouse device. To validate the different contributions, we performed, besides the implementation works, an experimental study for each model. For the emergence of large data, we developed, as part of a case study, a parallel processing algorithms using the MapReduce paradigm tested in the Apache Hadoop environment.
|
Page generated in 0.5625 seconds