Global ETD Search

31	HADOOP-EDF: LARGE-SCALE DISTRIBUTED PROCESSING OF ELECTROPHYSIOLOGICAL SIGNAL DATA IN HADOOP MAPREDUCE Wu, Yuanyuan 01 January 2019 (has links) The rapidly growing volume of electrophysiological signals has been generated for clinical research in neurological disorders. European Data Format (EDF) is a standard format for storing electrophysiological signals. However, the bottleneck of existing signal analysis tools for handling large-scale datasets is the sequential way of loading large EDF files before performing an analysis. To overcome this, we develop Hadoop-EDF, a distributed signal processing tool to load EDF data in a parallel manner using Hadoop MapReduce. Hadoop-EDF uses a robust data partition algorithm making EDF data parallel processable. We evaluate Hadoop-EDF’s scalability and performance by leveraging two datasets from the National Sleep Research Resource and running experiments on Amazon Web Service clusters. The performance of Hadoop-EDF on a 20-node cluster improves 27 times and 47 times than sequential processing of 200 small-size files and 200 large-size files, respectively. The results demonstrate that Hadoop-EDF is more suitable and effective in processing large EDF files. Electrophysiological Signals European Data Format Cloud Computing Hadoop MapReduce Bioinformatics
32	Design and implementation of a Hadoop-based secure cloud computing architecture Cheng, Sheng-Lun 31 January 2011 (has links) The goal in this research is to design and implement a secure Hadoop cluster. The cloud computing is a type of network computing, where most data is transmitted through network. To develop a secure cloud architecture, we need to validate users first, and guarantee transmitting data against stealing and falsification. In case of someone steals the data, he is still hard to know content. Therefore, we focus on the following points: I. Authorization¡G First, we investigate the user authorization problem in Hadoop system, and then, propose two solutions: SOCKS Authorization and Service Level Authorization. SOCKS Authorization is a external authorization in Hadoop System, and uses username/password to identify users. Service Level Authorization is a new authorization mechanism in Hadoop 0.20. This mechanism to ensure clients connecting to a particular Hadoop service have the necessary, pre-configured, permissions and are authorized to access the given service. II. Transmission Encryption¡G To keep important data, such as Block ID, Job ID, username, etc, away from exposedness in non-trusted networks, we examine Hadoop transmissions in practice, and point out possible security problems. Subsequently, we use IPSec to implement transmission encryption and packet verification for Hadoop. III. Architecture Design¡G Based on the implementation framework of Hadoop mentioned above, we propose a secure architecture of Hadoop cluster to solve the security problems. In addition, we also evaluate the performances of HDFS and MapRduce in this architecture. IPSec SOCKS Authorization Architecture Design Hadoop cloud computing
33	IMPLEMENTATION OF A CLOUD SHELL FOR LIGHT-WEIGHT UNIX PROGRAMMABILITY SUPPORT IN A DISTRIBUTED CLOUD ENVIRONMENT Wei, Tzu-Chieh 09 February 2012 (has links) This thesis describes the implementation of a UNIX-styled shell environment for cloud systems. This new scripting language, the cloud shell (CLSH), uses a syntax based upon the familiar BASH shell of UNIX systems. This familiar syntax allows users to quickly learn the new environment. The difference, as compared to BASH, is that CLSH gives the user easy access to the parallelism of the cloud. Indeed, the user does not need to explicitly refer to the cloud at all; the cloud becomes simply a virtual file system and the user experience is quite similar to standard bash programming. This cloud shell is built into Hadoop¡¦s HDFS file system. The difference, as compared to HDFS, is that CLSH offers a full range of UNIX-style commands, rather than a small subset of simple commands. Moreover, CLSH is a full-fledged scripting language that offers much more control over file management than does HDFS. To achieve comparable behavior within HDFS, the user must use either the Pig Latin tool or else use java scripting. Not only are these alternatives harder to use than CLSH, but they also perform slower and are incapable of performing certain tasks that CLSH can easily achieve. Moreover, the cloud shell environment simply provides the user with a better cloud interface; it does not preclude the use of Pig Latin or Java scripts. Shell Unix Cloud computing Hadoop map-reduce Interface
34	Applying MapReduce Island-based Genetic Algorithm-Particle Swarm Optimization to the inference of large Gene Regulatory Network in Cloud Computing environment Huang, Wei-Jhe 13 September 2012 (has links) The construction of Gene Regulatory Networks (GRNs) is one of the most important issues in systems biology. To infer a large-scale GRN with a nonlinear mathematical model, researchers need to encounter the time-consuming problem due to the large number of network parameters involved. In recent years, the cloud computing technique has been widely used to solve large-scale problems. Among others, Hadoop is currently the most well-known and reliable cloud computing framework, which allows users to analyze large amount of data in a distributed environment (i.e., MapReduce). It also supports data backup and data recovery mechanisms. This study proposes an Island-based GAPSO algorithm under the Hadoop cloud computing environment to infer large-scale GRNs. GAPSO exploited the position and velocity functions of PSO, and integrated the operations of Genetic Algorithm. This approach is often used to derive the optimal solution in nonlinear mathematical models. Several sets of experiments have been conducted, in which the number of network nodes varied from 50 to 125. The experiments were executed in the Hadoop distributed environment with 10, 20, and 26 computers, respectively. In the experiments of inferring the network with 125 gene nodes on the largest Hadoop cluster (i.e. 26 computers), the proposed framework performed up to 9.7 times faster than the stand-alone computer. It means that our work can successfully reduce 90% of the computation time in a single experimental run. Cloud Computing Gene Regulatory Networks Particle Swarm Optimization Hadoop MapReduce
35	Enabling Large-Scale Mining Software Repositories (MSR) Studies Using Web-Scale Platforms Shang, Weiyi 31 May 2010 (has links) The Mining Software Repositories (MSR) field analyzes software data to uncover knowledge and assist software developments. Software projects and products continue to grow in size and complexity. In-depth analysis of these large systems and their evolution is needed to better understand the characteristics of such large-scale systems and projects. However, classical software analysis platforms (e.g., Prolog-like, SQL-like, or specialized programming scripts) face many challenges when performing large-scale MSR studies. Such software platforms rarely scale easily out of the box. Instead, they often require analysis-specific one-time ad hoc scaling tricks and designs that are not reusable for other types of analysis and that are costly to maintain. We believe that the web community has faced many of the scaling challenges facing the software engineering community, as they cope with the enormous growth of the web data. In this thesis, we report on our experience in using MapReduce and Pig, two web-scale platforms, to perform large MSR studies. Through our case studies, we carefully demonstrate the benefits and challenges of using web platforms to prepare (i.e., Extract, Transform, and Load, ETL) software data for further analysis. The results of our studies show that: 1) web-scale platforms provide an effective and efficient platform for large-scale MSR studies; 2) many of the web community’s guidelines for using web-scale platforms must be modified to achieve the optimal performance for large-scale MSR studies. This thesis will help other software engineering researchers who want to scale their studies. / Thesis (Master, Computing) -- Queen's University, 2010-05-28 00:37:19.443 Software Engineering Mining Software Repositories MapReduce Hadoop Pig
36	Análisis y comparación entre el motor de bases de datos orientado a columnas Infobright y el framework de aplicaciones distribuidas Hadoop en escenarios de uso de bases de datos analíticas Silva Balocchi, Erika Fernanda January 2014 (has links) Ingeniera Civil en Computación / Business Intelligence es la habilidad para transformar datos en información, y la información en conocimiento, de forma que se pueda optimizar la toma de decisiones en los negocios. Debido al aumento exponencial en la cantidad de datos disponibles en los ultimos años y a la complejidad de estos, las herramientas tradicionales de bases de datos y business intelligence pueden no dar a basto, suponiendo numerosos riesgos para las empresas. El objetivo de la presente memoria fue analizar el uso del framework de aplicaciones distribuidas Hadoop en comparación a la solución actual de Penta Analytics, buscando hacer un mejor uso de la infraestructura y aumentando la disponibilidad de los datos a medida que el volumen de estos crece. Actualmente esta compañía utiliza un motor de bases de datos analíticas llamado Infobright, que permite la ejecución de consultas de manera eficiente dada su estructura columnar, pero a nivel de un único servidor, limitando las capacidades de manejo de datos y uso eficiente de todos los servidores. Para realizar la comparación se tomaron en cuenta dos casos de procesamiento de datos reales; consultas OLAP y ETL, además de tres casos de consultas estándar. Para cada uno de estos casos se realizaron tres variantes según el volumen a procesar para evaluar el rendimiento según crecían los datos. La solución Hadoop fue desarrollada en un cluster en la nube, con tres servidores (un maestro y dos esclavos). En el sistema de archivos del cluster se almacenó la información a procesar y se realizaron los sets de consultas mediante dos herramientas Hadoop: Hive e Impala. Los resultados obtenidos arrojaron que Hive presenta tiempo superiores a Impala e Infobright, esto debido al overhead que implica lanzar las tareas map y reduce, sin embargo es el único que ofrece tolerancia ante el fallo de un nodo. Por otro lado Impala presenta la menor latencia, con un tiempo de respuesta mucho menor a Infobright, no obstante presenta la mayor utilización de memoria. A partir de los resultados se pudo observar que Hive se comporta mejor en trabajos pesados tipo ETL donde la robustez prime sobre el tiempo, e Impala aplica mejor en consultas ligeras donde prime la velocidad. Se pudo concluir que la combinación de distintas herramientas en un ambiente con tecnología Hadoop pueden ofrecer un buen desempeño, además de mejor utilización de máquinas y eventual tolerancia a fallos. Sin embargo hay que tomar en cuenta la curva de aprendizaje implicada. Bases de datos Administración de bases de datos Minería de datos Infobright Hadoop
37	A distributed approach to Frequent Itemset Mining at low support levels Clark, Neal 22 December 2014 (has links) Frequent Itemset Mining, the process of finding frequently co-occurring sets of items in a dataset, has been at the core of the field of data mining for the past 25 years. During this time the datasets have grown much faster than the algorithms capacity to process them. Great progress was made at optimizing this task on a single computer however, despite years of research, very little progress has been made on parallelizing this task. FPGrowth based algorithms have proven notoriously difficult to parallelize and Apriori has largely fallen out of favor with the research community. In this thesis we introduce a parallel, Apriori based, Frequent Itemset Mining algo- rithm capable of distributing computation across large commodity clusters. Our case study demonstrates that our algorithm can efficiently scale to hundreds of cores, on a standard Hadoop MapReduce cluster, and can improve executions times by at least an order of magnitude at the lowest support levels. / Graduate / 0984 / 0800 / nclark@uvic.ca Apriori MapReduce Frequent Itemset Mining FPGrowth Distributed Machine Learning Hadoop
38	Datový sklad v prostředí Amazon Web Services / Data warehouse in the Amazon Web Services Kuželka, Kryštof January 2015 (has links) The primary objective of this work is to investigate the potential of utilizing Hadoop and Amazon Redshift in the Amazon Web Services ("AWS") cloud, in order to design and implement a data warehouse, the efficacy of which will be tested afterwards. Contributions of this work include: documenting the technologies in the AWS cloud in Czech, demonstration of the design and performance tests of the data warehouse and the ETL part. Another considerable benefit is the added value to the company for whom the project was designed, and which is currently using the output of the project.
39	Aplikace pro Big Data / Application for Big Data Blaho, Matúš January 2018 (has links) This work deals with the description and analysis of the Big Data concept and its processing and use in the process of decision support. Suggested processing is based on the MapReduce concept designed for Big Data processing. The theoretical part of this work is largely about the Hadoop system that implements this concept. Its understanding is a key feature for properly designing applications that run within it. The work also contains design for specific Big Data processing applications. In the implementation part of the thesis is a description of Hadoop system management, description of implementation of MapReduce applications and description of their testing over data sets.
40	Zpracování a vizualizace senzorových dat ve vojenském prostředí / Processing and Visualization of Military Sensor Data Boychuk, Maksym January 2016 (has links) This thesis deals with the creating, visualization and processing data in a military environment. The task is to design and implement a system that enables the creation, visualization and processing ESM data. The result of this work is a ESMBD application that allows using a classical approach, which is a relational database, and BigData technologies for data storage and manipulation. The comparison of data processing speed while using the classic approach (Postgres database) and BigData technologies (Cassandra databases and Hadoop) has been carried out as well.

Search results