Global ETD Search

71	Bayesian-based Traffic State Estimation in Large-Scale Networks Using Big Data Gu, Yiming 01 February 2017 (has links) Traffic state estimation (TSE) aims to estimate the time-varying traffic characteristics (such as flow rate, flow speed, flow density, and occurrence of incidents) of all roads in traffic networks, provided with limited observations in sparse time and locations. TSE is critical to transportation planning, operation and infrastructure design. In this new era of “big data”, massive volumes of sensing data from a variety of source (such as cell phones, GPS, probe vehicles, and inductive loops, etc.) enable TSE in an efficient, timely and accurate manner. This research develops a Bayesian-based theoretical framework, along with statistical inference algorithms, to (1) capture the complex flow patterns in the urban traffic network consisting both highways and arterials; (2) incorporate heterogeneous data sources into the process of TSE; (3) enable both estimation and perdition of traffic states; and (4) demonstrate the scalability to large-scale urban traffic networks. To achieve those goals, a Hierarchical Bayesian probabilistic model is proposed to capture spatio-temporal traffic states. The propagation of traffic states are encapsulated through mesoscopic network flow models (namely the Link Queue Model) and equilibrated fundamental diagrams. Traffic states in the Hierarchical Bayesian model are inferred using the Expectation-Maximization Extended Kalman Filter (EM-EKF). To better estimate and predict states, infrastructure supply is also estimated as part of the TSE process. It is done by adopting a series of algorithms to translate Twitter data into traffic incident information. Finally, the proposed EM-EKF algorithm is implemented and examined on the road networks in Washington DC. The results show that the proposed methods can handle large-scale traffic state estimation, while achieving superior results comparing to traditional temporal and spatial smoothing methods. Bayesian Big Data Hadoop Intelligent Transportation System Traffic State Estimation Transportation
72	Scientific Workflows for Hadoop Bux, Marc Nicolas 07 August 2018 (has links) Scientific Workflows bieten flexible Möglichkeiten für die Modellierung und den Austausch komplexer Arbeitsabläufe zur Analyse wissenschaftlicher Daten. In den letzten Jahrzehnten sind verschiedene Systeme entstanden, die den Entwurf, die Ausführung und die Verwaltung solcher Scientific Workflows unterstützen und erleichtern. In mehreren wissenschaftlichen Disziplinen wachsen die Mengen zu verarbeitender Daten inzwischen jedoch schneller als die Rechenleistung und der Speicherplatz verfügbarer Rechner. Parallelisierung und verteilte Ausführung werden häufig angewendet, um mit wachsenden Datenmengen Schritt zu halten. Allerdings sind die durch verteilte Infrastrukturen bereitgestellten Ressourcen häufig heterogen, instabil und unzuverlässig. Um die Skalierbarkeit solcher Infrastrukturen nutzen zu können, müssen daher mehrere Anforderungen erfüllt sein: Scientific Workflows müssen parallelisiert werden. Simulations-Frameworks zur Evaluation von Planungsalgorithmen müssen die Instabilität verteilter Infrastrukturen berücksichtigen. Adaptive Planungsalgorithmen müssen eingesetzt werden, um die Nutzung instabiler Ressourcen zu optimieren. Hadoop oder ähnliche Systeme zur skalierbaren Verwaltung verteilter Ressourcen müssen verwendet werden. Diese Dissertation präsentiert neue Lösungen für diese Anforderungen. Zunächst stellen wir DynamicCloudSim vor, ein Simulations-Framework für Cloud-Infrastrukturen, welches verschiedene Aspekte der Variabilität adäquat modelliert. Im Anschluss beschreiben wir ERA, einen adaptiven Planungsalgorithmus, der die Ausführungszeit eines Scientific Workflows optimiert, indem er Heterogenität ausnutzt, kritische Teile des Workflows repliziert und sich an Veränderungen in der Infrastruktur anpasst. Schließlich präsentieren wir Hi-WAY, eine Ausführungsumgebung die ERA integriert und die hochgradig skalierbare Ausführungen in verschiedenen Sprachen beschriebener Scientific Workflows auf Hadoop ermöglicht. / Scientific workflows provide a means to model, execute, and exchange the increasingly complex analysis pipelines necessary for today's data-driven science. Over the last decades, scientific workflow management systems have emerged to facilitate the design, execution, and monitoring of such workflows. At the same time, the amounts of data generated in various areas of science outpaced hardware advancements. Parallelization and distributed execution are generally proposed to deal with increasing amounts of data. However, the resources provided by distributed infrastructures are subject to heterogeneity, dynamic performance changes at runtime, and occasional failures. To leverage the scalability provided by these infrastructures despite the observed aspects of performance variability, workflow management systems have to progress: Parallelization potentials in scientific workflows have to be detected and exploited. Simulation frameworks, which are commonly employed for the evaluation of scheduling mechanisms, have to consider the instability encountered on the infrastructures they emulate. Adaptive scheduling mechanisms have to be employed to optimize resource utilization in the face of instability. State-of-the-art systems for scalable distributed resource management and storage, such as Apache Hadoop, have to be supported. This dissertation presents novel solutions for these aspirations. First, we introduce DynamicCloudSim, a cloud computing simulation framework that is able to adequately model the various aspects of variability encountered in computational clouds. Secondly, we outline ERA, an adaptive scheduling policy that optimizes workflow makespan by exploiting heterogeneity, replicating bottlenecks in workflow execution, and adapting to changes in the underlying infrastructure. Finally, we present Hi-WAY, an execution engine that integrates ERA and enables the highly scalable execution of scientific workflows written in a number of languages on Hadoop. Scientific Workflows Workflow-Management-System Cloud Computing Simulation Workflow-Planungsalgorithmen Adaptive Planungsalgorithmen Brownsche Bewegung Wiener-Prozess Apache Hadoop Hadoop YARN Scientific Workflows Workflow Management Systems Cloud Computing Simulation Workflow Scheduling Adaptive Scheduling Brownian Motion Wiener Process Apache Hadoop Hadoop YARN 004 Datenverarbeitung; Informatik ST 201 H03 ST 201 ddc:004
73	Uma arquitetura para processamento de grande volumes de dados integrando sistemas de workflow científicos e o paradigma mapreduce Zorrilla Coz, Rocío Milagros 13 September 2012 (has links) Submitted by Maria Cristina (library@lncc.br) on 2017-08-10T17:48:51Z No. of bitstreams: 1 RocioZorrilla_Dissertacao.pdf: 3954121 bytes, checksum: f22054a617a91e44c59cba07b1d97fbb (MD5) / Approved for entry into archive by Maria Cristina (library@lncc.br) on 2017-08-10T17:49:05Z (GMT) No. of bitstreams: 1 RocioZorrilla_Dissertacao.pdf: 3954121 bytes, checksum: f22054a617a91e44c59cba07b1d97fbb (MD5) / Made available in DSpace on 2017-08-10T17:49:17Z (GMT). No. of bitstreams: 1 RocioZorrilla_Dissertacao.pdf: 3954121 bytes, checksum: f22054a617a91e44c59cba07b1d97fbb (MD5) Previous issue date: 2012-09-13 / Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) / With the exponential growth of computational power and generated data from scientific experiments and simulations, it is possible to find today simulations that generate terabytes of data and scientific experiments that gather petabytes of data. The type of processing required for this data is currently known as data-intensive computing. The MapReduce paradigm, which is included in the Hadoop framework, is an alternative parallelization technique for the execution of distributed applications that is being increasingly used. This framework is responsible for scheduling the execution of jobs in clusters, provides fault tolerance and manages all necessary communication between machines. For many types of complex applications, the Scientific Workflow Systems offer advanced functionalities that can be leveraged for the development, execution and evaluation of scientific experiments under different computational environments. In the Query Evaluation Framework (QEF), workflow activities are represented as algebrical operators, and specific application data types are encapsulated in a common tuple structure. QEF aims for the automatization of computational processes and data management, supporting scientists so that they can concentrate on the scientific problem. Nowadays, there are several Scientific Workflow Systems that provide components and task parallelization strategies on a distributed environment. However, scientific experiments tend to generate large sizes of information, which may limit the execution scalability in relation to data locality. For instance, there could be delays in data transfer for process execution or a fault at result consolidation. In this work, I present a proposal for the integration of QEF with Hadoop. The main objective is to manage the execution of a workflow with an orientation towards data locality. In this proposal, Hadoop is responsible for the scheduling of tasks in a distributed environment, while the workflow activities and data sources are managed by QEF. The proposed environment is evaluated using a scientific workflow from the astronomy field as a case study. Then, I describe in detail the deployment of the application in a virtualized environment. Finally, experiments that evaluate the impact of the proposed environment on the perceived performance of the application are presented, and future work discussed. / Com o crescimento exponencial do poder computacional e das fontes de geração de dados em experimentos e simulações científicas, é possível encontrar simulações que usualmente geram terabytes de dados e experimentos científicos que coletam petabytes de dados. O processamento requerido nesses casos é atualmente conhecido como computação de dados intensivos. Uma alternativa para a execução de aplicações distribuídas que atualmente é bastante usada é a técnica de paralelismo baseada no paradigma MapReduce, a qual é incluída no framework Hadoop. Esse framework se encarrega do escalonamento da execução em um conjunto de computadores (cluster), do tratamento de falhas, e do gerenciamento da comunicação necessária entre máquinas. Para diversos tipos de aplicações complexas, os Sistemas de Gerência de Workflows Científicos (SGWf) oferecem funcionalidades avançadas que auxiliam no desenvolvimento, execução e avaliação de experimentos científicos sobre diversos tipos de ambientes computacionais. No Query Evaluation Framework (QEF), as atividades de um workflow são representadas como operadores algébricos e os tipos de dados específicos da aplicação são encapsulados em uma tupla com estrutura comum. O QEF aponta para a automatização de processos computacionais e gerenciamento de dados, ajudando os cientistas a se concentrarem no problema científico. Atualmente, existem vários sistemas de gerência de workflows científicos que fornecem componentes e estratégias de paralelização de tarefas em um ambiente distribuído. No entanto, os experimentos científicos apresentam uma tendência a gerar quantidades de informação que podem representar uma limitação na escalabilidade de execução em relação à localidade dos dados. Por exemplo, é possível que exista um atraso na transferência de dados no processo de execução de determinada tarefa ou uma falha no momento de consolidar os resultados. Neste trabalho, é apresentada uma proposta de integração do QEF com o Hadoop. O objetivo dessa proposta é realizar a execução de um workflow científico orientada a localidade dos dados. Na proposta apresentada, o Hadoop é responsável pelo escalonamento de tarefas em um ambiente distribuído, enquanto que o gerenciamento das atividades e fontes de dados do workflow é realizada pelo QEF. O ambiente proposto é avaliado utilizando um workflow científico da astronomia como estudo de caso. Logo, a disponibilização da aplicação no ambiente virtualizado é descrita em detalhe. Por fim, são realizados experimentos para avaliar o impacto do ambiente proposto no desempenho percebido da aplicação, e discutidos trabalhos futuros. Workflows cientificos MapReduce e Hadoop Virtualização
74	Recuperación de información en ficheros XES de gran dimensión mediante técnicas de indexación Aponte Báez, Yosvanys 19 January 2016 (has links) No description available. Indexación XML XES Minería de procesos Hadoop Lenguajes y Sistemas Informáticos
75	Assessing Apache Spark Streaming with Scientific Data Dahal, Janak 06 August 2018 (has links) Processing real-world data requires the ability to analyze data in real-time. Data processing engines like Hadoop come short when results are needed on the fly. Apache Spark's streaming library is increasingly becoming a popular choice as it can stream and analyze a significant amount of data. To showcase and assess the ability of Spark various metrics were designed and operated using data collected from the USGODAE data catalog. The latency of streaming in Apache Spark was measured and analyzed against many nodes in the cluster. Scalability was monitored by adding and removing nodes in the middle of a streaming job. Fault tolerance was verified by stopping nodes in the middle of a job and making sure that the job was rescheduled and completed on other node/s. A full stack application was designed that would automate data collection, data processing and visualizing the results. Google Maps API was used to visualize results by color coding the world map with values from various analytics. Other Computer Sciences
76	Scalable Scientific Computing Algorithms Using MapReduce Xiang, Jingen January 2013 (has links) Cloud computing systems, like MapReduce and Pregel, provide a scalable and fault tolerant environment for running computations at massive scale. However, these systems are designed primarily for data intensive computational tasks, while a large class of problems in scientific computing and business analytics are computationally intensive (i.e., they require a lot of CPU in addition to I/O). In this thesis, we investigate the use of cloud computing systems, in particular MapReduce, for computationally intensive problems, focusing on two classic problems that arise in scienti c computing and also in analytics: maximum clique and matrix inversion. The key contribution that enables us to e ectively use MapReduce to solve the maximum clique problem on dense graphs is a recursive partitioning method that partitions the graph into several subgraphs of similar size and running time complexity. After partitioning, the maximum cliques of the di erent partitions can be computed independently, and the computation is sped up using a branch and bound method. Our experiments show that our approach leads to good scalability, which is unachievable by other partitioning methods since they result in partitions of di erent sizes and hence lead to load imbalance. Our method is more scalable than an MPI algorithm, and is simpler and more fault tolerant. For the matrix inversion problem, we show that a recursive block LU decomposition allows us to e ectively compute in parallel both the lower triangular (L) and upper triangular (U) matrices using MapReduce. After computing the L and U matrices, their inverses are computed using MapReduce. The inverse of the original matrix, which is the product of the inverses of the L and U matrices, is also obtained using MapReduce. Our technique is the rst matrix inversion technique that uses MapReduce. We show experimentally that our technique has good scalability, and it is simpler and more fault tolerant than MPI implementations such as ScaLAPACK. Scientific Computing Cloud Computing MapReduce Hadoop Matrix Inversion Maximum Clique Computer Science
77	Programming Models and Runtimes for Heterogeneous Systems Grossman, Max 16 September 2013 (has links) With the plateauing of processor frequencies and increase in energy consumption in computing, application developers are seeking new sources of performance acceleration. Heterogeneous platforms with multiple processor architectures offer one possible avenue to address these challenges. However, modern heterogeneous programming models tend to be either so low-level as to severely hinder programmer productivity, or so high-level as to limit optimization opportunities. The novel systems presented in this thesis strike a better balance between abstraction and transparency, enabling programmers to be productive and produce high-performance applications on heterogeneous platforms. This thesis starts by summarizing the strengths, weaknesses, and features of existing heterogeneous programming models. It then introduces and evaluates four novel heterogeneous programming models and runtime systems: JCUDA, CnC-CUDA, DyGR, and HadoopCL. We'll conclude by positioning the key contributions of each piece in this thesis relative to the state-of-the-art, and outline possible directions for future work. heterogeneous GPU GPGPU programming model runtime multicore abstraction distributed CUDA OpenCL Hadoop
78	Enhancing Data Processing on Clouds with Hadoop/HBase Zhang, Chen January 2011 (has links) In the current information age, large amounts of data are being generated and accumulated rapidly in various industrial and scientific domains. This imposes important demands on data processing capabilities that can extract sensible and valuable information from the large amount of data in a timely manner. Hadoop, the open source implementation of Google's data processing framework (MapReduce, Google File System and BigTable), is becoming increasingly popular and being used to solve data processing problems in various application scenarios. However, being originally designed for handling very large data sets that can be divided easily in parts to be processed independently with limited inter-task communication, Hadoop lacks applicability to a wider usage case. As a result, many projects are under way to enhance Hadoop for different application needs, such as data warehouse applications, machine learning and data mining applications, etc. This thesis is one such research effort in this direction. The goal of the thesis research is to design novel tools and techniques to extend and enhance the large-scale data processing capability of Hadoop/HBase on clouds, and to evaluate their effectiveness in performance tests on prototype implementations. Two main research contributions are described. The first contribution is a light-weight computational workflow system called "CloudWF" for Hadoop. The second contribution is a client library called "HBaseSI" supporting transactional snapshot isolation (SI) in HBase, Hadoop's database component. CloudWF addresses the problem of automating the execution of scientific workflows composed of both MapReduce and legacy applications on clouds with Hadoop/HBase. CloudWF is the first computational workflow system built directly using Hadoop/HBase. It uses novel methods in handling workflow directed acyclic graph decomposition, storing and querying dependencies in HBase sparse tables, transparent file staging, and decentralized workflow execution management relying on the MapReduce framework for task scheduling and fault tolerance. HBaseSI addresses the problem of maintaining strong transactional data consistency in HBase tables. This is the first SI mechanism developed for HBase. HBaseSI uses novel methods in handling distributed transactional management autonomously by individual clients. These methods greatly simplify the design of HBaseSI and can be generalized to other column-oriented stores with similar architecture as HBase. As a result of the simplicity in design, HBaseSI adds low overhead to HBase performance and directly inherits many desirable properties of HBase. HBaseSI is non-intrusive to existing HBase installations and user data, and is designed to work with a large cloud in terms of data size and the number of nodes in the cloud. Cloud Hadoop HBase Snapshot Isolation Distributed Transaction Workflow Data Processing Computer Science
79	Design and Implementation of a QoS file transfer protocol over Hadoop distributed file system Chen, Chih-yi 26 July 2010 (has links) Cloud computing is pervasive in our daily life. For instance, I usually use Google¡¦s GMail to receive e-mail, Google Document to edit documents online and Google Calendar to make my daily schedule. We can say that Google provides a ¡§Platform as a Service (PaaS)¡¨, which delivers a computing platform as a service, and the platform sustaining lots of cloud applications such as I mentioned above. However, the cloud computing platform of Google is private: we cannot trace its source code and make cloud applications on it! Fortunately, there¡¦s an open source project supported by Apache named ¡§Hadoop¡¨, which has a distributed file system which is very like Google File System (GFS) called ¡§Hadoop distributed file system (HDFS)¡¨. In order to observe the properties of HDFS, we design and implement a HDFS-based FTP server system called FTP-ON-HDFS system, say, a FTP server whose storage is HDFS. There are a web-console for FTP administrator, a FreeRADIUS server and a MySQL database for user authentication, a NameNode daemon on its machine, a SecondaryNameNode on its machine and five DataNode daemons and on five different machines in FTP-ON-HDFS system. Our FTP-ON-HDFS system can tune two QoS parameters: ¡§data block size¡¨ and ¡§data replication¡¨. Then, we tuned ¡§data block size¡¨ and ¡§data replication¡¨ in our system and compared its performance with Hadoop File System (FS) shell command and normal vsftpd. On the other hand, FUSE can mount HDFS from remote cluster to local machine, and make use of the permission of the local machine to manage HDFS. So, we compared the performance of FUSE with HDFS (FUSE-DFS) and our FTP-ON-HDFS system. File Transfer Protocol (FTP) FUSE FreeRADIUS Hadoop Distributed File System (HDFS)
80	The development of an intelligent, cloud-based remote monitoring management system Cheng, Wen-Hao 25 October 2012 (has links) In this thesis, a data collection application based on MapReduce programming is described. This application aims to collect tempera- ture data stream continuously from a specied set of sensors. Instead of collecting the temperature information of all the sensors by one machine, the sensors are divided into several subsets each of which is handled as a Map task. In each Map task, the temperature data stream of the assigned sensors is collected continuously and stored in a predened database. All the Map tasks can run simultaneously on several machines. This method can reduce the delay time and improve the eciency of the data collection service, especially in the case of having a huge number of sensors monitored remotely by a data center through Internet. We can use the value of remote sensors to predict the next value of remote sensors by some methods such as linear regres- sion and K-means. And, we can use it to predict the system alarm. Experimental results show that the proposed method is eective in temperature data collection,and eective in carbon reduction. K-means Hadoop Data collection MapReduce Distributed programming Database Linear regression

Search results