Global ETD Search

11	Maritime Transportation Optimization Using Evolutionary Algorithms in the Era of Big Data and Internet of Things Cheraghchi, Fatemeh 19 July 2019 (has links) With maritime industry carrying out nearly 90% of the volume of global trade, the algorithms and solutions to provide quality of services in maritime transportation are of great importance to both academia and the industry. This research investigates an optimization problem using evolutionary algorithms and big data analytics to address an important challenge in maritime disruption management, and illustrates how it can be engaged with information technologies and Internet of Things. Accordingly, in this thesis, we design, develop and evaluate methods to improve decision support systems (DSSs) in maritime supply chain management. We pursue three research goals in this thesis. First, the Vessel Schedule recovery Problem (VSRP) is reformulated and a bi-objective optimization approach is proposed. We employ bi-objective evolutionary algorithms (MOEAs) to solve optimization problems. An optimal Pareto front provides a valuable trade-off between two objectives (minimizing delay and minimizing financial loss) for a stakeholder in the freight ship company. We evaluate the problem in three domains, namely scalability analysis, vessel steaming policies, and voyage distance analysis, and statistically validate their performance significance. According to the experiments, the problem complexity varies in different scenarios, while NSGAII performs better than other MOEAs in all scenarios. In the second work, a new data-driven VSRP is proposed, which benefits from the available Automatic Identification System (AIS) data. In the new formulation, the trajectory between the port calls is divided and encoded into adjacent geohashed regions. In each geohash, the historical speed profiles are extracted from AIS data. This results in a large-scale optimization problem called G-S-VSRP with three objectives (i.e., minimizing loss, delay, and maximizing compliance) where the compliance objective maximizes the compliance of optimized speeds with the historical data. Assuming that the historical speed profiles are reliable to trust for actual operational speeds based on other ships' experience, maximizing the compliance of optimized speeds with these historical data offers some degree of avoiding risks. Three MOEAs tackled the problem and provided the stakeholder with a Pareto front which reflects the trade-off among the three objectives. Geohash granularity and dimensionality reduction techniques were evaluated and discussed for the model. G-S-VSRPis a large-scale optimization problem and suffers from the curse of dimensionality (i.e. problems are difficult to solve due to exponential growth in the size of the multi-dimensional solution space), however, due to a special characteristic of the problem instance, a large number of function evaluations in MOEAs can still find a good set of solutions. Finally, when the compliance objective in G-S-VSRP is changed to minimization, the regular MOEAs perform poorly due to the curse of dimensionality. We focus on improving the performance of the large-scale G-S-VSRP through a novel distributed multiobjective cooperative coevolution algorithm (DMOCCA). The proposed DMOCCA improves the quality of performance metrics compared to the regular MOEAs (i.e. NSGAII, NSGAIII, and GDE3). Additionally, the DMOCCA results in speedup when running on a cluster. multiobjective optimization evolutionary algorithms vessel schedule recovery problem big data distributed algorithms cooperative coevolution Apache Spark
12	Insightful Performance Analysis of Many-Task Runtimes through Tool-Runtime Integration Chaimov, Nicholas 06 September 2017 (has links) Future supercomputers will require application developers to expose much more parallelism than current applications expose. In order to assist application developers in structuring their applications such that this is possible, new programming models and libraries are emerging, the many-task runtimes, to allow for the expression of orders of magnitude more parallelism than currently existing models. This dissertation describes the challenges that these emerging many-task runtimes will place on performance analysis, and proposes deep integration between runtimes and performance tools as a means of producing correct, insightful, and actionable performance results. I show how tool-runtime integration can be used to aid programmer understanding of performance characteristics and to provide online performance feedback to the runtime for Unified Parallel C (UPC), High Performance ParalleX (HPX), Apache Spark, the Open Community Runtime, and the OpenMP runtime. Apache Spark High performance computing High performance ParalleX Open community runtime Task parallelism Unified Parallel C
13	Performance assessment of Apache Spark applications AL Jorani, Salam January 2019 (has links) This thesis addresses the challenges of large software and data-intensive systems. We will discuss a Big Data software that consists of quite a bit of Linux configuration, some Scala coding and a set of frameworks that work together to achieve the smooth performance of the system. Moreover, the thesis focuses on the Apache Spark framework and the challenging of measuring the lazy evaluation of the transformation operations of Spark. Investigating the challenges are essential for the performance engineers to increase their ability to study how the system behaves and take decisions in early design iteration. Thus, we made some experiments and measurements to achieve this goal. In addition to that, and after analyzing the result we could create a formula that will be useful for the engineers to predict the performance of the system in production. Big Data Apache Spark BigBlu Lazy evaluation of Spark Computer Sciences Datavetenskap (datalogi)
14	GeoSparkSim: A Scalable Microscopic Road Network Traffic Simulator Based on Apache Spark January 2019 (has links) abstract: Researchers and practitioners have widely studied road network traffic data in different areas such as urban planning, traffic prediction and spatial-temporal databases. For instance, researchers use such data to evaluate the impact of road network changes. Unfortunately, collecting large-scale high-quality urban traffic data requires tremendous efforts because participating vehicles must install Global Positioning System(GPS) receivers and administrators must continuously monitor these devices. There have been some urban traffic simulators trying to generate such data with different features. However, they suffer from two critical issues (1) Scalability: most of them only offer single-machine solution which is not adequate to produce large-scale data. Some simulators can generate traffic in parallel but do not well balance the load among machines in a cluster. (2) Granularity: many simulators do not consider microscopic traffic situations including traffic lights, lane changing, car following. This paper proposed GeoSparkSim, a scalable traffic simulator which extends Apache Spark to generate large-scale road network traffic datasets with microscopic traffic simulation. The proposed system seamlessly integrates with a Spark-based spatial data management system, GeoSpark, to deliver a holistic approach that allows data scientists to simulate, analyze and visualize large-scale urban traffic data. To implement microscopic traffic models, GeoSparkSim employs a simulation-aware vehicle partitioning method to partition vehicles among different machines such that each machine has a balanced workload. The experimental analysis shows that GeoSparkSim can simulate the movements of 200 thousand cars over an extensive road network (250 thousand road junctions and 300 thousand road segments). / Dissertation/Thesis / Masters Thesis Computer Engineering 2019 Computer science Computer engineering Urban planning Apache Spark Microscopic Simulator Road Network Scalability Traffic Workload Balancing
15	Distributed Local Outlier Factor with Locality-Sensitive Hashing Zheng, Lining 08 November 2019 (has links) Outlier detection remains a heated area due to its essential role in a wide range of applications, including intrusion detection, fraud detection in finance, medical diagnosis, etc. Local Outlier Factor (LOF) has been one of the most influential outlier detection techniques over the past decades. LOF has distinctive advantages on skewed datasets with regions of various densities. However, the traditional centralized LOF faces new challenges in the era of big data and no longer satisfies the rigid time constraints required by many modern applications, due to its expensive computation overhead. A few researchers have explored the distributed solution of LOF, but existant methods are limited by their grid-based data partitioning strategy, which falls short when applied to high-dimensional data. In this thesis, we study efficient distributed solutions for LOF. A baseline MapReduce solution for LOF implemented with Apache Spark, named MR-LOF, is introduced. We demonstrate its disadvantages in communication cost and execution time through complexity analysis and experimental evaluation. Then an approximate LOF method is proposed, which relies on locality-sensitive hashing (LSH) for partitioning data and enables fully distributed local computation. We name it MR-LOF-LSH. To further improve the approximate LOF, we introduce a process called cross-partition updating. With cross-partition updating, the actual global k-nearest neighbors (k-NN) of the outlier candidates are found, and the related information of the neighbors is used to update the outlier scores of the candidates. The experimental results show that MR-LOF achieves a speedup of up to 29 times over the centralized LOF. MR-LOF-LSH further reduces the execution time by a factor of up to 9.9 compared to MR-LOF. The results also highlight that MR-LOF-LSH scales well as the cluster size increases. Moreover, with a sufficient candidate size, MR-LOF-LSH is able to detect in most scenarios over 90% of the top outliers with the highest LOF scores computed by the centralized LOF algorithm. Outlier detection Distributed computing Apache Spark Locality-sensitive hashing Local Outlier Factor
16	Assessing Apache Spark Streaming with Scientific Data Dahal, Janak 06 August 2018 (has links) Processing real-world data requires the ability to analyze data in real-time. Data processing engines like Hadoop come short when results are needed on the fly. Apache Spark's streaming library is increasingly becoming a popular choice as it can stream and analyze a significant amount of data. To showcase and assess the ability of Spark various metrics were designed and operated using data collected from the USGODAE data catalog. The latency of streaming in Apache Spark was measured and analyzed against many nodes in the cluster. Scalability was monitored by adding and removing nodes in the middle of a streaming job. Fault tolerance was verified by stopping nodes in the middle of a job and making sure that the job was rescheduled and completed on other node/s. A full stack application was designed that would automate data collection, data processing and visualizing the results. Google Maps API was used to visualize results by color coding the world map with values from various analytics. Other Computer Sciences
17	Correlação probabilística implementada em spark para big data em saúde Pita, Robespierre Dantas da Rocha 05 March 2015 (has links) Submitted by Santos Davilene (davilenes@ufba.br) on 2016-05-30T16:15:43Z No. of bitstreams: 1 Dissertação_Mestrado_Clicia(1).pdf: 2228201 bytes, checksum: d990a114eac5a988c57ba6d1e22e8f99 (MD5) / Made available in DSpace on 2016-05-30T16:15:43Z (GMT). No. of bitstreams: 1 Dissertação_Mestrado_Clicia(1).pdf: 2228201 bytes, checksum: d990a114eac5a988c57ba6d1e22e8f99 (MD5) / A aplicação de técnicas de correlação probabilística em registros de saúde ou socioeconômicos de uma população tem sido uma prática comum entre epidemiologistas como base para suas pesquisa não-experimentais. Entretanto, o crescimento do volume dos dados comum ao cenário imposto pelo Big Data provocou uma carˆencia por ferramentas computacionais capazes de lidar com esses imensos reposit´orios. Neste trabalho é descrita uma solução implementada no framework de processamento em cluster Spark para a correlação probabilística de registros de grandes bases de dados do Sistema Público de Saúde brasileiro. Este trabalho está vinculado a um projeto que visa analisar a relação entre o Programam Bolsa Família e a incidência de doen¸cas associadas á pobreza, tais como hanseníase e tuberculose. Os resultados obtidos demonstram que esta implementação provê qualidade competitiva em relação a outras ferramentas e abordagens existentes, comprovada pela superioridade das métricas de tempo de execução. Ciência da Computação Correlação probabilística Computação intensiva de dados Sistemas de saúde pública Apache Spark
18	Big Data analytics for the forest industry : A proof-of-conceptbuilt on cloud technologies Sellén, David January 2016 (has links) Large amounts of data in various forms are generated at a fast pace in today´s society. This is commonly referred to as “Big Data”. Making use of Big Data has been increasingly important for both business and in research. The forest industry is generating big amounts of data during the different processes of forest harvesting. In Sweden, forest infor-mation is sent to SDC, the information hub for the Swedish forest industry. In 2014, SDC received reports on 75.5 million m3fub from harvester and forwarder machines. These machines use a global stand-ard called StanForD 2010 for communication and to create reports about harvested stems. The arrival of scalable cloud technologies that com-bines Big Data with machine learning makes it interesting to develop an application to analyze the large amounts of data produced by the forest industry. In this study, a proof-of-concept has been implemented to be able to analyze harvest production reports from the StanForD 2010 standard. The system consist of a back-end and front-end application and is built using cloud technologies such as Apache Spark and Ha-doop. System tests have proven that the concept is able to successfully handle storage, processing and machine learning on gigabytes of HPR files. It is capable of extracting information from raw HPR data into datasets and support a machine learning pipeline with pre-processing and K-Means clustering. The proof-of-concept has provided a code base for further development of a system that could be used to find valuable knowledge for the forest industry. Big Data analytics Apache Spark StanForD 2010 forest industry harvest production report Computer Engineering Datorteknik
19	Extending the Growing Hierarchical Self Organizing Maps for a Large Mixed-Attribute Dataset Using Spark MapReduce Malondkar, Ameya Mohan January 2015 (has links) In this thesis work, we propose a Map-Reduce variant of the Growing Hierarchical Self Organizing Map (GHSOM) called MR-GHSOM, which is capable of handling mixed attribute datasets of massive size. The Self Organizing Map (SOM) has proved to be a useful unsupervised data analysis algorithm. It projects a high dimensional data onto a lower dimensional grid of neurons. However, the SOM has some limitations owing to its static structure and the incapability to mirror the hierarchical relations in the data. The GHSOM overcomes these shortcomings of the SOM by providing a dynamic structure that adapts its shape according to the input data. It is capable of growing dynamically in terms of the size of the individual neuron layers to represent data at the desired granularity as well as in depth to model the hierarchical relations in the data. However, the training of the GHSOM requires multiple passes over an input dataset. This makes it difficult to use the GHSOM for massive datasets. In this thesis work, we propose a Map-Reduce variant of the GHSOM called MR-GHSOM, which is capable of processing massive datasets. The MR-GHSOM is implemented using the Apache Spark cluster computing engine and leverages the popular Map-Reduce programming model. This enables us to exploit the usefulness and dynamic capabilities of the GHSOM even for a large dataset. Moreover, the conventional GHSOM algorithm can handle datasets with numeric attributes only. This is owing to the fact that it relies heavily on the Euclidean space dissimilarity measures of the attribute vectors. The MR-GHSOM further extends the GHSOM to handle mixed attribute - numeric and categorical - datasets. It accomplishes this by adopting the distance hierarchy approach of managing mixed attribute datasets. The proposed MR-GHSOM is thus capable of handling massive datasets containing mixed attributes. To demonstrate the effectiveness of the MR-GHSOM in terms of clustering of mixed attribute datasets, we present the results produced by the MR-GHSOM on some popular datasets. We further train our MR-GHSOM on a Census dataset containing mixed attributes and provide an analysis of the results. Self Organizing Map Map-Reduce Growing Hierarchical Self Organizing Map GHSOM SOM Apache Spark
20	Zpracování síťové komunikace v prostředí Apache Spark / Network Traces Analysis Using Apache Spark Béder, Michal January 2018 (has links) The aim of this thesis is to show how to design and implement an application for network traces analysis using Apache Spark distributed system. Implementation can be divided into three parts - loading data from a distributed HDFS storage, supported network protocols analysis and distributed data processing. As a data visualization tool is used web-based notebook Apache Zeppelin. The resulting application is able to analyze individual packets as well as the entire flows. It supports JSON and pcap as input data formats. The goal of the application is to allow Big Data processing. The greatest impact on its performance has the input data format and allocation of the available cores.

Search results