Global ETD Search

41	A Shared-Memory Coupled Architecture to Leverage Big Data Frameworks in Prototyping and In-Situ Analytics for Data Intensive Scientific Workflows Lemon, Alexander Michael 01 July 2019 (has links) There is a pressing need for creative new data analysis methods whichcan sift through scientific simulation data and produce meaningfulresults. The types of analyses and the amount of data handled by currentmethods are still quite restricted, and new methods could providescientists with a large productivity boost. New methods could be simpleto develop in big data processing systems such as Apache Spark, which isdesigned to process many input files in parallel while treating themlogically as one large dataset. This distributed model, combined withthe large number of analysis libraries created for the platform, makesSpark ideal for processing simulation output.Unfortunately, the filesystem becomes a major bottleneck in any workflowthat uses Spark in such a fashion. Faster transports are notintrinsically supported by Spark, and its interface almost denies thepossibility of maintainable third-party extensions. By leveraging thesemantics of Scala and Spark's recent scheduler upgrades, we forceco-location of Spark executors with simulation processes and enable fastlocal inter-process communication through shared memory. This provides apath for bulk data transfer into the Java Virtual Machine, removing thecurrent Spark ingestion bottleneck.Besides showing that our system makes this transfer feasible, we alsodemonstrate a proof-of-concept system integrating traditional HPC codeswith bleeding-edge analytics libraries. This provides scientists withguidance on how to apply our libraries to gain a new and powerful toolfor developing new analysis techniques in large scientific simulationpipelines. Apache Spark Data-Intensive Science High-Performance Computing In-Situ Analytics Parameter Sweep State-Space Pruning Computer Sciences Physical Sciences and Mathematics
42	High Performance and Scalable Matching and Assembly of Biological Sequences Abu Doleh, Anas 21 December 2016 (has links) No description available. Computer Engineering Bioinformatics bioinformatics sequence similarity indexing graphical processing unit Apache Spark de Bruijn graph de novo assembly metagenomics
43	Auto-Tuning Apache Spark Parameters for Processing Large Datasets / Auto-Optimering av Apache Spark-parametrar för bearbetning av stora datamängder Zhou, Shidi January 2023 (has links) Apache Spark is a popular open-source distributed processing framework that enables efficient processing of large amounts of data. Apache Spark has a large number of configuration parameters that are strongly related to performance. Selecting an optimal configuration for Apache Spark application deployed in a cloud environment is a complex task. Making a poor choice may not only result in poor performance but also increases costs. Manually adjusting the Apache Spark configuration parameters can take a lot of time and may not lead to the best outcomes, particularly in a cloud environment where computing resources are allocated dynamically, and workloads can fluctuate significantly. The focus of this thesis project is the development of an auto-tuning approach for Apache Spark configuration parameters. Four machine learning models are formulated and evaluated to predict Apache Spark’s performance. Additionally, two models for Apache Spark configuration parameter search are created and evaluated to identify the most suitable parameters, resulting in the shortest execution time. The obtained results demonstrates that with the developed auto-tuning approach and adjusting Apache Spark configuration parameters, Apache Spark applications can achieve a shorter execution time than when using the default parameters. The developed auto-tuning approach gives an improved cluster utilization and shorter job execution time, with an average performance improvement of 49.98%, 53.84%, and 64.16% for the three different types of Apache Spark applications benchmarked. / Apache Spark är en populär öppen källkodslösning för distribuerad databehandling som möjliggör effektiv bearbetning av stora mängder data. Apache Spark har ett stort antal konfigurationsparametrar som starkt påverkar prestandan. Att välja en optimal konfiguration för en Apache Spark-applikation som distribueras i en molnmiljö är en komplex uppgift. Ett dåligt val kan inte bara leda till dålig prestanda utan också ökade kostnader. Manuell anpassning av Apache Spark-konfigurationsparametrar kan ta mycket tid och leda till suboptimala resultat, särskilt i en molnmiljö där beräkningsresurser tilldelas dynamiskt och arbetsbelastningen kan variera avsevärt. Fokus för detta examensprojekt är att utveckla en automatisk optimeringsmetod för konfigurationsparametrarna i Apache Spark. Fyra maskininlärningsmodeller formuleras och utvärderas för att förutsäga Apache Sparks prestanda. Dessutom skapas och utvärderas två modeller för att söka efter de mest lämpliga konfigurationsparametrarna för Apache Spark, vilket resulterar i kortast möjliga exekveringstid. De erhållna resultaten visar att den utvecklade automatiska optimeringsmetoden, med anpassning av Apache Sparks konfigurationsparameterar, bidrar till att Apache Spark-applikationer kan uppnå kortare exekveringstider än vid användning av standard-parametrar. Den utvecklade metoden för automatisk optimering bidrar till en förbättrad användning av klustret och kortare exekveringstider, med en genomsnittlig prestandaförbättring på 49,98%, 53,84% och 64,16% för de tre olika typerna av Apache Spark-applikationer som testades. Apache Spark Cloud Environment Spark Configuration Parameter Resource Utilization Ridge Regression Elastic Net Random Forest Deep Neural Network Bayesian Optimization Particle Swarm Optimization. Apache Spark Molnmiljö Apache Spark konfigurationsparameter Resursutnyttjande Ridge-regression Elastisk nät Slumpskog Djupt neuralt nätverk Bayesiansk optimering Partikelsvärmsoptimering. Computer and Information Sciences Data- och informationsvetenskap
44	以雲端平行運算建立期貨走勢預測模型-Logistic Regression之應用 / Prediction Model of Futures Trend by Cloud and Parallel Computing - Application of Logistic Regression 呂縩正, Lu, Tsai Cheng Unknown Date (has links) 在科技持續進步的時代，金融市場發展隨著社會的演進不斷地成長與活絡，金融商品也從原本單純的本國存放款、外幣投資衍生出票券、債券等利率投資工具；除此之外，隨著資本市場的擴張，股票、基金、期貨與選擇權等投資標的更是琳瑯滿目。而後產生了許多人使用資料探勘工具預測市場的買賣時機。如Baba N., Asakawa H. and Sato K.(1999)使用倒傳遞類神經網路來預測到股市未來的漲跌，而後又在2000年研究當中加入基因演算法來求得倒傳遞類神經網路的權重，得到最後的類神經網路模型。在做資料探勘的同時，我們得在希望預測目標(Target)上事先定義好一套固定規則，這會使得模型的彈性與可預測度降低，本研究希望能透過資料探勘工具增加預測目標規則的彈性，增加模型最後的預測準確度。本研究樣本區間選用2010年到2015年的台指期貨數據做為資料，並結合羅吉斯回歸與粒子群演算法建構更加有彈性的預測模型結果，最後發現在未來10分鐘，若漲幅超過0.1114%做為買進訊號的話，其建立出的模型可達到84%的預測準確度。羅吉斯回歸粒子群演算法最佳化預測台股期貨 Apache Spark
45	Performance Characterization and Optimization of In-Memory Data Analytics on a Scale-up Server Awan, Ahsan Javed January 2017 (has links) The sheer increase in the volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark defines the state of the art in big data analytics platforms for (i) exploiting data-flow and in-memory computing and (ii) for exhibiting superior scale-out performance on the commodity machines, little effort has been devoted to understanding the performance of in-memory data analytics with Spark on modern scale-up servers. This thesis characterizes the performance of in-memory data analytics with Spark on scale-up servers.Through empirical evaluation of representative benchmark workloads on a dual socket server, we have found that in-memory data analytics with Spark exhibit poor multi-core scalability beyond 12 cores due to thread level load imbalance and work-time inflation (the additional CPU time spent by threads in a multi-threaded computation beyond the CPU time required to perform the same work in a sequential computation). We have also found that workloads are bound by the latency of frequent data accesses to the memory. By enlarging input data size, application performance degrades significantly due to the substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1cache misses and higher core utilization).For data accesses, we have found that simultaneous multi-threading is effective in hiding the data latencies. We have also observed that (i) data locality on NUMA nodes can improve the performance by 10% on average,(ii) disabling next-line L1-D prefetchers can reduce the execution time by upto14%. For garbage collection impact, we match memory behavior with the garbage collector to improve the performance of applications between 1.6xto 3x and recommend using multiple small Spark executors that can provide up to 36% reduction in execution time over single large executor. Based on the characteristics of workloads, the thesis envisions near-memory and near storage hardware acceleration to improve the single-node performance of scale-out frameworks like Apache Spark. Using modeling techniques, it estimates the speed-up of 4x for Apache Spark on scale-up servers augmented with near-data accelerators. / <p>QC 20171121</p> Workload Characterization Big Data Analytics Multicore Performance Apache Spark Near Data Processing NUMA Hyperthreading Prefetchers Coherently attached accelerators Computer Systems Datorsystem
46	Webová aplikace pro grafické zadávání a spouštění Spark úloh / Web Application for Graphical Description and Execution of Spark Tasks Hmeľár, Jozef January 2018 (has links) This master's thesis deals with Big data processing in distributed system Apache Spark using tools, which allow remotely entry and execution of Spark tasks through web inter- face. Author describes the environment of Spark in the first part, in the next he focuses on the Apache Livy project, which offers REST API to run Spark tasks. Contemporary solutions that allow interactive data analysis are presented. Author further describes his own application design for interactive entry and launch of Spark tasks using graph repre- sentation of them. Author further describes the web part of the application as well as the server part of the application. In next section author presents the implementation of both parts and, last but not least, the demonstration of the result achieved on a typical task. The created application provides an intuitive interface for comfortable working with the Apache Spark environment, creating custom components, and also a number of other options that are standard in today's web applications.
47	Comparison of Popular Data Processing Systems Nasr, Kamil January 2021 (has links) Data processing is generally defined as the collection and transformation of data to extract meaningful information. Data processing involves a multitude of processes such as validation, sorting summarization, aggregation to name a few. Many analytics engines exit today for largescale data processing, namely Apache Spark, Apache Flink and Apache Beam. Each one of these engines have their own advantages and drawbacks. In this thesis report, we used all three of these engines to process data from the Carbon Monoxide Daily Summary Dataset to determine the emission levels per area and unit of time. Then, we compared the performance of these 3 engines using different metrics. The results showed that Apache Beam, while offered greater convenience when writing programs, was slower than Apache Flink and Apache Spark. Spark Runner in Beam was the fastest runner and Apache Spark was the fastest data processing framework overall. / Databehandling definieras generellt som insamling och omvandling av data för att extrahera meningsfull information. Databehandling involverar en mängd processer som validering, sorteringssammanfattning, aggregering för att nämna några. Många analysmotorer lämnar idag för storskalig databehandling, nämligen Apache Spark, Apache Flink och Apache Beam. Var och en av dessa motorer har sina egna fördelar och nackdelar. I den här avhandlingsrapporten använde vi alla dessa tre motorer för att bearbeta data från kolmonoxidens dagliga sammanfattningsdataset för att bestämma utsläppsnivåerna per område och tidsenhet. Sedan jämförde vi prestandan hos dessa 3 motorer med olika mått. Resultaten visade att Apache Beam, även om det erbjuds större bekvämlighet när man skriver program, var långsammare än Apache Flink och Apache Spark. Spark Runner in Beam var den snabbaste löparen och Apache Spark var den snabbaste databehandlingsramen totalt. Apache Spark Apache Flink Apache Beam Spark Runner Flink Runner Direct Runner Big Data Analytics Data Processing Systems Benchmarking Kaggle Computer and Information Sciences Data- och informationsvetenskap
48	Distributed Rule-Based Ontology Reasoning Mutharaju, Raghava 12 September 2016 (has links) No description available. Computer Science distributed reasoning ontology reasoning distributed OWL 2 EL reasoning Semantic Web scalable ontology reasoning OWL 2 EL reasoning MapReduce reasoning Apache Spark reasoning
49	Zpracování velkých dat z rozsáhlých IoT sítí / Big Data Processing from Large IoT Networks Benkő, Krisztián January 2019 (has links) The goal of this diploma thesis is to design and develop a system for collecting, processing and storing data from large IoT networks. The developed system introduces a complex solution able to process data from various IoT networks using Apache Hadoop ecosystem. The data are real-time processed and stored in a NoSQL database, but the data are also stored in the file system for a potential later processing. The system is optimized and tested using data from IQRF network. The data stored in the NoSQL database are visualized and the system periodically generates derived predictions. Users are connected to this system via an information system, which is able to automatically generate notifications when monitored values are out of range.
50	New authentication mechanism using certificates for big data analytic tools Velthuis, Paul January 2017 (has links) Companies analyse large amounts of sensitive data on clusters of machines, using a framework such as Apache Hadoop to handle inter-process communication, and big data analytic tools such as Apache Spark and Apache Flink to analyse the growing amounts of data. Big data analytic tools are mainly tested on performance and reliability. Security and authentication have not been enough considered and they lack behind. The goal of this research is to improve the authentication and security for data analytic tools.Currently, the aforementioned big data analytic tools are using Kerberos for authentication. Kerberos has difficulties in providing multi factor authentication. Attacks on Kerberos can abuse the authentication. To improve the authentication, an analysis of the authentication in Hadoop and the data analytic tools is performed. The research describes the characteristics to gain an overview of the security of Hadoop and the data analytic tools. One characteristic is that the usage of the transport layer security (TLS) for the security of data transportation. TLS usually establishes connections with certificates. Recently, certificates with a short time to live can be automatically handed out.This thesis develops new authentication mechanism using certificates for data analytic tools on clusters of machines, providing advantages over Kerberos. To evaluate the possibility to replace Kerberos, the mechanism is implemented in Spark. As a result, the new implementation provides several improvements. The certificates used for authentication are made valid with a short time to live and are thus less vulnerable to abuse. Further, the authentication mechanism solves new requirements coming from businesses, such as providing multi-factor authenticationand scalability.In this research a new authentication mechanism is developed, implemented and evaluated, giving better data protection by providing improved authentication. Cloud Access Management certificate on demand Apache Spark Apache Flink Kerberos transport security layer (TLS) Authentication Multi Factor Authentication Authentication for data analytic tools certificate based Spark authentication public key encryption distributed authentication short valid authentication Computer Sciences Datavetenskap (datalogi)

Search results