Spelling suggestions: "subject:"apache spark"" "subject:"spache spark""
21 |
Výpočetní úlohy pro řešení paralelního zpracování dat / Computational tasks for solving parallel data processingRexa, Denis January 2019 (has links)
The goal of this diploma thesis was to create four laboratory exercises for the subject "Parallel Data Processing", where students will try on the options and capabilities of Apache Spark as a parallel computing platform. The work also includes basic setup and use of Apache Kafka technology and NoSQL Apache Cassandra database. The other two lab assignments focus on working with a Travelling Salesman Problem. The first lab was designed to demonstrate the difficulty of a task where the student will face an exponential increase in complexity. The second task consists of an optimization algorithm to solve the problem in cluster. This algorithm is subjected to performance measurements in clusters. The conclusion of the thesis contains recommendations for optimization as well as comparison of running with different number of computing devices.
|
22 |
Analýza datových toků pro knihovny se složitými vzory interakcí / Data Lineage Analysis of Frameworks with Complex Interaction PatternsHýbl, Oskar January 2020 (has links)
Manta Flow is a tool for analyzing data flow in enterprise environment. It features Java scanner, a module using static analysis to determine the flows through Java applications. To analyze an application using some framework, the scanner requires a dedicated plugin. Although Java scanner provides plugins for several frameworks, to be usable for real applications, it is essential that the scanner supports as many frameworks as possible, which requires implementation of new plugins. Application using Apache Spark, a framework for cluster computing, are increasingly popular. Therefore we designed and implemented Java scanner plugin that allows the scanner to analyze Spark applications. As Spark focuses on data processing, this presented several challenges that were not encountered in other frameworks. In particular it was necessary to resolve the data schema in various scenarios and track the schema changes throughout any operations invoked on the data. Of the multiple APIs Spark provides for data processing, we focused on Spark SQL module, notably on Dataset, omitting the legacy RDD. We also implemented support for data access, covering JDBC and chosen file formats. The implementation has been thoroughly tested and is proven to work correctly as a part of Manta Flow, which features the plugin in...
|
23 |
Implementierung von Software-Frameworks am Beispiel von Apache Spark in das DBpediaExtraction FrameworkBielinski, Robert 28 August 2018 (has links)
Das DBpedia-Projekt extrahiert zweimal pro Jahr RDF-Datensätze aus den semi-\\strukturierten Datensätzen Wikipedias. DBpedia soll nun auf ein Release-Modell umgestellt werden welches einen Release-Zyklus mit bis zu zwei vollständigen DBpedia Datensätzen pro Monat unterstützt. Dies ist mit der momentanen Geschwindigkeit des Extraktionsprozesses nicht möglich. Eine Verbesserung soll durch eine Parallelisierung mithilfe von Apache Spark erreicht werden. Der Fokus dieser Arbeit liegt auf der effizienten lokalen Nutzung Apache Sparks zur parallelen Verarbeitung von großen, semi-strukturierten Datensätzen. Dabei wird eine Implementierung der Apache Spark gestützten Extraktion vorgestellt, welche eine ausreichende Verringerung der Laufzeit erzielt. Dazu wurden grundlegende Methoden der komponentenbasierten Softwareentwicklung angewendet, Apache Sparks Nutzen für das Extraction-Framework analysiert und ein Überblick über die notwendigen Änderungen am Extraction-Framework präsentiert.
|
24 |
An experimental study of memory management in Rust programming for big data processingOkazaki, Shinsaku 10 December 2020 (has links)
Planning optimized memory management is critical for Big Data analysis tools to perform faster runtime and efficient use of computation resources. Modern Big Data analysis tools use application languages that abstract their memory management so that developers do not have to pay extreme attention to memory management strategies.
Many existing modern cloud-based data processing systems such as Hadoop, Spark or Flink use Java Virtual Machine (JVM) and take full advantage of features such as automated memory management in JVM including Garbage Collection (GC) which may lead to a significant overhead. Dataflow-based systems like Spark allow programmers to define complex objects in a host language like Java to manipulate and transfer tremendous amount of data.
System languages like C++ or Rust seem to be a better choice to develop systems for Big Data processing because they do not relay on JVM. By using a system language, a developer has full control on the memory management. We found Rust programming language to be a good candidate due to its ability to write memory-safe and fearless concurrent codes with its concept of memory ownership and borrowing. Rust programming language includes many possible strategies to optimize memory management for Big Data processing including a selection of different variable types, use of Reference Counting, and multithreading with Atomic Reference Counting.
In this thesis, we conducted an experimental study to assess how much these different memory management strategies differ regarding overall runtime performance. Our experiments focus on complex object manipulation and common Big Data processing patterns with various memory man- agement. Our experimental results indicate a significant difference among these different memory strategies regarding data processing performance.
|
25 |
Dimensionality Reduction in Healthcare Data Analysis on Cloud PlatformRay, Sujan January 2020 (has links)
No description available.
|
26 |
Distributed graph decomposition algorithms on Apache SparkMandal, Aritra 20 April 2018 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Structural analysis and mining of large and complex graphs for describing the
characteristics of a vertex or an edge in the graph have widespread use in graph
clustering, classification, and modeling. There are various methods for structural
analysis of graphs including the discovery of frequent subgraphs or network motifs,
counting triangles or graphlets, spectral analysis of networks using eigenvectors of
graph Laplacian, and finding highly connected subgraphs such as cliques and quasi
cliques. Unfortunately, the algorithms for solving most of the above tasks are quite
costly, which makes them not-scalable to large real-life networks.
Two such very popular decompositions, k-core and k-truss of a graph give very
useful insight about the graph vertex and edges respectively. These decompositions
have been applied to solve protein functions reasoning on protein-protein networks,
fraud detection and missing link prediction problems.
k-core decomposition with is linear time complexity is scalable to large real-life
networks as long as the input graph fits in the main memory. k-truss on the other
hands is computationally more intensive due to its definition relying on triangles and
their is no linear time algorithm available for it.
In this paper, we propose distributed algorithms on Apache Spark for k-truss and
k-core decomposition of a graph. We also compare the performance of our algorithm
with state-of-the-art Map-Reduce and parallel algorithms using openly available real
world network data. Our proposed algorithms have shown substantial performance
improvement.
|
27 |
Optimization of the Photovoltaic Time-series Analysis Process Through Hybrid Distributed ComputingHwang, Suk Hyun 01 June 2020 (has links)
No description available.
|
28 |
Mining Formal Concepts in Large Binary Datasets using Apache SparkRayabarapu, Varun Raj 29 September 2021 (has links)
No description available.
|
29 |
Development of an Apache Spark-Based Framework for Processing and Analyzing Neuroscience Big Data: Application in Epilepsy Using EEG Signal DataZhang, Jianzhe 07 September 2020 (has links)
No description available.
|
30 |
Ablation Programming for Machine LearningSheikholeslami, Sina January 2019 (has links)
As machine learning systems are being used in an increasing number of applications from analysis of satellite sensory data and health-care analytics to smart virtual assistants and self-driving cars they are also becoming more and more complex. This means that more time and computing resources are needed in order to train the models and the number of design choices and hyperparameters will increase as well. Due to this complexity, it is usually hard to explain the effect of each design choice or component of the machine learning system on its performance.A simple approach for addressing this problem is to perform an ablation study, a scientific examination of a machine learning system in order to gain insight on the effects of its building blocks on its overall performance. However, ablation studies are currently not part of the standard machine learning practice. One of the key reasons for this is the fact that currently, performing an ablation study requires major modifications in the code as well as extra compute and time resources.On the other hand, experimentation with a machine learning system is an iterative process that consists of several trials. A popular approach for execution is to run these trials in parallel, on an Apache Spark cluster. Since Apache Spark follows the Bulk Synchronous Parallel model, parallel execution of trials includes several stages, between which there will be barriers. This means that in order to execute a new set of trials, all trials from the previous stage must be finished. As a result, we usually end up wasting a lot of time and computing resources on unpromising trials that could have been stopped soon after their start.We have attempted to address these challenges by introducing MAGGY, an open-source framework for asynchronous and parallel hyperparameter optimization and ablation studies with Apache Spark and TensorFlow. This framework allows for better resource utilization as well as ablation studies and hyperparameter optimization in a unified and extendable API. / Eftersom maskininlärningssystem används i ett ökande antal applikationer från analys av data från satellitsensorer samt sjukvården till smarta virtuella assistenter och självkörande bilar blir de också mer och mer komplexa. Detta innebär att mer tid och beräkningsresurser behövs för att träna modellerna och antalet designval och hyperparametrar kommer också att öka. På grund av denna komplexitet är det ofta svårt att förstå vilken effekt varje komponent samt designval i ett maskininlärningssystem har på slutresultatet.En enkel metod för att få insikt om vilken påverkan olika komponenter i ett maskinlärningssytem har på systemets prestanda är att utföra en ablationsstudie. En ablationsstudie är en vetenskaplig undersökning av maskininlärningssystem för att få insikt om effekterna av var och en av dess byggstenar på dess totala prestanda. Men i praktiken så är ablationsstudier ännu inte vanligt förekommande inom maskininlärning. Ett av de viktigaste skälen till detta är det faktum att för närvarande så krävs både stora ändringar av koden för att utföra en ablationsstudie, samt extra beräkningsoch tidsresurser.Vi har försökt att ta itu med dessa utmaningar genom att använda en kombination av distribuerad asynkron beräkning och maskininlärning. Vi introducerar maggy, ett ramverk med öppen källkodsram för asynkron och parallell hyperparameteroptimering och ablationsstudier med PySpark och TensorFlow. Detta ramverk möjliggör bättre resursutnyttjande samt ablationsstudier och hyperparameteroptimering i ett enhetligt och utbyggbart API.
|
Page generated in 0.0648 seconds