Spelling suggestions: "subject:"apache spark"" "subject:"spache spark""
1 |
A Large Collection Learning Optimizer FrameworkChakravarty, Saurabh 30 June 2017 (has links)
Content is generated on the web at an increasing rate. The type of content varies from text on a traditional webpage to text on social media portals (e.g., social network sites and microblogs). One such example of social media is the microblogging site Twitter. Twitter is known for its high level of activity during live events, natural disasters, and events of global importance. Challenges with the data in the Twitter universe include the limit of 140 characters on the text length. Because of this limitation, the vocabulary in the Twitter universe includes short abbreviations of sentences, emojis, hashtags, and other non-standard usage. Consequently, traditional text classification techniques are not very effective on tweets. Fortunately, sophisticated text processing techniques like cleaning, lemmatizing, and removal of stop words and special characters will give us clean text which can be further processed to derive richer word semantic and syntactic relationships using state of the art feature selection techniques like Word2Vec. Machine learning techniques, using word features that capture semantic and context relationships, can be of benefit regarding classification accuracy.
Improving text classification results on Twitter data would pave the way to categorize tweets relative to human defined real world events. This would allow diverse stakeholder communities to interactively collect, organize, browse, visualize, analyze, summarize, and explore content and sources related to crises, disasters, human rights, inequality, population growth, resiliency, shootings, sustainability, violence, etc. Having the events classified into different categories would help us study causality and correlations among real world events.
To check the efficacy of our classifier, we would compare our experimental results with an Association Rules (AR) classifier. This classifier composes its rules around the most discriminating words in the training data. The hierarchy of rules, along with an ability to tune to a support threshold, makes it an effective classifier for scenarios where short text is involved.
Traditionally, developing classification systems for these purposes requires a great degree of human intervention. Constantly monitoring new events, and curating training and validation sets, is tedious and time intensive. Significant human capital is required for such annotation endeavors. Also, involved efforts are required to tune the classifier for best performance. Developing and tuning classifiers manually using human intervention would not be a viable option if we are to monitor events and trends in real-time. We want to build a framework that would require very little human intervention to build and choose the best among the available performing classification techniques in our system.
Another challenge with classification systems is related to their performance with unseen data. For the classification of tweets, we are continually faced with a situation where a given event contains a certain keyword that is closely related to it. If a classifier, built for a particular event, due to overfitting to what is a biased sample with limited generality, is faced with new tweets with different keywords, accuracy may be reduced. We propose building a system that will use very little training data in the initial iteration and will be augmented with automatically labelled training data from a collection that stores all the incoming tweets. A system that is trained on incoming tweets that are labelled using sophisticated techniques based on rich word vector representation would perform better than a system that is trained on only the initial set of tweets.
We also propose to use sophisticated deep learning techniques like Convolutional Neural Networks (CNN) that can capture the combination of the words using an n-gram feature representation. Such sophisticated feature representation could account for the instances when the words occur together.
We divide our case studies into two phases: preliminary and final case studies. The preliminary case studies focus on selecting the best feature representation and classification methodology out of the AR and the Word2Vec based Logistic Regression classification techniques. The final case studies focus on developing the augmented semi-supervised training methodology and the framework to develop a large collection learning optimizer to generate a highly performant classifier.
For our preliminary case studies, we are able to achieve an F1 score of 0.96 that is based on Word2Vec and Logistic Regression. The AR classifier achieved an F1 score of 0.90 on the same data.
For our final case studies, we are able to show improvements of F1 score from 0.58 to 0.94 in certain cases based on our augmented training methodology. Overall, we see improvement in using the augmented training methodology on all datasets. / Master of Science / Content is generated on social media at a very fast pace. Social media content in the form of tweets that is generated by the microblog site Twitter is quite popular for understanding the events and trends that are prevalent at a given point of time across various geographies. Categorizing these tweets into their real-world event categories would be useful for researchers, students, academics and the government. Categorizing tweets to their real-world categories is a challenging task. Our framework involves building a classification system that can learn how to categorize tweets for a given category if it is provided with a few samples of the relevant and non-relevant tweets. The system retrieves additional tweets from an auxiliary data source to further learn what is relevant and irrelevant based on how similar a tweet is to a positive example. Categorizing the tweets in an automated way would be useful in analyzing and studying the events and trends for past and future real-world events.
|
2 |
Uma análise comparativa de ambientes para Big Data: Apche Spark e HPAT / A comparative analysis for Big Data environments: Apache Spark and HPATCarvalho, Rafael Aquino de 16 April 2018 (has links)
Este trabalho compara o desempenho e a estabilidade de dois arcabouços para o processamento de Big Data: Apache Spark e High Performance Analytics Toolkit (HPAT). A comparação foi realizada usando duas aplicações: soma dos elementos de um vetor unidimensional e o algoritmo de clusterização K-means. Os experimentos foram realizados em ambiente distribuído e com memória compartilhada com diferentes quantidades e configurações de máquinas virtuais. Analisando os resultados foi possível concluir que o HPAT tem um melhor desempenho em relação ao Apache Spark nos nossos casos de estudo. Também realizamos uma análise dos dois arcabouços com a presença de falhas. / This work compares the performance and stability of two Big Data processing tools: Apache Spark and High Performance Analytics Toolkit (HPAT). The comparison was performed using two applications: a unidimensional vector sum and the K-means clustering algorithm. The experiments were performed in distributed and shared memory environments with different numbers and configurations of virtual machines. By analyzing the results we are able to conclude that HPAT has performance improvements in relation to Apache Spark in our case studies. We also provide an analysis of both frameworks in the presence of failures.
|
3 |
Uma análise comparativa de ambientes para Big Data: Apche Spark e HPAT / A comparative analysis for Big Data environments: Apache Spark and HPATRafael Aquino de Carvalho 16 April 2018 (has links)
Este trabalho compara o desempenho e a estabilidade de dois arcabouços para o processamento de Big Data: Apache Spark e High Performance Analytics Toolkit (HPAT). A comparação foi realizada usando duas aplicações: soma dos elementos de um vetor unidimensional e o algoritmo de clusterização K-means. Os experimentos foram realizados em ambiente distribuído e com memória compartilhada com diferentes quantidades e configurações de máquinas virtuais. Analisando os resultados foi possível concluir que o HPAT tem um melhor desempenho em relação ao Apache Spark nos nossos casos de estudo. Também realizamos uma análise dos dois arcabouços com a presença de falhas. / This work compares the performance and stability of two Big Data processing tools: Apache Spark and High Performance Analytics Toolkit (HPAT). The comparison was performed using two applications: a unidimensional vector sum and the K-means clustering algorithm. The experiments were performed in distributed and shared memory environments with different numbers and configurations of virtual machines. By analyzing the results we are able to conclude that HPAT has performance improvements in relation to Apache Spark in our case studies. We also provide an analysis of both frameworks in the presence of failures.
|
4 |
Modelem řízený vývoj Spark úloh / Model Driven Development of Spark TasksBútora, Matúš January 2019 (has links)
The aim of the master thesis is to describe Apache Spark framework , its structure and the way how Spark works . Next goal is to present topic of Model- Driven Development and Model-Drive Architecture . Define their advantages , disadvantages and way of usage . However , the main part of this text is devoted to design a model for creating tasks in Apache Spark framework . Text desribes application , that allows user to create graph based on proposed modeling language . Final application allows user to generate source code from created model.
|
5 |
Profile, Monitor, and Introspect Spark Jobs Using OSU INAMKedia, Mansa January 2020 (has links)
No description available.
|
6 |
Enumerating k-cliques in a large network using Apache SparkDheekonda, Raja Sekhar Rao January 2017 (has links)
Indiana University-Purdue University Indianapolis (IUPUI) / Network analysis is an important research task which explains the relationships
among various entities in a given domain. Most of the existing approaches of network
analysis compute global properties of a network, such as transitivity, diameter, and
all-pair shortest paths. They also study various non-random properties of a network,
such as graph densifi cation with shrinking diameter, small diameter, and scale-freeness.
Such approaches enable us to understand real-life networks with global properties.
However, the discovery of the local topological building blocks within a network
is an important task, and examples include clique enumeration, graphlet counting,
and motif counting. In this paper, my focus is to fi nd an efficient solution of k-clique
enumeration problem. A clique is a small, connected, and complete induced subgraph
over a large network. However, enumerating cliques using sequential technologies is
very time-consuming. Another promising direction that is being adopted is a solution
that runs on distributed clusters of machines using the Hadoop mapreduce
framework. However, the solution suffers from a general limitation of the framework,
as Hadoop's mapreduce performs substantial amounts of reading and writing to disk.
Thus, the running times of Hadoop-based approaches suffer enormously. To avoid
these problems, we propose an e cient, scalable, and distributed solution, kc-spark
, for enumerating cliques in real-life networks using the Apache Spark in-memory cluster
computing framework. Experiment results show that kc-spark can enumerate
k-cliques from very large real-life networks, whereas a single commodity machine cannot
produce the same desired result in a feasible amount of time. We also compared
kc-spark with Hadoop mapreduce solutions and found the algorithm to be 80-100
percent faster in terms of running times. On the other hand, we compared with the
triangle enumeration with Hadoop mapreduce and results shown that kc-spark is
8-10 times faster than mapreduce implementation with the same cluster setup. Furthermore,
the overall performance of kc-spark is improved by using Spark's inbuilt
caching and broadcast transformations.
|
7 |
Методология запуска Apache Spark в различных менеджерах контейнеров (Hadoop, Kubernetes) : магистерская диссертация / Methodology for running Apache Spark in various container managers (Hadoop, Kubernetes)Краубаев, А. С., Kraubaev, A. S. January 2023 (has links)
Цель работы – разработка методики для студентов, разработчиков и инженер по работе с данными, которые заинтересованы расширить свой кругозор, по запуску Apache Spark в кластерной среде «Hadoop» и «Kubernetes». Объектом исследования – данной работы являются практика применения методологии запуска Apache Spark в кластерной среде Kubernetes, Hadoop. Результаты работы: практика применения контейнеризации и кластерной среды Kubernetes, чтобы ознакомить с методологией запуска «Apache Spark». Выпускная квалификационная работа выполнена в текстовом редакторе. Microsoft Word и предоставлена в твердой копии. / The goal of the work is to develop a methodology for students, developers and data engineers who are interested in expanding their horizons on running Apache Spark in the Hadoop and Kubernetes cluster environment. The object of research - this work is the practice of applying the methodology for launching Apache Spark in the Kubernetes and Hadoop cluster environment. Results of the work: practice of using containerization and the Kubernetes cluster environment to familiarize yourself with the methodology for launching Apache Spark. The final qualifying work was completed in a text editor. Microsoft Word and provided in hard copy.
|
8 |
Resource-efficient and fast Point-in-Time joins for Apache Spark : Optimization of time travel operations for the creation of machine learning training datasets / Resurseffektiva och snabba Point-in-Time joins i Apache Spark : Optimering av tidsresningsoperationer för skapande av träningsdata för maskininlärningsmodellerPettersson, Axel January 2022 (has links)
A scenario in which modern machine learning models are trained is to make use of past data to be able to make predictions about the future. When working with multiple structured and time-labeled datasets, it has become a more common practice to make use of a join operator called the Point-in-Time join, or PIT join, to construct these datasets. The PIT join matches entries from the left dataset with entries of the right dataset where the matched entry is the row whose recorded event time is the closest to the left row’s timestamp, out of all the right entries whose event time occurred before or at the same time of the left event time. This feature has long only been a part of time series data processing tools but has recently received a new wave of attention due to the rise of the popularity of feature stores. To be able to perform such an operation when dealing with a large amount of data, data engineers commonly turn to large-scale data processing tools, such as Apache Spark. However, Spark does not have a native implementation when performing these joins and there has not been a clear consensus by the community on how this should be achieved. This, along with previous implementations of the PIT join, raises the question: ”How to perform fast and resource efficient Pointin- Time joins in Apache Spark?”. To answer this question, three different algorithms have been developed and compared for performing a PIT join in Spark in terms of resource consumption and execution time. These algorithms were benchmarked using generated datasets using varying physical partitions and sorting structures. Furthermore, the scalability of the algorithms was tested by running the algorithms on Apache Spark clusters of varying sizes. The results received from the benchmarks showed that the best measurements were achieved by performing the join using Early Stop Sort-Merge Join, a modified version of the regular Sort-Merge Join native to Spark. The best performing datasets were the datasets that were sorted by timestamp and primary key, ascending or descending, using a suitable number of physical partitions. Using this new information gathered by this project, data engineers have been provided with general guidelines to optimize their data processing pipelines to be able to perform more resource-efficient and faster PIT joins. / Ett vanligt scenario för maskininlärning är att träna modeller på tidigare observerad data för att för att ge förutsägelser om framtiden. När man jobbar med ett flertal strukturerade och tidsmärkta dataset har det blivit vanligare att använda sig av en join-operator som kallas Point-in-Time join, eller PIT join, för att konstruera dessa datauppsättningar. En PIT join matchar rader från det vänstra datasetet med rader i det högra datasetet där den matchade raden är den raden vars registrerade händelsetid är närmaste den vänstra raden händelsetid, av alla rader i det högra datasetet vars händelsetid inträffade före eller samtidigt som den vänstra händelsetiden. Denna funktionalitet har länge bara varit en del av datahanteringsverktyg för tidsbaserad data, men har nyligen fått en ökat popularitet på grund av det ökande intresset för feature stores. För att kunna utföra en sådan operation vid hantering av stora mängder data vänder sig data engineers vanligvis till storskaliga databehandlingsverktyg, såsom Apache Spark. Spark har dock ingen inbyggd implementation för denna join-operation, och det finns inte ett tydligt konsensus från Spark-rörelsen om hur det ska uppnås. Detta, tillsammans med de tidigare implementationerna av PIT joins, väcker frågan: ”Vad är det mest effektiva sättet att utföra en PIT join i Apache Spark?”. För att svara på denna fråga har tre olika algoritmer utvecklats och jämförts med hänsyn till resursförbrukning och exekveringstid. För att jämföra algoritmerna, exekverades de på genererade datauppsättningar med olika fysiska partitioner och sorteringstrukturer. Dessutom testades skalbarheten av algoritmerna genom att köra de på Spark-kluster av varierande storlek. Resultaten visade att de bästa mätvärdena uppnåddes genom att utföra operationen med algoritmen early stop sort-merge join, en modifierad version av den vanliga sort-merge join som är inbyggd i Spark, med en datauppsättning som är sorterad på tidsstämpel och primärnyckel, antingen stigande eller fallande. Fysisk partitionering av data kunde även ge bättre resultat, men det optimala antal fysiska partitioner kan variera beroende på datan i sig. Med hjälp av denna nya information som samlats in av detta projekt har data engineers försetts med allmänna riktlinjer för att optimera sina databehandlings-pipelines för att kunna utföra mer resurseffektiva och snabbare PIT joins
|
9 |
分散式計算系統及巨量資料處理架構設計-基於YARN, Storm及Spark / Distributed computing system and big data real-time processing structure—based on YARN, Storm and Spark曾柏崴, Tseng, Po Wei Unknown Date (has links)
近年來,隨著大數據時代的來臨,即時資料運算面臨許多挑戰。例如在期貨交易預測方面,為了精準的預測市場狀態,我們需要在海量資料中建立預測模型,且耗時在數十毫秒之內。
在本研究中,我們將介紹一套即時巨量資料運算架構,這套架構將解決在實務上需要解決的三大需求:高速處理需求、巨量資料處理以及儲存需求。同時,在整個平行運算系統之下,我們也實作了數種人工智慧演算法,例如SVM (Support Vector Machine)和LR (Logistic Regression)等,做為策略模擬的子系統。本架構包含下列三種主要的雲端運算技術:
1. 使用Apache YARN以整合整體系統資源,使叢集資源運用更具效率。
2. 為滿足高速處理需求,本架構使用Apache Storm以便處理海量且即時之資料流。同時,借助該框架,可在數十毫秒之內,運算上千種市場狀態數值供模型建模之用。
3. 運用Apache Spark,本研究建立了一套分散式運算架構用於模型建模。藉由使用Spark RDD(Resilient Distributed Datasets),本架構可將SVM和LR之模型建模時間縮短至數百毫秒之內。
為解決上述需求,本研究設計了一套n層分散式架構且整合上列數種技術。另外,在該架構中,我們使用Apache Kafka作為整體系統之訊息中介層,並支持系統內各子系統間之非同步訊息溝通。 / With the coming of the era of big data, the immediacy and the amount of data computation are facing with many challenges. For example, for Futures market forecasting, we need to accurately forecast the market state with the model built from large data (hundreds of GB to tens of TB) within tens of milliseconds.
In this research, we will introduce a real-time big data computing architecture to resolve requests of high speed processing, the immense volume of data and the request of large data processing. In the meantime, several algorithms, such as SVM (Support Vector Machine, SVM) and LR (Logistic Regression, LR), are implemented as a subproject under the parallel distributed computing system. This architecture involves three main cloud computing techniques:
1. Use Apache YARN as a system of integrated resource management in order to apply cluster resources more efficiently.
2. To satisfy the requests of high speed processing, we apply Apache Storm in order to process large real-time data stream and compute thousands of numerical value within tens of milliseconds for following model building.
3. With Apache Spark, we establish a distributed computing architecture for model building. By using Spark RDD (Resilient Distributed Datasets, RDD), this architecture can shorten the execution time to within hundreds of milliseconds for SVM and LR model building.
To resolve the requirements of the distributed system, we design an n-tier distributed architecture to integrate the foregoing several techniques. In this architecture, we use the Apache Kafka as the messaging middleware to support asynchronous message-based communication.
|
10 |
A Real- time Log Correlation System for Security Information and Event ManagementDubuc, Clémence January 2021 (has links)
The correlation of several events in a period of time is a necessity for a threat detection platform. In the case of multistep attacks (attacks characterized by a sequence of executed commands), it allows detecting the different steps one by one and correlating them to raise an alert. It also allows detecting abnormal behaviors on the IT system, for example, multiple suspicious actions performed by the same account. The correlation of security events increases the security of the system and reduces the number of false positives. The correlation of the events is made thanks to pre- existing correlation rules. The goal of this thesis is to evaluate the feasibility of using a correlation engine based on Apache Spark. There is a necessity of changing the actual correlation system because it is not scalable, it cannot handle all the incoming data and it cannot perform some types of correlation like aggregating the events by attributes or counting the cardinality. The novelty is the improvement of the performance and the correlation capacities of the system. Two systems are proposed for correlating events in this project. The first one is based on Apache Spark Structured Streaming and analyzed the flow of security logs in real- time. As the results are not satisfactory, a second system is implemented. It uses a more traditional approach by storing the logs into an Elastic Search cluster and does correlation queries on it. In the end, the two systems are able to correlate the logs of the platform. Nevertheless, the system based on Apache Spark uses too many resources by correlation rule and it is too expensive to launch hundreds of correlation queries at the same time. For those reasons, the system based on Elastic Search is preferred and is implemented in the workflow. / Korrelation av flera händelser under en viss tidsperiod är en nödvändighet för plattformen för hotdetektering. När det gäller attacker i flera steg (attacker som kännetecknas av en sekvens av utförda kommandon) gör det möjligt att upptäcka de olika stegen ett efter ett och korrelera dem för att utlösa en varning. Den gör det också möjligt att upptäcka onormala beteenden i IT- systemet, t.ex. flera misstänkta åtgärder som utförs av samma konto. Korrelationen av säkerhetshändelser ökar systemets säkerhet och minskar antalet falska positiva upptäckter. Korrelationen av händelserna görs tack vare redan existerande korrelationsregler. Målet med den här avhandlingen är att utvärdera genomförbarheten av en korrelationsmotor baserad på Apache Spark. Det är nödvändigt att ändra det nuvarande korrelationssystemet eftersom det inte är skalbart, det kan inte hantera alla inkommande data och det kan inte utföra vissa typer av korrelation, t.ex. aggregering av händelserna efter attribut eller beräkning av kardinaliteten. Det nya är att förbättra systemets prestanda och korrelationskapacitet. I detta projekt föreslås två system för korrelering av händelser. Det första bygger på Apache Spark Structured Streaming och analyserade flödet av säkerhetsloggar i realtid. Eftersom resultaten inte var tillfredsställande har ett andra system införts. Det använder ett mer traditionellt tillvägagångssätt genom att lagra loggarna i ett Elastic Searchkluster och göra korrelationsförfrågningar på dem. I slutändan kan de två systemen korrelera plattformens loggar. Det system som bygger på Apache Spark använder dock för många resurser per korrelationsregel och det är för dyrt att starta hundratals korrelationsförfrågningar samtidigt. Av dessa skäl föredras systemet baserat på Elastic Search och det implementeras i arbetsflödet.
|
Page generated in 0.0591 seconds