Global ETD Search

31	Matrix Multiplications on Apache Spark through GPUs / Matrismultiplikationer på Apache Spark med GPU Safari, Arash January 2017 (has links) In this report, we consider the distribution of large scale matrix multiplications across a group of systems through Apache Spark, where each individual system utilizes Graphical Processor Units (GPUs) in order to perform the matrix multiplication. The purpose of this thesis is to research whether the GPU's advantage in performing parallel work can be applied to a distributed environment, and whether it scales noticeably better than a CPU implementation in a distributed environment. This question was resolved by benchmarking the different implementations at their peak. Based on these benchmarks, it was concluded that GPUs indeed do perform better as long as single precision support is available in the distributed environment. When single precision operations are not supported, GPUs perform much worse due to the low double precision performance of most GPU devices. / I denna rapport betraktar vi fördelningen av storskaliga matrismultiplikationeröver ett Apache Spark kluster, där varje system i klustret delegerar beräkningarnatill grafiska processorenheter (GPU). Syftet med denna avhandling är attundersöka huruvida GPU:s fördel vid parallellt arbete kan tillämpas på en distribuerad miljö, och om det skalar märkbart bättre än en CPU-implementationi en distribuerad miljö. Detta gjordes genom att testa de olika implementationerna i en miljö däroptimal prestanda kunde förväntas. Baserat på resultat ifrån dessa tester drogsslutsatsen att GPU-enheter preseterar bättre än CPU-enheter så länge ramverkethar stöd för single precision beräkningar. När detta inte är fallet så presterar deflesta GPU-enheterna betydligt sämre på grund av deras låga double-precisionprestanda. GPU Matrix multiplication Apache Spark Cluster Distributed Matrix Multiplication Computer Sciences Datavetenskap (datalogi)
32	Big Data Analytics Using Apache Flink for Cybercrime Forensics on X (formerly known as Twitter) / Big Data Analytics Using Apache Flink for Cybercrime Forensics on X (formerly known as Twitter) Kakkepalya Puttaswamy, Manjunath January 2023 (has links) The exponential growth of social media usage has led to massive data sharing, posing challenges for traditional systems in managing and analyzing such vast amounts of data. This surge in data exchange has also resulted in an increase in cyber threats from individuals and criminal groups. Traditional forensic methods, such as evidence collection and data backup, become impractical when dealing with petabytes or terabytes of data. To address this, Big Data Analytics has emerged as a powerful solution for handling and analyzing structured and unstructured data. This thesis explores the use of Apache Flink, an open-source tool by the Apache Software Foundation, to enhance cybercrime forensic research. Unlike batch processing engines like Apache Spark, Apache Flink offers real-time processing capabilities, making it well-suited for analyzing dynamic and time-sensitive data streams. The study compares Apache Flink's performance against Apache Spark in handling various workloads on a single node. The literature review reveals a growing interest in utilizing Big Data Analytics, including platforms like Apache Flink, for cybercrime detection and investigation, especially on social media platforms like X (formerly known as Twitter). Sentiment analysis is a vital technique, but challenges arise due to the unique nature of social data. X (formerly known as Twitter), as a valuable source for cybercrime forensics, enables the study of fraudulent, extremist, and other criminal activities. This research explores various data mining techniques and emphasizes the need for real-time analytics to combat cybercrime effectively. The methodology involves data collection from X, preprocessing to remove noise, and sentiment analysis to identify cybercrime-related tweets. The comparative analysis between Apache Flink and Apache Spark demonstrates Flink's efficiency in handling larger datasets and real-time processing. Parallelism and scalability are evaluated to optimize performance. The results indicate that Apache Flink outperforms Apache Spark regarding response time, making it a valuable tool for cybercrime forensics. Despite progress, challenges such as data privacy, accuracy improvement, and cross-platform analysis remain. Future research should focus on refining algorithms, enhancing scalability, and addressing these challenges to further advance cybercrime forensics using Big Data Analytics and platforms like Apache Flink. Apache Flink Apache Spark Big Data Twitter X Computer Sciences Datavetenskap (datalogi)
33	Predicting Closed Versus Open Questions Using Machine Learning for Improving Community Question Answering Websites Makkena, Pradeep Kumar January 2017 (has links) No description available. Computer Science Information Systems Predicting closed questions Stack overflow closed questions Classification using Apache Spark
34	Using Apache Spark's MLlib to Predict Closed Questions on Stack Overflow Madeti, Preetham 07 June 2016 (has links) No description available. Computer Science Information Systems Machine learning Feature Extraction Apache Spark Stack Overflow Textual Analysis
35	運用記憶體內運算於智慧型健保院所異常查核之研究 / A Research into In-Memory Computing Techniques for Intelligent Check of Health-Insurance Fraud 湯家哲, Tang, Jia Jhe Unknown Date (has links) 我國全民健保近年財務不佳，民國98年收支短絀達582億元。根據中央健康保險署資料，截至目前為止，特約醫事服務機構違規次數累積達13722次。在所有重大違規事件中，大部分是詐欺行為。健保審查機制主要以電腦隨機抽樣，再由人工進行調查。然而，這樣的審查方式無法有效抽取到違規醫事機構之樣本，造成審查效果不彰。 Benford’s Law又稱第一位數法則，其概念為第一位數的值越小則該數字出現的頻率越大，反之相反。該方法被應用於會計、金融、審計及經濟領域中。楊喻翔(2012)將Benford’s Law相關指標應用於我國全民健保上，並結合機器學習演算法來進行健保異常偵測。 Zaharia et al. (2012)提出了一種具容錯的群集記憶內運算模式 Apache Spark，在相同的運算節點及資源下，其資料運算效率及速度可勝出Hadoop MapReduce 20倍以上。為解決健保異常查核效果不彰問題，本研究將採用Benford’s Law，使用國家衛生研究院發行之健保資料計算成為Benford’s Law指標和實務指標，接著並使用支援向量機和邏輯斯迴歸來建構出異常查核模型。然而健保資料量龐大，為加快運算時間，本研究使用Apache Spark做為運算環境，並以Hadoop MapReduce作為標竿，比較運算效率。研究結果顯示，本研究撰寫的Spark程式運算時間能較MapReduce快2倍；在分類模型上，支援向量機和邏輯斯迴歸所進行的住院資料測試，敏感度皆有80%以上；而所進行的門診資料測試，兩個模型的準確率沒有住院資料高，但邏輯斯迴歸測試結果仍保有一定的準確性，在敏感度仍有75%，整體正確率有73%。本研究使用Apache Spark節省處理大量健保資料的運算時間。其次本研究建立的智慧型異常查核模型，確實能查核出違約的醫事機構，而模型所查核出可能有詐欺及濫用健保之醫事機構，可進行下階段人工調查，最終得改善健保查核效力。 / Financial condition of National Health Insurance (NHI) has been wretched in recent years. The income statement in 2009 indicated that National Health Insurance Administration (NHIA) was in debt for NTD $58.2 billion. According to NHIA data, certain medical institutions in Taiwan violated the NHI laws for 13722 times. Among all illegal cases, fraud is the most serious. In order to find illegal medical institutions, NHIA conducted random sampling by computer. Once the data was collected, NHIA investigators got involved in the review process. However, the way to get the samples mentioned above cannot reveal the reality. Benford's law is called the First-Digit Law. The concept of Benford’s Law is that the smaller digits would appear more frequently, while larger digits would occur less frequently. Benford’s Law is applied to accounting, finance, auditing and economics. Yang(2012) used Benford’s Law in NHI data and he also used machine learning algorithms to do fraud detection. Zaharia et al. (2012) proposed a fault-tolerant in-memory cluster computing -Apache Spark. Under the same computing nodes and resources, Apache Spark’s computing is faster than Hadoop MapReduce 20 times. In order to solve the problem of medical claims review, Benford’s Law was applied to this study. This study used NHI data which was published by National Health Research Institutes. Then, we computed NHI data to generate Benford’s Law variables and technical variables. Finally, we used support vector machine and logistics regression to construct the illegal check model. During system development, we found that the data size was big. With the purpose of reducing the computing time, we used Apache Spark to build computing environment. Furthermore, we adopted Hadoop MapReduce as benchmark to compare the performance of computing time. This study indicated that Apache Spark is faster twice than Hadoop MapReduce. In illegal check model, with support vector machine and logistics regression, we had 80% sensitivity in inpatient data. In outpatient data, the accuracy of support vector machine and logistics regression were lower than inpatient data. In this case, logistics regression still had 75% sensitivity and 73% accuracy. This study used Apache Spark to compute NHI data with lower computing time. Second, we constructed the intelligent illegal check model which can find the illegal medical institutions for manual check. With the use of illegal check model, the procedure of medical claims review will be improved. 異常健保院所記憶內運算 Apache Spark Benford’s Law 機器學習演算法 Illegal Medical Institutions In-Memory Computing Apache Spark Benford’s Law Machine Learning Algorithms
36	Geo-distributed multi-layer stream aggregation Cannalire, Pietro January 2018 (has links) The standard processing architectures are enough to satisfy a lot of applications by employing already existing stream processing frameworks which are able to manage distributed data processing. In some specific cases, having geographically distributed data sources requires to distribute even more the processing over a large area by employing a geographically distributed architecture.‌ The issue addressed in this work is the reduction of data movement across the network which is continuously flowing in a geo-distributed architecture from streaming sources to the processing location and among processing entities within the same distributed cluster. Reduction of data movement can be critical for decreasing bandwidth costs since accessing links placed in the middle of the network can be costly and can increase as the amount of data exchanges increase. In this work we want to create a different concept to deploy geographically distributed architectures by relying on Apache Spark Structured Streaming and Apache Kafka. The features needed for an algorithm to run on a geo-distributed architecture are provided. The algorithms to be executed on this architecture apply the windowing and the data synopses techniques to produce a summaries of the input data and to address issues of the geographically distributed architecture. The computation of the average and the Misra-Gries algorithm are then implemented to test the designed architecture. This thesis work contributes in providing a new model of building geographically distributed architecture. The experimental results show that, for the algorithms running on top of the geo distributed architecture, the computation time is reduced on average by 70% compared to the distributed setup. Similarly, and the amount of data exchanged across the network is reduced on average by 99%, compared to the distributed setup. / Standardbehandlingsarkitekturer är tillräckligt för uppfylla behoven av många tillämpningar genom användning av befintliga ramverk för flödesbehandling med stöd för distribuerad databehandling. I specifika fall kan geografiskt fördelade datakällor kräva att databehandlingen fördelas över ett stort område med hjälp av en geografiskt distribuerad arkitektur. Problemet som behandlas i detta arbete är minskningen av kontinuerlig dataöverföring i ett nätverk med geo-distribuerad arkitektur. Minskad dataöverföring kan vara avgörande för minskade bandbreddskonstnader då åtkomst av länkar placerade i mitten av ett nätverk kan vara dyrt och öka ytterligare med tilltagande dataöverföring. I det här arbetet vill vi skapa ett nytt koncept för att upprätta geografiskt distribuerade arkitekturer med hjälp av Apache Spark Structured Streaming och Apache Kafka. Funktioner och förutsättningar som behövs för att en algoritm ska kunna köras på en geografisk distribuerad arkitektur tillhandahålls. Algoritmerna som ska köras på denna arkitektur tillämpar “windowing synopsing” och “data synopses”-tekniker för att framställa en sammanfattning av ingående data samt behandla problem beträffande den geografiskt fördelade arkitekturen. Beräkning av medelvärdet och Misra-Gries-algoritmen implementeras för att testa den konstruerade arkitekturen. Denna avhandling bidrar till att förse ny modell för att bygga geografiskt distribuerad arkitektur. Experimentella resultat visar att beräkningstiden reduceras i genomsnitt 70% för de algoritmer som körs ovanför den geo-distribuerade arkitekturen jämfört med den distribuerade konfigurationen. På liknande sätt reduceras mängden data som utväxlas över nätverket med 99% i snitt jämfört med den distribuerade inställningen. stream processing geo-distributed architecture algorithms windowing data synopses Apache Spark Structured Streaming Apache Kafka Misra-Gries algorithm flödesbehandling geo-distribuerade arkitekturen algoritmerna windowing data synopses Apache Spark Structured Streaming Apache Kafka Misra-Gries-algoritmen Computer and Information Sciences Data- och informationsvetenskap
37	Distributed multi-label learning on Apache Spark Gonzalez Lopez, Jorge 01 January 2019 (has links) This thesis proposes a series of multi-label learning algorithms for classification and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios. The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of individual information measures, and a method that selects the subset of features maximizing the geometric mean. The results indicate that each method excels in different scenarios depending on type of features and the number of labels. Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets confirm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art. multi-label apache spark feature selection mutual information Artificial Intelligence and Robotics Theory and Algorithms
38	SparkBLAST : utilização da ferramenta Apache Spark para a execução do BLAST em ambiente distribuído e escalável Castro, Marcelo Rodrigo de 13 February 2017 (has links) Submitted by Aelson Maciera (aelsoncm@terra.com.br) on 2017-09-06T18:32:40Z No. of bitstreams: 1 DissMRC.pdf: 1562148 bytes, checksum: 9921840ad67ef82d956e399ab96dd78c (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-09-25T16:56:27Z (GMT) No. of bitstreams: 1 DissMRC.pdf: 1562148 bytes, checksum: 9921840ad67ef82d956e399ab96dd78c (MD5) / Approved for entry into archive by Ronildo Prado (ronisp@ufscar.br) on 2017-09-25T16:56:34Z (GMT) No. of bitstreams: 1 DissMRC.pdf: 1562148 bytes, checksum: 9921840ad67ef82d956e399ab96dd78c (MD5) / Made available in DSpace on 2017-09-25T17:05:03Z (GMT). No. of bitstreams: 1 DissMRC.pdf: 1562148 bytes, checksum: 9921840ad67ef82d956e399ab96dd78c (MD5) Previous issue date: 2017-02-13 / Outra / Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) / Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro (FAPERJ) / With the evolution of next generation sequencing devices, the cost for obtaining genomic data has significantly reduced. With reduced costs for sequencing, the amount of genomic data to be processed has increased exponentially. Such data growth supersedes the rate at which computing power can be increased year after year by the hardware and software evolution. Thus, the higher rate of data growth in bioinformatics raises the need for exploiting more efficient and scalable techniques based on parallel and distributed processing, including platforms like Clusters, and Cloud Computing. BLAST is a widely used tool for genomic sequences alignment, which has native support for multicore-based parallel processing. However, its scalability is limited to a single machine. On the other hand, Cloud computing has emerged as an important technology for supporting rapid and elastic provisioning of large amounts of resources. Current frameworks like Apache Hadoop and Apache Spark provide support for the execution of distributed applications. Such environments provide mechanisms for embedding external applications in order to compose large distributed jobs which can be executed on clusters and cloud platforms. In this work, we used Spark to support the high scalable and efficient parallelization of BLAST (Basic Local Alingment Search Tool) to execute on dozens to hundreds of processing cores on a cloud platform. As result, our prototype has demonstrated better performance and scalability then CloudBLAST, a Hadoop based parallelization of BLAST. / Com a redução dos custos e evolução dos mecanismos que efetuam o sequenciamento genômico, tem havido um grande aumento na quantidade de dados referentes aos estudos da genomica. O crescimento desses dados tem ocorrido a taxas mais elevadas do que a industria tem conseguido aumentar o poder dos computadores a cada ano. Para melhor atender a necessidade de processamento e analise de dados em bioinformatica faz-se o uso de sistemas paralelos e distribuídos, como por exemplo: Clusters, Grids e Nuvens Computacionais. Contudo, muitas ferramentas, como o BLAST, que fazem o alinhamento entre sequencias e banco de dados, nao foram desenvolvidas para serem processadas de forma distribuída e escalavel. Os atuais frameworks Apache Hadoop e Apache Spark permitem a execucao de aplicacoes de forma distribuída e paralela, desde que as aplicacoes possam ser devidamente adaptadas e paralelizadas. Estudos que permitam melhorar desempenho de aplicacoes em bioinformatica tem se tornado um esforço contínuo. O Spark tem se mostrado uma ferramenta robusta para processamento massivo de dados. Nesta pesquisa de mestrado a ferramenta Apache Spark foi utilizada para dar suporte ao paralelismo da ferramenta BLAST (Basic Local Alingment Search Tool). Experimentos realizados na nuvem Google Cloud e Microsoft Azure demonstram desempenho (speedup) obtido foi similar ou melhor que trabalhos semelhantes ja desenvolvidos em Hadoop. BLAST Apache Spark Nuvens computacionais Sequenciamento genético Cloud computing Genetic sequencing Hadoop
39	Junções por similaridade com expressões complexas em ambientes distribuídos / Set similarity joins with complex expressions on distributed platforms Oliveira, Diego Junior do Carmo 31 August 2018 (has links) Submitted by Liliane Ferreira (ljuvencia30@gmail.com) on 2018-10-01T13:06:03Z No. of bitstreams: 2 Dissertação - Diego Junior do Carmo Oliveira - 2018.pdf: 2678764 bytes, checksum: c32f645ce8abd8a764bec1993d41337b (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Approved for entry into archive by Luciana Ferreira (lucgeral@gmail.com) on 2018-10-01T14:48:43Z (GMT) No. of bitstreams: 2 Dissertação - Diego Junior do Carmo Oliveira - 2018.pdf: 2678764 bytes, checksum: c32f645ce8abd8a764bec1993d41337b (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Made available in DSpace on 2018-10-01T14:48:43Z (GMT). No. of bitstreams: 2 Dissertação - Diego Junior do Carmo Oliveira - 2018.pdf: 2678764 bytes, checksum: c32f645ce8abd8a764bec1993d41337b (MD5) license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) Previous issue date: 2018-08-31 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES / A recurrent problem that degrades the quality of the information in databases is the presence of duplicates, i.e., multiple representations of the same real-world entity. Despite being computationally expensive, the use of similarity operations is fundamental to identify duplicates. Furthermore, real-world data is typically composed of different attributes and each attribute represents a distinct type of information. The application of complex similarity expressions is important in this context because they allow considering the importance of each attribute in the similarity evaluation. However, due to a large amount of data present in Big Data applications, it has become crucial to perform these operations in parallel and distributed processing environments. In order to solve such problems of great relevance to organizations, this work proposes a novel strategy to identify duplicates in textual data by using similarity joins with complex expressions in a distributed environment. / Um problema recorrente que degrada a qualidade das informações em banco de dados é a presença de duplicatas, isto é, múltiplas representações de uma mesma entidade do mundo real. Apesar de ser computacionalmente oneroso, para realizar a identificação de duplicatas é fundamental o emprego operações de similaridade. Além disso, os dados atuais são tipicamente compostos por diferentes atributos, cada um destes contendo um tipo distinto de informação. A aplicação de expressões de similaridade complexas é importante neste contexto uma vez que permitem considerar a importância de cada atributo na avaliação da similaridade. No entanto, em virtude da grande quantidade de dados presentes em aplicações Big Data, fez-se necessário realizar o processamento destas operações em ambientes de programação paralelo ou distribuído. Visando solucionar estes problemas de grande relevância para as organizações, este trabalho propõe uma nova estratégia de processamento para identificação de duplicatas em dados textuais utilizando junções por similaridade com expressões complexas em um ambiente distribuído. Junção por similaridade Sistemas distribuídos Apache spark Big data Similarity joins Distributed platforms
40	Jämförelser av MySQL och Apache Spark : För aggregering av smartmätardata i Big Data format för en webbapplikation / Comparisons between MySQL and Apache Spark : For aggregation of smartmeter data in Big Data format for a web application Danielsson, Robin January 2020 (has links) Smarta elmätare är ett område som genererar data i storleken Big Data. Dessa datamängder medför svårigheter att hanteras med traditionella databaslösningar som MySQL. Ett ramverk som uppstått för att lösa dessa svårigheter är Apache Spark som implementerar MapReduce-modellen för klustrade nätverk av datorer. En frågeställning för arbetet är om Apache Spark har fördelar över MySQL på en enskild dator för att hantera stora mängder data i formatet JSON för aggregering mot webbapplikationer. Resultaten i detta arbete visar på att Apache Spark har lägre aggregeringstid än MySQLmot en webbapplikation vid minst ~6.7 GB data i formatet JSON vid mer komplexa aggregeringsfrågor på enskild dator. Resultatet visar även att MySQL lämpar sig bättre än Apache Spark vid enklare aggregeringsfrågor för samtliga datamängder i experimentet. Big Data Apache Spark MySQL JSON Webbapplikationer Information Systems, Social aspects

Search results