211 |
Parallel Kafka Producer Applications : Their performance and its limitationsSundbom, Arvid January 2023 (has links)
"This paper examines multi-threaded Kafka producer applications, and how the performance of such applications is affected by how the number of producer instances relates to the number of executing threads. Specifically, the performance of such applications when using a single producer instance, shared among all threads, and when each thread is allotted a separate, private instance, is compared. This comparison is carried out for a number of different producer configurations and varying levels of computational work per message produced.Overall, the data indicates that utilizing private producer instances results in highe rperformance, in terms of data throughput, than sharing a single instance among the executing threads. The magnitude of this difference is affected, to some extent, by the configuration profiles used to create the producer instances, as well as the computational workload of the application hosting the producers. Specifically, configuring producers for reliability seems to increase the difference, and so does increasing the rate at which messages are to be produced.As a result of this, Brod, a wrapper library [56], based on an implementation of a client library for Apache Kafka [25], has been developed. The purpose of the library is to provide functionality which simplifies the development of multi-threadedKafka producer applications."
|
212 |
Mining Formal Concepts in Large Binary Datasets using Apache SparkRayabarapu, Varun Raj 29 September 2021 (has links)
No description available.
|
213 |
Development of an Apache Spark-Based Framework for Processing and Analyzing Neuroscience Big Data: Application in Epilepsy Using EEG Signal DataZhang, Jianzhe 07 September 2020 (has links)
No description available.
|
214 |
Ablation Programming for Machine LearningSheikholeslami, Sina January 2019 (has links)
As machine learning systems are being used in an increasing number of applications from analysis of satellite sensory data and health-care analytics to smart virtual assistants and self-driving cars they are also becoming more and more complex. This means that more time and computing resources are needed in order to train the models and the number of design choices and hyperparameters will increase as well. Due to this complexity, it is usually hard to explain the effect of each design choice or component of the machine learning system on its performance.A simple approach for addressing this problem is to perform an ablation study, a scientific examination of a machine learning system in order to gain insight on the effects of its building blocks on its overall performance. However, ablation studies are currently not part of the standard machine learning practice. One of the key reasons for this is the fact that currently, performing an ablation study requires major modifications in the code as well as extra compute and time resources.On the other hand, experimentation with a machine learning system is an iterative process that consists of several trials. A popular approach for execution is to run these trials in parallel, on an Apache Spark cluster. Since Apache Spark follows the Bulk Synchronous Parallel model, parallel execution of trials includes several stages, between which there will be barriers. This means that in order to execute a new set of trials, all trials from the previous stage must be finished. As a result, we usually end up wasting a lot of time and computing resources on unpromising trials that could have been stopped soon after their start.We have attempted to address these challenges by introducing MAGGY, an open-source framework for asynchronous and parallel hyperparameter optimization and ablation studies with Apache Spark and TensorFlow. This framework allows for better resource utilization as well as ablation studies and hyperparameter optimization in a unified and extendable API. / Eftersom maskininlärningssystem används i ett ökande antal applikationer från analys av data från satellitsensorer samt sjukvården till smarta virtuella assistenter och självkörande bilar blir de också mer och mer komplexa. Detta innebär att mer tid och beräkningsresurser behövs för att träna modellerna och antalet designval och hyperparametrar kommer också att öka. På grund av denna komplexitet är det ofta svårt att förstå vilken effekt varje komponent samt designval i ett maskininlärningssystem har på slutresultatet.En enkel metod för att få insikt om vilken påverkan olika komponenter i ett maskinlärningssytem har på systemets prestanda är att utföra en ablationsstudie. En ablationsstudie är en vetenskaplig undersökning av maskininlärningssystem för att få insikt om effekterna av var och en av dess byggstenar på dess totala prestanda. Men i praktiken så är ablationsstudier ännu inte vanligt förekommande inom maskininlärning. Ett av de viktigaste skälen till detta är det faktum att för närvarande så krävs både stora ändringar av koden för att utföra en ablationsstudie, samt extra beräkningsoch tidsresurser.Vi har försökt att ta itu med dessa utmaningar genom att använda en kombination av distribuerad asynkron beräkning och maskininlärning. Vi introducerar maggy, ett ramverk med öppen källkodsram för asynkron och parallell hyperparameteroptimering och ablationsstudier med PySpark och TensorFlow. Detta ramverk möjliggör bättre resursutnyttjande samt ablationsstudier och hyperparameteroptimering i ett enhetligt och utbyggbart API.
|
215 |
Matrix Multiplications on Apache Spark through GPUs / Matrismultiplikationer på Apache Spark med GPUSafari, Arash January 2017 (has links)
In this report, we consider the distribution of large scale matrix multiplications across a group of systems through Apache Spark, where each individual system utilizes Graphical Processor Units (GPUs) in order to perform the matrix multiplication. The purpose of this thesis is to research whether the GPU's advantage in performing parallel work can be applied to a distributed environment, and whether it scales noticeably better than a CPU implementation in a distributed environment. This question was resolved by benchmarking the different implementations at their peak. Based on these benchmarks, it was concluded that GPUs indeed do perform better as long as single precision support is available in the distributed environment. When single precision operations are not supported, GPUs perform much worse due to the low double precision performance of most GPU devices. / I denna rapport betraktar vi fördelningen av storskaliga matrismultiplikationeröver ett Apache Spark kluster, där varje system i klustret delegerar beräkningarnatill grafiska processorenheter (GPU). Syftet med denna avhandling är attundersöka huruvida GPU:s fördel vid parallellt arbete kan tillämpas på en distribuerad miljö, och om det skalar märkbart bättre än en CPU-implementationi en distribuerad miljö. Detta gjordes genom att testa de olika implementationerna i en miljö däroptimal prestanda kunde förväntas. Baserat på resultat ifrån dessa tester drogsslutsatsen att GPU-enheter preseterar bättre än CPU-enheter så länge ramverkethar stöd för single precision beräkningar. När detta inte är fallet så presterar deflesta GPU-enheterna betydligt sämre på grund av deras låga double-precisionprestanda.
|
216 |
The -go Morpheme and Reference Tracking in Jicarilla ApacheFerrin, Lee Shanideen 14 August 2023 (has links) (PDF)
Jicarilla Apache is a Southern Athabaskan language with a complex verbal structure, including a prefix template with positions for more than ten affixes. Little has been done to document or describe the language grammatically or typologically, but one of the morphemes that has been described in the literature is the suffix -go. The morpheme can be found in elicited speech as well as in narrations. This morpheme is one of the few verbal affixes that can appear after the verb stem and plays a role in many subordinate clause constructions. It has been described as a temporal marker, a feature of certain auxiliary verb constructions, a marker of habitual aspect, and a required part of causative constructions, among others. Such a wide variety of uses can make it difficult for language learners to know when this morpheme should be included. But there is one function that would account for all the previous descriptions and provide a simpler paradigm for funderstanding what triggers the presence of -go: namely, that of reference tracking. No referent tracking function of -go has been described, yet many of the functions of -go provided in the literature can also be explained as the result of a system of reference tracking. This thesis argues that Jicarilla features a reference tracking system that combines foregrounding functions with the features of switch reference, according to the definition of foregrounding found in Simpson (2004) and the definitions of switch reference found in van Gijn (2016a) and Stirling (1993). This is demonstrated by reviewing all the examples of -go in the available literature, including Goddard (1911), Jung (2002), and Phone, Olson, Martinez, & Axelrod (2007).
|
217 |
Comparative Analysis of Load Balancing in Cloud Platforms for an Online Bookstore Web Application using Apache BenchmarkPothuganti, Srilekha, Samanth, Malepiti January 2023 (has links)
Background :Cloud computing has transformed the landscape of application deploy-ment, offering on-demand access to compute resources, databases, and services viathe internet. This thesis explores the development of an innovative online book-storeweb application, harnessing the power of cloud infrastructure across AWS,Azure, andGCP. The front end utilises HTML, CSS, and JavaScript to create responsive webpages with an intuitive user interface. The back-end is constructed using Node.jsand Express for high-performance server-side logic and routing, while MongoDB, adistributed NoSQL database, stores the data. This cloud-native architecture facili-tates easy scaling and ensures high availability. Objectives: The main objectives of this thesis are to develop an intuitive onlinebookstore enabling users to add, exchange, and purchase books, deploy it acrossAWS, Azure, and GCP for scalability, implement load balancers for enhanced per-formance, and conduct load testing and benchmarking to compare the efficiency ofthese load balancers. The study aims to determine the best-performing cloud plat-form and load-balancing strategy to ensure an exceptional user experience for ouronline bookstore. Comparing load balancer data across these platforms to determinetheir performance ensures the best user experience for our online bookstore by takingthe metrics. Methods: The website deployment is done on three cloud platforms by creatinginstances separately on each platform, and then the load balance is created for eachof the services. By using the monitoring tools of every platform, we get the resultinggraphs for the metrics. From this, we increase and decrease the load in the ApacheBenchmark tool by taking the specific tasks from the website and comparing thevisualisation of the results done in an aggregate graph and summary reports. It isthen used to test the website’s overall performance by using metrics like throughput,CPU utilisation, error percentage, and cost efficiency. Results: The results are based on the Apache Benchmark Load Testing Tool of aselected website between the cloud platforms. The results of AWS, Azure, and GCPcan be shown in the aggregate graph. The graph results are based on the testingtool to determine which service is best for users because it shows less load on theserver and requests data in the shortest amount of time. We have considered 10 and50 requests, and based on the results, we have compared the metrics of throughput,CPU utilisation, error percentage, and cost efficiency. The 10 and 50 requests’ resultsare compared to determine which cloud platform performs better. Conclusions: According to the results from the 10 and 50 requests, it can be con-cluded that GCP has a higher throughput and CPU utilisation than AWS and Azure.They are less flexible and efficient for users. Thus, it can be said that GCP outper-forms in terms of load balancing.
|
218 |
Investigations of Free Text Indexing Using NLP : Comparisons of Search Algorithms and Models in Apache Solr / Undersöka hur fritextindexering kan förbättras genom NLPSundstedt, Alfred January 2023 (has links)
As Natural Language Processing progresses societal and applications like OpenAI obtain more considerable popularity in society, businesses encourage the integration of NLP into their systems. Both to improve the user experience and provide users with their requested information. For case management systems, a complicated task is to provide the user with relevant documents, since customers often have large databases containing similar information. This presumes that the user needs to match the requested topic perfectly. Imagine if there was a solution to search for context, instead of formulating the perfect prompt, via established NLP models like BERT. Imagine if the system understood its content. This thesis aims to investigate how a free text index can be improved using NLP from a user perspective and implement it. Using AI to help a free text index, in this case, Apache Solr, can make it easier for users to find the specific content the users are looking for. It is interesting to see how the search can be improved with the help of NLP models and present a more relevant result for the user. NLP can improve user prompts, known as queries, and assist in indexing the information. The task is to conduct a practical investigation by configuring the free text database Apache Solr, with and without NLP support. This is investigated by learning the search models' content, letting the search models provide their relevant search results, for some user queries, and evaluating the results. The investigated search models were a string-based model, an OpenNLP model, and BERT models segmented on paragraph level and sentence level. A hybrid search model of OpenNLP and BERT, on paragraph level, was the best solution overall.
|
219 |
Predicting Closed Versus Open Questions Using Machine Learning for Improving Community Question Answering WebsitesMakkena, Pradeep Kumar January 2017 (has links)
No description available.
|
220 |
Using Apache Spark's MLlib to Predict Closed Questions on Stack OverflowMadeti, Preetham 07 June 2016 (has links)
No description available.
|
Page generated in 0.0279 seconds