Spelling suggestions: "subject:"[een] PARALLELISM"" "subject:"[enn] PARALLELISM""
61 |
Parallelism in Node.js applications : Data flow analysis of concurrent scriptsJansson, Linda January 2017 (has links)
To fully utilize multicore processors in Node.js applications, the applications must be programmed as multiple processes. Parallel execution can increase the throughput of data and hence lower data buffering for inter-process communica- tion. Node.js’s asynchronous programming model and interface to the operating system make for convenient tools that are well suited for multiprocess program- ming. However, the run-time behavior of asynchronous processes results in non-deterministic processor load and data flow. That means the performance gain from increasing concurrency depends on both the application’s run-time state and the hardware’s capacity for parallel execution. The objective of this thesis work is to explore the effects of increasing parallel- ism in Node.js applications by measuring the differences in the amount of data buffering when distributed processes run of a varying number of cores with a fixed rate of asynchronously arriving data. The goal is to simulate and examine the run-time behavior of three basic multiprocess Node.js application architec- tures in order to discuss and evaluate software parallelism techniques. The three architectures are: pipelined nodes for temporally dependent processing, a vector of nodes for data parallel processing, and a grid of nodes for one-to-many branched processing. To simulate and visualize the run-time behavior, a simulation environment us- ing multiple Node.js processes is created. The simulation is agent-based, where the agent is an abstraction for a specific data flow within the application. The simulation models and visualizes all of the data flows within a distributed appli- cation where processes communicate asynchronously via messages through sockets. The results show that performance can increase when distributing Node.js ap- plications across multiple processes running in parallel on multicore hardware. There are however diminishing returns as the number of active processes equal or exceed the number of cores. A good rule of thumb seem to be to distribute the decoupled logic across as many processes as there are cores. The interaction between asynchronous processes is on the whole made very simple with Node.js. Although running multiple instances of Node.js requires more memory, the distributed architecture has the potential to increase performance by nearly as many times as the number of cores in the processor.
|
62 |
Conception d'un modèle de composants logiciels avec ordonnancement de tâches pour les architectures parallèles multi-coeurs, application au code Gysela / Conception of a software component model with task scheduling for many-core based parallel architecture, application to the Gysela5D codeRichard, Jérôme 06 December 2017 (has links)
Cette thèse vise à définir et à valider un modèle de programmation intégrant la description d'architectures logicielles et un ordonnancement dynamique de tâches dans un contexte de haute performance. Par exemple, il s'agit de combiner les avantages de modèles tels que L²C et StarPU. L'objectif final est de proposer un modèle capable de supporter des applications telles que Gysela5D sur les architectures parallèles actuelles et futures (tel que des clusters très variés et supercalculateurs comportant des accélérateurs). / This thesis aims to define and validate a programing model that combines the description of software architecture with dynamic task scheduling in a high performance context. For example by integrating the advantages of the L²C and StarPU models. The final goal is to propose a model that enables the description of applications such as Gysela5D on current and future parallel architectures (such as various clusters and supercomputers including accelerators).
|
63 |
Paralelização de programas sisal para sistemas MPI / Parallelization of sisal programs for MPI systemsRaul Junji Nakashima 15 March 1996 (has links)
Este trabalho teve como finalidade a implementação de um método para a paralelização parcial de programas, escritos na linguagem funcional, SISAL utilizando as bibliotecas do padrão MPI (Message Passing Interface). Para tal, propusemos a transformação dos programas SISAL através do particionamento do loop paralelo forall, através do método de particionamento slice e a utilização do modelo de implementação do paralelismo SPMD (Single Program Multiple Data) no estilo de programas mestre/escravo. A validação de nossa proposta foi obtida através da realização de testes onde foram comparados os resultados obtidos com os programas originais e os programas com as alterações propostas / This work describes a method for the partial parallelization of SISAL programs into programs with calls to MPI routines. We focused on the parallelization of the forall loop (through slicing of the index range). The generated code is a master/slave SPMD program. The work was validated through the compilation of some simple SISAL programs and comparison of the results with an unmodified version
|
64 |
Processamento paralelo aplicado em análise não linear de cascas / Parallel processing applied to nonlinear structural analysisElias Calixto Carrijo 20 June 2001 (has links)
Este trabalho tem o intuito de fazer uso do processamento paralelo na análise não linear de cascas pelo método dos elementos finitos. O elemento finito de casca é obtido com o acoplamento de um elemento de placa e um de chapa. O elemento de placa utiliza formulação de Kirchhof (DKT) para placas delgadas e o elemento de chapa faz uso da formulação livre (FF), introduzindo um grau de liberdade rotacional nos vértices. A análise não-linear com plasticidade utiliza o modelo de plasticidade associada com algoritmo de integração explícito, modelo de escoamento de von Mises com integração em camadas (modelo estratificado), para materiais isotrópicos. A implementação em paralelo é realizada em um sistema com memória distribuída e biblioteca de troca de mensagens PVM (Parallel Virtual Machine). O procedimento não-linear é completamente paralelizado, excetuando a impressão final de resultados. As etapas que constituem o método dos elementos finitos, matriz de rigidez da estrutura e resolução do sistema de equações lineares são paralelizadas. Para o cálculo da matriz de rigidez utiliza-se um algoritmo com decomposição de domínio explícito. Para resolução do sistema de equações lineares utiliza-se o método dos gradientes conjugados com implementação em paralelo. É apresentada uma breve revisão bibliográfica sobre o paralelismo, com comentários sobre perspectivas em análise estrutural / This work aims at using parallel processing for nonlinear analysis of shells through finite element method. The shell finite element is obtained by coupling a plate element with a membrane one. The plate element uses Kirchhoffs formulation (DKT) for thin plates and the membrane element makes use of the free formulation (FF), introducing a rotational degree of freedom at the vertexes. Nonlinear plastic analysis uses associated plasticity model with explicit integration algorithm, von Mises yelding model with layer integration (stratified model) for isotropic materials. Parallel implementation is done on a distributed memory system and message exchange library PVM (Parallel Virtual Machine). Nonlinear procedure is completely parallelised, but final printing of results. The finite element method steps, structural stifness matrix and solution of the linear equation system are parallelised. An explicit domain decomposition algorithm is used for the stifness matrix evaluation. To solve the linear equation system, one uses conjugated gradients method, with parallel implementation. A brief bibliography about parallelism is presented, with comments on structural analysis perspectives
|
65 |
Exploração de paralelismo no roteamento global de circuitos VLSI / Parallel computing exploitation applied for VLSI global routingTumelero, Diego January 2015 (has links)
Com o crescente aumento das funcionalidades dos circuitos integrados, existe um aumento consequente da complexidade do projeto dos mesmos. O fluxo de projeto de circuitos integrados inclui em um de seus passos o roteamento, que consiste em criar fios que interconectam as células do circuito. Devido à complexidade, o roteamento é dividido em global e detalhado. O roteamento global de circuitos VLSI é uma das tarefas mais complexas do fluxo de síntese física, sendo classificado como um problema NP-completo. Neste trabalho, além de realizar um levantamento de trabalhos que utilizam as principais técnicas de paralelismo com o objetivo de acelerar o processamento do roteamento global, foram realizadas análises nos arquivos de benchmark do ISPD 2007/08. Com base nestas análises foi proposto um método que agrupa as redes para então verificar a existência de dependência de dados em cada grupo. Esta verificação de dependência de dados, que chamamos neste trabalho de colisor, tem por objetivo, criar fluxos de redes independentes umas das outras para o processamento em paralelo, ou seja, ajudar a implementação do roteamento independente de redes. Os resultados demonstram que esta separação em grupos, aliada com a comparação concorrente dos grupos, podem reduzir em 67x o tempo de execução do colisor de redes se comparada com a versão sequencial e sem a utilização de grupos. Também foi obtido um ganho de 10x ao comparar a versão com agrupamentos sequencial com a versão paralela. / With the increasing of the functionality of integrated circuits, there is a consequent increase in the complexity of the design. The IC design flow includes the routing in one of its steps, which is to create wires that interconnect the circuit cells. Because of the complexity, routing is divided into global and detailed. The global routing of VLSI circuits is one of the most complex tasks in the flow of physical synthesis and it's classified as an NP-complete problem. In this work, a parallel computing techniques survey was applied to the VLSI global routing in order to accelerate the global routing processing analyzes. This analyzes was performed on the ISPD 2007/08 benchmark files. We proposed a method that groups the networks and then check for data dependence in each group based on these analyzes. This data dependency checking, we call this checking of collider, aims to create flow nets independent of each other for processing in parallel, or help implement the independent routing networks. The results demonstrate that this separation into groups, together with the competitor comparison of groups, can reduce 67x in the collider networks runtime compared with the sequential release and without the use of groups. It was also obtained a gain of 10x when comparing the version with sequential clusters with the parallel version.
|
66 |
A study about differences in performance with parallel and sequential sorting algorithmsNyholm, Joel January 2021 (has links)
Background: Sorting algorithms are an essential part of computer science. With the use of parallelism, these algorithms performance can improve. Objectives: To assess parallel sorting algorithms performance compared with their sequential counterparts and see what contextual factors make a difference in performance. Methods: An experiment was made with quicksort, merge sort, load-balanced parallel merge sort and hyperquicksort. These algorithms executed on Ubuntu 20.10 and Windows 10 Home with three data sets, small (106 integers), medium (5 106 integers) and large (107 integers). Each algorithm executed 1 000 times per data set within each operating system resulting in 6 000 executions per sorting algorithm. Results: With the data from the executions, it was concluded that hyperquicksort had the fastest execution time. On average load-balanced parallel merge sort had the slowest execution time. The fastest operating system was Ubuntu 20.10, all but one algorithm executed faster on Ubuntu. Conclusions: The results showed that the fastest algorithm was hyperquicksort, but other conclusions also arose. The data set size correlated with both the execution time and speedup for a given parallel sorting algorithm. When the data set size increased, both the execution time and the speedup increased.
|
67 |
Performance evaluation of Web Workers API and OpenMPHellberg, Linus, Bhamidipati, Bhargava January 2022 (has links)
Background - Web browsers and and the web programs on them are being used now more than ever in a manner similar to traditional software. But with the increase in the demand for performance on bigger and bigger web applications, there is a need for making the web applications perform faster and better. Introducing parallelism to a normally single threaded system is one popular way of introducing more performance. Objectives - We will implement proven and workable programs, created with OpenMP, that will be translated to JavaScript. These JavaScript applications will use the WebWorkers API to achieve similar levels of parallelism as the OpenMP applications. Methods - To implement and gather results from the all of the various programs, we will be using visual studio code and its live server extension that it hosts to run and compare the JavaScript implementations. The selected OpenMP applications to be measured and translated were primarily taken and selected from a benchmark suite that hosts programs that have already been written in a traditional parallel computing model, such as OpenMP. The performance of these OpenMP programs and Web Workers applications will be analyzed and compared in the second portion of the research where results will be gathered. Results - JavaScript was proven to perform worse than OpenMP in every situation tested. Though this was expected, there were also some situations where JavaScript applications were performing close to the OpenMP programs. Conclusions - Ultimately, using Web Workers is recommended for what they were designed to do. To help alleviate the main thread to keep the web program running smoothly. For the heavy computational tasks that we were experimenting on, JavaScript did not do a sufficient enough job compared against the OpenMP applications. When measuring the workers we did not get any results for any applicationsthat was very close to what OpenMP achieved. Thus Web Workers are really only suited for easy problems that needs to be done repeatedly. They lack the efficiency for any complicated algorithms to be worth implementing
|
68 |
Implementation and evaluation of selected Machine Learning algorithms on a resource constrained telecom hardware platform / Implementation och utvärdering av utvalda maskininlärningsalgoritmer på en resursbegränsad telekom-maskinvaruplattformLeborg, Sebastian January 2017 (has links)
The vast majority of computing hardware platforms available today are not desktop PCs. They are embedded systems, sensors and small specialized pieces of hardware present in almost every digital product available today. Due to the massive amount of information available through these devices we can find new and exciting ways to apply and benefit from machine learning. Many of these computing devices have specialized, resource-constrained architectures and it might be problematic to perform complicated computations. If such a system is under heavy load or has restricted performance, computational power is a valuable resource and costly algorithms must be avoided. \\This master thesis will present an in-depth study investigating the trade-offs between precision, latency and memory consumption of a selected set of machine learning algorithms implemented on a resource constrained multi-core telecom hardware platform. This report includes motivations for the selected algorithms, discusses the results of the algorithms execution on the hardware platform and offers conclusions relevant to further developments. / Majoriteten av beräkningsplattformarna som finns tillgängliga idag är inte stationära bordsdatorer. De är inbyggda system, sensorer och små specialiserade hårdvaror som finns i nästan alla digitala produkter tillgängliga idag. På grund av den enorma mängden information som finns tillgänglig via dessa enheter kan vi hitta nya och spännande sätt att dra nytta av maskininlärning. Många av dessa datorer har specialiserade, resursbegränsade arkitekturer och det kan vara problematiskt att utföra de komplicerade beräkningar som behövs. Om ett sådant system är tungt belastat eller har begränsad prestanda, är beräkningskraft en värdefull resurs och kostsamma algoritmer måste undvikas. \\ Detta masterprojekt kommer att presentera en djupgående studie som undersöker avvägningarna mellan precision, latens och minneskonsumtion av en utvald uppsättning maskininlärningsalgoritmer implementerade på en resursbegränsad flerkärnig telekom-maskinvaruplattform. Denna rapport innehåller motivationer för de valda algoritmerna, diskuterar resultaten av algoritmerna på hårdvaruplattformen och presenterar slutsatser som är relevanta för vidareutveckling.
|
69 |
Optimizing Lempel-Ziv Factorization for the GPU ArchitectureChing, Bryan 01 June 2014 (has links) (PDF)
Lossless data compression is used to reduce storage requirements, allowing for the relief of I/O channels and better utilization of bandwidth. The Lempel-Ziv lossless compression algorithms form the basis for many of the most commonly used compression schemes. General purpose computing on graphic processing units (GPGPUs) allows us to take advantage of the massively parallel nature of GPUs for computations other that their original purpose of rendering graphics. Our work targets the use of GPUs for general lossless data compression. Specifically, we developed and ported an algorithm that constructs the Lempel-Ziv factorization directly on the GPU. Our implementation bypasses the sequential nature of the LZ factorization and attempts to compute the factorization in parallel. By breaking down the LZ factorization into what we call the PLZ, we are able to outperform the fastest serial CPU implementations by up to 24x and perform comparatively to a parallel multicore CPU implementation. To achieve these speeds, our implementation outputted LZ factorizations that were on average only 0.01 percent greater than the optimal solution that what could be computed sequentially.
We are also able to reevaluate the fastest GPU suffix array construction algorithm, which is needed to compute the LZ factorization. We are able to find speedups of up to 5x over the fastest CPU implementations.
|
70 |
Enhanced Capabilities of the Spike Algorithm and a New Spike-OpenMP SolverSpring, Braegan S 07 November 2014 (has links) (PDF)
SPIKE is a parallel algorithm to solve block tridiagonal matrices. In this work, two useful improvements to the algorithm are proposed. A flexible threading strategy is developed, to overcome limitations of the recursive reduced system method. Allo- cating multiple threads to some tasks created by the SPIKE algorithm removes the previous restriction that recursive SPIKE may only use a number of threads equal to a power of two. Additionally, a method of solving transpose problems is shown. This method matches the performance of the non-transpose solve while reusing the original factorization.
|
Page generated in 0.0305 seconds