Spelling suggestions: "subject:"openmp"" "subject:"openmpi""
1 |
Task Level Parallelization of Irregular Computations using OpenMP 3.0Albalawi, Eid 2013 July 1900 (has links)
OpenMP is a standard parallel programming language used
to develop parallel applications on shared memory machines.
OpenMP is very suitable for designing parallel
algorithms for regular applications where the amount of work
is known apriori and therefore distribution of work
among the threads can be done at compile time.
In irregular applications, the load
changes dynamically at runtime and distribution of work among the threads
can only be done at runtime. In the literature, it has been
shown that OpenMP produces unsatisfactory performance for
irregular applications.
In 2008, the OpenMP 3.0 version introduced new directives and features such as the ''task'' directive
to handle irregular computations. Not much work
has gone into studying irregular algorithms in OpenMP 3.0.
In this thesis, I provide some insight into the usefulness of OpenMP 3.0
|
2 |
Task Level Parallelization of Irregular Computations using OpenMP 3.0Albalawi, Eid January 1900 (has links)
OpenMP is a standard parallel programming language used
to develop parallel applications on shared memory machines.
OpenMP is very suitable for designing parallel
algorithms for regular applications where the amount of work
is known apriori and therefore distribution of work
among the threads can be done at compile time.
In irregular applications, the load
changes dynamically at runtime and distribution of work among the threads
can only be done at runtime. In the literature, it has been
shown that OpenMP produces unsatisfactory performance for
irregular applications.
In 2008, the OpenMP 3.0 version introduced new directives and features such as the ''task'' directive
to handle irregular computations. Not much work
has gone into studying irregular algorithms in OpenMP 3.0.
In this thesis, I provide some insight into the usefulness of OpenMP 3.0
|
3 |
A user-centric perspective on parallel programming with focus on openMPSuess, Michael. Unknown Date (has links) (PDF)
Kassel, University, Diss., 2007.
|
4 |
Comparative study of parallel programming models for multicore computingAli, Akhtar January 2013 (has links)
Shared memory multi-core processor technology has seen a drastic developmentwith faster and increasing number of processors per chip. This newarchitecture challenges computer programmers to write code that scales overthese many cores to exploit full computational power of these machines.Shared-memory parallel programming paradigms such as OpenMP and IntelThreading Building Blocks (TBB) are two recognized models that offerhigher level of abstraction, shields programmers from low level detailsof thread management and scales computation over all available resources.At the same time, need for high performance power-ecient computing iscompelling developers to exploit GPGPU computing due to GPU's massivecomputational power and comparatively faster multi-core growth. Thistrend leads to systems with heterogeneous architectures containing multicoreCPUs and one or more programmable accelerators such as programmableGPUs. There exist dierent programming models to program these architecturesand code written for one architecture is often not portable to anotherarchitecture. OpenCL is a relatively new industry standard framework, de-ned by Khronos group, which addresses the portability issue. It oers aportable interface to exploit the computational power of a heterogeneous setof processors such as CPUs, GPUs, DSP processors and other accelerators. In this work, we evaluate the eectiveness of OpenCL for programmingmulti-core CPUs in a comparative case study with two CPU specic stableframeworks, OpenMP and Intel TBB, for ve benchmark applicationsnamely matrix multiply, LU decomposition, image convolution, Pi value approximationand image histogram generation. The evaluation includes aperformance comparison of the three frameworks and a study of the relativeeects of applying compiler optimizations on performance numbers.OpenCL performance on two vendor-dependent platforms Intel and AMD,is also evaluated. Then the same OpenCL code is ported to a modern GPUand its code correctness and performance portability is investigated. Finally,usability experience of coding using the three multi-core frameworksis presented.
|
5 |
Improving Performance and Quality-of-Service through the Task-Parallel Model : Optimizations and Future Directions for OpenMPPodobas, Artur January 2015 (has links)
With the failure of Dennard's scaling, which stated that shrinking transistors will be more power-efficient, computer hardware has today become very divergent. Initially the change only concerned the number of processor on a chip (multicores), but has today further escalated into complex heterogeneous system with non-intuitive properties -- properties that can improve performance and power consumption but also strain the programmer expected to develop on them. Answering these challenges is the OpenMP task-parallel model -- a programming model that simplifies writing parallel software. Our focus in the thesis has been to explore performance and quality-of-service directions of the OpenMP task-parallel model, particularly by taking architectural features into account. The first question tackled is: what capabilities does existing state of the art runtime-systems have and how do they perform? We empirically evaluated the performance of several modern task-parallel runtime-systems. Performance and power-consumption was measured through the use of benchmarks and we show that the two primary causes for bottlenecks in modern runtime-systems lies in either the task management overheads or how tasks are being distributed across processors. Next, we consider quality-of-service improvements in task-parallel runtime-systems. Striving to improve execution performance, current state of the art runtime-systems seldom take dynamic architectural features such as temperature into account when deciding how work should be distributed across the processors, which can lead to overheating. We developed and evaluated two strategies for thermal-awareness in task-parallel runtime-systems. The first improves performance when the computer system is constrained by temperature while the second strategy strives to reduce temperature while meeting soft real-time objectives. We end the thesis by focusing on performance. Here we introduce our original contribution called BLYSK -- a prototype OpenMP framework created exclusively for performance research. We found that overheads in current runtime-systems can be expensive, which often lead to performance degradation. We introduce a novel way of preserving task-graphs throughout application runs: task-graphs are recorded, identified and optimized the first time an OpenMP application is executed and are later re-used in following executions, removing unnecessary overheads. Our proposed solution can nearly double the performance compared with other state of the art runtime-systems. Performance can also be improved through heterogeneity. Today, manufacturers are placing processors with different capabilities on the same chip. Because they are different, their power-consuming characteristics and performance differ. Heterogeneity adds another dimension to the multiprocessing problem: how should work be distributed across the heterogeneous processors?We evaluated the performance of existing, homogeneous scheduling algorithms and found them to be an ill-match for heterogeneous systems. We proposed a novel scheduling algorithm that dynamically adjusts itself to the heterogeneous system in order to improve performance. The thesis ends with a high-level synthesis approach to improve performance in task-parallel applications. Rather than limiting ourselves to off-the-shelf processors -- which often contains a large amount of unused logic -- our approach is to automatically generate the processors ourselves. Our method allows us to generate application-specific hardware from the OpenMP task-parallel source code. Evaluated using FPGAs, the performance of our System-on-Chips outperformed other soft-cores such as the NiosII processor and were also comparable in performance with modern state of the art processors such as the Xeon PHI and the AMD Opteron. / <p>QC 20151016</p>
|
6 |
Self-tuned parallel runtimes: a case of study for OpenMPDurán González, Alejandro 22 October 2008 (has links)
In recent years parallel computing has become ubiquitous. Lead by the spread of commodity multicore processors, parallel programming is not anymore an obscure discipline only mastered by a few.Unfortunately, the amount of able parallel programmers has not increased at the same speed because is not easy to write parallel codes.Parallel programming is inherently different from sequential programming. Programmers must deal with a whole new set of problems: identification of parallelism, work and data distribution, load balancing, synchronization and communication.Parallel programmers have embraced several languages designed to allow the creation of parallel applications. In these languages, the programmer is not only responsible of identifying the parallelism but also of specifying low-level details of how the parallelism needs to exploited (e.g. scheduling, thread distribution ...). This is a burden than hampers the productivity of the programmers.We demonstrate that is possible for the runtime component of a parallel environment to adapt itself to the application and the execution environment and thus reducing the burden put into the programmer. For this purpose we study three different parameters that are involved in the parallel exploitation of the OpenMP parallel language: parallel loop scheduling, thread allocation in multiple levels of parallelism and task granularity control.In all the cases, we propose a self-tuned algorithm that will first perform an on-line profiling of the application and based on the information gathered it will adapt the value of the parameter to the one that maximizes the performance of the application.Our goal is not to develop methods that outperform a hand-tuned application for a specific scenario, as this is probably just as difficult as compiler code outperforming hand-tuned assembly code, but methods that get close to that performance with a minimum effort from the programmer. In other words, what we want to achieve with our self-tuned algorithms is to maximize the ratio performance over effort so the entry level to the parallelism is lower. The evaluation of our algorithms with different applications shows that we achieve that goal.
|
7 |
Improving OpenMP Productivity with Data Locality Optimizations and High-resolution Performance AnalysisMuddukrishna, Ananya January 2016 (has links)
The combination of high-performance parallel programming and multi-core processors is the dominant approach to meet the ever increasing demand for computing performance today. The thesis is centered around OpenMP, a popular parallel programming API standard that enables programmers to quickly get started with writing parallel programs. However, in contrast to the quickness of getting started, writing high-performance OpenMP programs requires high effort and saps productivity. Part of the reason for impeded productivity is OpenMP’s lack of abstractions and guidance to exploit the strong architectural locality exhibited in NUMA systems and manycore processors. The thesis contributes with data distribution abstractions that enable programmers to distribute data portably in NUMA systems and manycore processors without being aware of low-level system topology details. Data distribution abstractions are supported by the runtime system and leveraged by the second contribution of the thesis – an architecture-specific locality-aware scheduling policy that reduces data access latencies incurred by tasks, allowing programmers to obtain with minimal effort upto 69% improved performance for scientific programs compared to state-of-the-art work-stealing scheduling. Another reason for reduced programmer productivity is the poor support extended by OpenMP performance analysis tools to visualize, understand, and resolve problems at the level of grains– task and parallel for-loop chunk instances. The thesis contributes with a cost-effective and automatic method to extensively profile and visualize grains. Grain properties and hardware performance are profiled at event notifications from the runtime system with less than 2.5% overheads and visualized using a new method called theGrain Graph. The grain graph shows the program structure that unfolded during execution and highlights problems such as low parallelism, work inflation, and poor parallelization benefit directly at the grain level with precise links to problem areas in source code. The thesis demonstrates that grain graphs can quickly reveal performance problems that are difficult to detect and characterize in fine detail using existing tools in standard programs from SPEC OMP 2012, Parsec 3.0 and Barcelona OpenMP Tasks Suite (BOTS). Grain profiles are also applied to study the input sensitivity and similarity of BOTS programs. All thesis contributions are assembled together to create an iterative performance analysis and optimization work-flow that enables programmers to achieve desired performance systematically and more quickly than what is possible using existing tools. This reduces pressure on experts and removes the need for tedious trial-and-error tuning, simplifying OpenMP performance analysis. / <p>QC 20151221</p>
|
8 |
Um cluster híbrido com módulos de co – processamento em hardware (FPGAS) para processamento de alto desempenhoBARROS JÚNIOR, Severino José de 10 September 2014 (has links)
Submitted by Luiz Felipe Barbosa (luiz.fbabreu2@ufpe.br) on 2015-03-10T19:00:58Z
No. of bitstreams: 2
DISSERTAÇÃO Severino José de Barros Júnior.pdf: 3495935 bytes, checksum: b2c482e8b4f864c84aad98267495cde1 (MD5)
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) / Approved for entry into archive by Daniella Sodre (daniella.sodre@ufpe.br) on 2015-03-10T19:42:57Z (GMT) No. of bitstreams: 2
DISSERTAÇÃO Severino José de Barros Júnior.pdf: 3495935 bytes, checksum: b2c482e8b4f864c84aad98267495cde1 (MD5)
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) / Made available in DSpace on 2015-03-10T19:42:57Z (GMT). No. of bitstreams: 2
DISSERTAÇÃO Severino José de Barros Júnior.pdf: 3495935 bytes, checksum: b2c482e8b4f864c84aad98267495cde1 (MD5)
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
Previous issue date: 2014-09-10 / FINEP/Petrobrás(CENPES) / Organizações que lidam com sistemas computacionais buscam cada vez mais melhorar o desempenho de suas aplicações. Essas aplicações possuem como principal característica o processamento massivo de dados. A solução utilizada para execução desses problemas é baseada, em geral, em arquiteturas de processadores de uso geral, cuja principal característica é sua estrutura de hardware baseada no Paradigma de Von Neumann. Esse paradigma possui uma deficiência conhecida como “Gargalo de Von Neumann”, onde instruções que poderiam ser executadas de forma simultânea, devido à sua independência de dados, acabam sendo processadas sequencialmente, prejudicando o potencial desempenho dessa classe de aplicações. Para aumentar o processamento paralelo dos sistemas, as Organizações costumam adotar uma estrutura baseada na associação de vários PCs, conectados a uma rede de alta velocidade e trabalham em conjunto para resolver um grande problema. A essa associação é atribuída o nome de cluster, a qual cada integrante PC, chamado de nó, realiza uma parte da computação de um grande problema de forma simultânea, proporcionando a ideia de um paralelismo explícito da aplicação como um todo. Mesmo com um aumento significativo de elementos de processamento independentes, este crescimento é insuficiente para atender à enorme quantidade de demanda de computação de dados em aplicações complexas. Ela exige uma divisão de grupos de instruções independentes, distribuídos entre os nós. Esta estratégia dá a idéia de paralelismo e assim um melhor desempenho. No entanto, o desempenho em cada nó permanece degradado, devido ao estrangulamento seqüencial presente nós processadores. A fim de aumentar o paralelismo das operações em cada nó, soluções híbridas, compostas por CPUs convencionais e coprocessadores foram adotadas. Um desses coprocessadores é o FPGA (Field Programmable Gate Array), que geralmente é conectado ao PC através do barramento PCIe. O projeto descrito na dissertação propõe uma metodologia de desenvolvimento para este aglomerado híbrido, de modo a aumentar o desempenho de aplicações científicas que requerem uma grande quantidade de processamento de dados. A metodologia é apresentada e dois exemplos são discutidos em detalhes.
|
9 |
Aplikace využívající paralelní zpracování pro kryptografické výpočty / Applications for parallel processing in cryptographyŠánek, Jaromír January 2014 (has links)
This thesis is about parallel programming. In the first part of the thesis is compared speed of functions modular exponentiation from various C/C++ libraries for CPU. In the second part is transformed the LibTomMath library from CPU to GPU CUDA technology. For devices CPU and GPU is compared speed of processing the operation of modular exponentiation from modified library. In conclusion are created two applications “Client –Server” for computing the revocation function of the protocol HM12.
|
10 |
A Banded Spike Algorithm and Solver for Shared Memory ArchitecturesMendiratta, Karan 01 January 2011 (has links) (PDF)
A new parallel solver based on SPIKE-TA algorithm has been developed using OpenMP API for solving diagonally-dominant banded linear systems on shared memory architectures. The results of the numerical experiments carried out for different test cases demonstrate high-performance and scalability on current multi-core platforms and highlight the time savings that SPIKE-TA OpenMP offers in comparison to the LAPACK BLAS-threaded LU model. By exploiting algorithmic parallelism in addition to threaded implementation, we obtain greater speed-ups in contrast to the threaded versions of sequential algorithms. For non-diagonally dominant systems, we implement the SPIKE-RL scheme and a new Spike-calling-Spike (SCS) scheme using OpenMP. The timing results for solving the non-diagonally dominant systems using SPIKE-RL show extremely good scaling in comparison to LAPACK and modified banded-primitive library.
|
Page generated in 0.0239 seconds