Spelling suggestions: "subject:"open MP"" "subject:"ipen MP""
1 |
Numerical simulation of oil spills in coastal areas using shallow water equations in generalised coordinatesNovelli, Guillaume 24 November 2011 (has links)
The pollution generated by accidental marine oil spills can cause persistent ecological disasters and lead to serious social and economical damages. Numerical simulations are a valuable tool to make proper decisions in emergency situation or to plan response actions beforehand.
The main objective of this work was to improve SIMOIL, a computational model developed earlier at URV and capable of predicting the evaporation and spreading of massive oil spills in coastal areas.
Specifically, a new coastal current model, based on the resolution of the shallow water equations in generalised coordinates, has been developed and validated and then coupled to SIMOIL.
The model was specially designed to describe coastal oceanic flows over topography accounting
for Coriolis force, eddy viscosity, seabed friction and to couple with SIMOIL in domain with complex boundaries.
The equations have been discretized over generalised domains by means of finite differences of second order accuracy. The code was then implemented in FORTRAN.
The code has been validated extensively against numerical and experimental flow studies of the bibliography.
Finally, the new complete version of SIMOIL, coupling the shallow water model and the oil slick model, has been applied to the study of two accidental oil spills:
• A massive leakage from the Repsol's floating dock in the port of Tarragona
• The biggest oil spill ever occurred in the Eastern Mediterranean Sea: the 2006 Lebanon oil spill.
In both cases, the new version of SIMOIL, demonstrate more accurate predictions of the behaviour of the oil spill, specially for moderate winds with complex topography. / La contaminación generada por los vertidos accidentales de petróleo puede ser reducida si se actúa y si se toman las decisiones adecuadas a tiempo. Las simulaciones numéricas de vertidos de petróleo permiten predecir la evolución de las manchas de crudo.
En este trabajo, el objetivo principal era de mejorar la precisión y el rango de aplicación del código SIMOIL desarrollando e integrando al código un modelo de predicción de corrientes marinas en aguas costeras.
Se han derivado las ecuaciones de aguas poco profundas en coordenadas generalizadas. Se han discretizado las ecuaciones y el código se implementó en FORTRAN 90.
El modelo así como los métodos numéricos han sido validados con el estudio de flujos experimentales y numéricos de la bibliografía.
Finalmente, la nueva versión de SIMOIL se aplicó con éxito a dos casos físicos de vertidos de crudo:
• un vertido ficticio desde la monoboya de descarga de Repsol en el puerto de Tarragona
• un vertido real, el mas grande ocurrido en el Este del mar Mediterráneo, consecuencia de la guerra en Líbano en julio de 2006.
En ambos casos la nueva versión de SIMOIL proporcionó predicciones más precisas, especialmente para vientos moderados y topografías complejas.
2 |
Aurora : seamless optimization of openMP applications / Aurora: Otimização Transparente de Aplicações OpenMPLorenzon, Arthur Francisco January 2018 (has links)
A exploração eficiente do paralelismo no nível de threads tem sido um desafio para os desenvolvedores de softwares. Como muitas aplicações não escalam com o número de núcleos, aumentar cegamente o número de threads pode não produzir os melhores resultados em desempenho ou energia. No entanto, a tarefa de escolher corretamente o número ideal de threads não é simples: muitas variáveis estão envolvidas (por exemplo, saturação do barramento off-chip e sobrecarga de sincronização de dados), que mudam de acordo com diferentes aspectos do sistema (por exemplo, conjunto de entrada, micro-arquitetura) e mesmo durante a execução da aplicação. Para abordar esse complexo cenário, esta tese apresenta Aurora. Ela é capaz de encontrar automaticamente, em tempo de execução e com o mínimo de sobrecarga, o número ideal de threads para cada região paralela da aplicação e se readaptar nos casos em que o comportamento de uma região muda durante a execução. Aurora trabalha com o OpenMP e é completamente transparente tanto para o programador quanto para o usuário final: dado um binário de uma aplicação OpenMP, Aurora o otimiza sem nenhuma transformação ou recompilação de código. Através da execução de quinze benchmarks conhecidos em quatro processadores multi-core, mostramos que Aurora melhora o trade-off entre desempenho e energia em até: 98% sobre a execução padrão do OpenMP; 86% sobre o recurso interno do OpenMP que ajusta dinamicamente o número de threads; e 91% quando comparado a uma emulação do feedback-driven threading. / Efficiently exploiting thread-level parallelism has been challenging for software developers. As many parallel applications do not scale with the number of cores, blindly increasing the number of threads may not produce the best results in performance or energy. However, the task of rightly choosing the ideal amount of threads is not straightforward: many variables are involved (e.g. off-chip bus saturation and overhead of datasynchronization), which will change according to different aspects of the system at hand (e.g., input set, micro-architecture) and even during execution. To address this complex scenario, this thesis presents Aurora. It is capable of automatically finding, at run-time and with minimum overhead, the optimal number of threads for each parallel region of the application and re-adapt in cases the behavior of a region changes during execution. Aurora works with OpenMP and is completely transparent to both designer and end-user: given an OpenMP application binary, Aurora optimizes it without any code transformation or recompilation. By executing fifteen well-known benchmarks on four multi-core processors, Aurora improves the trade-off between performance and energy by up to: 98% over the standard OpenMP execution; 86% over the built-in feature of OpenMP that dynamically adjusts the number of threads; and 91% over a feedback-driven threading emulation.
3 |
Aurora : seamless optimization of openMP applications / Aurora: Otimização Transparente de Aplicações OpenMPLorenzon, Arthur Francisco January 2018 (has links)
A exploração eficiente do paralelismo no nível de threads tem sido um desafio para os desenvolvedores de softwares. Como muitas aplicações não escalam com o número de núcleos, aumentar cegamente o número de threads pode não produzir os melhores resultados em desempenho ou energia. No entanto, a tarefa de escolher corretamente o número ideal de threads não é simples: muitas variáveis estão envolvidas (por exemplo, saturação do barramento off-chip e sobrecarga de sincronização de dados), que mudam de acordo com diferentes aspectos do sistema (por exemplo, conjunto de entrada, micro-arquitetura) e mesmo durante a execução da aplicação. Para abordar esse complexo cenário, esta tese apresenta Aurora. Ela é capaz de encontrar automaticamente, em tempo de execução e com o mínimo de sobrecarga, o número ideal de threads para cada região paralela da aplicação e se readaptar nos casos em que o comportamento de uma região muda durante a execução. Aurora trabalha com o OpenMP e é completamente transparente tanto para o programador quanto para o usuário final: dado um binário de uma aplicação OpenMP, Aurora o otimiza sem nenhuma transformação ou recompilação de código. Através da execução de quinze benchmarks conhecidos em quatro processadores multi-core, mostramos que Aurora melhora o trade-off entre desempenho e energia em até: 98% sobre a execução padrão do OpenMP; 86% sobre o recurso interno do OpenMP que ajusta dinamicamente o número de threads; e 91% quando comparado a uma emulação do feedback-driven threading. / Efficiently exploiting thread-level parallelism has been challenging for software developers. As many parallel applications do not scale with the number of cores, blindly increasing the number of threads may not produce the best results in performance or energy. However, the task of rightly choosing the ideal amount of threads is not straightforward: many variables are involved (e.g. off-chip bus saturation and overhead of datasynchronization), which will change according to different aspects of the system at hand (e.g., input set, micro-architecture) and even during execution. To address this complex scenario, this thesis presents Aurora. It is capable of automatically finding, at run-time and with minimum overhead, the optimal number of threads for each parallel region of the application and re-adapt in cases the behavior of a region changes during execution. Aurora works with OpenMP and is completely transparent to both designer and end-user: given an OpenMP application binary, Aurora optimizes it without any code transformation or recompilation. By executing fifteen well-known benchmarks on four multi-core processors, Aurora improves the trade-off between performance and energy by up to: 98% over the standard OpenMP execution; 86% over the built-in feature of OpenMP that dynamically adjusts the number of threads; and 91% over a feedback-driven threading emulation.
4 |
Aurora : seamless optimization of openMP applications / Aurora: Otimização Transparente de Aplicações OpenMPLorenzon, Arthur Francisco January 2018 (has links)
A exploração eficiente do paralelismo no nível de threads tem sido um desafio para os desenvolvedores de softwares. Como muitas aplicações não escalam com o número de núcleos, aumentar cegamente o número de threads pode não produzir os melhores resultados em desempenho ou energia. No entanto, a tarefa de escolher corretamente o número ideal de threads não é simples: muitas variáveis estão envolvidas (por exemplo, saturação do barramento off-chip e sobrecarga de sincronização de dados), que mudam de acordo com diferentes aspectos do sistema (por exemplo, conjunto de entrada, micro-arquitetura) e mesmo durante a execução da aplicação. Para abordar esse complexo cenário, esta tese apresenta Aurora. Ela é capaz de encontrar automaticamente, em tempo de execução e com o mínimo de sobrecarga, o número ideal de threads para cada região paralela da aplicação e se readaptar nos casos em que o comportamento de uma região muda durante a execução. Aurora trabalha com o OpenMP e é completamente transparente tanto para o programador quanto para o usuário final: dado um binário de uma aplicação OpenMP, Aurora o otimiza sem nenhuma transformação ou recompilação de código. Através da execução de quinze benchmarks conhecidos em quatro processadores multi-core, mostramos que Aurora melhora o trade-off entre desempenho e energia em até: 98% sobre a execução padrão do OpenMP; 86% sobre o recurso interno do OpenMP que ajusta dinamicamente o número de threads; e 91% quando comparado a uma emulação do feedback-driven threading. / Efficiently exploiting thread-level parallelism has been challenging for software developers. As many parallel applications do not scale with the number of cores, blindly increasing the number of threads may not produce the best results in performance or energy. However, the task of rightly choosing the ideal amount of threads is not straightforward: many variables are involved (e.g. off-chip bus saturation and overhead of datasynchronization), which will change according to different aspects of the system at hand (e.g., input set, micro-architecture) and even during execution. To address this complex scenario, this thesis presents Aurora. It is capable of automatically finding, at run-time and with minimum overhead, the optimal number of threads for each parallel region of the application and re-adapt in cases the behavior of a region changes during execution. Aurora works with OpenMP and is completely transparent to both designer and end-user: given an OpenMP application binary, Aurora optimizes it without any code transformation or recompilation. By executing fifteen well-known benchmarks on four multi-core processors, Aurora improves the trade-off between performance and energy by up to: 98% over the standard OpenMP execution; 86% over the built-in feature of OpenMP that dynamically adjusts the number of threads; and 91% over a feedback-driven threading emulation.
5 |
Optimization of memory management on distributed machine / Optimisation de la gestion mémoire sur machines distribuéesHa, Viet Hai 05 October 2012 (has links)
Afin d'exploiter les capacités des architectures parallèles telles que les grappes, les grilles, les systèmes multi-processeurs, et plus récemment les nuages et les systèmes multi-cœurs, un langage de programmation universel et facile à utiliser reste à développer. Du point de vue du programmeur, OpenMP est très facile à utiliser en grande partie grâce à sa capacité à supporter une parallélisation incrémentale, la possibilité de définir dynamiquement le nombre de fils d'exécution, et aussi grâce à ses stratégies d'ordonnancement. Cependant, comme il a été initialement conçu pour des systèmes à mémoire partagée, OpenMP est généralement très limité pour effectuer des calculs sur des systèmes à mémoire distribuée. De nombreuses solutions ont été essayées pour faire tourner OpenMP sur des systèmes à mémoire distribuée. Les approches les plus abouties se concentrent sur l’exploitation d’une architecture réseau spéciale et donc ne peuvent fournir une solution ouverte. D'autres sont basées sur une solution logicielle déjà disponible telle que DMS, MPI ou Global Array, et par conséquent rencontrent des difficultés pour fournir une implémentation d'OpenMP complètement conforme et à haute performance. CAPE — pour Checkpointing Aided Parallel Execution — est une solution alternative permettant de développer une implémentation conforme d'OpenMP pour les systèmes à mémoire distribuée. L'idée est la suivante : en arrivant à une section parallèle, l'image du thread maître est sauvegardé et est envoyée aux esclaves ; puis, chaque esclave exécute l'un des threads ; à la fin de la section parallèle, chaque threads esclaves extraient une liste de toutes modifications ayant été effectuées localement et la renvoie au thread maître ; le thread maître intègre ces modifications et reprend son exécution. Afin de prouver la faisabilité de cette approche, la première version de CAPE a été implémentée en utilisant des points de reprise complets. Cependant, une analyse préliminaire a montré que la grande quantité de données transmises entre les threads et l’extraction de la liste des modifications depuis les points de reprise complets conduit à de faibles performances. De plus, cette version est limitée à des problèmes parallèles satisfaisant les conditions de Bernstein, autrement dit, il ne permet pas de prendre en compte les données partagées. L'objectif de cette thèse est de proposer de nouvelles approches pour améliorer les performances de CAPE et dépasser les restrictions sur les données partagées. Tout d'abord, nous avons développé DICKPT (Discontinuous Incremental ChecKPoinTing), une technique points de reprise incrémentaux qui supporte la possibilité de prendre des points de reprise discontinue lors de l'exécution d'un processus. Basé sur DICKPT, la vitesse d'exécution de la nouvelle version de CAPE a été considérablement augmenté. Par exemple, le temps de calculer une grande multiplication matrice-matrice sur un cluster des ordinateurs bureaux est devenu très similaire à la durée d'exécution d'un programme MPI optimisé. En outre, l'accélération associée à cette nouvelle version pour divers nombre de threads est assez linéaire pour différentes tailles du problème. Pour des données partagées, nous avons proposé UHLRC (Updated Home-based Lazy Relaxed Consistency), une version modifiée de la HLRC (Home-based Lazy Relaxed Consistency) modèle de mémoire, pour le rendre plus adapté aux caractéristiques de CAPE. Les prototypes et les algorithmes à mettre en œuvre la synchronisation des données et des directives et clauses de données partagées sont également précisées. Ces deux travaux garantit la possibilité pour CAPE de respecter des demandes de données partagées d'OpenMP / In order to explore further the capabilities of parallel computing architectures such as grids, clusters, multi-processors and more recently, clouds and multi-cores, an easy-to-use parallel language is an important challenging issue. From the programmer's point of view, OpenMP is very easy to use with its ability to support incremental parallelization, features for dynamically setting the number of threads and scheduling strategies. However, as initially designed for shared memory systems, OpenMP is usually limited on distributed memory systems to intra-nodes' computations. Many attempts have tried to port OpenMP on distributed systems. The most emerged approaches mainly focus on exploiting the capabilities of a special network architecture and therefore cannot provide an open solution. Others are based on an already available software solution such as DMS, MPI or Global Array and, as a consequence, they meet difficulties to become a fully-compliant and high-performance implementation of OpenMP. As yet another attempt to built an OpenMP compliant implementation for distributed memory systems, CAPE − which stands for Checkpointing Aide Parallel Execution − has been developed which with the following idea: when reaching a parallel section, the master thread is dumped and its image is sent to slaves; then, each slave executes a different thread; at the end of the parallel section, slave threads extract and return to the master thread the list of all modifications that has been locally performed; the master includes these modifications and resumes its execution. In order to prove the feasibility of this paradigm, the first version of CAPE was implemented using complete checkpoints. However, preliminary analysis showed that the large amount of data transferred between threads and the extraction of the list of modifications from complete checkpoints lead to weak performance. Furthermore, this version was restricted to parallel problems satisfying the Bernstein's conditions, i.e. it did not solve the requirements of shared data. This thesis aims at presenting the approaches we proposed to improve CAPE' performance and to overcome the restrictions on shared data. First, we developed DICKPT which stands for Discontinuous Incremental Checkpointing, an incremental checkpointing technique that supports the ability to save incremental checkpoints discontinuously during the execution of a process. Based on the DICKPT, the execution speed of the new version of CAPE was significantly increased. For example, the time to compute a large matrix-matrix product on a desktop cluster has become very similar to the execution time of the same optimized MPI program. Moreover, the speedup associated with this new version for various number of threads is quite linear for different problem sizes. In the side of shared data, we proposed UHLRC, which stands for Updated Home-based Lazy Release Consistency, a modified version of the Home-based Lazy Release Consistency (HLRC) memory model, to make it more appropriate to the characteristics of CAPE. Prototypes and algorithms to implement the synchronization and OpenMP data-sharing clauses and directives are also specified. These two works ensures the ability for CAPE to respect shared-data behavior
6 |
Parallel paradigms in optimal structural designVan Huyssteen, Salomon Stephanus 12 1900 (has links)
Thesis (MScEng)--Stellenbosch University, 2011. / ENGLISH ABSTRACT: Modern-day processors are not getting any faster. Due to the power consumption limit of frequency
scaling, parallel processing is increasingly being used to decrease computation time. In
this thesis, several parallel paradigms are used to improve the performance of commonly serial
SAO programs. Four novelties are discussed:
First, replacing double precision solvers with single precision solvers. This is attempted in order
to take advantage of the anticipated factor 2 speed increase that single precision computations
have over that of double precision computations. However, single precision routines present
unpredictable performance characteristics and struggle to converge to required accuracies, which
is unfavourable for optimization solvers.
Second, QP and dual are statements pitted against one another in a parallel environment. This
is done because it is not always easy to see which is best a priori. Therefore both are started in
parallel and the competing threads are cancelled as soon as one returns a valid point. Parallel QP
vs. dual statements prove to be very attractive, converging within the minimum number of outer
iterations. The most appropriate solver is selected as the problem properties change during the
iteration steps. Thread cancellation poses problems caused by threads having to wait to arrive at
appropriate checkpoints, thus su ering from unnecessarily long wait times because of struggling
competing routines.
Third, multiple global searches are started in parallel on a shared memory system. Problems
see a speed increase of nearly 4x for all problems. Dynamically scheduled threads alleviate the
need for set thread amounts, as in message passing implementations.
Lastly, the replacement of existing matrix-vector multiplication routines with optimized BLAS
routines, especially BLAS routines targeted at GPGPU technologies (graphics processing units),
proves to be superior when solving large matrix-vector products in an iterative environment. These problems scale well within the hardware capabilities and speedups of up to 36x are
recorded. / AFRIKAANSE OPSOMMING: Hedendaagse verwerkers word nie vinniger nie as gevolg van kragverbruikingslimiet soos die
verwerkerfrekwensie op-skaal. Parallelle prosesseering word dus meer dikwels gebruik om berekeningstyd
te laat daal. Verskeie parallelle paradigmas word gebruik om die prestasie van
algemeen sekwensiële optimeringsprogramme te verbeter. Vier ontwikkelinge word bespreek:
Eerste, is die vervanging van dubbel presisie roetines met enkel presisie roetines. Dit poog om
voordeel te trek uit die faktor 2 spoed verbetering wat enkele presisie berekeninge het oor dubbel
presisie berekeninge. Enkele presisie roetines is onvoorspelbaar en sukkel in meeste gevalle om
die korrekte akkuraatheid te vind.
Tweedens word QP teen duale algoritmes in ’n parallel omgewing gebruik. Omdat dit nie altyd
voor die tyd maklik is om te sien watter een die beste gaan presteer nie, word almal in parallel
begin en die mededingers word dan gekanselleer sodra een terugkeer met ’n geldige KKT punt.
Parallele QP teen duale algoritmes blyk om baie aantreklik te wees. Konvergensie gebeur in alle
gevalle binne die minimum aantal iterasies. Die mees geskikte algoritme word op elke iterasie
gebruik soos die probleem eienskappe verander gedurende die iterasie stappe. “Thread” kanseleering
hou probleme in en word veroorsaak deur “threads” wat moet wag om die kontrolepunte
te bereik, dus ly die beste roetines onnodig as gevolg van meededinger roetines was sukkel.
Derdens, verskeie globale optimerings word in parallel op ’n “shared memory” stelsel begin.
Probleme bekom ’n spoed verhoging van byna vier maal vir alle probleme. Dinamiese geskeduleerde
“threads” verlig die behoefte aan voorafbepaalde “threads” soos gebruik word in die
“message passing” implementerings.
Laastens is die vervanging van die bestaande matriks-vektor vermenigvuldiging roetines met
geoptimeerde BLAS roetines, veral BLAS roetines wat gerig is op GPGPU tegnologië. Die GPU roetines bewys om superieur te wees wanneer die oplossing van groot matrix-vektor produkte in
’n iteratiewe omgewing gebruik word. Hierdie probleme skaal ook goed binne die hardeware se
vermoëns, vir die grootste probleme wat getoets word, word ’n versnelling van 36 maal bereik.
7 |
Optimization of memory management on distributed machineHa, Viet Hai 05 October 2012 (has links) (PDF)
In order to explore further the capabilities of parallel computing architectures such as grids, clusters, multi-processors and more recently, clouds and multi-cores, an easy-to-use parallel language is an important challenging issue. From the programmer's point of view, OpenMP is very easy to use with its ability to support incremental parallelization, features for dynamically setting the number of threads and scheduling strategies. However, as initially designed for shared memory systems, OpenMP is usually limited on distributed memory systems to intra-nodes' computations. Many attempts have tried to port OpenMP on distributed systems. The most emerged approaches mainly focus on exploiting the capabilities of a special network architecture and therefore cannot provide an open solution. Others are based on an already available software solution such as DMS, MPI or Global Array and, as a consequence, they meet difficulties to become a fully-compliant and high-performance implementation of OpenMP. As yet another attempt to built an OpenMP compliant implementation for distributed memory systems, CAPE − which stands for Checkpointing Aide Parallel Execution − has been developed which with the following idea: when reaching a parallel section, the master thread is dumped and its image is sent to slaves; then, each slave executes a different thread; at the end of the parallel section, slave threads extract and return to the master thread the list of all modifications that has been locally performed; the master includes these modifications and resumes its execution. In order to prove the feasibility of this paradigm, the first version of CAPE was implemented using complete checkpoints. However, preliminary analysis showed that the large amount of data transferred between threads and the extraction of the list of modifications from complete checkpoints lead to weak performance. Furthermore, this version was restricted to parallel problems satisfying the Bernstein's conditions, i.e. it did not solve the requirements of shared data. This thesis aims at presenting the approaches we proposed to improve CAPE' performance and to overcome the restrictions on shared data. First, we developed DICKPT which stands for Discontinuous Incremental Checkpointing, an incremental checkpointing technique that supports the ability to save incremental checkpoints discontinuously during the execution of a process. Based on the DICKPT, the execution speed of the new version of CAPE was significantly increased. For example, the time to compute a large matrix-matrix product on a desktop cluster has become very similar to the execution time of the same optimized MPI program. Moreover, the speedup associated with this new version for various number of threads is quite linear for different problem sizes. In the side of shared data, we proposed UHLRC, which stands for Updated Home-based Lazy Release Consistency, a modified version of the Home-based Lazy Release Consistency (HLRC) memory model, to make it more appropriate to the characteristics of CAPE. Prototypes and algorithms to implement the synchronization and OpenMP data-sharing clauses and directives are also specified. These two works ensures the ability for CAPE to respect shared-data behavior
Page generated in 0.0378 seconds