161 |
Estudo do desempenho de aplica??es da mec?nica dos s?lidos em computa??o paralela / Study of the performance of solid mechanics applications in parallel computingPinho, Ronilson Rodrigues 06 October 2014 (has links)
Submitted by Celso Magalhaes (celsomagalhaes@ufrrj.br) on 2017-06-19T12:18:08Z
No. of bitstreams: 1
2014 - Ronilson Rodrigues Pinho.pdf: 623700 bytes, checksum: 7bc5eefc4b9dab2877f833cbdab95b9f (MD5) / Made available in DSpace on 2017-06-19T12:18:08Z (GMT). No. of bitstreams: 1
2014 - Ronilson Rodrigues Pinho.pdf: 623700 bytes, checksum: 7bc5eefc4b9dab2877f833cbdab95b9f (MD5)
Previous issue date: 2014-10-06 / The Boundary Element Method (BEM) is a computational method for differential equations
solutions, formulated in the form of integral domains. Thus, it is applied in Fluid Mechanics,
Acoustics, Electromagnetics and Fractures study. The BEM requires discretization only regarding
boundary geometry of the problem, but not inside as a whole, reducing the computational effort.
In order to reduce computational effort, parallel computing is an efficient form of information
processing emphasizing concurrent events exploitation during software execution. This
processing status arises primarily due to high computational performance requirements and
difficulty in increasing single processor core speed. Despite central processing units (CPUs),
whether multiprocessors or multicore processors, are easily found today, several algorithms are
not suitable to run on parallel architectures yet. The present study aimed to develop parallelism
research, acting in a sequential program, using Fortran 77 language (VERA-TUDELLA, 2003),
making numerical analysis of stress and strain 2D specific problems) of Solids Mechanics with
BEM, as well as, its clamped and tensioned bar physical representation. This application
implementation is intended to exploit the maximum parallelism / O M?todo de Elementos de Contorno (MEC) ? um m?todo computacional para a solu??o de
sistemas de equa??es diferenciais, formuladas em forma de integrais. Aplicado na Mec?nica dos
fluidos, Ac?stica, Eletromagn?ticos, Estudo de fraturas etc. O MEC requer discretiza??o apenas
no contorno da geometria do problema, mas n?o do seu interior como um todo, diminuindo o
esfor?o computacional. Com o intuito em diminuir o esfor?o computacional, a Computa??o
paralela ? uma forma eficiente de processamento de informa??o com ?nfase na explora??o de
eventos simult?neos na execu??o de um software. Ele surge principalmente devido ?s elevadas
exig?ncias de desempenho computacional e ? dificuldade em aumentar a velocidade de um ?nico
n?cleo de processamento. Apesar das CPUs multiprocessadas, ou processadores multicore, serem
facilmente encontrados atualmente, diversos algoritmos ainda n?o s?o adequados para executar
em arquiteturas paralelas. O presente estudo objetivou-se com o intuito de prosseguir na pesquisa
sobre paralelismo, atuando num programa sequencial, desenvolvido na linguagem Fortran 77
(VERA-TUDELA, 2003), que efetua an?lises num?ricas de problemas espec?ficos tens?o e
deforma??o em 2D) da Mec?nica dos S?lidos via MEC com representa??o f?sica da barra
engastada e tracionada. A implementa??o da aplica??o, visa explorar o m?ximo o paralelismo
|
162 |
Um método para paralelização automática de workflows intensivos em dados / A method for automatic paralelization of data-intensive workflowsElaine Naomi Watanabe 22 May 2017 (has links)
A análise de dados em grande escala é um dos grandes desafios computacionais atuais e está presente não somente em áreas da ciência moderna mas também nos setores público e industrial. Nesses cenários, o processamento dos dados geralmente é modelado como um conjunto de atividades interligadas por meio de fluxos de dados os workflows. Devido ao alto custo computacional, diversas estratégias já foram propostas para melhorar a eficiência da execução de workflows intensivos em dados, tais como o agrupamento de atividades para minimizar as transferências de dados e a paralelização do processamento, de modo que duas ou mais atividades sejam executadas ao mesmo tempo em diferentes recursos computacionais. O paralelismo nesse caso é definido pela estrutura descrita em seu modelo de composição de atividades. Em geral, os Sistemas de Gerenciamento de Workflows, responsáveis pela coordenação e execução dessas atividades em um ambiente distribuído, desconhecem o tipo de processamento a ser realizado e por isso não são capazes de explorar automaticamente estratégias para execução paralela. As atividades paralelizáveis são definidas pelo usuário em tempo de projeto e criar uma estrutura que faça uso eficiente de um ambiente distribuído não é uma tarefa trivial. Este trabalho tem como objetivo prover execuções mais eficientes de workflows intensivos em dados e propõe para isso um método para a paralelização automática dessas aplicações, voltado para usuários não-especialistas em computação de alto desempenho. Este método define nove anotações semânticas para caracterizar a forma como os dados são acessados e consumidos pelas atividades e, assim, levando em conta os recursos computacionais disponíveis para a execução, criar automaticamente estratégias que explorem o paralelismo de dados. O método proposto gera réplicas das atividades anotadas e define também um esquema de indexação e distribuição dos dados do workflow que possibilita maior acesso paralelo. Avaliou-se sua eficiência em dois modelos de workflows com dados reais, executados na plataforma de nuvem da Amazon. Usou-se um SGBD relacional (PostgreSQL) e um NoSQL (MongoDB) para o gerenciamento de até 20,5 milhões de objetos de dados em 21 cenários com diferentes configurações de particionamento e replicação de dados. Os resultados obtidos mostraram que a paralelização da execução das atividades promovida pelo método reduziu o tempo de execução do workflow em até 66,6% sem aumentar o seu custo monetário. / The analysis of large-scale datasets is one of the major current computational challenges and it is present not only in fields of modern science domain but also in the industry and public sector. In these scenarios, the data processing is usually modeled as a set of activities interconnected through data flows as known as workflows. Due to their high computational cost, several strategies were proposed to improve the efficiency of data-intensive workflows, such as activities clustering to minimize data transfers and parallelization of data processing for reducing makespan, in which two or more activities are performed at same time on different computational resources. The parallelism, in this case, is defined in the structure of the workflows model of activities composition. In general, Workflow Management Systems are responsible for the coordination and execution of these activities in a distributed environment. However, they are not aware of the type of processing that will be performed by each one of them. Thus, they are not able to automatically explore strategies for parallel execution. Parallelizable activities are defined by user at workflow design time and creating a structure that makes an efficient use of a distributed environment is not a trivial task. This work aims to provide more efficient executions for data intensive workflows and, for that, proposes a method for automatic parallelization of these applications, focusing on users who are not specialists in high performance computing. This method defines nine semantic annotations to characterize how data is accessed and consumed by activities and thus, taking into account the available computational resources, automatically creates strategies that explore data parallelism. The proposed method generates replicas of annotated activities. It also defines a workflow data indexing and distribution scheme that allows greater parallel access. Its efficiency was evaluated in two workflow models with real data, executed in Amazon cloud platform. A relational (PostgreSQL) and a NoSQL (MongoDB) DBMS were used to manage up to 20.5 million of data objects in 21 scenarios with different partitioning and data replication settings. The experiments have shown that the parallelization of the execution of the activities promoted by the method resulted in a reduction of up to 66.6 % in the workflows makespan without increasing its monetary cost.
|
163 |
William Blake's The Chimney Sweeper : - A Stylistic and Allegorical StudyGummesson, Katja January 2011 (has links)
No description available.
|
164 |
Evaluation of Parallel Programming Standards For Embedded High Performance ComputingJames Emmanuel Roy, Muggalla, Garimella, Pradeep January 2010 (has links)
The aim of this project is to evaluate parallel programming standards for embedded high performance computing. There is a huge demand for high computational speed and performance in the present radar signal processing, so more processors are needed to get enough performance. One way of getting high performance is by dividing the work on multiple processors. At the same time, it has to get low communication overhead and good speedup. This has been done by using parallel computing languages such as OpenMP and MPI.We use these parallel programming languages on radar signal benchmark which is similar to many tasks in radar signal processing. For running OpenMP, a shared memory system SUNFIRE E2900 is used and for MPI, a SUNFIRE E2900, containing 8 nodes which uses SUN HPC cluster tools v5 is used. The OpenMP program shows pretty good speedup up to 5 processors, there after an increase in communication overhead is observed. MPI has shown low communication overhead at the beginning but got decreases when the numbers of processors were increased. Both OpenMP and MPI show similar aspects, at certain limit as the number of processors are increased there is decreasing trend in efficiency and increase in communication overhead. According to our results, OpenMP is a relatively easy to use program when compared to MPI. When using MPI it is up to the programmer to make explicit calls in order to parallelize.
|
165 |
A Skeleton Programming Library for Multicore CPU and Multi-GPU SystemsEnmyren, Johan January 2010 (has links)
This report presents SkePU, a C++ template library which provides a simple and unified interface for specifying data-parallel computations with the help of skeletons on GPUs using CUDA and OpenCL. The interface is also general enough to support other architectures, and SkePU implements both a sequential CPU and a parallel OpenMP back end. It also supports multi-GPU systems. Benchmarks show that copying data between the host and the GPU is often a bottleneck. Therefore a container which uses lazy memory copying has been implemented to avoid unnecessary memory transfers. SkePU was evaluated with small benchmarks and a larger application, a Runge-Kutta ODE solver. The results show that skeletal parallel programming is indeed a viable approach for GPU Computing and that a generalized interface for multiple back ends is also reasonable. The best performance gains are received when the computation load is large compared to memory I/O (the lazy memory copying can help to achieve this). We see that SkePU offers good performance with a more complex and realistic task such as ODE solving, with up to ten times faster run times when using SkePU with a GPU back end compared to a sequential solver running on a fast CPU. From the benchmarks we can conclude that skeletal parallel programming is indeed a viable approach for GPU Computing and that a generalized interface for multiple back ends is also reasonable. SkePU does however have some disadvantages too; there is some overhead in using the library which we can see from the dot product and LibSolve benchmarks. Although not big, it is still there and if performance is of uttermost importance, then a hand coded solution would be best. One cannot express all calculations in terms of skeletons either, if one have such a problem, specialized routines must still be created.
|
166 |
Parallélisme en programmation par contraintes / Parallelism in constraint programmingRezgui, Mohamed 08 July 2015 (has links)
Nous étudions la parallélisation de la procédure de recherche de solution d’un problème en Programmation Par Contraintes (PPC). Après une étude de l’état de l’art, nous présentons une nouvelle méthode, nommée Embarrassingly Parallel Search (EPS). Cette méthode est basée sur la décomposition d’un problème en un très grand nombre de sous-problèmes disjoints qui sont ensuite résolus en parallèle par des unités de calcul avec très peu, voire aucune communication. Le principe d’EPS est d’arriver statistiquement à un équilibrage des temps de résolution de chaque unité de calcul afin d’obtenir une bonne répartition de la charge de travail. EPS s’appuie sur la propriété suivante : la somme des temps de résolution de chacun des sous-problèmes est comparable au temps de résolution du problème en entier. Cette propriété est vérifiée en PPC, ce qui nous permet de disposer d’une méthode simple et efficace en pratique. Dans nos expérimentations, nous nous intéressons à la recherche de toutes les solutions d’un problème en PPC, à prouver qu’un problème n’a pas de solution et à la recherche d’une solution optimale d’un problème d’optimisation. Les résultats montrent que la décomposition doit générer au moins 30 sous-problèmes par unité de calcul pour obtenir des charges de travail par unité de calcul équivalentes. Nous évaluons notre approche sur différentes architectures (machine multi-coeurs, centre de calcul et cloud computing) et montrons qu’elle obtient un gain pratiquement linéaire en fonction du nombre d’unités de calcul. Une comparaison avec les méthodes actuelles telles que le work stealing ou le portfolio montre qu’EPS obtient de meilleurs résultats. / We study the search procedure parallelization in Constraint Programming (CP). After giving an overview on various existing methods of the state-of-the-art, we present a new method, named Embarrassinqly Parallel Search (EPS). This method is based on the decomposition of a problem into many disjoint subproblems which are then solved in parallel by computing units with little or without communication. The principle of EPS is to have a resolution times balancing for each computing unit in a statistical sense to obtain a goodDépôt de thèse – Données complémentaireswell-balanced workload. We assume that the amount of resolution times of all subproblems is comparable to the resolution time of the entire problem. This property is checked with CP and allows us to have a simple and efficient method in practice. In our experiments, we are interested in enumerating all solutions of a problem, and proving that a problem has no solution and finding an optimal solution of an optimization problem. We observe that the decomposition has to generate at least 30 subproblems per computing unit to get equivalent workloads per computing unit. Then, we evaluate our approach on different architectures (multicore machine, cluster and cloud computing) and we observe a substantially linear speedup. A comparison with current methods such as work stealing or portfolio shows that EPS gets better results.
|
167 |
Modeling performance of serial and parallel sections of multi-threaded programs in many-core era / Modélisation de la performance des sections séquentielles et parallèles au sein de programmes multithreadés à l'ère des many-coeursKhizakanchery Natarajan, Surya Narayanan 01 June 2015 (has links)
Ce travail a été effectué dans le contexte d'un projet financé par l'ERC, Defying Amdahl's Law (DAL), dont l'objectif est d'explorer les techniques micro-architecturales améliorant la performance des processeurs multi-cœurs futurs. Le projet prévoit que malgré les efforts investis dans le développement de programmes parallèles, la majorité des codes auront toujours une quantité signifiante de code séquentiel. Pour cette raison, il est primordial de continuer à améliorer la performance des sections séquentielles des-dits programmes. Le travail de recherche de cette thèse porte principalement sur l'étude des différences entre les sections parallèles et les sections séquentielles de programmes multithreadés (MT) existants. L'exploration de l'espace de conception des futurs processeurs multi-cœurs est aussi traitée, tout en gardant à l'esprit les exigences concernant ces deux types de sections ainsi que le compromis performance-surface. / This thesis work is done in the general context of the ERC, funded Defying Amdahl's Law (DAL) project which aims at exploring the micro-architectural techniques that will enable high performance on future many-core processors. The project envisions that despite future huge investments in the development of parallel applications and porting it to the parallel architectures, most applications will still exhibit a significant amount of sequential code sections and, hence, we should still focus on improving the performance of the serial sections of the application. In this thesis, the research work primarily focuses on studying the difference between parallel and serial sections of the existing multi-threaded (MT) programs and exploring the design space with respect to the processor core requirement for the serial and parallel sections in future many-core with area-performance tradeoff as a primary goal.
|
168 |
Characterizing The Vulnerability Of Parallelism To Resource ConstraintsVivekanand, V 01 1900 (has links) (PDF)
No description available.
|
169 |
Solving the Boolean satisfiability problem using the parallel paradigm / Résolution du problème SAT au travers de la programmation parallèleHoessen, Benoît 10 December 2014 (has links)
Cette thèse présente différentes techniques permettant de résoudre le problème de satisfaction de formule booléenes utilisant le parallélisme et du calcul distribué. Dans le but de fournir une explication la plus complète possible, une présentation détaillée de l'algorithme CDCL est effectuée, suivi d'un état de l'art. De ce point de départ, deux pistes sont explorées. La première est une amélioration d'un algorithme de type portfolio, permettant d'échanger plus d'informations sans perte d'efficacité. La seconde est une bibliothèque de fonctions avec son interface de programmation permettant de créer facilement des solveurs SAT distribués. / This thesis presents different technique to solve the Boolean satisfiability problem using parallel and distributed architectures. In order to provide a complete explanation, a careful presentation of the CDCL algorithm is made, followed by the state of the art in this domain. Once presented, two propositions are made. The first one is an improvement on a portfolio algorithm, allowing to exchange more data without loosing efficiency. The second is a complete library with its API allowing to easily create distributed SAT solver.
|
170 |
Algorithmes d'étiquetage en composantes connexes efficaces pour architectures hautes performances / Efficient Connected Component Labeling Algorithms for High Performance ArchitecturesCabaret, Laurent 28 September 2016 (has links)
Ces travaux de thèse, dans le domaine de l'adéquation algorithme architecture pour la vision par ordinateur, ont pour cadre l'étiquetage en composantes connexes (ECC) dans le contexte parallèle des architectures hautes performances. Alors que les architectures généralistes modernes sont multi-coeur, les algorithmes d'ECC sont majoritairement séquentiels, irréguliers et utilisent une structure de graphe pour représenter les relations d'équivalences entre étiquettes ce qui rend complexe leur parallélisation. L'ECC permet à partir d'une image binaire, de regrouper sous une même étiquette tous les pixels connexes, il fait ainsi le pont entre les traitements bas niveaux tels que le filtrage et ceux de haut niveau tels que la reconnaissance de forme ou la prise de décision. Il est donc impliqué dans un grand nombre de chaînes de traitements qui nécessitent l'analyse d'image segmentées. L'accélération de cette étape représente donc un enjeu pour tout un ensemble d'algorithmes.Les travaux de thèse se sont tout d'abord concentrés sur les performances comparées des algorithmes de l'état de l'art tant pour l'ECC que pour l'analyse des caractéristiques des composantes connexes (ACC) afin d'en dégager une hiérarchie et d’identifier les composantes déterminantes des algorithmes. Pour cela, une méthode d'évaluation des performances, reproductible et indépendante du domaine applicatif, a été proposée et appliquée à un ensemble représentatif des algorithmes de l'état de l'art. Les résultats montrent que l'algorithme séquentiel le plus rapide est l'algorithme LSL qui manipule des segments contrairement aux autres algorithmes qui manipulent des pixels.Dans un deuxième temps, une méthode de parallélisation des algorithmes directs utilisant OpenMP a été proposé avec pour objectif principal de réaliser l’ACC à la volée et de diminuer le coût de la communication entre les threads. Pour cela, l'image binaire est découpée en bandes traitées en parallèle sur chaque coeur du l'architecture, puis une étape de fusion pyramidale d'ensembles deux à deux disjoint d'étiquettes permet d'obtenir l'image complètement étiquetée sans avoir de concurrence d'accès aux données entre les différents threads. La procédure d'évaluation des performances appliquée a des machines de degré de parallélisme variés, a démontré que la méthode de parallélisation proposée était efficace et qu'elle s'appliquait à tous les algorithmes directs. L'algorithme LSL s'est encore avéré être le plus rapide et le seul adapté à l'augmentation du nombre de coeurs du fait de son approche «segments». Pour une architecture à 60 coeurs, l'algorithme LSL permet de traiter de 42,4 milliards de pixels par seconde pour des images de taille 8192x8192, tandis que le plus rapide des algorithmes pixels est limité par la bande passante et sature à 5,8 milliards de pixels par seconde.Après ces travaux, notre attention s'est portée sur les algorithmes d'ECC itératifs dans le but de développer des algorithmes pour les architectures manycore et GPU. Les algorithmes itératifs se basant sur un mécanisme de propagation des étiquettes de proche en proche, aucune autre structure que l'image n'est nécessaire ce qui permet d'en réaliser une implémentation massivement parallèle (MPAR). Ces travaux ont menés à la création de deux nouveaux algorithmes.- Une amélioration incrémentale de MPAR utilisant un ensemble de mécanismes tels qu'un balayage alternatif, l'utilisation d'instructions SIMD ainsi qu'un mécanisme de tuiles actives permettant de répartir la charge entre les différents coeurs tout en limitant le traitement des pixels aux zones actives de l'image et à leurs voisines.- Un algorithme mettant en œuvre la relation d’équivalence directement dans l’image pour réduire le nombre d'itérations nécessaires à l'étiquetage. Une implémentation pour GPU basée sur les instructions atomic avec un pré-étiquetage en mémoire locale a été réalisée et s'est révélée efficace dès les images de petite taille. / This PHD work take place in the field of algorithm-architecture matching for computer vision, specifically for the connected component labeling (CCL) for high performance parallel architectures.While modern architectures are overwhelmingly multi-core, CCL algorithms are mostly sequential, irregular and they use a graph structure to represent the equivalences between labels. This aspects make their parallelization challenging.CCL processes a binary image and gathers under the same label all the connected pixels, doing so CCL is a bridge between low level operations like filtering and high level ones like shape recognition and decision-making.It is involved in a large number of processing chains that require segmented image analysis. The acceleration of this step is therefore an issue for a variety of algorithms.At first, the PHD work focused on the comparative performance of the State-of-the-Art algorithms, as for CCL than for the features analysis of the connected components (CCA) in order to identify a hierarchy and the critical components of the algorithms. For this, a benchmarking method, reproducible and independent of the application domain was proposed and applied to a representative set of State-of-the-Art algorithms. The results show that the fastest sequential algorithm is the LSL algorithm which manipulates segments unlike other algorithms that manipulate pixels.Secondly, a parallelization framework of directs algorithms based on OpenMP was proposed with the main objective to compute the CCA on the fly and reduce the cost of communication between threads.For this, the binary image is divided into bands processed in parallel on each core of the architecture and a pyramidal fusion step that processes the generated disjoint sets of labels provides the fully labeled image without concurrent access to data between threads.The benchmarking procedure applied to several machines of various parallelism level, shows that the proposed parallelization framework applies to all the direct algorithms.The LSL algorithm is once again the fastest and the only one suitable when the number of cores increases due to its run-based conception. With an architecture of 60 cores, the LSL algorithm can process 42.4 billion pixels per second for images of 8192x8192 pixels, while the fastest pixel-based algorithm is limited by the bandwidth and saturates at 5.8 billion pixels per second.After these works, our attention focused on iterative CCL algorithms in order to develop new algorithms for many-core and GPU architectures. The Iterative algorithms are based on a local propagation mechanism without supplementary equivalence structure which allows to achieve a massively parallel implementation (MPAR). This work led to the creation of two new algorithms.- An incremental improvement of MPAR using a set of mechanisms such as an alternative scanning, the use of SIMD instructions and an active tile mechanism to distribute the load between the different cores while limiting the processing of the pixels to the active areas of the image and to their neighbors.- An algorithm that implements the equivalence relation directly into the image to reduce the number of iterations required for labeling. An implementation for GPU, based on atomic instructions with a pre-labeling in the local memory has been realized and it has proven effective from the small images.
|
Page generated in 0.0693 seconds