Global ETD Search

21	Automatic Parallelization of Loops with Data Dependent Control Flow and Array Access Patterns Ravishankar, Mahesh 12 November 2014 (has links) No description available. Computer Engineering
22	Are you experienced? Contributions towards experience recognition, cognition, and decision making Chada, Daniel de Magalhães 08 December 2016 (has links) Submitted by Daniel Chada (danielc2112@gmail.com) on 2017-01-10T13:25:02Z No. of bitstreams: 1 chada.phd.2017.01.09.pdf: 5177057 bytes, checksum: a6174d9f2ba0b373776e750def2a23aa (MD5) / Approved for entry into archive by ÁUREA CORRÊA DA FONSECA CORRÊA DA FONSECA (aurea.fonseca@fgv.br) on 2017-01-12T14:03:51Z (GMT) No. of bitstreams: 1 chada.phd.2017.01.09.pdf: 5177057 bytes, checksum: a6174d9f2ba0b373776e750def2a23aa (MD5) / Made available in DSpace on 2017-01-23T11:48:10Z (GMT). No. of bitstreams: 1 chada.phd.2017.01.09.pdf: 5177057 bytes, checksum: a6174d9f2ba0b373776e750def2a23aa (MD5) Previous issue date: 2016-12-08 / Este trabalho consiste em três contribuições independentes do âmbito da modelagem cognitiva ao campo de management science. O primeiro aborda Experience Recognition, uma teoria inicialmente introduzida por Linhares e Freitas [91]. Aqui ela é estendida e delineada, além de se discutir suas contribuições para a ciência cognitiva e management science. A segunda contribuição introduz a framework cognitiva chamada Rotational Sparse Distributed Memory, e fornece uma aplicação-exemplo de suas características como substrato para um fortemente relevante campo da management science: redes semânticas. A contribuição final aplica Rotational Sparse Distributed Memory para a modelagem de motifs de rede, flexibilidade dinâmica e organização hierárquica, três resultados de forte impacto na literatura recente de neurociência. A relevância de uma abordagem baseada na modelagem neurocientífica para a decision science é discutida. / This work is comprised of three independent contributions from the realm of cognitive modeling to management science. The first addresses Experience Recognition, a theory first introduced by Linhares and Freitas [91]. Here it is extended and better defined, and also its contribution to cognitive science and management science are discussed. The second contribution introduces a cognitive framework called Rotational Sparse Distributed Memory, and provides a sample application of its characteristics as a substrate for a highly relevant subject in management science: semantic networks. The final contribution applies Rotational Sparse Distributed Memory to modeling network motifs, dynamic flexibility and hierarchical organization, all highly impactful results in recent neuroscience literature. The relevance of a neuroscientific modeling approach towards a cognitive view of decision science are discussed. Cognitive modeling Rotational Sparse Distributed Memory Network motifs Semantic networks Spreading activation Sparse distributed memory Administração de empresas Administração Ciência cognitiva Memória - Simulação por computador Processo decisório Sistemas de reconhecimento de padrões Inteligência artificial distribuida
23	Performance Prediction of Parallel Programs in a Linux Environment Farooq, Mohammad Habibur Rahman & Qaisar January 2010 (has links) Context. Today’s parallel systems are widely used in different computational tasks. Developing parallel programs to make maximum use of the computing power of parallel systems is tricky and efficient tuning of parallel programs is often very hard. Objectives. In this study we present a performance prediction and visualization tool named VPPB for a Linux environment, which had already been introduced by Broberg et.al, [1] for a Solaris2.x environment. VPPB shows the predicted behavior of a multithreaded program using any number of processors and the behavior is shown on two different graphs. The prediction is based on a monitored uni-processor execution. Methods. An experimental evaluation was carried out to validate the prediction reliability of the developed tool. Results. Validation of prediction is conducted, using an Intel multiprocessor with 8 processors and PARSEC 2.0 benchmark suite application programs. The validation shows that the speed-up predictions are +/-7% of a real execution. Conclusions. The experimentation of the VPPB tool showed that the prediction of VPPB is reliable and the incurred overhead into the application programs is low. / contact: +46(0)736368336 VPPB Pthreads parallel programming performance tuning performance prediction distributed memory system shared memory system Computer Sciences Datavetenskap (datalogi)
24	Parallélisation automatique et statique de tâches sous contraintes de ressources : une approche générique / Automatic Resource-Constrained Static Task Parallelization : A Generic Approach Khaldi, Dounia 27 November 2013 (has links) Le but de cette thèse est d'exploiter efficacement le parallélisme présent dans les applications informatiques séquentielles afin de bénéficier des performances fournies par les multiprocesseurs, en utilisant une nouvelle méthodologie pour la parallélisation automatique des tâches au sein des compilateurs. Les caractéristiques clés de notre approche sont la prise en compte des contraintes de ressources et le caractère statique de l'ordonnancement des tâches. Notre méthodologie contient les techniques nécessaires pour la décomposition des applications en tâches et la génération de code parallèle équivalent, en utilisant une approche générique qui vise différents langages et architectures parallèles. Nous implémentons cette méthodologie dans le compilateur source-à-source PIPS. Cette thèse répond principalement à trois questions. Primo, comme l'extraction du parallélisme de tâches des codes séquentiels est un problème d'ordonnancement, nous concevons et implémentons un algorithme d'ordonnancement efficace, que nous nommons BDSC, pour la détection du parallélisme ; le résultat est un SDG ordonnancé, qui est une nouvelle structure de données de graphe de tâches. Secondo, nous proposons une nouvelle extension générique des représentations intermédiaires séquentielles en des représentations intermédiaires parallèles que nous nommons SPIRE, pour la représentation des codes parallèles. Enfin, nous développons, en utilisant BDSC et SPIRE, un générateur de code que nous intégrons dans PIPS. Ce générateur de code cible les systèmes à mémoire partagée et à mémoire distribuée via des codes OpenMP et MPI générés automatiquement. / This thesis intends to show how to efficiently exploit the parallelism present in applications in order to enjoy the performance benefits that multiprocessors can provide, using a new automatic task parallelization methodology for compilers. The key characteristics we focus on are resource constraints and static scheduling. This methodology includes the techniques required to decompose applications into tasks and generate equivalent parallel code, using a generic approach that targets both different parallel languages and architectures. We apply this methodology in the existing tool PIPS, a comprehensive source-to-source compilation platform. This thesis mainly focuses on three issues. First, since extracting task parallelism from sequential codes is a scheduling problem, we design and implement an efficient, automatic scheduling algorithm called BDSC for parallelism detection; the result is a scheduled SDG, a new task graph data structure. In a second step, we design a new generic parallel intermediate representation extension called SPIRE, in which parallelized code may be expressed. Finally, we wrap up our goal of automatic parallelization in a new BDSC- and SPIRE-based parallel code generator, which is integrated within the PIPS compiler framework. It targets both shared and distributed memory systems using automatically generated OpenMP and MPI code. Ordonnancement Pips Graphe de tâches Spire Scheduling Shared and distributed memory systems Pips Task graph Spire
25	Programmation haute performance pour architectures hybrides / High Performance Programming for Hybrid Architectures Habel, Rachid 19 November 2014 (has links) Les architectures parallèles hybrides constituées d'un grand nombre de noeuds de calcul multi-coeurs/GPU connectés en réseau offrent des performances théoriques très élevées, de l'ordre de quelque dizaines de TeraFlops. Mais la programmation efficace de ces machines reste un défi à cause de la complexité de l'architecture et de la multiplication des modèles de programmation utilisés. L'objectif de cette thèse est d'améliorer la programmation des applications scientifiques denses sur les architectures parallèles hybrides selon trois axes: réduction des temps d'exécution, traitement de données de très grande taille et facilité de programmation. Nous avons pour cela proposé un modèle de programmation à base de directives appelé DSTEP pour exprimer à la fois la distribution des données et des calculs. Dans ce modèle, plusieurs types de distribution de données sont exprimables de façon unifiée à l'aide d'une directive "dstep distribute" et une réplication de certains éléments distribués peut être exprimée par un "halo". La directive "dstep gridify" exprime à la fois la distribution des calculs ainsi que leurs contraintes d'ordonnancement. Nous avons ensuite défini un modèle de distribution et montré la correction de la transformation de code du domaine séquentiel au domaine distribué. À partir du modèle de distribution, nous avons dérivé un schéma de compilation pour la transformation de programmes annotés de directives DSTEP en des programmes parallèles hybrides. Nous avons implémenté notre solution sous la forme d'un compilateur intégré à la plateforme de compilation PIPS ainsi qu'une bibliothèque fournissant les fonctionnalités du support d'exécution, notamment les communications. Notre solution a été validée sur des programmes de calcul scientifiques standards tirés des NAS Parallel Benchmarks et des Polybenchs ainsi que sur une application industrielle. / Clusters of multicore/GPU nodes connected with a fast network offer very high therotical peak performances, reaching tens of TeraFlops. Unfortunately, the efficient programing of such architectures remains challenging because of their complexity and the diversity of the existing programming models. The purpose of this thesis is to improve the programmability of dense scientific applications on hybrid architectures in three ways: reducing the execution times, processing larger data sets and reducing the programming effort. We propose DSTEP, a directive-based programming model expressing both data and computation distribution. A large set of distribution types are unified in a "dstep distribute" directive and the replication of some distributed elements can be expressed using a "halo". The "dstep gridify" directive expresses both the computation distribution and the schedule constraints of loop iterations. We define a distribution model and demonstrate the correctness of the code transformation from the sequential domain to the parallel domain. From the distribution model, we derive a generic compilation scheme transforming DSTEP annotated input programs into parallel hybrid ones. We have implemented such a tool as a compiler integrated to the PIPS compilation workbench together with a library offering the runtime functionality, especially the communication. Our solution is validated on scientific programs from the NAS Parallel Benchmarks and the PolyBenchs as well as on an industrial signal procesing application. Compilation Mémoire distribuée Mémoire partagée Gpu Mpi OpenMP Compilation Distributed-Memory Shared-Memory Gpu Mpi OpenMP 004
26	Lokale Realisierung von Vektoroperationen auf Parallelrechnern Groh, U. 30 October 1998 (has links) For the basic algebraic vector operations several variants of a local implementation on distributed memory parallel computers are presented and discussed systematically. In particular necessary and sufficient conditions are shown for the local realizability of the multiplication matrix by vector. info:eu-repo/classification/ddc/510 ddc:510
27	Lattice QCD Optimization and Polytopic Representations of Distributed Memory / Optimisation de LatticeQCD et représentations polytopiques de la mémoire distribuée Kruse, Michael 26 September 2014 (has links) La physique actuelle cherche, à côté des expériences, à vérifier et déduire les lois de la nature en simulant les modèles physiques sur d'énormes ordinateurs. Cette thèse explore comment accélérer ces simulations en améliorant les programmes qui les font tourner. L'application de référence est la chromodynamique quantique sur réseaux (LQCD pour "Lattice Quantum Chromodynamics"), une branche de la théorie quantique des champs, tournant sur le plus récent des supercalculateurs d'IBM, le Blue Gene/Q.Dans un premier temps, on améliore le code source de tmLQCD, un programme de LQCD, dont l'opération clef pour la performance est un stencil à 8 points en dimension 4. On étudie deux stratégies d'optimisation différentes: la première se donne comme priorité d'améliorer la localité spatiale et temporelle; la seconde utilise le préchargement matériel de flux de données. Sur le Blue Gene/Q, la première stratégie permet d'atteindre 20% de la performance crête théorique. La seconde, avec jusqu'à 54% de la performance crête est bien meilleure mais utilise 4 fois plus de mémoire car elle stocke les résultats dans l'ordre où les utilise le stencil suivant, ce qui requiert de dupliquer des données. Les autres techniques exploitées sont la programmation directe du système de communication (appelé MUSPI chez IBM), un mécanisme allégé de gestion des threads, le préchargement explicite de certaines données (à l'aide de l'instruction dcbt) et la vectorisation manuelle (en utilisant les instructions SIMD de largeur 4; appelé QPX par IBM). Le préchargement de liste et la mémoire transactionnelle - deux nouveaux mécanismes du Blue Gene/Q - n'améliorent pas les performances.Dans un second temps, on présente la réalisation d'une extension appelé Molly au compilateur LLVM, pour optimiser automatiquement le programme, et plus précisément la distribution des données et des calculs entre les nœuds d'un cluster tel que le Blue Gene/Q. Molly représente les tableaux par des polyèdres entiers et utilise l'extension existante Polly qui représente les boucles et les instructions par des polyèdres. Partant de la spécification de la distribution des données et de l'emplacement des calculs, Molly ajoute le code qui gère les flots de données entre les nœuds de calcul. Molly peut aussi permuter l'ordre des données en mémoire. La tâche principale de Molly est d'agréger les données dans des ensembles qui sont envoyés dans le même tampon au même destinataire, pour éviter l'overhead des transferts trop petits. Nous présentons un algorithme qui minimise le nombre de transferts pour des boucles non-paramétrées, basé sur les antichaînes du flot des données. De plus, nous implémentons une heuristique qui tient compte de la manière dont le programmeur a écrit son code. Les primitives de communication asynchrone sont insérées juste après que les données soient disponibles - respectivement juste avant qu'elles soient utilisées. Une bibliothèque runtime implémente ces primitives en utilisant MPI. Molly gère la distribution pour tout code représentable dans le modèle polyédrique, mais fonctionne mieux pour du code à stencil tel LQCD. Compilé avec Molly, le code LQCD atteint 2,5% de la performance crête. L'écart de performance est surtout dû au fait que les autres optimisations ne sont pas faites, par exemple la vectorisation. Les versions futures de Molly pourraient aussi gérer efficacement les codes non à stencil et exploiter les autres optimisations qui ont rendu le code LQCD optimisé à la main si rapide. / Motivated by modern day physics which in addition to experiments also tries to verify and deduce laws of nature by simulating the state-of-the-art physical models using oversized computers, this thesis explores means of accelerating such simulations by improving the simulation programs they run. The primary focus is Lattice Quantum Chromodynamics (QCD), a branch of quantum field theory, running on IBM newest supercomputer, the Blue Gene/Q.In a first approach, the source code of tmLQCD, a Lattice QCD program, is improved to run faster on the Blue Gene machine. Its most performance-relevant operation is a 8-point stencil in 4 dimensional space. Two different optimization strategies are perused: One with the priority of improving spatial and temporal locality, and a second making use of the hardware's data stream prefetcher. On Blue Gene/Q the first strategy reaches up to 20% of the peak theoretical floating point operation performance of that machine. The second strategy with up to 54% of peak is much faster at the cost of using 4 times more memory by storing the data in the order they will be used in the next stencil operation, duplicating data where necessary.Other techniques exploited are direct programming of the messaging hardware (called MUSPI by IBM), a low-overhead work distribution mechanism for threads, explicit data prefetching of data (using dcbt instruction) and manual vectorization (using QPX; width-4 SIMD instructions). Hardware-based list prefetching and transactional memory - both distinct and novel features of the Blue Gene/Q system -- did not improve the program's performance.The second approach is the newly-written LLVM compiler extension called Molly which optimizes the program itself, specifically the distribution of data and work between the nodes of a cluster machine such as Blue Gene/Q. Molly represents arrays using integer polyhedra and uses another already existing compiler extension Polly which represents statements and loops using polyhedra. When Molly knows how data is distributed among the nodes and where statements are executed, it adds code that manages the data flow between the nodes. Molly can also permute the order of data in memory. Molly's main task is to cluster data into sets that are sent to the same target into the same buffer because single transfers involve a massive overhead. We present an algorithm that minimizes the number of transfers for unparametrized loops using anti-chains of data flows. In addition, we implement a heuristic that takes into account how the programmer wrote the code. Asynchronous communication primitives are inserted right after the data is available respectively just before it is used. A runtime library implements these primitives using MPI.Molly manages to distribute any code that is representable by the polyhedral model, but does so best for stencils codes such as Lattice QCD. Compiled using Molly, the Lattice QCD stencil reaches 2.5% of the theoretical peak performance. The performance gap is mostly because all the other optimizations are missing, such as vectorization. Future versions of Molly may also effectively handle non-stencil codes and use make use of all the optimizations that make the manually optimized Lattice QCD stencil so fast. Blue Gene Lattice QCD Mémoire distribuée LLVM Clang Modèle polyhédrique Parallélisation Blue Gene Lattice QCD Distributed Memory LLVM Clang Polyhedral Model Parallelization
28	Alinhamento de seqüências biológicas em arquiteturas com memória distribuída Peranconi, Daniela Saccol 04 March 2005 (has links) Made available in DSpace on 2015-03-05T13:53:44Z (GMT). No. of bitstreams: 0 Previous issue date: 4 / Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / A utilização de aglomerados de computadores na solução de problemas que demandam grande quantidade de recursos computacionais vem se mostrando uma alternativa interessante. aglomerados são economicamente viáveis e de fácil manutenção, oferecendo poder computacional equivalente ao de supercomputadores. No entanto, o desenvolvimento de aplicações para este tipo de arquitetura é complexo, uma vez que envolve questões não presentes na programação seqüencial, como a comunicação de dados e a sincronização de tarefas concorrentes, problemas estes que, em geral, são tratados em supercomputadores por pacotes de software especializados. Neste contexto, este trabalho apresenta o desenvolvimento de um mecanismo de suporte à comunicação sobre aglomerados de computadores, focado na exploração desta plataforma de hardware para o processamento de alto desempenho. O mecanismo criado e disponibilizado sob a forma de uma biblioteca de funções em C, é baseado no modelo de Mensagens Ativas. Sua implementação é realizada na cama / The use of cluster of computers for solving problems that require a great quantity of computational resources is becoming an interesting alternative. Clusters are economically feasible and of easy maintenance, offering a computational power equivalent to that of supercomputers. However developing applications for this kind of architecture is complex because it involves issues that are not present in the sequential programming such as data communication and concurrent tasks synchronization, problems that usually are handled by specialized software packages in supercomputers. Considering this context, this work presents the development of a mechanism for supporting communication on clusters of computers focused on exploring this hardware platform for high performance processing. The mechanism was created as a library of functions written in C and it is based on the Active Messages model. Its implementation was performed on the applicative level, using light multiprogramming techniques as programming resou Ciências Exatas e da Terra memória distribuída mensagens ativas multiprogramação leve processamento de alto desempenho alinhamento de seqüências Multiprogramação active aessages distributed memory high performance processing multithreading sequence alignment
29	Design by transformation : from domain knowledge to optimized program generation Marker, Bryan Andrew 20 June 2014 (has links) Expert design knowledge is essential to develop a library of high-performance software. This includes how to implement and parallelize domain operations, how to optimize implementations, and estimates of which implementation choices are best. An expert repeatedly applies his knowledge, often in a rote and tedious way, to develop all of the related functionality expected from a domain-specific library. Expert knowledge is hard to gain and is easily lost over time when an expert forgets or when a new engineer starts developing code. The domain of dense linear algebra (DLA) is a prime example with software that is so well designed that much of experts' important work has become tediously rote in many ways. In this dissertation, we demonstrate how one can encode design knowledge for DLA so it can be automatically applied to generate code as an expert would or to generate better code. Further, the knowledge is encoded for perpetuity, so it can be reused to make implementing functionality on new hardware easier or it can be used to teach how software is designed to a non-expert. We call this approach to software engineering (encoding expert knowledge and automatically applying it) Design by Transformation (DxT). We present our vision, the methodology, a prototype code generation system, and possibilities when applying DxT to the domain of dense linear algebra. / text Automatic program generation Software engineering Code transformation High-performance computing Distributed-memory computing Shared-memory computing Dense linear algebra Design by transformation
30	Scaling the solution of large sparse linear systems using multifrontal methods on hybrid shared-distributed memory architectures / Scalabilité des méthodes multifrontales pour la résolution de grands systèmes linéaires creux sur architectures hybrides à mémoire partagée et distribuée Sid Lakhdar, Mohamed Wissam 01 December 2014 (has links) La résolution de systèmes d'équations linéaires creux est au cœur de nombreux domaines d'applications. De même que la quantité de ressources de calcul augmente dans les architectures modernes, offrant ainsi de nouvelles perspectives, la taille des problèmes rencontré de nos jours dans les applications de simulations numériques augmente aussi et de façon significative. L'exploitation des architectures modernes pour la résolution efficace de problèmes de très grande taille devient ainsi un défit a relever, aussi bien d'un point de vue théorique que d'un point de vue algorithmique. L'objectif de cette thèse est d'adresser les problèmes de scalabilité des solveurs creux directs basés sur les méthodes multifrontales en environnements parallèles asynchrones. Dans la première partie de la thèse, nous nous intéressons a l'exploitation du parallélisme multicoeur sur les architectures a mémoire partagée. Nous introduisons une variante de l'algorithme Geist-Ng afin de gérer aussi bien un parallélisme a grain fin, a travers l'utilisation de librairies BLAS séquentiel et parallèle optimisées, que d'un parallélisme a plus gros grain, a travers l'utilisation de parallélisme a base de directives OpenMP. Nous considérons aussi des aspects mémoire afin d'améliorer les performances sur des architectures NUMA: (i) d'une part, nous analysons l'influence de la localité mémoire et utilisons des stratégies d'allocation mémoire adaptatives pour gérer les espaces de travail privés et partagés; (ii) d'autre part, nous nous intéressons au problème de partages de ressources sur les architectures multicoeurs, qui induisent des pénalités en termes de performances. Enfin, afin d'éviter que des ressources ne reste inertes a la fin de l'exécution de leurs taches, et ainsi, afin d'exploiter au mieux les ressources disponibles, nous proposons un algorithme conceptuellement proche de l'approche dite de vol de travail, et qui consiste a assigner les ressources de calculs inactives au taches de travail actives de façon dynamique. Dans la deuxième partie de cette thèse, nous nous intéressons aux architectures hybrides, a base de mémoire partagées et de mémoire distribuées, pour lesquels un travail particulier est nécessaire afin d'améliorer la scalabilité du traitement de problèmes de grande taille. Nous étudions et optimisons tout d'abord les noyaux d'algèbre linéaire danse utilisé dans les méthodes multifrontales en environnent distribué asynchrone, en repensant les variantes right-looking et left-looking de la factorisation LU avec pivotage partiel dans notre contexte distribué. De plus, du fait du parallélisme multicoeurs, la proportion des communications relativement aux calculs et plus importante. Nous expliquons comment construire des algorithmes de mapping qui minimisent les communications entres nœuds de l'arbre de dépendances de la méthode multifrontale. Nous montrons aussi que les communications asynchrones collectives deviennent christiques sur grand nombres de processeurs, et que les broadcasts asynchrones a base d'arbres de broadcast doivent être utilisés. Nous montrons ensuite que dans un contexte multifrontale complètement asynchrone, où plusieurs instances de tels communications ont lieux, de nouveaux problèmes de synchronisation apparaissent. Nous analysons et caractérisons les situations de deadlock possibles et établissons formellement des propriétés générales simples afin de résoudre ces problèmes de deadlock. Nous établissons par la suite des propriétés nous permettant de relâcher les synchronisations induites par la solutions précédentes, et ainsi, d'améliorer les performances. Enfin, nous montrons que les synchronisations peuvent être relâchées dans un solveur creux danse et illustrons les gains en performances, sur des problèmes de grande taille issue d'applications réelles, dans notre environnement multifrontale complètement asynchrone. / The solution of sparse systems of linear equations is at the heart of numerous applicationfields. While the amount of computational resources in modern architectures increases and offersnew perspectives, the size of the problems arising in today’s numerical simulation applicationsalso grows very much. Exploiting modern architectures to solve very large problems efficiently isthus a challenge, from both a theoretical and an algorithmic point of view. The aim of this thesisis to address the scalability of sparse direct solvers based on multifrontal methods in parallelasynchronous environments.In the first part of this thesis, we focus on exploiting multi-threaded parallelism on sharedmemoryarchitectures. A variant of the Geist-Ng algorithm is introduced to handle both finegrain parallelism through the use of optimized sequential and multi-threaded BLAS libraries andcoarser grain parallelism through explicit OpenMP based parallelization. Memory aspects arethen considered to further improve performance on NUMA architectures: (i) on the one hand,we analyse the influence of memory locality and exploit adaptive memory allocation strategiesto manage private and shared workspaces; (ii) on the other hand, resource sharing on multicoreprocessors induces performance penalties when many cores are active (machine load effects) thatwe also consider. Finally, in order to avoid resources remaining idle when they have finishedtheir share of the work, and thus, to efficiently exploit all computational resources available, wepropose an algorithm wich is conceptually very close to the work-stealing approach and whichconsists in dynamically assigning idle cores to busy threads/activities.In the second part of this thesis, we target hybrid shared-distributed memory architectures,for which specific work to improve scalability is needed when processing large problems. We firststudy and optimize the dense linear algebra kernels used in distributed asynchronous multifrontalmethods. Simulation, experimentation and profiling have been performed to tune parameterscontrolling the algorithm, in correlation with problem size and computer architecture characteristics.To do so, right-looking and left-looking variants of the LU factorization with partialpivoting in our distributed context have been revisited. Furthermore, when computations are acceleratedwith multiple cores, the relative weight of communication with respect to computationis higher. We explain how to design mapping algorithms minimizing the communication betweennodes of the dependency tree of the multifrontal method, and show that collective asynchronouscommunications become critical on large numbers of processors. We explain why asynchronousbroadcasts using standard tree-based communication algorithms must be used. We then showthat, in a fully asynchronous multifrontal context where several such asynchronous communicationtrees coexist, new synchronization issues must be addressed. We analyse and characterizethe possible deadlock situations and formally establish simple global properties to handle deadlocks.Such properties partially force synchronization and may limit performance. Hence, wedefine properties which enable us to relax synchronization and thus improve performance. Ourapproach is based on the observation that, in our case, as long as memory is available, deadlockscannot occur and, consequently, we just need to keep enough memory to guarantee thata deadlock can always be avoided. Finally, we show that synchronizations can be relaxed in astate-of-the-art solver and illustrate the performance gains on large real problems in our fullyasynchronous multifrontal approach. Algèbre linéaire creuse Méthode multifrontale Mémoire partagée Mémoire distribuée Architectures NUMA Asynchronisme Sparse linear algebra Multifrontal method Shared-memory Distributed-memory NUMA architectures Asynchronism

Search results