Global ETD Search

11	Modèles de programmation et supports exécutifs pour architectures hétérogènes / Programming Models and Runtime Systems for Heterogeneous Architectures Henry, Sylvain 14 November 2013 (has links) Le travail réalisé lors de cette thèse s'inscrit dans le cadre du calcul haute performance sur architectures hétérogènes. Pour faciliter l'écriture d'applications exploitant ces architectures et permettre la portabilité des performances, l'utilisation de supports exécutifs automatisant la gestion des certaines tâches (gestion de la mémoire distribuée, ordonnancement des noyaux de calcul) est nécessaire. Une approche bas niveau basée sur le standard OpenCL est proposée ainsi qu'une approche de plus haut niveau basée sur la programmation fonctionnelle parallèle, la seconde permettant de pallier certaines difficultés rencontrées avec la première (notamment l'adaptation de la granularité). / This work takes part in the context of high-performance computing on heterogeneous architectures. Runtime systems are increasingly used to make programming these architectures easier and to ensure performance portability by automatically dealing with some tasks (management of the distributed memory, scheduling of the computational kernels...). We propose a low-level approach based on the OpenCL specification as well as a high-level approach based on parallel functional programming. Calcul haute performance Supports exécutifs Modèles de programmation Architectures hétérogènes High-Performance Computing Runtime Systems Programming Models Heterogeneous Architectures
12	A Modular Platform for Adaptive Heterogeneous Many-Core Architectures Atef, Ahmed Kamaleldin 18 December 2023 (has links) Multi-/many-core heterogeneous architectures are shaping current and upcoming generations of compute-centric platforms which are widely used starting from mobile and wearable devices to high-performance cloud computing servers. Heterogeneous many-core architectures sought to achieve an order of magnitude higher energy efficiency as well as computing performance scaling by replacing homogeneous and power-hungry general-purpose processors with multiple heterogeneous compute units supporting multiple core types and domain-specific accelerators. Drifting from homogeneous architectures to complex heterogeneous systems is heavily adopted by chip designers and the silicon industry for more than a decade. Recent silicon chips are based on a heterogeneous SoC which combines a scalable number of heterogeneous processing units from different types (e.g. CPU, GPU, custom accelerator). This shifting in computing paradigm is associated with several system-level design challenges related to the integration and communication between a highly scalable number of heterogeneous compute units as well as SoC peripherals and storage units. Moreover, the increasing design complexities make the production of heterogeneous SoC chips a monopoly for only big market players due to the increasing development and design costs. Accordingly, recent initiatives towards agile hardware development open-source tools and microarchitecture aim to democratize silicon chip production for academic and commercial usage. Agile hardware development aims to reduce development costs by providing an ecosystem for open-source hardware microarchitectures and hardware design processes. Therefore, heterogeneous many-core development and customization will be relatively less complex and less time-consuming than conventional design process methods. In order to provide a modular and agile many-core development approach, this dissertation proposes a development platform for heterogeneous and self-adaptive many-core architectures consisting of a scalable number of heterogeneous tiles that maintain design regularity features while supporting heterogeneity. The proposed platform hides the integration complexities by supporting modular tile architectures for general-purpose processing cores supporting multi-instruction set architectures (multi-ISAs) and custom hardware accelerators. By leveraging field-programmable-gate-arrays (FPGAs), the self-adaptive feature of the many-core platform can be achieved by using dynamic and partial reconfiguration (DPR) techniques. This dissertation realizes the proposed modular and adaptive heterogeneous many-core platform through three main contributions. The first contribution proposes and realizes a many-core architecture for heterogeneous ISAs. It provides a modular and reusable tilebased architecture for several heterogeneous ISAs based on open-source RISC-V ISA. The modular tile-based architecture features a configurable number of processing cores with different RISC-V ISAs and different memory hierarchies. To increase the level of heterogeneity to support the integration of custom hardware accelerators, a novel hybrid memory/accelerator tile architecture is developed and realized as the second contribution. The hybrid tile is a modular and reusable tile that can be configured at run-time to operate as a scratchpad shared memory between compute tiles or as an accelerator tile hosting a local hardware accelerator logic. The hybrid tile is designed and implemented to be seamlessly integrated into the proposed tile-based platform. The third contribution deals with the self-adaptation features by providing a reconfiguration management approach to internally control the DPR process through processing cores (RISC-V based). The internal reconfiguration process relies on a novel DPR controller targeting FPGA design flow for RISC-V-based SoC to change the types and functionalities of compute tiles at run-time. info:eu-repo/classification/ddc/004 ddc:004
13	Ordonnancement d'applications à flux de données pour les MPSoC embarqués hybrides comprenant des unités de calcul programmables et des accélérateurs matériels / Scheduling of dynamic streaming applications on hybrid embedded MPSoCs comprising programmable computing units and hardware accelerators Arras, Paul-Antoine 03 February 2015 (has links) Bien que de nombreux appareils numériques soient aujourd'hui capables de lire des contenus vidéo en temps réel et d'offrir une restitution de grande qualité, le décodage vidéo dans les systèmes embarqués n'en est pas pour autant devenu une opération anodine. En effet, les codecs récents tels que H.264 et HEVC sont d'une complexité telle que le recours à des architectures mixtes logiciel/matériel est presque incontournable. Or les plateformes de ce type sont notoirement difficiles à programmer efficacement. Cette thèse relève le défi du développement d'applications à flux de données pour les cibles embarquées hybrides et de leur exécution efficace, et propose plusieurs contributions. La première est une extension des heuristiques d'ordonnancement de liste pour tenir compte des contraintes mémorielles. La seconde est un modèle d'exécution à flot de données compatible avec la plupart des modèles existants et avec une large classe de plateformes matérielles, ainsi qu'un ordonnanceur dynamique. Enfin, de nombreux développements ont été menés sur une architecture réelle de STMicroelectronics pour démontrer la faisabilité de l'approche. / Although numerous electronic devices are nowadays able to play video contents in real time and offer high-quality reproduction, video decoding in embedded systems has not become a trivial process yet. As a mater of fact, recent codecs such as H.264 and HEVC exhibit such a complexity that resorting to mixed sofware-hardware architecture is almost unavoidable. However, programming efficiently this kind of platforms is well-known to be tricky. This thesis addresses the issue of developing streaming applications for hybrid embedded targets and executing them efficiently, and proposes several contributions. The first one is an extension of the classical list-scheduling heuristics to take memory constraints into account. Te second one is a datafow execution model compatible with most existing models and with a large set of hardware platforms, as well as a dynamic scheduler. Lastly, numerous developments have been carried out on a real-world architecture from STMicroelectronics so as to demonstrate the feasibility of the approach. Ordonnancement Flot de données Flux de données Modèle d’exécution Architectures hétéroghènes Architectures hybrides Systèmes embarqués Scheduling Datafow Streaming Execution model Heterogeneous architectures Hybrid architectures Embedded systems
14	Accurate Residual-distribution Schemes for Accelerated Parallel Architectures Guzik, Stephen Michael Jan 12 August 2010 (has links) Residual-distribution methods offer several potential benefits over classical methods, such as a means of applying upwinding in a multi-dimensional manner and a multi-dimensional positivity property. While it is apparent that residual-distribution methods also offer higher accuracy than finite-volume methods on similar meshes, few studies have directly compared the performance of the two approaches in a systematic and quantitative manner. In this study, comparisons between residual distribution and finite volume are made for steady-state smooth and discontinuous flows of gas dynamics, governed by hyperbolic conservation laws, to illustrate the strengths and deficiencies of the residual-distribution method. Deficiencies which reduce the accuracy are analyzed and a new nonlinear scheme is proposed that closely reproduces or surpasses the accuracy of the best linear residual-distribution scheme. The accuracy is further improved by extending the scheme to fourth order using established finite-element techniques. Finally, the compact stencil, arithmetic workload, and data parallelism of the fourth-order residual-distribution scheme are exploited to accelerate parallel computations on an architecture consisting of both CPU cores and a graphics processing unit. Numerical experiments are used to assess the gains to efficiency and possible monetary savings that may be provided by accelerated architectures. residual distribution parallel architectures GPGPU heterogeneous architectures computational fluid dynamics CUDA high order partial differential equation multi-dimensional upwind multi-dimensional limiter 0538
15	Accurate Residual-distribution Schemes for Accelerated Parallel Architectures Guzik, Stephen Michael Jan 12 August 2010 (has links) Residual-distribution methods offer several potential benefits over classical methods, such as a means of applying upwinding in a multi-dimensional manner and a multi-dimensional positivity property. While it is apparent that residual-distribution methods also offer higher accuracy than finite-volume methods on similar meshes, few studies have directly compared the performance of the two approaches in a systematic and quantitative manner. In this study, comparisons between residual distribution and finite volume are made for steady-state smooth and discontinuous flows of gas dynamics, governed by hyperbolic conservation laws, to illustrate the strengths and deficiencies of the residual-distribution method. Deficiencies which reduce the accuracy are analyzed and a new nonlinear scheme is proposed that closely reproduces or surpasses the accuracy of the best linear residual-distribution scheme. The accuracy is further improved by extending the scheme to fourth order using established finite-element techniques. Finally, the compact stencil, arithmetic workload, and data parallelism of the fourth-order residual-distribution scheme are exploited to accelerate parallel computations on an architecture consisting of both CPU cores and a graphics processing unit. Numerical experiments are used to assess the gains to efficiency and possible monetary savings that may be provided by accelerated architectures. residual distribution parallel architectures GPGPU heterogeneous architectures computational fluid dynamics CUDA high order partial differential equation multi-dimensional upwind multi-dimensional limiter 0538
16	Localisation et cartographie simultanées par optimisation de graphe sur architectures hétérogènes pour l’embarqué / Embedded graph-based simultaneous localization and mapping on heterogeneous architectures Dine, Abdelhamid 05 October 2016 (has links) La localisation et cartographie simultanées connue, communément, sous le nom de SLAM (Simultaneous Localization And Mapping) est un processus qui permet à un robot explorant un environnement inconnu de reconstruire une carte de celui-ci tout en se localisant, en même temps, sur cette carte. Dans ce travail de thèse, nous nous intéressons au SLAM par optimisation de graphe. Celui-ci utilise un graphe pour représenter et résoudre le problème de SLAM. Une optimisation de graphe consiste à trouver une configuration de graphe (trajectoire et carte) qui correspond le mieux aux contraintes introduites par les mesures capteurs. L'optimisation de graphe présente une forte complexité algorithmique et requiert des ressources de calcul et de mémoire importantes, particulièrement si l'on veut explorer de larges zones. Cela limite l'utilisation de cette méthode dans des systèmes embarqués temps-réel. Les travaux de cette thèse contribuent à l'atténuation de la complexité de calcul du SLAM par optimisation de graphe. Notre approche s’appuie sur deux axes complémentaires : la représentation mémoire des données et l’implantation sur architectures hétérogènes embarquées. Dans le premier axe, nous proposons une structure de données incrémentale pour représenter puis optimiser efficacement le graphe. Dans le second axe, nous explorons l'utilisation des architectures hétérogènes récentes pour accélérer le SLAM par optimisation de graphe. Nous proposons, donc, un modèle d’implantation adéquat aux applications embarquées en mettant en évidence les avantages et les inconvénients des architectures évaluées, à savoir SoCs basés GPU et FPGA. / Simultaneous Localization And Mapping is the process that allows a robot to build a map of an unknown environment while at the same time it determines the robot position on this map.In this work, we are interested in graph-based SLAM method. This method uses a graph to represent and solve the SLAM problem. A graph optimization consists in finding a graph configuration (trajectory and map) that better matches the constraints introduced by the sensors measurements. Graph optimization is characterized by a high computational complexity that requires high computational and memory resources, particularly to explore large areas. This limits the use of graph-based SLAM in real-time embedded systems. This thesis contributes to the reduction of the graph-based computational complexity. Our approach is based on two complementary axes: data representation in memory and implementation on embedded heterogeneous architectures. In the first axis, we propose an incremental data structure to efficiently represent and then optimize the graph. In the second axis, we explore the use of the recent heterogeneous architectures to speed up graph-based SLAM. We propose an efficient implementation model for embedded applications. We highlight the advantages and disadvantages of the evaluated architectures, namely GPU-based and FPGA-based System-On-Chips. : SLAM par optimisation de graphe Complexité de calcul Structure de données Architectures hétérogènes embarquées Evaluation de performances Graph-Based SLAM Computational complexity Data structure Embedded heterogeneous architectures Performances evaluation
17	On the design of sparse hybrid linear solvers for modern parallel architectures / Sur la conception de solveurs linéaires hybrides pour les architectures parallèles modernes Nakov, Stojce 14 December 2015 (has links) Dans le contexte de cette thèse, nous nous focalisons sur des algorithmes pour l’algèbre linéaire numérique, plus précisément sur la résolution de grands systèmes linéaires creux. Nous mettons au point des méthodes de parallélisation pour le solveur linéaire hybride MaPHyS. Premièrement nous considerons l'aproche MPI+threads. Dans MaPHyS, le premier niveau de parallélisme consiste au traitement indépendant des sous-domaines. Le second niveau est exploité grâce à l’utilisation de noyaux multithreadés denses et creux au sein des sous-domaines. Une telle implémentation correspond bien à la structure hiérarchique des supercalculateurs modernes et permet un compromis entre les performances numériques et parallèles du solveur. Nous démontrons la flexibilité de notre implémentation parallèle sur un ensemble de cas tests. Deuxièmement nous considérons un approche plus innovante, où les algorithmes sont décrits comme des ensembles de tâches avec des inter-dépendances, i.e., un graphe de tâches orienté sans cycle (DAG). Nous illustrons d’abord comment une première parallélisation à base de tâches peut être obtenue en composant des librairies à base de tâches au sein des processus MPI illustrer par un prototype d’implémentation préliminaire de notre solveur hybride. Nous montrons ensuite comment une approche à base de tâches abstrayant entièrement le matériel peut exploiter avec succès une large gamme d’architectures matérielles. À cet effet, nous avons implanté une version à base de tâches de l’algorithme du Gradient Conjugué et nous montrons que l’approche proposée permet d’atteindre une très haute performance sur des architectures multi-GPU, multicoeur ainsi qu’hétérogène. / In the context of this thesis, our focus is on numerical linear algebra, more precisely on solution of large sparse systems of linear equations. We focus on designing efficient parallel implementations of MaPHyS, an hybrid linear solver based on domain decomposition techniques. First we investigate the MPI+threads approach. In MaPHyS, the first level of parallelism arises from the independent treatment of the various subdomains. The second level is exploited thanks to the use of multi-threaded dense and sparse linear algebra kernels involved at the subdomain level. Such an hybrid implementation of an hybrid linear solver suitably matches the hierarchical structure of modern supercomputers and enables a trade-off between the numerical and parallel performances of the solver. We demonstrate the flexibility of our parallel implementation on a set of test examples. Secondly, we follow a more disruptive approach where the algorithms are described as sets of tasks with data inter-dependencies that leads to a directed acyclic graph (DAG) representation. The tasks are handled by a runtime system. We illustrate how a first task-based parallel implementation can be obtained by composing task-based parallel libraries within MPI processes throught a preliminary prototype implementation of our hybrid solver. We then show how a task-based approach fully abstracting the hardware architecture can successfully exploit a wide range of modern hardware architectures. We implemented a full task-based Conjugate Gradient algorithm and showed that the proposed approach leads to very high performance on multi-GPU, multicore and heterogeneous architectures. Calcul haute performance Multi-coeur Solveur linéaires creux Méthode hybride Programmation en tâche Architecture hétérogène High Performance Computing (HPC) Multicore Sparse linear solver Hybrid method Task Moteur d’exécution Heterogeneous architectures
18	Task-based multifrontal QR solver for heterogeneous architectures / Solveur multifrontal QR à base de tâches pour architectures hétérogènes Lopez, Florent 11 December 2015 (has links) Afin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. Dans cette étude, nous explorons la conception de solveurs directes creux à base de tâches, qui représentent une charge de travail extrêmement irrégulière, avec des tâches de granularités et de caractéristiques différentes ainsi qu'une consommation mémoire variable, au-dessus d'un moteur d'exécution. Dans le cadre du solveur qr mumps, nous montrons dans un premier temps la viabilité et l'efficacité de notre approche avec l'implémentation d'une méthode multifrontale pour la factorisation de matrices creuses, en se basant sur le modèle de programmation parallèle appelé "flux de tâches séquentielles" (Sequential Task Flow). Cette approche, nous a ensuite permis de développer des fonctionnalités telles que l'intégration de noyaux dense de factorisation de type "minimisation de cAfin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. Dans cette étude, nous explorons la conception de solveurs directes creux à base de tâches, qui représentent une charge de travail extrêmement irrégulière, avec des tâches de granularités et de caractéristiques différentes ainsi qu'une consommation mémoire variable, au-dessus d'un moteur d'exécution. Dans le cadre du solveur qr mumps, nous montrons dans un premier temps la viabilité et l'efficacité de notre approche avec l'implémentation d'une méthode multifrontale pour la factorisation de matrices creuses, en se basant sur le modèle de programmation parallèle appelé "flux de tâches séquentielles" (Sequential Task Flow). Cette approche, nous a ensuite permis de développer des fonctionnalités telles que l'intégration de noyaux dense de factorisation de type "minimisation de cAfin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modèles de programmations basés sur un parallélisme de tâche ont gagné en popularité dans la communauté du calcul scientifique haute performance. Les moteurs d'exécution fournissent une interface de programmation qui correspond à ce paradigme ainsi que des outils pour l'ordonnancement des tâches qui définissent l'application. / To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. In this study we investigate the design of task-based sparse direct solvers which constitute extremely irregular workloads, with tasks of different granularities and characteristics with variable memory consumption on top of runtime systems. In the context of the qr mumps solver, we prove the usability and effectiveness of our approach with the implementation of a sparse matrix multifrontal factorization based on a Sequential Task Flow parallel programming model. Using this programming model, we developed features such as the integration of dense 2D Communication Avoiding algorithms in the multifrontal method allowing for better scalability compared to the original approach used in qr mumps. In addition we introduced a memory-aware algorithm to control the memory behaviour of our solver and show, in the context of multicore architectures, an important reduction of the memory footprint for the multifrontal QR factorization with a small impact on performance. Following this approach, we move to heterogeneous architectures where task granularity and scheduling strategies are critical to achieve performance. We present, for the multifrontal method, a hierarchical strategy for data partitioning and a scheduling algorithm capable of handling the heterogeneity of resources. Finally we present a study on the reproducibility of executions and the use of alternative programming models for the implementation of the multifrontal method. All the experimental results presented in this study are evaluated with a detailed performance analysis measuring the impact of several identified effects on the performance and scalability. Thanks to this original analysis, presented in the first part of this study, we are capable of fully understanding the results obtained with our solver. Méthode multifrontale Multicœur Moteurs d'exécutions Architectures hétérogènes Calcul haute performance GPU Sparse direct solvers Multifrontal method Multicores Runtime systems Scheduling Memory-aware algorythms Heterogeneous architectures High-performance computing
19	Programmation des architectures hiérarchiques et hétérogènes / Programming hierarxchical and heterogenous machines Hamidouche, Khaled 10 November 2011 (has links) Les architectures de calcul haute performance de nos jours sont des architectures hiérarchiques et hétérogènes: hiérarchiques car elles sont composées d’une hiérarchie de mémoire, une mémoire distribuée entre les noeuds et une mémoire partagée entre les coeurs d’un même noeud. Hétérogènes due à l’utilisation des processeurs spécifiques appelés Accélérateurs tel que le processeur CellBE d’IBM et les CPUs de NVIDIA. La complexité de maîtrise de ces architectures est double. D’une part, le problème de programmabilité: la programmation doit rester simple, la plus proche possible de la programmation séquentielle classique et indépendante de l’architecture cible. D’autre part, le problème d’efficacité: les performances doivent êtres proches de celles qu’obtiendrait un expert en écrivant le code à la main en utilisant des outils de bas niveau. Dans cette thèse, nous avons proposé une plateforme de développement pour répondre à ces problèmes. Pour cela, nous proposons deux outils : BSP++ est une bibliothèque générique utilisant des templates C++ et BSPGen est un framework permettant la génération automatique de code hybride à plusieurs niveaux de la hiérarchie (MPI+OpenMP ou MPI + Cell BE). Basée sur un modèle hiérarchique, la bibliothèque BSP++ prend les architectures hybrides comme cibles natives. Utilisant un ensemble réduit de primitives et de concepts intuitifs, BSP++ offre une simplicité d'utilisation et un haut niveau d' abstraction de la machine cible. Utilisant le modèle de coût de BSP++, BSPGen estime et génère le code hybride hiérarchique adéquat pour une application donnée sur une architecture cible. BSPGen génère un code hybride à partir d'une liste de fonctions séquentielles et d'une description de l'algorithme parallèle. Nos outils ont été validés sur différentes applications de différents domaines allant de la vérification et du calcul scientifique au traitement d'images en passant par la bioinformatique. En utilisant une large sélection d’architecture cible allant de simple machines à mémoire partagée au machines Petascale en passant par les architectures hétérogènes équipées d’accélérateurs de type Cell BE. / Today’s high-performance computing architectures are hierarchical and heterogeneous. With a hierarchy of memory, they are composed of distributed memory between nodes and shared memory between cores of the same node. heterogeneous due to the use of specific processors called accelerators such as the CellBE IBM processor and/or NVIDIA GPUs. The programming complexity of these architectures is twofold. On the one hand, the problem of programmability: the programming should be simple, as close as possible to the conventional sequential programming and independent of the target architecture. On the other hand, the problem of efficiency: performance should be similar to those obtained by a expert in writing code by hand using low-level tools. In this thesis, we proposed a development platform to address these problems. For this, we propose two tools: BSP++ is a generic library using C++ templates and BSPGen is a framework for the automatic hybrid multi-level hierarchy (MPI + OpenMP or MPI + Cell BE) code generation.Based on a hierarchical model, the BSP++ library takes the hybrid architectures as native targets. Using a small set of primitives and intuitive concepts, BSP++ provides a simple way to use and a high level of abstraction of the target machine. Using the cost model of BSP++, BSPGen predicts and generates the appropriate hierarchical hybrid code for a given application on target architecture. BSPGen generates hybrid code from a sequential list of functions and a description of the parallel algorithm.Our tools have been validated with various applications in different fields ranging from verification to scientific computing and image processing through bioinformatics. Using a wide selection of target architecture ranging from simple shared memory machines to Petascale machines through the heterogeneous architectures equipped with Cell BE accelerators. BSP Génération automatique Programmation parallèle MPI OpenMP Cell BE BSP Automatic code generation Parallel computing MPI OpenMP Cell BE
20	Finite element modeling of electromagnetic radiation and induced heat transfer in the human body Kim, Kyungjoo 24 September 2013 (has links) This dissertation develops adaptive hp-Finite Element (FE) technology and a parallel sparse direct solver enabling the accurate modeling of the absorption of Electro-Magnetic (EM) energy in the human head. With a large and growing number of cell phone users, the adverse health effects of EM fields have raised public concerns. Most research that attempts to explain the relationship between exposure to EM fields and its harmful effects on the human body identifies temperature changes due to the EM energy as the dominant source of possible harm. The research presented here focuses on determining the temperature distribution within the human body exposed to EM fields with an emphasis on the human head. Major challenges in accurately determining the temperature changes lie in the dependence of EM material properties on the temperature. This leads to a formulation that couples the BioHeat Transfer (BHT) and Maxwell equations. The mathematical model is formed by the time-harmonic Maxwell equations weakly coupled with the transient BHT equation. This choice of equations reflects the relevant time scales. With a mobile device operating at a single frequency, EM fields arrive at a steady-state in the micro-second range. The heat sources induced by EM fields produce a transient temperature field converging to a steady-state distribution on a time scale ranging from seconds to minutes; this necessitates the transient formulation. Since the EM material properties depend upon the temperature, the equations are fully coupled; however, the coupling is realized weakly due to the different time scales for Maxwell and BHT equations. The BHT equation is discretized in time with a time step reflecting the thermal scales. After multiple time steps, the temperature field is used to determine the EM material properties and the time-harmonic Maxwell equations are solved. The resulting heat sources are recalculated and the process continued. Due to the weak coupling of the problems, the corresponding numerical models are established separately. The BHT equation is discretized with H¹ conforming elements, and Maxwell equations are discretized with H(curl) conforming elements. The complexity of the human head geometry naturally leads to the use of tetrahedral elements, which are commonly employed by unstructured mesh generators. The EM domain, including the head and a radiating source, is terminated by a Perfectly Matched Layer (PML), which is discretized with prismatic elements. The use of high order elements of different shapes and discretization types has motivated the development of a general 3D hp-FE code. In this work, we present new generic data structures and algorithms to perform adaptive local refinements on a hybrid mesh composed of different shaped elements. A variety of isotropic and anisotropic refinements that preserve conformity of discretization are designed. The refinement algorithms support one- irregular meshes with the constrained approximation technique. The algorithms are experimentally proven to be deadlock free. A second contribution of this dissertation lies with a new parallel sparse direct solver that targets linear systems arising from hp-FE methods. The new solver interfaces to the hierarchy of a locally refined mesh to build an elimination ordering for the factorization that reflects the h-refinements. By following mesh refinements, not only the computation of element matrices but also their factorization is restricted to new elements and their ancestors. The solver is parallelized by exploiting two-level task parallelism: tasks are first generated from a parallel post-order tree traversal on the assembly tree; next, those tasks are further refined by using algorithms-by-blocks to gain fine-grained parallelism. The resulting fine-grained tasks are asynchronously executed after their dependencies are analyzed. This approach effectively reduces scheduling overhead and increases flexibility to handle irregular tasks. The solver outperforms the conventional general sparse direct solver for a class of problems formulated by high order FEs. Finally, numerical results for a 3D coupled BHT with Maxwell equations are presented. The solutions of this Maxwell code have been verified using the analytic Mie series solutions. Starting with simple spherical geometry, parametric studies are conducted on realistic head models for a typical frequency band (900 MHz) of mobile phones. / text hp-FEM Hybrid mesh Local mesh refinement algorithms Electromagnetics Specific Absorption Rate Dielectric heating Gaussian elimination Directed Acyclic Graph Direct method LU Multi-core Multi-frontal OpenMP Sparse matrix Supernodes Task parallelism Unassembled HyperMatrix GPU Dense linear algebra Algorithms-by-blocks Heterogeneous architectures BLAS

Search results