Global ETD Search

11	Efficient LU Factorization for Texas Instruments Keystone Architecture Digital Signal Processors / Effektiv LU-faktorisering för Texas Instruments digitala signalprocessorer med Keystone-arkitektur Netzer, Gilbert January 2015 (has links) The energy consumption of large-scale high-performance computer (HPC) systems has become one of the foremost concerns of both data-center operators and computer manufacturers. This has renewed interest in alternative computer architectures that could offer substantially better energy-efficiency.Yet, the for the evaluation of the potential of these architectures necessary well-optimized implementations of typical HPC benchmarks are often not available for these for the HPC industry novel architectures. The in this work presented LU factorization benchmark implementation aims to provide such a high-quality tool for the HPC industry standard high-performance LINPACK benchmark (HPL) for the eight-core Texas Instruments TMS320C6678 digitalsignal processor (DSP). The presented implementation could perform the LU factorization at up to 30.9 GF/s at 1.25 GHz core clock frequency by using all the eight DSP cores of the System-on-Chip (SoC). This is 77% of the attainable peak double-precision floating-point performance of the DSP, a level of efficiency that is comparable to the efficiency expected on traditional x86-based processor architectures. A presented detailed performance analysis shows that this is largely due to the optimized implementation of the embedded generalized matrix-matrix multiplication (GEMM). For this operation, the on-chip direct memory access (DMA) engines were used to transfer the necessary data from the external DDR3 memory to the core-private and shared scratchpad memory. This allowed to overlap the data transfer with computations on the DSP cores. The computations were in turn optimized by using software pipeline techniques and were partly implemented in assembly language. With these optimization the performance of the matrix multiplication reached up to 95% of attainable peak performance. A detailed description of these two key optimization techniques and their application to the LU factorization is included. Using a specially instrumented Advantech TMDXEVM6678L evaluation module, described in detail in related work, allowed to measure the SoC’s energy efficiency of up to 2.92 GF/J while executing the presented benchmark. Results from the verification of the benchmark execution using standard HPL correctness checks and an uncertainty analysis of the experimentally gathered data are also presented. / Energiförbrukningen av storskaliga högpresterande datorsystem (HPC) har blivit ett av de främsta problemen för såväl ägare av dessa system som datortillverkare. Det har lett till ett förnyat intresse för alternativa datorarkitekturer som kan vara betydligt mer effektiva ur energiförbrukningssynpunkt. För detaljerade analyser av prestanda och energiförbrukning av dessa för HPC-industrin nya arkitekturer krävs väloptimerade implementationer av standard HPC-bänkmärkningsproblem. Syftet med detta examensarbete är att tillhandhålla ett sådant högkvalitativt verktyg i form av en implementation av ett bänkmärkesprogram för LU-faktorisering för den åttakärniga digitala signalprocessorn (DSP) TMS320C6678 från Texas Instruments. Bänkmärkningsproblemet är samma som för det inom HPC-industrin välkända bänkmärket “high-performance LINPACK” (HPL). Den här presenterade implementationen nådde upp till en prestanda av 30,9 GF/s vid 1,25 GHz klockfrekvens genom att samtidigt använda alla åtta kärnor i DSP:n. Detta motsvarar 77% av den teoretiskt uppnåbara prestandan, vilket är jämförbart med förväntningar på effektivteten av mer traditionella x86-baserade system. En detaljerad prestandaanalys visar att detta tillstor del uppnås genom den högoptimerade implementationen av den ingående matris-matris-multiplikationen. Användandet av specialiserade “direct memory access” (DMA) hårdvaruenheter för kopieringen av data mellan det externa DDR3 minnet och det interna kärn-privata och delade arbetsminnet tillät att överlappa dessa operationer med beräkningar. Optimerade mjukvaruimplementationer av dessa beräkningar, delvis utförda i maskinspåk, tillät att utföra matris-multiplikationen med upp till 95% av den teoretiskt nåbara prestandan. I rapporten ges en detaljerad beskrivning av dessa två nyckeltekniker. Energiförbrukningen vid exekvering av det implementerade bänkmärket kunde med hjälp av en för ändamålet anpassad Advantech TMDXEVM6678L evalueringsmodul bestämmas till maximalt 2,92 GF/J. Resultat från verifikationen av bänkmärkesimplementationen och en uppskattning av mätosäkerheten vid de experimentella mätningarna presenteras också. LU factorization digital signal processors Texas Instruments Keystone architecture high-performance LINPACK benchmark performance energy efficiency software-pipelined loops direct memory access optimization Computer Sciences Datavetenskap (datalogi)
12	Résolution des équations intégrales de surface par une méthode de décomposition de domaine et compression hiérarchique ACA : Application à la simulation électromagnétique des larges plateformes / Resolution of surface integral equations by a domain decomposition method and adaptive cross approximation : Application to the electromagnetic simulation of large platforms Maurin, Julien 25 November 2015 (has links) Cette étude s’inscrit dans le domaine de la simulation électromagnétique des problèmes de grande taille tels que la diffraction d’ondes planes par de larges plateformes et le rayonnement d’antennes aéroportées. Elle consiste à développer une méthode combinant décomposition en sous-domaines et compression hiérarchique des équations intégrales de frontière. Pour cela, nous rappelons dans un premier temps les points importants de la méthode des équations intégrales de frontière et de leur compression hiérarchique par l’algorithme ACA (Adaptive Cross Approximation). Ensuite, nous présentons la formulation IE-DDM (Integral Equations – Domain Decomposition Method) obtenue à partir d’une représentation intégrale des sous-domaines. Les matrices résultant de la discrétisation de cette formulation sont stockées au format H-matrice (matricehiérarchique). Un solveur spécialement adapté à la résolution de la formulation IE-DDM et à sa représentation hiérarchique a été conçu. Cette étude met en évidence l’efficacité de la décomposition en sous-domaines en tant que préconditionneur des équations intégrales. De plus, la méthode développée est rapide pour la résolution des problèmes à incidences multiples ainsi que la résolution des problèmes basses fréquences / This thesis is about the electromagnetic simulation of large scale problems as the wave scattering from aircrafts and the airborne antennas radiation. It consists in the development of a method combining domain decomposition and hierarchical compression of the surface integral equations. First, we remind the principles of the boundary element method and the hierarchical representation of the surface integral equations with the Adaptive Cross Approximation algorithm. Then, we present the IE-DDM formulation obtained from a sub-domain integral representation. The matrices resulting of the discretization of the formulation are stored in the H-matrix format. A solver especially fitted with the hierarchical representation of the IE-DDM formulation has been developed. This study highlights the efficiency of the sub-domain decomposition as a preconditioner of the integral equations. Moreover, the method is fast for the resolution of multiple incidences and the resolution of low frequencies problems Equations intégrales de surface Décomposition en sous-domaines Matrices hiérarchiques Adaptive cross approximation Factorisation LU approchée Solveur itératif Simulation électromagnétique Larges plateformes Surface integral equations Domain decomposition Hierarchical matrices Adaptive cross approximation Approximate LU factorization Iterative solver Computational electromagnetics Large scale problems
13	Memory-aware Algorithms and Scheduling Techniques for Matrix Computattions / Algorithmes orientés mémoire et techniques d'ordonnancement pour le calcul matriciel Herrmann, Julien 25 November 2015 (has links) Dans cette thèse, nous nous sommes penchés d’un point de vue à la foisthéorique et pratique sur la conception d’algorithmes et detechniques d’ordonnancement adaptées aux architectures complexes dessuperordinateurs modernes. Nous nous sommes en particulier intéressésà l’utilisation mémoire et la gestion des communications desalgorithmes pour le calcul haute performance (HPC). Nous avonsexploité l’hétérogénéité des superordinateurs modernes pour améliorerles performances du calcul matriciel. Nous avons étudié lapossibilité d’alterner intelligemment des étapes de factorisation LU(plus rapide) et des étapes de factorisation QR (plus stablenumériquement mais plus deux fois plus coûteuses) pour résoudre unsystème linéaire dense. Nous avons amélioré les performances desystèmes d’exécution dynamique à l’aide de pré-calculs statiquesprenants en compte l’ensemble du graphe de tâches de la factorisationCholesky ainsi que l’hétérogénéité de l’architecture. Nous noussommes intéressés à la complexité du problème d’ordonnancement degraphes de tâches utilisant de gros fichiers d’entrée et de sortiesur une architecture hétérogène avec deux types de ressources,utilisant chacune une mémoire spécifique. Nous avons conçu denombreuses heuristiques en temps polynomial pour la résolution deproblèmes généraux que l’on avait prouvés NP-complet aupréalable. Enfin, nous avons conçu des algorithmes optimaux pourordonnancer un graphe de différentiation automatique sur uneplateforme avec deux types de mémoire : une mémoire gratuite maislimitée et une mémoire coûteuse mais illimitée. / Throughout this thesis, we have designed memory-aware algorithms and scheduling techniques suitedfor modern memory architectures. We have shown special interest in improving the performance ofmatrix computations on multiple levels. At a high level, we have introduced new numerical algorithmsfor solving linear systems on large distributed platforms. Most of the time, these linear solvers rely onruntime systems to handle resources allocation and data management. We also focused on improving thedynamic schedulers embedded in these runtime systems by adding static information to their decisionprocess. We proposed new memory-aware dynamic heuristics to schedule workflows, that could beimplemented in such runtime systems.Altogether, we have dealt with multiple state-of-the-art factorization algorithms used to solve linearsystems, like the LU, QR and Cholesky factorizations. We targeted different platforms ranging frommulticore processors to distributed memory clusters, and worked with several reference runtime systemstailored for these architectures, such as P A RSEC and StarPU. On a theoretical side, we took specialcare of modelling convoluted hierarchical memory architectures. We have classified the problems thatare arising when dealing with these storage platforms. We have designed many efficient polynomial-timeheuristics on general problems that had been shown NP-complete beforehand. Ordonnancement multi-critère Algorithmes numériques Factorisation LU Factorisation QR Factorisation Cholesky Calcul haute performance Systèmes linéaires Différentiation automatique Scheduling Numerical algorithms LU factorization QR factorization Cholesky factorization High performance computing Linear systems Automatic differentiation
14	Dense matrix computations : communication cost and numerical stability / Calculs pour les matrices denses : coût de communication et stabilité numérique Khabou, Amal 11 February 2013 (has links) Cette thèse traite d’une routine d’algèbre linéaire largement utilisée pour la résolution des systèmes li- néaires, il s’agit de la factorisation LU. Habituellement, pour calculer une telle décomposition, on utilise l’élimination de Gauss avec pivotage partiel (GEPP). La stabilité numérique de l’élimination de Gauss avec pivotage partiel est caractérisée par un facteur de croissance qui est reste assez petit en pratique. Toutefois, la version parallèle de cet algorithme ne permet pas d’atteindre les bornes inférieures qui ca- ractérisent le coût de communication pour un algorithme donné. En effet, la factorisation d’un bloc de colonnes constitue un goulot d’étranglement en termes de communication. Pour remédier à ce problème, Grigori et al [60] ont développé une factorisation LU qui minimise la communication(CALU) au prix de quelques calculs redondants. En théorie la borne supérieure du facteur de croissance de CALU est plus grande que celle de l’élimination de Gauss avec pivotage partiel, cependant CALU est stable en pratique. Pour améliorer la borne supérieure du facteur de croissance, nous étudions une nouvelle stra- tégie de pivotage utilisant la factorisation QR avec forte révélation de rang. Ainsi nous développons un nouvel algorithme pour la factorisation LU par blocs. La borne supérieure du facteur de croissance de cet algorithme est plus petite que celle de l’élimination de Gauss avec pivotage partiel. Cette stratégie de pivotage est ensuite combinée avec le pivotage basé sur un tournoi pour produire une factorisation LU qui minimise la communication et qui est plus stable que CALU. Pour les systèmes hiérarchiques, plusieurs niveaux de parallélisme sont disponibles. Cependant, aucune des méthodes précédemment ci- tées n’exploite pleinement ces ressources. Nous proposons et étudions alors deux algorithmes récursifs qui utilisent les mêmes principes que CALU mais qui sont plus appropriés pour des architectures à plu- sieurs niveaux de parallélisme. Pour analyser d’une façon précise et réaliste / This dissertation focuses on a widely used linear algebra kernel to solve linear systems, that is the LU decomposition. Usually, to perform such a computation one uses the Gaussian elimination with partial pivoting (GEPP). The backward stability of GEPP depends on a quantity which is referred to as the growth factor, it is known that in general GEPP leads to modest element growth in practice. However its parallel version does not attain the communication lower bounds. Indeed the panel factorization rep- resents a bottleneck in terms of communication. To overcome this communication bottleneck, Grigori et al [60] have developed a communication avoiding LU factorization (CALU), which is asymptotically optimal in terms of communication cost at the cost of some redundant computation. In theory, the upper bound of the growth factor is larger than that of Gaussian elimination with partial pivoting, however CALU is stable in practice. To improve the upper bound of the growth factor, we study a new pivoting strategy based on strong rank revealing QR factorization. Thus we develop a new block algorithm for the LU factorization. This algorithm has a smaller growth factor upper bound compared to Gaussian elimination with partial pivoting. The strong rank revealing pivoting is then combined with tournament pivoting strategy to produce a communication avoiding LU factorization that is more stable than CALU. For hierarchical systems, multiple levels of parallelism are available. However, none of the previously cited methods fully exploit these hierarchical systems. We propose and study two recursive algorithms based on the communication avoiding LU algorithm, which are more suitable for architectures with multiple levels of parallelism. For an accurate and realistic cost analysis of these hierarchical algo- rithms, we introduce a hierarchical parallel performance model that takes into account processor and network hierarchies. This analysis enables us to accurately predict the performance of the hierarchical LU factorization on an exascale platform. Factorisation LU Minimisation de la communication Algorithmes parallèles Systèmes hiérarchiques Modèles de performance Stratégies de pivotage LU factorization Growth factor Minimizing the communication cost Parallel algorithms Hierarchical systems Performance models Pivoting strategies
15	Solving dense linear systems on accelerated multicore architectures / Résoudre des systèmes linéaires denses sur des architectures composées de processeurs multicœurs et d’accélerateurs Rémy, Adrien 08 July 2015 (has links) Dans cette thèse de doctorat, nous étudions des algorithmes et des implémentations pour accélérer la résolution de systèmes linéaires denses en utilisant des architectures composées de processeurs multicœurs et d'accélérateurs. Nous nous concentrons sur des méthodes basées sur la factorisation LU. Le développement de notre code s'est fait dans le contexte de la bibliothèque MAGMA. Tout d'abord nous étudions différents solveurs CPU/GPU hybrides basés sur la factorisation LU. Ceux-ci visent à réduire le surcoût de communication dû au pivotage. Le premier est basé sur une stratégie de pivotage dite "communication avoiding" (CALU) alors que le deuxième utilise un préconditionnement aléatoire du système original pour éviter de pivoter (RBT). Nous montrons que ces deux méthodes surpassent le solveur utilisant la factorisation LU avec pivotage partiel quand elles sont utilisées sur des architectures hybrides multicœurs/GPUs. Ensuite nous développons des solveurs utilisant des techniques de randomisation appliquées sur des architectures hybrides utilisant des GPU Nvidia ou des coprocesseurs Intel Xeon Phi. Avec cette méthode, nous pouvons éviter l'important surcoût du pivotage tout en restant stable numériquement dans la plupart des cas. L'architecture hautement parallèle de ces accélérateurs nous permet d'effectuer la randomisation de notre système linéaire à un coût de calcul très faible par rapport à la durée de la factorisation. Finalement, nous étudions l'impact d'accès mémoire non uniformes (NUMA) sur la résolution de systèmes linéaires denses en utilisant un algorithme de factorisation LU. En particulier, nous illustrons comment un placement approprié des processus légers et des données sur une architecture NUMA peut améliorer les performances pour la factorisation du panel et accélérer de manière conséquente la factorisation LU globale. Nous montrons comment ces placements peuvent améliorer les performances quand ils sont appliqués à des solveurs hybrides multicœurs/GPU. / In this PhD thesis, we study algorithms and implementations to accelerate the solution of dense linear systems by using hybrid architectures with multicore processors and accelerators. We focus on methods based on the LU factorization and our code development takes place in the context of the MAGMA library. We study different hybrid CPU/GPU solvers based on the LU factorization which aim at reducing the communication overhead due to pivoting. The first one is based on a communication avoiding strategy of pivoting (CALU) while the second uses a random preconditioning of the original system to avoid pivoting (RBT). We show that both of these methods outperform the solver using LU factorization with partial pivoting when implemented on hybrid multicore/GPUs architectures. We also present new solvers based on randomization for hybrid architectures for Nvidia GPU or Intel Xeon Phi coprocessor. With this method, we can avoid the high cost of pivoting while remaining numerically stable in most cases. The highly parallel architecture of these accelerators allow us to perform the randomization of our linear system at a very low computational cost compared to the time of the factorization. Finally we investigate the impact of non-uniform memory accesses (NUMA) on the solution of dense general linear systems using an LU factorization algorithm. In particular we illustrate how an appropriate placement of the threads and data on a NUMA architecture can improve the performance of the panel factorization and consequently accelerate the global LU factorization. We show how these placements can improve the performance when applied to hybrid multicore/GPU solvers. Systèmes linéaires denses Factorisation LU Bibliothèque MAGMA Calcul hybride multicœur/GPU Processeurs graphiques Intel Xeon Phi . ccNUMA Communication-avoiding Randomisation Placement des processus légers Dense linear systems LU factorization Dense linear algebra libraries MAGMA library Hybrid multicore/GPU computing Graphics process units Intel Xeon Phi . ccNUMA Communication-avoiding algorithms Randomization Thread placement
16	Migrering av en State of Charge-algoritm : Migrering och optimering av State of Charge algoritmen för Nickel-metallhydridbatterier Jansson, Christoffer, Pettersson, Malte January 2023 (has links) Följande studie är utförd på uppdrag av företaget Nilar som tillverkar Nickel-Metallhydridbatterier (NiMH-batterier) vid sin produktionanläggning i Gävle. Den nuvarande beräkningen av State of Charge (SoC) sker på deras Battery Management Unit (BMU) och är implementerad i Structured Text i exekveringsmiljön CODESYS. Nilar vill flytta SoC-beräkningen från BMU:n så att den kan exekveras på en Interface Control Unit (ICU). Motiveringen till detta är för att distribuera SoC-beräkningen då ett flertal ICU:er finns tillgängliga per Battery Management System (BMS) men även för att i framtiden helt byta ut CODESY. Syftet med denna studie är att migrera implementationen av SoC-algoritmen till programmeringsspråket C så att algoritmen senare kan exekveras på ICU:n. Därefter optimeras algoritmen för att sänka exekveringstiden. Studien utforskar kodstrukturella och funktionella skillnader mellan implementationerna samt metoder för att optimera SoC-algoritmen. Migreringen av algoritmen fullföljdes utan större inverkan på noggrannheten. Algoritmen optimerades genom att skapa en variant av en LU-faktorisering som var specifikt anpassad för det aktuella problemet. Optimeringen av algoritmen resulterade i en minskning på 25% av den totala exekveringstiden för algoritmen. De nya implementationerna tar markant längre tid att exekvera då batteriet befinner sig under laddning jämfört när det befinner sig under urladdning, någonting som inte kan noteras för den gamla implementationen. / The following study was carried out on the behalf of Nilar, which manufactures Nickel–metal hydride batteries at its production site in Gävle. The current State of Charge (SoC) calculation is done on their Battery Manegment Unit (BMU) and is implemented in Structured Text for the CODESYS runtime. Nilar wants to move the SoC calculation from the BMU so that its executed on a Interface Control Unit (ICU). The reasoning behind this is to distribute the SoC computation as several ICUs are available per Battery Management System (BMS) but also to remove the CODESYS dependency in the future. The purpose of this study is to migrate the implementation of the SoC-algorithm to the programming language C so that the algorithm can be executed on an ICU in the future. Furthermore this study aims to optimize the the algorithm to lower the execution time. The study explores differences in code structure and functionallity between the implementations as well as methods to optimize the SoC algorithm. The migration of the algorithm was completed without major impact on the accuracy. The algorithm was optimized by creating a variant of a LU factorization that was specifically suited to LU factorize the given problem. The optimization of the algorithm resulted in a 25% lower total execution time. The new implementations suffers from a longer total execution time when the battery is charging compared to when it’s discharging, something that’s not prevalent for the old implementation. State of Charge Battery Management System Nickel-metal hydride battery Migration Optimization Execution time C CODESYS LU Factorization State of Charge Battery Management System Nickel-Metallhydridbatteri Migrering Optimering Exekveringstid C CODESYS LU-faktorisering Computer Sciences Datavetenskap (datalogi) Embedded Systems Inbäddad systemteknik Computational Mathematics Beräkningsmatematik
17	Contribution à l'analyse mathématique et à la résolution numérique d'un problème inverse de scattering élasto-acoustique / Contribution to the mathematical analysis and to the numerical solution of an inverse elasto-acoustic scattering problem Estecahandy, Elodie 19 September 2013 (has links) La détermination de la forme d'un obstacle élastique immergé dans un milieu fluide à partir de mesures du champ d'onde diffracté est un problème d'un vif intérêt dans de nombreux domaines tels que le sonar, l'exploration géophysique et l'imagerie médicale. A cause de son caractère non-linéaire et mal posé, ce problème inverse de l'obstacle (IOP) est très difficile à résoudre, particulièrement d'un point de vue numérique. De plus, son étude requiert la compréhension de la théorie du problème de diffraction direct (DP) associé, et la maîtrise des méthodes de résolution correspondantes. Le travail accompli ici se rapporte à l'analyse mathématique et numérique du DP élasto-acoustique et de l'IOP. En particulier, nous avons développé un code de simulation numérique performant pour la propagation des ondes associée à ce type de milieux, basé sur une méthode de type DG qui emploie des éléments finis d'ordre supérieur et des éléments courbes à l'interface afin de mieux représenter l'interaction fluide-structure, et nous l'appliquons à la reconstruction d'objets par la mise en oeuvre d'une méthode de Newton régularisée. / The determination of the shape of an elastic obstacle immersed in water from some measurements of the scattered field is an important problem in many technologies such as sonar, geophysical exploration, and medical imaging. This inverse obstacle problem (IOP) is very difficult to solve, especially from a numerical viewpoint, because of its nonlinear and ill-posed character. Moreover, its investigation requires the understanding of the theory for the associated direct scattering problem (DP), and the mastery of the corresponding numerical solution methods. The work accomplished here pertains to the mathematical and numerical analysis of the elasto-acoustic DP and of the IOP. More specifically, we have developed an efficient numerical simulation code for wave propagation associated to this type of media, based on a DG-type method using higher-order finite elements and curved edges at the interface to better represent the fluid-structure interaction, and we apply it to the reconstruction of objects with the implementation of a regularized Newton method. Interaction fluide-solide Problème de diffraction Fréquence de Jones Inégalité de Gärding Alternative de Fredholm Espace de Sobolev à poids Méthode de Galerkin discontinue Méthode élément fini Raffinement hp Effet de pollution Arêtes de frontière courbes Factorisation LU Différentiabilité au sens de Fréchet Dérivée de domaine Frontière Lipschitzienne Théorème des fonctions implicites Méthode de Newton Régularisation de Tikhonov Domaine étoilé B-splines quadratiques Fluid-solid interaction Scattering problem Jones frequency Gärding's inequality Fredholm alternative Weighted Sobolev space Discontinuous Galerkin method Finite element method Hp-refinement, Pollution effect Curved boundary edges LU factorization Fréchet differentiability Domain derivative Lipschitz boundary Implicit function theorem Newton method Tikhonov regularization Star domain Quadratic B-splines.

Page generated in 0.1753 seconds