• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 337
  • 189
  • 134
  • 56
  • 45
  • 44
  • 4
  • 4
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 922
  • 922
  • 922
  • 404
  • 394
  • 351
  • 351
  • 329
  • 325
  • 320
  • 319
  • 316
  • 314
  • 313
  • 313
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
491

Multiphysics and Large-Scale Modeling and Simulation Methods for Advanced Integrated Circuit Design

Shuzhan Sun (11564611) 22 November 2021 (has links)
<div>The design of advanced integrated circuits (ICs) and systems calls for multiphysics and large-scale modeling and simulation methods. On the one hand, novel devices and materials are emerging in next-generation IC technology, which requires multiphysics modeling and simulation. On the other hand, the ever-increasing complexity of ICs requires more efficient numerical solvers.</div><div><br></div><div>In this work, we propose a multiphysics modeling and simulation algorithm to co-simulate Maxwell's equations, dispersion relation of materials, and Boltzmann equation to characterize emerging new devices in IC technology such as Cu-Graphene (Cu-G) hybrid nano-interconnects. We also develop an unconditionally stable time marching scheme to remove the dependence of time step on space step for an efficient simulation of the multiscaled and multiphysics system. Extensive numerical experiments and comparisons with measurements have validated the accuracy and efficiency of the proposed algorithm. Compared to simplified steady-state-models based analysis, a significant difference is observed when the frequency is high or/and the dimension of the Cu-G structure is small, which necessitates our proposed multiphysics modeling and simulation for the design of advanced Cu-G interconnects. </div><div><br></div><div>To address the large-scale simulation challenge, we develop a new split-field domain-decomposition algorithm amenable for parallelization for solving Maxwell’s equations, which minimizes the communication between subdomains, while having a fast convergence of the global solution. Meanwhile, the algorithm is unconditionally stable in time domain. In this algorithm, unlike prevailing domain decomposition methods that treat the interface unknown as a whole and let it be shared across subdomains, we partition the interface unknown into multiple components, and solve each of them from one subdomain. In this way, we transform the original coupled system to fully decoupled subsystems to solve. Only one addition (communication) of the interface unknown needs to be performed after the computation in each subdomain is finished at each time step. More importantly, the algorithm has a fast convergence and permits the use of a large time step irrespective of space step. Numerical experiments on large-scale on-chip and package layout analysis have demonstrated the capability of the new domain decomposition algorithm. </div><div><br></div><div>To tackle the challenge of efficient simulation of irregular structures, in the last part of the thesis, we develop a method for the stability analysis of unsymmetrical numerical systems in time domain. An unsymmetrical system is traditionally avoided in numerical formulation since a traditional explicit simulation is absolutely unstable, and how to control the stability is unknown. However, an unsymmetrical system is frequently encountered in modeling and simulating of unstructured meshes and nonreciprocal electromagnetic and circuit devices. In our method, we reduce stability analysis of a large system into the analysis of dissembled single element, therefore provides a feasible way to control the stability of large-scale systems regardless of whether the system is symmetrical or unsymmetrical. We then apply the proposed method to prove and control the stability of an unsymmetrical matrix-free method that solves Maxwell’s equations in general unstructured meshes while not requiring a matrix solution.<br></div><div><br></div>
492

A Unified Infrastructure for Monitoring and Tuning the Energy Efficiency of HPC Applications

Schöne, Robert 19 September 2017 (has links)
High Performance Computing (HPC) has become an indispensable tool for the scientific community to perform simulations on models whose complexity would exceed the limits of a standard computer. An unfortunate trend concerning HPC systems is that their power consumption under high-demanding workloads increases. To counter this trend, hardware vendors have implemented power saving mechanisms in recent years, which has increased the variability in power demands of single nodes. These capabilities provide an opportunity to increase the energy efficiency of HPC applications. To utilize these hardware power saving mechanisms efficiently, their overhead must be analyzed. Furthermore, applications have to be examined for performance and energy efficiency issues, which can give hints for optimizations. This requires an infrastructure that is able to capture both, performance and power consumption information concurrently. The mechanisms that such an infrastructure would inherently support could further be used to implement a tool that is able to do both, measuring and tuning of energy efficiency. This thesis targets all steps in this process by making the following contributions: First, I provide a broad overview on different related fields. I list common performance measurement tools, power measurement infrastructures, hardware power saving capabilities, and tuning tools. Second, I lay out a model that can be used to define and describe energy efficiency tuning on program region scale. This model includes hardware and software dependent parameters. Hardware parameters include the runtime overhead and delay for switching power saving mechanisms as well as a contemplation of their scopes and the possible influence on application performance. Thus, in a third step, I present methods to evaluate common power saving mechanisms and list findings for different x86 processors. Software parameters include their performance and power consumption characteristics as well as the influence of power-saving mechanisms on these. To capture software parameters, an infrastructure for measuring performance and power consumption is necessary. With minor additions, the same infrastructure can later be used to tune software and hardware parameters. Thus, I lay out the structure for such an infrastructure and describe common components that are required for measuring and tuning. Based on that, I implement adequate interfaces that extend the functionality of contemporary performance measurement tools. Furthermore, I use these interfaces to conflate performance and power measurements and further process the gathered information for tuning. I conclude this work by demonstrating that the infrastructure can be used to manipulate power-saving mechanisms of contemporary x86 processors and increase the energy efficiency of HPC applications.
493

Advanced Memory Data Structures for Scalable Event Trace Analysis

Knüpfer, Andreas 16 December 2008 (has links)
The thesis presents a contribution to the analysis and visualization of computational performance based on event traces with a particular focus on parallel programs and High Performance Computing (HPC). Event traces contain detailed information about specified incidents (events) during run-time of programs and allow minute investigation of dynamic program behavior, various performance metrics, and possible causes of performance flaws. Due to long running and highly parallel programs and very fine detail resolutions, event traces can accumulate huge amounts of data which become a challenge for interactive as well as automatic analysis and visualization tools. The thesis proposes a method of exploiting redundancy in the event traces in order to reduce the memory requirements and the computational complexity of event trace analysis. The sources of redundancy are repeated segments of the original program, either through iterative or recursive algorithms or through SPMD-style parallel programs, which produce equal or similar repeated event sequences. The data reduction technique is based on the novel Complete Call Graph (CCG) data structure which allows domain specific data compression for event traces in a combination of lossless and lossy methods. All deviations due to lossy data compression can be controlled by constant bounds. The compression of the CCG data structure is incorporated in the construction process, such that at no point substantial uncompressed parts have to be stored. Experiments with real-world example traces reveal the potential for very high data compression. The results range from factors of 3 to 15 for small scale compression with minimum deviation of the data to factors &amp;gt; 100 for large scale compression with moderate deviation. Based on the CCG data structure, new algorithms for the most common evaluation and analysis methods for event traces are presented, which require no explicit decompression. By avoiding repeated evaluation of formerly redundant event sequences, the computational effort of the new algorithms can be reduced in the same extent as memory consumption. The thesis includes a comprehensive discussion of the state-of-the-art and related work, a detailed presentation of the design of the CCG data structure, an elaborate description of algorithms for construction, compression, and analysis of CCGs, and an extensive experimental validation of all components. / Diese Dissertation stellt einen neuartigen Ansatz für die Analyse und Visualisierung der Berechnungs-Performance vor, der auf dem Ereignis-Tracing basiert und insbesondere auf parallele Programme und das Hochleistungsrechnen (High Performance Computing, HPC) zugeschnitten ist. Ereignis-Traces (Ereignis-Spuren) enthalten detaillierte Informationen über spezifizierte Ereignisse während der Laufzeit eines Programms und erlauben eine sehr genaue Untersuchung des dynamischen Verhaltens, verschiedener Performance-Metriken und potentieller Performance-Probleme. Aufgrund lang laufender und hoch paralleler Anwendungen und dem hohen Detailgrad kann das Ereignis-Tracing sehr große Datenmengen produzieren. Diese stellen ihrerseits eine Herausforderung für interaktive und automatische Analyse- und Visualisierungswerkzeuge dar. Die vorliegende Arbeit präsentiert eine Methode, die Redundanzen in den Ereignis-Traces ausnutzt, um sowohl die Speicheranforderungen als auch die Laufzeitkomplexität der Trace-Analyse zu reduzieren. Die Ursachen für Redundanzen sind wiederholt ausgeführte Programmabschnitte, entweder durch iterative oder rekursive Algorithmen oder durch SPMD-Parallelisierung, die gleiche oder ähnliche Ereignis-Sequenzen erzeugen. Die Datenreduktion basiert auf der neuartigen Datenstruktur der &amp;quot;Vollständigen Aufruf-Graphen&amp;quot; (Complete Call Graph, CCG) und erlaubt eine Kombination von verlustfreier und verlustbehafteter Datenkompression. Dabei können konstante Grenzen für alle Abweichungen durch verlustbehaftete Kompression vorgegeben werden. Die Datenkompression ist in den Aufbau der Datenstruktur integriert, so dass keine umfangreichen unkomprimierten Teile vor der Kompression im Hauptspeicher gehalten werden müssen. Das enorme Kompressionsvermögen des neuen Ansatzes wird anhand einer Reihe von Beispielen aus realen Anwendungsszenarien nachgewiesen. Die dabei erzielten Resultate reichen von Kompressionsfaktoren von 3 bis 5 mit nur minimalen Abweichungen aufgrund der verlustbehafteten Kompression bis zu Faktoren &amp;gt; 100 für hochgradige Kompression. Basierend auf der CCG_Datenstruktur werden außerdem neue Auswertungs- und Analyseverfahren für Ereignis-Traces vorgestellt, die ohne explizite Dekompression auskommen. Damit kann die Laufzeitkomplexität der Analyse im selben Maß gesenkt werden wie der Hauptspeicherbedarf, indem komprimierte Ereignis-Sequenzen nicht mehrmals analysiert werden. Die vorliegende Dissertation enthält eine ausführliche Vorstellung des Stands der Technik und verwandter Arbeiten in diesem Bereich, eine detaillierte Herleitung der neu eingeführten Daten-strukturen, der Konstruktions-, Kompressions- und Analysealgorithmen sowie eine umfangreiche experimentelle Auswertung und Validierung aller Bestandteile.
494

Efficient Implementation of 3D Finite Difference Schemes on Recent Processor Architectures / Effektiv implementering av finita differensmetoder i 3D på senaste processorarkitekturer

Ceder, Frederick January 2015 (has links)
Efficient Implementation of 3D Finite Difference Schemes on Recent Processors Abstract In this paper a solver is introduced that solves a problem set modelled by the Burgers equation using the finite difference method: forward in time and central in space. The solver is parallelized and optimized for Intel Xeon Phi 7120P as well as Intel Xeon E5-2699v3 processors to investigate differences in terms of performance between the two architectures. Optimized data access and layout have been implemented to ensure good cache utilization. Loop tiling strategies are used to adjust data access with respect to the L2 cache size. Compiler hints describing aligned memory access are used to support vectorization on both processors. Additionally, prefetching strategies and streaming stores have been evaluated for the Intel Xeon Phi. Parallelization was done using OpenMP and MPI. The parallelisation for native execution on Xeon Phi is based on OpenMP and yielded a raw performance of nearly 100 GFLOP/s, reaching a speedup of almost 50 at a 83\% parallel efficiency. An OpenMP implementation on the E5-2699v3 (Haswell) processors produced up to 292 GFLOP/s, reaching a speedup of almost 31 at a 85\% parallel efficiency. For comparison a mixed implementation using interleaved communications with computations reached 267 GFLOP/s at a speedup of 28 with a 87\% parallel efficiency. Running a pure MPI implementation on the PDC's Beskow supercomputer with 16 nodes yielded a total performance of 1450 GFLOP/s and for a larger problem set it yielded a total of 2325 GFLOP/s, reaching a speedup and parallel efficiency at resp. 170 and 33,3\% and 290 and 56\%. An analysis based on the roofline performance model shows that the computations were memory bound to the L2 cache bandwidth, suggesting good L2 cache utilization for both the Haswell and the Xeon Phi's architectures. Xeon Phi performance can probably be improved by also using MPI. Keeping technological progress for computational cores in the Haswell processor in mind for the comparison, both processors perform well. Improving the stencil computations to a more compiler friendly form might improve performance more, as the compiler can possibly optimize more for the target platform. The experiments on the Cray system Beskow showed an increased efficiency from 33,3\% to 56\% for the larger problem, illustrating good weak scaling. This suggests that problem sizes should increase accordingly for larger number of nodes in order to achieve high efficiency. Frederick Ceder / Effektiv implementering av finita differensmetoder i 3D på moderna processorarkitekturer Sammanfattning Denna uppsats diskuterar implementationen av ett program som kan lösa problem modellerade efter Burgers ekvation numeriskt. Programmet är byggt ifrån grunden och använder sig av finita differensmetoder och applicerar FTCS metoden (Forward in Time Central in Space). Implementationen paralleliseras och optimeras på Intel Xeon Phi 7120P Coprocessor och Intel Xeon E5-2699v3 processorn för att undersöka skillnader i prestanda mellan de två modellerna. Vi optimerade programmet med omtanke på dataåtkomst och minneslayout för att få bra cacheutnyttjande. Loopblockningsstrategier används också för att dela upp arbetsminnet i mindre delar för att begränsa delarna i L2 cacheminnet. För att utnyttja vektorisering till fullo så används kompilatordirektiv som beskriver minnesåtkomsten, vilket ska hjälpa kompilatorn att förstå vilka dataaccesser som är alignade. Vi implementerade också prefetching strategier och streaming stores på Xeon Phi och disskuterar deras värde. Paralleliseringen gjordes med OpenMP och MPI. Parallelliseringen för Xeon Phi:en är baserad på bara OpenMP och exekverades direkt på chipet. Detta gav en rå prestanda på nästan 100 GFLOP/s och nådde en speedup på 50 med en 83% effektivitet. En OpenMP implementation på E5-2699v3 (Haswell) processorn fick upp till 292 GFLOP/s och nådde en speedup på 31 med en effektivitet på 85%. I jämnförelse fick en hybrid implementation 267 GFLOP/s och nådde en speedup på 28 med en effektivitet på 87%. En ren MPI implementation på PDC's Beskow superdator med 16 noder gav en total prestanda på 1450 GFLOP/s och för en större problemställning gav det totalt 2325 GFLOP/s, med speedup och effektivitet på respektive 170 och 33% och 290 och 56%. En analys baserad på roofline modellen visade att beräkningarna var minnesbudna till L2 cache bandbredden, vilket tyder på bra L2-cache användning för både Haswell och Xeon Phi:s arkitekturer. Xeon Phis prestanda kan förmodligen förbättras genom att även använda MPI. Håller man i åtanke de tekniska framstegen när det gäller beräkningskärnor på de senaste åren, så preseterar både arkitekturer bra. Beräkningskärnan av implementationen kan förmodligen anpassas till en mer kompilatorvänlig variant, vilket eventuellt kan leda till mer optimeringar av kompilatorn för respektive plattform. Experimenten på Cray-systemet Beskow visade en ökad effektivitet från 33,3% till 56% för större problemställningar, vilket visar tecken på bra weak scaling. Detta tyder på att effektivitet kan uppehållas om problemställningen växer med fler antal beräkningsnoder. Frederick Ceder
495

Optimizing array processing on complex I/O stacks usingindices and data summarization

Xing, Haoyuan January 2021 (has links)
No description available.
496

Optimisation de transfert de données pour les processeurs pluri-coeurs, appliqué à l'algèbre linéaire et aux calculs sur stencils / Optimization of data transfer on many-core processors, applied to dense linear algebra and stencil computations

Ho, Minh Quan 05 July 2018 (has links)
La prochaine cible de Exascale en calcul haute performance (High Performance Computing - HPC) et des récent accomplissements dans l'intelligence artificielle donnent l'émergence des architectures alternatives non conventionnelles, dont l'efficacité énergétique est typique des systèmes embarqués, tout en fournissant un écosystème de logiciel équivalent aux plateformes HPC classiques. Un facteur clé de performance de ces architectures à plusieurs cœurs est l'exploitation de la localité de données, en particulier l'utilisation de mémoire locale (scratchpad) en combinaison avec des moteurs d'accès direct à la mémoire (Direct Memory Access - DMA) afin de chevaucher le calcul et la communication. Un tel paradigme soulève des défis de programmation considérables à la fois au fabricant et au développeur d'application. Dans cette thèse, nous abordons les problèmes de transfert et d'accès aux mémoires hiérarchiques, de performance de calcul, ainsi que les défis de programmation des applications HPC, sur l'architecture pluri-cœurs MPPA de Kalray. Pour le premier cas d'application lié à la méthode de Boltzmann sur réseau (Lattice Boltzmann method - LBM), nous fournissons des techniques génériques et réponses fondamentales à la question de décomposition d'un domaine stencil itérative tridimensionnelle sur les processeurs clusterisés équipés de mémoires locales et de moteurs DMA. Nous proposons un algorithme de streaming et de recouvrement basé sur DMA, délivrant 33% de gain de performance par rapport à l'implémentation basée sur la mémoire cache par défaut. Le calcul de stencil multi-dimensionnel souffre d'un goulot d'étranglement important sur les entrées/sorties de données et d'espace mémoire sur puce limitée. Nous avons développé un nouvel algorithme de propagation LBM sur-place (in-place). Il consiste à travailler sur une seule instance de données, au lieu de deux, réduisant de moitié l'empreinte mémoire et cède une efficacité de performance-par-octet 1.5 fois meilleur par rapport à l'algorithme traditionnel dans l'état de l'art. Du côté du calcul intensif avec l'algèbre linéaire dense, nous construisons un benchmark de multiplication matricielle optimale, basé sur exploitation de la mémoire locale et la communication DMA asynchrone. Ces techniques sont ensuite étendues à un module DMA générique du framework BLIS, ce qui nous permet d'instancier une bibliothèque BLAS3 (Basic Linear Algebra Subprograms) portable et optimisée sur n'importe quelle architecture basée sur DMA, en moins de 100 lignes de code. Nous atteignons une performance maximale de 75% du théorique sur le processeur MPPA avec l'opération de multiplication de matrices (GEMM) de BLAS, sans avoir à écrire des milliers de lignes de code laborieusement optimisé pour le même résultat. / Upcoming Exascale target in High Performance Computing (HPC) and disruptive achievements in artificial intelligence give emergence of alternative non-conventional many-core architectures, with energy efficiency typical of embedded systems, and providing the same software ecosystem as classic HPC platforms. A key enabler of energy-efficient computing on many-core architectures is the exploitation of data locality, specifically the use of scratchpad memories in combination with DMA engines in order to overlap computation and communication. Such software paradigm raises considerable programming challenges to both the vendor and the application developer. In this thesis, we tackle the memory transfer and performance issues, as well as the programming challenges of memory- and compute-intensive HPC applications on he Kalray MPPA many-core architecture. With the first memory-bound use-case of the lattice Boltzmann method (LBM), we provide generic and fundamental techniques for decomposing three-dimensional iterative stencil problems onto clustered many-core processors fitted withs cratchpad memories and DMA engines. The developed DMA-based streaming and overlapping algorithm delivers 33%performance gain over the default cache-based implementation.High-dimensional stencil computation suffers serious I/O bottleneck and limited on-chip memory space. We developed a new in-place LBM propagation algorithm, which reduces by half the memory footprint and yields 1.5 times higher performance-per-byte efficiency than the state-of-the-art out-of-place algorithm. On the compute-intensive side with dense linear algebra computations, we build an optimized matrix multiplication benchmark based on exploitation of scratchpad memory and efficient asynchronous DMA communication. These techniques are then extended to a DMA module of the BLIS framework, which allows us to instantiate an optimized and portable level-3 BLAS numerical library on any DMA-based architecture, in less than 100 lines of code. We achieve 75% peak performance on the MPPA processor with the matrix multiplication operation (GEMM) from the standard BLAS library, without having to write thousands of lines of laboriously optimized code for the same result.
497

Couplage d'algorithmes d'optimisation par un système multi-agents pour l'exploration distribuée de simulateurs complexes : application à l'épidémiologie / Coupling of optimisation algorithms by a multi-agent system for supporting of distributed exploration of complex simulations : an application in epidemiology

Ho, The Nhan 27 June 2016 (has links)
L’étude des systèmes complexes tels que des systèmes écologiques ou urbains, nécessite sou- vent l’usage de simulateurs qui permettent de comprendre les dynamiques observées ou d’avoir une vision prospective de l’évolution du système. Cependant, le crédit donné aux résultats d’une simulation dépend fortement de la confiance qui est accordée au simulateur, et donc de la qualité de sa validation. Cette confiance ne s’obtient qu’au travers d’une étude avancée du modèle, d’une analyse de sensibilité aux paramètres et d’une confrontation des résultats de simulation et des données de terrain. Pour cela, pléthore de simulations est nécessaire, ce qui est coûteux du point de vue des ressources mobilisés (temps de calcul, processeurs et mémoire) et est souvent impossible compte tenue de la taille de l’espace des paramètres à étudier. Il est donc important de réduire de manière significative et intelligente le domaine à explorer. L’une des particularités des simulateurs représentatifs de phénomènes réels est d’avoir un espace des paramètres dont la nature et la forme est fonction : (i) des objectifs scientifiques ; (ii) de la nature des paramètres manipulés ; et (iii) surtout du systèmes complexes étudiés. Ainsi, le choix d’une stratégie d’exploration est totalement dépendante du domaine de l’étude. Les algorithmes génériques de la littérature ne sont alors pas optimaux. Compte tenu de la singularité des simulateurs complexes, des nécessités et des difficultés rencontrées de l’exploration de leur espace de paramètres. Nous envisageons de guider le tâche d’exploration des systèmes complexes en proposant le protocole d’exploration stratifié coopérative GRADEA qui hybride trois algorithmes d’exploration de différents classements dans un même environnement : la recherche en criblage pour zones d’intérêt, la recherche globale et la recherche locale. Différents stratégies d’exploration vont en parallèle parcourir l’espace de recherche pour trouver l’optimum globale du problème d’optimisation et égale- ment pour désigner partiellement la cartographie de l’espace de solutions pour comprendre le comportement émergent du modèle. Les premiers résultats du protocole d’exploration stratifié avec un exemple d’algorithmes présélectionnés d’exploration sont appliquées au simulateur du domaine environnemental pour l’aide à la conception de la planification des politiques de vaccination de la maladie rougeole au Vietnam. Le couplage d’algorithmes d’exploration est intégré sur une architecture modulaire à base d’agents qui sont en interaction avec des noeuds de calcul où sont exécutés les simulations. Cet environnement facilite d’une part le rapprochement et l’interaction entre une selection d’algorithmes d’exploration, et d’autre part l’utilisation de ressources de calcul haute performance. L’enjeu résolu jusqu’à ce temps est de proposer, à la communauté, un environnement optimisé où l’utilisateur sera en mesure : (i) de combiner des algorithmes d’exploration adaptés à son cas d’étude ; (ii) et de tirer parti des ressources disponibles de calcul haute performance pour réaliser l’exploration. / Study of complex systems such as environmental or urban systems, often requires the use of simulators for understanding the dynamics observed or getting a prospective vision of the evolution of system. However, the credit given to results of a simulation depends heavily on the trust placed in the simulator, and the quality of validation. This trust is achieved only through an advanced study on the model, a sensitivity analysis of parameters and a comparison of simulation results and collected data. For all of those, plethora of simulations is necessary, which is costly in term of computing resources (CPU time, memory and processors) and is often impossible because of the size of parameters space. It is therefore important to reduce significantly and intelligently the domain to explore. One of the special properties of representative simulators of real phenomena is that they own a parameters space, of which the nature and the form is based on: (i) the scientific objectives; (ii) the nature of manipulated parameters; and (iii) especially complex systems. Thus, the choice of an exploration strategy is totally dependent on the domain to explore. The generic algorithms in the literature are then not optimal. Because of the singularity of complex simulators, the necessities and the difficulties of exploring their parameters space, we plan to guide the exploration task of complex systems by providing GRADEA, a stratified cooperative exploration protocol, that hybrids three different algorithms of different categories in the same environment: the screening search for areas of interest, the global search and the local search. Various exploration algorithms will explore the search space by parallel manner to find the global optimum of optimization problem and also to partially describe the cartography of solutions space to understand the emergent behavior of the model. The first results of the stratified exploration protocol with an example of preselected search algorithms are applied to the environmental simulator for the design of vaccination policies of measles disease in Vietnam. The coupling of search algorithms is built on a modular and agent based architecture that interacts with a computing cluster where the simulations run. This environment facilitates both the interaction between a group of search algorithms, and also the use of high performance computing resources. The challenge is resolved to propose to community, an optimized environment where users will be able: (i) to combine search algorithms that adapted to case study; (ii) and take advantage of the available resources of high performance computing to accelerate the exploration.
498

Approche haut niveau pour l’accélération d’algorithmes sur des architectures hétérogènes CPU/GPU/FPGA. Application à la qualification des radars et des systèmes d’écoute électromagnétique / High-Level Approach for the Acceleration of Algorithms on CPU/GPU/FPGA Heterogeneous Architectures. Application to Radar Qualification and Electromagnetic Listening Systems

Martelli, Maxime 13 December 2019 (has links)
A l'heure où l'industrie des semi-conducteurs fait face à des difficultés majeures pour entretenir une croissance en berne, les nouveaux outils de synthèse de haut niveau repositionnent les FPGAs comme une technologie de premier plan pour l'accélération matérielle d'algorithmes face aux clusters à base de CPUs et GPUs. Mais en l’état, pour un ingénieur logiciel, ces outils ne garantissent pas, sans expertise du matériel sous-jacent, l’utilisation de ces technologies à leur plein potentiel. Cette particularité peut alors constituer un frein à leur démocratisation. C'est pourquoi nous proposons une méthodologie d'accélération d'algorithmes sur FPGA. Après avoir présenté un modèle d'architecture haut niveau de cette cible, nous détaillons différentes optimisations possibles en OpenCL, pour finalement définir une stratégie d'exploration pertinente pour l'accélération d'algorithmes sur FPGA. Appliquée sur différents cas d'étude, de la reconstruction tomographique à la modélisation d'un brouillage aéroporté radar, nous évaluons notre méthodologie suivant trois principaux critères de performance : le temps de développement, le temps d'exécution, et l'efficacité énergétique. / As the semiconductor industry faces major challenges in sustaining its growth, new High-Level Synthesis tools are repositioning FPGAs as a leading technology for algorithm acceleration in the face of CPU and GPU-based clusters. But as it stands, for a software engineer, these tools do not guarantee, without expertise of the underlying hardware, that these technologies will be harnessed to their full potential. This can be a game breaker for their democratization. From this observation, we propose a methodology for algorithm acceleration on FPGAs. After presenting a high-level model of this architecture, we detail possible optimizations in OpenCL, and finally define a relevant exploration strategy for accelerating algorithms on FPGA. Applied to different case studies, from tomographic reconstruction to the modelling of an airborne radar jammer, we evaluate our methodology according to three main performance criteria: development time, execution time, and energy efficiency.
499

Contribution to High Performance Computing and Big Data Infrastructure Convergence / Contribution à la convergence d'infrastructure entre le calcul haute performance et le traitement de données à large échelle

Mercier, Michael 01 July 2019 (has links)
La quantité de données produites dans le monde scientifique comme dans le monde commercial, est en constante augmentation. Le domaine du traitement de donnée à large échelle, appelé “Big Data”, a été inventé pour traiter des données sur de larges infrastructures informatiques distribuées. Mais l’intégration de système Big Data sur des machines de calcul intensif pose de nombreux problèmes. En effet, les gestionnaires de ressources ainsi que les systèmes de fichier de super calculateurs ne sont pas penser pour ce type de travail. Le sujet de cette thèse est de trouver la meilleure approche pour faire interagir ces deux gestionnaires de ressources et de traiter les différents problèmes soulevés par les mouvements de données et leur ordonnancement. / The amount of data produced, either in the scientific community and the commercial world, is constantly growing. The field of Big Data has emerged to handle a large amount of data on distributed computing infrastructures. High-Performance Computer (HPC) infrastructures are made for intensive parallel computations. The HPC community is also facing more and more data because of new high definition sensors and large physics apparatus. The convergence of the two fields is currently happening. In fact, the HPC community is already using Big Data tools, but they are not integrated correctly, especially at the level of the file system and the Resources and Job Management System (RJMS).In order to understand how we can leverage HPC clusters for Big Data usage, and what are the challenges for the HPC infrastructures, we have studied multiple aspects of the convergence: we have made a survey on the software provisioning methods, with a focus on data-intensive applications. We also propose a new RJMS collaboration technique called BeBiDa which is based on 50 lines of code whereas similar solutions use at least 1000x more. We evaluate this mechanismon real conditions and in a simulation with our simulator Batsim.
500

Relativistic Causal Ordering A Memory Model for Scalable Concurrent Data Structures

Triplett, Josh 01 January 2012 (has links)
High-performance programs and systems require concurrency to take full advantage of available hardware. However, the available concurrent programming models force a difficult choice, between simple models such as mutual exclusion that produce little to no concurrency, or complex models such as Read-Copy Update that can scale to all available resources. Simple concurrent programming models enforce atomicity and causality, and this enforcement limits concurrency. Scalable concurrent programming models expose the weakly ordered hardware memory model, requiring careful and explicit enforcement of causality to preserve correctness, as demonstrated in this dissertation through the manual construction of a scalable hash-table item-move algorithm. Recent research on "relativistic programming" aims to standardize the programming model of Read-Copy Update, but thus far these efforts have lacked a generalized memory ordering model, requiring data-structure-specific reasoning to preserve causality. I propose a new memory ordering model, "relativistic causal ordering", which combines the scalabilty of relativistic programming and Read-Copy Update with the simplicity of reader atomicity and automatic enforcement of causality. Programs written for the relativistic model translate to scalable concurrent programs for weakly-ordered hardware via a mechanical process of inserting barrier operations according to well-defined rules. To demonstrate the relativistic causal ordering model, I walk through the straightforward construction of a novel concurrent hash-table resize algorithm, including the translation of this algorithm from the relativistic model to a hardware memory model, and show through benchmarks that the resulting algorithm scales far better than those based on mutual exclusion.

Page generated in 0.1496 seconds