Global ETD Search

351	A Unified Infrastructure for Monitoring and Tuning the Energy Efficiency of HPC Applications Schöne, Robert 07 November 2017 (has links) (PDF) High Performance Computing (HPC) has become an indispensable tool for the scientific community to perform simulations on models whose complexity would exceed the limits of a standard computer. An unfortunate trend concerning HPC systems is that their power consumption under high-demanding workloads increases. To counter this trend, hardware vendors have implemented power saving mechanisms in recent years, which has increased the variability in power demands of single nodes. These capabilities provide an opportunity to increase the energy efficiency of HPC applications. To utilize these hardware power saving mechanisms efficiently, their overhead must be analyzed. Furthermore, applications have to be examined for performance and energy efficiency issues, which can give hints for optimizations. This requires an infrastructure that is able to capture both, performance and power consumption information concurrently. The mechanisms that such an infrastructure would inherently support could further be used to implement a tool that is able to do both, measuring and tuning of energy efficiency. This thesis targets all steps in this process by making the following contributions: First, I provide a broad overview on different related fields. I list common performance measurement tools, power measurement infrastructures, hardware power saving capabilities, and tuning tools. Second, I lay out a model that can be used to define and describe energy efficiency tuning on program region scale. This model includes hardware and software dependent parameters. Hardware parameters include the runtime overhead and delay for switching power saving mechanisms as well as a contemplation of their scopes and the possible influence on application performance. Thus, in a third step, I present methods to evaluate common power saving mechanisms and list findings for different x86 processors. Software parameters include their performance and power consumption characteristics as well as the influence of power-saving mechanisms on these. To capture software parameters, an infrastructure for measuring performance and power consumption is necessary. With minor additions, the same infrastructure can later be used to tune software and hardware parameters. Thus, I lay out the structure for such an infrastructure and describe common components that are required for measuring and tuning. Based on that, I implement adequate interfaces that extend the functionality of contemporary performance measurement tools. Furthermore, I use these interfaces to conflate performance and power measurements and further process the gathered information for tuning. I conclude this work by demonstrating that the infrastructure can be used to manipulate power-saving mechanisms of contemporary x86 processors and increase the energy efficiency of HPC applications. Hochleistungsrechnen HPC Energieeffizienz paralleles Rechnen Energieeffizienzsanalyse high performance computing HPC energy efficiency power saving tools ddc:004 rvk:ST 151
352	Transport des rayons cosmiques en turbulence magnétohydrodynamique / Cosmic Ray transport in magnetohydrodynamic turbulence Cohet, Romain 12 February 2015 (has links) Dans cette thèse, nous étudions les propriétés du transport de particules chargées de haute énergie dans des champs électromagnétiques turbulents.Ces champs ont été générés en utilisant le code magnétohydrodynamique (MHD) RAMSES, résolvant les équations de la MHD idéales compressibles. Nous avons développé un module pour générer la turbulence MHD, en utilisant une technique de forçage à grande échelle. Les propriétés des équations de la MHD font cascader l'énergie des grandes échelles vers les petites, développant un spectre en énergie suivant une loi de puissance, appelée zone inertielle. Nous avons développé un module permettant de calculer les trajectoires de particule chargée une fois le spectre turbulent établi. En injectant les particules à une énergie telle que l'inverse du rayon de Larmor des particules corresponde à un mode du spectre de Fourier dans la zone inertielle, nous avons cherché à mettre en évidence un effet systématique lié à la loi de puissance du spectre. Cette méthode a montré que le libre parcours moyen est indépendant de l'énergie des particules jusqu'à des valeurs de rayon de Larmor proches de l'échelle de cohérence de la turbulence. La dépendance du libre parcours moyen avec le nombre de Mach alfvénique des simulations MHD a également produit une loi de puissance.Nous avons également développé une technique pour mesurer l'effet de l'anisotropie de la turbulence MHD sur les propriétés du transport des rayons cosmiques, au travers le calcul de champs magnétiques locaux. Cette étude nous a montré un effet sur coefficient de diffusion angulaire, accréditant l'hypothèse que les particules sont plus sensible aux variations de petites échelles. / In this thesis, we study the transport properties of high energy charged particles in turbulent electromagnetic fields.These fields were generated by using the magnetohydrodynamic (MHD) code RAMSES, which solve the compressible ideal MHD equations. We have developed a module for generating the MHD turbulence, by using a large scale forcing technique. The MHD equations induce a cascading of the energy from large scales to small ones, developing an energy spectrum which follows a power law, called the inertial range.We have developed a module for computing the charged particle trajectories once the turbulent spectrum is established. By injecting the particles to energy such as the inverse of the particle Larmor radius corresponds to a mode in the inertial range of the Fourier spectrum, we have highlighted systematic effects related to the power law spectrum. This method showed that the mean free path is independent of the particules energy until the Larmor radius takes values close to the turbulence coherence scale. The dependence of the mean free path with the alfvénic Mach number produced a power law.We have also developed a technique to measure the anisotropy effect of the MHD turbulence in the cosmic rays transport properties through the calculation of local magnetic fields. This study has shown an effect on the pitch angle scattering coefficient, which confirmed the assumption that the particles are more sensitive to changes in small scales fluctuations. Rayonnement cosmique Turbulence Transport Magnétohydrodynamique Calculs hautes performances Numériques Transport Turbulence Cosmic ray Magnétohydrodynamic High Performance Computing Numerics
353	A Hardware and Software Integrated Approach for Adaptive Thread Management in Multicore Multithreaded Microprocessors Weng, Lichen 23 April 2012 (has links) The Multicore Multithreaded Microprocessor maximizes parallelism on a chip for the optimal system performance, such that its popularity is growing rapidly in high-performance computing. It increases the complexity in resource distribution on a chip by leading it to two directions: isolation and unification. On one hand, multiple cores are implemented to deliver the computation and memory accessing resources to more than one thread at the same time. Nevertheless, it limits the threads’ access to resources in different cores, even if extensively demanded. On the other hand, simultaneous multithreaded architectures unify the domestic execu- tion resources together for concurrently running threads. In such an environment, threads are greatly affected by the inter-thread interference. Moreover, the impacts of the complicated distribution are enlarged by variation in workload behaviors. As a result, the microprocessor requires an adaptive management scheme to schedule threads throughout different cores and coordinate them within cores. In this study, an adaptive thread management scheme was proposed, integrating both hardware and software approaches. The instruction fetch policy at the hardware level took the responsibility by prioritizing domestic threads, while the Operating System scheduler at the software level was used to pair threads dynami- vi cally to multiple cores. The tie between them was the proposed online linear model, which was dynamically constructed for every thread based on data misses by the regression algorithm. Consequently, the hardware part of the proposed scheme proactively granted higher priority to the threads with less predicted long-latency loads, expecting they would better utilize the shared execution resources. Mean- while, the software part was invoked by such a model upon significant changes in the execution phases and paired threads with different demands to the same core to minimize competition on the chip. The proposed scheme was compared to its peer designs and overall 43% speedup was achieved by the integrated approach over the combination of two baseline policies in hardware and software, respectively. The overhead was examined carefully regarding power, area, storage and latency, as well as the relationship between the overhead and the performance. Computer architecture High-performance computing Simultaneous Multithreaded Adaptive resource management Multicore multithreaded microprocessor Ordinary least square regression
354	Improving memory consumption and performance scalability of HPC applications with multi-threaded network communications / Amélioration de la consommation mémoire et de l'extensibilité des performances des applications HPC par le multi-threading des communications réseaux Didelot, Sylvain 12 June 2014 (has links) La tendance en HPC est à l'accroissement du nombre de coeurs par noeud de calcul pour une quantité totale de mémoire par noeud constante. A large échelle, l'un des principaux défis pour les applications parallèles est de garder une faible consommation mémoire. Cette thèse présente une couche de communication multi-threadée sur Infiniband, laquelle fournie de bonnes performances et une faible consommation mémoire. Nous ciblons les applications scientifiques parallélisées grâce à la bibliothèque MPI ou bien combinées avec un modèle de programmation en mémoire partagée. En partant du constat que le nombre de connexions réseau et de buffers de communication est critique pour la mise à l'échelle des bibliothèques MPI, la première contribution propose trois approches afin de contrôler leur utilisation. Nous présentons une topologie virtuelle extensible et entièrement connectée pour réseaux rapides orientés connexion. Dans un contexte agrégeant plusieurs cartes permettant d'ajuster dynamiquement la configuration des buffers réseau utilisant la technologie RDMA. La seconde contribution propose une optimisation qui renforce le potentiel d'asynchronisme des applications MPI, laquelle montre une accélération de deux des communications. La troisième contribution évalue les performances de plusieurs bibliothèques MPI exécutant une application de modélisation sismique en contexte hybride. Les expériences sur des noeuds de calcul jusqu'à 128 coeurs montrent une économie de 17 % sur la mémoire. De plus, notre couche de communication multi-threadée réduit le temps d'exécution dans le cas où plusieurs threads OpenMP participent simultanément aux communications MPI. / A recent trend in high performance computing shows a rising number of cores per compute node, while the total amount of memory per compute node remains constant. To scale parallel applications on such large machines, one of the major challenges is to keep a low memory consumption. This thesis develops a multi-threaded communication layer over Infiniband which provides both good performance of communications and a low memory consumption. We target scientific applications parallelized using the MPI standard in pure mode or combined with a shared memory programming model. Starting with the observation that network endpoints and communication buffers are critical for the scalability of MPI runtimes, the first contribution proposes three approaches to control their usage. We introduce a scalable and fully-connected virtual topology for connection-oriented high-speed networks. In the context of multirail configurations, we then detail a runtime technique which reduces the number of network connections. We finally present a protocol for dynamically resizing network buffers over the RDMA technology. The second contribution proposes a runtime optimization to enforce the overlap potential of MPI communications, showing a 2x improvement factor on communications. The third contribution evaluates the performance of several MPI runtimes running a seismic modeling application in a hybrid context. On large compute nodes up to 128 cores, the introduction of OpenMP in the MPI application saves up to 17 % of memory. Moreover, we show a performance improvement with our multi-threaded communication layer where the OpenMP threads concurrently participate to the MPI communications Calcul haute performance Multi-threading Réseaux haut débit MPI NUMA High performance computing Multi-threading High-speed networks MPI NUMA
355	Placement d'applications parallèles en fonction de l'affinité et de la topologie / Placement of parallel applications according to the topology and the affinity Tessier, Francois 26 January 2015 (has links) La simulation numérique est un des piliers des Sciences et de l’industrie. La simulationmétéorologique, la cosmologie ou encore la modélisation du coeur humain sont autantde domaines dont les besoins en puissance de calcul sont sans cesse croissants. Dès lors,comment passer ces applications à l’échelle ? La parallélisation et les supercalculateurs massivementparallèles sont les seuls moyens d’y parvenir. Néanmoins, il y a un prix à payercompte tenu des topologies matérielles de plus en plus complexes, tant en terme de réseauque de hiérarchie mémoire. La question de la localité des données devient ainsi centrale :comment réduire la distance entre une entité logicielle et les données auxquelles elle doitaccéder ? Le placement d’applications est un des leviers permettant de traiter ce problème.Dans cette thèse, nous présentons l’algorithme de placement TreeMatch et ses applicationsdans le cadre du placement statique, c’est-à-dire au lancement de l’application, et duplacement dynamique. Pour cette seconde approche, nous proposons la prise en comptede la localité des données dans le cadre d’un algorithme d’équilibrage de charge. Les différentesapproches abordées sont validées par des expériences réalisées tant sur des codesd’évaluation de performances que sur des applications réelles. / Computer simulation is one of the pillars of Sciences and industry. Climate simulation,cosmology, or heart modeling are all areas in which computing power needs are constantlygrowing. Thus, how to scale these applications ? Parallelization and massively parallel supercomputersare the only ways to do achieve. Nevertheless, there is a price to pay consideringthe hardware topologies incessantly complex, both in terms of network and memoryhierarchy. The issue of data locality becomes central : how to reduce the distance betweena processing entity and data to which it needs to access ? Application placement is one ofthe levers to address this problem. In this thesis, we present the TreeMatch algorithmand its application for static mapping, that is to say at the lauchtime of the application,and the dynamic placement. For this second approach, we propose the awareness of datalocality within a load balancing algorithm. The different approaches discussed are validatedby experiments both on benchmarking codes and on real applications. Calcul haute performance Parallélisme Localité Affinité Topologie Placement Équilibrage de charge High performance computing Parallelism Locality Affinity Topology Placement Load balancing
356	Contributions to Software Runtime for Clustered Manycores Applied to Embedded and High-Performance Applications / Contributions aux environnements d’exécution pour processeurs massivement parallèles et clustérisés appliqués aux applications embarquées et hautes performances Hascoët, Julien 14 December 2018 (has links) Le besoin en calculs est toujours plus important et difficile à satisfaire, spécialement dans le domaine de l’informatique embarquée qui inclue les voitures autonomes, drones et téléphones intelligents. Les systèmes embarqués doivent respecter des contraintes fortes de temps, de consommation et de sécurité. Les nouveaux processeurs parallèles et hétérogènes comme le MPPA® de Kalray utilisé dans cette thèse, doivent alors combiner haute performance et basse consommation. Pour cela, le MPPA® intègre 288 coeurs, regroupés en 18 clusters à mémoire locale partagée, un réseau sur puce et des moteurs DMA pour les communications. Ces processeurs sont difficiles à programmer, engendrant des coûts de développement importants. Cette thèse a pour objectif de simplifier leur programmation tout en optimisant les performances finales. Nous proposons pour cela AOS, une librairie de communication et synchronisation haute performance gérant les mémoires locales distribuées des processeurs clustérisés. La librairie atteint 70% de la crête matérielle pour des transferts supérieurs à 8 KB. Nous proposons plusieurs outils de développement basés sur AOS et des modèles de programmation flux-dedonnées pour accélérer le développement d’applications parallèles pour processeurs clustérisés, notamment OpenVX qui est un nouveau standard pour les applications de vision et les réseaux de neurones. Nous automatisons l’optimisation de l’application OpenVX en faisant du pré-chargement de données et en les fusionnants, pour éviter le mur de la bande passante mémoire externe. Les résultats montrent des facteurs d’accélération super linéaires. / The growing need for computing is more and more challenging, especially in the embedded system world with autonomous cars, drones, and smartphones. New highly parallel and heterogeneous processors emerge to answer this challenge. They operate in constrained environments with real-time requirements, reduced power consumption, and safety. Programming these new chips is a time-consuming and challenging task leading to huge software development costs. The Kalray MPPA® processor is a competitive example for low-power super-computing on a single chip. It integrates up to 288 VLIW cores grouped in 18 clusters, each fitted with shared local memory. These clusters are interconnected with a high-bandwidth network-on-chip, and DMA engines are used to communicate. This processor is used in this thesis for experimental results. We propose the AOS library enabling highperformance communications and synchronizations of distributed local memories on clustered manycores. AOS provides 70% of the peak hardware throughput for transfers larger than 8 KB. We propose tools for the implementation of static and dynamic dataflow programs based on AOS to accelerate the parallel application developments onto clustered manycores. We propose an implementation of OpenVX for clustered manycores on top of AOS. OpenVX is a standard based on dataflow for the development of computer vision and neural network computing. The proposed OpenVX implementation includes automatic optimizations like data prefetch to overlap communications and computations, or kernel fusion to avoid the main memory bandwidth bottleneck. Results show super-linear speedups. Calcul à hautes performances Parallélismes Communications Flux-de-données Systèmes embarqués High-Performance Computing Parallelisms Communications Dataflow Embedded systems 004.2
357	Designing High-Performance Erasure Coding Schemes for Next-Generation Storage Systems Haiyang, Shi January 2020 (has links) No description available. Computer Engineering Computer Science High-Performance Computing Erasure Coding In-Network Computing Coherent In-Network Erasure Coding RDMA NIC Offload
358	Řešení pro clusterování serverů / Server clustering techniques Čech, Martin January 2009 (has links) The work is given an analysis of Open Source Software (further referred as OSS), which allows use and create computer clusters. It explored the issue of clustering and construction of clusters. All installations, configuration and cluster management have been done on the operating system GNU / Linux. Presented OSS makes possible to compile a storage cluster, cluster with load distribution, cluster with high availability and computing cluster. Different types of benchmarks was theoretically analyzed, and practically used for measuring cluster’s performance. Results were compared with others, eg. the TOP500 list of the best clusters available online. Practical part of the work deals with comparing performance computing clusters. With several tens of computational nodes has been established cluster, where was installed package OpenMPI, which allows parallelization of calculations. Subsequently, tests were performed with the High Performance Linpack, which by calculation of linear equations provides total performance. Influence of the parallelization to algorithm PEA was also tested. To present practical usability, cluster has been tested by program John the Ripper, which serves to cracking users passwords. The work shall include the quantity of graphs clarifying the function and mainly showing the achieved results.
359	Runtime MPI Correctness Checking with a Scalable Tools Infrastructure Hilbrich, Tobias 08 June 2015 (has links) Increasing computational demand of simulations motivates the use of parallel computing systems. At the same time, this parallelism poses challenges to application developers. The Message Passing Interface (MPI) is a de-facto standard for distributed memory programming in high performance computing. However, its use also enables complex parallel programing errors such as races, communication errors, and deadlocks. Automatic tools can assist application developers in the detection and removal of such errors. This thesis considers tools that detect such errors during an application run and advances them towards a combination of both precise checks (neither false positives nor false negatives) and scalability. This includes novel hierarchical checks that provide scalability, as well as a formal basis for a distributed deadlock detection approach. At the same time, the development of parallel runtime tools is challenging and time consuming, especially if scalability and portability are key design goals. Current tool development projects often create similar tool components, while component reuse remains low. To provide a perspective towards more efficient tool development, which simplifies scalable implementations, component reuse, and tool integration, this thesis proposes an abstraction for a parallel tools infrastructure along with a prototype implementation. This abstraction overcomes the use of multiple interfaces for different types of tool functionality, which limit flexible component reuse. Thus, this thesis advances runtime error detection tools and uses their redesign and their increased scalability requirements to apply and evaluate a novel tool infrastructure abstraction. The new abstraction ultimately allows developers to focus on their tool functionality, rather than on developing or integrating common tool components. The use of such an abstraction in wide ranges of parallel runtime tool development projects could greatly increase component reuse. Thus, decreasing tool development time and cost. An application study with up to 16,384 application processes demonstrates the applicability of both the proposed runtime correctness concepts and of the proposed tools infrastructure. info:eu-repo/classification/ddc/004 ddc:004
360	Reactive transport modeling at hillslope scale with high performance computing methods He, Wenkui 07 November 2016 (has links) Reactive transport modeling is an important approach to understand water dynamics, mass transport and biogeochemical processes from the hillslope to the catchment scale. It has a wide range of applications in the fields of e.g. water resource management, contaminanted site remediation and geotechnical engineering. To simulate reactive transport processes at a hillslope or larger scales is a challenging task, which involves interactions of complex physical and biogeochemical processes, huge computational expenses as well as difficulties in numerical precision and stability. The primary goal of the work is to develop a practical, accurate and efficient tool to facilitate the simulation techniques for reactive transport problems towards hillslope or larger scales. The first part of the work deals with the simulation of water flow in saturated and unsaturated porous media. The capability and accuracy of different numerical approaches were analyzed and compared by using benchmark tests. The second part of the work introduces the coupling of the scientific software packages OpenGeoSys and IPhreeqc by using a character-string-based interface. The accuracy and computational efficiency of the coupled tool were discussed based on three benchmarks. It shows that OGS#IPhreeqc provides sufficient numerical accuracy to simulate reactive transport problems for both equilibrium and kinetic reactions in variably saturated porous media. The third part of the work describes the algorithm of a parallelization scheme using MPI (Message Passing Interface) grouping concept, which enables a flexible allocation of computational resources for calculating geochemical reaction and the physical processes such as groundwater flow and transport. The parallel performance of the approach was tested by three examples. It shows that the new approach has more advantages than the conventional ones for the calculation of geochemically-dominated problems, especially when only limited benefit can be obtained through parallelization for solving flow or solute transport. The comparison between the character-string-based and the file-based coupling shows, that the former approach produces less computational overhead in a distributed-memory system such as a computing cluster. The last part of the work shows the application of OGS#IPhreeqc for the simulation of the water dynamic and denitrification process in the groundwater aquifer of a study site in Northern Germany. It demonstrates that OGS#IPhreeqc is able to simulate heterogeneous reactive transport problems at a hillslope scale within an acceptable time span. The model results shows the importance of functional zones for natural attenuation process. / Modellierung des reaktiven Stofftranports ist ein wichtiger Ansatz um die Wasserströmung, den Stofftransport und die biogeochemischen Prozesse von der Hang- bis zur Einzugsgebietsskala zu verstehen. Es gibt umfangreiche Anwendungsgebiete, z.B. in der Wasserwirtschaft, Umweltsanierung und Geotechnik. Die Simulation der reaktiven Stofftransportprozesse auf der Hangskala oder auf größeren Maßstäbe ist eine anspruchsvolle Aufgabe, da es sich um die Wechselwirkungen komplexer physikalischer und biogeochemischen Prozesse handelt, die riesigen Berechnungsaufwand sowie numerischen Schwierigkeiten bezogen auf die Genauigkeit und die Stabilität nach sich ziehen. Das Hauptziel dieser Arbeit besteht darin, ein praktisches, genaues und effizientes Werkzeug zu entwickeln, um die Simulationstechnik für reaktiven Stofftransport auf der Hangskala und auf größeren Skalen zu verbessern. Der erste Teil der Arbeit behandelt die Simulation der Wasserströmung in gesättigten und ungesättigten porösen Medien. Das Anwendungspotential und die Genauigkeit verschiedener numerischer Ansätze wurden mittels einiger Benchmarks analysiert und miteinander verglichen. Der zweite Teil der Arbeit stellt die Kopplung der wissenschaftlichen Softwarepakete OpenGeoSys und IPhreeqc mit einer stringbasierten Schnittstelle dar. Die Genauigkeit und die Recheneffizienz des gekoppelten Tools OGS#IPhreeqc wurden basierend auf drei Benchmark-Tests diskutiert. Das Ergebnis zeigt, dass OGS#IPhreeqc die ausreichende numerische Genauigkeit für die Simulation reaktiven Stofftransports liefert, welcher sich sowohl auf die Gleichgewichtsreaktion als auch auf die kinetische Reaktion in variabel gesättigten porösen Medien beziehen. Der dritte Teil der Arbeit beschreibt zuerst den Algorithmus der Parallelisierung des OGS#IPhreeqc basierend auf dem MPI (Message Passing Interface) Gruppierungskonzept, welcher eine flexible Verteilung der Rechenressourcen für die Berechnung der geochemischen Reaktion und der physikalischen Prozesse wie z.B. Wasserströmung oder Stofftransport ermöglicht. Danach wurde die Leistungsfähigkeit des Algorithmus anhand von drei Beispielen getestet. Es zeigt sich, dass der neue Ansatz Vorteile gegenüber die konventionellen Ansätzen für die Berechnung von geochemisch dominierten Problemen bringt. Dies ist vor allem dann der Fall, wenn nur eingeschränkter Nutzen aus der Parallelisierung für die Berechnung der Wasserströmung oder des Stofftransportes gezogen werden kann. Der Vergleich zwischen der string- und der dateibasierten Kopplung zeigt, dass die erstere weniger Rechenoverhead in einem verteilten Rechnersystem, wie z.B. Cluster erzeugt. Der letzte Teil der Arbeit zeigt die Anwendung von OGS#IPhreeqc für die Simulation der Wasserdynamik und der Denitrifikation im Grundwasserleiter eines Untersuchungsgebietes in NordDeutschland. Es beweist, dass OGS#IPhreeqc in der Lage ist, reaktiven Stofftransport auf der Hangskala innerhalb akzeptabler Zeitspanne zu simulieren. Die Simulationsergebnisse zeigen die Bedeutung der funktionalen Zonen für die natürlichen Selbstreinigungsprozesse. info:eu-repo/classification/ddc/550 ddc:550

Search results