Global ETD Search

351	A Hardware and Software Integrated Approach for Adaptive Thread Management in Multicore Multithreaded Microprocessors Weng, Lichen 23 April 2012 (has links) The Multicore Multithreaded Microprocessor maximizes parallelism on a chip for the optimal system performance, such that its popularity is growing rapidly in high-performance computing. It increases the complexity in resource distribution on a chip by leading it to two directions: isolation and unification. On one hand, multiple cores are implemented to deliver the computation and memory accessing resources to more than one thread at the same time. Nevertheless, it limits the threads’ access to resources in different cores, even if extensively demanded. On the other hand, simultaneous multithreaded architectures unify the domestic execu- tion resources together for concurrently running threads. In such an environment, threads are greatly affected by the inter-thread interference. Moreover, the impacts of the complicated distribution are enlarged by variation in workload behaviors. As a result, the microprocessor requires an adaptive management scheme to schedule threads throughout different cores and coordinate them within cores. In this study, an adaptive thread management scheme was proposed, integrating both hardware and software approaches. The instruction fetch policy at the hardware level took the responsibility by prioritizing domestic threads, while the Operating System scheduler at the software level was used to pair threads dynami- vi cally to multiple cores. The tie between them was the proposed online linear model, which was dynamically constructed for every thread based on data misses by the regression algorithm. Consequently, the hardware part of the proposed scheme proactively granted higher priority to the threads with less predicted long-latency loads, expecting they would better utilize the shared execution resources. Mean- while, the software part was invoked by such a model upon significant changes in the execution phases and paired threads with different demands to the same core to minimize competition on the chip. The proposed scheme was compared to its peer designs and overall 43% speedup was achieved by the integrated approach over the combination of two baseline policies in hardware and software, respectively. The overhead was examined carefully regarding power, area, storage and latency, as well as the relationship between the overhead and the performance. Computer architecture High-performance computing Simultaneous Multithreaded Adaptive resource management Multicore multithreaded microprocessor Ordinary least square regression
352	Improving memory consumption and performance scalability of HPC applications with multi-threaded network communications / Amélioration de la consommation mémoire et de l'extensibilité des performances des applications HPC par le multi-threading des communications réseaux Didelot, Sylvain 12 June 2014 (has links) La tendance en HPC est à l'accroissement du nombre de coeurs par noeud de calcul pour une quantité totale de mémoire par noeud constante. A large échelle, l'un des principaux défis pour les applications parallèles est de garder une faible consommation mémoire. Cette thèse présente une couche de communication multi-threadée sur Infiniband, laquelle fournie de bonnes performances et une faible consommation mémoire. Nous ciblons les applications scientifiques parallélisées grâce à la bibliothèque MPI ou bien combinées avec un modèle de programmation en mémoire partagée. En partant du constat que le nombre de connexions réseau et de buffers de communication est critique pour la mise à l'échelle des bibliothèques MPI, la première contribution propose trois approches afin de contrôler leur utilisation. Nous présentons une topologie virtuelle extensible et entièrement connectée pour réseaux rapides orientés connexion. Dans un contexte agrégeant plusieurs cartes permettant d'ajuster dynamiquement la configuration des buffers réseau utilisant la technologie RDMA. La seconde contribution propose une optimisation qui renforce le potentiel d'asynchronisme des applications MPI, laquelle montre une accélération de deux des communications. La troisième contribution évalue les performances de plusieurs bibliothèques MPI exécutant une application de modélisation sismique en contexte hybride. Les expériences sur des noeuds de calcul jusqu'à 128 coeurs montrent une économie de 17 % sur la mémoire. De plus, notre couche de communication multi-threadée réduit le temps d'exécution dans le cas où plusieurs threads OpenMP participent simultanément aux communications MPI. / A recent trend in high performance computing shows a rising number of cores per compute node, while the total amount of memory per compute node remains constant. To scale parallel applications on such large machines, one of the major challenges is to keep a low memory consumption. This thesis develops a multi-threaded communication layer over Infiniband which provides both good performance of communications and a low memory consumption. We target scientific applications parallelized using the MPI standard in pure mode or combined with a shared memory programming model. Starting with the observation that network endpoints and communication buffers are critical for the scalability of MPI runtimes, the first contribution proposes three approaches to control their usage. We introduce a scalable and fully-connected virtual topology for connection-oriented high-speed networks. In the context of multirail configurations, we then detail a runtime technique which reduces the number of network connections. We finally present a protocol for dynamically resizing network buffers over the RDMA technology. The second contribution proposes a runtime optimization to enforce the overlap potential of MPI communications, showing a 2x improvement factor on communications. The third contribution evaluates the performance of several MPI runtimes running a seismic modeling application in a hybrid context. On large compute nodes up to 128 cores, the introduction of OpenMP in the MPI application saves up to 17 % of memory. Moreover, we show a performance improvement with our multi-threaded communication layer where the OpenMP threads concurrently participate to the MPI communications Calcul haute performance Multi-threading Réseaux haut débit MPI NUMA High performance computing Multi-threading High-speed networks MPI NUMA
353	Placement d'applications parallèles en fonction de l'affinité et de la topologie / Placement of parallel applications according to the topology and the affinity Tessier, Francois 26 January 2015 (has links) La simulation numérique est un des piliers des Sciences et de l’industrie. La simulationmétéorologique, la cosmologie ou encore la modélisation du coeur humain sont autantde domaines dont les besoins en puissance de calcul sont sans cesse croissants. Dès lors,comment passer ces applications à l’échelle ? La parallélisation et les supercalculateurs massivementparallèles sont les seuls moyens d’y parvenir. Néanmoins, il y a un prix à payercompte tenu des topologies matérielles de plus en plus complexes, tant en terme de réseauque de hiérarchie mémoire. La question de la localité des données devient ainsi centrale :comment réduire la distance entre une entité logicielle et les données auxquelles elle doitaccéder ? Le placement d’applications est un des leviers permettant de traiter ce problème.Dans cette thèse, nous présentons l’algorithme de placement TreeMatch et ses applicationsdans le cadre du placement statique, c’est-à-dire au lancement de l’application, et duplacement dynamique. Pour cette seconde approche, nous proposons la prise en comptede la localité des données dans le cadre d’un algorithme d’équilibrage de charge. Les différentesapproches abordées sont validées par des expériences réalisées tant sur des codesd’évaluation de performances que sur des applications réelles. / Computer simulation is one of the pillars of Sciences and industry. Climate simulation,cosmology, or heart modeling are all areas in which computing power needs are constantlygrowing. Thus, how to scale these applications ? Parallelization and massively parallel supercomputersare the only ways to do achieve. Nevertheless, there is a price to pay consideringthe hardware topologies incessantly complex, both in terms of network and memoryhierarchy. The issue of data locality becomes central : how to reduce the distance betweena processing entity and data to which it needs to access ? Application placement is one ofthe levers to address this problem. In this thesis, we present the TreeMatch algorithmand its application for static mapping, that is to say at the lauchtime of the application,and the dynamic placement. For this second approach, we propose the awareness of datalocality within a load balancing algorithm. The different approaches discussed are validatedby experiments both on benchmarking codes and on real applications. Calcul haute performance Parallélisme Localité Affinité Topologie Placement Équilibrage de charge High performance computing Parallelism Locality Affinity Topology Placement Load balancing
354	Contributions to Software Runtime for Clustered Manycores Applied to Embedded and High-Performance Applications / Contributions aux environnements d’exécution pour processeurs massivement parallèles et clustérisés appliqués aux applications embarquées et hautes performances Hascoët, Julien 14 December 2018 (has links) Le besoin en calculs est toujours plus important et difficile à satisfaire, spécialement dans le domaine de l’informatique embarquée qui inclue les voitures autonomes, drones et téléphones intelligents. Les systèmes embarqués doivent respecter des contraintes fortes de temps, de consommation et de sécurité. Les nouveaux processeurs parallèles et hétérogènes comme le MPPA® de Kalray utilisé dans cette thèse, doivent alors combiner haute performance et basse consommation. Pour cela, le MPPA® intègre 288 coeurs, regroupés en 18 clusters à mémoire locale partagée, un réseau sur puce et des moteurs DMA pour les communications. Ces processeurs sont difficiles à programmer, engendrant des coûts de développement importants. Cette thèse a pour objectif de simplifier leur programmation tout en optimisant les performances finales. Nous proposons pour cela AOS, une librairie de communication et synchronisation haute performance gérant les mémoires locales distribuées des processeurs clustérisés. La librairie atteint 70% de la crête matérielle pour des transferts supérieurs à 8 KB. Nous proposons plusieurs outils de développement basés sur AOS et des modèles de programmation flux-dedonnées pour accélérer le développement d’applications parallèles pour processeurs clustérisés, notamment OpenVX qui est un nouveau standard pour les applications de vision et les réseaux de neurones. Nous automatisons l’optimisation de l’application OpenVX en faisant du pré-chargement de données et en les fusionnants, pour éviter le mur de la bande passante mémoire externe. Les résultats montrent des facteurs d’accélération super linéaires. / The growing need for computing is more and more challenging, especially in the embedded system world with autonomous cars, drones, and smartphones. New highly parallel and heterogeneous processors emerge to answer this challenge. They operate in constrained environments with real-time requirements, reduced power consumption, and safety. Programming these new chips is a time-consuming and challenging task leading to huge software development costs. The Kalray MPPA® processor is a competitive example for low-power super-computing on a single chip. It integrates up to 288 VLIW cores grouped in 18 clusters, each fitted with shared local memory. These clusters are interconnected with a high-bandwidth network-on-chip, and DMA engines are used to communicate. This processor is used in this thesis for experimental results. We propose the AOS library enabling highperformance communications and synchronizations of distributed local memories on clustered manycores. AOS provides 70% of the peak hardware throughput for transfers larger than 8 KB. We propose tools for the implementation of static and dynamic dataflow programs based on AOS to accelerate the parallel application developments onto clustered manycores. We propose an implementation of OpenVX for clustered manycores on top of AOS. OpenVX is a standard based on dataflow for the development of computer vision and neural network computing. The proposed OpenVX implementation includes automatic optimizations like data prefetch to overlap communications and computations, or kernel fusion to avoid the main memory bandwidth bottleneck. Results show super-linear speedups. Calcul à hautes performances Parallélismes Communications Flux-de-données Systèmes embarqués High-Performance Computing Parallelisms Communications Dataflow Embedded systems 004.2
355	Designing High-Performance Erasure Coding Schemes for Next-Generation Storage Systems Haiyang, Shi January 2020 (has links) No description available. Computer Engineering Computer Science High-Performance Computing Erasure Coding In-Network Computing Coherent In-Network Erasure Coding RDMA NIC Offload
356	Řešení pro clusterování serverů / Server clustering techniques Čech, Martin January 2009 (has links) The work is given an analysis of Open Source Software (further referred as OSS), which allows use and create computer clusters. It explored the issue of clustering and construction of clusters. All installations, configuration and cluster management have been done on the operating system GNU / Linux. Presented OSS makes possible to compile a storage cluster, cluster with load distribution, cluster with high availability and computing cluster. Different types of benchmarks was theoretically analyzed, and practically used for measuring cluster’s performance. Results were compared with others, eg. the TOP500 list of the best clusters available online. Practical part of the work deals with comparing performance computing clusters. With several tens of computational nodes has been established cluster, where was installed package OpenMPI, which allows parallelization of calculations. Subsequently, tests were performed with the High Performance Linpack, which by calculation of linear equations provides total performance. Influence of the parallelization to algorithm PEA was also tested. To present practical usability, cluster has been tested by program John the Ripper, which serves to cracking users passwords. The work shall include the quantity of graphs clarifying the function and mainly showing the achieved results.
357	Runtime MPI Correctness Checking with a Scalable Tools Infrastructure Hilbrich, Tobias 08 June 2015 (has links) Increasing computational demand of simulations motivates the use of parallel computing systems. At the same time, this parallelism poses challenges to application developers. The Message Passing Interface (MPI) is a de-facto standard for distributed memory programming in high performance computing. However, its use also enables complex parallel programing errors such as races, communication errors, and deadlocks. Automatic tools can assist application developers in the detection and removal of such errors. This thesis considers tools that detect such errors during an application run and advances them towards a combination of both precise checks (neither false positives nor false negatives) and scalability. This includes novel hierarchical checks that provide scalability, as well as a formal basis for a distributed deadlock detection approach. At the same time, the development of parallel runtime tools is challenging and time consuming, especially if scalability and portability are key design goals. Current tool development projects often create similar tool components, while component reuse remains low. To provide a perspective towards more efficient tool development, which simplifies scalable implementations, component reuse, and tool integration, this thesis proposes an abstraction for a parallel tools infrastructure along with a prototype implementation. This abstraction overcomes the use of multiple interfaces for different types of tool functionality, which limit flexible component reuse. Thus, this thesis advances runtime error detection tools and uses their redesign and their increased scalability requirements to apply and evaluate a novel tool infrastructure abstraction. The new abstraction ultimately allows developers to focus on their tool functionality, rather than on developing or integrating common tool components. The use of such an abstraction in wide ranges of parallel runtime tool development projects could greatly increase component reuse. Thus, decreasing tool development time and cost. An application study with up to 16,384 application processes demonstrates the applicability of both the proposed runtime correctness concepts and of the proposed tools infrastructure. info:eu-repo/classification/ddc/004 ddc:004
358	Reactive transport modeling at hillslope scale with high performance computing methods He, Wenkui 07 November 2016 (has links) Reactive transport modeling is an important approach to understand water dynamics, mass transport and biogeochemical processes from the hillslope to the catchment scale. It has a wide range of applications in the fields of e.g. water resource management, contaminanted site remediation and geotechnical engineering. To simulate reactive transport processes at a hillslope or larger scales is a challenging task, which involves interactions of complex physical and biogeochemical processes, huge computational expenses as well as difficulties in numerical precision and stability. The primary goal of the work is to develop a practical, accurate and efficient tool to facilitate the simulation techniques for reactive transport problems towards hillslope or larger scales. The first part of the work deals with the simulation of water flow in saturated and unsaturated porous media. The capability and accuracy of different numerical approaches were analyzed and compared by using benchmark tests. The second part of the work introduces the coupling of the scientific software packages OpenGeoSys and IPhreeqc by using a character-string-based interface. The accuracy and computational efficiency of the coupled tool were discussed based on three benchmarks. It shows that OGS#IPhreeqc provides sufficient numerical accuracy to simulate reactive transport problems for both equilibrium and kinetic reactions in variably saturated porous media. The third part of the work describes the algorithm of a parallelization scheme using MPI (Message Passing Interface) grouping concept, which enables a flexible allocation of computational resources for calculating geochemical reaction and the physical processes such as groundwater flow and transport. The parallel performance of the approach was tested by three examples. It shows that the new approach has more advantages than the conventional ones for the calculation of geochemically-dominated problems, especially when only limited benefit can be obtained through parallelization for solving flow or solute transport. The comparison between the character-string-based and the file-based coupling shows, that the former approach produces less computational overhead in a distributed-memory system such as a computing cluster. The last part of the work shows the application of OGS#IPhreeqc for the simulation of the water dynamic and denitrification process in the groundwater aquifer of a study site in Northern Germany. It demonstrates that OGS#IPhreeqc is able to simulate heterogeneous reactive transport problems at a hillslope scale within an acceptable time span. The model results shows the importance of functional zones for natural attenuation process. / Modellierung des reaktiven Stofftranports ist ein wichtiger Ansatz um die Wasserströmung, den Stofftransport und die biogeochemischen Prozesse von der Hang- bis zur Einzugsgebietsskala zu verstehen. Es gibt umfangreiche Anwendungsgebiete, z.B. in der Wasserwirtschaft, Umweltsanierung und Geotechnik. Die Simulation der reaktiven Stofftransportprozesse auf der Hangskala oder auf größeren Maßstäbe ist eine anspruchsvolle Aufgabe, da es sich um die Wechselwirkungen komplexer physikalischer und biogeochemischen Prozesse handelt, die riesigen Berechnungsaufwand sowie numerischen Schwierigkeiten bezogen auf die Genauigkeit und die Stabilität nach sich ziehen. Das Hauptziel dieser Arbeit besteht darin, ein praktisches, genaues und effizientes Werkzeug zu entwickeln, um die Simulationstechnik für reaktiven Stofftransport auf der Hangskala und auf größeren Skalen zu verbessern. Der erste Teil der Arbeit behandelt die Simulation der Wasserströmung in gesättigten und ungesättigten porösen Medien. Das Anwendungspotential und die Genauigkeit verschiedener numerischer Ansätze wurden mittels einiger Benchmarks analysiert und miteinander verglichen. Der zweite Teil der Arbeit stellt die Kopplung der wissenschaftlichen Softwarepakete OpenGeoSys und IPhreeqc mit einer stringbasierten Schnittstelle dar. Die Genauigkeit und die Recheneffizienz des gekoppelten Tools OGS#IPhreeqc wurden basierend auf drei Benchmark-Tests diskutiert. Das Ergebnis zeigt, dass OGS#IPhreeqc die ausreichende numerische Genauigkeit für die Simulation reaktiven Stofftransports liefert, welcher sich sowohl auf die Gleichgewichtsreaktion als auch auf die kinetische Reaktion in variabel gesättigten porösen Medien beziehen. Der dritte Teil der Arbeit beschreibt zuerst den Algorithmus der Parallelisierung des OGS#IPhreeqc basierend auf dem MPI (Message Passing Interface) Gruppierungskonzept, welcher eine flexible Verteilung der Rechenressourcen für die Berechnung der geochemischen Reaktion und der physikalischen Prozesse wie z.B. Wasserströmung oder Stofftransport ermöglicht. Danach wurde die Leistungsfähigkeit des Algorithmus anhand von drei Beispielen getestet. Es zeigt sich, dass der neue Ansatz Vorteile gegenüber die konventionellen Ansätzen für die Berechnung von geochemisch dominierten Problemen bringt. Dies ist vor allem dann der Fall, wenn nur eingeschränkter Nutzen aus der Parallelisierung für die Berechnung der Wasserströmung oder des Stofftransportes gezogen werden kann. Der Vergleich zwischen der string- und der dateibasierten Kopplung zeigt, dass die erstere weniger Rechenoverhead in einem verteilten Rechnersystem, wie z.B. Cluster erzeugt. Der letzte Teil der Arbeit zeigt die Anwendung von OGS#IPhreeqc für die Simulation der Wasserdynamik und der Denitrifikation im Grundwasserleiter eines Untersuchungsgebietes in NordDeutschland. Es beweist, dass OGS#IPhreeqc in der Lage ist, reaktiven Stofftransport auf der Hangskala innerhalb akzeptabler Zeitspanne zu simulieren. Die Simulationsergebnisse zeigen die Bedeutung der funktionalen Zonen für die natürlichen Selbstreinigungsprozesse. info:eu-repo/classification/ddc/550 ddc:550
359	Toward Resilience in High Performance Computing:: A Prototype to Analyze and Predict System Behavior Ghiasvand, Siavash 16 October 2020 (has links) Following the growth of high performance computing systems (HPC) in size and complexity, and the advent of faster and more complex Exascale systems, failures became the norm rather than the exception. Hence, the protection mechanisms need to be improved. The most de facto mechanisms such as checkpoint/restart or redundancy may also fail to support the continuous operation of future HPC systems in the presence of failures. Failure prediction is a new protection approach that is beneficial for HPC systems with a short mean time between failure. The failure prediction mechanism extends the existing protection mechanisms via the dynamic adjustment of the protection level. This work provides a prototype to analyze and predict system behavior using statistical analysis to pave the path toward resilience in HPC systems. The proposed anomaly detection method is noise-tolerant by design and produces accurate results with as little as 30 minutes of historical data. Machine learning models complement the main approach and further improve the accuracy of failure predictions up to 85%. The fully automatic unsupervised behavior analysis approach, proposed in this work, is a novel solution to protect future extreme-scale systems against failures.:1 Introduction 1.1 Background and Statement of the Problem 1.2 Purpose and Significance of the Study 1.3 Jam–e Jam: A System Behavior Analyzer 2 Review of the Literature 2.1 Syslog Analysis 2.2 Users and Systems Privacy 2.3 Failure Detection and Prediction 2.3.1 Failure Correlation 2.3.2 Anomaly Detection 2.3.3 Prediction Methods 2.3.4 Prediction Accuracy and Lead Time 3 Data Collection and Preparation 3.1 Taurus HPC Cluster 3.2 Monitoring Data 3.2.1 Data Collection 3.2.2 Taurus System Log Dataset 3.3 Data Preparation 3.3.1 Users and Systems Privacy 3.3.2 Storage and Size Reduction 3.3.3 Automation and Improvements 3.3.4 Data Discretization and Noise Mitigation 3.3.5 Cleansed Taurus System Log Dataset 3.4 Marking Potential Failures 4 Failure Prediction 4.1 Null Hypothesis 4.2 Failure Correlation 4.2.1 Node Vicinities 4.2.2 Impact of Vicinities 4.3 Anomaly Detection 4.3.1 Statistical Analysis (frequency) 4.3.2 Pattern Detection (order) 4.3.3 Machine Learning 4.4 Adaptive resilience 5 Results 5.1 Taurus System Logs 5.2 System-wide Failure Patterns 5.3 Failure Correlations 5.4 Taurus Failures Statistics 5.5 Jam-e Jam Prototype 5.6 Summary and Discussion 6 Conclusion and Future Works Bibliography List of Figures List of Tables Appendix A Neural Network Models Appendix B External Tools Appendix C Structure of Failure Metadata Databse Appendix D Reproducibility Appendix E Publicly Available HPC Monitoring Datasets Appendix F Glossary Appendix G Acronyms info:eu-repo/classification/ddc/004 ddc:004
360	Modèles de performance pour l'adaptation des méthodes numériques aux architectures multi-coeurs vectorielles. Application aux schémas Lagrange-Projection en hydrodynamique compressible / Improving numerical methods on recent multi-core processors. Application to Lagrange-Plus-Remap hydrodynamics solver. Gasc, Thibault 06 December 2016 (has links) Ces travaux se concentrent sur la résolution de problèmes de mécanique des fluides compressibles. De nombreuses méthodes numériques ont depuis plusieurs décennies été développées pour traiter ce type de problèmes. Cependant, l'évolution et la complexité des architectures informatiques nous poussent à actualiser et repenser ces méthodes numériques afin d'utiliser efficacement les calculateurs massivement parallèles. Au moyen de modèles de performance, nous analysons une méthode numérique de référence de type Lagrange-Projection afin de comprendre son comportement sur les supercalculateurs récents et d'en optimiser l'implémentation pour ces architectures. Grâce au bilan de cet analyse, nous proposons une formulation alternative de la phase de projection ainsi qu'une nouvelle méthode numérique plus performante baptisée Lagrange-Flux. Les développements de cette méthode ont permis d'obtenir des résultats d'une précision comparable à la méthode de référence. / This works are dedicated to hydrodynamics. For decades, numerous numerical methods has been developed to deal with this type of problems. However, both the evolution and the complexity of computing make us rethink or redesign our numerical solver in order to use efficiently massively parallel computers. Using performance modeling, we perform an analysis of a reference Lagrange-Remap solver in order to deeply understand its behavior on current supercomputer and to optimize its implementation. Thanks to the conclusions of this analysis, we derive a new numerical solver which by design has a better performance. We call it the Lagrange-Flux solver. The accuracy obtained with this solver is similar to the reference one. The derivation of this method also leads to rethink the Remap step. Hydrodynamique Lagrange-Projection Modèles de performance Informatique haute performance Hydrodynamics Lagrange-Plus-Remap Performance modeling High performance computing

Search results