Global ETD Search

1	Simulations parallèles de Monte Carlo appliquées à la Physique des Hautes Energies pour plates-formes manycore et multicore : mise au point, optimisation, reproductibilité / Monte Carlo parallel simulations applied to the High Energy Physics for manycore and multicore platforms : development, optimisation, reproducibility Schweitzer, Pierre 19 October 2015 (has links) Lors de cette thèse, nous nous sommes focalisés sur le calcul à haute performance, dans le domaine très précis des simulations de Monte Carlo appliquées à la physique des hautes énergies, et plus particulièrement, aux simulations pour la propagation de particules dans un milieu. Les simulations de Monte Carlo sont des simulations particulièrement consommatrices en ressources, temps de calcul, capacité mémoire. Dans le cas précis sur lequel nous nous sommes penchés, la première simulation de Monte Carlo existante prenait plus de temps à simuler le phénomène physique que le phénomène lui-même n’en prenait pour se dérouler dans les conditions expérimentales. Cela posait donc un sévère problème de performance. L’objectif technique minimal était d’avoir une simulation prenant autant de temps que le phénomène réel observé, l’objectif maximal était d’avoir une simulation bien plus rapide. En effet, ces simulations sont importantes pour vérifier la bonne compréhension de ce qui est observé dans les conditions expérimentales. Plus nous disposons d’échantillons statistiques simulés, meilleurs sont les résultats. Cet état initial des simulations ouvrait donc de nombreuses perspectives d’un point de vue optimisation et calcul à haute performance. Par ailleurs, dans notre cas, le gain de performance étant proprement inutile s’il n’est pas accompagné d’une reproductibilité des résultats, la reproductibilité numérique de la simulation est de ce fait un aspect que nous devons prendre en compte.C’est ainsi que dans le cadre de cette thèse, après un état de l’art sur le profilage, l’optimisation et la reproductibilité, nous avons proposé plusieurs stratégies visant à obtenir plus de performances pour nos simulations. Dans tous les cas, les optimisations proposées étaient précédées d’un profilage. On n’optimise jamais sans avoir profilé. Par la suite, nous nous intéressés à la création d’un profileur parallèle en programmation orientée aspect pour nos besoins très spécifiques, enfin, nous avons considéré la problématique de nos simulations sous un angle nouveau : plutôt que d’optimiser une simulation existante, nous avons proposé des méthodes permettant d’en créer une nouvelle, très spécifique à notre domaine, qui soit d’emblée reproductible, statistiquement correcte et qui puisse passer à l’échelle. Dans toutes les propositions, de façon transverse, nous nous sommes intéressés aux architectures multicore et manycore d’Intel pour évaluer les performances à travers une architecture orientée serveur et une architecture orientée calcul à haute performance. Ainsi, grâce à la mise en application de nos propositions, nous avons pu optimiser une des simulations de Monte Carlo, nous permettant d’obtenir un gain de performance de l’ordre de 400X, une fois optimisée et parallélisée sur un nœud de calcul avec 32 cœurs physiques. De même, nous avons pu proposer l’implémentation d’un profileur, programmé à l’aide d’aspects et capable de gérer le parallélisme à la fois de la machine sur laquelle il est exécuté mais aussi de l’application qu’il profile. De plus, parce qu’il emploi les aspects, il est portable et n’est pas fixé à une architecture matérielle en particulier. Enfin, nous avons implémenté la simulation prévue pour être reproductible, performante et ayant des résultats statistiquement viables. Nous avons pu constater que ces objectifs étaient atteints quelle que soit l’architecture cible pour l’exécution. Cela nous a permis de valider notamment notre méthode de vérification de la reproductibilité numérique d’une simulation. / During this thesis, we focused on High Performance Computing, specifically on Monte Carlo simulations applied to High Energy Physics. We worked on simulations dedicated to the propagation of particles through matter. Monte Carlo simulations require significant CPU time and memory footprint. Our first Monte Carlo simulation was taking more time to simulate the physical phenomenon than the said phenomenon required to happen in the experimental conditions. It raised a real performance issue. The minimal technical aim of the thesis was to have a simulation requiring as much time as the real observed phenomenon. Our maximal target was to have a much faster simulation. Indeed, these simulations are critical to asses our correct understanding of what is observed during experimentation. The more we have simulated statistics samples, the better are our results. This initial state of our simulation was allowing numerous perspectives regarding optimisation, and high performance computing. Furthermore, in our case, increasing the performance of the simulation was pointless if it was at the cost of losing results reproducibility. The numerical reproducibility of the simulation was then an aspect we had to take into account. In this manuscript, after a state of the art about profiling, optimisation and reproducibility, we proposed several strategies to gain more performance in our simulations. In each case, all the proposed optimisations followed a profiling step. One never optimises without having profiled first. Then, we looked at the design of a parallel profiler using aspect-oriented programming for our specific needs. Finally, we took a new look at the issues raised by our Monte Carlo simulations: instead of optimising existing simulations, we proposed methods for developing a new simulation from scratch, having in mind it is for High Performance Computing and it has to be statistically sound, reproducible and scalable. In all our proposals, we looked at both multicore and manycore architectures from Intel to benchmark the performance on server-oriented architecture and High Performance Computing oriented architecture. Through the implementation of our proposals, we were able to optimise one of the Monte Carlo simulations, permitting us to achieve a 400X speedup, once optimised and parallelised on a computing node with 32 physical cores. We were also able to implement a profiler with aspects, able to deal with the parallelism of its computer and of the application it profiles. Moreover, because it relies on aspects, it is portable and not tied to any specific architecture. Finally, we implemented the simulation designed to be reproducible, scalable and to have statistically sound results. We observed that these goals could be achieved, whatever the target architecture for execution. This enabled us to assess our method for validating the numerical reproducibility of a simulation. Simulation Monte Carlo Reproductibilité Parallélisation Xeon Phi Optimisation Simulation Monte Carlo Reproducibility Parallelisation Xeon Phi Optimisation
2	Plánování v proudových systemech na Xeonu Phi / Streaming system scheduling for Xeon Phi Faltín, Tomáš January 2016 (has links) Task scheduling in operating system area is well known problem on traditional system architectures (NUMA, SMP). However, it does not perform well on emerging many-core architectures, especially on Intel Xeon Phi. We have collected all publicly available information about the Xeon Phi's architecture. After that we have benchmarked the Xeon Phi to find missing information about its architecture. We were especially curious in architecture of cores and memory controllers. These parts are most important while designing scheduler. Based on the measured results we have proposed improvements to scheduling algorithm in the Bobox - experimental streaming system. Powered by TCPDF (www.tcpdf.org)
3	Plánování v proudových systemech na Xeonu Phi / Streaming system scheduling for Xeon Phi Faltín, Tomáš January 2017 (has links) Task scheduling in operating system area is a well-known problem on traditional system architectures (NUMA, SMP). Unfortunately, it does not perform well on emerging many-core architectures, especially on Intel Xeon Phi. We collected all publicly available information about the architecture of Xeon Phi. After that, we benchmarked the Xeon Phi in order to find the missing information about its architecture. We focused especially on the information about cores and memory controllers. These are the most important parts when designing a scheduler. Based on the results, we proposed improvements for scheduling algorithm in the Bobox (an experimental streaming system). However, we found that the biggest problem is not in the scheduling algorithm, but in the design of operators' parallelization. Therefore, we proposed improvements to the parallelization and tested one of the proposals.
4	Etude de l'adéquation des machines Exascale pour les algorithmes implémentant la méthode du Reverse Time Migation / Preparing depth imaging applications for Exascale challenges and impacts Farjallah, Asma 16 December 2014 (has links) La caractérisation des applications en vue de les préparer pour les nouvelles architectures et les porter sur des systèmes très étendus est une étape importante pour pouvoir anticiper les modifications nécessaires. Comme les machines Exascale sont prévues pour la période 2018-2020, l'étude des applications et leur préparation pour ces machines s'avèrent donc essentielles. Nous nous intéressons aux applications d'imagerie sismique et en particulier à l'application Reverse Time Migration (RTM) car elle est très utilisée par les pétroliers dans le cadre de l'exploration sismique.La première partie de nos travaux a porté sur l'étude du cœur de calcul de l'application RTM qui consiste en un calcul de différences finies dans le domaine temporel (FDTD). Nous avons caractérisé cette partie de l'application en soulevant les aspects architecturaux des machines actuelles ayant un fort impact sur la performance, notamment les caches, les bandes passantes et le prefetching. Cette étude a abouti à l'élaboration d'un modèle de performance permettant de prédire le trafic DRAM des FDTD. La deuxième partie de la thèse se focalise sur l'impact de l'hétérogénéité et le parallélisme sur la FDTD et sur RTM. Nous avons choisi l'architecture manycore d’Intel, Xeon Phi, et nous avons étudié une implémentation "native" et une implémentation hétérogène et hybride, la version "symmetric". Enfin, nous avons porté l'application RTM sur un cluster hétérogène, Stampede du Texas Advanced Computing Center (TACC), où nous avons effectué des tests de scalabilité allant jusqu'à 64 nœuds contenant des coprocesseurs Xeon Phi et des processeurs Sandy Bridge ce qui correspond à presque 5000 cœurs / As we are expecting Exascale systems for the 2018-2020 time frame, performance analysis and characterization of applications for new processor architectures and large scale systems are important tasks that permit to anticipate the required changes to efficiently exploit the future HPC systems. This thesis focuses on seismic imaging applications used for modeling complex physical phenomena, in particular the depth imaging application called Reverse Time Migration (RTM). My first contribution consists in characterizing and modeling the performance of the computational core of RTM which is based on finite-difference time-domain (FDTD) computations. I identify and explore the major tuning parameters influencing performance and the interaction between the architecture and the application. The second contribution is an analysis to identify the challenges for a hybrid and heterogeneous implementation of FDTD for manycore architectures. We target Intel’s first Xeon Phi co-processor, the Knights Corner. This architecture is an interesting proxy for our study since it contains some of the expected features of an Exascale system: concurrency and heterogeneity.My third contribution is an extension of the performance analysis and modeling to the full RTM. This adds communications and IOs to the computation part. RTM is a data intensive application and requires the storage of intermediate values of the computational field resulting in expensive IO accesses. My fourth contribution is the final measurement and model validation of my hybrid RTM implementation on a large system. This has been done on Stampede, a machine of the Texas Advanced Computing Center (TACC), which allows us to test the scalability up to 64 nodes each containing one 61-core Xeon Phi and two 8-core CPUs for a total close to 5000 heterogeneous cores Co-design Exascale Caractérisation Manycore Imagerie sismique Modélisation Performance Xeon Phi Exascale Characterization Manycore Seismic imaging Performance Xeon Phi
5	Exploring vectorisation for parallel breadth-first search on an advanced vector processor Paredes Lopez, Mireya January 2017 (has links) Modern applications generate a massive amount of data that is challenging to process or analyse. Graph algorithms have emerged as a solution for the analysis of this data because they can represent the entities participating in the generation of large scale datasets in terms of vertices and their relationships in terms of edges. Graph analysis algorithms are used for finding patterns within these relationships, aiming to extract information to be further analysed. The breadth-first search (BFS) is one of the main graph search algorithms used for graph analysis and its optimisation has been widely researched using different parallel computers. However, the BFS parallelisation has been shown to be chal- lenging because of its inherent characteristics, including irregular memory access patterns, data dependencies and workload imbalance, that limit its scalability. This thesis investigates the optimisation of the BFS on the Xeon Phi, which is a modern parallel architecture provided with an advanced vector processor using a self-created development framework integrated with the Graph 500 benchmark. As a result, optimised parallel versions of two high-level algorithms for BFS were created using vectorisation, starting with the conventional top-down BFS algorithm and, building on this, leading to the hybrid BFS algorithm. The best implementations resulted in speedups of 1.37x and 1.33x, for a one million vertices graph, compared to the state-of-the-art, respectively. The hybrid BFS algorithm can be further used by other graph analysis algorithms and the lessons learned from vectorisation can be applied to other algorithms targeting the existing and future models of the Xeon Phi and other advanced vector architectures. 004
6	Plateforme de calcul parallèle « Design for Demise » / Parallel computing platform « Design for Demise » Plazolles, Bastien 10 January 2017 (has links) Les risques liés aux débris spatiaux sont à présent considérés comme critiques par les gouvernements et les agences spa-tiales internationales. Durant la dernière décennie les agences spatiales ont développé des logiciels pour simuler la rentrée atmosphérique des satellites et des stations orbitales afin de déterminer les risques et possibles dommages au sol. Néan-moins les outils actuels fournissent des résultats déterministes alors que les modèles employés utilisent des valeurs de paramètres qui sont mal connues. De plus les résultats obtenus dépendent fortement des hypothèses qui sont faites. Une solution pour obtenir des résultats pertinents et exploitables est de prendre en considération les incertitudes que l’on a sur les différents paramètres de la modélisation afin d’effectuer des analyses de type Monte-Carlo. Mais une telle étude est particulièrement gourmande en temps de calcul à cause du grand espace des paramètres à explorer (ce qui nécessite des centaines de milliers de simulations numériques). Dans le cadre de ces travaux de thèse nous proposons un nouveau logiciel de simulation numérique de rentrée atmosphérique de satellite, permettant de façon native de prendre en consi-dération les incertitudes sur les différents paramètres de modélisations pour effectuer des analyses statistiques. Afin de maitriser les temps de calculs cet outil tire avantage de la méthode de Taguchi pour réduire le nombre de paramètres à étudier et aussi des accélérateurs de calculs de type Graphics Processing Units (GPUs) et Intel Xeon Phi. / The risk of space debris is now perceived as primordial by government and international space agencies. Since the last decade, international space agencies have developed tools to simulate the re-entry of satellites and orbital stations in order to assess casualty risk on the ground. Nevertheless , all current tools provide deterministic solutions, though models include various parameters that are not well known. Therefore, the provided results are strongly dependent on the as-sumptions made. One solution to obtain relevant and exploitable results is to include uncertainties around those parame-ters in order to perform Monte-Carlo analysis. But such a study is very time consuming due to the large parameter space to explore (that necessitate hundreds of thousands simulations). As part of this thesis work we propose a new satellite atmospheric reentry simulation to perform statistical analysis. To master computing time this tool takes advantage of Taguchi method to restrain the amount of parameter to study and also takes advantage of computing accelerators like Graphic Processing Units (GPUs) and Intel Xeon Phi. Débris spatiaux Calcul parallèle GPU CUDA Xeon Phi CPU multi-cœurs Monte-Carlo Taguchi
7	Parallel Evaluation of Numerical Models for Algorithmic Trading / Parallel Evaluation of Numerical Models for Algorithmic Trading Ligr, David January 2016 (has links) This thesis will address the problem of the parallel evaluation of algorithmic trading models based on multiple kernel support vector regression. Various approaches to parallelization of the evaluation of these models will be proposed and their suitability for highly parallel architectures, namely the Intel Xeon Phi coprocessor, will be analysed considering specifics of this coprocessor and also specifics of its programming. Based on this analysis a prototype will be implemented, and its performance will be compared to a serial and multi-core baseline pursuant to executed experiments. Powered by TCPDF (www.tcpdf.org)
8	Unstructured Computations on Emerging Architectures Al Farhan, Mohammed 05 May 2019 (has links) This dissertation describes detailed performance engineering and optimization of an unstructured computational aerodynamics software system with irregular memory accesses on various multi- and many-core emerging high performance computing scalable architectures, which are expected to be the building blocks of energy-austere exascale systems, and on which algorithmic- and architecture-oriented optimizations are essential for achieving worthy performance. We investigate several state-of-the-practice shared-memory optimization techniques applied to key kernels for the important problem class of unstructured meshes. We illustrate for a broad spectrum of emerging microprocessor architectures as representatives of the compute units in contemporary leading supercomputers, identifying and addressing performance challenges without compromising the floating-point numerics of the original code. While the linear algebraic kernels are bottlenecked by memory bandwidth for even modest numbers of hardware cores sharing a common address space, the edge-based loop kernels, which arise in the control volume discretization of the conservation law residuals and in the formation of the preconditioner for the Jacobian by finite-differencing the conservation law residuals, are compute-intensive and effectively exploit contemporary multi- and many-core processing hardware. We therefore employ low- and high-level algorithmic- and architecture-specific code optimizations and tuning in light of thread- and data-level parallelism, with a focus on strong thread scaling at the node-level. Our approaches are based upon novel multi-level hierarchical workload distribution mechanisms of data across different compute units (from the address space down to the registers) within every hardware core. We analyze the demonstrated aerodynamics application on specific computing architectures to develop certain performance metrics and models to bespeak the upper and lower bounds of the performance. We present significant full application speedup relative to the baseline code, on a succession of many-core processor architectures, i.e., Intel Xeon Phi Knights Corner (5.0x) and Knights Landing (2.9x). In addition, the performance of Knights Landing outperforms, at significantly lower power consumption, Intel Xeon Skylake with nearly twofold speedup. These optimizations are expected to be of value for many other unstructured mesh partial differential equation-based scientific applications as multi- and many- core architecture evolves. Performance Optimizations Thread-level parallelism Data-level parallelism Unstructured Grids Computational Aerodynamics Intel Xeon Phi
9	Power, Performance, and Energy Management of Heterogeneous Architectures January 2019 (has links) abstract: Many core modern multiprocessor systems-on-chip offers tremendous power and performance optimization opportunities by tuning thousands of potential voltage, frequency and core configurations. Applications running on these architectures are becoming increasingly complex. As the basic building blocks, which make up the application, change during runtime, different configurations may become optimal with respect to power, performance or other metrics. Identifying the optimal configuration at runtime is a daunting task due to a large number of workloads and configurations. Therefore, there is a strong need to evaluate the metrics of interest as a function of the supported configurations. This thesis focuses on two different types of modern multiprocessor systems-on-chip (SoC): Mobile heterogeneous systems and tile based Intel Xeon Phi architecture. For mobile heterogeneous systems, this thesis presents a novel methodology that can accurately instrument different types of applications with specific performance monitoring calls. These calls provide a rich set of performance statistics at a basic block level while the application runs on the target platform. The target architecture used for this work (Odroid XU3) is capable of running at 4940 different frequency and core combinations. With the help of instrumented application vast amount of characterization data is collected that provides details about performance, power and CPU state at every instrumented basic block across 19 different types of applications. The vast amount of data collected has enabled two runtime schemes. The first work provides a methodology to find optimal configurations in heterogeneous architecture using classifiers and demonstrates an average increase of 93%, 81% and 6% in performance per watt compared to the interactive, ondemand and powersave governors, respectively. The second work using same data shows a novel imitation learning framework for dynamically controlling the type, number, and the frequencies of active cores to achieve an average of 109% PPW improvement compared to the default governors. This work also presents how to accurately profile tile based Intel Xeon Phi architecture while training different types of neural networks using open image dataset on deep learning framework. The data collected allows deep exploratory analysis. It also showcases how different hardware parameters affect performance of Xeon Phi. / Dissertation/Thesis / Masters Thesis Engineering 2019 Computer engineering Architectures Energy Management Heterogeneous Performance Management Power Management Xeon Phi
10	OPTIMIZATIONS ON FINITE THREE DIMENSIONAL LARGE EDDY SIMULATIONS Phadke, Nandan Neelkanth 17 August 2015 (has links) No description available. Computer Science

Search results