Spelling suggestions: "subject:"ppc"" "subject:"dppc""
221 |
Investigating the effect of implementing Data-Oriented Design principles on performance and cache utilizationNyberg, Frank January 2021 (has links)
Game engines process a lot of data under strict deadlines. Therefore, measures to increase performance are important in this area. Data-Oriented Design (DOD) promotes principles that are meant to increase performance by better cache utilization. The purpose of this thesis is to examine a selection of these principles to give a better understanding of how DOD affects CPU time and the rate of cache misses, with focus on the area of game development. More specifically, the principles examined are removal of run-time polymorphism, iteration over contiguous data, and lowering the amount of data in hot loops. Also, the Entity-Component-System pattern is examined, which is based upon DOD principles. The approach was to first present a theoretical background on the subject, and then to conduct tests by implementing a simulation of movement and collision detection utilizing said principles. The tests were written in C++ and executed on an Intel Core i7 4770k with no rendering. CPU time was measured in updated entities per μs, and cache utilization was measured in the form of cache miss rate. The results showed that the DOD principles did increase performance. Cache miss rate was also lower, with the exception of when removing run-time polymorphism. The conclusion is that Data-Oriented Design, used in game development, is likely to result in better performance, mostly as a result of better cache utilization.
|
222 |
Multiphysics and Large-Scale Modeling and Simulation Methods for Advanced Integrated Circuit DesignShuzhan Sun (11564611) 22 November 2021 (has links)
<div>The design of advanced integrated circuits (ICs) and systems calls for multiphysics and large-scale modeling and simulation methods. On the one hand, novel devices and materials are emerging in next-generation IC technology, which requires multiphysics modeling and simulation. On the other hand, the ever-increasing complexity of ICs requires more efficient numerical solvers.</div><div><br></div><div>In this work, we propose a multiphysics modeling and simulation algorithm to co-simulate Maxwell's equations, dispersion relation of materials, and Boltzmann equation to characterize emerging new devices in IC technology such as Cu-Graphene (Cu-G) hybrid nano-interconnects. We also develop an unconditionally stable time marching scheme to remove the dependence of time step on space step for an efficient simulation of the multiscaled and multiphysics system. Extensive numerical experiments and comparisons with measurements have validated the accuracy and efficiency of the proposed algorithm. Compared to simplified steady-state-models based analysis, a significant difference is observed when the frequency is high or/and the dimension of the Cu-G structure is small, which necessitates our proposed multiphysics modeling and simulation for the design of advanced Cu-G interconnects. </div><div><br></div><div>To address the large-scale simulation challenge, we develop a new split-field domain-decomposition algorithm amenable for parallelization for solving Maxwell’s equations, which minimizes the communication between subdomains, while having a fast convergence of the global solution. Meanwhile, the algorithm is unconditionally stable in time domain. In this algorithm, unlike prevailing domain decomposition methods that treat the interface unknown as a whole and let it be shared across subdomains, we partition the interface unknown into multiple components, and solve each of them from one subdomain. In this way, we transform the original coupled system to fully decoupled subsystems to solve. Only one addition (communication) of the interface unknown needs to be performed after the computation in each subdomain is finished at each time step. More importantly, the algorithm has a fast convergence and permits the use of a large time step irrespective of space step. Numerical experiments on large-scale on-chip and package layout analysis have demonstrated the capability of the new domain decomposition algorithm. </div><div><br></div><div>To tackle the challenge of efficient simulation of irregular structures, in the last part of the thesis, we develop a method for the stability analysis of unsymmetrical numerical systems in time domain. An unsymmetrical system is traditionally avoided in numerical formulation since a traditional explicit simulation is absolutely unstable, and how to control the stability is unknown. However, an unsymmetrical system is frequently encountered in modeling and simulating of unstructured meshes and nonreciprocal electromagnetic and circuit devices. In our method, we reduce stability analysis of a large system into the analysis of dissembled single element, therefore provides a feasible way to control the stability of large-scale systems regardless of whether the system is symmetrical or unsymmetrical. We then apply the proposed method to prove and control the stability of an unsymmetrical matrix-free method that solves Maxwell’s equations in general unstructured meshes while not requiring a matrix solution.<br></div><div><br></div>
|
223 |
Efficient Implementation of 3D Finite Difference Schemes on Recent Processor Architectures / Effektiv implementering av finita differensmetoder i 3D på senaste processorarkitekturerCeder, Frederick January 2015 (has links)
Efficient Implementation of 3D Finite Difference Schemes on Recent Processors Abstract In this paper a solver is introduced that solves a problem set modelled by the Burgers equation using the finite difference method: forward in time and central in space. The solver is parallelized and optimized for Intel Xeon Phi 7120P as well as Intel Xeon E5-2699v3 processors to investigate differences in terms of performance between the two architectures. Optimized data access and layout have been implemented to ensure good cache utilization. Loop tiling strategies are used to adjust data access with respect to the L2 cache size. Compiler hints describing aligned memory access are used to support vectorization on both processors. Additionally, prefetching strategies and streaming stores have been evaluated for the Intel Xeon Phi. Parallelization was done using OpenMP and MPI. The parallelisation for native execution on Xeon Phi is based on OpenMP and yielded a raw performance of nearly 100 GFLOP/s, reaching a speedup of almost 50 at a 83\% parallel efficiency. An OpenMP implementation on the E5-2699v3 (Haswell) processors produced up to 292 GFLOP/s, reaching a speedup of almost 31 at a 85\% parallel efficiency. For comparison a mixed implementation using interleaved communications with computations reached 267 GFLOP/s at a speedup of 28 with a 87\% parallel efficiency. Running a pure MPI implementation on the PDC's Beskow supercomputer with 16 nodes yielded a total performance of 1450 GFLOP/s and for a larger problem set it yielded a total of 2325 GFLOP/s, reaching a speedup and parallel efficiency at resp. 170 and 33,3\% and 290 and 56\%. An analysis based on the roofline performance model shows that the computations were memory bound to the L2 cache bandwidth, suggesting good L2 cache utilization for both the Haswell and the Xeon Phi's architectures. Xeon Phi performance can probably be improved by also using MPI. Keeping technological progress for computational cores in the Haswell processor in mind for the comparison, both processors perform well. Improving the stencil computations to a more compiler friendly form might improve performance more, as the compiler can possibly optimize more for the target platform. The experiments on the Cray system Beskow showed an increased efficiency from 33,3\% to 56\% for the larger problem, illustrating good weak scaling. This suggests that problem sizes should increase accordingly for larger number of nodes in order to achieve high efficiency. Frederick Ceder / Effektiv implementering av finita differensmetoder i 3D på moderna processorarkitekturer Sammanfattning Denna uppsats diskuterar implementationen av ett program som kan lösa problem modellerade efter Burgers ekvation numeriskt. Programmet är byggt ifrån grunden och använder sig av finita differensmetoder och applicerar FTCS metoden (Forward in Time Central in Space). Implementationen paralleliseras och optimeras på Intel Xeon Phi 7120P Coprocessor och Intel Xeon E5-2699v3 processorn för att undersöka skillnader i prestanda mellan de två modellerna. Vi optimerade programmet med omtanke på dataåtkomst och minneslayout för att få bra cacheutnyttjande. Loopblockningsstrategier används också för att dela upp arbetsminnet i mindre delar för att begränsa delarna i L2 cacheminnet. För att utnyttja vektorisering till fullo så används kompilatordirektiv som beskriver minnesåtkomsten, vilket ska hjälpa kompilatorn att förstå vilka dataaccesser som är alignade. Vi implementerade också prefetching strategier och streaming stores på Xeon Phi och disskuterar deras värde. Paralleliseringen gjordes med OpenMP och MPI. Parallelliseringen för Xeon Phi:en är baserad på bara OpenMP och exekverades direkt på chipet. Detta gav en rå prestanda på nästan 100 GFLOP/s och nådde en speedup på 50 med en 83% effektivitet. En OpenMP implementation på E5-2699v3 (Haswell) processorn fick upp till 292 GFLOP/s och nådde en speedup på 31 med en effektivitet på 85%. I jämnförelse fick en hybrid implementation 267 GFLOP/s och nådde en speedup på 28 med en effektivitet på 87%. En ren MPI implementation på PDC's Beskow superdator med 16 noder gav en total prestanda på 1450 GFLOP/s och för en större problemställning gav det totalt 2325 GFLOP/s, med speedup och effektivitet på respektive 170 och 33% och 290 och 56%. En analys baserad på roofline modellen visade att beräkningarna var minnesbudna till L2 cache bandbredden, vilket tyder på bra L2-cache användning för både Haswell och Xeon Phi:s arkitekturer. Xeon Phis prestanda kan förmodligen förbättras genom att även använda MPI. Håller man i åtanke de tekniska framstegen när det gäller beräkningskärnor på de senaste åren, så preseterar både arkitekturer bra. Beräkningskärnan av implementationen kan förmodligen anpassas till en mer kompilatorvänlig variant, vilket eventuellt kan leda till mer optimeringar av kompilatorn för respektive plattform. Experimenten på Cray-systemet Beskow visade en ökad effektivitet från 33,3% till 56% för större problemställningar, vilket visar tecken på bra weak scaling. Detta tyder på att effektivitet kan uppehållas om problemställningen växer med fler antal beräkningsnoder. Frederick Ceder
|
224 |
Optimisation de transfert de données pour les processeurs pluri-coeurs, appliqué à l'algèbre linéaire et aux calculs sur stencils / Optimization of data transfer on many-core processors, applied to dense linear algebra and stencil computationsHo, Minh Quan 05 July 2018 (has links)
La prochaine cible de Exascale en calcul haute performance (High Performance Computing - HPC) et des récent accomplissements dans l'intelligence artificielle donnent l'émergence des architectures alternatives non conventionnelles, dont l'efficacité énergétique est typique des systèmes embarqués, tout en fournissant un écosystème de logiciel équivalent aux plateformes HPC classiques. Un facteur clé de performance de ces architectures à plusieurs cœurs est l'exploitation de la localité de données, en particulier l'utilisation de mémoire locale (scratchpad) en combinaison avec des moteurs d'accès direct à la mémoire (Direct Memory Access - DMA) afin de chevaucher le calcul et la communication. Un tel paradigme soulève des défis de programmation considérables à la fois au fabricant et au développeur d'application. Dans cette thèse, nous abordons les problèmes de transfert et d'accès aux mémoires hiérarchiques, de performance de calcul, ainsi que les défis de programmation des applications HPC, sur l'architecture pluri-cœurs MPPA de Kalray. Pour le premier cas d'application lié à la méthode de Boltzmann sur réseau (Lattice Boltzmann method - LBM), nous fournissons des techniques génériques et réponses fondamentales à la question de décomposition d'un domaine stencil itérative tridimensionnelle sur les processeurs clusterisés équipés de mémoires locales et de moteurs DMA. Nous proposons un algorithme de streaming et de recouvrement basé sur DMA, délivrant 33% de gain de performance par rapport à l'implémentation basée sur la mémoire cache par défaut. Le calcul de stencil multi-dimensionnel souffre d'un goulot d'étranglement important sur les entrées/sorties de données et d'espace mémoire sur puce limitée. Nous avons développé un nouvel algorithme de propagation LBM sur-place (in-place). Il consiste à travailler sur une seule instance de données, au lieu de deux, réduisant de moitié l'empreinte mémoire et cède une efficacité de performance-par-octet 1.5 fois meilleur par rapport à l'algorithme traditionnel dans l'état de l'art. Du côté du calcul intensif avec l'algèbre linéaire dense, nous construisons un benchmark de multiplication matricielle optimale, basé sur exploitation de la mémoire locale et la communication DMA asynchrone. Ces techniques sont ensuite étendues à un module DMA générique du framework BLIS, ce qui nous permet d'instancier une bibliothèque BLAS3 (Basic Linear Algebra Subprograms) portable et optimisée sur n'importe quelle architecture basée sur DMA, en moins de 100 lignes de code. Nous atteignons une performance maximale de 75% du théorique sur le processeur MPPA avec l'opération de multiplication de matrices (GEMM) de BLAS, sans avoir à écrire des milliers de lignes de code laborieusement optimisé pour le même résultat. / Upcoming Exascale target in High Performance Computing (HPC) and disruptive achievements in artificial intelligence give emergence of alternative non-conventional many-core architectures, with energy efficiency typical of embedded systems, and providing the same software ecosystem as classic HPC platforms. A key enabler of energy-efficient computing on many-core architectures is the exploitation of data locality, specifically the use of scratchpad memories in combination with DMA engines in order to overlap computation and communication. Such software paradigm raises considerable programming challenges to both the vendor and the application developer. In this thesis, we tackle the memory transfer and performance issues, as well as the programming challenges of memory- and compute-intensive HPC applications on he Kalray MPPA many-core architecture. With the first memory-bound use-case of the lattice Boltzmann method (LBM), we provide generic and fundamental techniques for decomposing three-dimensional iterative stencil problems onto clustered many-core processors fitted withs cratchpad memories and DMA engines. The developed DMA-based streaming and overlapping algorithm delivers 33%performance gain over the default cache-based implementation.High-dimensional stencil computation suffers serious I/O bottleneck and limited on-chip memory space. We developed a new in-place LBM propagation algorithm, which reduces by half the memory footprint and yields 1.5 times higher performance-per-byte efficiency than the state-of-the-art out-of-place algorithm. On the compute-intensive side with dense linear algebra computations, we build an optimized matrix multiplication benchmark based on exploitation of scratchpad memory and efficient asynchronous DMA communication. These techniques are then extended to a DMA module of the BLIS framework, which allows us to instantiate an optimized and portable level-3 BLAS numerical library on any DMA-based architecture, in less than 100 lines of code. We achieve 75% peak performance on the MPPA processor with the matrix multiplication operation (GEMM) from the standard BLAS library, without having to write thousands of lines of laboriously optimized code for the same result.
|
225 |
MVAPICH2-AutoTune: An Automatic Collective Tuning Framework for the MVAPICH2 MPI LibrarySrivastava, Siddhartha January 2021 (has links)
No description available.
|
226 |
Congenital amegakaryocytic thrombocytopenia iPS cells exhibit defective MPL-mediated signaling / 先天性無巨核球性血小板減少症患者由来のiPS細胞はMPLを介した細胞内シグナルが欠落しているHirata, Shinji 26 March 2018 (has links)
京都大学 / 0048 / 新制・論文博士 / 博士(医学) / 乙第13159号 / 論医博第2146号 / 新制||医||1029(附属図書館) / (主査)教授 河本 宏, 教授 前川 平, 教授 髙折 晃史 / 学位規則第4条第2項該当 / Doctor of Medical Science / Kyoto University / DFAM
|
227 |
High Performance and Scalable Cooperative Communication Middleware for Next Generation ArchitecturesChakraborty, Sourav 10 October 2019 (has links)
No description available.
|
228 |
Towards an Efficient Spectral Element Solver for Poisson’s Equation on Heterogeneous Platforms / Mot en effektiv spektrala element-lösare för Poissons ekvation på heterogena plattformarNylund, Jonas January 2022 (has links)
Neko is a project at KTH to refactor the widely used fluid dynamics solver Nek5000 to support modern hardware. Many aspects of the solver need adapting for use on GPUs, and one such part is the main communication kernel, the Gather-Scatter (GS) routine. To avoid race conditions in the kernel, atomic operations are used, which can be inefficient. To avoid the use of atomics, elements were grouped in such a way that when multiple writes to the same address are necessary, they will always come in blocks. This way, each block can be assigned to a single thread and handled sequentially, avoiding the need for atomic operations altogether. In the scope of the thesis, a Poisson solver was also ported from CPU to Nvidia GPUs. To optimise the Poisson solver, a batched matrix multiplication kernel was developed to efficiently perform small matrix multiplications in bulk, to better utilise the GPU. Optimisations using shared memory and kernel unification was done. The performance of the different implementations was tested on two systems using a GTX1660 and dual Nvidia A100 respectively. The results show only small differences in performance between the two versions of the GS kernels when only considering computational cost, and in a multi-rank setup the communication time completely overwhelms any potential difference. The shared memory matrix multiplication kernel yielded around a 20% performance boost for the Poisson solver. Both versions vastly outperformed cuBLAS. The unified kernel also had a large positive impact on the performance, yielding up to a 50% increase in throughput. / Neko är ett KTH-projekt med syfte att vidareutveckla det populära beräkningsströmningsdynamik-programmet Nek5000 för moderna datorsystem. Speciell vikt har lagts vid att stödja heterogena plattformar med dedikerade accelleratorer för flyttalsberäkningar. Den idag vanligast förekommande sådana är grafikkort (GPUer). En viktig del av Neko är Gather-Scatter (GS)-funktionen, som är den huvudsakliga kommunikations-funktionen mellan processer i programmet. I GS-funktionen kan race conditions uppstå då flera trådar skriver till samma minnesaddress samtidigt. Detta kan undvikas med atomic operations, men användande av dessa kan ha negativ inverkan på prestanda. I detta masterarbete utvecklades en alternativ implementation där element i GS-algoritmen grupperades på sådant sätt att alla operationer på samma element kommer i block. På så sätt kan de enkelt behandlas i sekvens och därmed undvika behovet av atomic operations. Inom ramen för masterarbetet implementerades en numerisk lösare av Poisson’s ekvation för GPUer. Optimering av koden genom att göra matrismultiplikationer i bulk genomfördes, och vidare genom utnyttjande av shared memory. Prestandan utvärderades på två olika datorsystem med en GTX1660 respektive två A100 GPUer. Enbart små skillnader sågs mellan de olika GS-implementationerna, med en svag fördel om ca 5% högre prestanda för den grupperade varianten i högupplösta domäner. Poisson-lösaren visade på höga prestandasiffror jämfört med cuBLAS-biblioteket.
|
229 |
High Performance Computing as a Service in the Cloud Using Software-Defined NetworkingJamaliannasrabadi, Saba 27 July 2015 (has links)
No description available.
|
230 |
Modeling, Detection, and Prevention of Electricity Theft for Enhanced Performance and Security of Power GridDepuru, Soma Shekara 24 September 2012 (has links)
No description available.
|
Page generated in 0.0432 seconds