41 |
Exploiting Multigrain Parallelism in Pairwise Sequence Search on Emergent CMP ArchitecturesAji, Ashwin Mandayam 25 August 2008 (has links)
With the emerging hybrid multi-core and many-core compute platforms delivering unprecedented high performance within a single chip, and making rapid strides toward the commodity processor market, they are widely expected to replace the multi-core processors in the existing High-Performance Computing (HPC) infrastructures, such as large scale clusters, grids and supercomputers. On the other hand in the realm of bioinformatics, the size of genomic databases is doubling every 12 months, and hence the need for novel approaches to parallelize sequence search algorithms has become increasingly important. This thesis puts a significant step forward in bridging the gap between software and hardware by presenting an efficient and scalable model to accelerate one of the popular sequence alignment algorithms by exploiting multigrain parallelism that is exposed by the emerging multiprocessor architectures. Specifically, we parallelize a dynamic programming algorithm called Smith-Waterman both within and across multiple Cell Broadband Engines and within an nVIDIA GeForce General Purpose Graphics Processing Unit (GPGPU).
Cell Broadband Engine: We parallelize the Smith-Waterman algorithm within a Cell node by performing a blocked data decomposition of the dynamic programming matrix followed by pipelined execution of the blocks across the synergistic processing elements (SPEs) of the Cell. We also introduce novel optimization methods that completely utilize the vector processing power of the SPE. As a result, we achieve near-linear scalability or near-constant efficiency for up to 16 SPEs on the dual-Cell QS20 blades, and our design is highly scalable to more cores, if available. We further extend this design to accelerate the Smith-Waterman algorithm across nodes on both the IBM QS20 and the PlayStation3 Cell cluster platforms and achieve a maximum speedup of 44, when compared to the execution times on a single Cell node. We then introduce an analytical model to accurately estimate the execution times of parallel sequence alignments and wavefront algorithms in general on the Cell cluster platforms. Lastly, we contribute and evaluate TOSS -- a Throughput-Oriented Sequence Scheduler, which leverages the performance prediction model and dynamically partitions the available processing elements to simultaneously align multiple sequences. This scheme succeeds in aligning more sequences per unit time with an improvement of 33.5% over the naive first-come, first-serve (FCFS) scheduler.
nVIDIA GPGPU: We parallelize the Smith-Waterman algorithm on the GPGPU by optimizing the code in stages, which include optimal data layout strategies, coalesced memory accesses and blocked data decomposition techniques. Results show that our methods provide a maximum speedup of 3.6 on the nVIDIA GPGPU when compared to the performance of the naive implementation of Smith-Waterman. / Master of Science
|
42 |
Jämförelse av GPGPU-ramverk och AES-metoder : Jämförelse av GPGPU-ramverk och AES-metoder för att besvara vilka GPGPU-ramverk och vilken AES-metod som bör rekommenderas för AES-kryptering med GPGPUBerggren, Emil, Gustafson, Tobias January 2017 (has links)
Sammanfattning Bakgrund - Dagens processorer börjar närma sig gränsen för hur höga klockfrekvenser de kan köras i. Detta har lett till att processorer har fått fler kärnor för att kunna exekvera flera processer parallellt med flertrådade applikationer. Det finns dock ofta en stor mängd oanvänd beräkningskraft under långa perioder då datorn är igång som ligger i grafikprocessorn, GPU. Då en GPU kan köra tusentals många fler trådar på samma gång än en CPU har ramverk för att göra mer generella beräkningar på GPU utvecklats, dessa kallas för GPGPU-ramverk. Då varje kärna på en GPU inte är lika stark som på en CPU ligger vinsten i att använda algoritmer som går bra att parallellisera. En sådan algoritm är krypteringsalgoritmen AES som är en av de säkraste och vanligaste krypteringsalgoritmerna som används idag. Syfte – Med hjälp av GPU-accelerering kan man kryptera med AES snabbare än med en traditionell CPU-lösning. För att göra GPU-accelereringen så effektiv som möjligt undersöker detta examensarbete vilken AES-metod samt vilket GPGPU-ramverk man bör välja. Metod – För att undersöka vilken/vilka AES-metoder samt vilka GPGPU-ramverk som var lämpliga att använda för denna undersökning gjordes två litteraturstudier. Utifrån data som litteraturstudierna gav genomfördes experiment för att jämföra de valda GPGPU-ramverken med den valda AES-metoden som ansågs vara mest lämpliga. Resultat – Från litteraturstudierna kom det fram att OpenCL och CUDA blir de rekommenderade GPGPU-ramverken och att CTR blir den rekommenderade AES-metoden för AES-kryptering med GPGPU-programmering. Utifrån experimenten som genomförts kunde det konstateras att CUDA är ett effektivare GPGPU-ramverk än OpenCL för AES-CTR på det testade grafikkortet, GTX 560. Implikationer – CUDA är snabbare vid större filer för att OpenCL begränsas mer av dataöverföringshastigheten än CUDA på ett GTX 560. Begränsningar – Experimenten genomfördes endast på ett grafikkort från Nvidia. Eftersom Nvidia inte har något intresse i att optimera för andra GPGPU-ramverk så kunde inte testresultaten från OpenCL verifieras med externa verktyg. Detta p.g.a. att Nvidias verktyg inte längre stödjer debugging eller profiling för OpenCL. Nyckelord – Processorer, GPGPU, AES, CTR, OpenCL, CUDA, GPGPU-ramverk / Abstract Background - Processors today are approaching the limit for how high clockfrequences they can run. This has led to that instead of trying to make them run faster they are instead made with multiple cores so they can utilize parallelization by running several threads in parallel. However aside from the CPU there is still the graphics card which has a large amount of unused computing power for long durations of time while the computer is active. While a GPU might not have as quick processors it instead has several thousands of them at the same time than a CPU which have led to the development of GPGPU-frameworks to use that potential parallelization. The profit in this lies in using algorithms and code functions that got high potential parallelization, one of which is the AES encryption algorithm. AES is one of the most widely used encryption algorithms today and also considered to be one of the most secure. Purpose – By using GPGPU-acceleration the encryption speed of AES is higher than by using a traditional CPU approach. To make the GPU-acceleration as effective as possible this study looks into which AES-method and which GPGPU-framework that should be chosen during development. Method – This study makes two literature studies to determine which AES-methods and which GPGPU-frameworks that are viable for GPU-acceleration of AES. Afterwards this study conducts experiments to determine which of these GPGPU-frameworks are the most effective. Findings – The conclusion drawn from the literature study is that the CTR-method among the AES-methods is preferable due to its parallelization potential and high security measures. Among the current GPGPU-frameworks only two frameworks satisfies the criteria determined from the literature study and those are CUDA and OpenCL. From the experiment the conclusion is thereafter drawn that of the two GPGPU-frameworks CUDA is more effective due to the bandwidth limits that OpenCL have compared to CUDA. This conclusion is valid on at least the tested graphics card, GTX 560. Implications – CUDA is faster at larger file sizes than OpenCL due to limited data transfer speed in OpenCL on a GTX 560. Limitations – The experiments were only conducted on one graphics card from Nvidia due to hardware constraints in that CUDA can only be run on Nvidia hardware. Due to this hardware constraint and Nvidia’s lack of support in their tools for debugging and profiling of OpenCL the results from the testing of OpenCL couldn’t be verified using external tools. Keywords – Processor, GPGPU, AES, CTR, OpenCL, CUDA, GPGPU-framework
|
43 |
Parallélisation et optimisation d'un simulateur de morphogénèse d'organes. Application aux éléments du rein / Parallelization and optimization of an organ morphogenesis simulator. Application to the elements of the kidneyCaux, Jonathan 30 November 2012 (has links)
Depuis plusieurs dizaines d’années, la modélisation du vivant est un enjeu majeur qui nécessite de plus en plus de travaux dans le domaine de la simulation. En effet, elle ouvre la porte à toute une palette d’applications : l’aide à la décision en environnement et en écologie, l’aide à l’enseignement, l’aide à la décision pour les médecins, l’aide à la recherche de nouveaux traitements pharmaceutiques et la biologie dite « prédictive », etc. Avant de pouvoir aborder un problème, il est nécessaire de pouvoir modéliser de façon précise le système biologique concerné en précisant bien les questions auxquelles devra répondre le modèle. La manipulation et l’étude de systèmes complexes, les systèmes biologiques en étant l’archétype, pose, de façon générale, des problèmes de modélisation et de simulation. C’est dans ce contexte que la société Integrative BioComputing (IBC) développe depuis le début des années 2000 un prototype d’une Plateforme Générique de Modélisation et de Simulation (la PGMS) dont le but est de fournir un environnement pour modéliser et simuler plus simplement les processus et les fonctions biologiques d’un organisme complet avec les organes le composant. La PGMS étant une plateforme générique encore en phase de développement, elle ne possédait pas les performances nécessaires pour permettre de réaliser la modélisation et la simulation d’éléments importants dans des temps suffisamment courts. Il a donc été décidé, afin d’améliorer drastiquement les performances de la PGMS, de paralléliser et d’optimiser l’implémentation de celle-ci ; le but étant de permettre la modélisation et la simulation d’organes complets dans des temps acceptables. Le travail réalisé au cours de cette thèse a donc consisté à traiter différents aspects de la modélisation et de la simulation de systèmes biologiques afin d’accélérer les traitements de ceux-ci. Le traitement le plus gourmand en termes de temps de calcul lors de l’exécution de la PGMS, le calcul des champs physicochimiques, a ainsi fait l’objet d’une étude de faisabilité de sa parallélisation. Parmi les différentes architectures disponibles pour paralléliser une telle application, notre choix s’est porté sur l’utilisation de GPU (Graphical Processing Unit) à des fins de calculs généralistes aussi couramment appelé GPGPU (General-Purpose computation on Graphics Processing Units). Ce choix a été réalisé du fait, entre autres, du coût réduit du matériel et de sa très grande puissance de calcul brute qui en fait une des architectures de parallélisation les plus accessibles du marché. Les résultats de l’étude de faisabilité étant particulièrement concluant, la parallélisation du calcul des champs a ensuite été intégrée à la PGMS. En parallèle, nous avons également mené des travaux d’optimisations pour améliorer les performances séquentielles de la PGMS. Le résultat de ces travaux est une augmentation de la vitesse d’exécution d’un facteur 18,12x sur les simulations les plus longues (passant de 16 minutes pour la simulation non optimisée utilisant un seul cœur CPU à 53 secondes pour la version optimisée utilisant toujours un seul cœur CPU mais aussi un GPU GTX500). L’autre aspect majeur traité dans ces travaux a été d’améliorer les performances algorithmiques pour la simulation d’automates cellulaires en trois dimensions. En effet, ces derniers permettent aussi bien de simuler des comportements biologiques que d’implémenter des mécanismes de modélisation tels que les interactions multi-échelles. Le travail de recherche s’est essentiellement effectué sur des propositions algorithmiques originales afin d’améliorer les simulations réalisées par IBC sur la PGMS. L’accélération logicielle, à travers l’implémentation de l’algorithme Hash‑Life en trois dimensions, et la parallélisation à l’aide de GPGPU ont été étudiées de façon concomitante et ont abouti à des gains très significatifs en temps de calcul. / For some years, living matter modeling has been a major challenge which needs more and more research in the simulation field. Indeed, the use of models of living matter have multiple applications: decision making aid in environment or ecology, teaching tools, decision making aid for physician, research aid for new pharmaceutical treatment and “predictive” biology, etc. But before being able to tackle all these issues, the development of a correct model, able to give answer about specific questions, is needed. Working with complex systems –biologic system being the archetype of them– raises various modeling and simulation issues. It is in this context that the Integrative BioComputing (IBC) company have been elaborating, since the early 2000s, the prototype of a generic platform for modeling and simulation (PGMS). Its goal is to provide a platform used to easily model and simulate biological process of a full organism, including its organs. Since the PGMS was still in its development stage at the start of my PhD, the application performance prevented the modeling and simulation of large biological components in an acceptable time. Therefore, it has been decide to optimize and parallelize its computation to increase significantly the PGMS performances. The goal was to enable the use of the PGMS to model and simulate full organs in acceptable times. During my PhD, I had to work on various aspects of the modeling and simulation of biological systems to increase their process speed. Since the most costly process during the PGMS execution was the computation of chemical fields, I had to study the opportunity of parallelizing this process. Among the various hardware architectures available to parallelize this application, we chose to use graphical processing units for general purpose computation (GPGPUs). This choice was motivated, beside other reasons, by the low cost of the hardware compared to its massive computation power, making it one of the most affordable parallel architecture on the market. Since the results of the initial feasibility study were conclusive, the parallelization of the fields computation has been integrated into the PGMS. In parallel to this work, I also worked on optimizing the sequential performance of the application. All these works lead to an increase of the software performances achieving a speed-up of 18.12x for the longest simulation (from 16 minutes for the non-optimized version with one CPU core to 53 seconds for the optimized version, still using only one core on the CPU but also a GPU GTX500). The other major aspect of my work was to increase the algorithmic performances for the simulation of three-dimensional cellular automata. In fact, these automata allow the simulation of biological behavior as they can be used to implement various mechanisms of a model such as multi-scale interactions. The research work consisted mainly in proposing original algorithms to improve the simulation provided by IBC on the PGMS. The sequential speed increase, thanks to the three-dimensional Hash Life implementation, and the parallelization on GPGPU has been studied together and achieved major computation time improvement.
|
44 |
Architectures logicielles pour la radio flexible : intégration d'unités de calcul hétérogènes / Software design for flexible radio : integration of heterogeneous computing unitsHorrein, Pierre-Henri 10 January 2012 (has links)
L'utilisation de la radio flexible permet d'envisager des réseaux sans fil évolutifs, plus efficaces et plus intelligents. Le terme «~radio flexible~» est un terme très général, qui peut s'appliquer à une implanta- tion logicielle des opérations, à une implantation matérielle adaptable basée sur des accélérateurs matériels, ou encore à des implantations mixtes. Il regroupe en fait toutes les implantations de terminaux radio qui ne sont pas figées. Les travaux réalisés durant cette thèse tournent autour de deux points. Le premier est l'utilisation des processeurs graphiques pour la radio flexible, et plus particulièrement pour la radio logicielle. Ces cibles offrent des performances impressionnantes en termes de débit brut de calcul, en se basant sur architecture massivement parallèle. Le parallélisme de données utilisé dans ces processeurs diffère cependant du parallélisme de tâches souvent exploitées dans les applications de radio logicielle. Différentes approches pour résoudre ce problème sont étudiées. Les résultats obtenus sur ce point permettent une nette amélioration du débit de calcul atteignable avec une implantation logicielle, tout en libérant le processeur pour d'autres tâches. Le deuxième point abordé dans cette étude concerne la définition d'un environnement perme- ttant de supporter différentes possibilités d'implantation de la radio flexible. Cet environnement englobe le support de la plateforme hétérogène, et la gestion des applications sur ces plateformes. Bien qu'encore expérimental, les premiers résultats obtenus avec l'environnement montrent ses capacités d'adaptation, et le rendent utilisable pour des applications radio variées sur des plateformes hétérogènes. / The development of flexible radio leads to evolving wireless networks. This character- istic enables more efficient and smarter networks. Flexible radio is not a precise definition. It can be used to describe a software implementation of radio operations, as well as a hardware implementation based on configurable hardware coprocessors. It can also refer to heterogeneous implementations. It describes any implementation which can evolve. During this PhD, we focused on two different aspects of flexible radio. First, the use of graphi- cal processors for flexible radio (and especially software radio) is studied. These execution targets enable impressive performances, when studying raw attainable processing throughput, through the use of massively parallel architectures. The problem is that the data parallelism exhibited by these processors does not match the task parallelism of software radio applications. Different approaches to correct this mismatch are studied in this work. The displayed results show an improvement in the at- tainable software implementation, while letting the processor process other tasks. The other issue addressed in this work is the definition of an environment able to support dif- ferent implementation choices for flexible radio. Support for multiple implementations includes heterogeneous platforms support, as well as application management on these platforms. While this environment is still in early development stage, preliminary results demonstrate its adaptabil ity, and eases development of applications on different heterogeneous platforms.
|
45 |
Uma plataforma híbrida baseada em FPGA para a aceleração de um algoritmo de alinhamento de sequências biológicasFIGUEIRÔA, Luiz Henrique Alves 17 August 2015 (has links)
Submitted by Fabio Sobreira Campos da Costa (fabio.sobreira@ufpe.br) on 2016-04-05T14:47:50Z
No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
Dissertação_Figueiroa(versao_final).pdf: 2779464 bytes, checksum: bec03362367d058faa9ed8c36d09b5f8 (MD5) / Made available in DSpace on 2016-04-05T14:47:50Z (GMT). No. of bitstreams: 2
license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
Dissertação_Figueiroa(versao_final).pdf: 2779464 bytes, checksum: bec03362367d058faa9ed8c36d09b5f8 (MD5)
Previous issue date: 2015-08-17 / A partir da revelação da estrutura em dupla-hélice do DNA, em 1953, foi aberto o caminho para a compreensão dos mecanismos que codificam as instruções de construção e desenvolvimento das células dos seres vivos. A nova geração de sequenciadores (NGS) têm produzido gigantescos volumes de dados nos Bancos de Dados biológicos cujas informações podem demandar uma intensa atividade computacional em sua compilação. Entretanto, o desempenho das ferramentas empregadas na Biologia Computacional não tem evoluído na mesma taxa de crescimento desses bancos, podendo impor restrições aos avanços neste campo de pesquisa. Uma das principais técnicas usadas é o alinhamento de sequências que, a partir da identificação de similaridades, possibilitam a análise de regiões conservadas em sequências homólogas, servem como ponto de partida no estudo de estruturas secundárias de proteínas e de construção de àrvores filogenéticas, entre outros. Como os algoritmos exatos de alinhamento possuem complexidade quadrática no tempo e no espaço, o custo computacional poderá ser elevado demandando estratégias de aceleração. Neste contexto, a Computação de Alto Desempenho (HPC), estruturada em Supercomputadores e Clusters, tem sido, empregada. No entanto, o investimento inicial e os requisitos de manutenção, espaço físico, refrigeração, além do consumo de energia, podem representar custos significativos. As arquiteturas paralelas híbridas baseadas na ação conjunta de PCs e dispositivos aceleradores como chips VLSI, GPGPUs e FPGAs, surgiram como alternativas mais acessíveis, apresentando resultados promissores. O projeto descrito nesta dissertação tem por objetivo a aceleração do algoritmo de alinhamento-ótimo global, conhecido como Needleman-Wunsch, a partir de uma plataforma híbrida baseada em um PC (host) e um FPGA. A aceleração ocorre a partir da exploração das possibilidades de paralelismo oferecidas pelo algoritmo e sua implementação em hardware. A arquitetura desenvolvida é baseada num Array Sistólico Linear apresentando elevado desempenho e boa escalabilidade. / From the revelation of the structure in double-helix of Deoxyribonucleic Acid (DNA) by James D. Watson and Francis H. C. Crick, in 1953, it opened the way for the understanding of the mechanismis that encoding the building instructions and development of cells of living beings. The DNA sequencing is one of the first steps in this process. The new generation of sequencers (NGS) have produced massive amounts of data on biological databases whose information may require intense computational activity in your compilation. However, the performance of the tools employed in Computational Biology has not evolved at the same rate of growth of these banks, may impose restrictions on advances in this research field. One of the primary techniques used is the sequence alignment that from the identification of similarities, enable the analysis of conserved regions of homologous sequences, serve as the starting point in the study of protein secondary structures and the construction of phylogenetic trees, among others. As the exact alignment algorithms have quadratic complexity in time and space, the computational cost can be high demanding acceleration strategies. In this context, the High Performance Computing (HPC), structured in supercomputers and clusters, has been employed. However, the initial investment and maintenance requirements, floor space, cooling, in addition to energy consumption, may represent significant costs. The hybrid parallel architectures based on joint action of PCs and devices accelerators as VLSI chips, GPGPUs and FPGAs, have emerged as more affordable alternatives, with promising results. The project described in this dissertation aims at accelerating the global optimal-alignment algorithm, known as Needleman-Wunsch, from a hybrid platform based on a PC, that acts as host, and an FPGA. The acceleration occurs through exploration of the parallelism opportunities offered by the algorithm and implemented in hardware. In this, an architecture based on a Linear Systolic Array offers high performance and high scalability.
|
46 |
Adaptation du calcul de la Transformée de Fourier Rapide sur une architecture mixte CPU/GPU intégrée / Adaptation of the Fast Fourier Transform processing on hybride integrated CPU/GPU architectureBergach, Mohamed Amine 02 October 2015 (has links)
Les architectures multi-cœurs Intel Core (IvyBridge, Haswell,...) contiennent à la fois des cœurs CPU généralistes (4), mais aussi des cœurs dédiés GPU embarqués sur cette même puce (16 et 40 respectivement). Dans le cadre de l'activité de la société Kontron (qui participe à ce financement de nature CIFRE) un objectif important est de calculer efficacement sur cette architecture des tableaux et séquences de transformées de Fourier rapides (FFT), comme par exemple on en trouve dans des applications radar. Alors que des bibliothèques natives (mais propriétaires) existent chez Intel pour les CPU, rien de tel n'est actuellement disponible pour la partie GPU. L'objectif de la thèse était donc de définir le placement efficace de modules FFT, en étudiant au niveau théorique la forme optimale permettant de regrouper des étages de calcul d'une telle FFT en fonction de la localité des données sur un cœur de calcul unique. Ce choix a priori permet d'espérer une efficacité des traitements, en ajustant la taille de la mémoire disponible à celles des données nécessaires. Ensuite la multiplicité des cœurs reste exploitable pour disposer plusieurs FFT calculées en parallèle, sans interférence (sauf contention du bus entre CPU et GPU). Nous avons obtenu des résultats significatifs, tant au niveau de l'implantation d'une FFT (1024 points) sur un cœur CPU SIMD, exprimée en langage C, que pour l'implantation d'une FFT de même taille sur un cœur GPU SIMT, exprimée alors en OpenCL. De plus nos résultats permettent de définir des règles pour synthétiser automatiquement de telles solutions, en fonction uniquement de la taille de la FFT son nombre d'étages plus précisément), et de la taille de la mémoire locale pour un coeur de calcul donné. Les performances obtenues sont supérieures à celles de la bibliothèque native Intel pour CPU), et démontrent un gain important de consommation sur GPU. Tous ces points sont détaillés dans le document de thèse. Ces résultats devraient donner lieu à exploitation au sein de la société Kontron. / Multicore architectures Intel Core (IvyBridge, Haswell…) contain both general purpose CPU cores (4) and dedicated GPU cores embedded on the same chip (16 and 40 respectively). As part of the activity of Kontron (the company partially funding this CIFRE scholarship), an important objective is to efficiently compute arrays and sequences of fast Fourier transforms (FFT) such as one finds in radar applications, on this architecture. While native (but proprietary) libraries exist for Intel CPU, nothing is currently available for the GPU part.The aim of the thesis was to define the efficient placement of FFT modules, and to study theoretically the optimal form for grouping computing stages of such FFT according to data locality on a single computing core. This choice should allow processing efficiency, by adjusting the memory size available to the required application data size. Then the multiplicity of cores is exploitable to compute several FFT in parallel, without interference (except for possible bus contention between the CPU and the GPU). We have achieved significant results, both in the implementation of an FFT (1024 points) on a SIMD CPU core, expressed in C, and in the implementation of a FFT of the same size on a GPU SIMT core, then expressed in OpenCL. In addition, our results allow to define rules to automatically synthesize such solutions, based solely on the size of the FFT (more specifically its number of stages), and the size of the local memory for a given computing core. The performances obtained are better than the native Intel library for CPU, and demonstrate a significant gain in consumption on GPU. All these points are detailed in the thesis document.
|
47 |
Efficient Dynamic Automatic Memory Management And Concurrent Kernel Execution For General-Purpose Programs On Graphics Processing UnitsPai, Sreepathi 11 1900 (has links) (PDF)
Modern supercomputers now use accelerators to achieve their performance with the most widely used accelerator being the Graphics Processing Unit (GPU). However, achieving the performance potential of systems that combine a GPU and CPU is an arduous task which could be made easier with the assistance of the compiler or runtime. In particular, exploiting two features of GPU architectures -- distributed memory and concurrent kernel execution -- is critical to achieve good performance, but in current GPU programming systems, programmers must exploit them manually. This can lead to poor performance. In this thesis, we propose automatic techniques that: i) perform data transfers between the CPU and GPU, ii) allocate resources for concurrent kernels, and iii) schedule concurrent kernels efficiently without programmer intervention.
<p>Most GPU programs access data in GPU memory for performance. Manually inserting data transfers that move data to and from this GPU memory is an error-prone and tedious task. In this work, we develop a software coherence mechanism to fully automate all data transfers between the CPU and GPU without any assistance from the programmer. Our mechanism uses compiler analysis to identify potential stale data accesses and uses a runtime to initiate transfers as necessary. This avoids redundant transfers that are exhibited by all other existing automatic memory management proposals for general purpose programs. We integrate our automatic memory manager into the X10 compiler and runtime, and find that it not only results in smaller and simpler programs, but also eliminates redundant memory transfers. Tested on eight programs ported from the Rodinia benchmark suite it achieves (i) a 1.06x speedup over hand-tuned manual memory management, and (ii) a 1.29x speedup over another recently proposed compiler--runtime automatic memory management system. Compared to other existing runtime-only (ADSM) and compiler-only (OpenMPC) proposals, it also transfers 2.2x to 13.3x less data on average.
<p>Each new generation of GPUs vastly increases the resources available to GPGPU programs. GPU programming models (like CUDA) were designed to scale to use these resources. However, we find that CUDA programs actually do not scale to utilize all available resources, with over 30% of resources going unused on average for programs of the Parboil2 suite. Current GPUs therefore allow concurrent execution of kernels to improve utilization. We study concurrent execution of GPU kernels using multiprogrammed workloads on current NVIDIA Fermi GPUs. On two-program workloads from Parboil2 we find concurrent execution is often no better than serialized execution. We identify lack of control over resource allocation to kernels as a major serialization bottleneck. We propose transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage. We then propose several elastic-kernel aware runtime concurrency policies that offer significantly better performance and concurrency than the current CUDA policy. We evaluate our proposals on real hardware using multiprogrammed workloads constructed from benchmarks in the Parboil2 suite. On average, our proposals increase system throughput (STP) by 1.21x and improve the average normalized turnaround time (ANTT) by 3.73x for two-program workloads over the current CUDA concurrency implementation.
<p>Recent NVIDIA GPUs use a FIFO policy in their thread block scheduler (TBS) to schedule thread blocks of concurrent kernels. We show that FIFO leaves performance to chance, resulting in significant loss of performance and fairness. To improve performance and fairness, we propose use of the Shortest Remaining Time First (SRTF) policy instead. Since SRTF requires an estimate of runtime (i.e. execution time), we introduce Structural Runtime Prediction that uses the grid structure of GPU programs for predicting runtimes. Using a novel Staircase model of GPU kernel execution, we show that kernel runtime can be predicted by profiling only the first few thread blocks. We evaluate an online predictor based on this model on benchmarks from ERCBench and find that predictions made after the execution of single thread block are between 0.48x to 1.08x of actual runtime. %Next, we design a thread block scheduler that is both concurrent kernel-aware and incorporates this predictor. We implement the SRTF policy for concurrent kernels that uses this predictor and evaluate it on two-program workloads from ERCBench. SRTF improves STP by 1.18x and ANTT by 2.25x over FIFO. Compared to MPMax, a state-of-the-art resource allocation policy for concurrent kernels, SRTF improves STP by 1.16x and ANTT by 1.3x. To improve fairness, we also propose SRTF/Adaptive which controls resource usage of concurrently executing kernels to maximize fairness. SRTF/Adaptive improves STP by 1.12x, ANTT by 2.23x and Fairness by 2.95x compared to FIFO. Overall, our implementation of SRTF achieves STP to within 12.64% of Shortest Job First (SJF, an oracle optimal scheduling policy), bridging 49% of the gap between FIFO and SJF.
|
48 |
Modélisation et implémentation de simulations multi-agents sur architectures massivement parallèles / Modeling and implementing multi-agents based simulations on massively parallel architecturesHermellin, Emmanuel 18 November 2016 (has links)
La simulation multi-agent représente une solution pertinente pour l’ingénierie et l’étude des systèmes complexes dans de nombreux domaines (vie artificielle, biologie, économie, etc.). Cependant, elle requiert parfois énormément de ressources de calcul, ce qui représente un verrou technologique majeur qui restreint les possibilités d'étude des modèles envisagés (passage à l’échelle, expressivité des modèles proposés, interaction temps réel, etc.).Parmi les technologies disponibles pour faire du calcul intensif (High Performance Computing, HPC), le GPGPU (General-Purpose computing on Graphics Processing Units) consiste à utiliser les architectures massivement parallèles des cartes graphiques (GPU) comme accélérateur de calcul. Cependant, alors que de nombreux domaines bénéficient des performances du GPGPU (météorologie, calculs d’aérodynamique, modélisation moléculaire, finance, etc.), celui-ci est peu utilisé dans le cadre de la simulation multi-agent. En fait, le GPGPU s'accompagne d’un contexte de développement très spécifique qui nécessite une transformation profonde et non triviale des modèles multi-agents. Ainsi, malgré l'existence de travaux pionniers qui démontrent l'intérêt du GPGPU, cette difficulté explique le faible engouement de la communauté multi-agent pour le GPGPU.Dans cette thèse, nous montrons que, parmi les travaux qui visent à faciliter l'usage du GPGPU dans un contexte agent, la plupart le font au travers d’une utilisation transparente de cette technologie. Cependant, cette approche nécessite d’abstraire un certain nombre de parties du modèle, ce qui limite fortement le champ d’application des solutions proposées. Pour pallier ce problème, et au contraire des solutions existantes, nous proposons d'utiliser une approche hybride (l'exécution de la simulation est partagée entre le processeur et la carte graphique) qui met l'accent sur l'accessibilité et la réutilisabilité grâce à une modélisation qui permet une utilisation directe et facilitée de la programmation GPU. Plus précisément, cette approche se base sur un principe de conception, appelé délégation GPU des perceptions agents, qui consiste à réifier une partie des calculs effectués dans le comportement des agents dans de nouvelles structures (e.g. dans l’environnement). Ceci afin de répartir la complexité du code et de modulariser son implémentation. L'étude de ce principe ainsi que les différentes expérimentations réalisées montre l'intérêt de cette approche tant du point de vue conceptuel que du point de vue des performances. C'est pourquoi nous proposons de généraliser cette approche sous la forme d'une méthodologie de modélisation et d'implémentation de simulations multi-agents spécifiquement adaptée à l'utilisation des architectures massivement parallèles. / Multi-Agent Based Simulations (MABS) represents a relevant solution for the engineering and the study of complex systems in numerous domains (artificial life, biology, economy, etc.). However, MABS sometimes require a lot of computational resources, which is a major constraint that restricts the possibilities of study for the considered models (scalability, real-time interaction, etc.).Among the available technologies for HPC (High Performance Computing), the GPGPU (General-Purpose computing on Graphics Processing Units) proposes to use the massively parallel architectures of graphics cards as computing accelerator. However, while many areas benefit from GPGPU performances (meteorology, molecular dynamics, finance, etc.). Multi-Agent Systems (MAS) and especially MABS hardly enjoy the benefits of this technology: GPGPU is very little used and only few works are interested in it. In fact, the GPGPU comes along with a very specific development context which requires a deep and not trivial transformation process for multi-agents models. So, despite the existence of works that demonstrate the interest of GPGPU, this difficulty explains the low popularity of GPGPU in the MAS community.In this thesis, we show that among the works which aim to ease the use of GPGPU in an agent context, most of them do it through a transparent use of this technology. However, this approach requires to abstract some parts of the models, what greatly limits the scope of the proposed solutions. To handle this issue, and in contrast to existing solutions, we propose to use a nhybrid approach (the execution of the simulation is shared between both the processor and graphics card) that focuses on accessibility and reusability through a modeling process that allows to use directly GPU programming while simplifying its use. More specifically, this approach is based on a design principle, called GPU delegation of agent perceptions, consists in making a clear separation between the agent behaviors, managed by the processor, and environmental dynamics, handled by the graphics card. So, one major idea underlying this principle is to identify agent computations which can be transformed in new structures (e.g. in the environment) in order to distribute the complexity of the code and modulate its implementation. The study of this principle and the different experiments conducted show the advantages of this approach from both a conceptual and performances point of view. Therefore, we propose to generalize this approach and define a comprehensive methodology relying on GPU delegation specifically adapted to the use of massively parallel architectures for MABS.
|
49 |
Comparison of Technologies for General-Purpose Computing on Graphics Processing UnitsSörman, Torbjörn January 2016 (has links)
The computational capacity of graphics cards for general-purpose computinghave progressed fast over the last decade. A major reason is computational heavycomputer games, where standard of performance and high quality graphics constantlyrise. Another reason is better suitable technologies for programming thegraphics cards. Combined, the product is high raw performance devices andmeans to access that performance. This thesis investigates some of the currenttechnologies for general-purpose computing on graphics processing units. Technologiesare primarily compared by means of benchmarking performance andsecondarily by factors concerning programming and implementation. The choiceof technology can have a large impact on performance. The benchmark applicationfound the difference in execution time of the fastest technology, CUDA, comparedto the slowest, OpenCL, to be twice a factor of two. The benchmark applicationalso found out that the older technologies, OpenGL and DirectX, are competitivewith CUDA and OpenCL in terms of resulting raw performance.
|
50 |
Využití GPU pro náročné výpočty / Using GPU for HPCMáček, Branislav Unknown Date (has links)
Recently there was a significant grow in building HPC systems. Nowadays they are building from mainstream computer components. One of them is graphics accelerators with GPU. This thesis deals with description of graphics accelerators. It examines possibilities usage. GPU chip has hundreds simple processors. This thesis examine possibilities how to benefit from these parallel processors. It contains description of several testing applications, discuss results from experiments and compares them with another components used for HPC.
|
Page generated in 0.0353 seconds