Spelling suggestions: "subject:"slang"" "subject:"klang""
1 |
Cu2cl: a Cuda-To-Opencl Translator for Multi- and Many-Core ArchitecturesMartinez Arroyo, Gabriel Ernesto 02 September 2011 (has links)
The use of graphics processing units (GPUs) in high-performance parallel computing continues to steadily become more prevalent, often as part of a heterogeneous system. For years, CUDA has been the de facto programming environment for nearly all general-purpose GPU (GPGPU) applications. In spite of this, the framework is available only on NVIDIA GPUs, traditionally requiring reimplementation in other frameworks in order to utilize additional multi- or many-core devices. On the other hand, OpenCL provides an open and vendor-neutral programming environment and run-time system. With implementations available for CPUs, GPUs, and other types of accelerators, OpenCL therefore holds the promise of a "write once, run anywhere" ecosystem for heterogeneous computing.
Given the many similarities between CUDA and OpenCL, manually porting a CUDA application to OpenCL is almost straightforward, albeit tedious and error-prone. In response to this issue, we created CU2CL, an automated CUDA-to-OpenCL source-to-source translator that possesses a novel design and clever reuse of the Clang compiler framework. Currently, the CU2CL translator covers the primary constructs found in the CUDA Runtime API, and we have successfully translated several applications from the CUDA SDK and Rodinia benchmark suite. CU2CL's translation times are reasonable, allowing for many applications to be translated at once. The number of manual changes required after executing our translator on CUDA source is minimal, with some compiling and working with no changes at all. The performance of our automatically translated applications via CU2CL is on par with their manually ported counterparts. / Master of Science
|
2 |
On the Complexity of Robust Source-to-Source Translation from CUDA to OpenCLSathre, Paul Daniel 12 June 2013 (has links)
The use of hardware accelerators in high-performance computing has grown increasingly prevalent, particularly due to the growth of graphics processing units (GPUs) as general-purpose (GPGPU) accelerators. Much of this growth has been driven by NVIDIA's CUDA ecosystem for developing GPGPU applications on NVIDIA hardware. However, with the increasing diversity of GPUs (including those from AMD, ARM, and Qualcomm), OpenCL has emerged as an open and vendor-agnostic environment for programming GPUs as well as other parallel computing devices such as the CPU (central processing unit), APU (accelerated processing unit), FPGA (field programmable gate array), and DSP (digital signal processor).
The above, coupled with the broader array of devices supporting OpenCL and the significant conceptual and syntactic overlap between CUDA and OpenCL, motivated the creation of a CUDA-to-OpenCL source-to-source translator. However, there exist sufficient differences that make the translation non-trivial, providing practical limitations to both manual and automatic translation efforts. In this thesis, the performance, coverage, and reliability of a prototype CUDA-to-OpenCL source translator are addressed via extensive profiling of a large body of sample CUDA applications. An analysis of the sample body of applications is provided, which identifies and characterizes general CUDA source constructs and programming practices that obstruct our translation efforts. This characterization then led to more robust support for the translator, followed by an evaluation that demonstrated the performance of our automatically-translated OpenCL is on par with the original CUDA for a subset of sample applications when executed on the same NVIDIA device. / Master of Science
|
3 |
Cumulus - translating CUDA to sequential C++ : Simplifying the process of debugging CUDA programs / Cumulus - översätter CUDA till sekventiell C++ : En studie i hur felsökande av CUDA-program kan förenklasBlomkvist Karlsson, Vera January 2021 (has links)
Due to their highly parallel architecture, Graphics Processing Units (GPUs) offer increased performance for programs benefiting from parallel execution. A range of technologies exist which allow GPUs to be used for general-purpose programming, NVIDIA’s CUDA platform is one example. CUDA makes it possible to combine source code written for GPUs and Central Processing Units (CPUs) in the same program. Those sections that benefit from parallel execution can be written as CUDA kernels and will be executed on the GPU. With CUDA it is common to have tens, or even hundreds, of thousands of threads running in parallel. While the high level of parallelism can offer significant performance increases for executed programs, it can also make CUDA programs hard to debug. Although debuggers for CUDA exist, they can not be used in the same way as standard debuggers, and they do not reduce the difficulties of reasoning about parallel execution. As a result, developers may feel compelled to fall back to inefficient debugging methods, such as relying on print statements. This project examines two possible approaches for creating a tool which simplifies the process of debugging CUDA programs, by transforming a parallel CUDA program to a sequential program in another high level language: one method centered around the Clang Abstract Syntax Tree (AST), and the other method centered around LLVM Intermediate Representation (IR) code. The method using Clang was found to be the most suitable for the purpose of translating CUDA, as it enables modifying only select parts, such as kernels, of the input program. Thus, the tool Cumulus was developed as a Clang plugin. Cumulus translates parallel CUDA code into sequential C++ code, allowing developers to use any method available for C++ debugging to debug their CUDA program. Cumulus is indicated to be a potential aid in debugging CUDA programs, by providing developers with increased flexibility. / Tack vare sin högst parallella arkitektur kan grafikprocessorer erbjuda ökad prestanda för program som gagnas av parallel exekvering. En mängd teknologier finns, vilka möjliggör att grafikprocessorer kan användas inte bara till grafikberäkningar, utan även till allmäna beräkningar. NVIDIA’s plattform CUDA är en sådan teknik. CUDA gör det möjligt att i samma program kombinera källkod skriven för att exekveras på en centralprocessor, med källkod skriven för att exekveras på en grafikprocessor. Kodsektioner i ett program som gagnas av att köras parallellt kan skrivas som en CUDA kernel, vilket är en funktion som exekveras på grafikprocessorn. Med CUDA är det är inte ovanligt att ha tiotusentals, eller till och med hundratusentals, trådar som körs parallellt. Den mycket höga nivån av parallellism kan erbjuda markant ökad prestanda för exekverade program, men kan samtidigt göra det svårt att felsöka CUDA-program. Särskilda avlusare för CUDA existerar, men de kan inte användas på samma sätt som standardavlusare, och de minskar inte svårigheterna med att resonera kring parallella beräkningar. På grund av detta kan utvecklare känna sig nödgade att använda ineffektiva felsökningsmetoder, såsom att förlita sig på printsatser. Det här projektet undersöker två möjliga metoder för att skapa ett verktyg som förenklar felsökandet i CUDAprogram, genom att översätta ett parallellt CUDA-program till ett sekventiellt program i ett klassiskt högnivå-programmeringsspråk. Den ena möjliga metoden är centrerad kring Clangs AST, den andra möjliga metoden är centrerad kring LLVM IR-kod. Metoden som använder Clang fanns vara den mest lämpliga metoden för syftet att översätta CUDA-kod, eftersom den möjliggör översättning av endast utvalda delar av originalprogrammet, exempelvis kernels. Således utvecklades verktyget Cumulus som en Clangplugin. Cumulus översätter parallell CUDA-kod till serialiserad C++-kod, vilket låter utvecklare använda alla de metoder som finns tillgängliga för att felsöka C++-program, för att felsöka sina CUDA-program. Evalueringen av Cumulus indikerar att verktyget kan fungera som en möjlig hjälp vid felsökande av CUDA-program, genom att erbjuda utvecklare ökad flexibilitet.
|
4 |
Lattice QCD Optimization and Polytopic Representations of Distributed Memory / Optimisation de LatticeQCD et représentations polytopiques de la mémoire distribuéeKruse, Michael 26 September 2014 (has links)
La physique actuelle cherche, à côté des expériences, à vérifier et déduire les lois de la nature en simulant les modèles physiques sur d'énormes ordinateurs. Cette thèse explore comment accélérer ces simulations en améliorant les programmes qui les font tourner. L'application de référence est la chromodynamique quantique sur réseaux (LQCD pour "Lattice Quantum Chromodynamics"), une branche de la théorie quantique des champs, tournant sur le plus récent des supercalculateurs d'IBM, le Blue Gene/Q.Dans un premier temps, on améliore le code source de tmLQCD, un programme de LQCD, dont l'opération clef pour la performance est un stencil à 8 points en dimension 4. On étudie deux stratégies d'optimisation différentes: la première se donne comme priorité d'améliorer la localité spatiale et temporelle; la seconde utilise le préchargement matériel de flux de données. Sur le Blue Gene/Q, la première stratégie permet d'atteindre 20% de la performance crête théorique. La seconde, avec jusqu'à 54% de la performance crête est bien meilleure mais utilise 4 fois plus de mémoire car elle stocke les résultats dans l'ordre où les utilise le stencil suivant, ce qui requiert de dupliquer des données. Les autres techniques exploitées sont la programmation directe du système de communication (appelé MUSPI chez IBM), un mécanisme allégé de gestion des threads, le préchargement explicite de certaines données (à l'aide de l'instruction dcbt) et la vectorisation manuelle (en utilisant les instructions SIMD de largeur 4; appelé QPX par IBM). Le préchargement de liste et la mémoire transactionnelle - deux nouveaux mécanismes du Blue Gene/Q - n'améliorent pas les performances.Dans un second temps, on présente la réalisation d'une extension appelé Molly au compilateur LLVM, pour optimiser automatiquement le programme, et plus précisément la distribution des données et des calculs entre les nœuds d'un cluster tel que le Blue Gene/Q. Molly représente les tableaux par des polyèdres entiers et utilise l'extension existante Polly qui représente les boucles et les instructions par des polyèdres. Partant de la spécification de la distribution des données et de l'emplacement des calculs, Molly ajoute le code qui gère les flots de données entre les nœuds de calcul. Molly peut aussi permuter l'ordre des données en mémoire. La tâche principale de Molly est d'agréger les données dans des ensembles qui sont envoyés dans le même tampon au même destinataire, pour éviter l'overhead des transferts trop petits. Nous présentons un algorithme qui minimise le nombre de transferts pour des boucles non-paramétrées, basé sur les antichaînes du flot des données. De plus, nous implémentons une heuristique qui tient compte de la manière dont le programmeur a écrit son code. Les primitives de communication asynchrone sont insérées juste après que les données soient disponibles - respectivement juste avant qu'elles soient utilisées. Une bibliothèque runtime implémente ces primitives en utilisant MPI. Molly gère la distribution pour tout code représentable dans le modèle polyédrique, mais fonctionne mieux pour du code à stencil tel LQCD. Compilé avec Molly, le code LQCD atteint 2,5% de la performance crête. L'écart de performance est surtout dû au fait que les autres optimisations ne sont pas faites, par exemple la vectorisation. Les versions futures de Molly pourraient aussi gérer efficacement les codes non à stencil et exploiter les autres optimisations qui ont rendu le code LQCD optimisé à la main si rapide. / Motivated by modern day physics which in addition to experiments also tries to verify and deduce laws of nature by simulating the state-of-the-art physical models using oversized computers, this thesis explores means of accelerating such simulations by improving the simulation programs they run. The primary focus is Lattice Quantum Chromodynamics (QCD), a branch of quantum field theory, running on IBM newest supercomputer, the Blue Gene/Q.In a first approach, the source code of tmLQCD, a Lattice QCD program, is improved to run faster on the Blue Gene machine. Its most performance-relevant operation is a 8-point stencil in 4 dimensional space. Two different optimization strategies are perused: One with the priority of improving spatial and temporal locality, and a second making use of the hardware's data stream prefetcher. On Blue Gene/Q the first strategy reaches up to 20% of the peak theoretical floating point operation performance of that machine. The second strategy with up to 54% of peak is much faster at the cost of using 4 times more memory by storing the data in the order they will be used in the next stencil operation, duplicating data where necessary.Other techniques exploited are direct programming of the messaging hardware (called MUSPI by IBM), a low-overhead work distribution mechanism for threads, explicit data prefetching of data (using dcbt instruction) and manual vectorization (using QPX; width-4 SIMD instructions). Hardware-based list prefetching and transactional memory - both distinct and novel features of the Blue Gene/Q system -- did not improve the program's performance.The second approach is the newly-written LLVM compiler extension called Molly which optimizes the program itself, specifically the distribution of data and work between the nodes of a cluster machine such as Blue Gene/Q. Molly represents arrays using integer polyhedra and uses another already existing compiler extension Polly which represents statements and loops using polyhedra. When Molly knows how data is distributed among the nodes and where statements are executed, it adds code that manages the data flow between the nodes. Molly can also permute the order of data in memory. Molly's main task is to cluster data into sets that are sent to the same target into the same buffer because single transfers involve a massive overhead. We present an algorithm that minimizes the number of transfers for unparametrized loops using anti-chains of data flows. In addition, we implement a heuristic that takes into account how the programmer wrote the code. Asynchronous communication primitives are inserted right after the data is available respectively just before it is used. A runtime library implements these primitives using MPI.Molly manages to distribute any code that is representable by the polyhedral model, but does so best for stencils codes such as Lattice QCD. Compiled using Molly, the Lattice QCD stencil reaches 2.5% of the theoretical peak performance. The performance gap is mostly because all the other optimizations are missing, such as vectorization. Future versions of Molly may also effectively handle non-stencil codes and use make use of all the optimizations that make the manually optimized Lattice QCD stencil so fast.
|
5 |
Bobox Runtime Optimization / Bobox Runtime OptimizationKrížik, Lukáš January 2015 (has links)
The goal of this thesis is to create a tool for an optimization of code for the task-based parallel framework called Bobox. The optimizer tool reduces a number of short and long running tasks based on a static code analysis. Some cases of short-running tasks cause an unnecessary scheduling overhead. The Bobox scheduler can schedule a task even though the task does not have all input data. Unless, the scheduler has enough information not to schedule such task. In order to remove such short-running task, the tool analyses its input usage and informs the scheduler. Long-running tasks inhibit a parallel execution in some cases. A bigger task granularity can significantly improve execution times in a parallel environment. In order to remove a long-running task, the tool has to be able to evaluate a runtime code complexity and yield a task execution in the appropriate place. Powered by TCPDF (www.tcpdf.org)
|
6 |
Exploring the Mental Lexicon of Pakistani L2 Learners : the Role of Culture and L2 Knowledge in Organizing the Mental LexiconQadir, Abdul January 2011 (has links)
There are different types of psycholinguistic approaches which attempt to examine the quality and the organization of the human mental lexicon; the word association experiment is one of them. The word association experiment can be used to probe the development of human vocabulary. The current investigation was carried out in order to trace the influence of the cultural background and L2 knowledge on the mental lexicon of the undergraduate Pakistani L2 learners of English. It was hypothesized that the individual‟s culture and knowledge of L2 bear direct relation with their mental lexicon. Influenced by the culture, they may connect different words with attitudinal bonds, whereas L2 knowledge is accountable for the growth of vocabulary. The motivation stems from the fact that none of the previous studies has targeted Pakistani L2 learners for the word association test in order to investigate their mental lexicon. The data was gathered through a word association test. The results supported the hypothesis. A considerable amount of attitudinal responses emerged in their responses, and the number of paradigmatic responses found in the data was the highest of all. Therefore, it was concluded that Pakistani L2 learners‟ vocabulary was considerably influenced by their cultural milieu due to the presence of attitudinal responses to the stimulus words, and their vocabulary is patterning toward native-like since the number of paradigmatic relations with the stimulus words was the highest of other types of relations. The findings carry important implications for didactics.
|
7 |
Refaktoring a verifikace kódu mkfs xfs / Refactoring and Verification of the Code of mkfs xfsŤulák, Jan January 2017 (has links)
Tato práce popisuje průběh refaktoringu programu mkfs.xfs za účelem zpřehlednění jeho kódu a vyčištění technického dluhu naakumulovaného za dvacet let existence tohoto programu, a následně jeho statickou analýzu. Použité nástroje (CppCheck, Coverity, Codacy, GCC, Clang) jsou srovnány z hlediska počtu i typu nalezených chyb.
|
8 |
Context-aware automated refactoring for unified memory allocation in NVIDIA CUDA programsNejadfard, Kian 25 June 2021 (has links)
No description available.
|
9 |
Funqual: User-Defined, Statically-Checked Call Graph Constraints in C++Nelson, Andrew P 01 June 2018 (has links) (PDF)
Static analysis tools can aid programmers by reporting potential programming mistakes prior to the execution of a program. Funqual is a static analysis tool that reads C++17 code ``in the wild'' and checks that the function call graph follows a set of rules which can be defined by the user. This sort of analysis can help the programmer to avoid errors such as accidentally calling blocking functions in time-sensitive contexts or accidentally allocating memory in heap-sensitive environments. To accomplish this, we create a type system whereby functions can be given user-defined type qualifiers and where users can define their own restrictions on the call graph based on these type qualifiers. We demonstrate that this tool, when used with hand-crafted rules, can catch certain types of errors which commonly occur in the wild. We claim that this tool can be used in a production setting to catch certain kinds of errors in code before that code is even run.
|
10 |
Comparing Android Runtime with native : Fast Fourier Transform on Android / Jämförelse av Android Runtime och native : Fast Fourier Transform på AndroidDanielsson, André January 2017 (has links)
This thesis investigates the performance differences between Java code compiled by Android Runtime and C++ code compiled by Clang on Android. For testing the differences, the Fast Fourier Transform (FFT) algorithm was chosen to demonstrate examples of when it is relevant to have high performance computing on a mobile device. Different aspects that could affect the execution time of a program were examined. One test measured the overhead related to the Java Native Interface (JNI). The results showed that the overhead was insignificant for FFT sizes larger than 64. Another test compared matching implementations of FFTs between Java and native code. The conclusion drawn from this test was that, of the converted algorithms, Columbia Iterative FFT performed the best in both Java and C++. A third test, evaluating the performance of vectorization, proved to be an efficient option for native optimization. Finally, tests examining the effect of using single-point precision (float) versus double-point precision (double) data types were covered. Choosing float could improve performance by using the cache in an efficient manner. / I denna studie undersöktes prestandaskillnader mellan Java-kod kompilerad av Android Runtime och C++-kod kompilerad av Clang på Android. En snabb Fourier Transform (FFT) användes under experimenten för att visa vilka användningsområden som kräver hög prestanda på en mobil enhet. Olika påverkande aspekter vid användningen av en FFT undersöktes. Ett test undersökte hur mycket påverkan Java Native Interface (JNI) hade på ett program i helhet. Resultaten från dessa tester visade att påverkan inte var signifikant för FFT-storlekar större än 64. Ett annat test undersökte prestandaskillnader mellan FFT-algoritmer översatta från Java till C++. Slutsatsen kring dessa tester var att av de översatta algoritmerna var Columbia Iterative FFT den som presterade bäst, både i Java och i C++. Vektorisering visade sig vara en effektiv optimeringsteknik för arkitekturspecifik kod skriven i C++. Slutligen utfördes tester som undersökte prestandaskillnader mellan flyttalsprecision för datatyperna float och double. float kunde förbättra prestandan genom att på ett effektivt sätt utnyttja processorns cache.
|
Page generated in 0.0449 seconds