Global ETD Search

11	Automatic Parallelization of Equation-Based Simulation Programs Aronsson, Peter January 2006 (has links) Modern equation-based object-oriented modeling languages which have emerged during the past decades make it easier to build models of large and complex systems. The increasing size and complexity of modeled systems requires high performance execution of the simulation code derived from such models. More efficient compilation and code optimization techniques can help to some extent. However, a number of heavy-duty simulation applications require the use of high performance parallel computers in order to obtain acceptable execution times. Unfortunately, the possible additional performance offered by parallel computer architectures requires the simulation program to be expressed in a way that makes the potential parallelism accessible to the parallel computer. Manual parallelization of computer programs is generally a tedious and error prone process. Therefore, it would be very attractive to achieve automatic parallelization of simulation programs. This thesis presents solutions to the research problem of finding practically usable methods for automatic parallelization of simulation codes produced from models in typical equationbased object-oriented languages. The methods have been implemented in a tool to automatically translate models in the Modelica modeling language to parallel codes which can be efficiently executed on parallel computers. The tool has been evaluated on several application models. The research problem includes the problem of how to extract a sufficient amount of parallelism from equations represented in the form of a data dependency graph (task graph), requiring analysis of the code at a level as detailed as individual expressions. Moreover, efficient clustering algorithms for building clusters of tasks from the task graph are also required. One of the major contributions of this thesis work is a new approach for merging fine-grained tasks by using a graph rewrite system. Results from using this method show that it is efficient in merging task graphs, thereby decreasing their size, while still retaining a reasonable amount of parallelism. Moreover, the new task-merging approach is generally applicable to programs which can be represented as static (or almost static) task graphs, not only to code from equation-based models. An early prototype called DSBPart was developed to perform parallelization of codes produced by the Dymola tool. The final research prototype is the ModPar tool which is part of the OpenModelica framework. Results from using the DSBpart and ModPar tools show that the amount of parallelism of complex models varies substantially between different application models, and in some cases can produce reasonable speedups. Also, different optimization techniques used on the system of equations from a model affect the amount of parallelism of the model and thus influence how much is gained by parallelization. Heavy-duty simulation Parallel computers Automatic parallelization Clustering Computer science Datavetenskap
12	Automatic Task Formation Techniques for the Multi-level Computing Architecture Stewart, Kirk 30 July 2008 (has links) The Multi-Level Computing Architecture (MLCA) is a multiprocessor system-on-chip architecture designed for multimedia applications. It provides a programming model that simplifies the process of writing parallel applications by eliminating the need for explicit synchronization. However, developers must still invest effort to design applications that fully exploit the MLCA’s multiprocessing capabilities. We present a set of compiler techniques to streamline the process of developing applications for the MLCA. We present an algorithm to automatically partition a sequential application into tasks that can be executed in parallel. We also present code generation algorithms to translate annotated, sequential C code to the MLCA’s programming model. We provide an experimental evaluation of these techniques, performed with a prototype compiler based upon the open-source ORC compiler and integrated with the MLCA Optimizing Compiler. This evaluation shows that the performance of automatically generated code compares favourably to that of manually written code. compiler optimization task-level parallelism multimedia automatic parallelization software programming 0984
13	Automatic Task Formation Techniques for the Multi-level Computing Architecture Stewart, Kirk 30 July 2008 (has links) The Multi-Level Computing Architecture (MLCA) is a multiprocessor system-on-chip architecture designed for multimedia applications. It provides a programming model that simplifies the process of writing parallel applications by eliminating the need for explicit synchronization. However, developers must still invest effort to design applications that fully exploit the MLCA’s multiprocessing capabilities. We present a set of compiler techniques to streamline the process of developing applications for the MLCA. We present an algorithm to automatically partition a sequential application into tasks that can be executed in parallel. We also present code generation algorithms to translate annotated, sequential C code to the MLCA’s programming model. We provide an experimental evaluation of these techniques, performed with a prototype compiler based upon the open-source ORC compiler and integrated with the MLCA Optimizing Compiler. This evaluation shows that the performance of automatically generated code compares favourably to that of manually written code. compiler optimization task-level parallelism multimedia automatic parallelization software programming 0984
14	Automatic Parallelization for Graphics Processing Units in JikesRVM Leung, Alan Chun Wai January 2008 (has links) Accelerated graphics cards, or Graphics Processing Units (GPUs), have become ubiquitous in recent years. On the right kinds of problems, GPUs greatly surpass CPUs in terms of raw performance. However, GPUs are currently used only for a narrow class of special-purpose applications; the raw processing power available in a typical desktop PC is unused most of the time. The goal of this work is to present an extension to JikesRVM that automatically executes suitable code on the GPU instead of the CPU. Both static and dynamic features are used to decide whether it is feasible and beneficial to off-load a piece of code on the GPU. Feasible code is discovered by an implementation of data dependence analysis. A cost model that balances the speedup available from the GPU against the cost of transferring input and output data between main memory and GPU memory has been deployed to determine if a feasible parallelization is indeed beneficial. The cost model is parameterized so that it can be applied to different hardware combinations. We also present ways to overcome several obstacles to parallelization inherent in the design of the Java bytecode language: unstructured control flow, the lack of multi-dimensional arrays, the precise exception semantics, and the proliferation of indirect references. Compiler GPU Automatic Parallelization Just In Time Virtual Machine JikesRVM Java Optimization Computer Science
15	Automatic Parallelization for Graphics Processing Units in JikesRVM Leung, Alan Chun Wai January 2008 (has links) Accelerated graphics cards, or Graphics Processing Units (GPUs), have become ubiquitous in recent years. On the right kinds of problems, GPUs greatly surpass CPUs in terms of raw performance. However, GPUs are currently used only for a narrow class of special-purpose applications; the raw processing power available in a typical desktop PC is unused most of the time. The goal of this work is to present an extension to JikesRVM that automatically executes suitable code on the GPU instead of the CPU. Both static and dynamic features are used to decide whether it is feasible and beneficial to off-load a piece of code on the GPU. Feasible code is discovered by an implementation of data dependence analysis. A cost model that balances the speedup available from the GPU against the cost of transferring input and output data between main memory and GPU memory has been deployed to determine if a feasible parallelization is indeed beneficial. The cost model is parameterized so that it can be applied to different hardware combinations. We also present ways to overcome several obstacles to parallelization inherent in the design of the Java bytecode language: unstructured control flow, the lack of multi-dimensional arrays, the precise exception semantics, and the proliferation of indirect references. Compiler GPU Automatic Parallelization Just In Time Virtual Machine JikesRVM Java Optimization Computer Science
16	Automatic Parallelization using Pipelining for Equation-Based Simulation Languages Lundvall, Håkan January 2008 (has links) <p>During the most recent decades modern equation-based object-oriented modeling and simulation languages, such as Modelica, have become available. This has made it easier to build complex and more detailed models for use in simulation. To be able to simulate such large and complex systems it is sometimes not enough to rely on the ability of a compiler to optimize the simulation code and reduce the size of the underlying set of equations to speed up the simulation on a single processor. Instead we must look for ways to utilize the increasing number of processing units available in modern computers. However to gain any increased performance from a parallel computer the simulation program must be expressed in a way that exposes the potential parallelism to the computer. Doing this manually is not a simple task and most modelers are not experts in parallel computing. Therefore it is very appealing to let the compiler parallelize the simulation code automatically. This thesis investigates techniques of using automatic translation of models in typical equation based languages, such as Modelica, into parallel simulation code that enable high utilization of available processors in a parallel computer. The two main ideas investigated here are the following: first, to apply parallelization simultaneously to both the system equations and the numerical solver, and secondly. to use software pipelining to further reduce the time processors are kept waiting for the results of other processors. Prototype implementations of the investigated techniques have been developed as a part of the OpenModelica open source compiler for Modelica. The prototype has been used to evaluate the parallelization techniques by measuring the execution time of test models on a few parallel archtectures and to compare the results to sequential code as well as to the results achieved in earlier work. A measured speedup of 6.1 on eight processors on a shared memory machine has been reached. It still remains to evaluate the methods for a wider range of test models and parallel architectures.</p> Equation-Based languages automatic parallelization Modelica simulation Ekvationsbaserade språk automatisk parallellisering Modelica simulering Computer science Datalogi
17	Transformations de programme automatiques et source-à-source pour accélérateurs matériels de type GPU / Source-to-Source Automatic Program Transformations for GPU-like Hardware Accelerators Amini, Mehdi 13 December 2012 (has links) Depuis le début des années 2000, la performance brute des cœurs des processeurs a cessé son augmentation exponentielle. Les circuits graphiques (GPUs) modernes ont été conçus comme des circuits composés d'une véritable grille de plusieurs centaines voir milliers d'unités de calcul. Leur capacité de calcul les a amenés à être rapidement détournés de leur fonction première d'affichage pour être exploités comme accélérateurs de calculs généralistes. Toutefois programmer un GPU efficacement en dehors du rendu de scènes 3D reste un défi.La jungle qui règne dans l'écosystème du matériel se reflète dans le monde du logiciel, avec de plus en plus de modèles de programmation, langages, ou API, sans laisser émerger de solution universelle.Cette thèse propose une solution de compilation pour répondre partiellement aux trois "P" propriétés : Performance, Portabilité, et Programmabilité. Le but est de transformer automatiquement un programme séquentiel en un programme équivalent accéléré à l'aide d'un GPU. Un prototype, Par4All, est implémenté et validé par de nombreuses expériences. La programmabilité et la portabilité sont assurées par définition, et si la performance n'est pas toujours au niveau de ce qu'obtiendrait un développeur expert, elle reste excellente sur une large gamme de noyaux et d'applications.Une étude des architectures des GPUs et les tendances dans la conception des langages et cadres de programmation est présentée. Le placement des données entre l'hôte et l'accélérateur est réalisé sans impliquer le développeur. Un algorithme d'optimisation des communications est proposé pour envoyer les données sur le GPU dès que possible et les y conserver aussi longtemps qu'elle ne sont pas requises sur l'hôte. Des techniques de transformations de boucles pour la génération de code noyau sont utilisées, et même certaines connues et éprouvées doivent être adaptées aux contraintes posées par les GPUs. Elles sont assemblées de manière cohérente, et ordonnancées dans le flot d'un compilateur interprocédural. Des travaux préliminaires sont présentés au sujet de l'extension de l'approche pour cibler de multiples GPUs. / Since the beginning of the 2000s, the raw performance of processors stopped its exponential increase. The modern graphic processing units (GPUs) have been designed as array of hundreds or thousands of compute units. The GPUs' compute capacity quickly leads them to be diverted from their original target to be used as accelerators for general purpose computation. However programming a GPU efficiently to perform other computations than 3D rendering remains challenging.The current jungle in the hardware ecosystem is mirrored by the software world, with more and more programming models, new languages, different APIs, etc. But no one-fits-all solution has emerged.This thesis proposes a compiler-based solution to partially answer the three "P" properties: Performance, Portability, and Programmability. The goal is to transform automatically a sequential program into an equivalent program accelerated with a GPU. A prototype, Par4All, is implemented and validated with numerous experiences. The programmability and portability are enforced by definition, and the performance may not be as good as what can be obtained by an expert programmer, but still has been measured excellent for a wide range of kernels and applications.A survey of the GPU architectures and the trends in the languages and framework design is presented. The data movement between the host and the accelerator is managed without involving the developer. An algorithm is proposed to optimize the communication by sending data to the GPU as early as possible and keeping them on the GPU as long as they are not required by the host. Loop transformations techniques for kernel code generation are involved, and even well-known ones have to be adapted to match specific GPU constraints. They are combined in a coherent and flexible way and dynamically scheduled within the compilation process of an interprocedural compiler. Some preliminary work is presented about the extension of the approach toward multiple GPUs. GPU CUDA OpenCL Parallélisation automatisée Compilation GPU CUDA OpenCL Automatic Parallelization Compilation
18	Automated Recognition of Algorithmic Patterns in DSP Programs Shafiee Sarvestani, Amin January 2011 (has links) We introduce an extensible knowledge based tool for idiom (pattern) recognition in DSP(digital signal processing) programs. Our tool utilizesfunctionality provided by the Cetus compiler infrastructure fordetecting certain computation patterns that frequently occurin DSP code. We focus on recognizing patterns for for-loops andstatements in their bodies as these often are the performance criticalconstructs in DSP applications for which replacementby highly optimized, target-specific parallel algorithms will bemost profitable. For better structuring and efficiency of patternrecognition, we classify patterns by different levels of complexitysuch that patterns in higher levels are defined in terms of lowerlevel patterns.The tool works statically on the intermediate representation(IR). It traverses the abstract syntax tree IR in post-orderand applies bottom-up pattern matching, at each IR nodeutilizing information about the patterns already matched for itschildren or siblings.For better extensibility and abstraction,most of the structuralpart of recognition rules is specified in XML form to separatethe tool implementation from the pattern specifications.Information about detected patterns will later be used foroptimized code generation by local algorithm replacement e.g. for thelow-power high-throughput multicore DSP architecture ePUMA. Automatic Parallelization Algorithmic Pattern Recognition Cetus DSP DSP Code Parallelization Compiler Frameworks Computer Sciences Datavetenskap (datalogi)
19	Garbage Collection supporting automatic JIT parallelization in JVM Österlund, Erik January 2012 (has links) With increasing clock-rates in CPUs coming to an end, a need for parallelization has emerged. This thesis proposes a dynamic purity analysis of objects, detecting independent execution paths that may be run in parallel. The analysis relies in speculative guesses and may be rolled back when proven wrong. It piggybags on an efficient replicating garbage collector integrated to JVM. The efficiency of the algorithms are shown in benchmark, and are comparable to the speed of state of the art garbage collectors in hotspot’s JVM. With this dynamic purity analysis now accessible in Java programs, the potential for automatic JIT-parallelization of pure methods is possible. garbage collection dynamic analysis automatic parallelization Engineering and Technology Teknik och teknologier Computer Sciences Datavetenskap (datalogi)
20	Evaluating the Scalability of SDF Single-chip Multiprocessor Architecture Using Automatically Parallelizing Code Zhang, Yuhua 12 1900 (has links) Advances in integrated circuit technology continue to provide more and more transistors on a chip. Computer architects are faced with the challenge of finding the best way to translate these resources into high performance. The challenge in the design of next generation CPU (central processing unit) lies not on trying to use up the silicon area, but on finding smart ways to make use of the wealth of transistors now available. In addition, the next generation architecture should offer high throughout performance, scalability, modularity, and low energy consumption, instead of an architecture that is suitable for only one class of applications or users, or only emphasize faster clock rate. A program exhibits different types of parallelism: instruction level parallelism (ILP), thread level parallelism (TLP), or data level parallelism (DLP). Likewise, architectures can be designed to exploit one or more of these types of parallelism. It is generally not possible to design architectures that can take advantage of all three types of parallelism without using very complex hardware structures and complex compiler optimizations. We present the state-of-art architecture SDF (scheduled data flowed) which explores the TLP parallelism as much as that is supplied by that application. We implement a SDF single-chip multiprocessor constructed from simpler processors and execute the automatically parallelizing application on the single-chip multiprocessor. SDF has many desirable features such as high throughput, scalability, and low power consumption, which meet the requirements of the next generation of CPU design. Compared with superscalar, VLIW (very long instruction word), and SMT (simultaneous multithreading), the experiment results show that for application with very little parallelism SDF is comparable to other architectures, for applications with large amounts of parallelism SDF outperforms other architectures. Multiprocessors. Computer architecture. SDF single-chip multiprocessor multithreading automatic parallelization scalability superscalar VLIW SMT

Search results