• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 27
  • 6
  • 2
  • 2
  • 1
  • Tagged with
  • 42
  • 42
  • 19
  • 13
  • 11
  • 9
  • 8
  • 7
  • 7
  • 7
  • 7
  • 7
  • 6
  • 6
  • 6
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
21

Improving performance of sequential code through automatic parallelization / Prestandaförbättring av sekventiell kod genom automatisk parallellisering

Sundlöf, Claudius January 2018 (has links)
Automatic parallelization is the conversion of sequential code into multi-threaded code with little or no supervision. An ideal implementation of automatic parallelization would allow programmers to fully utilize available hardware resources to deliver optimal performance when writing code. Automatic parallelization has been studied for a long time, with one result being that modern compilers support vectorization without any input. In the study, contemporary parallelizing compilers are studied in order to determine whether or not they can easily be used in modern software development, and how code generated by them compares to manually parallelized code. Five compilers, ICC, Cetus, autoPar, PLUTO, and TC Optimizing Compiler are included in the study. Benchmarks are used to measure speedup of parallelized code, these benchmarks are executed on three different sets of hardware. The NAS Parallel Benchmarks (NPB) suite is used for ICC, Cetus, and autoPar, and PolyBench for the previously mentioned compilers in addition to PLUTO and TC Optimizing Compiler. Results show that parallelizing compilers outperform serial code in most cases, with certain coding styles hindering the capability of them to parallelize code. In the NPB suite, manually parallelized code is outperformed by Cetus and ICC for one benchmark. In the PolyBench suite, PLUTO outperforms the other compilers to a great extent, producing code not only optimized for parallel execution, but also for vectorization. Limitations in code generated by Cetus and autoPar prevent them from being used in legacy projects, while PLUTO and TC do not offer fully automated parallelization. ICC was found to offer the most complete automatic parallelization solution, although offered speedups were not as great as ones offered by other tools. / Automatisk parallellisering innebär konvertering av sekventiell kod till multitrådad kod med liten eller ingen tillsyn. En idealisk implementering av automatisk parallellisering skulle låta programmerare utnyttja tillgänglig hårdvara till fullo för att uppnå optimal prestanda när de skriver kod. Automatisk parallellisering har varit ett forskningsområde under en längre tid, och har resulterat i att moderna kompilatorer stöder vektorisering utan någon insats från programmerarens sida. I denna studie studeras samtida parallelliserande kompilatorer för att avgöra huruvida de lätt kan integreras i modern mjukvaruutveckling, samt hur kod som dessa kompilatorer genererar skiljer sig från manuellt parallelliserad kod. Fem kompilatorer, ICC, Cetus, autoPar, PLUTO, och TC Optimizing Compiler inkluderas i studien. Benchmarks används för att mäta speedup av paralleliserad kod. Dessa benchmarks exekveras på tre skiljda hårdvaruuppsättningar. NAS Parallel Benchmarks (NPB) används som benchmark för ICC, Cetus, och autoPar, och PolyBench för samtliga kompilatorer i studien. Resultat visar att parallelliserande kompilatorer genererar kod som presterar bättre än sekventiell kod i de flesta fallen, samt att vissa kodstilar begränsar deras möjlighet att parallellisera kod. I NPB så presterar kod parallelliserad av Cetus och ICC bättre än manuellt parallelliserad kod för en benchmark. I PolyBench så presterar PLUTO mycket bättre än de andra kompilatorerna och producerar kod som inte endast är optimerad för parallell exekvering, utan också för vektorisering. Begränsningar i kod genererad av Cetus och autoPar förhindrar användningen av dessa redskap i etablerade projekt, medan PLUTO och TC inte är kapabla till fullt automatisk parallellisering. Det framkom att ICC erbjuder den mest kompletta lösningen för automatisk parallellisering, men möjliga speedups var ej på samma nivå som för de andra kompilatorerna.
22

A Distributed Memory Implementation of LOCI

George, Thomas 14 December 2001 (has links)
Distributed memory systems have gained immense popularity due to their favorable price/performance ratios. This study seeks to reduce the complexities, involved in developing parallel applications for distributed memory systems. The Loci system is a coordination framework which was developed to eliminate most of the accidental complexities involved in numerical simulation software development. A distributed memory version of Loci is developed and has been tested and validated using a finite-rate chemically reacting flow solver developed in the sequential Loci framework. The application developed in the original sequential version of Loci was parallelized with minimal changes in its source code. A comparison with the results from the original sequential version guarantees a correct implementation. The performance measurements indicate that an efficient implementation has been achieved.
23

Analýza paralelizovatelnosti programů na základě jejich bytecode / Analýza paralelizovatelnosti programů na základě jejich bytecode

Brabec, Michal January 2013 (has links)
Analysis of automatic program parallelization based on bytecode There are many algorithms for automatic parallelization and this work explores the possible application of these algorithms to programs based on their bytecode or similar intermediate code. All these algorithms require the identification of independent code segments, because if two parts of code do not interfere with one another then they can be run in parallel without any danger of data corruption. Dependence testing is an extremely complicated problem and in general application, it is not algorithmically solvable. However, independences can be discovered in special cases and then they can be used as a basis for application of automatic parallelization, like the use of vector instructions. The first step is function inlining that allows the compiler to analyze the code more precisely, without unnecessary dependences caused by unknown functions. Next, it is necessary to identify all control flow constructs, like loops, and after that the compiler can attempt to locate dependences between the statements or instructions. Parallelization can be achieved only if the analysis discovered some independent parts in the code. This work is accompanied by an implementation of function inlining and code analysis for the .NET framework.
24

Automatic Parallelization using Pipelining for Equation-Based Simulation Languages

Lundvall, Håkan January 2008 (has links)
During the most recent decades modern equation-based object-oriented modeling and simulation languages, such as Modelica, have become available. This has made it easier to build complex and more detailed models for use in simulation. To be able to simulate such large and complex systems it is sometimes not enough to rely on the ability of a compiler to optimize the simulation code and reduce the size of the underlying set of equations to speed up the simulation on a single processor. Instead we must look for ways to utilize the increasing number of processing units available in modern computers. However to gain any increased performance from a parallel computer the simulation program must be expressed in a way that exposes the potential parallelism to the computer. Doing this manually is not a simple task and most modelers are not experts in parallel computing. Therefore it is very appealing to let the compiler parallelize the simulation code automatically. This thesis investigates techniques of using automatic translation of models in typical equation based languages, such as Modelica, into parallel simulation code that enable high utilization of available processors in a parallel computer. The two main ideas investigated here are the following: first, to apply parallelization simultaneously to both the system equations and the numerical solver, and secondly. to use software pipelining to further reduce the time processors are kept waiting for the results of other processors. Prototype implementations of the investigated techniques have been developed as a part of the OpenModelica open source compiler for Modelica. The prototype has been used to evaluate the parallelization techniques by measuring the execution time of test models on a few parallel archtectures and to compare the results to sequential code as well as to the results achieved in earlier work. A measured speedup of 6.1 on eight processors on a shared memory machine has been reached. It still remains to evaluate the methods for a wider range of test models and parallel architectures.
25

Análise dos caminhos de execução de programas para a paralelização automática de códigos binários para a plataforma Intel x86 / Analysis of the execution paths of programs to perform automatic parallelization of binary codes on the platform Intel x86

Eberle, André Mantini 06 October 2015 (has links)
Aplicações têm tradicionalmente utilizado o paradigma de programação sequencial. Com a recente expansão da computação paralela, em particular os processadores multinúcleo e ambientes distribuídos, esse paradigma tornou-se um obstáculo para a utilização dos recursos disponíveis nesses sistemas, uma vez que a maior parte das aplicações tornam-se restrita à execução sobre um único núcleo de processamento. Nesse sentido, este trabalho de mestrado introduz uma abordagem para paralelizar programas sequenciais de forma automática e transparente, diretamente sobre o código-binário, de forma a melhor utilizar os recursos disponíveis em computadores multinúcleo. A abordagem consiste na desmontagem (disassembly) de aplicações Intel x86 e sua posterior tradução para uma linguagem intermediária. Em seguida, são produzidos grafos de fluxo e dependências, os quais são utilizados como base para o particionamento das aplicações em unidades paralelas. Por fim, a aplicação é remontada (assembly) e traduzida novamente para a arquitetura original. Essa abordagem permite a paralelização de aplicações sem a necessidade de esforço suplementar por parte de desenvolvedores e usuários. / Traditionally, computer programs have been developed using the sequential programming paradigm. With the advent of parallel computing systems, such as multi-core processors and distributed environments, the sequential paradigm became a barrier to the utilization of the available resources, since the program is restricted to a single processing unit. To address this issue, we introduce a transparent automatic parallelization methodology using a binary rewriter. The steps involved in our approach are: the disassembly of an Intel x86 application, transforming it into an intermediary language; analysis of this intermediary code to obtain flow and dependency graphs; partitioning of the application into parallel units, using the obtained graphs and posterior reassembly of the application, writing it back to the original Intel x86 architecture. By transforming the compiled application software, we aim at obtaining a program which can explore the parallel resources, with no extra effort required either from users or developers.
26

Análise dos caminhos de execução de programas para a paralelização automática de códigos binários para a plataforma Intel x86 / Analysis of the execution paths of programs to perform automatic parallelization of binary codes on the platform Intel x86

André Mantini Eberle 06 October 2015 (has links)
Aplicações têm tradicionalmente utilizado o paradigma de programação sequencial. Com a recente expansão da computação paralela, em particular os processadores multinúcleo e ambientes distribuídos, esse paradigma tornou-se um obstáculo para a utilização dos recursos disponíveis nesses sistemas, uma vez que a maior parte das aplicações tornam-se restrita à execução sobre um único núcleo de processamento. Nesse sentido, este trabalho de mestrado introduz uma abordagem para paralelizar programas sequenciais de forma automática e transparente, diretamente sobre o código-binário, de forma a melhor utilizar os recursos disponíveis em computadores multinúcleo. A abordagem consiste na desmontagem (disassembly) de aplicações Intel x86 e sua posterior tradução para uma linguagem intermediária. Em seguida, são produzidos grafos de fluxo e dependências, os quais são utilizados como base para o particionamento das aplicações em unidades paralelas. Por fim, a aplicação é remontada (assembly) e traduzida novamente para a arquitetura original. Essa abordagem permite a paralelização de aplicações sem a necessidade de esforço suplementar por parte de desenvolvedores e usuários. / Traditionally, computer programs have been developed using the sequential programming paradigm. With the advent of parallel computing systems, such as multi-core processors and distributed environments, the sequential paradigm became a barrier to the utilization of the available resources, since the program is restricted to a single processing unit. To address this issue, we introduce a transparent automatic parallelization methodology using a binary rewriter. The steps involved in our approach are: the disassembly of an Intel x86 application, transforming it into an intermediary language; analysis of this intermediary code to obtain flow and dependency graphs; partitioning of the application into parallel units, using the obtained graphs and posterior reassembly of the application, writing it back to the original Intel x86 architecture. By transforming the compiled application software, we aim at obtaining a program which can explore the parallel resources, with no extra effort required either from users or developers.
27

Automatic Parallelization of Simulation Code from Equation Based Simulation Languages

Aronsson, Peter January 2002 (has links)
<p>Modern state-of-the-art equation based object oriented modeling languages such as Modelica have enabled easy modeling of large and complex physical systems. When such complex models are to be simulated, simulation tools typically perform a number of optimizations on the underlying set of equations in the modeled system, with the goal of gaining better simulation performance by decreasing the equation system size and complexity. The tools then typically generate efficient code to obtain fast execution of the simulations. However, with increasing complexity of modeled systems the number of equations and variables are increasing. Therefore, to be able to simulate these large complex systems in an efficient way parallel computing can be exploited.</p><p>This thesis presents the work of building an automatic parallelization tool that produces an efficient parallel version of the simulation code by building a data dependency graph (task graph) from the simulation code and applying efficient scheduling and clustering algorithms on the task graph. Various scheduling and clustering algorithms, adapted for the requirements from this type of simulation code, have been implemented and evaluated. The scheduling and clustering algorithms presented and evaluated can also be used for functional dataflow languages in general, since the algorithms work on a task graph with dataflow edges between nodes.</p><p>Results are given in form of speedup measurements and task graph statistics produced by the tool. The conclusion drawn is that some of the algorithms investigated and adapted in this work give reasonable measured speedup results for some specific Modelica models, e.g. a model of a thermofluid pipe gave a speedup of about 2.5 on 8 processors in a PC-cluster. However, future work lies in finding a good algorithm that works well in general.</p> / Report code: LiU-Tek-Lic-2002:06.
28

Efficient search-based strategies for polyhedral compilation : algorithms and experience in a production compiler

Trifunovic, Konrad 04 July 2011 (has links) (PDF)
In order to take the performance advantages of the current multicore and heterogeneous architectures the compilers are required to perform more and more complex program transformations. The search space of the possible program optimizations is huge and unstructured. Selecting the best transformation and predicting the potential performance benefits of that transformation is the major problem in today's optimizing compilers. The promising approach to handling the program optimizations is to focus on the automatic loop optimizations expressed in the polyhedral model. The current approaches for optimizing programs in the polyhedral model broadly fall into two classes. The first class of the methods is based on the linear optimization of the analytical cost function. The second class is based on the exhaustive iterative search. While the first approach is fast, it can easily miss the optimal solution. The iterative approach is more precise, but its running time might be prohibitively expensive. In this thesis we present a novel search-based approach to program transformations in the polyhedral model. The new method combines the benefits - effectiveness and precision - of the current approaches, while it tries to minimize their drawbacks. Our approach is based on enumerating the evaluations of the precise, nonlinear performance predicting cost-function. The current practice is to use the polyhedral model in the context of source-to-source compilers. We have implemented our techniques in a GCC framework that is based on the low level three address code representation. We show that the chosen level of abstraction for the intermediate representation poses scalability challenges, and we show the ways to overcome those problems. On the other hand, it is shown that the low level IR abstraction opens new degrees of freedom that are beneficial for the search-based transformation strategies and for the polyhedral compilation in general.
29

Adapting the polytope model for dynamic and speculative parallelization / Adaptation du modèle polyhédrique à la parallélisation dynamique et spéculatice

Jimborean, Alexandra 14 September 2012 (has links)
Dans cette thèse, nous décrivons la conception et l'implémentation d'une plate-forme logicielle de spéculation de threads, ou fils d'exécution, appelée VMAD, pour "Virtual Machine for Advanced Dynamic analysis and transformation", et dont la fonction principale est d'être capable de paralléliser de manière spéculative un nid de boucles séquentiel de différentes façons, en ré-ordonnançant ses itérations. La transformation à appliquer est sélectionnée au cours de l'exécution avec pour objectifs de minimiser le nombre de retours arrières et de maximiser la performance. Nous effectuons des transformations de code en appliquant le modèle polyédrique que nous avons adapté à la parallélisation spéculative au cours de l'exécution. Pour cela, nous construisons au préalable un patron de code qui est "patché" par notre "runtime", ou support d'exécution logiciel, selon des informations de profilage collectées sur des échantillons du temps d'exécution. L'adaptabilité est assurée en considérant des tranches de code de tailles différentes, qui sont exécutées successivement, chacune étant parallélisée différemment, ou exécutée en séquentiel, selon le comportement des accès à la mémoire observé. Nous montrons, sur plusieurs programmes que notre plate-forme offre de bonnes performances, pour des codes qui n'auraient pas pu être traités efficacement par les systèmes spéculatifs de threads proposés précédemment. / In this thesis, we present a Thread-Level Speculation (TLS) framework whose main feature is to speculatively parallelize a sequential loop nest in various ways, to maximize performance. We perform code transformations by applying the polyhedral model that we adapted for speculative and runtime code parallelization. For this purpose, we designed a parallel code pattern which is patched by our runtime system according to the profiling information collected on some execution samples. We show on several benchmarks that our framework yields good performance on codes which could not be handled efficiently by previously proposed TLS systems.
30

Automatic Parallelization of Simulation Code from Equation Based Simulation Languages

Aronsson, Peter January 2002 (has links)
Modern state-of-the-art equation based object oriented modeling languages such as Modelica have enabled easy modeling of large and complex physical systems. When such complex models are to be simulated, simulation tools typically perform a number of optimizations on the underlying set of equations in the modeled system, with the goal of gaining better simulation performance by decreasing the equation system size and complexity. The tools then typically generate efficient code to obtain fast execution of the simulations. However, with increasing complexity of modeled systems the number of equations and variables are increasing. Therefore, to be able to simulate these large complex systems in an efficient way parallel computing can be exploited. This thesis presents the work of building an automatic parallelization tool that produces an efficient parallel version of the simulation code by building a data dependency graph (task graph) from the simulation code and applying efficient scheduling and clustering algorithms on the task graph. Various scheduling and clustering algorithms, adapted for the requirements from this type of simulation code, have been implemented and evaluated. The scheduling and clustering algorithms presented and evaluated can also be used for functional dataflow languages in general, since the algorithms work on a task graph with dataflow edges between nodes. Results are given in form of speedup measurements and task graph statistics produced by the tool. The conclusion drawn is that some of the algorithms investigated and adapted in this work give reasonable measured speedup results for some specific Modelica models, e.g. a model of a thermofluid pipe gave a speedup of about 2.5 on 8 processors in a PC-cluster. However, future work lies in finding a good algorithm that works well in general. / <p>Report code: LiU-Tek-Lic-2002:06.</p>

Page generated in 0.1446 seconds