71 |
Machine learning enhanced code optimization for high-level synthesis (ML-ECOHS)Munafo, Robert P. 24 May 2024 (has links)
While Field-Programmable Gate Arrays (FPGAs) exist in many design configurations throughout the data center, cloud, and edge, the promise of performance and flexibility offered by the FPGA often remains unrealized for lack of hardware design expertise, with most computation remaining in fixed hardware such as CPUs, GPUs, and ASICs e.g. tensor processors. Identifying programmability as a barrier to FPGA usage, we seek to augment High Level Synthesis (HLS) design flows with machine learning. The overall goal of this dissertation is to advance the art of using unmodified high-level language (HLL) programs to create FPGA configurations that are performant, programmable, and portable.
The problems in using HLL code to program FPGAs arise from the serial execution model of the target application codes, in particular, the mismatch between that model and the arbitrary data flow model of the target hardware. However, a variety of code transformation techniques, tedious to perform by a human but readily and effortlessly done by CPU compilers, allow many compute-intensive operations to be transformed into a form that is highly or massively parallel. A challenge then exists in selecting the best set of optimizations and an order in which to perform them, a choice among staggeringly many options. Brute-force and automated orthogonal search techniques have failed to produce solutions that lie within the realm of practicality.
We evaluate the suitability of machine learning (ML) models to address this challenge. We develop and assess designs for systems that use ML and present feedback to these models based on their recommendations. In support of using ML models, we begin by developing and assessing methods for preparing the original program as input for processing by the model. We next develop and assess methods to apply the model's output to a compilation system and evaluate the results for use as feedback. Machine learning experiments are performed to demonstrate their potential. Feasibility is studied through simulations and specialized evaluation techniques as appropriate. Specific new artifacts are created, including extensions to the open-source GCC compiler.
The significance of this work is of several types: the many options and variants to the overall design that are considered, with their distinct advantages and disadvantages; the application of proper units and baselines to experimental measurements; and the numerous contributions in the form of published extensions and improvements to open-source projects.
|
72 |
Applied logic : its use and implementation as a programming toolWarren, David H. D. January 1978 (has links)
The first Part of the thesis explains from first principles the concept of "logic programming" and its practical application in the programming language Prolog. Prolog is a simple but powerful language which encourages rapid, error-free programming and clear, readable, concise programs. The basic computational mechanism is a pattern matching process ("unification") operating on general record structures ("terms" of logic). IThe ideas are illustrated by describing in detail one sizable Prolog program which implements a simple compiler. The advantages and practicability of using Prolog for "real" compiler implementation are discussed. The second Part of the thesis describes techniques for implementing Prolog efficiently. In particular it is shown how to compile the patterns involved in the matching process into instructions of a low-level language. This idea has actually been implemented in a compiler (written in Prolog) from Prolog to DECsystem-10 assembly language. However the principles involved are explained more abstractly in terms of a "Prolog Machine". The code generated is comparable in speed with that produced by existing DEC10 Lisp compilers. Comparison is possible since pure Lisp can be viewed as a (rather restricted) subset of Prolog. It is argued that structured data objects, such as lists and trees, can be manipulated by pattern matching using a "structure 'sharing" representation as efficiently as by conventional selector and constructor functions operating on linked records in "heap" storage. Moreover the pattern matching formulation actually helps the implementor to produce a better implementation.
|
73 |
Modelica-Compiler Methoden für zeitkritische SimulationenWaurich, Volker 01 December 2021 (has links)
Die Modellierung und Simulation dynamischer Vorgänge ist ein etabliertes Werkzeug bei der Entwicklung und Untersuchung technisch-physikalischer Systeme. Gleichungsbasierte und objektorientierte Modellbeschreibungssprachen wie Modelica ermöglichen eine effiziente Modellierung bei gleichzeitig guter Wiederverwendbarkeit und Übersichtlichkeit. Der Modelica-Compiler übersetzt das Modell in eine mathematisch lösbare Form und erzeugt ein ausführbares Simulationsprogramm. Mithilfe symbolischer Verfahren kann der Berechnungsaufwand und die Zeit zur Auswertung der Modellgleichungen positiv beeinflusst werden. Die vorliegende Arbeit identifiziert Potenziale zur automatisierten Laufzeitverbesserung von Modelica-Modellen und untersucht geeignete Verfahren, welche in den open source Compiler OpenModelica implementiert werden. Die vorgeschlagenen Methoden basieren auf einer graphentheoretischen Repräsentation des Gleichungssystems. Graphenalgorithmen werden zur symbolischen Manipulation des Gleichungssystem benutzt, um die Ausführungszeit zu reduzieren und ein nicht-deterministisches Laufzeitverhalten zu vermeiden. Es werden Methoden aufgezeigt, welche automatisiert und generalisiert ohne manuelles Eingreifen in den Übersetzungsprozess zu einer beschleunigten Modellauswertung führen. Die vorliegende Arbeit leistet somit einen Beitrag für die Echtzeitfähigkeit von Modellen mit steigender Komplexität, effizienter Modellerstellung und leistungsfähigen, automatisierten Simulationsläufen, wie bspw. Parameterstudien oder virtuellen Testläufen. / Modelling and simulation of dynamic processes is a well-established tool for the development of technical and physical systems. Equation-based and object-oriented modelling languages such as Modelica enable an efficient way of modelling as well as good reusability and structuring of the model. The Modelica-Compiler translates the model to a solvable representation and compiles an executable simulation program. With the help of symbolic compiler methods, the calculation costs and executation time can be reduced. The present paper identifies potentials for the automated improvement of execution times of Modelica-based simulation models and investigates suitable methods which have been implemented in the OpenModelica Compiler. The proposed compiler methods are based on a graph representation of the underlying equation system. Graph algorithms are used for the symbolic manipulation of the system of equations to reduce execuation time and non-deterministic runtime behaviour. All presented methods can be applied automatically without addtional manual modifications during the compilation process. Hence, the present paper is a contribution to achieve faster simulation times of models with increasing complexity. This is an eminent requirement for a productive and highly automated virtual development process.
|
74 |
Methodology of Construction Compiler Front-End and Its Integration into the GNU Compiler Collection / Methodology of Construction Compiler Front-End and Its Integration into the GNU Compiler CollectionMachata, Petr Unknown Date (has links)
Tato diplomová práce vznikla za podpory ANF DATA s.r.o., Brno. Diplomová práce je vypracována v angličtině. Vstupní bariéra pro vývoj uvnitř GCC se během posledních let znatelně snížila. Na konferencích, v časopisech a na webu se objevují články s architektonickými přehledy a návody. Věci se nadále zjednodušují použitím oficiálního vnitřního jazyka GENERIC: Pro komunikaci mezi přední částí a zbytkem překladače již není nutné zabývat se obtížným a nepřehledným RTL. Přesto je práce se souborem zdrojových kódů velikosti GCC nutně složitá. Je třeba napsat určité soubory a provést určitá nastavení, oboje jen s poměrně malým množstvím dokumentace. Cílem této práce je pomoci s posledním zmíněným bodem. Práce popisuje ukázkovou přední část: Vše od vytvoření zdrojových souborů, přes různé konstrukce jazyka GENERIC, až k problémům s kompilací běhové podpory nebo používání nativního preprocesoru.
|
75 |
Automatic Discovery and Exposition of Parallelism in Serial Applications for Compiler-Inserted Runtime AdaptationGreenland, David A. 25 May 2012 (has links) (PDF)
Compiler-Inserted Runtime Adaptation (CIRA) is a compilation and runtime adaptation strategy which has great potential for increasing performance in multicore systems. In this strategy, the compiler inserts directives into the application which will adapt the application at runtime. Its ability to overcome the obstacles of architectural and environmental diversity coupled with its flexibility to work with many programming languages and styles of applications make it a very powerful tool. However, it is not complete. In fact, there are many pieces still needed to accomplish these lofty goals. This work describes the automatic discovery of parallelism inherent in an application and the generation of an intermediate representation to expose that parallelism. This work shows on six benchmark applications that a significant amount of parallelism which was not initially apparent can be automatically discovered. This work also shows that the parallelism can then be exposed in a representation which is also automatically generated. This is accomplished by a series of analysis and transformation passes with only minimal programmer-inserted directives. This series of passes forms a necessary part of the CIRA toolchain called the concurrency compiler. This concurrency compiler proves that a representation with exposed parallelism and locality can be generated by a compiler. It also lays the groundwork for future, more powerful concurrency compilers. This work also describes the extension of the intermediate representation to support hierarchy, a prerequisite characteristic to the creation of the concurrency compiler. This extension makes it capable of representing many more applications in a much more effective way. This extension to support hierarchy allows much more of the parallelism discovered by the concurrency compiler to be stored in the representation.
|
76 |
Compiler Testing of C11 Atomics for Arm and RISC-VAdolfsson, Hampus January 2022 (has links)
The C11 standard introduced atomic types and operations, with an accompanying memory model, to enable the use of shared variables in concurrent programs. In this thesis, I demonstrate how compilers can be tested, in a way that is deterministic and covers the entire set of atomic operations, to ensure they correctly implement C11 atomics and the C11 memory model. I use a large set of short concurrent programs (”litmus tests”), generated from a model written in a specification language and based on a formalized C11 memory model. Each test program is compiled and run with a model checker, to determine the possible outcomes; any program with an outcome that is possible after compilation but not allowed by C11 is a failed test case. As an alternative to model checking, I also test a nondeterministic, hardware-based method for running tests, but I find that this method is too inaccurate to be useful. I test IAR and gcc compilers for Arm and RISC-V; all of these compilers pass all tests. Out of three compilers with purposefully inserted bugs, all are correctly identified as faulty. This testing process thus shows some promise, but further evaluation is needed.
|
77 |
A development environment and static analyses for GUARDOL - a language for the specification of high assurance guardsDodds, Josiah January 1900 (has links)
Master of Science / Department of Computing and Information Sciences / John M. Hatcliff / There are a number of network situations where different networks have different security policies and still need to share information. While it is important to allow some data to flow between the two networks, it is just as important that they don't share any data that violates the respective security policies of the networks. Constraints on data sharing are often phrased in terms of classification levels of data (e.g. top secret, secret, public). They might also be stated in terms of the contents of the data (e.g. are there military base names,
is the location correct). The software and hardware that works to solve these problems is called Cross Domain Solutions (CDS). There are a variety of hardware platforms capable of implementing CDS. These platforms are all configured in different ways and they are often proprietary. Not only are there a
number of platforms on the market, many are difficult to understand, verify, or even specify.
The Guardol project provides an open, non-proprietary, and domain-specific language
for specifying CDS security policies and implementing CDS. Guardol is designed to be easy to understand and verify.
This thesis describes the design and implementation of primary Guardol components. It includes a description of the Eclipse GUI plug-ins that have been developed for the project as well as a description of new formal analyses and translations that have been developed for the language. The translation is used to plug into external tools for model checking and the analyses help to make the translation clean and efficient. The analyses are also useful
tools to help make the use of Guardol easier for developers.
|
78 |
Efficient Execution Paradigms for Parallel Heterogeneous ArchitecturesKoukos, Konstantinos January 2016 (has links)
This thesis proposes novel, efficient execution-paradigms for parallel heterogeneous architectures. The end of Dennard scaling is threatening the effectiveness of DVFS in future nodes; therefore, new execution paradigms are required to exploit the non-linear relationship between performance and energy efficiency of memory-bound application-regions. To attack this problem, we propose the decoupled access-execute (DAE) paradigm. DAE transforms regions of interest (at program-level) in two coarse-grain phases: the access-phase and the execute-phase, which we can independently DVFS. The access-phase is intended to prefetch the data in the cache, and is therefore expected to be predominantly memory-bound, while the execute-phase runs immediately after the access-phase (that has warmed-up the cache) and is therefore expected to be compute-bound. DAE, achieves good energy savings (on average 25% lower EDP) without performance degradation, as opposed to other DVFS techniques. Furthermore, DAE increases the memory level parallelism (MLP) of memory-bound regions, which results in performance improvements of memory-bound applications. To automatically transform application-regions to DAE, we propose compiler techniques to automatically generate and incorporate the access-phase(s) in the application. Our work targets affine, non-affine, and even complex, general-purpose codes. Furthermore, we explore the benefits of software multi-versioning to optimize DAE in dynamic environments, and handle codes with statically unknown access-phase overheads. In general, applications automatically-transformed to DAE by our compiler, maintain (or even exceed in some cases) the good performance and energy efficiency of manually-optimized DAE codes. Finally, to ease the programming environment of heterogeneous systems (with integrated GPUs), we propose a novel system-architecture that provides unified virtual memory with low overhead. The underlying insight behind our work is that existing data-parallel programming models are a good fit for relaxed memory consistency models (e.g., the heterogeneous race-free model). This allows us to simplify the coherency protocol between the CPU – GPU, as well as the GPU memory management unit. On average, we achieve 45% speedup and 45% lower EDP over the corresponding SC implementation.
|
79 |
Optimalizace rozsáhlých aplikací / Optimizing large applicationsLiška, Martin January 2013 (has links)
Both uppermost open source compilers, GCC and LLVM, are mature enough to link-time optimize large applications. In case of large applications, we must take into account, except standard speed efficiency and memory consumption, different aspects. We focus on size of the code, cold start-up time, etc. Developers of applications often come up with ad-hoc solutions such as Elfhack utility, start-up of an application via a pre-loading utility and dlopen; prelinking and variety of different tools that reorder functions to fit the order of execution. The goal of the thesis is to analyse all existing techniques of optimization, evaluate their efficiency and design new solutions based on the link-time optimization platform. Powered by TCPDF (www.tcpdf.org)
|
80 |
RISC-V Compiler Performance:A Comparison between GCC and LLVM/clangBjäreholt, Johan January 2017 (has links)
RISC-V is a new open-source instruction set architecture (ISA) that in De-cember 2016 manufactured its rst mass-produced processors. It focuses onboth eciency and performance and diers from other open-source architec-tures by not having a copyleft license permitting vendors to freely design,manufacture and sell RISC-V chips without any fees nor having to sharetheir modications on the reference implementations of the architecture.The goal of this thesis is to evaluate the performance of the GCC andLLVM/clang compilers support for the RISC-V target and their ability tooptimize for the architecture. The performance will be evaluated from ex-ecuting the CoreMark and Dhrystone benchmarks are both popular indus-try standard programs for evaluating performance on embedded processors.They will be run on both the GCC and LLVM/clang compilers on dierentoptimization levels and compared in performance per clock to the ARM archi-tecture which is mature yet rather similar to RISC-V. The compiler supportfor the RISC-V target is still in development and the focus of this thesis willbe the current performance dierences between the GCC and LLVM com-pilers on this architecture. The platform we will execute the benchmarks onwil be the Freedom E310 processor on the SiFive HiFive1 board for RISC-Vand a ARM Cortex-M4 processor by Freescale on the Teensy 3.6 board. TheFreedom E310 is almost identical to the reference Berkeley Rocket RISC-Vdesign and the ARM Coretex-M4 processor has a similar clock speed and isaimed at a similar target audience.The results presented that the -O2 and -O3 optimization levels on GCCfor RISC-V performed very well in comparison to our ARM reference. Onthe lower -O1 optimization level and -O0 which is no optimizations and -Oswhich is -O0 with optimizations for generating a smaller executable code sizeGCC performs much worse than ARM at 46% of the performance at -O1,8.2% at -Os and 9.3% at -O0 on the CoreMark benchmark with similar resultsin Dhrystone except on -O1 where it performed as well as ARM. When turn-ing o optimizations (-O0) GCC for RISC-V was 9.2% of the performanceon ARM in CoreMark and 11% in Dhrystone which was unexpected andneeds further investigation. LLVM/clang on the other hand crashed whentrying to compile our CoreMark benchmark and on Dhrystone the optimiza-tion options made a very minor impact on performance making it 6.0% theperformance of GCC on -O3 and 5.6% of the performance of ARM on -O3, soeven with optimizations it was still slower than GCC without optimizations.In conclusion the performance of RISC-V with the GCC compiler onthe higher optimization levels performs very well considering how young theRISC-V architecture is. It does seems like there could be room for improvement on the lower optimization levels however which in turn could also pos-sibly increase the performance of the higher optimization levels. With theLLVM/clang compiler on the other hand a lot of work needs to be done tomake it competetive in both performance and stability with the GCC com-piler and other architectures. Why the -O0 optimization is so considerablyslower on RISC-V than on ARM was also very unexpected and needs furtherinvestigation.
|
Page generated in 0.068 seconds