161 |
Scratch-pad memory management for static data aggregatesLi, Lian, Computer Science & Engineering, Faculty of Engineering, UNSW January 2007 (has links)
Scratch-pad memory (SPM), a fast on-chip SRAM managed by software, is widely used in embedded systems. Compared to hardware-managed cache, SPM can be more efficient in performance, power and area cost, and has the added advantage of better time predictability. In this thesis, SPMs should be seen in a general context. For example, in stream processors, a software-managed stream register file is usually used to stage data to and from off-chip memory. In IBM's Cell architecture, each co-processor has a software-managed local store for keeping data and instructions. SPM management is critical for SPM-based embedded systems. In this thesis, we propose two novel methodologies, the memory colouring methodology and the perfect colouring methodology, to place the static data aggregates such as arrays and structs of a program in SPM. Our methodologies are dynamic in the sense that some data aggregates can be swapped into and out of SPM during program execution. To this end, a live range splitting heuristic is introduced in order to create potential data transfer statements between SPM and off-chip memory. The memory colouring methodology is a general-purpose compiler approach. The novelty of this approach lies in partitioning an SPM into a pseudo register file then generalising existing graph colouring algorithms for register allocation to colour data aggregates. In this thesis, a scheme for partitioning an SPM into a pseudo register file is introduced. This methodology is inter-procedural and therefore operates on the interference graph for the data aggregates in the whole program. Different graph colouring algorithms may give rise to different results due to live range splitting and spilling heuristics used. As a result, two representative graph colouring algorithms, George and Appel's iterative-coalescing and Park and Moon's optimistic-coalescing, are generalised and evaluated for SPM allocation. Like memory colouring, perfect colouring is also inter-procedural. The novelty of this second methodology lies in formulating the SPM allocation problem as an interval colouring problem. The interval colouring problem is an NP problem and no widely-accepted approximation algorithms exist. The key observation is that the interference graphs for data aggregates in many embedded applications form a special class of superperfect graphs. This has led to the development of two additional SPM allocation algorithms. While differing in whether live range splits and spills are done sequentially or together, both algorithms place data aggregates in SPM based on the cliques in an interference graph. In both cases, we guarantee optimally that all data aggregates in an interference graph can be placed in SPM if the given SPM size is no smaller than the chromatic number of the graph. We have developed two memory colouring algorithms and two perfect colouring algorithms for SPM allocation. We have evaluated them using a set of embedded applications. Our results show that both methodologies are efficient and effective in handling large-scale embedded applications. While neither methodology outperforms the other consistently, perfect colouring has yielded better overall results in the set of benchmarks used in our experiments. All these algorithms are expected to be valuable. For example, they can be made available as part of the same compiler framework to assist the embedded designer with exploring a large number of optimisation opportunities for a particular embedded application.
|
162 |
Compiler Directed Codesign for FPGA-based Embedded SystemsHauff, Martin Anthony, marty@extendabilities.com.au January 2008 (has links)
As embedded systems designers increasingly turn to programmable logic technologies in place of off-the-shelf microprocessors, there is a growing interest in the development of optimised custom processing cores that can be designed on a per-application basis. FPGAs blur the traditional distinction between hardware and software and offer the promise of application specific hardware acceleration. But realizing this in a general sense requires a significant departure from traditional embedded systems development flows. Whereas off-the-shelf processors have a fixed architecture, the same cannot be said of purpose-built FPGA-based processors. With this freedom comes the challenge of empirically determining the optimal boundary point between hardware and software. The fluidity of the hardware/software partition also poses an interesting challenge for compiler developers. This thesis presents a tool and methodology that addresses these codesign challenges in a new way. Described as 'compiler-directed codesign', it makes use of a suitably modified compiler to help direct the development of a custom processor core on a per-application basis. By exposing the compiler's internal representation of a compiled target program, visibility into those instructions, and hardware resources, that are most sought after by the compiler can be gained. This information is then used to inform further processor development and to determine the optimal partition between hardware and software. At each design iteration, the machine model is updated to reflect the available hardware resources, the compiler is rebuilt, and the target application is compiled once again. By including the compiler 'in-the-loop' of custom processor design, developers can accurately quantify the impact on performance caused by the addition or removal of specific hardware resources and iteratively converge on an optimal solution. Compiler Directed Codesign has advantages over existing codesign methodologies because it offers both a concrete point from which to begin the partitioning process as well as providing quantifiable and rapid feedback of the merits of different partitioning choices. When applied to an Adaptive PCM Encoder/Decoder case study, the Compiler Directed Codesign technique yielded a custom processor core that was between 36% and 73% smaller, consumed between 11% to 19% less memory, and performed up to 10X faster than comparable general-purpose FPGA-based processor cores. The conclusion of this work is that a suitably modified compiler can serve a valuable role in directing hardware/software partitioning on a per-application basis.
|
163 |
Debugging Equation-Based Languages in OpenModelica EnvironmentSjöholm, Klas January 2009 (has links)
<p>The need for debugging tools for declarative programming languages has increased due to the rapid development of modeling and simulation tools/programs. Declarative equation-based programming languages have the problem of equation systems being over-, or under-constrained. This means that the system of equations has more equations than variables or more variables than equations respectively, making the system of equations unsolvable. In this study a static debugger is implemented in OpenModelica compiler for the equation-based programming language Modelica to make it easier for the programmer or modeler to locate the equation/s causing the unconstrained system of equations. The debugging techniques used by the debugger are developed by Peter Bunus. Those techniques are able to detect unconstrained systems of equations and give solutions by identifying the minimal set ofequation/s that should be removed or which variable/s should be added to an equation/s to make the system solvable. In this study the debugging techniques for detecting and giving a solution for over-constrained system of equations are shown suitable to be used for the programming language Modelica in the OpenModelica compiler.</p>
|
164 |
En optimierande kompilator för SMV till CLP(B) / An optimising SMV to CLP(B) compilerAsplund, Mikael January 2005 (has links)
<p>This thesis describes an optimising compiler for translating from SMV to CLP(B). The optimisation is aimed at reducing the number of required variables in order to decrease the size of the resulting BDDs. Also a partitioning of the transition relation is performed. The compiler uses an internal representation of a FSM that is built up from the SMV description. A number of rewrite steps are performed on the problem description such as encoding to a Boolean domain and performing the optimisations. </p><p>The variable reduction heuristic is based on finding sub-circuits that are suitable for reduction and a state space search is performed on those groups. An evaluation of the results shows that in some cases the compiler is able to greatly reduce the size of the resulting BDDs.</p>
|
165 |
Compiling the parallel programming language NestStep to the CELL processorHolm, Magnus January 2010 (has links)
<p>The goal of this project is to create a source-to-source compiler which will translate NestStep code to C code. The compiler's job is to replace NestStep constructs with a series of function calls to the NestStep runtime system. NestStep is a parallel programming language extension based on the BSP model. It adds constructs for parallel programming on top of an imperative programming language. For this project, only constructs extending the C language are relevant. The output code will compile to form an executable program that runs on the multicore processor Cell Broadband Engine (Cell BE). The NestStep runtime system has been ported to the Cell BE and is available from start of this project.</p>
|
166 |
Transactions EverywhereKuszmaul, Bradley C., Leiserson, Charles E. 01 1900 (has links)
Arguably, one of the biggest deterrants for software developers who might otherwise choose to write parallel code is that parallelism makes their lives more complicated. Perhaps the most basic problem inherent in the coordination of concurrent tasks is the enforcing of atomicity so that the partial results of one task do not inadvertently corrupt another task. Atomicity is typically enforced through locking protocols, but these protocols can introduce other complications, such as deadlock, unless restrictive methodologies in their use are adopted. We have recently begun a research project focusing on transactional memory [18] as an alternative mechanism for enforcing atomicity, since it allows the user to avoid many of the complications inherent in locking protocols. Rather than viewing transactions as infrequent occurrences in a program, as has generally been done in the past, we have adopted the point of view that all user code should execute in the context of some transaction. To make this viewpoint viable requires the development of two key technologies: effective hardware support for scalable transactional memory, and linguistic and compiler support. This paper describes our preliminary research results on making “transactions everywhere” a practical reality. / Singapore-MIT Alliance (SMA)
|
167 |
Run-time optimization of adaptive irregular applicationsYu, Hao 15 November 2004 (has links)
Compared to traditional compile-time optimization, run-time optimization could offer significant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identified a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Specifically, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm specified by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems.
|
168 |
Design and remote control of a Gantry mechanism for the SCARA robotSurinder Pal, 15 May 2009 (has links)
Remote experimentation and control have led researchers to develop new technologies as well as implement existing techniques. The multidisciplinary nature of research in electromechanical systems has led to the synergy of mechanical engineering, electrical engineering and computer science. This work describes the design of a model of a Gantry Mechanism, which maneuvers a web-cam. The user controls virtually the position of end-effecter of the Gantry Mechanism using a Graphical User Interface. The GUI is accessed over the Internet. In order to reduce the unbalanced vibrations of the Gantry Mechanism, we investigate the development of an algorithm of input shaping. A model of the Gantry Mechanism is built, and it is controlled over the Internet to view experimentation of the SCARA Robot. The system performance is studied by comparing the inputs such as distances and angles with outputs, and methods to improve the performance are suggested.
|
169 |
NoGAP: Novel Generator of Accelerators and ProcessorsKarlström, Per Axel January 2010 (has links)
ASIPs are needed to handle the future demand of flexible yet highperformance embedded computing. The flexibility of ASIPs makes them preferable over fixed function ASICs. Also, a well designed ASIP, has a power consumption comparable to ASICs. However the cost associated with ASIP design is a limiting factor for a more wide spread adoption. A number of different tools have been proposed, promising to ease this design process. However all of the current state of the art tools limits the designer due to a template based design process. It blocks design freedoms and limits the I/O bandwidth of the template. We have therefore proposed the Novel Generator of Accelerator and Processors (NoGAP). NoGAP is a design automation tool for ASIP andaccelerator design that puts very few limits on what can be designed, yet NoGAP gives support by automating much of the tedious anderror prone tasks associated with ASIP design. This thesis will present NoGAP and much of its key concepts. Such as; the NoGAP-CL) which is a language used to implement processors in NoGAP, This thesis exposes NoGAP's key technologies, which include automatic bus and wire sizing, instruction decoder and pipeline management, how PC-FSMs can be generated, how an assembler can be generated, and how cycle accurate simulators can be generated. We have so far proven NoGAP's strengths in three extensive case studies, in one a floating point pipelined data path was designed, in another a simple RISC processor was designed, and finally one advanced RISC style DSP was designed using NoGAP. All these case studies points to the same conclusion, that NoGAP speeds up development time, clarify complex pipeline architectures, retains design flexibility, and most importantly does not incur much performance penalty, compared to hand optimized RTL code. We belive that the work presented in this thesis shows that NoGAP, using our proposed novel approach to micro architecture design, can have a significant impact on both academic and industrial hardware design. To our best knowledge NoGAP is the first system that has demonstrated that a template free processor construction framework can be developed and generate high performance hardware solutions. / NoGAP
|
170 |
Automatic Task Formation Techniques for the Multi-level Computing ArchitectureStewart, Kirk 30 July 2008 (has links)
The Multi-Level Computing Architecture (MLCA) is a multiprocessor system-on-chip architecture designed for multimedia applications. It provides a programming model
that simplifies the process of writing parallel applications by eliminating the need for explicit synchronization. However, developers must still invest effort to design applications that fully exploit the MLCA’s multiprocessing capabilities. We present a set of compiler techniques to streamline the process of developing applications for the MLCA. We present an algorithm to automatically partition a sequential application into tasks that can be executed in parallel. We also present code generation algorithms to translate annotated, sequential C code to the MLCA’s programming model. We provide an experimental evaluation of these techniques, performed with a prototype compiler based upon the open-source ORC compiler and integrated with the MLCA Optimizing Compiler. This evaluation shows that the performance of automatically generated code compares favourably to that of manually written code.
|
Page generated in 0.2246 seconds