Spelling suggestions: "subject:"coprocessors."" "subject:"posprocessors.""
1 |
Methodologies and tools for computation offloading on heterogeneous multicoresBhagwat, Ashwini. January 2009 (has links)
Thesis (M. S.)--Computing, Georgia Institute of Technology, 2009. / Committee Chair: Pande, Santosh; Committee Member: Clark, Nate; Committee Member: Yalamanchili, Sudhakar. Part of the SMARTech Electronic Thesis and Dissertation Collection.
|
2 |
Enhancing Query Support in HBase via an extended Coprocessor FrameworkVashishtha, Himanshu Unknown Date
No description available.
|
3 |
Methodologies and tools for computation offloading on heterogeneous multicoresBhagwat, Ashwini 18 May 2009 (has links)
Frequency scaling in traditional computing systems has hit the power wall and multicore computing is here to stay. Unlike homogeneous multicores which have uniform architecture and instruction set across cores, heterogenous multicores have differentially capable cores to provide optimal performance for specialized functionality. However, this heterogeneity also translates into difficult programming models, and extracting its potential is not trivial. The Cell Broadband Engine by the Sony Toshiba IBM(STI) consortium was amongst the first heterogenous multicore systems with a single Power Processing Unit(PPU) and 8 Synergistic Processor Units (SPUs).
We address the issue of porting an existing sequential C/C++ codebase on to the Cell BE through compiler driven program analysis and profiling. Until parallel programming models evolve, the "interim" solution to performance involves speeding up legacy code by offloading computationally intense parts of a sequential thread to the co-processor; thus using it as an accelerator. Unique architectural characteristics of an accelerator makes this problem quite challenging. On the Cell, these characteristics include limited local store of the SPU, high latency of data transfer between PPU and SPU, lack of branch prediction unit, limited SIMDizability, expensive scalar code etc. In particular, the designers of the Cell have opted for software controlled memory on its SPUs to reduce power consumption and to give programmers more control over the predictability of latency. The lack of a hardware cache on the SPU can create performance bottlenecks because any data that needs to be brought in to the SPU must be brought in using a DMA call. The need for supporting a software controlled cache is thus evident for irregular memory accesses on the SPU. For such a cache to result in improved performance, the amount of time spent in book-keeping and tracking at run-time should be minimal. Traditional algorithms like LRU, when implemented in software incur overheads on every cache hit because appropriate data structures need to be updated. Such overheads are on off critical path for traditional hardware cache but on the critical path for a software controlled cache. Thus there is a need for better management of "data movement" for the code that is offloaded on to the SPU.
This thesis addresses the "code partitioning" problem as well as the "data movement" problem. We present
GLIMPSES - a compiler driven profiling tool that analyzes existing C/C++ code for its suitability for porting to the Cell, and presents its results in an interactive visualizer.
Software Controlled Cache - an improved eviction policy that exploits information gleaned from memory traces generated through offline profiling. The trace is analyzed to provide guidance for a run-time state machine within the cache manager; resulting in reduced run-time overhead and better performance. The design tradeoffs and several pros and cons of this approach are brought forth as well. It is shown that with just about the right amount of runtime book-keeping and decision making, one can get to the difficult solution space of the right balance to achieve high performance.
|
4 |
An FPGA coprocessor for real-time bathymetric synthetic aperture sonar : a thesis submitted in partial fulfilment of the requirements for the degree of Master of Engineering in Electrical and Computer Engineering at the University of Canterbury, Christchurch, New Zealand /Mulligan, David J. January 1900 (has links)
Thesis (M.E.)--University of Canterbury, 2007. / Typescript (photocopy). "February 2007." Includes bibliographical references (p. [85]-88). Also available via the World Wide Web.
|
5 |
Coprocessador para aceleração de aplicações desenvolvidas utilizando paradigma orientado a notificaçõesPeters, Eduardo 31 July 2012 (has links)
Este trabalho apresenta um novo hardware coprocessador para acelerar aplicações desenvolvidas utilizando-se o Paradigma Orientado a Notificações (PON), cuja essência se constitui em uma nova forma de influência causal baseada na colaboração pontual entre entidades granulares e notificantes. Uma aplicação PON apresenta as vantagens da programação baseada em eventos e da programação declarativa, possibilitando um desenvolvimento de alto nível, auxiliando o reuso de código e reduzindo o processamento desnecessário existente das aplicações desenvolvidas com os paradigmas atuais. Como uma aplicação PON é composta de uma cadeia de pequenas entidades computacionais, comunicando-se somente quando necessário, é um bom candidato a implementação direta em hardware. Para investigar este pressuposto, criou-se um coprocessador capaz de executar aplicações PON existentes. O coprocessador foi desenvolvido utilizando-se linguagem VHDL e testado em FPGAs, mostrando um decréscimo de 96% do número de ciclos de clock utilizados por um programa se comparado a implementação puramente em software da mesma aplicação, considerando uma dada materialização em um framework em PON. / This work presents a new hardware coprocessor to accelerate applications developed using the Notification-Oriented Paradigm (NOP). A NOP application has the advantages of both event-based programming and declarative programming, enabling higher level software development, improving code reuse, and reducing the number of unnecessary computations. Because a NOP application is composed of a network of small computational entities communicating only when needed, it is a good candidate for a direct hardware implementation. In order to investigate this assumption, a coprocessor that is able to run existing NOP applications was created. The coprocessor was developed in VHDL and tested in FPGAs, providing a decrease of 96% in the number of clock cycles compared to a purely software implementation.
|
6 |
Coprocessador para aceleração de aplicações desenvolvidas utilizando paradigma orientado a notificaçõesPeters, Eduardo 31 July 2012 (has links)
Este trabalho apresenta um novo hardware coprocessador para acelerar aplicações desenvolvidas utilizando-se o Paradigma Orientado a Notificações (PON), cuja essência se constitui em uma nova forma de influência causal baseada na colaboração pontual entre entidades granulares e notificantes. Uma aplicação PON apresenta as vantagens da programação baseada em eventos e da programação declarativa, possibilitando um desenvolvimento de alto nível, auxiliando o reuso de código e reduzindo o processamento desnecessário existente das aplicações desenvolvidas com os paradigmas atuais. Como uma aplicação PON é composta de uma cadeia de pequenas entidades computacionais, comunicando-se somente quando necessário, é um bom candidato a implementação direta em hardware. Para investigar este pressuposto, criou-se um coprocessador capaz de executar aplicações PON existentes. O coprocessador foi desenvolvido utilizando-se linguagem VHDL e testado em FPGAs, mostrando um decréscimo de 96% do número de ciclos de clock utilizados por um programa se comparado a implementação puramente em software da mesma aplicação, considerando uma dada materialização em um framework em PON. / This work presents a new hardware coprocessor to accelerate applications developed using the Notification-Oriented Paradigm (NOP). A NOP application has the advantages of both event-based programming and declarative programming, enabling higher level software development, improving code reuse, and reducing the number of unnecessary computations. Because a NOP application is composed of a network of small computational entities communicating only when needed, it is a good candidate for a direct hardware implementation. In order to investigate this assumption, a coprocessor that is able to run existing NOP applications was created. The coprocessor was developed in VHDL and tested in FPGAs, providing a decrease of 96% in the number of clock cycles compared to a purely software implementation.
|
7 |
GPF : a framework for general packet classification on GPU co-processors / GPU Packet Filter : framework for general packet classification on Graphics Processing Unit co-processorsNottingham, Alastair January 2012 (has links)
This thesis explores the design and experimental implementation of GPF, a novel protocol-independent, multi-match packet classification framework. This framework is targeted and optimised for flexible, efficient execution on NVIDIA GPU platforms through the CUDA API, but should not be difficult to port to other platforms, such as OpenCL, in the future. GPF was conceived and developed in order to accelerate classification of large packet capture files, such as those collected by Network Telescopes. It uses a multiphase SIMD classification process which exploits both the parallelism of packet sets and the redundancy in filter programs, in order to classify packet captures against multiple filters at extremely high rates. The resultant framework - comprised of classification, compilation and buffering components - efficiently leverages GPU resources to classify arbitrary protocols, and return multiple filter results for each packet. The classification functions described were verified and evaluated by testing an experimental prototype implementation against several filter programs, of varying complexity, on devices from three GPU platform generations. In addition to the significant speedup achieved in processing results, analysis indicates that the prototype classification functions perform predictably, and scale linearly with respect to both packet count and filter complexity. Furthermore, classification throughput (packets/s) remained essentially constant regardless of the underlying packet data, and thus the effective data rate when classifying a particular filter was heavily influenced by the average size of packets in the processed capture. For example: in the trivial case of classifying all IPv4 packets ranging in size from 70 bytes to 1KB, the observed data rate achieved by the GPU classification kernels ranged from 60Gbps to 900Gbps on a GTX 275, and from 220Gbps to 3.3Tbps on a GTX 480. In the less trivial case of identifying all ARP, TCP, UDP and ICMP packets for both IPv4 and IPv6 protocols, the effective data rates ranged from 15Gbps to 220Gbps (GTX 275), and from 50Gbps to 740Gbps (GTX 480), for 70B and 1KB packets respectively. / LaTeX with hyperref package
|
8 |
High Level Power Estimation and Reduction Techniques for Power Aware Hardware DesignAhuja, Sumit 14 June 2010 (has links)
The unabated continuation of the Moore's law has allowed the doubling of the number of transistors per unit area of a silicon die every 2 years or so. At the same time, an increasing demand on consumer electronics and computing equipments to run sophisticated applications has led to an unprecedented complexity of hardware designs. These factors have necessitated the abstraction level of design-entry of hardware systems to be raised beyond the Register-Transfer-Level (RTL) to Electronic System Level (ESL). However, power envelope on the designs due to packaging and other thermal limitations, and the energy envelope due to battery life-time considerations have also created a need for power/energy efficient design. The confluence of these two technological issues has created an urgent need for solving two problems: (i) How do we enable a power-aware design flow with a design entry point at the Electronic System Level? (ii) How do we enable power aware High Level Synthesis to automatically synthesize RTL implementation from ESL?
This dissertation distinguishes itself by addressing the following two issues: (i) Since power/energy consumption of electronic systems largely depends on implementation details, and high-level models abstract away from such details, power/energy estimation at such levels has not been addressed thoroughly. (ii) A lot of work has been done in applying various techniques on control-data-flow graphs (CDFG) to find power/area/latency pareto points during behavioral synthesis. However, high level C-based functional models of various compute-intensive components, which could be easily synthesized as co-processors, have many opportunities to reduce power. Some of these savings opportunities are traditional such as clock-gating, operand-isolation etc. The exploration of alternate granularities of these techniques with target applications in mind, opens the door for traditional power reduction opportunities at the high-level.
This work therefore concentrates on the aforementioned two areas of inadequacy of hardware design methodologies. Our proposed solutions include utilizing ESL simulation traces and mapping those to lower abstraction levels for power estimation, derivation of statistical power models using regression based learning for power estimation at early design stages, etc. On the HLS front, techniques that insert the power saving features during the synthesis process using exploration of granularity and scope of clock-gating, sequential clock-gating are proposed. Finally, this work shows how to marry two domains, that is estimation and reduction. In this regard, a power model is proposed, which helps in predicting power savings obtained using clock-gating and further guiding HLS to selectively insert clock-gating. / Ph. D.
|
9 |
Μεθοδολογίες μεταγλώττισης σε επαναπροσδιοριζόμενα συστήματα αρχιτεκτονικών πίνακαΓεωργιόπουλος, Σταύρος 01 February 2013 (has links)
Το αντικείμενο της παρούσας διδακτορικής διατριβής εστιάζεται στην ανάπτυξη αποδοτικών τεχνικών μεταγλώττισης για επαναπροσδιοριζόμενα ολοκληρωμένα συστήματα αρχιτεκτονικών πίνακα. Χρησιμοποιήθηκαν εφαρμογές που κυριαρχούνται από δεδομένα για τον έλεγχο των μεθοδολογιών. Σκοπός είναι να βελτιστοποιηθεί η εκτέλεση των εφαρμογών ως προς χαρακτηριστικά των επαναπροσδιοριζόμενων συστημάτων όπως η απόδοση, ο αριθμός εντολών ανά κύκλο ρολογιού, η επιφάνεια ολοκλήρωσης και ο βαθμός χρησιμοποίησης των επεξεργαστικών πόρων. Αυτό επιτυγχάνεται με την εισαγωγή πρωτότυπων τεχνικών χαρτογράφησης αλλά και την εύρεση βέλτιστων αρχιτεκτονικών.
Στο πρώτο τμήμα της διατριβής υλοποιήθηκε η έρευνα, ανάπτυξη και αυτοματοποίηση τεχνικών μεταγλώττισης για επαναπροσδιοριζόμενα συστήματα αρχιτεκτονικών πίνακα. Κύριο χαρακτηριστικό αυτών των αρχιτεκτονικών είναι ύπαρξη μεγάλου αριθμού επεξεργαστικών στοιχείων που δουλεύουν παράλληλα με αποτέλεσμα να επιταχύνουν την εκτέλεση εφαρμογών που εμφανίζουν παραλληλία πράξεων. Η λειτουργία τους σε ενσωματωμένα συστήματα είναι αυτή ενός συνεπεξεργαστή.
Η έρευνα πάνω σε επαναπροσδιοριζόμενες αρχιτεκτονικές πίνακα έχει αποκτήσει μεγάλο ενδιαφέρον λόγω της ευελιξίας, της επεκτασιμότητας και της απόδοσής τους, ιδιαίτερα σε εφαρμογές που κυριαρχούνται από δεδομένα. Η μεταγλώττιση, όμως, εφαρμογών πάνω σε αυτές χαρακτηρίζεται από υψηλή πολυπλοκότητα. Απαιτούνται κατάλληλα εργαλεία και ειδικές μεθοδολογίες χαρτογράφησης για την εκμετάλλευση των χαρακτηριστικών αυτών των αρχιτεκτονικών. Με αυτό το σκεπτικό, προτάθηκε μια πρωτότυπη επαναστοχεύσιμη μεθοδολογία χαρτογράφησης εφαρμογών, η οποία επιπλέον έχει αυτοματοποιηθεί με τη χρήση ενός πρότυπου εργαλείου μεταγλώττισης που στοχεύει σε ένα αρχιτεκτονικό παραμετρικό πρότυπο. Αποτέλεσμα ήταν η εύρεση των βέλτιστων αρχιτεκτονικών με βάσει την απόδοση, τον αριθμό των εντολών ανά κύκλο ρολογιού και το χρόνο εκτέλεσης του εργαλείου, για μια ομάδα εφαρμογών.
Η αποδοτικότητα μιας επαναπροσδιοριζόμενης αρχιτεκτονικής πίνακα ως προς την ταχύτητα και το κόστος σε υλικό είναι δύσκολο να μετρηθεί, για αυτό έχουν υπάρξει λίγες έρευνες που μελετούν την επίδραση αρχιτεκτονικών παραμέτρων πάνω σε παράγοντες όπως η επιφάνεια ολοκλήρωσης και ο αριθμός εντολών ανά κύκλο ρολογιού. Επιπλέον, καμιά εργασία δεν έχει εξετάσει την επίδραση πολλαπλασιαστών ενσωματωμένων στα επεξεργαστικά στοιχεία των επαναπροσδιοριζόμενων αρχιτεκτονικών. Χρησιμοποιώντας την υπάρχουσα επαναστοχεύσιμη μεθοδολογία μεταγλώττισης και μια παραμετρική υλοποίηση της αρχιτεκτονικής σε γλώσσα περιγραφής υλικού, εξετάζουμε την επίδραση των πολλαπλασιαστών από τη μεριά της χαρτογράφησης και της αρχιτεκτονικής.
Επίσης, περιγράφεται η πρωτότυπη μεθοδολογία χαρτογράφησης που εισήχθη με σκοπό την αποδοτική λειτουργία του αλγορίθμου Fast Fourier Transform (FFT) πάνω σε επαναπροσδιοριζόμενα συστήματα αρχιτεκτονικών πίνακα. Ο αλγόριθμος FFT χαρακτηρίζεται από μεγάλο αριθμό πράξεων κυρίως πολλαπλασιασμών που επιβραδύνουν την απόδοση μιας επαναπροσδιοριζόμενης αρχιτεκτονικής. Εκμεταλλευόμενοι την ύπαρξη εσωτερικής επαναληπτικής δομής μέσα στον αλγόριθμο και χρησιμοποιώντας μια επαναπροσδιοριζόμενη αρχιτεκτονική 16 επεξεργαστικών στοιχείων, αναπτύξαμε μια πρωτότυπη τεχνική χαρτογράφησης. Επιπρόσθετα, η τεχνική μας λαμβάνει υπόψη την ιεραρχία μνήμης μεταξύ κύριας μνήμης και επαναπροσδιοριζόμενης αρχιτεκτονικής για την περαιτέρω επιτάχυνση εκτέλεσης του αλγορίθμου FFT. Η χρήση της προτεινόμενης τεχνικής χαρτογράφησης οδηγεί σε επίτευξη βαθμού χρησιμοποίησης των επεξεργαστικών στοιχείων άνω του 90%, τιμή που είναι τουλάχιστον 37% υψηλότερη από την καλύτερη τιμή της βιβλιογραφίας. / The object of this PhD thesis focuses on developing efficient mapping techniques for coarse grain reconfigurable build arrays. Data intensive applications were used to evaluate the proposed methodologies. The aim is to optimize the applications’ performance on characteristics targeting reconfigurable characteristics such as performance, instructions per cycle, area of integration and processing resource utilization. This is achieved by introducing novel mapping techniques and finding optimal architectures.
In the first part of the thesis research, development and automation of mapping techniques was carried out targeting coarse grain reconfigurable arrays. The main feature of these architectures is the presence of a large number of processing elements working in parallel thus speeding up the execution of applications featuring parallel operations. The function of these processing elements in embedded systems resembles that of a coprocessor.
The research on reconfigurable array architectures has gained considerable interest because of their flexibility, scalability and performance, particularly in data intensive applications. Nevertheless, compiling these applications on reconfigurable architectures is characterized by high degree of complexity. Appropriate tools and special mapping methodologies are needed to exploit the characteristics of these architectures. Bearing this in mind, we proposed a novel reconfigurable methodology for mapping applications, which has also been automated with the use of a prototype compiler tool aiming at a parametric architectural model. The result was finding the best architectures on the basis of performance, the instructions per cycle term and the tool execution time for a sample set of applications.
It is difficult to evaluate the efficiency of a reconfigurable array architecture table in terms of speed and area of integration, so there have been few cases studying the effect of architectural parameters on factors such as surface integration and the number of instructions per clock cycle. Moreover, no work has examined the multipliers’ impact embedded in reconfigurable architectures processing elements. Using the existing reconfigurable mapping methodology and a parametric implementation of the architecture in hardware description language, we examine the effect of multipliers on the part of the mapping phase and architecture.
We also describe an original mapping methodology introduced for the purpose of efficiently mapping the Fast Fourier Transform (FFT) algorithm on reconfigurable array architectures. The FFT algorithm is characterized by a large number of operations primarily multiplications that slow the performance of a reconfigurable architecture. Exploiting the existence of an internal structure inside the FFT algorithm and by the use of a reconfigurable architecture template of 16 processing elements, we developed a novel mapping technique. Additionally, our technique takes into account the memory hierarchy between main memory and reconfigurable architecture in order to further accelerate the implementation of the FFT algorithm. Using the proposed mapping technique results in processing elements utilization of over 90% value which is at least 37% better than the best value of the related literature.
|
Page generated in 0.0364 seconds