• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 25
  • 5
  • 5
  • 4
  • 3
  • 3
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 55
  • 15
  • 11
  • 11
  • 11
  • 11
  • 10
  • 10
  • 10
  • 8
  • 7
  • 7
  • 7
  • 7
  • 7
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
41

Energy efficient instruction decoding in application: Specific instruction - set processors / Αποκωδικοποίηση εντολών για χαμηλή κατανάλωση ενέργειας σε επεξεργαστές συνόλου εντολών ειδικού σκοπού

Κάργας, Χρήστος 04 September 2013 (has links)
With commercial processor design tools, a designer can quickly design a C- programmable ASIP for a specific application domain. There are several such ASIPs available for both wireless (UWB baseband processing), encryption, and biomedical processing (particularly for ECG beat detection). In traditional CPUs and DSPs the impact of the instruction-set definition and the complexity of the instruction decoder can be substantial, especially in terms of power consumption. Fully orthogonal VLIW processors, do not incur the cost of an instruction decoder that severely. Instead the instruction word becomes very large, thereby shifting the (power-)cost to the program memory or instruction cache. For the purposes of this thesis a SIMD processor is developed and is compared to a soft-SIMD to observe its area, performance and energy efficiency for a bioimaging benchmark and how the processor description in the ASIP language nML, defines the generated HDL. This SIMD processor is turned into orthogonal and using iterative experiments it is investigated, what is the impact on power while manipulating the instruction-set architecture in combination with the program memory size. It is also investigated how instruction-set re-configuration can be exploited to improve power efficiency. Using this investigation guidelines for low-power ASIP design can be produced. / Με τη σύγχρονη τεχνολογία σχεδιασμού επεξεργαστών, ο σχεδιαστής μπορεί με ευκολία να σχεδιάσει ένα προγραμματιζόμενο Επεξεργαστή Συνόλου Εντολών Ειδικού Σκοπού (ASIP - Application-Specific Instruction-set Processor) για ένα συγκεκριμένο εύρος εφαρμογών. Υπάρχουν διάφοροι τέτοιοι επεξεργαστές διαθέσιμοι για ασύρματες εφαρμογές, κρυπτογράφηση και βιοϊατρικές εφαρμογές (π.χ. στον αλγόριθμο εντοπισμού χτύπου ηλεκτροκαρδιογραφήματος). Στους παραδοσιακούς επεξεργαστές και επεξεργαστές σήματος (DSP - Digital Signal Processor) ο ορισμός του συνόλου εντολών και η πολυπλοκότητα έχουν μεγάλη επίδραση, ειδικά στην κατανάλωση ισχύος. Μία πιθανή λύση σε αυτό το πρόβλημα είναι οι ορθογώνιοι επεξεργαστές μεγάλου μεγέθους λέξης εντολής (VLIW - Very Large Instruction Word). Με τον όρο ορθογώνιο επεξεργαστή, ορίζεται ένας επεξεργαστής οριζόντιου σύνολου εντολών, άρα ένας επεξεργαστής στον οποίο μπορεί να υπάρξει κάθε διαθέσιμος συνδυασμός μεταξύ των διαθέσιμων εντολών και των μεθόδων διευθυνσιοδότησης για πρόσβαση στη μνήμη και το αρχείο καταχωρητών. Οι ορθογώνιοι επεξεργαστές δεν επιβαρύνουν τόσο τον αποκωδικοποιητή εντολών. Αντί αυτού το μέγεθος της λέξης της εντολής γίνεται πολύ μεγάλο, και έτσι μετατίθεται το ενεργειακό κόστος στην μνήμη εντολών προγράμματος (program memory )ή την κρυφή μνήμη εντολών προγράμματος (instruction cache). Για τους σκοπούς αυτής της διπλωματικής εργασίας, αναπτύχθηκε ένας επεξεργαστής SIMD, ο οποίος συγκρίνεται με έναν soft-SIMD για να μελετηθούν η απαιτούμενη περιοχή στο ενσωματωμένο, επιδόσεις και κατανάλωση ενέργειας για μία βιοϊατρική εφαρμογή, καθώς και το πως η περιγραφή ενός επεξεργαστή στη γλώσσα περιγραφής επεξεργαστών ASIP nML ορίζει την παραγούμενη γλώσσα περιγραφής υλικού (HDL - Hardware Description Language). Ο επεξεργαστής αυτός μετατρέπεται σε ορθογώνιο, και με τη χρήση επαναληπτικών πειραμάτων μελετάται η επίδραση στην κατανάλωση ενέργειας κατά τη διάρκεια αλλαγών στην αρχιτεκτονική του συνόλου εντολών και του μεγέθους της μνήμης εντολών προγράμματος. Ακόμη μελετάται πως μπορεί να εκμεταλλευτεί ο σχεδιαστής την αναδιάρθρωση του συνόλου εντολών για να βελτιώσει την κατανάλωση ενέργειας.
42

Proposta de um processador multithreading com caracter?sticas de previsibilidade / Proposal of predictable multithreading processor

Siqueira, Hadley Magno da Costa 18 August 2015 (has links)
Submitted by Automa??o e Estat?stica (sst@bczm.ufrn.br) on 2016-06-14T19:51:32Z No. of bitstreams: 1 HadleyMagnoDaCostaSiqueira_DISSERT.pdf: 1452990 bytes, checksum: 84d7f3a1709799f4355ce71e68b94d8b (MD5) / Approved for entry into archive by Arlan Eloi Leite Silva (eloihistoriador@yahoo.com.br) on 2016-06-15T22:22:57Z (GMT) No. of bitstreams: 1 HadleyMagnoDaCostaSiqueira_DISSERT.pdf: 1452990 bytes, checksum: 84d7f3a1709799f4355ce71e68b94d8b (MD5) / Made available in DSpace on 2016-06-15T22:22:57Z (GMT). No. of bitstreams: 1 HadleyMagnoDaCostaSiqueira_DISSERT.pdf: 1452990 bytes, checksum: 84d7f3a1709799f4355ce71e68b94d8b (MD5) Previous issue date: 2015-08-18 / O projeto de sistemas embarcados de tempo real requer um controle preciso da passagem de tempo na computa??o realizada pelos m?dulos e na comunica??o entre os mesmos. Geralmente, esses sistemas s?o constitu?dos de v?rios m?dulos, cada um projetado para uma tarefa espec?fica e com comunica??o restrita com os demais m?dulos a fim de se obter a temporiza??o necess?ria. Essa estrat?gia, chamada de arquitetura federada, j? est? se tornando invi?vel em frente as demandas atuais de custo, desempenho e qualidade exigidas dos sistema embarcados. Para atacar esse problema, atualmente se prop?e o uso de arquiteturas integradas, que consistem em um ou poucos circuitos realizando v?rias tarefas em paralelo de forma mais eficiente e com redu??o de custos. Entretanto, ? preciso garantir que a arquitetura integrada possua componibilidade temporal, ou seja, a capacidade de projetar cada tarefa temporalmente isolada das demais a fim de manter as caracter?sticas individuais de cada tarefa. As ?Precision Timed Machines? s?o uma abordagem de arquitetura integrada que advoca o uso de processadores ?multithreaded? para garantir componibilidade temporal. Dessa forma, o presente trabalho apresenta a implementa??o de uma ?Precision Timed Machine? chamada Hivek-RT. Este processador, que ? um VLIW com suporte ? ?Simultaneous Multithreading?, ? capaz de executar eficientemente tarefas de tempo real quando comparado ? um processador tradicional. Al?m da execu??o eficiente, a arquitetura facilita a implementa??o, do ponto de vista de programa??o, de tarefas de tempo real. / The real-time embedded systems design requires precise control of the passage of time in the computation performed by the modules and communication between them. Generally, these systems consist of several modules, each designed for a specific task and restricted communication with other modules in order to obtain the required timing. This strategy, called federated architecture, is already becoming unviable in front of the current demands of cost, required performance and quality of embedded system. To address this problem, it has been proposed the use of integrated architectures that consist of one or few circuits performing multiple tasks in parallel in a more efficient manner and with reduced costs. However, one has to ensure that the integrated architecture has temporal composability, ie the ability to design each task temporally isolated from the others in order to maintain the individual characteristics of each task. The Precision Timed Machines are an integrated architecture approach that makes use of multithreaded processors to ensure temporal composability. Thus, this work presents the implementation of a Precision Machine Timed named Hivek-RT. This processor which is a VLIW supporting Simultaneous Multithreading is capable of efficiently execute real-time tasks when compared to a traditional processor. In addition to the efficient implementation, the proposed architecture facilitates the implementation real-time tasks from a programming point of view.
43

Optimalizace v překladači C pro VLIW architektury / Optimizations in C Compiler for VLIW Architectures

Baručák, Robert January 2014 (has links)
Presented is implementation of algorithm for alias analysis, which was integrated into LLVM framework. Properties and limitations of various alias analysis algorithms are discussed. Demonstrated are different approaches to working with predicates and integration of these principles with LLVM. One of the outcomes of this master's thesis is design and implementation of algorithm for profile guided if-conversion.
44

Porting the GCC-Backend to a VLIW-Architecture: Portierung des GCC-Backends auf eine VLIW-Architektur

Parthey, Jan 01 March 2004 (has links)
This diploma thesis discusses the implementation of a GCC target for the Texas Instruments TMS320C6000 DSP platform. To this end, it makes use of mechanisms offered by GCC for porting to new target architectures. GCC internals such as the handling of conditional jumps and the layout of stack frames are investigated and applied to the new architecture. / Diese Diplomarbeit behandelt die Implementierung eines GCC-Targets für die DSP-Plattform TMS320C6000 von Texas Instruments. Dazu werden Mechanismen genutzt, die GCC für die Portierung auf neue Zielplattformen anbietet. GCC-Interna, wie die Behandlung bedingter Sprünge und das Layout von Stack-Frames, werden untersucht und auf die neue Architektur angewendet.
45

Optimizing the GCC Suite for a VLIW Architecture: Optimierung der GCC Suite für eine VLIW Architektur

Strätling, Adrian 18 November 2004 (has links)
This diploma thesis discusses the applicability of GCC optimization algorithms for the TI TMS320C6x processor family. Conditional and Parallel Execution is used to speed up the resulting code. It describes the optimization framework of the GCC version 4.0 and the implementation details. / Diese Diplomarbeit behandelt die Anwendbarkeit der verschiedenen GCC Optimierungsalgorithmen für die TI TMS320C6x Prozessorfamilie. Bedingte und parallele Ausführbarkeit werden zur Beschleunigung eingesetzt. Sie beschreibt den Rahmen in dem die Optimierungen in Version 4.0 des GCC stattfinden und Details zur Implementierung.
46

Implementace generického procesoru v FPGA / Implementation of Generic Processor in FPGA

Mikušek, Petr Unknown Date (has links)
This thesis studies processor architectures suitable for embedded processors. This includes Transport Triggered Architectures (TTA). TTA is programmed by specifying data transport; operations are triggered as a side effect of data transports. In traditional Operation Triggered Architectures (OTA) requested operations are determined by program. Data transports are handled internally by hardware so it's impossible to control and optimize data transfer by compiler. This approach brings an advantage of hardware and software aspects. The aim of this thesis is to design and implement a sample TTA processor in VHDL followed by realization in FPGA. This processor is designed in a generic manner, i.e. customized by set of generic parameters such as data width, number of buses, etc.
47

Integrated Optimal Code Generation for Digital Signal Processors

Bednarski, Andrzej January 2006 (has links)
<p>In this thesis we address the problem of optimal code generation for irregular architectures such as Digital Signal Processors (DSPs).</p><p>Code generation consists mainly of three interrelated optimization tasks: instruction selection (with resource allocation), instruction scheduling and register allocation. These tasks have been discovered to be NP-hard for most architectures and most situations. A common approach to code generation consists in solving each task separately, i.e. in a decoupled manner, which is easier from a software engineering point of view. Phase-decoupled compilers produce good code quality for regular architectures, but if applied to DSPs the resulting code is of significantly lower performance due to strong interdependences between the different tasks.</p><p>We developed a novel method for fully integrated code generation at the basic block level, based on dynamic programming. It handles the most important tasks of code generation in a single optimization step and produces an optimal code sequence. Our dynamic programming algorithm is applicable to small, yet not trivial problem instances with up to 50 instructions per basic block if data locality is not an issue, and up to 20 instructions if we take data locality with optimal scheduling of data transfers on irregular processor architectures into account. For larger problem instances we have developed heuristic relaxations.</p><p>In order to obtain a retargetable framework we developed a structured architecture specification language, xADML, which is based on XML. We implemented such a framework, called OPTIMIST that is parameterized by an xADML architecture specification.</p><p>The thesis further provides an Integer Linear Programming formulation of fully integrated optimal code generation for VLIW architectures with a homogeneous register file. Where it terminates successfully, the ILP-based optimizer mostly works faster than the dynamic programming approach; on the other hand, it fails for several larger examples where dynamic programming still provides a solution. Hence, the two approaches complement each other. In particular, we show how the dynamic programming approach can be used to precondition the ILP formulation.</p><p>As far as we know from the literature, this is for the first time that the main tasks of code generation are solved optimally in a single and fully integrated optimization step that additionally considers data placement in register sets and optimal scheduling of data transfers between different registers sets.</p>
48

Integrated Optimal Code Generation for Digital Signal Processors

Bednarski, Andrzej January 2006 (has links)
In this thesis we address the problem of optimal code generation for irregular architectures such as Digital Signal Processors (DSPs). Code generation consists mainly of three interrelated optimization tasks: instruction selection (with resource allocation), instruction scheduling and register allocation. These tasks have been discovered to be NP-hard for most architectures and most situations. A common approach to code generation consists in solving each task separately, i.e. in a decoupled manner, which is easier from a software engineering point of view. Phase-decoupled compilers produce good code quality for regular architectures, but if applied to DSPs the resulting code is of significantly lower performance due to strong interdependences between the different tasks. We developed a novel method for fully integrated code generation at the basic block level, based on dynamic programming. It handles the most important tasks of code generation in a single optimization step and produces an optimal code sequence. Our dynamic programming algorithm is applicable to small, yet not trivial problem instances with up to 50 instructions per basic block if data locality is not an issue, and up to 20 instructions if we take data locality with optimal scheduling of data transfers on irregular processor architectures into account. For larger problem instances we have developed heuristic relaxations. In order to obtain a retargetable framework we developed a structured architecture specification language, xADML, which is based on XML. We implemented such a framework, called OPTIMIST that is parameterized by an xADML architecture specification. The thesis further provides an Integer Linear Programming formulation of fully integrated optimal code generation for VLIW architectures with a homogeneous register file. Where it terminates successfully, the ILP-based optimizer mostly works faster than the dynamic programming approach; on the other hand, it fails for several larger examples where dynamic programming still provides a solution. Hence, the two approaches complement each other. In particular, we show how the dynamic programming approach can be used to precondition the ILP formulation. As far as we know from the literature, this is for the first time that the main tasks of code generation are solved optimally in a single and fully integrated optimization step that additionally considers data placement in register sets and optimal scheduling of data transfers between different registers sets.
49

Custom floating-point arithmetic for integer processors : algorithms, implementation, and selection

Jourdan, Jingyan 15 November 2012 (has links) (PDF)
Media processing applications typically involve numerical blocks that exhibit regular floating-point computation patterns. For processors whose architecture supports only integer arithmetic, these patterns can be profitably turned into custom operators, coming in addition to the five basic ones (+, -, X, / and √), but achieving better performance by treating more operations. This thesis addresses the design of such custom operators as well as the techniques developed in the compiler to select them in application codes. We have designed optimized implementations for a set of custom operators which includes squaring, scaling, adding two nonnegative terms, fused multiply-add, fused square-add (x*x+z, with z>=0), two-dimensional dot products (DP2), sums of two squares, as well as simultaneous addition/subtraction and sine/cosine. With novel algorithms targeting high instruction-level parallelism and detailed here for squaring, scaling, DP2, and sin/cos, we achieve speedups of up to 4.2x for individual custom operators even when subnormal numbers are fully supported. Furthermore, we introduce the optimizations developed in the ST231 C/C++ compiler for selecting such operators. Most of the selections are achieved at high level, using syntactic criteria. However, for fused square-add, we also enhance the framework of integer range analysis to support floating-point variables in order to prove the required positivity condition z>= 0. Finally, we provide quantitative evidence of the benefits to support this selection of custom operations: on DSP kernels and benchmarks, our approach allows us to be up to 1.59x faster compared to the sole usage of basic ones.
50

Mikroarchitektur eines digitalen Signalprozessors mit Datenflusserweiterung

Fiedler, Rolf 27 June 2002 (has links)
This dissertation presents the results of research towards a new computer architectural approach for the construction of digital signal processors. The new approach is based on a transport triggered architecture (TTA) and allows for a dataflow processing mode. The proposed architecture has beed called TAD (Transport triggered Architecture with Dataflow-extension). The designed machine is able to execute limited dataflow-graphs using a single assembly instruction. The size of the dataflow-graph is limited by the number of available execution units and communication resources. To undertake the research a cycle-correct simulator of the proposed microarchitecture has been designed. Benchmark results of the new microarchitecture were obtained by executing typical DSP-programs on the simulator. The properties of the new architecture and the variants of its parameters are discussed in the text. i Performance data is given on a per-cycle basis. A demonstration machine for the TAD has been synthesized for a 0.35um CMOS-technology. Data for area and maximum clock frequency of the design have been extracted from the routed chip design. / Diese Arbeit stellt die Ergebnisse von Untersuchungen über eine neue Architekturvariante für digitale Signalverarbeitungsprozessoren mit transportgesteuerter Architektur (TTA) vor. Die dazu entworfene Maschine erlaubt es, endliche Datenflussgraphen auf einen einzelnen Maschinenbefehl abzubilden. Die maximale Größe der abbildbaren Datenflussgraphen ist dabei durch die Anzahl gleichzeitig verfügbarer Verarbeitungseinheiten und Kommunikationsresourcen beschränkt. Die Untersuchungen dazu wurden mit einem taktgenauen Mikroarchitektursimulator durchgeführt. Die Daten zur Verarbeitungsleistung der Maschine wurden durch das Ausführen von Lastprogrammen auf diesem Simulator gewonnen. Der Aufbau und die Eigenschaften der durch den Simulator realisierten Mikroarchitektur und einige von dieser Implementation abweichende Varianten werden erläutert. Da sich Angaben zur Anzahl der Verarbeitungszyklen nicht vergleichen lassen, ohne dass Informationen zur maximal erreichbaren Taktfrequenz der Implementation vorliegen, wurde die vorgeschlagene Mikroarchitektur als integrierter Schaltkreis synthetisiert, um Informationen zu Flächenbedarf und Laufzeit zu gewinnen. Aus den Entwurfsdaten für den integrierten Schaltkreis wurden die Verdrahtungs-Kapazitäten extrahiert und daraus die Information zur maximalen Taktfrequenz gewonnen.

Page generated in 0.0211 seconds