Global ETD Search

1	An improved method for register file verification Quan, Tong 2009 August 1900 (has links) Register file logic verification historically involves comparing two human generated logic sources such as a VHDL code file and a circuit schematic for logic equivalence. This method is valid for most cases, however it does not account for instances when both logic sources are equivalent but incorrect. This report proposes a method to eliminate this problem by testing logic coherency of various sources with a golden logic source. This golden logic source will be generated by a register file simulation program which has been developed to simulate accurate regfile I/O port data. Implementation of this simulation program for logic verification will eliminate the accuracy problem stated above, in addition the logic simulation time for the new method has also been reduced by 36% compared to the former method. / text Register file logic verification
2	Fully Distributed Register Files for Heterogeneous Clustered Microarchitectures Bunchua, Santithorn 09 July 2004 (has links) Conventional processor design utilizes a central register file and a bypass network to deliver operands to and from functional units, which cannot scale to a large number of functional units. As more functional units are integrated into a processor, the number of ports on a register file grows linearly while area, delay, and energy consumption grow even more rapidly. Physical properties of a bypass network scale in a similar manner. In this dissertation, a fully distributed register file organization is presented to overcome this limitation by relying on small register files with fewer ports and localized operand bypasses. Unlike other clustered microarchitectures, each cluster features a small single-issue functional unit coupled with a small local register file. Several clusters are used, and each of them can be different. All register files are connected through a register transfer network that supports multicast communications. Techniques to support distributed register file operations are presented for both dynamically and statically scheduled processors. These include the eager and multicast register transfer mechanisms in the dynamic approach and the global data routing with multicasting algorithm in the static approach. Although this organizaiton requires additional cycles to execute a program, it is compensated by significant savings obtained through smaller area, faster operand access time, and lower energy consumption. With faster operating frequency and more efficient hardware implementation, overall performance can be improved. Additionally, the fully distributed register file organization is applied to an ILP-SIMD processing element, which is the major building block of a massively parallel media processor array. The results show reduction in die area, which can be utilized to implement additional processing elements. Consequently, performance is improved through a higher degree of data parallelism through a larger processor array. In summary, the fully distributed register file architecture permits future processors to scale to a large number of functional units. This is especially desirable in high-throughput processors such as wide-issue processors and multithreaded processors. Moreover, localized communication is highly desirable in the transition to future deep submicron technologies since long wire is a critical issue in processes with extremely small feature sizes. Clustered microarchitecture Distributed register file
3	Register File Organization for Coarse-Grained Reconfigurable Architectures: Compiler-Microarchitecture Perspective January 2014 (has links) abstract: Coarse-Grained Reconfigurable Architectures (CGRA) are a promising fabric for improving the performance and power-efficiency of computing devices. CGRAs are composed of components that are well-optimized to execute loops and rotating register file is an example of such a component present in CGRAs. Due to the rotating nature of register indexes in rotating register file, it is very challenging, if at all possible, to hold and properly index memory addresses (pointers) and static values. In this Thesis, different structures for CGRA register files are investigated. Those structures are experimentally compared in terms of performance of mapped applications, design frequency, and area. It is shown that a register file that can logically be partitioned into rotating and non-rotating regions is an excellent choice because it imposes the minimum restriction on underlying CGRA mapping algorithm while resulting in efficient resource utilization. / Dissertation/Thesis / Masters Thesis Computer Science 2014 Computer science CGRA Mapping Register File
4	Register File Size Reduction through Instruction Pre-Execution Incorporating Value Prediction ANDO, Hideki, TANAKA, Yusuke 01 December 2010 (has links) No description available. register file value prediction instruction pre-execution microprocessor microarchitecture
5	Energy-efficient mechanisms for managing on-chip storage in throughput processors Gebhart, Mark Alan 05 July 2012 (has links) Modern computer systems are power or energy limited. While the number of transistors per chip continues to increase, classic Dennard voltage scaling has come to an end. Therefore, architects must improve a design's energy efficiency to continue to increase performance at historical rates, while staying within a system's power limit. Throughput processors, which use a large number of threads to tolerate memory latency, have emerged as an energy-efficient platform for achieving high performance on diverse workloads and are found in systems ranging from cell phones to supercomputers. This work focuses on graphics processing units (GPUs), which contain thousands of threads per chip. In this dissertation, I redesign the on-chip storage system of a modern GPU to improve energy efficiency. Modern GPUs contain very large register files that consume between 15%-20% of the processor's dynamic energy. Most values written into the register file are only read a single time, often within a few instructions of being produced. To optimize for these patterns, we explore various designs for register file hierarchies. We study both a hardware-managed register file cache and a software-managed operand register file. We evaluate the energy tradeoffs in varying the number of levels and the capacity of each level in the hierarchy. Our most efficient design reduces register file energy by 54%. Beyond the register file, GPUs also contain on-chip scratchpad memories and caches. Traditional systems have a fixed partitioning between these three structures. Applications have diverse requirements and often a single resource is most critical to performance. We propose to unify the register file, primary data cache, and scratchpad memory into a single structure that is dynamically partitioned on a per-kernel basis to match the application's needs. The techniques proposed in this dissertation improve the utilization of on-chip memory, a scarce resource for systems with a large number of hardware threads. Making more efficient use of on-chip memory both improves performance and reduces energy. Future efficient systems will be achieved by the combination of several such techniques which improve energy efficiency. / text Energy efficiency Multi-threading Register file organization Throughput computing
6	A Structured Design Methodology for High Performance VLSI Arrays January 2012 (has links) abstract: The geometric growth in the integrated circuit technology due to transistor scaling also with system-on-chip design strategy, the complexity of the integrated circuit has increased manifold. Short time to market with high reliability and performance is one of the most competitive challenges. Both custom and ASIC design methodologies have evolved over the time to cope with this but the high manual labor in custom and statistic design in ASIC are still causes of concern. This work proposes a new circuit design strategy that focuses mostly on arrayed structures like TLB, RF, Cache, IPCAM etc. that reduces the manual effort to a great extent and also makes the design regular, repetitive still achieving high performance. The method proposes making the complete design custom schematic but using the standard cells. This requires adding some custom cells to the already exhaustive library to optimize the design for performance. Once schematic is finalized, the designer places these standard cells in a spreadsheet, placing closely the cells in the critical paths. A Perl script then generates Cadence Encounter compatible placement file. The design is then routed in Encounter. Since designer is the best judge of the circuit architecture, placement by the designer will allow achieve most optimal design. Several designs like IPCAM, issue logic, TLB, RF and Cache designs were carried out and the performance were compared against the fully custom and ASIC flow. The TLB, RF and Cache were the part of the HEMES microprocessor. / Dissertation/Thesis / Ph.D. Electrical Engineering 2012 Electrical engineering Design Arrayed Structures Cache Directed placement Register File (RF) Translation lookaside buffer (TLB)
7	Scalable Register File Architecture for CGRA Accelerators January 2016 (has links) abstract: Coarse-grained Reconfigurable Arrays (CGRAs) are promising accelerators capable of accelerating even non-parallel loops and loops with low trip-counts. One challenge in compiling for CGRAs is to manage both recurring and nonrecurring variables in the register file (RF) of the CGRA. Although prior works have managed recurring variables via rotating RF, they access the nonrecurring variables through either a global RF or from a constant memory. The former does not scale well, and the latter degrades the mapping quality. This work proposes a hardware-software codesign approach in order to manage all the variables in a local nonrotating RF. Hardware provides modulo addition based indexing mechanism to enable correct addressing of recurring variables in a nonrotating RF. The compiler determines the number of registers required for each recurring variable and configures the boundary between the registers used for recurring and nonrecurring variables. The compiler also pre-loads the read-only variables and constants into the local registers in the prologue of the schedule. Synthesis and place-and-route results of the previous and the proposed RF design show that proposed solution achieves 17% better cycle time. Experiments of mapping several important and performance-critical loops collected from MiBench show proposed approach improves performance (through better mapping) by 18%, compared to using constant memory. / Dissertation/Thesis / Masters Thesis Computer Science 2016 Computer engineering Computer science Electrical engineering Accelerators Coarse Grained Reconfigurable Array Compiler Hardware-Software Co-design Register File Scalability
8	Suporte especializado de hardware para geração automática de loop pipelining em FPGAS Souza, Guilherme Stefano Silva de 19 November 2014 (has links) Submitted by Daniele Amaral (daniee_ni@hotmail.com) on 2016-09-13T20:06:59Z No. of bitstreams: 1 DissGSSS.pdf: 12761989 bytes, checksum: 9e4c2b4e76a2502af072064ed081eec1 (MD5) / Approved for entry into archive by Marina Freitas (marinapf@ufscar.br) on 2016-09-15T13:34:53Z (GMT) No. of bitstreams: 1 DissGSSS.pdf: 12761989 bytes, checksum: 9e4c2b4e76a2502af072064ed081eec1 (MD5) / Approved for entry into archive by Marina Freitas (marinapf@ufscar.br) on 2016-09-15T13:35:23Z (GMT) No. of bitstreams: 1 DissGSSS.pdf: 12761989 bytes, checksum: 9e4c2b4e76a2502af072064ed081eec1 (MD5) / Made available in DSpace on 2016-09-15T13:35:30Z (GMT). No. of bitstreams: 1 DissGSSS.pdf: 12761989 bytes, checksum: 9e4c2b4e76a2502af072064ed081eec1 (MD5) Previous issue date: 2014-11-19 / Não recebi financiamento / Loop pipelining is a technique that may offer significant performance improvements, being employed not only in conventional compilation targeting microprocessors, but also by High Level Synthesis (HLS) tools, targeting heterogeneous architectures and hardware accelerators. This work presents a specialized hardware support aiming at facilitate compilation tasks for HLS tools, along with potential advantages in execution performance and total silicon area employed. Two specialized hardware modules are presented: a queue register file and an instruction predication control module. / O desempenho na execução de programas, que é cada vez mais uma prioridade, pode ter uma melhora significativa por meio do uso de paralelismo em nível de instrução (ILP). Uma técnica que utiliza o ILP e propicia ganhos de desempenho significativos é o loop pipelining, sendo usado não apenas por compiladores para microprocessadores, mas também por ferramentas de Síntese de Alto Nível (HLS), visando arquiteturas heterogêneas e aceleradores de hardware. Neste trabalho é apresentado o projeto e implementação de estruturas de hardware especializadas, objetivando-se em solucionar o problema de sobreposição de valores que ocorre no loop pipelining, facilitar tarefas de compilaçãoo em ferramentas HLS e diminuir a repetição de código. Além disso, ganhos potenciais de desempenho e área de silício total podem ser alcançados como resultado do uso das estruturas propostas. Serão apresentados: um arquivo de registradores baseado em filas e um módulo de controle para a execução de instruções predicadas. Loop Pipelining Software Pipelining Queued Register File Modulo Scheduling Register Files Queue Predicated Instructions Escalonamento de Módulo QRF Arquivos de Registradores Filas Instruções Predicadas
9	Register Caching for Energy Efficient GPGPU Tensor Core Computing / Registrera cachelagring för energieffektiv GPGPU Tensor Core Computing Qian, Qiran January 2023 (has links) The General-Purpose GPU (GPGPU) has emerged as the predominant computing device for extensive parallel workloads in the fields of Artificial Intelligence (AI) and Scientific Computing, primarily owing to its adoption of the Single Instruction Multiple Thread architecture, which not only provides a wealth of thread context but also effectively hide the latencies exposed in the single threads executions. As computational demands have evolved, modern GPGPUs have incorporated specialized matrix engines, e.g., NVIDIA’s Tensor Core (TC), in order to deliver substantially higher throughput for dense matrix computations compared with traditional scalar or vector architectures. Beyond mere throughput, energy efficiency is a pivotal concern in GPGPU computing. The register file is the largest memory structure on the GPGPU die and typically accounts for over 20% of the dynamic power consumption. To enhance energy efficiency, GPGPUs incorporate a technique named register caching borrowed from the realm of CPUs. Register caching captures temporal locality among register operands to reduce energy consumption within a 2- level register file structure. The presence of TC raises new challenges for Register Cache (RC) design, as each matrix instruction applies intensive operand delivering traffic on the register file banks. In this study, we delve into the RC design trade-offs in GPGPUs. We undertake a comprehensive exploration of the design space, encompassing a range of workloads. Our experiments not only reveal the basic design considerations of RC but also clarify that conventional caching strategies underperform, particularly when dealing with TC computations, primarily due to poor temporal locality and the substantial register operand traffic involved. Based on these findings, we propose an enhanced caching strategy featuring a look-ahead allocation policy to minimize unnecessary cache allocations for the destination register operands. Furthermore, to leverage the energy efficiency of Tensor Core computing, we highlight an alternative instruction scheduling framework for Tensor Core instructions that collaborates with a specialized caching policy, resulting in a remarkable reduction of up to 50% in dynamic energy consumption within the register file during Tensor Core GEMM computations. / Den allmänna ändamålsgrafikprocessorn (GPGPU) har framträtt som den dominerande beräkningsenheten för omfattande parallella arbetsbelastningar inom områdena för artificiell intelligens (AI) och vetenskaplig beräkning, huvudsakligen tack vare dess antagande av arkitekturen för enkel instruktion, flera trådar (Single Instruction Multiple Thread), vilket inte bara ger en mängd trådcontext utan också effektivt döljer de latenser som exponeras vid enskilda trådars utförande. När beräkningskraven har utvecklats har moderna GPGPU:er inkorporerat specialiserade matrismotorer, t.ex., NVIDIAs Tensor Core (TC), för att leverera avsevärt högre genomströmning för täta matrisberäkningar jämfört med traditionella skalär- eller vektorarkitekturer. Bortom endast genomströmning är energieffektivitet en central oro inom GPGPUberäkning. Registerfilen är den största minnesstrukturen på GPGPU-dien och svarar vanligtvis för över 20% av den dynamiska effektförbrukningen För att förbättra energieffektiviteten inkorporerar GPGPU:er en teknik vid namn registercachning, lånad från CPU-världen. Registercachning fångar temporal lokalitet bland registeroperanderna för att minska energiförbrukningen inom en 2-nivåers registerfilstruktur. Närvaron av TC innebär nya utmaningar för Register Cache (RC)-design, eftersom varje matrisinstruktion genererar intensiv operandleverans på registerfilbankarna. I denna studie fördjupar vi oss i RC-designavvägandena i GPGPU:er. Vi genomför en omfattande utforskning av designutrymmet, som omfattar olika arbetsbelastningar. Våra experiment avslöjar inte bara de grundläggande designövervägandena för RC utan klargör också att konventionella cachestrategier underpresterar, särskilt vid hantering av TC-beräkningar, främst på grund av dålig temporal lokalitet och den betydande trafiken med registeroperand. Baserat på dessa resultat föreslår vi en förbättrad cachestrategi med en look-ahead-alloceringspolicy för att minimera onödiga cacheallokeringar för destinationens registeroperand. Dessutom, för att dra nytta av energieffektiviteten hos Tensor Core-beräkning, belyser vi en alternativ instruktionsplaneringsram för Tensor Core-instruktioner som samarbetar med en specialiserad cachelayout, vilket resulterar i en anmärkningsvärd minskning av upp till 50% i dynamisk energiförbrukning inom registerfilen under Tensor Core GEMM-beräkningar. Computer Architecture GPGPU Tensor Core GEMM Energy Efficiency Register File Cache Instruction Scheduling Datorarkitektur GPGPU Tensor Core GEMM energieffektivitet registerfil cache instruktionsschemaläggning Computer and Information Sciences Data- och informationsvetenskap
10	The transactional HW/SW stack for fault tolerant embedded computing / Pilha HW/SW transacional para computacao embarcada tolerante a falhas Ferreira, Ronaldo Rodrigues January 2015 (has links) O desafio de implementar tolerância a falhas em sistemas embarcados advém das restrições físicas de ocupação de área, dissipação de potência e consumo de energia desses sistemas. A necessidade de otimizar essas três restrições de projeto concomitante à computação dentro dos requisitos de desempenho e de tempo-real cria um problema difícil de ser resolvido. Soluções clássicas de tolerância a falhas tais como redundância modular dupla e tripla não são factíveis devido ao alto custo em potência e a falta de um mecanismo para se recuperar erros. Apesar de algumas técnicas existentes reduzirem o overhead de potência e área, essas incorrem em alta degradação de desempenho e muitas vezes assumem um modelo de falhas que não é factível. Essa tese introduz a Pilha de HW/SW Transacional, ou simplesmente Pilha, para gerenciar de maneira eficiente as restrições de área, potência, cobertura de falhas e desempenho. A Pilha introduz uma nova estratégia de compilação que organiza os programas em Blocos Básicos Transacionais (BBT), juntamente com um novo processador, a Arquitetura de Blocos Básicos Transacionais (ABBT), a qual provê detecção e recuperação de erros de grão fino e determinística ao usar o BBT como um contâiner de erros e como unidade de checkpointing. Duas soluções para prover a semântica de execução do BBT em hardware são propostas, uma baseada em software e a outra em hardware. A área, potência, desempenho e cobertura de falhas foram avaliadas através do modelo de hardware do ABBT. A Pilha provê uma cobertura de falhas de 99,35%, com overhead de 2,05 em potência e 2,65 de área. A Pilha apresenta overhead de desempenho de 1,33 e 1,54, dependento do modelo de hardware usado para suportar a semântica de execução do BBT. / Fault tolerance implementation in embedded systems is challenging because the physical constraints of area occupation, power dissipation, and energy consumption of these systems. The need for optimizing these three physical constraints while doing computation within the available performance goals and real-time deadlines creates a conundrum that is hard to solve. Classical fault tolerance solutions such as triple and dual modular redundancy are not feasible due to their high power overhead or lack of efficient and deterministic error recovery. Existing techniques, although some of them reduce the power and area overhead, incur heavy perfor- mance penalties and most of the time do not assume a feasible fault model. This dissertation introduces the Transactional HW/SW Stack, or simply Stack, to effi- ciently manage the area, power, fault coverage, and performance conundrum. The Stack introduces a new compilation strategy that assembles programs into Transac- tional Basic Blocks, together with a novel microprocessor, the TransactiOnal Basic Block Architecture (ToBBA), which provides fine-grained error detection and deter- ministic error rollback and elimination using the Transactional Basic Blocks (TBBs) both as a container for errors and as a small unit of data checkpointing. Two so- lutions to sustain the TBB semantics in hardware are introduced: software- and hardware-based. Stack’s area, power, performance, and coverage were evaluated using ToBBA’s hardware implementation model. The Stack attains an error correc- tion coverage of 99.35% with 2.05 power overhead within an area overhead of 2.65. The Stack also presents a performance overhead of 1.33 or 1.54, depending on the hardware model adopted to support the TBB. Microeletrônica Sistemas embarcados Tolerancia : Falhas Compiler design Coverage Error detection Error recovery Fault injection Hardening by design Latency LLVM Modular redundancy Register file Rollback Single event effects Soft error

Search results