Global ETD Search

1	A Superscalar Dual Core Architecture for ARM9 ISA Chou, Yu-Liang 21 August 2005 (has links) In the design of the recently embedded processor, besides paying attention to low power consumption and reducing the complexity of the design, it is also necessary to be with high performance. Because of the advanced logic CMOS technology, incorporated multiple processor cores on a single chip becomes the new trend which improve performance. This thesis, being designed for the embedded processor, presents a superscalar dual core architecture for ARM9 ISA. The processor provides three operation modes: single-core mode, superscalar mode, and multithreading mode for designer. The processor can be switched dynamically in three mode with the new extended instructions, when it is executing the program. When the processor is executing in superscalar mode, the designer can use the processor without changing any setting of the original environment. According our simulation result, the superscalar dual core architecture can obtain 52% performance speedup when it execute the trace of MPEG2 decoder and obtain average 41% performance speedup comparing to the five-stage pipelined ARM9 architecture. dual_code superscalar ARM9
2	Compiler-directed energy savings in superscalar processors Jones, Timothy M. January 2006 (has links) Superscalar processors contain large, complex structures to hold data and instructions as they wait to be executed. However, many of these structures consume large amounts of energy, making them hotspots requiring sophisticated cooling systems. With the trend towards larger, more complex processors, this will become more of a problem, having important implications for future technology. This thesis uses compiler-based optimisation schemes to target the issue queue and register file. These are two of the most energy consuming structures in the processor. The algorithms and hardware techniques developed in this work dynamically adapt the processor's resources to the changing program phases, turning off parts of each structure when they are unused to save dynamic and static energy. To optimise the issue queue, the compiler analysis tracks data dependences through each program procedure. It identifies the critical path through each program region and informs the hardware of the minimum number of queue entries required to prevent it slowing down. This reduces the occupancy of the queue and increases the opportunities to save energy. With just a 1.3% performance loss, 26% dynamic and 32% static energy savings are achieved. Registers can be idle for many cycles after they are last read, before they are released and put back on the free-list to be reused by another instruction. Alternatively, they can be turned off for energy savings. Early register releasing can be used to perform this operation sooner than usual, but hardware schemes must wait for the instruction redefining the relevant logical register to enter the pipeline. This thesis presents an exploration of compiler-directed early register releasing. The compiler can exactly identify the last use of each register and pass the information to the hardware, based on simple data-flow and liveness analysis. The best scheme achieves 15% dynamic and 19% static energy savings. Finally, the issue queue limiting and early register releasing schemes are combined for energy savings in both processor structures. Four different configurations are evaluated bringing 25% to 31% dynamic and 19% to 34% static issue queue energy savings and reductions of 18% to 25% dynamic and 20% to 21% static energy in the register file. 004
3	Revisiting Wide Superscalar Microarchitecture / Révision de larges unités superscalaires Mondelli, Andrea 12 September 2017 (has links) Depuis plusieurs décennies, la fréquence des processeurs à usage général n'a cessé d'augmenter grâce aux transistors de plus en plus rapides et aux micro-architectures avec des pipelines plus profonds. Cependant il y a environ 10 ans, à cause des courants de fuite et de la température, la ﬁnesse de gravure des processeurs a atteint sa limite physique. Depuis, au lieu d'augmenter la fréquence du processeur, les fabricants ont intégré plus de cœurs sur une seule puce, agrandi la hiérarchie de caches et amélioré l'eﬃcacité énergétique. Cependant, il est également important d'accélérer les processeurs individuellement.La réduction de la consommation énergétique est donc devenue un objectif majeur lors de la conception d'une micro-architecture pour la haute performance. Certaines fonctionnalités ont été introduites dans les unités superscalaires principalement pour réduire la consommation énergétique. Un exemple de fonctionnalité est le tampon de boucles ("loop buﬀer"), qui est maintenant mis en œuvre dans plusieurs micro-architectures superscalaires. Le but d'un tampon de boucle est d'économiser l'énergie dans le bloc avant du microprocesseur (cache d'instructions, prédicteur de branchements, décodeur, etc.) lors de l'exécution d'une boucle avec un corps assez petit pour tenir dans cette mémoire tampon spéciﬁque. Si la fréquence du processeur reste constante, la seule possibilité laissée libre pour l'amélioration des performances des applications séquentielles dans les futurs processeurs est d'augmenter l'exploitation du parallélisme d'instructions. Certaines améliorations des micro-architectures (e.g., une meilleure prédiction de branchement) améliorent simultanément la performance et l'eﬃcacité énergétique. Cependant, améliorer l'exploitation du parallélisme d'instructions a généralement un coût: augmentation de la surface de silicium, de la consommation d'énergie, des eﬀorts de conception, etc. Par conséquent, la micro-architecture est modiﬁée lentement, incrément par incrément. En eﬀet, les fabricants de processeurs ont fait des eﬀorts continus aﬁn d'exploiter davantage l'ILP avec de meilleurs prédicteurs de branchements, de meilleurs pré-chargeurs de données, de plus grandes fenêtres d'instructions, ajout de registres physiques, et ainsi de suite. Cette thèse décrit ce que devraient être les unités superscalaires dans les 10 ans à venir et explore la possibilité d'exploiter le comportement des boucles aﬁn de réduire la consommation énergétique au-delà du bloc avant. Certaines propositions ont été publiées notamment sur les accélérateurs de boucles et sur les unités superscalaires à bloc arrière non conventionnel. Il est soutenu que la taille de la fenêtre d'instructions peut être augmentée en combinant le regroupement (clustering) et la spécialisation des registres d'écriture (register write specialization). Une diﬀérence majeure avec les précédentes études sur les micro-architectures en grappe est l'utilisation de grappes larges (wide issue clusters), contrairement aux études passées qui étaient principalement axées sur des petites grappes (narrow issue cluster). Le passage de petites grappes à des grappes larges n'est pas qu'un changement quantitatif, mais a aussi un impact qualitatif sur le problème de regroupement, et en particulier sur la politique de pilotage (steering policy). La seconde contribution propose deux optimisations indépendantes et orthogonales concernant la consommation énergétique et exploitant les boucles. La première optimisation détecte les micro-opérations redondantes produisant le même résultat à chaque itération puis supprime déﬁnitivement ces micro-opérations. La seconde optimisation se concentre sur la diminution de l'énergie consommée des micro-opérations de chargement, en détectant les situations où un chargement n'a pas besoin d'accéder à la ﬁle d'attente des enregistrements ou n'a pas besoin d'accéder au cache de données de niveau. / For several decades, the clock frequency of general purpose processors was growing thanks to faster transistors and microarchitectures with deeper pipelines. However, about 10 years ago, technology hit leakage power and temperature walls. Since then, the clock frequency of high-end processors did not increase. Instead of increasing the clock frequency, processor makers integrated more cores on a single chip, enlarged the cache hierarchy and improved energy eﬃciency. Putting more cores on a single chip has increased the total chip throughput and beneﬁts some applications with thread-level parallelism. However, most applications have low thread-level parallelism. So having more cores is not suﬃcient. It is important also to accelerate individual threads. Moreover, reducing the energy consumption has become a major objective when designing a high-performance microarchitecture. Some microarchitecture features have been introduced in superscalar cores mainly for reducing energy. An example of such feature is the loop buﬀer, which is now implemented in several superscalar microarchitectures. The purpose of a loop buﬀer is to save energy in the core's front-end (instruction cache, branch predictor, decoder, etc.) when executing a loop with a body small enough to ﬁt in the loop buﬀer. If the clock frequency remains constant, the only possibility left for higher single-thread performance in future processors is to exploit more ILP. Certain microarchitecture improvements (e.g., better branch predictor) simultaneously improve performance and energy eﬃciency. However, in general, exploiting more ILP has a cost in silicon area, energy consumption, design eﬀort, etc. Therefore, the microarchitecture is modiﬁed slowly, incrementally, taking advantage of technology scaling. And indeed, processor makers have made continuous eﬀorts to exploit more, with better branch predictors, better data prefetchers, larger instruction windows, more physical registers, and so forth. In this thesis, we try to depict what future superscalar cores may look like in 10 years and explore the possibility of exploiting loop behaviors to reduce energy consumption beyond the front-end. Some propositions have been published for loop accelerators or for unconventional superscalar core back-ends. I argue that the instruction window and the issue width can be augmented by combining clustering and register write specialization A major diﬀerence with past research on clustered microarchitecture is that I assume wide issue clusters, whereas past research mostly focused on narrow issue clusters. Going from narrow issue to wide issue clusters is not just a quantitative change, it has a qualitative impact on the clustering problem, in particular on the steering policy. We propose, in the second part of this thesis, two independent and orthogonal energy optimizations exploiting loops. The ﬁrst optimization detects redundant micro-ops producing the same result on every iteration and removes these micro-ops completely. The second optimization focuses on the energy consumed by load micro-ops, detecting situations where a load does not need to access the store queue or does not need to access the level-1 data cache. Microarchitecture superscalaire Microarchitecture Superscalar Pipeline
4	Parallel Instruction Decoding for DSP Controllers with Decoupled Execution Units Pettersson, Andreas January 2019 (has links) Applications run on embedded processors are constantly evolving. They are for the most part growing more complex and the processors have to increase their performance to keep up. In this thesis, an embedded DSP SIMT processor with decoupled execution units is under investigation. A SIMT processor exploits the parallelism gained from issuing instructions to functional units or to decoupled execution units. In its basic form only a single instruction is issued per cycle. If the control of the decoupled execution units become too fine-grained or if the control burden of the master core becomes sufficiently high, the fetching and decoding of instructions can become a bottleneck of the system. This thesis investigates how to parallelize the instruction fetch, decode and issue process. Traditional parallel fetch and decode methods in superscalar and VLIW architectures are investigated. Benefits and drawbacks of the two are presented and discussed. One superscalar design and one VLIW design are implemented in RTL, and their costs and performances are compared using a benchmark program and synthesis. It is found that both the superscalar and the VLIW designs outperform a baseline scalar processor as expected, with the VLIW design performing slightly better than the superscalar design. The VLIW design is found to be able to achieve a higher clock frequency, with an area comparable to the area of the superscalar design. This thesis also investigates how instructions can be encoded to lower the decode complexity and increase the speed of issue to decoupled execution units. A number of possible encodings are proposed and discussed. Simulations show that the encodings have a possibility to considerably lower the time spent issuing to decoupled execution units. superscalar VLIW SIMT computer architecture DSP Computer Engineering Datorteknik
5	Design of the Superscalar Dual-Core Architecture using Single-Issue Out-of-Order Instruction Pipe for Embedded System Lai, Yu-ren 29 July 2009 (has links) With the improvement in VLSI technology, realization of multiple processor cores on a single chip becomes easier. Therefore, more and more users execute applications on current multi-core architectures. The multi-core system has a brilliant performance in executing multi-threaded applications, but this system could not gain any performance in single-threaded applications. This paper proposes a multi-core architecture for enhancing single-threaded performance in embedded system, and focuses on four points: 1. Construct a simple out-of-order execution core. 2. Design a dynamically scheduled instruction analyzer. 3. Design a mechanism for sharing operands between two cores. 4. Design a mechanism for committing instructions synchronously between two cores. The architecture of each core is single-issue out-of-order instruction pipe. First, instruction analyzer will fetch instructions and generate instruction dependence tags by detecting the dependencies among the fetched instructions, then schedule instructions dynamically and dispatch to the cores. In the core, instructions can know where to get required operands according to the information of instruction tags, this mechanism enables data can be shared between two cores. Instructions are executed by data-driven approach, but in-order complete to maintain the correctness of the program order. Based on ARM instruction set, this paper tries to explore ways to achieve interaction control mechanisms between two cores and to accelerate a single-thread in the dual-core architecture. We write a simulation model of the proposed architecture in C language as our trace-driven simulation framework and the MediaBench suite is selected for the experiments. According simulation result, the architecture can obtain average 40% performance speedup comparing to the five-stage pipelined architecture. Dual-Core Superscalar Embedded System Out-of-Order Single-Issue
6	Generating RTL for microprocessors from architectural and microarchitectural description Bansal, Ankit Sajjan Kumar 17 June 2011 (has links) Designing a modern processor is a very complex task. Writing the entire design using a hardware description language (like Verilog) is time consuming and difficult to verify. There exists a split architecture/microarchitecture description technique, in which, the description of any hardware can be divided into two orthogonal descriptions: (a) an architectural contract between the user and the implementation, and (b) a microarchitecture which describes the implementation of the architecture. The main aim of this thesis is to build realistic processors using this technique. We have designed an in-order and an out-of-order superscalar processor using the split-description compiler. The backend of this compiler is another contribution of this thesis. / text Microprocessors Microarchitecture Processor architecture Split-description compiler Superscalar processors
7	A platform to evaluate the fault sensitivity of superscalar processors Tonetto, Rafael Billig January 2017 (has links) A diminuição agressiva dos transistores, a qual levou a reduções na tensão de operação, vem proporcionando enormes benefícios em termos de poder computacional, mantendo o consumo de energia em um nível aceitável. No entanto, à medida que o tamanho dos recursos e a tensão diminuem, a susceptibilidade a falhas tende a aumentar e a importância das avaliações com falhas cresce. Os processadores superescalares, que hoje dominam o mercado, são um exemplo significativo de sistemas que se beneficiam destas melhorias tecnológicas e são mais suscetíveis a erros. Juntamente com isso, existem vários métodos para injeção de falhas, que é um meio eficiente para avaliar a resiliência desses processadores. No entanto, os métodos tradicionais de injeção de falhas, como a técnica baseada em hardware, impõem que o processador seja implementado fisicamente antes que os testes possam ser conduzidos, sem fornecer níveis razoáveis de controlabilidade. Por outro lado, as técnicas baseadas em simuladores implementados em software oferecem altos níveis de controlabilidade. No entanto, enquanto os simuladores em SW de alto nível (que são rápidos) podem levar a uma avaliação incompleta, ou mesmo equivocada, da resiliência do sistema, uma vez que não modelam os componentes internos do hardware (como os registradores do pipeline), simuladores em SW de baixo nível são extremamente lentos e dificilmente estão disponíveis em RTL (Register-Transfer Level). Considerando este cenário, propomos uma plataforma que preenche a lacuna entre as abordagens em HW e SW para avaliar falhas em processadores superescalares: é rápida, tem alta controlabilidade, disponível em software, flexível e, o mais importante, modela o processador em RTL. A ferramenta foi implementada sobre a plataforma usada para gerar o processador superescalar The Berkeley Out-of-Order Machine (BOOM), que é um processador altamente escalável e parametrizável. Esta propriedade nos permitiu experimentar três arquiteturas diferentes do processador: single-, dual- e quad-issue, e, ao analisar como a resiliência a falhas é influenciada pela complexidade de diferentes processadores, usamos os processadores para validar nossa ferramenta. Tolerancia : Falhas Processamento paralelo Fault injection Register-transfer level Superscalar processor
8	Architectural Enhancements for Color Image and Video Processing on Embedded Systems Kim, Jongmyon 21 April 2005 (has links) As emerging portable multimedia applications demand more and more computational throughput with limited energy consumption, the need for high-efficiency, high-throughput embedded processing is becoming an important challenge in computer architecture. In this regard, this dissertation addresses application-, architecture-, and technology-level issues in existing processing systems to provide efficient processing of multimedia in many, or ideally all, of its form. In particular, this dissertation explores color imaging in multimedia while focusing on two architectural enhancements for memory- and performance-hungry embedded applications: (1) a pixel-truncation technique and (2) a color-aware instruction set (CAX) for embedded multimedia systems. The pixel-truncation technique differs from previous techniques (e.g., 4:2:2 and 4:2:0 subsampling) used in image and video compression applications (e.g., JPEG and MPEG) in that it reduces the information content in individual pixel word sizes rather than in each dimension. Thus, this technique drastically reduces the bandwidth and memory required to transport and store color images without perceivable distortion in color. At the same time, it maintains the pixel storage format of color image processing in which each pixel computation is performed simultaneously on 3-D YCbCr components, which are widely used in the image and video processing community. CAX supports parallel operations on two-packed 16-bit (6:5:5) YCbCr data in a 32-bit datapath processor, providing greater concurrency and efficiency for processing color image sequences. This dissertation presents the impact of CAX on processing performance and on both area and energy efficiency for color imaging applications in three major processor architectures: dynamically scheduled (superscalar), statically scheduled (very long instruction word, VLIW), and embedded single instruction multiple data (SIMD) array processors. Unlike typical multimedia extensions, CAX obtains substantial performance and code density improvements through direct support for color data processing rather than depending solely on generic subword parallelism. In addition, the ability to reduce data format size reduces system cost. The reduction in data bandwidth also simplifies system design. In summary, CAX, coupled with the pixel-truncation technique, provides an efficient mechanism that meets the computational requirements and cost goals for future embedded multimedia products. Data parallel architectures Superscalar processors Embedded systems Subword parallelism Computer architecture Color image and video processing
9	Study of the Hyperscalar Multi-core Architecture Chou, Yu-Liang 07 September 2011 (has links) Current trends in processor design have migrated toward chip multiprocessors (CMPs). CMPs are designed to exploit both instruction-level parallelism (ILP) within processors and thread-level parallelism (TLP) within and across processors. However, the conventional design of current CMPs is forced to make a choice between high single-thread performance and high peak throughput. This inability to adjust to varying levels of ILP and TLP results in processor inefficiency. To cope with the dilemma of designing CMPs confronted by the processor designers, this dissertation proposed the hyperscalar concept for current multi-core designs. The hyperscalar concept enables the multi-core architectures to dynamically group many scalar in-order cores as a superscalar processor to accelerate a sequential thread. The reconfigure feature of hyperscalar architecture contributes to the high flexibility in adapting different types of applications, providing high single-thread performance when thread level parallelism (TLP) is low and high throughput when TLP is high. Based on the hyperscalar concept, this dissertation first proposed a hyperscalar dual-core architecture. It can play three different roles (a 2-issue statically scheduled superscalar processor, a homogeneous dual-core processor, or a standalone single-core processor). An Instruction-dependency Analyzer (IA) that connects two scalar in-order cores is designed to handle the role switching. The design of IA makes it possible for the two cores to work together like a 2-issue statically scheduled superscalar processor. The IA dispatches instructions with data dependencies to the same core so that the data dependencies can be resolved by existing forwarding paths in the core. Simulation results show that when the proposed architecture works in a statically scheduled superscalar manner, it achieves a 30.3% higher instructions per cycle (IPC) than the traditional five-stage pipelined core based on 35 benchmarks from the MiBench suite. The increases in area and power for extending a homogeneous dual-core processor to a hyperscalar dual-core processor are only 1.8% and 1.75%, respectively, using 90nm CMOS technology. On top of that, this dissertation further extended the hyperscalar dual-core architecture to hyperscalar multi-core architecture capable of flexibly providing high throughput for uniform parallel application as well as high performance for more general workloads. It can dynamically unite many scalar cores as a larger OOO superscalar processor to accelerate a thread. To accomplish this, the Virtual Shared Register File (VSRF) concept was proposed to help the instructions of a thread in different cores can logically face a uniform set of register file. Simulation results show that the 2, 4, 8, 16, and 32-core-united configurations of the hyperscalar multi-core architecture archive 95%, 84%, 82%, 85%, and 90% of the performance of the monolithic 2, 4,8, 16, and 32-issue OOO superscalar processors based the SPEC2000 benchmarks. Finally, this dissertation proposed a new technology, called multi-streaming SIMD, applicable for hyperscalar architecture to efficiently exploit data-level parallelism (DLP). The multi-streaming SIMD technology enables current multimedia extensions to simultaneously manipulate multiple data streams. Simulation results show that when a multi-streaming SIMD computing engine has four 4-register multimedia operation storage units, it provides a factor of 3.3x to 5.5x performance enhancement for traditional MMX extensions on twelve multimedia kernels. After exploring the above research topics discussed in this dissertation, a promising architecture for future multi-core designs was realized. SIMD chip multiprocessors superscalar dynamic multi-core reconfigurable hardware multimedia processing hyperscalar
10	RST: Reuse through Speculation on Traces / RST: Reuso Especulativo de Traces Pilla, Mauricio Lima January 2004 (has links) Na presente tese, apresentamos uma nova abordagem para combinar reuso e prvisão de seqüências dinâmicas de instruções, chamada Reuso por Especulação em traces (RST). Esta técnica permite a identificação dinâmica de traces de instruções redundantes ou previsíveis e o reuso (especulativo ou não) desses traces. RST procura resolver a questão de traces que não são reusados por seus valores de entradas de Traces (DTM). Em estudo anteriores, esses traces foram contabilizados como sendo cerca de 69% de todos os traces reusáveis. Uma das maiores vantagens de RST sobre a combinação de um mecanismo de previsão com uma técnica de reuso de valores em que mecanismos não são relacionados é que RST não necessita de tabelas adicionais para o armazenamento dos valores a serem previstos. A aplciação de reuso e previsão de valores pela simples combinação de mecanismos pode necessitar de uma quantidade proibitiva de espaço de armazenamento. No mecanismo RST, os valores já estão presentes na Tabela de Memorização de Traces, não incorrendo em custos adicionais para lê-los se comparado com uma técnica não-especulativa de reuso de traces. O contexto de entrada de cada trace (os valores de entrada de todas as instruções contidas no trace) já armazenam os valores para o teste de reuso, os quais podem ser também utilizados para previsão de valores para o teste de reuso, os quais podem ser também utilizados para previsão de valores. As principais contribuições de nosso trabalho incluem: (i) um framework de reuso especulativo de traces que pode ser modificado para diferentes arquiteturas de processadores; (ii) definição das modificações necessárias em um processador superescalar e superpipeline para implementar nosso mecanismo; (iii) estudo de questões de implementação relacionadas à essa arquitetura; (iv) estudo dos limites de desempenho da nossa técnica; (v) estudo de uma implementação RST limitada por fatores realísticos; e (vi) ferramentas de simulação que podem ser utilizadas em outros estudos, representando um processador superescalar e superpipeline em detalhes. Salientamos que, em uma arquitetura utilizando mecanismos realistas de estimativa de confiança das previsões, nossa técnica RST consegue atingir speedups médios (médias harmônicas) de 1.29 sobre uma arquitetura sem reuso e 1.09 sobre uma técnica não-especulativa de reuso de traces (DTM). / In this thesis, we present a novel approach to combine both reuse and prediction of dynamic sequences of instructions called Reuse through Speculation on Traces (RST). Our technique allows the dynamic identification of instruction traces that are redundant or predictable, and the reuse (speculative or not) of these traces. RST addresses the issue, present on Dynamic Trace Memoization (DTM), of traces not being reused because some of their inputs are not ready for the reuse test. These traces were measured to be 69% of all reusable traces in previous studies. One of the main advantages of RST over just combining a value prediction technique with an unrelated reuse technique is that RST does not require extra tables to store the values to be predicted. Applying reuse and value prediction in unrelated mechanisms but at the same time may require a prohibitive amount of storage in tables. In RST, the values are already stored in the Trace Memoization Table, and there is no extra cost in reading them if compared with a non-speculative trace reuse technique. . The input context of each trace (the input values of all instructions in the trace) already stores the values for the reuse test, which may also be used for prediction. Our main contributions include: (i) a speculative trace reuse framework that can be adapted to different processor architectures; (ii) specification of the modifications in a superscalar, superpipelined processor in order to implement our mechanism; (iii) study of implementation issues related to this architecture; (iv) study of the performance limits of our technique; (v) a performance study of a realistic, constrained implementation of RST; and (vi) simulation tools that can be used in other studies which represent a superscalar, superpipelined processor in detail. In a constrained architecture with realistic confidence, our RST technique is able to achieve average speedups (harmonic means) of 1.29 over the baseline architecture without reuse and 1.09 over a non-speculative trace reuse technique (DTM). Arquiteturas super escalares Processamento paralelo Speculative Trace Reuse Superscalar architectures Parallel processing Value reuse Value prediction

Search results