Global ETD Search

1	Language extensions for array processor and multi-processor configurations Orr, Rodney Alister January 1986 (has links) No description available. 621.39 Parallel processor design
2	Analysis of Hardware Sorting Units in Processor Design Furlan, Carmelo C. 01 June 2019 (has links) Sorting is often computationally intensive and can cause the application in which it is used to run slowly. To date, the quickest software sorting implementations for an N element sorting problem runs at O(nlogn). Current techniques, beyond developing better algorithms, used to accelerate sorting include the use of multiple processors or moving the sorting operation to a GPU. The use of multiple processors or a GPU can lead to increased energy consumption and heat produced by the device as compared to a single-core GPU-less implementation. To address these problems, specialized instructions and hardware units can be added to the processors to accelerate the sorting operation directly. This thesis studies and records the performance implications from implementing a sorting accelerator into a modern RISC-V processor pipeline. This thesis also explores the additional energy and area costs of implementing such hardware units in the processor. Sorting Units Processor Design
3	Efficient design-space exploration of custom instruction-set extensions Zuluaga, Marcela January 2010 (has links) Customization of processors with instruction set extensions (ISEs) is a technique that improves performance through parallelization with a reasonable area overhead, in exchange for additional design effort. This thesis presents a collection of novel techniques that reduce the design effort and cost of generating ISEs by advancing automation and reconfigurability. In addition, these techniques maximize the perfomance gained as a function of the additional commited resources. Including ISEs into a processor design implies development at many levels. Most prior works on ISEs solve separate stages of the design: identification, selection, and implementation. However, the interations between these stages also hold important design trade-offs. In particular, this thesis addresses the lack of interaction between the hardware implementation stage and the two previous stages. Interaction with the implementation stage has been mostly limited to accurately measuring the area and timing requirements of the implementation of each ISE candidate as a separate hardware module. However, the need to independently generate a hardware datapath for each ISE limits the flexibility of the design and the performance gains. Hence, resource sharing is essential in order to create a customized unit with multi-function capabilities. Previously proposed resource-sharing techniques aggressively share resources amongst the ISEs, thus minimizing the area of the solution at any cost. However, it is shown that aggressively sharing resources leads to large ISE datapath latency. Thus, this thesis presents an original heuristic that can be parameterized in order to control the degree of resource sharing amongst a given set of ISEs, thereby permitting the exploration of the existing implementation trade-offs between instruction latency and area savings. In addition, this thesis introduces an innovative predictive model that is able to quickly expose the optimal trade-offs of this design space. Compared to an exhaustive exploration of the design space, the predictive model is shown to reduce by two orders of magnitude the number of executions of the resource-sharing algorithm that are required in order to find the optimal trade-offs. This thesis presents a technique that is the first one to combine the design spaces of ISE selection and resource sharing in ISE datapath synthesis, in order to offer the designer solutions that achieve maximum speedup and maximum resource utilization using the available area. Optimal trade-offs in the design space are found by guiding the selection process to favour ISE combinations that are likely to share resources with low speedup losses. Experimental results show that this combined approach unveils new trade-offs between speedup and area that are not identified by previous selection techniques; speedups of up to 238% over previous selection thecniques were obtained. Finally, multi-cycle ISEs can be pipelined in order to increase their throughput. However, it is shown that traditional ISE identification techniques do not allow this optimization due to control flow overhead. In order to obtain the benefits of overlapping loop executions, this thesis proposes to carefully insert loop control flow statements into the ISEs, thus allowing the ISE to control the iterations of the loop. The proposed ISEs broaden the scope of instruction-level parallelism and obtain higher speedups compared to traditional ISEs, primarily through pipelining, the exploitation of spatial parallelism, and reducing the overhead of control flow statements and branches. A detailed case study of a real application shows that the proposed method achieves 91% higher speedups than the state-of-the-art, with an area overhead of less than 8% in hardware implementation. 004.1
4	Energy efficient branch prediction Hicks, Michael Andrew January 2010 (has links) Energy efficiency is of the utmost importance in modern high-performance embedded processor design. As the number of transistors on a chip continues to increase each year, and processor logic becomes ever more complex, the dynamic switching power cost of running such processors increases. The continual progression in fabrication processes brings a reduction in the feature size of the transistor structures on chips with each new technology generation. This reduction in size increases the significance of leakage power (a constant drain that is proportional to the number of transistors). Particularly in embedded devices, the proportion of an electronic product’s power budget accounted for by the CPU is significant (often as much as 50%). Dynamic branch prediction is a hardware mechanism used to forecast the direction, and target address, of branch instructions. This is essential to high performance pipelined and superscalar processors, where the direction and target of branches is not computed until several stages into the pipeline. Accurate branch prediction also acts to increase energy efficiency by reducing the amount of time spent executing mis-speculated instructions. ‘Stalling’ is no longer a sensible option when the significance of static power dissipation is considered. Dynamic branch prediction logic typically accounts for over 10% of a processor’s global power dissipation, making it an obvious target for energy optimisation. Previous approaches at increasing the energy efficiency of dynamic branch prediction logic has focused on either fully dynamic or fully static techniques. Dynamic techniques include the introduction of a new cache-like structure that can decide whether branch prediction logic should be accessed for a given branch, and static techniques tend to focus on scheduling around branch instructions so that a prediction is not needed (or the branch is removed completely). This dissertation explores a method of combining static techniques and profiling information with simple hardware support in order to reduce the number of accesses made to a branch predictor. The local delay region is used on unconditional absolute branches to avoid prediction, and, for most other branches, Adaptive Branch Bias Measurement (through profiling) is used to assign a static prediction that is as accurate as a dynamic prediction for that branch. This information is represented as two hint-bits in branch instructions, and then interpreted by simple hardware logic that bypasses both the lookup and update phases for appropriate branches. The global processor power saving that can be achieved by this Combined Algorithm is around 6% on the experimental architectures shown. These architectures are based upon real contemporary embedded architecture specifications. The introduction of the Combined Algorithm also significantly reduces the execution time of programs on Multiple Instruction Issue processors. This is attributed to the increase achieved in global prediction accuracy. 621.31
5	A Domain Specific DSP Processor / En domänspecifik DSP-processor Tell, Eric January 2001 (has links) <p>This thesis describes the design of a domain specific DSP processor. The thesis is divided into two parts. The first part gives some theoretical background, describes the different steps of the design process (both for DSP processors in general and for this project) and motivates the design decisions made for this processor. </p><p>The second part is a nearly complete design specification. </p><p>The intended use of the processor is as a platform for hardware acceleration units. Support for this has however not yet been implemented.</p> Datorteknik DSP processor design CPU design Datorteknik Computer engineering Datorteknik
6	Modeling and Optimization of Delay and Power for Key Components of Modern High-performance Processors Safi, Elham 13 April 2010 (has links) In designing a new processor, computer architects consider a myriad of possible organizations and designs to decide which best meets the constraints on performance, power and cost for each particular processor. To identify practical designs, architects need to have insight into the physical-level characteristics (delay, power and area) of various components of modern processors implemented in recent fabrication technologies. During early stages of design exploration, however, developing physical-level implementations for various design options (often in the order of thousands) is impractical or undesirable due to time and/or cost constraints. In lieu of actual measurements, analytical and/or empirical models can offer reasonable estimates of these physical-level characteristics. However, existing models tend to be out-dated for three reasons: (i) They have been developed based on old circuits in old fabrication technologies; (ii) The high-level designs of the components have evolved and older designs may no longer be representative; and, (iii) The overall architecture of processors has changed significantly, and new components for which no models exist have been introduced or are being considered. This thesis studies three key components of modern high-performance processors: Counting Bloom Filters (CBFs), Checkpointed Register Alias Tables (RATs), and Compacted Matrix Schedulers (CMSs). CBFs optimize membership tests (e.g., whether a block is cached). RAT and CMS increase the opportunities for exploiting instruction-level parallelism; RAT is the core of the renaming stage, and CMS is an implementation for the instruction scheduler. Physical-level studies or models for these components have been limited or non-existent. In addition to investigating these components at the physical level, this thesis (i) proposes a novel speed- and energy-efficient CBF implementation; (ii) studies how the number of RAT checkpoints affects its latency and energy, and overall processor performance; and, (iii) studies the CMS and its accompanying logic at the physical level. This thesis also develops empirical and analytical latency and energy models that can be adapted for newer fabrication technologies. Additionally, this thesis proposes physical-level latency and energy optimizations for these components motivated by design inefficiencies exposed during the physical-level study phase. Processor design Computer architecture Physical-level implementation 0544
7	Modeling and Optimization of Delay and Power for Key Components of Modern High-performance Processors Safi, Elham 13 April 2010 (has links) In designing a new processor, computer architects consider a myriad of possible organizations and designs to decide which best meets the constraints on performance, power and cost for each particular processor. To identify practical designs, architects need to have insight into the physical-level characteristics (delay, power and area) of various components of modern processors implemented in recent fabrication technologies. During early stages of design exploration, however, developing physical-level implementations for various design options (often in the order of thousands) is impractical or undesirable due to time and/or cost constraints. In lieu of actual measurements, analytical and/or empirical models can offer reasonable estimates of these physical-level characteristics. However, existing models tend to be out-dated for three reasons: (i) They have been developed based on old circuits in old fabrication technologies; (ii) The high-level designs of the components have evolved and older designs may no longer be representative; and, (iii) The overall architecture of processors has changed significantly, and new components for which no models exist have been introduced or are being considered. This thesis studies three key components of modern high-performance processors: Counting Bloom Filters (CBFs), Checkpointed Register Alias Tables (RATs), and Compacted Matrix Schedulers (CMSs). CBFs optimize membership tests (e.g., whether a block is cached). RAT and CMS increase the opportunities for exploiting instruction-level parallelism; RAT is the core of the renaming stage, and CMS is an implementation for the instruction scheduler. Physical-level studies or models for these components have been limited or non-existent. In addition to investigating these components at the physical level, this thesis (i) proposes a novel speed- and energy-efficient CBF implementation; (ii) studies how the number of RAT checkpoints affects its latency and energy, and overall processor performance; and, (iii) studies the CMS and its accompanying logic at the physical level. This thesis also develops empirical and analytical latency and energy models that can be adapted for newer fabrication technologies. Additionally, this thesis proposes physical-level latency and energy optimizations for these components motivated by design inefficiencies exposed during the physical-level study phase. Processor design Computer architecture Physical-level implementation 0544
8	Verification-Aware Processor Design Lungu, Anita January 2009 (has links) <p>As technological advances enable computers to permeate many of our society's critical application domains (such as medicine, finances, transportation), the requirement for computers to always behave correctly becomes critical as well. Currently, ensuring that processor designs are correct represents a major challenge for the computing industry consuming the majority (up to 70%) of the resources allocated for the creation of a new processor. Looking towards the future, we see that with each new processor generation, even more transistors fit on the same chip area and more complex designs become possible, which makes it unlikely that the difficulty of the design verification problem will decrease by itself.</p><p>We believe that the difficulty of the design verification problem is compounded by the current processor design flow. In most design cycles, a design's verifiability is not explicitly considered at an early stage - when decisions are most influential - because that initial focus is exclusively on improving the design on more traditional metrics like performance, power, and area. It is thus possible for the resulting design to be very difficult to verify in the end, specifically because its verifiability was not ranked high on the priority list in the beginning. </p><p>In this thesis we propose to view verifiability as a critical design constraint to be considered, together with other established metrics, like performance and power, from the initial stages of design. Our high level goal is for this approach to make designs more verifiable, which would both decrease the resources invested in the verification step and lead to more robust designs. </p><p>More specifically, we make five main contributions in this thesis. The first is our proposal for a change in design perspective towards considering verifiability as a first class constraint. Second, we use formal verification (through a combination of theorem proving, model checking, and probabilistic model checking ) to quantitatively evaluate the impact on verifiability of various design choices like the organization of caches, TLBs, pipeline, operand bypass network, and dynamic power management mechanisms. Our third contribution is to evaluate design trade-offs between verifiability and other established metrics, like performance and power, in the context of multi-core dynamic power management schemes. Fourth, we re-design several components for increasing their verifiability. Finally, we propose design guidelines for increasing verifiability. In the context of single core processors our guidelines refer to the organization of caches and translation lookaside buffers (TLBs), the depth of the core's pipeline, the type of ALUs used, while for multi-core processors we refer to dynamic power management schemes (DPMs) for power capping. </p><p>Our results confirm that making design choices with verifiability as a first class design constraint has the capacity to decrease the verification effort. Furthermore, making explicit trade-offs between verifiability, performance and power helps identify better design points for given verification, performance, and power goals.</p> / Dissertation Computer Science Computer architecture Hardware verification Processor design
9	A Domain Specific DSP Processor / En domänspecifik DSP-processor Tell, Eric January 2001 (has links) This thesis describes the design of a domain specific DSP processor. The thesis is divided into two parts. The first part gives some theoretical background, describes the different steps of the design process (both for DSP processors in general and for this project) and motivates the design decisions made for this processor. The second part is a nearly complete design specification. The intended use of the processor is as a platform for hardware acceleration units. Support for this has however not yet been implemented. Datorteknik DSP processor design CPU design Datorteknik Computer Engineering Datorteknik
10	WvFEv3: An FPGA-based general purpose digital signal processor for space applications Mokrzycki, Brian Thomas 01 July 2011 (has links) The Waves instruments aboard the Juno and Radiation Belt Storm Probe (RBSP) spacecraft represents the next generation of space radio and plasma wave instrumentation developed by the University of Iowa's Radio and Plasma Wave group. The previous generation of such instruments on the Cassini spacecraft utilized several analog signal-conditioning techniques to compress and condense scientific data. Compression techniques are necessary because the plasma wave instruments can often generate significantly more science data than can be transmitted using the narrow telemetry channel of the hosting spacecraft. The next generation of plasma wave instrumentation represents a major shift of analog signal conditioning functionality to the digital domain, drastically reducing the amount of power and mass required by the instrument while simultaneously further condensing scientific data, increasing the precision of plasma emission measurements, and adding flexibility. The solution presented in this thesis is to utilize a low-cost radiation tolerant field programmable gate array (FPGA) that serves as a space qualified implementation platform for a custom designed general-purpose digital signal processor, called the WvFEv3. computer architecture DSP FPGA Juno processor design RBSP Electrical and Computer Engineering

Search results