1 |
Design of the extended fixed-length instruction set for 32-bit X86 ISALin, Jyun-Ji 04 August 2008 (has links)
In the microprocessor development, the high performance microprocessor applies the x86 complex instruction set is used widely.
And the signal-core architecture towards slowly to multi-core one .But the variable-length instruction still creates the difficulties in instruction fetching and affects the whole executive- performance. There has the mechanism which supported the split-line and fetched fleetly the variable-lengths instruction. It has the problem in high time and hardware complexity, because it was accomplished with additional hardware. Accordingly, this paper proposed a fixed-length instruction set with design in compatible and extended x86 instruction set used the fixed-length instruction form to solve the difficulties in fetching the variable-length instructions. We considered the factor an overall arrangement of memory space and decided the length 4 bytes and 8 bytes to formulate the fixed-length instruction set. And we used the following six transitionary rules to complete the formulation for the coded form of the fixed-length instructions.(1)We used the auxiliary registers to save the value to decrease the data dependence between the original registers.(2)If it could use a few instructions to complete the translation with the original registers, we used the original registers to do it.(3)The complex case instructions were coded with eight bytes.(4)It did sign-extension by itself when displacement and immediate were moved to the auxiliary registers.(5)The auxiliary registers with the diacritic prefix were only coded in the r/m field or the index field.(6)One of displacement field and immediate field was moved first when its length was longer.
And we considered the hardware complexity of saving memory space and fetching instructions, we analyzed the categories of instruction packages to compress the program space to decrease the space loss which the fixed-lengths of instructions created. In the case of verifiable and experimental framework, the CINT2006 was used to be benchmarks. And the function which translated the fixed-length instructions was succeeded to execute. It was successful to achieve the purpose the program space was compressed efficiently in the instruction package mechanism.
|
2 |
Microarchitecture techniques to improve the design of superscalar microprocessorsChamdani, Joseph Irawan 05 1900 (has links)
No description available.
|
3 |
Efficient design-space exploration of custom instruction-set extensionsZuluaga, Marcela January 2010 (has links)
Customization of processors with instruction set extensions (ISEs) is a technique that improves performance through parallelization with a reasonable area overhead, in exchange for additional design effort. This thesis presents a collection of novel techniques that reduce the design effort and cost of generating ISEs by advancing automation and reconfigurability. In addition, these techniques maximize the perfomance gained as a function of the additional commited resources. Including ISEs into a processor design implies development at many levels. Most prior works on ISEs solve separate stages of the design: identification, selection, and implementation. However, the interations between these stages also hold important design trade-offs. In particular, this thesis addresses the lack of interaction between the hardware implementation stage and the two previous stages. Interaction with the implementation stage has been mostly limited to accurately measuring the area and timing requirements of the implementation of each ISE candidate as a separate hardware module. However, the need to independently generate a hardware datapath for each ISE limits the flexibility of the design and the performance gains. Hence, resource sharing is essential in order to create a customized unit with multi-function capabilities. Previously proposed resource-sharing techniques aggressively share resources amongst the ISEs, thus minimizing the area of the solution at any cost. However, it is shown that aggressively sharing resources leads to large ISE datapath latency. Thus, this thesis presents an original heuristic that can be parameterized in order to control the degree of resource sharing amongst a given set of ISEs, thereby permitting the exploration of the existing implementation trade-offs between instruction latency and area savings. In addition, this thesis introduces an innovative predictive model that is able to quickly expose the optimal trade-offs of this design space. Compared to an exhaustive exploration of the design space, the predictive model is shown to reduce by two orders of magnitude the number of executions of the resource-sharing algorithm that are required in order to find the optimal trade-offs. This thesis presents a technique that is the first one to combine the design spaces of ISE selection and resource sharing in ISE datapath synthesis, in order to offer the designer solutions that achieve maximum speedup and maximum resource utilization using the available area. Optimal trade-offs in the design space are found by guiding the selection process to favour ISE combinations that are likely to share resources with low speedup losses. Experimental results show that this combined approach unveils new trade-offs between speedup and area that are not identified by previous selection techniques; speedups of up to 238% over previous selection thecniques were obtained. Finally, multi-cycle ISEs can be pipelined in order to increase their throughput. However, it is shown that traditional ISE identification techniques do not allow this optimization due to control flow overhead. In order to obtain the benefits of overlapping loop executions, this thesis proposes to carefully insert loop control flow statements into the ISEs, thus allowing the ISE to control the iterations of the loop. The proposed ISEs broaden the scope of instruction-level parallelism and obtain higher speedups compared to traditional ISEs, primarily through pipelining, the exploitation of spatial parallelism, and reducing the overhead of control flow statements and branches. A detailed case study of a real application shows that the proposed method achieves 91% higher speedups than the state-of-the-art, with an area overhead of less than 8% in hardware implementation.
|
4 |
Customising compilers for customisable processorsMurray, Alastair Colin January 2012 (has links)
The automatic generation of instruction set extensions to provide application-specific acceleration for embedded processors has been a productive area of research in recent years. There have been incremental improvements in the quality of the algorithms that discover and select which instructions to add to a processor. The use of automatic algorithms, however, result in instructions which are radically different from those found in conventional, human-designed, RISC or CISC ISAs. This has resulted in a gap between the hardware’s capabilities and the compiler’s ability to exploit them. This thesis proposes and investigates the use of a high-level compiler pass that uses graph-subgraph isomorphism checking to exploit these complex instructions. Operating in a separate pass permits techniques to be applied that are uniquely suited for mapping complex instructions, but unsuitable for conventional instruction selection. The existing, mature, compiler back-end can then handle the remainder of the compilation. With this method, the high-level pass was able to use 1965 different automatically produced instructions to obtain an initial average speed-up of 1.11x over 179 benchmarks evaluated on a hardware-verified cycle-accurate simulator. This result was improved following an investigation of how the produced instructions were being used by the compiler. It was established that the models the automatic tools were using to develop instructions did not take account of how well the compiler could realistically use them. Adding additional parameters to the search heuristic to account for compiler issues increased the speed-up from 1.11x to 1.24x. An alternative approach using a re-designed hardware interface was also investigated and this achieved a speed-up of 1.26x while reducing hardware and compiler complexity. A complementary, high-level, method of exploiting dual memory banks was created to increase memory bandwidth to accommodate the increased data-processing bandwidth provided by extension instructions. Finally, the compiler was considered for use in a non-conventional role where rather than generating code it is used to apply source-level transformations prior to the generation of extension instructions and thus affect the shape of the instructions that are generated.
|
5 |
Increasing the efficacy of automated instruction set extensionBennett, Richard Vincent January 2011 (has links)
The use of Instruction Set Extension (ISE) in customising embedded processors for a specific application has been studied extensively in recent years. The addition of a set of complex arithmetic instructions to a baseline core has proven to be a cost-effective means of meeting design performance requirements. This thesis proposes and evaluates a reconfigurable ISE implementation called “Configurable Flow Accelerators” (CFAs), a number of refinements to an existing Automated ISE (AISE) algorithm called “ISEGEN”, and the effects of source form on AISE. The CFA is demonstrated repeatedly to be a cost-effective design for ISE implementation. A temporal partitioning algorithm called “staggering” is proposed and demonstrated on average to reduce the area of CFA implementation by 37% for only an 8% reduction in acceleration. This thesis then turns to concerns within the ISEGEN AISE algorithm. A methodology for finding a good static heuristic weighting vector for ISEGEN is proposed and demonstrated. Up to 100% of merit is shown to be lost or gained through the choice of vector. ISEGEN early-termination is introduced and shown to improve the runtime of the algorithm by up to 7.26x, and 5.82x on average. An extension to the ISEGEN heuristic to account for pipelining is proposed and evaluated, increasing acceleration by up to an additional 1.5x. An energyaware heuristic is added to ISEGEN, which reduces the energy used by a CFA implementation of a set of ISEs by an average of 1.6x, up to 3.6x. This result directly contradicts the frequently espoused notion that “bigger is better” in ISE. The last stretch of work in this thesis is concerned with source-level transformation: the effect of changing the representation of the application on the quality of the combined hardwaresoftware solution. A methodology for combined exploration of source transformation and ISE is presented, and demonstrated to improve the acceleration of the result by an average of 35% versus ISE alone. Floating point is demonstrated to perform worse than fixed point, for all design concerns and applications studied here, regardless of ISEs employed.
|
6 |
Very large register file for BLAS-3 operations.January 1995 (has links)
by Aylwin Chung-Fai, Yu. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1995. / Includes bibliographical references (leaves 117-118). / Abstract --- p.i / Acknowledgement --- p.iii / List of Tables --- p.v / List of Figures --- p.vi / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- BLAS-3 Operations --- p.2 / Chapter 1.2 --- Organization of Thesis --- p.2 / Chapter 1.3 --- Contribution --- p.3 / Chapter 2 --- Background Studies --- p.4 / Chapter 2.1 --- Registers & Cache Memory --- p.4 / Chapter 2.2 --- Previous Research --- p.6 / Chapter 2.3 --- Problem of Register & Cache --- p.8 / Chapter 2.4 --- BLAS-3 Operations On RISC Microprocessor --- p.10 / Chapter 3 --- Compiler Optimization Techniques for BLAS-3 Operations --- p.12 / Chapter 3.1 --- One-Dimensional Q-Way J-Loop Unrolling --- p.13 / Chapter 3.2 --- Two-Dimensional P×Q -Ways I×J-Loops Unrolling --- p.15 / Chapter 3.3 --- Addition of Code to Remove Redundant Code --- p.17 / Chapter 3.4 --- Simulation Result --- p.17 / Chapter 3.5 --- Summary --- p.23 / Chapter 4 --- Architectural Model of Very Large Register File --- p.25 / Chapter 4.1 --- Architectural Model --- p.26 / Chapter 4.2 --- Traditional Register File vs. Very Large Register File --- p.32 / Chapter 5 --- Ideal Case Study of Very Large Register File --- p.35 / Chapter 5.1 --- Matrix Multiply --- p.36 / Chapter 5.2 --- LU Decomposition --- p.41 / Chapter 5.3 --- Convolution --- p.50 / Chapter 6 --- Worst Case Study of Very Large Register File --- p.58 / Chapter 6.1 --- Matrix Multiply --- p.59 / Chapter 6.2 --- LU Decomposition --- p.65 / Chapter 6.3 --- Convolution --- p.74 / Chapter 7 --- Proposed Case Study of Very Large Register File --- p.81 / Chapter 7.1 --- Matrix Multiply --- p.82 / Chapter 7.2 --- LU Decomposition --- p.91 / Chapter 7.3 --- Convolution --- p.102 / Chapter 7.4 --- Comparison --- p.111 / Chapter 8 --- Conclusion & Future Work --- p.114 / Chapter 8.1 --- Summary --- p.114 / Chapter 8.2 --- Future Work --- p.115 / Bibliography --- p.117
|
7 |
An integrated multiprocessor for matrix algorithms / Warren Marwood.Marwood, Warren January 1994 (has links)
Bibliography: leaves 237-251. / xxi, 251 leaves : ill. ; 30 cm. / Title page, contents and abstract only. The complete thesis in print form is available from the University Library. / The work in this thesis is devoted to the architecture, implementation and performance of a MATRISC processing mode. Simulation results for the MATRISC processor are provided which give performance estimates for systems which can be implemented in current technologies. It is concluded that the extremely high performance of MATRISC processors makes possible the construction of parallel computers with processing capabilities in excess of one teraflops. / Thesis (Ph.D.)--University of Adelaide, Dept. of Electrical and Electronic Engineering, 1994
|
8 |
MatRISC : a RISC multiprocessor for matrix applications / Andrew James Beaumont-Smith.Beaumont-Smith, Andrew James January 2001 (has links)
"November, 2001" / Errata on back page. / Includes bibliographical references (p. 179-183) / xxii, 193 p. : ill. (some col.), plates (col.) ; 30 cm. / Title page, contents and abstract only. The complete thesis in print form is available from the University Library. / This thesis proposes a highly integrated SOC (system on a chip) matrix-based parallel processor which can be used as a co-processor when integrated into the on-chip cache memory of a microprocessor in a workstation environment. / Thesis (Ph.D.)--University of Adelaide, Dept. of Electrical and Electronic Engineering, 2002
|
9 |
Non-blocking synchronization and system designGreenwald, Michael Barry. January 1900 (has links)
Thesis (Ph.D)--Stanford University, 1999. / Title from PDF t.p. (viewed May 9, 2002). "August 1999." "Adminitrivia V1/Prg/19990826"--Metadata.
|
10 |
A new RISC architecture for high speed data acquisitionGribble, Donald L. 12 November 1991 (has links)
This thesis describes the design of a RISC architecture
for high speed data acquisition. The structure of existing
data acquisition systems is first examined. An instruction
set is created to allow the data acquisition system to serve
a wide variety of applications. The architecture is designed
to allow the execution of an instruction each clock cycle.
The utility of the RISC system is illustrated by
implementing several representative applications.
Performance of the system is analyzed and future enhancements
discussed. / Graduation date: 1992
|
Page generated in 0.1206 seconds