Global ETD Search

1	Design and implementation of a general purpose macroprocessor for software conversion Schmidt, David A., 1953 May 10- January 2010 (has links) Typescript, etc. / Digitized by Kansas Correctional Industries Macro processors
2	Compiler-directed energy savings in superscalar processors Jones, Timothy M. January 2006 (has links) Superscalar processors contain large, complex structures to hold data and instructions as they wait to be executed. However, many of these structures consume large amounts of energy, making them hotspots requiring sophisticated cooling systems. With the trend towards larger, more complex processors, this will become more of a problem, having important implications for future technology. This thesis uses compiler-based optimisation schemes to target the issue queue and register file. These are two of the most energy consuming structures in the processor. The algorithms and hardware techniques developed in this work dynamically adapt the processor's resources to the changing program phases, turning off parts of each structure when they are unused to save dynamic and static energy. To optimise the issue queue, the compiler analysis tracks data dependences through each program procedure. It identifies the critical path through each program region and informs the hardware of the minimum number of queue entries required to prevent it slowing down. This reduces the occupancy of the queue and increases the opportunities to save energy. With just a 1.3% performance loss, 26% dynamic and 32% static energy savings are achieved. Registers can be idle for many cycles after they are last read, before they are released and put back on the free-list to be reused by another instruction. Alternatively, they can be turned off for energy savings. Early register releasing can be used to perform this operation sooner than usual, but hardware schemes must wait for the instruction redefining the relevant logical register to enter the pipeline. This thesis presents an exploration of compiler-directed early register releasing. The compiler can exactly identify the last use of each register and pass the information to the hardware, based on simple data-flow and liveness analysis. The best scheme achieves 15% dynamic and 19% static energy savings. Finally, the issue queue limiting and early register releasing schemes are combined for energy savings in both processor structures. Four different configurations are evaluated bringing 25% to 31% dynamic and 19% to 34% static issue queue energy savings and reductions of 18% to 25% dynamic and 20% to 21% static energy in the register file. 004
3	Conjoint component designs for high performance dependable single chip multithreading systems / Wang, Hui. January 2007 (has links) Thesis (Ph.D.)--University of Texas at Dallas, 2007. / Includes vita. Includes bibliographical references (leaves 94-103)
4	Combinator reduction on networks of small processors Tunmer, Michael Luke January 1990 (has links) No description available. 005 Multi-processors
5	Instruction scheduling for a family of multiple instruction issue architectures Wang, Liang January 1993 (has links) No description available. 005 RISC design; Processors
6	A Branch-Directed Data Cache Prefetching Technique for Inorder Processors Panda, Reena 2011 December 1900 (has links) The increasing gap between processor and main memory speeds has become a serious bottleneck towards further improvement in system performance. Data prefetching techniques have been proposed to hide the performance impact of such long memory latencies. But most of the currently proposed data prefetchers predict future memory accesses based on current memory misses. This limits the opportunity that can be exploited to guide prefetching. In this thesis, we propose a branch-directed data prefetcher that uses the high prediction accuracies of current-generation branch predictors to predict a future basic block trace that the program will execute and issues prefetches for all the identified memory instructions contained therein. We also propose a novel technique to generate prefetch addresses by exploiting the correlation between the addresses generated by memory instructions and the values of the corresponding source registers at prior branch instances. We evaluate the impact of our prefetcher by using a cycle-accurate simulation of an inorder processor on the M5 simulator. The results of the evaluation show that the branch-directed prefetcher improves the performance on a set of 18 SPEC CPU2006 benchmarks by an average of 38.789% over a no-prefetching implementation and 2.148% over a system that employs a Spatial Memory Streaming prefetcher. Caches Prefetching Inorder Processors
7	Orchestrating thread scheduling and cache management to improve memory system throughput in throughput processors Li, Dong, active 21st century 10 July 2014 (has links) Throughput processors such as GPUs continue to provide higher peak arithmetic capability. Designing a high throughput memory system to keep the computational units busy is very challenging. Future throughput processors must continue to exploit data locality and utilize the on-chip and off-chip resources in the memory system more effectively to further improve the memory system throughput. This dissertation advocates orchestrating the thread scheduler with the cache management algorithms to alleviate GPU cache thrashing and pollution, avoid bandwidth saturation and maximize GPU memory system throughput. Based on this principle, this thesis work proposes three mechanisms to improve the cache efficiency and the memory throughput. This thesis work enhances the thread throttling mechanism with the Priority-based Cache Allocation mechanism (PCAL). By estimating the cache miss ratio with a variable number of cache-feeding threads and monitoring the usage of key memory system resources, PCAL determines the number of threads to share the cache and the minimum number of threads bypassing the cache that saturate memory system resources. This approach reduces the cache thrashing problem and effectively employs chip resources that would otherwise go unused by a pure thread throttling approach. We observe 67% improvement over the original as-is benchmarks and a 18% improvement over a better-tuned warp-throttling baseline. This work proposes the AgeLRU and Dynamic-AgeLRU mechanisms to address the inter-thread cache thrashing problem. AgeLRU prioritizes cache blocks based on the scheduling priority of their fetching warp at replacement. Dynamic-AgeLRU selects the AgeLRU algorithm and the LRU algorithm adaptively to avoid degrading the performance of non-thrashing applications. There are three variants of the AgeLRU algorithm: (1) replacement-only, (2) bypassing, and (3) bypassing with traffic optimization. Compared to the LRU algorithm, the above mentioned three variants of the AgeLRU algorithm enable increases in performance of 4%, 8% and 28% respectively across a set of cache-sensitive benchmarks. This thesis work develops the Reuse-Prediction-based cache Replacement scheme (RPR) for the GPU L1 data cache to address the intra-thread cache pollution problem. By combining the GPU thread scheduling priority together with the fetching Program Counter (PC) to generate a signature as the index of the prediction table, RPR identifies and prioritizes the near-reuse blocks and high-reuse blocks to maximize the cache efficiency. Compared to the AgeLRU algorithm, the experimental results show that the RPR algorithm results in a throughput improvement of 5% on average for regular applications, and a speedup of 3.2% on average across a set of cache-sensitive benchmarks. The techniques proposed in this dissertation are able to alleviate the cache thrashing, cache pollution and resource saturation problems effectively. We believe when these techniques are combined, they will synergistically further improve GPU cache efficiency and the overall memory system throughput. / text Throughput processors GPU Architecture
8	Modeling and evaluation of multi-core multithreading processor architectures in SystemC Ma, Nicholas 13 August 2007 (has links) Processor design has evolved over the years to take advantage of new technology and innovative concepts in order to improve performance. Diminishing returns for improvements based on current techniques such as exploiting instruction-level parallelism have caused designers to shift their focus. Rather then focusing on single-threaded architectures, designers have increasingly sought to improve system performance and increase overall throughput by exploiting thread-level parallelism through multithreaded multi-core architectures. Software modeling and simulation are common techniques used to aid hardware design. Through simulation, different architectures can be explored and verified before hardware is actually built. An appropriate choice for the level of abstraction can reduce the complexity and the time required to create and simulate software models. The first contribution of this thesis is a transaction-level simulation model of a multithreaded multi-core processor. The transaction level is a high level of abstraction that hides computational details from the designer allowing key architectural elements to be quickly explored. The processor model that has been implemented for this thesis is flexible and can be used to explore various designs by simulating different processor and cache configurations. The processor model is written in SystemC, which is a standard design and verification language that is built on C++ and that can be used to model hardware systems. The second contribution of this thesis is the development of an application model that seeks to characterize the behavior of instruction execution and data accesses in a program. An application's instruction trace can be profiled to produce a model that can be used to generate a synthetic trace with similar characteristics. The synthetic trace can then be used in place of large trace files to drive the SystemC-based processor model. The application model can also produce various workload scenarios for multiprocessor simulation. From experimentation, various processor configurations and different workload scenarios were simulated to explore the potential benefits of a multi-core multithreaded processor architecture. Performance increased with diminishing returns with additional multi-core multithreading support. However, these improvement were limited by the utilization of the shared bus. / Thesis (Master, Electrical & Computer Engineering) -- Queen's University, 2007-08-09 11:45:46.749 SystemC CMT Processors
9	Emitter identification using optical processors Hartup, David Carl 12 1900 (has links) No description available. Array processors Radar Antennas
10	Efficient use of Multi-core Technology in Interactive Desktop Applications Karlsson, Johan January 2015 (has links) The emergence of multi-core processors has successfully ended the era where applications could enjoy free and regular performance improvements without source code modifications. This thesis aims to gather experiences from the work of retrofitting parallelism into a desktop application originally written for sequential execution. The main contribution is the underlying theory and the performance evaluation, experiments and tests of the parallel software regions compared to its sequential counterparts. The feasibility is demonstrated as the theory is put into use when a complex commercially active desktop application is being rewritten to support parallelism. The thesis finds no simple guaranteed solution to the problem of making a serial application execute in parallel. However, experiments and tests proves that many of the evaluated methods offers tangible performance advantages compared to sequential execution. Multi-core processors parallelism

Search results