Global ETD Search

21	Multi-core processors and the future of parallelism in software Youngman, Ryan Christopher 01 January 2007 (has links) The purpose of this thesis is to examine multi-core technology. Multi-core architecture provides benefits such as less power consumption, scalability, and improved application performance enabled by thread-level parallelism. Computer architecture High performance processors Simultaneous multithreading processors Computer architecture High performance processors Simultaneous multithreading processors. Systems Architecture
22	Dynamic Register Allocation for Network Processors Collins, Ryan 22 May 2006 (has links) Network processors are custom high performance embedded processors deployed for a variety of tasks that must operate at high line (Gbits/sec) speeds to prevent packet loss. With the increase in complexity of application domains and larger code store on modern network processors, the network processor programming goes beyond simply exploiting parallelism in packet processing. Unlike the traditional homogeneous threading model, modern network processor programming must support heterogenous threads that execute simultaneously on a microengine. In order to support such demands, we first propose hardware management of registers across multiple threads. In their PLDI 2004 paper, Zhuang and Pande for the first time proposed a compiler based scheme to support register allocation across threads; in this work, we extend their static allocation method to support aggressive register allocation taking dynamic context into account. We also remove the load/stores due to aliased memory accesses converting them into register moves exploiting dead registers. This results in tremendous savings in latency and higher throughput mainly due to the removal of high latency accesses as well as idle cycles. The dynamic register allocator is designed to be light-weight and low latency by undertaking many tradeoffs. In the second part of this work, our goal is to design an automatic register allocation scheme that makes compiler transperant to dual bank register file design for network processors. By design network processors mandate that the operands of an instruction must be allocated to registers belonging to two different banks. The key goal in this work is to take into account dynamic contexts to balance the register pressure across the banks. Key decisions made involve, how and where to map incoming virtual register on a physical register in the bank, how to evict dead ones, and how to minimally undertake bank to bank copies and swaps. It is shown that it is viable to solve both of these problems by simple hardware designs that avail of dynamic contexts. The performance gains are substantial and due to simplicity of the designs (which are also off critical paths) such schemes may be attractive in practice. Compilers Register allocation Network processors
23	An analysis of software interface issues for SMT processors / Redstone, Joshua Abram. January 2002 (has links) Thesis (Ph. D.)--University of Washington, 2002. / Vita. Includes bibliographical references (p. 116-124).
24	Architectures and limits of GPU-CPU heterogeneous systems Wong, Henry Ting-Hei 11 1900 (has links) As we continue to be able to put an increasing number of transistors on a single chip, the answer to the perpetual question of what the best processor we could build with the transistors is remains uncertain. Past work has shown that heterogeneous multiprocessor systems provide benefits in performance and efficiency. This thesis explores heterogeneous systems composed of a traditional sequential processor (CPU) and highly parallel graphics processors (GPU). This thesis presents a tightly-coupled heterogeneous chip multiprocessor architecture for general-purpose non-graphics computation and a limit study exploring the potential benefits of GPU-like cores for accelerating a set of general-purpose workloads. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with GMA X4500 GPU cores. Pangaea introduces a resource partitioning of the GPU, where 3D graphics-specific hardware is removed to reduce area or add more processing cores, and a 3-instruction extension to the IA32 ISA that supports fast communication between CPU and GPU by building user-level interrupts on top of existing cache coherency mechanisms. By removing graphics-specific hardware on a 65 nm process, the area saved is equivalent to 9 GPU cores, while the power saved is equivalent to 5 cores. Our FPGA prototype shows thread spawn latency improvements from thousands of clock cycles to 26. A set of non-graphics workloads demonstrate speedups of up to 8.8x. This thesis also presents a limit study, where we measure the limit of algorithm parallelism in the context of a heterogeneous system that can be usefully extracted from a set of general-purpose applications. We measure sensitivity to the sequential performance (register read-after-write latency) of the low-cost parallel cores, and latency and bandwidth of the communication channel between the two cores. Using these measurements, we propose system characteristics that maximize area and power efficiencies. As in previous limit studies, we find a high amount of parallelism. We show, however, that the potential speedup on GPU-like systems is low (2.2x - 12.7x) due to poor sequential performance. Communication latency and bandwidth have comparatively small performance effects (<25%). Optimal area efficiency requires a lower-cost parallel processor while optimal power efficiency requires a higher-performance parallel processor than today's GPUs. Multicore processors Computer architecture GPU
25	Vector processing as a soft-core processor accelerator Yu, Jason Kwok Kwun 11 1900 (has links) Soft processors simplify hardware design by being able to implement complex control strategies using software. However, they are not fast enough for many intensive data-processing tasks, such as highly data-parallel embedded applications. This thesis suggests adding a vector processing core to the soft processor as a general-purpose accelerator for these types of applications. The approach has the benefits of a purely software-oriented development model, a fixed ISA allowing parallel software and hardware development, a single accelerator that can accelerate multiple functions in an application, and scalable performance with a single source code. With no hardware design experience needed, a software programmer can make area-versus-performance tradeoffs by scaling the number of functional units and register file bandwidth with a single parameter. The soft vector processor can be further customized by a number of secondary parameters to add and remove features for the specific application to optimize resource utilization. This thesis shows that a vector processing architecture maps efficiently into an FPGA and provides a scalable amount of performance for a reasonable amount of area. Configurations of the soft vector processor with different performance levels are estimated to achieve speedups of 2-24x for 5-26x the area of a Nios II/s processor on three benchmark kernels. Soft processors Data processing Software
26	Optically interconnected parallel processor arrays Drabik, Timothy J. 12 1900 (has links) No description available. Electric circuits, Parallel Array processors
27	XTHREAD : a flexible concurrency analysis framework Ressia, Jorge Luis. January 2006 (has links) Many different methodologies have been developed for analyzing multithreaded programs. These analyses present a wide variety of approaches and tend to be rather complicated because they work on applications formed by several threads executed in a nondeterministic order. / To address these issues this thesis introduces XThread, a flexible and modular framework for developing different concurrency analyses over multithreaded applications. The main objective of XTHREAD is to reduce the complexity of developing concurrency analyses by providing high level abstractions that close the breach between the language spoken by the researcher and the language the framework provides. Moreover, this framework provides different tools that are often required for solving issues common to many concurrency analyses. XTHREAD's modular organization also delivers a flexible environment for developing and testing different analysis implementations. / In order to demonstrate the usefulness of the framework a client analysis representing known but non-trivial multithreaded analysis is developed which is composed of several other concurrency analysis. A substantial number of benchmarks are used in order to test the implementations, showing that complex programs are accepted and correctly handled by the abstractions provided by the framework. Using the XTHREAD framework we demonstrate implementations that have both comparable accuracy and much better generality than is typically found in existing, research-level implementations of concurrency analyses. Computer multitasking. Simultaneous multithreading processors.
28	The application of real-time software in the implementation of low-cost satellite return links Slader, James Tom January 2001 (has links) Digital Signal Processors (DSPs) have evolved to a level where it is feasible for digital modems with relatively low data rates to be implemented entirely with software algorithms. With current technology it is still necessary for analogue processing between the RF input and a low frequency IF but, as DSP technology advances, it will become possible to shift the interface between analogue and digital domains ever closer towards the RF input. The software radio concept is a long-term goal which aims to realise software-based digital modems which are completely flexible in terms of operating frequency, bandwidth, modulation format and source coding. The ideal software radio cannot be realised until DSP, Analogue to Digital (A/D) and Digital to Analogue (D/A) technology has advanced sufficiently. Until these advances have been made, it is often necessary to sacrifice optimum performance in order to achieve real-time operation. This Thesis investigates practical real-time algorithms for carrier frequency synchronisation, symbol timing synchronisation, modulation, demodulation and FEC. Included in this work are novel software-based transceivers for continuous-mode transmission, burst-mode transmission, frequency modulation, phase modulation and orthogonal frequency division multiplexing (OFDM). Ideal applications for this work combine the requirement for flexible baseband signal processing and a relatively low data rate. Suitable applications for this work were identified in low-cost satellite return links, and specifically in asymmetric satellite Internet delivery systems. These systems employ a high-speed (>>2Mbps) DVB channel from service provider to customer and a low-cost, low-speed (32-128 kbps) return channel. This Thesis also discusses asymmetric satellite Internet delivery systems, practical considerations for their implementation and the techniques that are required to map TCP/IP traffic to low-cost satellite return links. 621.382 Radio; Digital signal processors
29	Vector processing as a soft-core processor accelerator Yu, Jason Kwok Kwun 11 1900 (has links) Soft processors simplify hardware design by being able to implement complex control strategies using software. However, they are not fast enough for many intensive data-processing tasks, such as highly data-parallel embedded applications. This thesis suggests adding a vector processing core to the soft processor as a general-purpose accelerator for these types of applications. The approach has the benefits of a purely software-oriented development model, a fixed ISA allowing parallel software and hardware development, a single accelerator that can accelerate multiple functions in an application, and scalable performance with a single source code. With no hardware design experience needed, a software programmer can make area-versus-performance tradeoffs by scaling the number of functional units and register file bandwidth with a single parameter. The soft vector processor can be further customized by a number of secondary parameters to add and remove features for the specific application to optimize resource utilization. This thesis shows that a vector processing architecture maps efficiently into an FPGA and provides a scalable amount of performance for a reasonable amount of area. Configurations of the soft vector processor with different performance levels are estimated to achieve speedups of 2-24x for 5-26x the area of a Nios II/s processor on three benchmark kernels. Soft processors Data processing Software
30	Architectures and limits of GPU-CPU heterogeneous systems Wong, Henry Ting-Hei 11 1900 (has links) As we continue to be able to put an increasing number of transistors on a single chip, the answer to the perpetual question of what the best processor we could build with the transistors is remains uncertain. Past work has shown that heterogeneous multiprocessor systems provide benefits in performance and efficiency. This thesis explores heterogeneous systems composed of a traditional sequential processor (CPU) and highly parallel graphics processors (GPU). This thesis presents a tightly-coupled heterogeneous chip multiprocessor architecture for general-purpose non-graphics computation and a limit study exploring the potential benefits of GPU-like cores for accelerating a set of general-purpose workloads. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with GMA X4500 GPU cores. Pangaea introduces a resource partitioning of the GPU, where 3D graphics-specific hardware is removed to reduce area or add more processing cores, and a 3-instruction extension to the IA32 ISA that supports fast communication between CPU and GPU by building user-level interrupts on top of existing cache coherency mechanisms. By removing graphics-specific hardware on a 65 nm process, the area saved is equivalent to 9 GPU cores, while the power saved is equivalent to 5 cores. Our FPGA prototype shows thread spawn latency improvements from thousands of clock cycles to 26. A set of non-graphics workloads demonstrate speedups of up to 8.8x. This thesis also presents a limit study, where we measure the limit of algorithm parallelism in the context of a heterogeneous system that can be usefully extracted from a set of general-purpose applications. We measure sensitivity to the sequential performance (register read-after-write latency) of the low-cost parallel cores, and latency and bandwidth of the communication channel between the two cores. Using these measurements, we propose system characteristics that maximize area and power efficiencies. As in previous limit studies, we find a high amount of parallelism. We show, however, that the potential speedup on GPU-like systems is low (2.2x - 12.7x) due to poor sequential performance. Communication latency and bandwidth have comparatively small performance effects (<25%). Optimal area efficiency requires a lower-cost parallel processor while optimal power efficiency requires a higher-performance parallel processor than today's GPUs. Multicore processors Computer architecture GPU

Search results