Global ETD Search

61	A Selection of H.264 Encoder Components Implemented and Benchmarked on a Multi-core DSP Processor Einemo, Jonas, Lundqvist, Magnus January 2010 (has links) <p>H.264 is a video coding standard which offers high data compression rate at the cost of a high computational load. This thesis evaluates how well parts of the H.264 standard can be implemented for a new multi-core digital signal processing processor architecture called ePUMA. The thesis investigates if real-time encoding of high definition video sequences could be performed. The implementation consists of the motion estimation, motion compensation, discrete cosine transform, inverse discrete cosine transform, quantization and rescaling parts of the H.264 standard. Benchmarking is done using the ePUMA system simulator and the results are compared to an implementation of an existing H.264 encoder for another multi-core processor architecture called STI Cell. The results show that the selected parts of the H.264 encoder could be run on 6 calculation cores in 5 million cycles per frame. This setup leaves 2 calculation cores to run the remaining parts of the encoder.</p> ePUMA DSP SIMD H.264 Parallel Programming Motion Estimation DCT Computer engineering Datorteknik
62	Optimizing VLIW architectures for multimedia applications Salamí San Juan, Esther 01 June 2007 (has links) The growing interest that multimedia processing has experimented during the last decade is motivating processor designers to reconsider which execution paradigms are the most appropriate for general-purpose processors. On the other hand, as the size of transistors decreases, power dissipation has become a relevant limitation to increases in the frequency of operation. Thus, the efficient exploitation of the different sources of parallelism is a key point to investigate in order to sustain the performance improvement rate of processors and face the growing requirements of future multimedia applications. We belief that a promising option arises from the combination of the Very Long Instruction Word (VLIW) and the vector processing paradigms together with other ways of exploiting coarser grain parallelism, such as Chip MultiProcessing (CMP). As part of this thesis, we analyze the problem of memory disambiguation in multimedia applications, as it represents a serious restriction for exploiting Instruction Level Parallelism (ILP) in VLIW architectures. We state that the real handicap for memory disambiguation in multimedia is the extensive use of pointers and indirect references usually found in those codes, together with the limited static information available to the compiler on certain occasions. Based on the observation that the input and output multimedia streams are commonly disjointed memory regions, we propose and implement a memory disambiguation technique that dynamically analyzes the region domain of every load and store before entering a loop, evaluates whether or not the full loop is disambiguated and executes the corresponding loop version. This mechanism does not require any additional hardware or instructions and has negligible effects over compilation time and code size. The performance achieved is comparable to that of advanced interprocedural pointer analysis techniques, with considerably less software complexity. We also demonstrate that both techniques can be combined to improve performance.In order to deal with the inherent Data Level Parallelism (DLP) of multimedia kernels without disrupting the existing core designs, major processor manufacturers have chosen to include MMX-like µSIMD extensions. By analyzing the scalability of the DLP and non-DLP regions of code separately in VLIW processors with µSIMD extensions, we observe that the performance of the overall application is dominated by the performance of the non-DLP regions, which in fact exhibit only modest amounts of ILP. As a result, the performance achieved by very wide issue configurations does not compensate for the related cost. To exploit the DLP of the vector regions in a more efficient way, we propose enhancing the µSIMD -VLIW core with conventional vector processing capabilities. The combination of conventional and sub-word level vector processing results in a 2-dimensional extension that combines the best of each one, including a reduction in the number of operations, lower fetch bandwidth requirements, simplicity of the control unit, power efficiency, scalability, and support for multimedia specific features such as saturation or reduction. This enhancement has a minimal impact on the VLIW core and reaches more parallelism than wider issue µSIMD implementations at a lower cost. Similar proposals have been successfully evaluated for superscalar cores. In this thesis, we demonstrate that 2-dimensional Vector-µSIMD extensions are also effective with static scheduling, allowing for high-performance cost-effective implementations. memory disambiguation VLIW SIMD extensions ILP vector processing DLP multimedia computer architecture 004
63	Extracting Data-Level Parallelism from Sequential Programs for SIMD Execution Baumstark, Lewis Benton, Jr. 29 October 2004 (has links) The goal of this research is to retarget multimedia programs written in sequential languages (e.g., C) to architectures with data-parallel execution capabilities. Image processing algorithms often have a high potential for data-level parallelism, but the artifacts imposed by the sequential programming language (e.g., loops, pointer variables) can obscure the parallelism and prohibit generation of efficient parallel code. This research presents a program representation and recognition approach for generating a data parallel program specification from sequential source code and retargeting it to data parallel execution mechanisms. The representation is based on an extension of the multi-dimensional synchronous dataflow model of computation. A partial recognition approach identifies and transforms only those program elements that hinder parallelization while leaving other computational elements intact. This permits flexibility in the types of programs that can be retargeted, while avoiding the complexity of complete program recognition. This representation and recognition process is implemented in the PARRET system, which is used to extract the high-level specification of a set of image-processing programs. From this specification, code is generated for Intels SSE2 instruction set and for the SIMPil processor. The results demonstrate that PARRET can exploit, given sufficient parallel resources, the maximum available parallelism in the retargeted applications. Similarly, the results show PARRET can also exploit parallelism on architectures with hardware-limited parallel resources. It is desirable to estimate potential parallelism before undertaking the expensive process of reverse engineering and retargeting. The goal is to narrow down the search space to a select set of loops which have a high likelihood of being data-parallel. This work also presents a hybrid static/dynamic approach, called DLPEST, for estimating the data-level parallelism in sequential program loops. We demonstrate the correctness of the DLPESTs estimates, show that estimates for programs of 25 to 5000 lines of code can be performed in under 10 minutes and that estimation time scales sub-linearly with input program size. Program recognition Data-level parallelization SIMD processors Reengineering
64	Storage Management for Embedded SIMD Processors Ryu, Soojung 17 December 2003 (has links) SIMD parallelism offers a high performance and efficient execution approach for today's broad range of portable multimedia consumer products. However, new methods are needed to meet the complex demands of high performance, embedded systems. This research explores new storage management techniques for this focused but critical application. These techniques include memory design exploration based on the application retargeting technique, storage-based systolic instruction broadcast, and systolic virtual memory to improve both the performance and efficiency of embedded SIMD systems. For an efficient storage usage by memory design space exploration in embedded SIMD systems, an analysis method for assessing storage needs and costs of a given application automatically retargeted across a spectrum of storage configuration designs was developed. Using this technique, a SIMD processing element achieves optimal area and energy efficiency with a register file containing between 8 and 12 words for given workload. This configuration is between 15% and 25% more area and energy efficient than other memory configurations being considered. Systolic instruction broadcast is a high performance and area efficient instruction broadcasting scheme with short-wire interconnects by eliminating of wire latency bottleneck found in global instruction broadcast. Three implementation methods are defined and evaluated - software method, 2-write port register file method, and bypass method. In our evaluations, due to the system's short clock cycle time and scheduler, a speedup in system performance of up to 7.5 can be achieved by the year 2010. In addition, speedup of area efficiency also can be achieved up to 7.2 for a given workload. The ability of minimizing off-chip memory access latency while maximizing access frequency by scheduling techniques along with data prefetch techniques in systolic virtual memory mechanism was evaluated using our SIMD-systolic architecture simulator. Results show that, systolic virtual off-chip memory with shared address space can achieve over 50% higher area efficiency than that of an on-chip only system for a matrix multiplication application. Storage management SIMD architectures Embedded systems Embedded computer systems Memory management (Computer science)
65	The Optimal Design for Face Detection Algorithm on Cell Processor Architecture Ku, Po-Yu 24 August 2011 (has links) With the advance of facial recognition technology, many related applications such as the clearance of specific facilities, air port security, video camera surveillance, and personnel recognition. To maximize working efficiency and reduce human resource, the platform used for facial recognition should possess both low cost, multimedia performance, and the ease of use. Among the list of available platforms, a IBM CELL multi-core based platform that features the aforementioned advantages is used to manifest our work. To meet the demand of recognition accuracy, a recognition algorithms using features low error rate and regular data patterns are adopted. These algorithms are carried out in two parts: Modified Census Transform (MCT) and hypotheses of human facial calculation. The multi-point average value required by the MCT is obtained through parallel processing, and potential improvement in recognition efficiency is possible if wider data paths are used. A PlayStation 3 (PS3) platform equipped with the IBM CELL multi-core processor is used in this thesis. The IBM CELL multi-core processor consists of a PowerPC Processor Element (PPE) and 8 Synergistic Processor (SPE), which forms a heterogeneous multi-core system. This system is capable of parallelizing thread-level and data-level data words, which can meet the demand of high data bandwidth and data parallelization. By using this platform to accelerate the processing of facial recognition, simulation results suggest that the execution efficiency is improved by 24 times when compared with a single core SPE. The simulation also reveals that the use of parallelization of processing facial recognition data feasible. In the future, improved algorithms can be applied to improve the accuracy of facial recognition. Multiple Buffering Modified Census Transform (MCT) SIMD Synergistic Processor (SPE) Heterogeneous PowerPC Processor Element (PPE)
66	Design of a Multi-Core Multi-thread Floating-Point Processor and Its Application in Computer Graphics Yeh, Chia-Yu 06 September 2011 (has links) Graphics processing unit (GPU) designs usually adopts various computer architecture techniques to boost the computation speed, including single-instruction multiple data (SIMD), very-long-instruction word (VLIW), multi-threading, and/or multi-core. In OpenGL ES 2.0, user programmable vertex shader (VS) hardware unit can be designed using vectored SIMD computation unit so that it can efficiently compute the matrix-vector multiplication, one of the key operations in vertex transformation. Recently, high-performance GPU, such as Telsa series from nVidia, is designed with many-core architectures with each core responsible for scalar operations. The intention is to allow for efficient execution of general-purpose computations in addition to the specialized graphics computations. In this thesis, we design a scalar-based multi-threaded GPU design that is composed of four scalar processors, one special-function unit, and can execute multi-threaded instructions. We use the example of vertex transformation to demonstrate execution efficiency of the scalar-based multi-threaded GPU. We also make comparison with the vector-based SIMD GPU. multi-threading graphics processing unit (GPU) vertex shader SIMD matrix-vector multiplication OpenGL ES 2.0
67	Study of the Hyperscalar Multi-core Architecture Chou, Yu-Liang 07 September 2011 (has links) Current trends in processor design have migrated toward chip multiprocessors (CMPs). CMPs are designed to exploit both instruction-level parallelism (ILP) within processors and thread-level parallelism (TLP) within and across processors. However, the conventional design of current CMPs is forced to make a choice between high single-thread performance and high peak throughput. This inability to adjust to varying levels of ILP and TLP results in processor inefficiency. To cope with the dilemma of designing CMPs confronted by the processor designers, this dissertation proposed the hyperscalar concept for current multi-core designs. The hyperscalar concept enables the multi-core architectures to dynamically group many scalar in-order cores as a superscalar processor to accelerate a sequential thread. The reconfigure feature of hyperscalar architecture contributes to the high flexibility in adapting different types of applications, providing high single-thread performance when thread level parallelism (TLP) is low and high throughput when TLP is high. Based on the hyperscalar concept, this dissertation first proposed a hyperscalar dual-core architecture. It can play three different roles (a 2-issue statically scheduled superscalar processor, a homogeneous dual-core processor, or a standalone single-core processor). An Instruction-dependency Analyzer (IA) that connects two scalar in-order cores is designed to handle the role switching. The design of IA makes it possible for the two cores to work together like a 2-issue statically scheduled superscalar processor. The IA dispatches instructions with data dependencies to the same core so that the data dependencies can be resolved by existing forwarding paths in the core. Simulation results show that when the proposed architecture works in a statically scheduled superscalar manner, it achieves a 30.3% higher instructions per cycle (IPC) than the traditional five-stage pipelined core based on 35 benchmarks from the MiBench suite. The increases in area and power for extending a homogeneous dual-core processor to a hyperscalar dual-core processor are only 1.8% and 1.75%, respectively, using 90nm CMOS technology. On top of that, this dissertation further extended the hyperscalar dual-core architecture to hyperscalar multi-core architecture capable of flexibly providing high throughput for uniform parallel application as well as high performance for more general workloads. It can dynamically unite many scalar cores as a larger OOO superscalar processor to accelerate a thread. To accomplish this, the Virtual Shared Register File (VSRF) concept was proposed to help the instructions of a thread in different cores can logically face a uniform set of register file. Simulation results show that the 2, 4, 8, 16, and 32-core-united configurations of the hyperscalar multi-core architecture archive 95%, 84%, 82%, 85%, and 90% of the performance of the monolithic 2, 4,8, 16, and 32-issue OOO superscalar processors based the SPEC2000 benchmarks. Finally, this dissertation proposed a new technology, called multi-streaming SIMD, applicable for hyperscalar architecture to efficiently exploit data-level parallelism (DLP). The multi-streaming SIMD technology enables current multimedia extensions to simultaneously manipulate multiple data streams. Simulation results show that when a multi-streaming SIMD computing engine has four 4-register multimedia operation storage units, it provides a factor of 3.3x to 5.5x performance enhancement for traditional MMX extensions on twelve multimedia kernels. After exploring the above research topics discussed in this dissertation, a promising architecture for future multi-core designs was realized. SIMD chip multiprocessors superscalar dynamic multi-core reconfigurable hardware multimedia processing hyperscalar
68	FPGA-based Soft Vector Processors Yiannacouras, Peter 23 February 2010 (has links) FPGAs are increasingly used to implement embedded digital systems because of their low time-to-market and low costs compared to integrated circuit design, as well as their superior performance and area over a general purpose microprocessor. However, the hardware design necessary to achieve this superior performance and area is very difficult to perform causing long design times and preventing wide-spread adoption of FPGA technology. The amount of hardware design can be reduced by employing a microprocessor for less-critical computation in the system. Often this microprocessor is implemented using the FPGA reprogrammable fabric as a soft processor which can preserve the benefits of a single-chip FPGA solution without specializing the device with dedicated hard processors. Current soft processors have simple architectures that provide performance adequate for only the least-critical computations. Our goal is to improve soft processors by scaling their performance and expanding their suitability to more critical computation. To this end we focus on the data parallelism found in many embedded applications and propose that soft processors be augmented with vector extensions to exploit this parallelism. We support this proposal through experimentation with a parameterized soft vector processor called VESPA (Vector Extended Soft Processor Architecture) which is designed, implemented, and evaluated on real FPGA hardware. The scalability of VESPA combined with several other architectural parameters can be used to finely span a large design space and derive a custom architecture for exactly matching the needs of an application. Such customization is a key advantage for soft processors since their architectures can be easily reconfigured by the end-user. Specifically, customizations can be made to the pipeline, functional units, and memory system within VESPA. In addition, general purpose overheads can be automatically eliminated from VESPA. Comparing VESPA to manual hardware design, we observe a 13x speed advantage for hardware over our fastest VESPA, though this is significantly less than the 500x speed advantage over scalar soft processors. The performance-per-area of VESPA is also observed to be significantly higher than a scalar soft processor suggesting that the addition of vector extensions makes more efficient use of silicon area for data parallel workloads. soft processor soft Vector processor FPGA custom embedded SIMD data parallel 0544
69	Harnessing Data Parallel Hardware for Server Workloads Agrawal, Sandeep R. January 2015 (has links) <p>Trends in increasing web traffic demand an increase in server throughput while preserving energy efficiency and total cost of ownership. Present work in optimizing data center efficiency primarily focuses on using general purpose processors, however these might not be the most efficient platforms for server workloads. Data parallel hardware achieves high energy efficiency by amortizing instruction costs across multiple data streams, and high throughput by enabling massive parallelism across independent threads. These benefits are considered traditionally applicable to scientific workloads, and common server tasks like page serving or search are considered unsuitable for a data parallel execution model.</p><p>Our work builds on the observation that server workload execution patterns are not completely unique across multiple requests. For a high enough arrival rate, a server has the opportunity to launch cohorts of similar requests on data parallel hardware, improving server performance and power/energy efficiency. We present a framework---called Rhythm---for high throughput servers that can exploit similarity across requests to improve server performance and power/energy efficiency by launching data parallel executions for request cohorts. An implementation of the SPECWeb Banking workload using Rhythm on NVIDIA GPUs provides a basis for evaluation. </p><p>Similarity search is another ubiquitous server workload that involves identifying the nearest neighbors to a given query across a large number of points. We explore the performance, power and dollar benefits of using accelerators to perform similarity search for query cohorts in very high dimensions under tight deadlines, and demonstrate an implementation on GPUs that searches across a corpus of billions of documents and is significantly cheaper than commercial deployments. We show that with software and system modifications, data parallel designs can greatly outperform common task parallel implementations.</p> / Dissertation Computer science Computer engineering Datacenters Energy efficiency GPU Server workloads SIMD execution Text Search
70	DESIGN AND IMPLEMENTATION OF THE INSTRUCTION SET ARCHITECTURE FOR DATA LARS Ponnala, Kalyan 01 January 2010 (has links) The ideal memory system assumed by most programmers is one which has high capacity, yet allows any word to be accessed instantaneously. To make the hardware approximate this performance, an increasingly complex memory hierarchy, using caches and techniques like automatic prefetch, has evolved. However, as the gap between processor and memory speeds continues to widen, these programmer-visible mechanisms are becoming inadequate. Part of the recent increase in processor performance has been due to the introduction of programmer/compiler-visible SWAR (SIMD Within A Register) parallel processing on increasingly wide DATA LARs (Line Associative Registers) as a way to both improve data access speed and increase efficiency of SWAR processing. Although the base concept of DATA LARs predates this thesis, this thesis presents the first instruction set architecture specification complete enough to allow construction of a detailed prototype hardware design. This design was implemented and tested using a hardware simulator. Line Associative Registers DATA LARs SIMD Within a Register (SWAR) Cache Registers (CRegs) Associativity Electrical and Computer Engineering

Search results