Global ETD Search

71	LINE ASSOCIATIVE REGISTERS Melarkode, Krishna 01 January 2004 (has links) As technological advances have improved processor speed, main memory speed has lagged behind. Even with advanced RAM technologies, it has not been possible to close the gap in speeds. Ideally, a CPU can deliver good performance when the right data is made available to it at the right time. Caches and Registers solved the problem to an extent. This thesis takes the approach of trying to create a new memory access model that is more efficient and simple instead of using various add on mechanisms to mask high memory latency. The Line Associative Registers have the functionality of a cache, scalar registers and vector registers built into them. This new model qualitatively changes how the processor accesses memory.
72	FPGA-based Soft Vector Processors Yiannacouras, Peter 23 February 2010 (has links) FPGAs are increasingly used to implement embedded digital systems because of their low time-to-market and low costs compared to integrated circuit design, as well as their superior performance and area over a general purpose microprocessor. However, the hardware design necessary to achieve this superior performance and area is very difficult to perform causing long design times and preventing wide-spread adoption of FPGA technology. The amount of hardware design can be reduced by employing a microprocessor for less-critical computation in the system. Often this microprocessor is implemented using the FPGA reprogrammable fabric as a soft processor which can preserve the benefits of a single-chip FPGA solution without specializing the device with dedicated hard processors. Current soft processors have simple architectures that provide performance adequate for only the least-critical computations. Our goal is to improve soft processors by scaling their performance and expanding their suitability to more critical computation. To this end we focus on the data parallelism found in many embedded applications and propose that soft processors be augmented with vector extensions to exploit this parallelism. We support this proposal through experimentation with a parameterized soft vector processor called VESPA (Vector Extended Soft Processor Architecture) which is designed, implemented, and evaluated on real FPGA hardware. The scalability of VESPA combined with several other architectural parameters can be used to finely span a large design space and derive a custom architecture for exactly matching the needs of an application. Such customization is a key advantage for soft processors since their architectures can be easily reconfigured by the end-user. Specifically, customizations can be made to the pipeline, functional units, and memory system within VESPA. In addition, general purpose overheads can be automatically eliminated from VESPA. Comparing VESPA to manual hardware design, we observe a 13x speed advantage for hardware over our fastest VESPA, though this is significantly less than the 500x speed advantage over scalar soft processors. The performance-per-area of VESPA is also observed to be significantly higher than a scalar soft processor suggesting that the addition of vector extensions makes more efficient use of silicon area for data parallel workloads. soft processor soft Vector processor FPGA custom embedded SIMD data parallel 0544
73	Occlusion culling and hardware accelerated volume rendering Meißner, Michael. Unknown Date (has links) (PDF) University, Diss., 2000--Tübingen. / Erscheinungsjahr an der Haupttitelstelle: 2000.
74	A study on SSE optimisation regarding initialisation and evaluation of the Fast Multipole Method Hjerpe, Daniel January 2016 (has links) The following study examines whether the initialisation (multipole expansions at the finest level) and evaluation of the numerical method Fast Multipole Method (FMM) can benefit from implementing SSE instructions. The implementation of SSE-instructions have been studied and compared to the serial case. Moreover, studied parts of the algorithm include arithmetics on complex numbers, and the usage of applying SSE instructions to complex numbers of double precision. In conclusion, the initialisation has not experienced any improvement in terms of throughput by appliying SSE instructions. However, the evaluation reached almost the double speed-up when SSE instructions were applied. The difference in results are most likely due to the structure of the both algorithms. The initialisation is simple, but the evaluation which involves more operations can benefit from SSE instructions. Furthermore, a scheme is proposed for how SSE instructions can be applied to data sets which are not divisable by the unroll factor and to data sets of varying size. SSE SIMD Fast Multipole Method AVX N-body problem Computer Sciences Datavetenskap (datalogi)
75	Stereoseende i realtid / Real-time Stereo Vision Arvidsson, Lars January 2007 (has links) In this thesis, two real-time stereo methods have been implemented and evaluated. The first one is based on blockmatching and the second one is based on local phase. The goal was to be able to run the algorithms at real-time and examine which one is best. The blockmatching method performed better than the phase based method, both in speed and accuracy. SIMD operations (Single Instruction Multiple Data) have been used in the processor giving a speed boost by a factor of two. / I det här exjobbet har två stereometoder för realtidstillämpningar implementerats och utvärderats. Den ena bygger på blockmatchning och den andra på lokal fas. Målet var att kunna köra metoderna i realtid och undersöka vilken av dem som fungerar bäst. Blockmatchningsmetoden gav gott resultat medan den fasbaserade fungerade sämre, både vad gäller hastighet och precision. SIMD-operationer (Single Instruction Multiple Data) användes hos processorn vilket resulterade en i fördubbling av prestandan. stereo disparitet realtid SIMD fas blockmatchning
76	A Multimedia DSP Processor Design / Design av en Multimedia DSP Processor Gnatyuk, Vladimir, Runesson, Christian January 2004 (has links) This Master Thesis presents the design of the core of a fixed point general purpose multimedia DSP processor (MDSP) and its instruction set. This processor employs parallel processing techniques and specialized addressing models to speed up the processing of multimedia applications. The MDSP has a dual MAC structure with one enhanced MAC that provides a SIMD, Single Instruction Multiple Data, unit consisting of four parallel data paths that are optimized for accelerating multimedia applications. The SIMD unit performs four multimedia- oriented 16- bit operations every clock cycle. This accelerates computationally intensive procedures such as video and audio decoding. The MDSP uses a memory bank of four memories to provide multiple accesses of source data each clock cycle. Datorteknik DSP processor multimedia SIMD Dual MAC assembler simulator Datorteknik Computer Engineering Datorteknik
77	Implementation of Elementary Functions for a Fixed Point SIMD DSP Coprocessor Tomasson, Orri January 2010 (has links) This thesis is about implementing the functions for reciprocal, square root, inverse square root and logarithms on a DSP platform. A multi-core DSP platform that consists of one master processor core and several SIMD coprocessor cores is currently being designed by a team at the Computer Engineering Department of Linköping University. The SIMD coprocessors’ arithmetic logic unit (ALU) has 16 multipliers to support vector multiplication instructions. By efficiently using the 16 multipliers, it is possible to evaluate polynomials very fast. The ALU does not have (hardware) support for floating point arithmetic, so the challenge is to get good precision by using fixed point arithmetic. Precise and fast solutions to implement the mathematical functions are found by converting the fixed point input to a soft floating point format before polynomial approximation, choosing a polynomial based on an error analysis of the polynomial approximation, and using Newton-Raphson or Goldschmidt iterations to improve the precision of the polynomial approximations. Finally, suggestions are made of changes and additions to the instruction set architecture, in order to make the implementations faster, by efficiently using the currently existing hardware. SIMD DSP mathematical functions elementary functions polynomial approximation fixed-point arithmetic Computer Engineering Datorteknik
78	A Selection of H.264 Encoder Components Implemented and Benchmarked on a Multi-core DSP Processor Einemo, Jonas, Lundqvist, Magnus January 2010 (has links) H.264 is a video coding standard which offers high data compression rate at the cost of a high computational load. This thesis evaluates how well parts of the H.264 standard can be implemented for a new multi-core digital signal processing processor architecture called ePUMA. The thesis investigates if real-time encoding of high definition video sequences could be performed. The implementation consists of the motion estimation, motion compensation, discrete cosine transform, inverse discrete cosine transform, quantization and rescaling parts of the H.264 standard. Benchmarking is done using the ePUMA system simulator and the results are compared to an implementation of an existing H.264 encoder for another multi-core processor architecture called STI Cell. The results show that the selected parts of the H.264 encoder could be run on 6 calculation cores in 5 million cycles per frame. This setup leaves 2 calculation cores to run the remaining parts of the encoder. ePUMA DSP SIMD H.264 Parallel Programming Motion Estimation DCT Computer Engineering Datorteknik
79	Design and implementation of massively parallel fine-grained processor arrays Walsh, Declan January 2015 (has links) This thesis investigates the use of massively parallel fine-grained processor arrays to increase computational performance. As processors move towards multi-core processing, more energy-efficient processors can be designed by increasing the number of processor cores on a single chip rather than increasing the clock frequency of a single processor. This can be done by making processor cores less complex, but increasing the number of processor cores on a chip. Using this philosophy, a processor core can be reduced in complexity, area, and speed to form a very small processor which can still perform basic arithmetic operations. Due to the small area occupation this can be multiplied and scaled to form a large scale parallel processor array to offer a significant performance. Following this design methodology, two fine-grained parallel processor arrays are designed which aim to achieve a small area occupation with each individual processor so that a larger array can be implemented over a given area. To demonstrate scalability and performance, SIMD parallel processor array is designed for implementation on an FPGA where each processor can be implemented using four ‘slices’ of a Xilinx FPGA. With such small area utilization, a large fine-grained processor can be implemented on these FPGAs. A 32 × 32 processor array is implemented and fast processing demonstrated using image processing tasks. An event-driven MIMD parallel processor array is also designed which occupies a small amount of area and can be scaled up to form much larger arrays. The event-driven approach allows the processor to enter an idle mode when no events are occurring local to the processor, reducing power consumption. The processor can switch to operational mode when events are detected. The processor core is designed with a multi-bit data path and ALU and contains its own instruction memory making the array a multi-core processor array. With area occupation of primary concern, the processor is relatively simple and connects with its four nearest direct neighbours. A small 8 × 8 prototype chip is implemented in a 65 nm CMOS technology process which can operate at a clock frequency of 80 MHz and offer a peak performance of 5.12 GOPS which can be scaled up to larger arrays. An application of the event-driven processor array is demonstrated using a simulation model of the processor. An event-driven algorithm is demonstrated to perform distributed control of distributed manipulator simulator by separating objects based on their physical properties. 621.3
80	Paralelizace výpočtů pro zpracování obrazu / Paralelized image processing library Fuksa, Tomáš January 2011 (has links) This work deals with parallel computing on modern processors - multi-core CPU and GPU. The goal is to learn about computing on this devices suitable for parallelization, define their advantages and disadvantages, test their properties in examples and select appropriate tools to implement a library for parallel image processing. This library is going to be used for the vanishing point estimation in the path finding mobile robot.

Search results