Global ETD Search

31	The Cell Processor Hoefler, Torsten 07 March 2006 (has links) (PDF) Mainstream processor development is mostly targeted at compatibility and continuity. Thus, the processor market is dominated by x86 compatible CPUs since more than two decades now. Several new concepts tried to gain some market share, but it was not possible to overtake the old compatibility driven concepts. A group of three corporates tries another way to come into the market with a new idea, the cell design. The cell processor is a new try to leverage the increasing amount of transistors per die in an efficient way. The new processor is targeted at the game console and consumer electronics market to enhance the quality of these devices. This will lead to a wide spreading, and if everybody has two or more cell processors in TV, game console or PDA, the interesting question comes up: what can I do with these processors? This paper gives a short overview of the architecture and several programming ideas which help to exploit the whole processing power of the cell processor. cell prozessor ppu spu ddc:004 Prozessor SIMD
32	An Application Developed for Simulation of Electrical Excitation and Conduction in a 3D Human Heart Yu, Di 01 January 2013 (has links) This thesis first reviews the history of General Purpose computing Graphic Processing Unit (GPGPU) and then introduces the fundamental problems that are suitable for GPGPU algorithm. The architecture of GPGPU is compared against modern CPU architecture, and the fundamental difference is outlined. The programming challenges faced by GPGPU and the techniques utilized to overcome these issues are evaluated and discussed. The second part of the thesis presents an application developed with GPGPU technology to simulate the electrical excitation and conduction in a 3D human heart model based on cellular automata model. The algorithm and implementation are discussed in detail and the performance of GPU is compared against CPU. Cellular Automata CUDA GPGPU Parallel Algorithm SIMD Computer Sciences
33	Low Cost Floating-Point Extensions to a Fixed-Point SIMD Datapath Kolumban, Gaspar January 2013 (has links) The ePUMA architecture is a novel master-multi-SIMD DSP platform aimed at low-power computing, like for embedded or hand-held devices for example. It is both a configurable and scalable platform, designed for multimedia and communications. Numbers with both integer and fractional parts are often used in computers because many important algorithms make use of them, like signal and image processing for example. A good way of representing these types of numbers is with a floating-point representation. The ePUMA platform currently supports a fixed-point representation, so the goal of this thesis will be to implement twelve basic floating-point arithmetic operations and two conversion operations onto an already existing datapath, conforming as much as possible to the IEEE 754-2008 standard for floating-point representation. The implementation should be done at a low hardware and power consumption cost. The target frequency will be 500MHz. The implementation will be compared with dedicated DesignWare components and the implementation will also be compared with floating-point done in software in ePUMA. This thesis presents a solution that on average increases the VPE datapath hardware cost by 15% and the power consumption increases by 15% on average. Highest clock frequency with the solution is 473MHz. The target clock frequency of 500MHz is thus not achieved but considering the lack of register retiming in the synthesis step, 500MHz can most likely be reached with this design. ePUMA floating-point SIMD VPE fixed-point datapath IEEE 754
34	Performance- und energieeffiziente Compilierung für digitale SIMD-Signalprozessoren mittels genetischer Algorithmen Lorenz, Markus. Unknown Date (has links) (PDF) Universiẗat, Diss., 2003--Dortmund.
35	Modeling and algorithm adaptation for a novel parallel DSP processor / Modellering och algorithm-anpassning för en ny parallell DSP-processor Kraigher, Olof, Olsson, Johan January 2009 (has links) The P3RMA (Programmable, Parallel, and Predictable Random Memory Access) processor, currently being developed at Linköping University Sweden, is an attempt to solve the problems of parallel computing by utilizing a parallel memory subsystem and splitting the complexity of address computations with the complexity of data computations. It is targeted at embedded low power low cost computing for mobile phones, handsets and basestations among many others. By studying the radix-2 FFT using the P3RMA concept we have shown that even algorithms with a complex addressing pattern can be adapted to fully utilize a parallel datapath while only requiring additional simple addressing hardware. By supporting this algorithm with a SIMT instruction almost 100% utilization of the datapath can be achieved. A simulator framework for this processor has been proposed and implemented. This simulator has a very ﬂexible structure featuring modular addition of new instructions and conﬁgurable hardware parameters. The simulator might be used by hardware developers and ﬁrmware developers in the future. DSP FFT Parallell Simulator P3RMA SIMD Computer Engineering Datorteknik
36	On SIMD code generation for the CELL SPE processor Pettersson, Magnus January 2010 (has links) This thesis project will attempt to answer the question if it is possible to gain performance by using SIMD instructions when generating code for scalar computation. The current trend in processor architecture is to equip the processors with multi-way SIMD units to form so-called throughput cores. This project uses the CELL SPE processor for a concrete implementation. To get good code quality the thesis project continues work on the code generator by Mattias Eriksson and Andrzej Bednarski based on integer linear programming. The code generator is extended to handle generation of SIMD code for 32bit operands. The result show for some basic blocks, positive impact in execution time of the generated schedule. However, further work has to be done to get a feasable run time of the code generator. SIMD code generation integer linear programming Computer Sciences Datavetenskap (datalogi)
37	Verification and FPGA implementation of a floating point SIMD processor for MIMO processing / Verifiering och FPGA-implementering av en flyttalsbaserad SIMD processor för MIMO-bearbetning Hussain, Sajid January 2010 (has links) The rapidly increasing capabilities of digital electronics have increased the demand of Software Defined Radio (SDR), which were not possible in the special purpose hardware. These enhanced capabilities come at the cost of time due to complex operations involved in multi-antenna wireless communications, one of those operations is complex matrix inversion. This thesis presents the verification and FPGA implementation of a SIMD processor, which was developed at Computer Engineering division of Linköping university, Sweden. This SIMD processor was designed specifically for performing complex matrix inversion in an efficient way, but it can also be reused for other operations. The processor is fully verified using all the possible combinations of instructions. An optimized firmware for this processor is implemented for efficiently inverting 4×4 matrices. Due to large number of subtractions involved in direct analytical approach, it losses stability for 4×4 matrices. Instead of this, a blockwise subdivision is used, in which 4×4 matrix is subdivided into four 2×2 matrices. Based on these 2×2 matrices, the inverse of 4×4 matrix is computed using the direct analytical approach and some other computations. Finally, the SIMD processor is integrated with Senior processor (a controlprocessor) and synthesized on Xilinx, Virtex-4 FPGA. After this, the performance of the proposed architecture is evaluated. A firmware is implemented for the Senior which uploads and downloads data/program into the SIMD unit using both I/O and DMA. / Den snabbt ökande prestandan hos digital elektronik har ökat behovet av Software Defined Radio (SDR), vilket inte var möjligt med tidigare hårdvara. Denna ökade förmåga kommer till priset av tidsåtgång, till följd av komplexa procedureri samband med trådlös kommunikation med flera antenner, en av dessa procedurer är komplex matrisinvertering. Denna avhandling presenterar verifiering och FPGA implementering hos en SIMD processor, vilken har blivit utvecklad vid institutionen för datorteknik, Linköpings universitet, Sverige. Denna SIMD processor blev specifikt designad för att genomföra komplex matrisinvertering på ett effektivt sätt, men kan också användas för andra tillämpningar. Processorn har testats och verifierats för alla möjliga kombinationer av instruktioner. En optimerad firmware för denna processor är implementerad för att effektivt invertera 4×4 matriser. På grund av att ett stort antal subtraktioner är inblandade i ett direkt analytiskt angreppssätt, så förlorar den stabilitet för 4×4 matriser. Istället används en stegvis indelning i underavdelningar, där 4×4 matrisen delasin i fyra 2×2 matriser. Baserat på dessa 2×2 matriser beräknas inversen av 4×4 matrisen med hjälp av ett direkt analytiskt angreppssätt samt andra beräkningar. Slutligen, SIMD processorn är integrerad i en huvudprocessor och körs påXilinx, Virtex-4 FPGA. Efter detta utvärderas prestandan hos den föreslagna arkitekturen. Firmware implementeras hos huvudprocessorn som laddar upp och ned data/program till SIMD enheten genom I/O samt DMA. SDR SIMD Xilinx FPGA DMA Computer Engineering Datorteknik
38	Performance Evaluation of Digital Image Processing on the Web using WebAssembly Nyberg, Christoffer January 2023 (has links) JavaScript has been the de-facto standard programming language for Web browsers for some time. Although it has enabled the interactive and complex web pages we have today, it has long been characterized by performance issues. A promising new technology, WebAssembly, aims to enable near-native performance on the Web. WebAssembly is a binary instruction format designed as a compilation target for programming languages like C/C++. This allows developers to deploy their applications for execution in a Web browser environment. Previous benchmarks have examined the performance of WebAssembly and observed a varying performance slowdown of 10% to around 55% slower than native. Recent additions to the WebAssembly standard, such as the support for SIMD instructions and multithreading, enables even greater performance and new benchmarks need to be constructed. This thesis explores the performance implications of these new features by applying them in the domain of digital image processing, which is particularly suited for such optimizations. The OpenCV library was used to construct two benchmark suites, one running natively and one running in two different Web browsers using WebAssembly. The results of the benchmarks indicate that, although in some cases performance approached native performance, the mean slowdown was approximately double compared to native code. WebAssembly wasm SIMD multithreading performance Computer Systems Datorsystem
39	Design and Implementation of a Multithreaded Associative SIMD Processor Schaffer, Kevin 30 November 2011 (has links) No description available. Computer Science Associative computing hardware multithreading SIMD processors
40	Optimising IIR Filters Using ARM NEON Bentmar Holgersson, Sebastian January 2012 (has links) ARMs processorserie Cortex-A9 har stöd för SIMD-instruktioner med hjälp av NEON MPE. Detta innebär att processorn kan använda sig av vektor-instruktioner som kan utföra operationer på ett flertal element med varje instruktion. Målet med bruk av NEON MPE är att öka prestandan, men då man försöker optimera en speciell IIR-filtertyp som kallas för "biquads" kan man stöta på problem. Problemen med NEON-optimering av "biquads" beror på att endast fem operationer krävs för varje iteration och att behandling av IIR-filter kräver att man behandlar en sampel i taget eftersom varje behandlat sampels värde beror på tidigare behandlade samplar. Rapporten ger en kort beskrivning och genomgång av hur IIR-filter och NEON-optimering fungerar.För att analysera NEON-optimering av biquad-filter skapas fyra olika implementationer av en audioeffekt. De fyra implementationerna jämför prestandan hos flyttalsaritmetik, fixpunkts-aritmetik och NEON-optimering samt en version som implementerar både fixpunktsaritmetik och NEON-optimering. Problemen med optimering av biquad-filter med hjälp av NEON-instruktioner löses genom parallell behandling av ljudkanalerna. Eftersom kanalerna är självständiga kan man fördubbla prestanda genom att utföra varje operation på såväl höger- som vänsterkanal. Vidare prestandaförbättring ges även då effektiviteten hos minnesoperationer förbättras och med hjälp av fixpunkts-behandling.Resultaten visar att fixpunktsversionen som använder sig av NEON-instruktioner är snabbast, men flyttalsversionen med NEON-instruktioner är bara marginellt långsammare och dessutom enklare att implementera. Användandet av NEON-instruktioner förbättrar prestandan med mellan 1,7-2,8 gånger i de fall som testas. / The ARM Cortex-A9 CPU has a SIMD extension called NEON MPE. It allows for vector instructions that can perform operations on multiple elements in a single instruction. Whilst this usually improves performance, certain IIR filters called biquads pose problems as only five operations are necessary per sample and every iteration is dependent on the result of the previous result. A brief overview is given for IIR filters, the NEON extension and fixed-point processing.In order to analyse optimisation of biquad filters, an audio effect with four different implementations is produced, comparing results with/without fixed-point processing and with/without NEON optimisation. The problems introduced by the use of biquad filters are solved by running multiple channels in parallel. As the audio channels are independent, two samples can be calculated in parallel, which approximately doubles peformance. Further performance improvement is provided by improved memory operation efficiency and the use of fixed-point processing.The results show that the fixed-point NEON implementation is the fastest, however the floating-point NEON implementation is marginally slower but simpler to write. The use of NEON MPE improves performance by between 1.7 to 2.8 times in this case. ARM NEON SIMD IIR biquad Engineering and Technology Teknik och teknologier

Search results