Global ETD Search

1	Design, Implementation, And Verification Of A Programmable Floating- And Fixed-Point Vertex Shader Huang, Kuan-min 01 September 2009 (has links) 3D graphics pipeline can be divided into two subsystems: geometry subsystem and rendering subsystem. Hardware implementation of the transformation and lighting in the geometric subsystem can be divided into two categories, fixed function pipeline and programmable vertex shader. This thesis proposes a programmable vertex shader design based on OpenGL ES 2.0 specification. We start from the design of instruction set and use a multiplier-accumulator (MAC)-based SIMD (Single-Instruction Multiple-Data) structure. The vertex shader supports both floating-point and fixed-point operations of both scalar and vector formats. In addition, the special function unit for calculation of complicated functions is also integrated in the vertex shader. Besides, we also make out best to minimize the cost, power ,and delay during the entire design process. Vertex Shader SIMD Programmable
2	SWAR systems and communications applications Spracklen, Lawrence A. January 2001 (has links) In recent years, the instruction sets of the majority of present day general purpose processors have been extended to include a variety of SWAR (SIMID Within A Register) instructions. These operations, which make possible the processing of multiple data elements with a single instruction, have been proven to facilitate the acceleration of a wide range of graphics and multimedia applications. The application of these instructions is not, however, just limited to this type of program, and current research is concerned with developing high performance implementations of a wide range of new applications. The initial part of this thesis presents a number of innovative strategies for accelerating previously un-addressed classes of application and illustrates that a significant degree of acceleration can frequently be achieved. However, for all of the applications analysed, several SWAR specific problems were repeatedly encountered. For the efficient operation of the implementation developed using these SWAR instructions, the organisation of the data to be processed is of critical importance and the required arrangement frequently fails to match that normally encountered in applications. SWAR ISEs (Instruction Set Extensions) usually contain functionality to address these problems, but it is shown by the author that this current functionality is insufficient and significantly curtails the performance achievable with these ISEs. The VMM (VIS Manipulation Matrix) was developed to address this problem and provide a methodology whereby the performance obtained using SWAR methodologies could be made significantly more independent of the underlying data organisation. The functionality of the VMM is presented and the effectiveness of this methodology is highlighted, by considering its application to a number of important algorithms. Finally, the feasibility of integrating the VMM into a general purpose processor is revealed, by highlighting its compatibility with the UltraSPARC processor. 005 SIMD; Register
3	A reconfigurable SIMD architecture on-chip Andersson, Johan, Mohlin, Mikael, Nilsson, Artur January 2006 (has links) <p>This project targets the problems with design and implementation of Single Instruction </p><p>Multiple Data (SIMD) architectures in System-on-Chip (SoC), with the goal to construct </p><p>a reconfigurable framework in VHDL to ease this process. The resulting framework should </p><p>be implemented on an FPGA and its usability tested. The main parts of a SIMD archi- </p><p>tecture was identified to be the Control Unit (CU), the Processing Elements (PE) and </p><p>the Interconnection Network (ICN), and a framework was constructed with these parts </p><p>as the main building blocks. The constructed framework is reconfigurable in data width, </p><p>memory size, number of PEs, topology and instruction set. To test ease of use and per- </p><p>formance of the system a FIR-filter application was implemented. The scalability of the </p><p>system and its different parts has been measured and comparisons are illustrated.</p> Single Instruction SIMD System-on-Chip
4	Investigation of NoGap : SIMD Datapath Implementation Chan, Chun-Jung January 2011 (has links) Nowadays, many ASIP systems with high computational capabilities are designed in order to fulfill the increasing demands of technical applications. However, the design of ASIP system usually takes many man hours. Therefore, a number of EDA tools are developed to ease the design effort, but they limit the design freedom due to their predefined design templates. Consequently, designers are forced to use lower level HDLs which offer high design flexibility but require substantial design hours. A novel design automation tool called NoGap was proposed to balance such issues. The NoGap system, which is especially used in ASIPs and accelerator design, effectively provides high design flexibility and saves design effort for designers. The efficiency and design ability of NoGap were investigated in this thesis work. NoGap was used to implement an eight-way SIMD datapath of an ASIP called Sleipnir, which was devised by the Division of Computer Engineering at Linköping University. For contrast, the manually crafted HDL implementation of the Sleipnir was taken. The critical path implementations, done by both design approaches, were synthesized to the Altera Strtix IV FPGA. The synthesize results showed that the NoGap design although used 1.358 times as many hardware units as the original HDL design. Their timing performance is comparable (HDL/NoGap-60.042/58.156Mhz). In this thesis, based on the design experience of SIMD datapath, valuable aspects were suggested to benefit the future users who will use NoGap to implement SIMD structures. In addition, the hidden bugs and insufficient features of NoGap were discovered, and the referable suggestions were provided in order to help the developers to improve the NoGap system. ASIP NoGap ePUMA Sleipnir SIMD
5	A reconfigurable SIMD architecture on-chip Andersson, Johan, Mohlin, Mikael, Nilsson, Artur January 2006 (has links) This project targets the problems with design and implementation of Single Instruction Multiple Data (SIMD) architectures in System-on-Chip (SoC), with the goal to construct a reconfigurable framework in VHDL to ease this process. The resulting framework should be implemented on an FPGA and its usability tested. The main parts of a SIMD archi- tecture was identified to be the Control Unit (CU), the Processing Elements (PE) and the Interconnection Network (ICN), and a framework was constructed with these parts as the main building blocks. The constructed framework is reconfigurable in data width, memory size, number of PEs, topology and instruction set. To test ease of use and per- formance of the system a FIR-filter application was implemented. The scalability of the system and its different parts has been measured and comparisons are illustrated. Single Instruction SIMD System-on-Chip
6	The Optimal Design for Action Recognition Algorithm on Cell Processor Architecture Pan, Po-Hsun 23 August 2011 (has links) In recent years, automatic human action recognition has been widely researched within the computer vision and image processing communities. To identify human behavior which achieve the surveillance has great help by video automation in aspect of home caring, personal property and homeland security. To achieve action recognition, there are many factors to be considered, primarily the accuracy and real-time. If we can parallelize the action recognition algorithm, it will be a greatly improvement to the real-time processing capability of the algorithm. To achieve real-time demand, we study how to implement action recognition algorithm parallelization in the CELL B.E. platform. The action recognition algorithm with our design is faster than the original algorithm; it has 231 times speed up. We found that in the action recognition algorithm, there are many repeated operation between blocks, it can be parallelize by using single-instruction multiple-data architecture. In the action recognition algorithms, there are four major algorithms, DMASKS, HMHHb, MGD, SVM. The SIMD instructions in CELL B.E. platform can compute 128 bits data at once. While doing DMASKS, SIMD parallelism can reach 16 times, HMHHb parallelism up to 128 times, MGD parallelism up to 8 times, and SVM can reach 4 times. Based on CELL B.E. acceleration mechanism, we achieve high-performance computing models with multi-threading and multiple streaming. Our study showed that the action recognition algorithm is very suitable for multi-core system with parallel processing SIMD architecture. The parallelization for action recognition algorithm will have more immediate response in identifying human action. With the advantages of real-time, it can be expected to include more complex algorithms for the accuracy of algorithm in the future, to achieve both immediacy and accuracy. action recognition SIMD CELL parallelize
7	Překlad OpenCL aplikací pro vestavěné systémy / Compilation of OpenCL Applications for Embedded Systems Šnobl, Pavel January 2016 (has links) This master's thesis deals with the support for compilation and execution of programs written using OpenCL framework on embedded systems. OpenCL is a system for programming heterogeneous systems comprising processors, graphic accelerators and other computing devices. But it also finds usage on systems composed of just one computing unit, where it allows to write parallel programs (task and data parallelism) and work with hierarchical system of memories. In this thesis, various available open source OpenCL implementations are compared and one selected is then integrated into LLVM compiler infrastructure. This compiler is generated as a part of toolchain provided by application specific instruction set architecture processor developement environment called Codasip Studio. Designed and implemented are also optimizations for architectures with SIMD instructions and VLIW architectures. The result is tested and demonstrated on a set of testing applications. Codasip; OpenCL; VLIW; LLVM; SIMD
8	Implementation study of radar signal processing Using SIMD architectures Ekström, Mikael, Westerberg, Martin January 2006 (has links) <p>The aim of this pro ject was to evaluate the use of SIMD array architectures in radar </p><p>signal processing. This has been done by implementing one of the most demanding parts </p><p>of the radar signal processing chain for airborne radar on the CSX600 architecture devel- </p><p>oped by Clearspeed Technologies. The CSX600 architecture is a SIMD processor with 96 </p><p>processing elements which can be arranged either as a linera array or as a ring. The QR- </p><p>decomposition, which was the part chosen for implementation, is the most performance </p><p>demanding part of the STAP stage. In order to create a relevant test case the well known </p><p>RT STAP benchmark from Mitre Corporation has been used. Two different algorithms </p><p>for performing QR-decompositions have been implemented and verified. In both cases </p><p>it has been concluded that either longer (> </p><p>≈256) or shorter (< ≈32) processor array </p><p>lengths would, in general, yield a higher utilization ratio. The FLOP count and utiliza- </p><p>tion has been measured for both algorithms, and it has been concluded that at least eight </p><p>CSX600 processors are needed to meet the real-time demand of the benchmark.</p> SIMD array architectures Radar signal processing
9	Parallel algorithms for real-time peptide-spectrum matching Zhang, Jian 16 December 2010 Tandem mass spectrometry is a powerful experimental tool used in molecular biology to determine the composition of protein mixtures. It has become a standard technique for protein identification. Due to the rapid development of mass spectrometry technology, the instrument can now produce a large number of mass spectra which are used for peptide identification. The increasing data size demands efficient software tools to perform peptide identification.<p> In a tandem mass experiment, peptide ion selection algorithms generally select only the most abundant peptide ions for further fragmentation. Because of this, the low-abundance proteins in a sample rarely get identified. To address this problem, researchers develop the notion of a `dynamic exclusion list', which maintains a list of newly selected peptide ions, and it ensures these peptide ions do not get selected again for a certain time. In this way, other peptide ions will get more opportunity to be selected and identified, allowing for identification of peptides of lower abundance. However, a better method is to also include the identification results into the `dynamic exclusion list' approach. In order to do this, a real-time peptide identification algorithm is required.<p> In this thesis, we introduce methods to improve the speed of peptide identification so that the `dynamic exclusion list' approach can use the peptide identification results without affecting the throughput of the instrument. Our work is based on RT-PSM, a real-time program for peptide-spectrum matching with statistical significance. We profile the speed of RT-PSM and find out that the peptide-spectrum scoring module is the most time consuming portion.<p> Given by the profiling results, we introduce methods to parallelize the peptide-spectrum scoring algorithm. In this thesis, we propose two parallel algorithms using different technologies. We introduce parallel peptide-spectrum matching using SIMD instructions. We implemented and tested the parallel algorithm on Intel SSE architecture. The test results show that a 18-fold speedup on the entire process is obtained. The second parallel algorithm is developed using NVIDIA CUDA technology. We describe two CUDA kernels based on different algorithms and compare the performance of the two kernels. The more efficient algorithm is integrated into RT-PSM. The time measurement results show that a 190-fold speedup on the scoring module is achieved and 26-fold speedup on the entire process is obtained. We perform profiling on the CUDA version again to show that the scoring module has been optimized sufficiently to the point where it is no longer the most time-consuming module in the CUDA version of RT-PSM.<p> In addition, we evaluate the feasibility of creating a metric index to reduce the number of candidate peptides. We describe evaluation methods, and show that general indexing methods are not likely feasible for RT-PSM. Bioinfomatics SIMD Parallel GPU Computer Science
10	Implementation study of radar signal processing Using SIMD architectures Ekström, Mikael, Westerberg, Martin January 2006 (has links) The aim of this pro ject was to evaluate the use of SIMD array architectures in radar signal processing. This has been done by implementing one of the most demanding parts of the radar signal processing chain for airborne radar on the CSX600 architecture devel- oped by Clearspeed Technologies. The CSX600 architecture is a SIMD processor with 96 processing elements which can be arranged either as a linera array or as a ring. The QR- decomposition, which was the part chosen for implementation, is the most performance demanding part of the STAP stage. In order to create a relevant test case the well known RT STAP benchmark from Mitre Corporation has been used. Two different algorithms for performing QR-decompositions have been implemented and verified. In both cases it has been concluded that either longer (> ≈256) or shorter (< ≈32) processor array lengths would, in general, yield a higher utilization ratio. The FLOP count and utiliza- tion has been measured for both algorithms, and it has been concluded that at least eight CSX600 processors are needed to meet the real-time demand of the benchmark. SIMD array architectures Radar signal processing

Search results