Global ETD Search

1	Solving Hyperbolic PDEs using Accelerator Architectures Rostrup, Scott 15 July 2009 (has links) Accelerator architectures are used to accelerate the simulation of nonlinear hyperbolic PDEs. Three different architectures, a multicore CPU using threading, IBM’s Cell Processor, and Nvidia’s Tesla GPUs are investigated. Speed-ups of between 40-75× relative to a single CPU core in single precision are obtained using the Cell processor and the GPU. The three implementations are extended to parallel computing clusters by making use of the Message Passing Interface (MPI). The resulting hybrid-parallel code is investigated for performance and scalability on both a GPU and Cell computing cluster. GPU Cell Processor Hyperbolic PDEs Hardware Optimization Applied Mathematics
2	Solving Hyperbolic PDEs using Accelerator Architectures Rostrup, Scott 15 July 2009 (has links) Accelerator architectures are used to accelerate the simulation of nonlinear hyperbolic PDEs. Three different architectures, a multicore CPU using threading, IBM’s Cell Processor, and Nvidia’s Tesla GPUs are investigated. Speed-ups of between 40-75× relative to a single CPU core in single precision are obtained using the Cell processor and the GPU. The three implementations are extended to parallel computing clusters by making use of the Message Passing Interface (MPI). The resulting hybrid-parallel code is investigated for performance and scalability on both a GPU and Cell computing cluster. GPU Cell Processor Hyperbolic PDEs Hardware Optimization Applied Mathematics
3	Extended architectural enhancements for minimizing message delivery latency on cache-less architectures (e.g., Cell BE) Kroeker, Anthony 12 January 2012 (has links) This thesis proposes to reduce the latency of MPI receive operations on cacheless architectures, by removing the delay of copying messages when they are first received. This is achieved by copying the messages directly into buffers in the lowest level of the memory hierarchy (e.g., scratchpad memory). The previously proposed solution introduced an Indirection Cache which would map between the receive variables and the buffered message payload locations. This proved somewhat beneficial, but the lookup penalty of the Indirection Cache limited its effectiveness. Therefore this thesis proposes that a most recently used buffer (i.e., an Indirection Buffer) be placed in front of the Indirection Cache to eliminate this penalty and speed up access. The tests conducted demonstrated that this method was indeed effective and improved over the original method by at least an order of magnitude. Finally, examination of implementation feasibility showed that this could be implemented with a small Cache, and that even with access times 6x slower than initially assumed, the approach with the Indirection Buffer would still be effective. / Graduate computer engineering computer architecture cell processor mpi cache injection cacheless Indirection Cache Indirection Buffer
4	H.264 Baseline Real-time High Definition Encoder on CELL Wei, Zhengzhe January 2010 (has links) <p>In this thesis a H.264 baseline high definition encoder is implemented on CELL processor. The target video sequence is YUV420 1080p at 30 frames per second in our encoder. To meet real-time requirements, a system architecture which reduces DMA requests is designed for large memory accessing. Several key computing kernels: Intra frame encoding, motion estimation searching and entropy coding are designed and ported to CELL processor units. A main challenge is to find a good tradeoff between DMA latency and processing time. The limited 256K bytes on-chip memory of SPE has to be organized efficiently in SIMD way. CAVLC is performed in non-real-time on the PPE.</p><p> </p><p>The experimental results show that our encoder is able to encode I frame in high quality and encode common 1080p video sequences in real-time. With the using of five SPEs and 63KB executable code size, 20.72M cycles are needed to encode one P frame partitions for one SPE. The average PSNR of P frames increases a maximum of 1.52%. In the case of fast speed video sequence, 64x64 search range gets better frame qualities than 16x16 search range and increases only less than two times computing cycles of 16x16. Our results also demonstrate that more potential power of the CELL processor can be utilized in multimedia computing.</p><p> </p><p>The H.264 main profile will be implemented in future phases of this encoder project. Since the platform we use is IBM Full-System Simulator, DMA performance in a real CELL processor is an interesting issue. Real-time entropy coding is another challenge to CELL.</p> Video coding H.264 CELL processor Real-time coding Intra prediction Parallel programming SIMD Computer engineering Datorteknik
5	H.264 Baseline Real-time High Definition Encoder on CELL Wei, Zhengzhe January 2010 (has links) In this thesis a H.264 baseline high definition encoder is implemented on CELL processor. The target video sequence is YUV420 1080p at 30 frames per second in our encoder. To meet real-time requirements, a system architecture which reduces DMA requests is designed for large memory accessing. Several key computing kernels: Intra frame encoding, motion estimation searching and entropy coding are designed and ported to CELL processor units. A main challenge is to find a good tradeoff between DMA latency and processing time. The limited 256K bytes on-chip memory of SPE has to be organized efficiently in SIMD way. CAVLC is performed in non-real-time on the PPE. The experimental results show that our encoder is able to encode I frame in high quality and encode common 1080p video sequences in real-time. With the using of five SPEs and 63KB executable code size, 20.72M cycles are needed to encode one P frame partitions for one SPE. The average PSNR of P frames increases a maximum of 1.52%. In the case of fast speed video sequence, 64x64 search range gets better frame qualities than 16x16 search range and increases only less than two times computing cycles of 16x16. Our results also demonstrate that more potential power of the CELL processor can be utilized in multimedia computing. The H.264 main profile will be implemented in future phases of this encoder project. Since the platform we use is IBM Full-System Simulator, DMA performance in a real CELL processor is an interesting issue. Real-time entropy coding is another challenge to CELL. Video coding H.264 CELL processor Real-time coding Intra prediction Parallel programming SIMD Computer Engineering Datorteknik
6	Jack Rabbit : an effective Cell BE programming system for high performance parallelism Ellis, Apollo Isaac Orion 08 July 2011 (has links) The Cell processor is an example of the trade-offs made when designing a mass market power efficient multi-core machine, but the machine-exposing architecture and raw communication mechanisms of Cell are hard to manage for a programmer. Cell's design is simple and causes software complexity to go up in the areas of achieving low threading overhead, good bandwidth efficiency, and load balance. Several attempts have been made to produce efficient and effective programming systems for Cell, but the attempts have been too specialized and thus fall short. We present Jack Rabbit, an efficient thread pool work queue implementation, with load balancing mechanisms and double buffering. Our system incurs low threading overhead, gets good load balance, and achieves bandwidth efficiency. Our system represents a step towards an effective way to program Cell and any similar current or future processors. / text Cell processor Multi-core systems High performance computing Runtime Barnes Hut LU factorization Mandelbrot Double buffering Thread pool Work queue Load balance
7	Optimisation multi-niveau d’une application de traitement d’images sur machines parallèles / Multi-level optimisation of an image processing application on parallel machines Saidani, Tarik 06 November 2012 (has links) Cette thèse vise à définir une méthodologie de mise en œuvre d’applications performantes sur les processeurs embarqués du futur. Ces architectures nécessitent notamment d’exploiter au mieux les différents niveaux de parallélisme (grain fin, gros grain) et de gérer les communications et les accès à la mémoire. Pour étudier cette méthodologie, nous avons utilisé un processeur cible représentatif de ces architectures émergentes, le processeur CELL. Le détecteurde points d’intérêt de Harris est un exemple de traitement régulier nécessitant des unités de calcul intensif. En étudiant plusieurs schémas de mise en oeuvre sur le processeur CELL, nous avons ainsi pu mettre en évidence des méthodes d’optimisation des calculs en adaptant les programmes aux unités spécifiques de traitement SIMD du processeur CELL. L’utilisation efficace de la mémoire nécessite par ailleurs, à la fois une bonne exploitation des transferts et un arrangement optimal des données en mémoire. Nous avons développé un outil d’abstraction permettant de simplifier et d’automatiser les transferts et la synchronisation, CELL MPI. Cette expertise nous a permis de développer une méthodologie permettant de simplifier la mise en oeuvre parallèle optimisée de ces algorithmes. Nous avons ainsi conçu un outil de programmation parallèle à base de squelettes algorithmiques : SKELL BE. Ce modèle de programmation propose une solution originale de génération d’applications à base de métaprogrammation. Il permet, de manière automatisée, d’obtenir de très bonnes performances et de permettre une utilisation efficace de l’architecture, comme le montre la comparaison pour un ensemble de programmes test avec plusieurs autres outils dédiés à ce processeur. / This thesis aims to define a design methodology for high performance applications on future embedded processors. These architectures require an efficient usage of their different level of parallelism (fine-grain, coarse-grain), and a good handling of the inter-processor communications and memory accesses. In order to study this methodology, we have used a target processor which represents this type of emerging architectures, the Cell BE processor.We have also chosen a low level image processing application, the Harris points of interest detector, which is representative of a typical low level image processing application that is highly parallel. We have studied several parallelisation schemes of this application and we could establish different optimisation techniques by adapting the software to the specific SIMD units of the Cell processor. We have also developped a library named CELL MPI that allows efficient communication and synchronisation over the processing elements, using a simplified and implicit programming interface. This work allowed us to develop a methodology that simplifies the design of a parallel algorithm on the Cell processor.We have designed a parallel programming tool named SKELL BE which is based on algorithmic skeletons. This programming model providesan original solution of a meta-programming based code generator. Using SKELL BE, we can obtain very high performances applications that uses the Cell architecture efficiently when compared to other tools that exist on the market. Programmation parallèle Processeur CELL Traitement d’images Squelettes algorithmiques Calcul hautes performances Méta-programmation Processeur embarqué Parallel programming CELL processor Image processing Algorithmic skeletons High performance computing Meta-programming Embedded processors

1

Page generated in 0.0805 seconds