• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 7
  • 1
  • Tagged with
  • 11
  • 11
  • 11
  • 5
  • 4
  • 4
  • 4
  • 4
  • 3
  • 3
  • 3
  • 2
  • 2
  • 2
  • 2
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
1

A Tiger Compiler for the Cell Broadband Engine Architecture

2013 August 1900 (has links)
The modern computing industry tends to build integrated circuits with multiple energy-efficient cores instead of ramping up the clock speed for each single processing unit. While each core may not run as fast as the single core model, such architecture allows more jobs to be handled in parallel and also provides better overall performance. Asymmetric Multiprocessing, also known as Heterogeneous Multiprocessing, involves multiple processors that differ architecturally from one another, especially where each processor has its own memory space. Under power limitations, this design could provide better performance than that attained through symmetric multiprocessing. However, the heterogeneous nature adds difficulty to programming. Each specific architecture requires its own program code. Programmers also need to explicitly transfer code and data between processors. This study describes the implementation of a compiler of the pedagogic Tiger language for the Cell Broadband Engine, an asymmetric multiprocessing platform jointly developed by Sony, Toshiba and IBM. The problem above is solved by introducing multiple backends for the Tiger language, along with a remote call stub (RCS) generator. Functions are compiled into different architectures, and calls across architectures are linked automatically through the stubs. RCS takes care of the execution context switch and hides details of the argument data/return value transfer. TigC simplifies the programming and building procedures. It also provides a high-level view of the whole program execution for future optimization because all of the source files are processed by a single compiler. As an example of this procedure, the possible optimization of data transfer during remote calls is investigated here.
2

Exploiting Multigrain Parallelism in Pairwise Sequence Search on Emergent CMP Architectures

Aji, Ashwin Mandayam 25 August 2008 (has links)
With the emerging hybrid multi-core and many-core compute platforms delivering unprecedented high performance within a single chip, and making rapid strides toward the commodity processor market, they are widely expected to replace the multi-core processors in the existing High-Performance Computing (HPC) infrastructures, such as large scale clusters, grids and supercomputers. On the other hand in the realm of bioinformatics, the size of genomic databases is doubling every 12 months, and hence the need for novel approaches to parallelize sequence search algorithms has become increasingly important. This thesis puts a significant step forward in bridging the gap between software and hardware by presenting an efficient and scalable model to accelerate one of the popular sequence alignment algorithms by exploiting multigrain parallelism that is exposed by the emerging multiprocessor architectures. Specifically, we parallelize a dynamic programming algorithm called Smith-Waterman both within and across multiple Cell Broadband Engines and within an nVIDIA GeForce General Purpose Graphics Processing Unit (GPGPU). Cell Broadband Engine: We parallelize the Smith-Waterman algorithm within a Cell node by performing a blocked data decomposition of the dynamic programming matrix followed by pipelined execution of the blocks across the synergistic processing elements (SPEs) of the Cell. We also introduce novel optimization methods that completely utilize the vector processing power of the SPE. As a result, we achieve near-linear scalability or near-constant efficiency for up to 16 SPEs on the dual-Cell QS20 blades, and our design is highly scalable to more cores, if available. We further extend this design to accelerate the Smith-Waterman algorithm across nodes on both the IBM QS20 and the PlayStation3 Cell cluster platforms and achieve a maximum speedup of 44, when compared to the execution times on a single Cell node. We then introduce an analytical model to accurately estimate the execution times of parallel sequence alignments and wavefront algorithms in general on the Cell cluster platforms. Lastly, we contribute and evaluate TOSS -- a Throughput-Oriented Sequence Scheduler, which leverages the performance prediction model and dynamically partitions the available processing elements to simultaneously align multiple sequences. This scheme succeeds in aligning more sequences per unit time with an improvement of 33.5% over the naive first-come, first-serve (FCFS) scheduler. nVIDIA GPGPU: We parallelize the Smith-Waterman algorithm on the GPGPU by optimizing the code in stages, which include optimal data layout strategies, coalesced memory accesses and blocked data decomposition techniques. Results show that our methods provide a maximum speedup of 3.6 on the nVIDIA GPGPU when compared to the performance of the naive implementation of Smith-Waterman. / Master of Science
3

Parallelization of Animation Blending on the PlayStation®3 / Parallellisering av Animationssystem på PlayStation®3

Jakobsson, Teodor January 2012 (has links)
An animation system gives a dynamic and life-like feel to character motions, allowing motion behaviour that far transcends the mere spatial translations of classic computer games. This increase in behavioural complexity however does not come for free as animation systems often are haunted by considerable performance overhead, the extent of which reflecting the complexity of the desired system.  In game development performance optimization is key, the pursuit of which is aided by the static hardware configuration of modern gaming consoles. These allow extensive optimization through specializing the application, at whole or in part, to the underlying hardware architecture. In this master's theses a method, that efficiently utilizes the parallel architecture of the PlayStation®3, is proposed in order to migrate the process of animation evaluation and blending from a single-thread implementation on the main processor to a fully parallelized multi-thread solution on the associated coprocessors. This method is further complimented with an in-depth study of the underlying theoretical foundations, as well as a reflection on similar works and approaches as used by other contemporary game development companies.
4

Real-Time Space-Time Adaptive Processing on the STI CELL Multiprocessor

Li, Yi-Hsien January 2007 (has links)
<p>Space-Time Adaptive Processing (STAP) has been widely used in modern radar systems such as Ground Moving Target Indication (GMTI) systems in order to suppress jamming and interference. However, the high performance comes at a price of higher computational complexity, which requires extensive powerful hardware.</p><p>The new STI Cell Broadband Engine (CBE) processor combines PowerPC core augmented with eight streamlined high-performance SIMD processing engine offers an opportunity to implement the STAP baseband signal processing without any full custom hardware. This paper presents the implementation of an STAP baseband signal processing flow on the state-of-the-art STI CELL multiprocessor, which enables the concept of Software-Defined Radar (SDR). The potential of the Cell BE processor is studied so that kernel subroutine such as QR decomposition, Fast Fourier Transform (FFT), and FIR filtering of STAP are mapped to the SPE co-processors of Cell BE processor with variety of architectural specific optimization techniques.</p><p>This report starts with an overview of airborne radar technique and then the standard, specifically the third-order Doppler-factored STAP are introduced. Next, it goes with the thorough description of Cell BE architecture, its programming tool chain and parallel programming methods for Cell BE. In later chapter, how the STAP is implemented on the Cell BE processor is discussed and the simulation results are presented. Furthermore, based on the result of earlier benchmarking, an optimized task partition and scheduling method is proposed to improve the overall performance.</p>
5

CellPilot: An extension of the Pilot library for Cell Broadband Engine processors and heterogeneous clusters

Girard, Natalie 13 January 2012 (has links)
The CellPilot library provides a uniform communication programming model, based on Pilot's process/channel approach, for clusters of Cell Broadband Engine processors. Pilot, a thin layer on top of the Message Passing Interface (MPI) library, allows processes to read/write messages on channels defined between pairs of processes on the cluster, but Pilot alone does not help a Cell programmer cope with the considerable complexities of intra-Cell communication. With CellPilot, programmers still design software in terms of processes, but they can now be located on a Cell node's Power Processor Elements (PPEs), Synergistic Processing Elements (SPEs), or non-Cell node within a heterogeneous Cell cluster, and communication is accomplished via channels between process pairs. Programs are coded in terms of reading and writing on those channels, whereupon CellPilot transparently applies whichever communication mechanisms are required to transport the message, regardless of its endpoints. This gives the programmer a way to handle inter-process communication while avoiding low-level I/O operations and the use of multiple libraries.
6

Real-Time Space-Time Adaptive Processing on the STI CELL Multiprocessor

Li, Yi-Hsien January 2007 (has links)
Space-Time Adaptive Processing (STAP) has been widely used in modern radar systems such as Ground Moving Target Indication (GMTI) systems in order to suppress jamming and interference. However, the high performance comes at a price of higher computational complexity, which requires extensive powerful hardware. The new STI Cell Broadband Engine (CBE) processor combines PowerPC core augmented with eight streamlined high-performance SIMD processing engine offers an opportunity to implement the STAP baseband signal processing without any full custom hardware. This paper presents the implementation of an STAP baseband signal processing flow on the state-of-the-art STI CELL multiprocessor, which enables the concept of Software-Defined Radar (SDR). The potential of the Cell BE processor is studied so that kernel subroutine such as QR decomposition, Fast Fourier Transform (FFT), and FIR filtering of STAP are mapped to the SPE co-processors of Cell BE processor with variety of architectural specific optimization techniques. This report starts with an overview of airborne radar technique and then the standard, specifically the third-order Doppler-factored STAP are introduced. Next, it goes with the thorough description of Cell BE architecture, its programming tool chain and parallel programming methods for Cell BE. In later chapter, how the STAP is implemented on the Cell BE processor is discussed and the simulation results are presented. Furthermore, based on the result of earlier benchmarking, an optimized task partition and scheduling method is proposed to improve the overall performance.
7

MRI Velocity Quantification Implementation and Evaluation of Elementary Functions for the Cell Broadband Engine

Li, Wei 27 June 2007 (has links)
<p> Magnetic Resonance Imaging (MRI) velocity quantification is addressed in part I of this thesis. In simple MR imaging, data is collected and tissue densities are displayed as images. Moving tissue creates signals which appear as artifacts in the images. In velocity imaging, more data is collected and phase differences are used to quantify the velocity of tissue components. The problem is described and a novel formulation of a regularized, nonlinear inverse problem is proposed. Both Tikhonov and Total Variation Regularization are discussed. Results of numerical simulations show that significant noise reduction is possible.</p> <p> The method is firstly verified with MATLAB. A number of experiments are carried out with different regularization parameters, different magnetic fields and different noise levels. The experiments show that the stronger the complex noise is, the stronger the magnetic field requires for estimating the velocity. The regularization parameter also plays an important role in the experiments. Given the noise level and with an appropriate value of regularization parameter, the estimated velocity converges to ideal velocity very quickly. A proof-of-concept implementation on the Cell BE processor is described, quantifying the performance potential of this platform.</p> <p> The second part of this thesis concerns the evaluation of an elementary function library. Since CBE SPU is designed for compute intensive applications, the well developed Math functions can help developer program and save time to take care other details. Dr. Anand's research group in McMaster developed 28 math functions for CBE SPU. The test tools for accuracy and performance were developed on CBE. The functions were tuned while testing. The functions are either competitive or an addition to the existing SDK1.1 SPU math functions.</p> / Thesis / Master of Applied Science (MASc)
8

Enhanced SAR Image Processing Using A Heterogeneous Multiprocessor

SHI, YU January 2008 (has links)
<p>Synthetic antenna aperture (SAR) is a pulses focusing airborne radar which can achieve high resolution radar image. A number of image process algorithms have been developed for this kind of radar, but the calculation burden is still heavy. So the image processing of SAR is normally performed “off-line”.</p><p>The Fast Factorized Back Projection (FFBP) algorithm is considered as a computationally efficient algorithm for image formation in SAR, and several applications have been implemented which try to make the process “on-line”.</p><p>CELL Broadband Engine is one of the newest multi-core-processor jointly developed by Sony, Toshiba and IBM. CELL is good at parallel computation and floating point numbers, which all fit the demands of SAR image formation.</p><p>This thesis is going to implement FFBP algorithm on CELL Broadband Engine, and compare the results with pre-projects. In this project, we try to make it possible to perform SAR image formation in real-time.</p>
9

Exploiting parallelism of irregular problems and performance evaluation on heterogeneous multi-core architectures

Xu, Meilian 04 October 2012 (has links)
In this thesis, we design, develop and implement parallel algorithms for irregular problems on heterogeneous multi-core architectures. Irregular problems exhibit random and unpredictable memory access patterns, poor spatial locality and input dependent control flow. Heterogeneous multi-core processors vary in: clock frequency, power dissipation, programming model (MIMD vs. SIMD), memory design and computing units, scalar versus vector units. The heterogeneity of the processors makes designing efficient parallel algorithms for irregular problems on heterogeneous multicore processors challenging. Techniques of mapping tasks or data on traditional parallel computers can not be used as is on heterogeneous multi-core processors due to the varying hardware. In an attempt to understand the efficiency of futuristic heterogeneous multi-core architectures on applications we study several computation and bandwidth oriented irregular problems on one heterogeneous multi-core architecture, the IBM Cell Broadband Engine (Cell BE). The Cell BE consists of a general processor and eight specialized processors and addresses vector/data-level parallelism and instruction-level parallelism simultaneously. Through these studies on the Cell BE, we provide some discussions and insight on the performance of the applications on heterogeneous multi-core architectures. Verifying these experimental results require some performance modeling. Due to the diversity of heterogeneous multi-core architectures, theoretical performance models used for homogeneous multi-core architectures do not provide accurate results. Therefore, in this thesis we propose an analytical performance prediction model that considers the multitude architectural features of heterogeneous multi-cores (such as DMA transfers, number of instructions and operations, the processor frequency and DMA bandwidth). We show that the execution time from our prediction model is comparable to the execution time of the experimental results for a complex medical imaging application.
10

Exploiting parallelism of irregular problems and performance evaluation on heterogeneous multi-core architectures

Xu, Meilian 04 October 2012 (has links)
In this thesis, we design, develop and implement parallel algorithms for irregular problems on heterogeneous multi-core architectures. Irregular problems exhibit random and unpredictable memory access patterns, poor spatial locality and input dependent control flow. Heterogeneous multi-core processors vary in: clock frequency, power dissipation, programming model (MIMD vs. SIMD), memory design and computing units, scalar versus vector units. The heterogeneity of the processors makes designing efficient parallel algorithms for irregular problems on heterogeneous multicore processors challenging. Techniques of mapping tasks or data on traditional parallel computers can not be used as is on heterogeneous multi-core processors due to the varying hardware. In an attempt to understand the efficiency of futuristic heterogeneous multi-core architectures on applications we study several computation and bandwidth oriented irregular problems on one heterogeneous multi-core architecture, the IBM Cell Broadband Engine (Cell BE). The Cell BE consists of a general processor and eight specialized processors and addresses vector/data-level parallelism and instruction-level parallelism simultaneously. Through these studies on the Cell BE, we provide some discussions and insight on the performance of the applications on heterogeneous multi-core architectures. Verifying these experimental results require some performance modeling. Due to the diversity of heterogeneous multi-core architectures, theoretical performance models used for homogeneous multi-core architectures do not provide accurate results. Therefore, in this thesis we propose an analytical performance prediction model that considers the multitude architectural features of heterogeneous multi-cores (such as DMA transfers, number of instructions and operations, the processor frequency and DMA bandwidth). We show that the execution time from our prediction model is comparable to the execution time of the experimental results for a complex medical imaging application.

Page generated in 0.0795 seconds