Spelling suggestions: "subject:"[een] PARALLEL PROGRAMMING"" "subject:"[enn] PARALLEL PROGRAMMING""
131 |
A Selection of H.264 Encoder Components Implemented and Benchmarked on a Multi-core DSP ProcessorEinemo, Jonas, Lundqvist, Magnus January 2010 (has links)
<p>H.264 is a video coding standard which offers high data compression rate at the cost of a high computational load. This thesis evaluates how well parts of the H.264 standard can be implemented for a new multi-core digital signal processing processor architecture called ePUMA. The thesis investigates if real-time encoding of high definition video sequences could be performed. The implementation consists of the motion estimation, motion compensation, discrete cosine transform, inverse discrete cosine transform, quantization and rescaling parts of the H.264 standard. Benchmarking is done using the ePUMA system simulator and the results are compared to an implementation of an existing H.264 encoder for another multi-core processor architecture called STI Cell. The results show that the selected parts of the H.264 encoder could be run on 6 calculation cores in 5 million cycles per frame. This setup leaves 2 calculation cores to run the remaining parts of the encoder.</p>
|
132 |
Compiling the parallel programming language NestStep to the CELL processorHolm, Magnus January 2010 (has links)
<p>The goal of this project is to create a source-to-source compiler which will translate NestStep code to C code. The compiler's job is to replace NestStep constructs with a series of function calls to the NestStep runtime system. NestStep is a parallel programming language extension based on the BSP model. It adds constructs for parallel programming on top of an imperative programming language. For this project, only constructs extending the C language are relevant. The output code will compile to form an executable program that runs on the multicore processor Cell Broadband Engine (Cell BE). The NestStep runtime system has been ported to the Cell BE and is available from start of this project.</p>
|
133 |
A Scalable Run-Time System for NestStep on Cluster SupercomputersSohl, Joar January 2006 (has links)
<p>NestStep is a collection of parallel extensions to existing programming languages. These extensions supports a shared memory model and nested parallelism. NestStep is based the Bulk-Synchronous Programming model. Most of the communication of data in NestStep takes place in a</p><p>combine/commit phase, which is essentially a reduction followed by a broadcast.</p><p>The primary aim of the project that this thesis is based on was to develop a runtime system for NestStep-C, the extensions for the C programming language. The secondary aim was to find which tree structure among a selected few is the best for communicating data in the combine/commit phase.</p><p>This thesis includes information about NestStep, how to interface with the NestStep runtime system, some example applications and benchmarks for determining the best tree structure. A binomial tree structure and trees similar to it was empirically found to yield the best performance.</p>
|
134 |
Real-Time Space-Time Adaptive Processing on the STI CELL MultiprocessorLi, Yi-Hsien January 2007 (has links)
<p>Space-Time Adaptive Processing (STAP) has been widely used in modern radar systems such as Ground Moving Target Indication (GMTI) systems in order to suppress jamming and interference. However, the high performance comes at a price of higher computational complexity, which requires extensive powerful hardware.</p><p>The new STI Cell Broadband Engine (CBE) processor combines PowerPC core augmented with eight streamlined high-performance SIMD processing engine offers an opportunity to implement the STAP baseband signal processing without any full custom hardware. This paper presents the implementation of an STAP baseband signal processing flow on the state-of-the-art STI CELL multiprocessor, which enables the concept of Software-Defined Radar (SDR). The potential of the Cell BE processor is studied so that kernel subroutine such as QR decomposition, Fast Fourier Transform (FFT), and FIR filtering of STAP are mapped to the SPE co-processors of Cell BE processor with variety of architectural specific optimization techniques.</p><p>This report starts with an overview of airborne radar technique and then the standard, specifically the third-order Doppler-factored STAP are introduced. Next, it goes with the thorough description of Cell BE architecture, its programming tool chain and parallel programming methods for Cell BE. In later chapter, how the STAP is implemented on the Cell BE processor is discussed and the simulation results are presented. Furthermore, based on the result of earlier benchmarking, an optimized task partition and scheduling method is proposed to improve the overall performance.</p>
|
135 |
Transactions EverywhereKuszmaul, Bradley C., Leiserson, Charles E. 01 1900 (has links)
Arguably, one of the biggest deterrants for software developers who might otherwise choose to write parallel code is that parallelism makes their lives more complicated. Perhaps the most basic problem inherent in the coordination of concurrent tasks is the enforcing of atomicity so that the partial results of one task do not inadvertently corrupt another task. Atomicity is typically enforced through locking protocols, but these protocols can introduce other complications, such as deadlock, unless restrictive methodologies in their use are adopted. We have recently begun a research project focusing on transactional memory [18] as an alternative mechanism for enforcing atomicity, since it allows the user to avoid many of the complications inherent in locking protocols. Rather than viewing transactions as infrequent occurrences in a program, as has generally been done in the past, we have adopted the point of view that all user code should execute in the context of some transaction. To make this viewpoint viable requires the development of two key technologies: effective hardware support for scalable transactional memory, and linguistic and compiler support. This paper describes our preliminary research results on making “transactions everywhere” a practical reality. / Singapore-MIT Alliance (SMA)
|
136 |
On-the-fly Race Detection for Programs with Recursive Spawn-Sync ParallelismHe, Yuxiong, Wang, Junqing 01 1900 (has links)
Detecting data race is very important for debugging shared-memory parallel programs, because data races result in unintended nondeterministic execution of the program. We propose a dynamic on-the-fly race detection mechanism called Parallel Nondeterminator to check for determinacy races during the parallel execution of a program with recursive spawn-sync parallelism. A modified version of Nested Region Labeling scheme is developed for the concurrency relationship test in the spawn-sync parallel structure. Through the identification of Least Common Ancestor in the spawn tree, the Parallel Nondeterminator only needs to keep two read access records and one write access record for each shared location. The work and critical path in the instrumented codes are analyzed as well as time complexity and space requirements. Let N denote the maximum depth of the recursion in the parallel program. The worst case time increased for each spawn and sync operation is O(N) and the time required to monitor any shared memory location is O(lgN). Moreover, Parallel Nondeterminator is able to execute the race detection code without loss of parallelism of the original program. In summary, the Parallel Non-determinator represents a provably efficient strategy for detecting data races for shared-memory parallel programs. / Singapore-MIT Alliance (SMA)
|
137 |
Enhancing MPI with modern networking mechanisms in cluster interconnectsYu, Weikuan, January 2006 (has links)
Thesis (Ph. D.)--Ohio State University, 2006. / Title from first page of PDF file. Includes bibliographical references (p. 161-168).
|
138 |
E-AMOM: An Energy-Aware Modeling and Optimization Methodology for Scientific Applications on Multicore SystemsLively, Charles 2012 May 1900 (has links)
Power consumption is an important constraint in achieving efficient execution on High Performance Computing Multicore Systems. As the number of cores available on a chip continues to increase, the importance of power consumption will continue to grow. In order to achieve improved performance on multicore systems scientific applications must make use of efficient methods for reducing power consumption and must further be refined to achieve reduced execution time.
In this dissertation, we introduce a performance modeling framework, E-AMOM, to enable improved execution of scientific applications on parallel multicore systems with regards to a limited power budget. We develop models for each application based upon performance hardware counters. Our models utilize different performance counters for each application and for each performance component (runtime, system power consumption, CPU power consumption, and memory power consumption) that are selected via our performance-tuned principal component analysis method. Models developed through E-AMOM provide insight into the performance characteristics of each application that affect performance for each component on a parallel multicore system. Our models are more than 92% accurate across both Hybrid (MPI/OpenMP) and MPI implementations for six scientific applications.
E-AMOM includes an optimization component that utilizes our models to employ run-time Dynamic Voltage and Frequency Scaling (DVFS) and Dynamic Concurrency Throttling to reduce power consumption of the scientific applications. Further, we optimize our applications based upon insights provided by the performance models to reduce runtime of the applications. Our methods and techniques are able to save up to 18% in energy consumption for Hybrid (MPI/OpenMP) and MPI scientific applications and reduce the runtime of the applications up to 11% on parallel multicore systems.
|
139 |
Architectural support for high-performing hardware transactional memory systemsLupon Navazo, Marc 23 December 2011 (has links)
Parallel programming presents an efficient solution to exploit future multicore processors.
Unfortunately, traditional programming models depend on programmer’s skills for synchronizing
concurrent threads, which makes the development of parallel software a hard and errorprone
task. In addition to this, current synchronization techniques serialize the execution of
those critical sections that conflict in shared memory and thus limit the scalability of multithreaded
applications.
Transactional Memory (TM) has emerged as a promising programming model that solves
the trade-off between high performance and ease of use. In TM, the system is in charge of
scheduling transactions (atomic blocks of instructions) and guaranteeing that they are executed
in isolation, which simplifies writing parallel code and, at the same time, enables high concurrency
when atomic regions access different data. Among all forms of TM environments,
Hardware TM (HTM) systems is the only one that offers fast execution at the cost of adding
dedicated logic in the processor.
Existing HTMsystems suffer considerable delays when they execute complex transactional
workloads, especially when they deal with large and contending transactions because they lack
adaptability. Furthermore, most HTM implementations are ad hoc and require cumbersome
hardware structures to be effective, which complicates the feasibility of the design. This thesis
makes several contributions in the design and analysis of low-cost HTMsystems that yield good
performance for any kind of TM program.
Our first contribution, FASTM, introduces a novel mechanism to elegantly manage speculative
(and already validated) versions of transactional data by slightly modifying on-chip memory
engine. This approach permits fast recovery when a transaction that fits in private caches is discarded.
At the same time, it keeps non-speculative values in software, which allows in-place
x
memory updates. Thus, FASTM is not hurt from capacity issues nor slows down when it has to
undo transactional modifications.
Our second contribution includes two different HTM systems that integrate deferred resolution
of conflicts in a conventional multicore processor, which reduces the complexity of the
system with respect to previous proposals. The first one, FUSETM, combines different-mode
transactions under a unified infrastructure to gracefully handle resource overflow. As a result,
FUSETM brings fast transactional computation without requiring additional hardware nor extra
communication at the end of speculative execution. The second one, SPECTM, introduces a
two-level data versioning mechanism to resolve conflicts in a speculative fashion even in the
case of overflow.
Our third and last contribution presents a couple of truly flexible HTM systems that can
dynamically adapt their underlying mechanisms according to the characteristics of the program.
DYNTM records statistics of previously executed transactions to select the best-suited strategy
each time a new instance of a transaction starts. SWAPTM takes a different approach: it tracks
information of the current transactional instance to change its priority level at runtime. Both
alternatives obtain great performance over existing proposals that employ fixed transactional
policies, especially in applications with phase changes.
|
140 |
Efficient Conditional Synchronization for Transactional Memory Based SystemNaik, Aniket Dilip 10 April 2006 (has links)
Multi-threaded applications are needed to realize the full potential
of new chip-multi-threaded machines. Such applications are
very difficult to program and orchestrate correctly, and transactional
memory has been proposed as a way of alleviating some of the programming
difficulties. However, transactional memory can directly
be applied only to critical sections, while conditional synchronization
remains difficult to implement correctly and efficiently.
This dissertation describes EasySync, a simple and inexpensive extension
to transactional memory that allows arbitrary conditional
synchronization to be expressed in a simple and composable way.
Transactional memory eliminates the need to use locks and provides
composability for critical sections: atomicity of a transaction is
guaranteed regardless of how other code is written. EasySync provides
the same benefits for conditional synchronizations: it eliminates
the need to use conditional variables, and it guarantees wakeup
of the waiting transaction when the real condition it is waiting for
is satisfied, regardless of whether other code correctly signals that
change. EasySync also allows transactional memory systems to efficiently
provide lock-free and condition variable-free conditional
critical regions and even more advanced synchronization primitives,
such as guarded execution with arbitrary conditional or guard code.
Because EasySync informs the hardware the that a thread is
waiting, it allows simple and effective optimizations, such as stopping
the execution of a thread until there is a change in the condition
it is waiting for. Like transactional memory, EasySync is backward compatible
with existing code, which we confirm by running unmodified
Splash-2 applications linked with an EasySync-based synchronization
library. We also re-write some of the synchronization
in three Splash-2 applications, to take advantage of better code readability,
and to replace spin-waiting with its more efficient EasySync
equivalents.
Our experimental evaluation shows that EasySync successfully
eliminates processor activity while waiting, reducing the number of
executed instructions by 8.6% on average in a 16-processor CMP.
We also show that these savings increase with the number of processors,
and also for applications written for transactional memory
systems. Finally, EasySync imposes virtually no performance overheads,
and can in fact improve performance.
|
Page generated in 0.035 seconds