Global ETD Search

131	A Selection of H.264 Encoder Components Implemented and Benchmarked on a Multi-core DSP Processor Einemo, Jonas, Lundqvist, Magnus January 2010 (has links) <p>H.264 is a video coding standard which offers high data compression rate at the cost of a high computational load. This thesis evaluates how well parts of the H.264 standard can be implemented for a new multi-core digital signal processing processor architecture called ePUMA. The thesis investigates if real-time encoding of high definition video sequences could be performed. The implementation consists of the motion estimation, motion compensation, discrete cosine transform, inverse discrete cosine transform, quantization and rescaling parts of the H.264 standard. Benchmarking is done using the ePUMA system simulator and the results are compared to an implementation of an existing H.264 encoder for another multi-core processor architecture called STI Cell. The results show that the selected parts of the H.264 encoder could be run on 6 calculation cores in 5 million cycles per frame. This setup leaves 2 calculation cores to run the remaining parts of the encoder.</p> ePUMA DSP SIMD H.264 Parallel Programming Motion Estimation DCT Computer engineering Datorteknik
132	Compiling the parallel programming language NestStep to the CELL processor Holm, Magnus January 2010 (has links) <p>The goal of this project is to create a source-to-source compiler which will translate NestStep code to C code. The compiler's job is to replace NestStep constructs with a series of function calls to the NestStep runtime system. NestStep is a parallel programming language extension based on the BSP model. It adds constructs for parallel programming on top of an imperative programming language. For this project, only constructs extending the C language are relevant. The output code will compile to form an executable program that runs on the multicore processor Cell Broadband Engine (Cell BE). The NestStep runtime system has been ported to the Cell BE and is available from start of this project.</p> NestStep Cell parallel programming source-to-source compiler Cetus ANTLR Software engineering Programvaruteknik
133	A Scalable Run-Time System for NestStep on Cluster Supercomputers Sohl, Joar January 2006 (has links) <p>NestStep is a collection of parallel extensions to existing programming languages. These extensions supports a shared memory model and nested parallelism. NestStep is based the Bulk-Synchronous Programming model. Most of the communication of data in NestStep takes place in a</p><p>combine/commit phase, which is essentially a reduction followed by a broadcast.</p><p>The primary aim of the project that this thesis is based on was to develop a runtime system for NestStep-C, the extensions for the C programming language. The secondary aim was to find which tree structure among a selected few is the best for communicating data in the combine/commit phase.</p><p>This thesis includes information about NestStep, how to interface with the NestStep runtime system, some example applications and benchmarks for determining the best tree structure. A binomial tree structure and trees similar to it was empirically found to yield the best performance.</p> NestStep parallel programming programming languages communication tree structures nested parallelism Computer science Datalogi
134	Real-Time Space-Time Adaptive Processing on the STI CELL Multiprocessor Li, Yi-Hsien January 2007 (has links) <p>Space-Time Adaptive Processing (STAP) has been widely used in modern radar systems such as Ground Moving Target Indication (GMTI) systems in order to suppress jamming and interference. However, the high performance comes at a price of higher computational complexity, which requires extensive powerful hardware.</p><p>The new STI Cell Broadband Engine (CBE) processor combines PowerPC core augmented with eight streamlined high-performance SIMD processing engine offers an opportunity to implement the STAP baseband signal processing without any full custom hardware. This paper presents the implementation of an STAP baseband signal processing flow on the state-of-the-art STI CELL multiprocessor, which enables the concept of Software-Defined Radar (SDR). The potential of the Cell BE processor is studied so that kernel subroutine such as QR decomposition, Fast Fourier Transform (FFT), and FIR filtering of STAP are mapped to the SPE co-processors of Cell BE processor with variety of architectural specific optimization techniques.</p><p>This report starts with an overview of airborne radar technique and then the standard, specifically the third-order Doppler-factored STAP are introduced. Next, it goes with the thorough description of Cell BE architecture, its programming tool chain and parallel programming methods for Cell BE. In later chapter, how the STAP is implemented on the Cell BE processor is discussed and the simulation results are presented. Furthermore, based on the result of earlier benchmarking, an optimized task partition and scheduling method is proposed to improve the overall performance.</p> Space Time Adaptive Processing Cell Broadband Engine Parallel Programming Computer engineering Datorteknik
135	Transactions Everywhere Kuszmaul, Bradley C., Leiserson, Charles E. 01 1900 (has links) Arguably, one of the biggest deterrants for software developers who might otherwise choose to write parallel code is that parallelism makes their lives more complicated. Perhaps the most basic problem inherent in the coordination of concurrent tasks is the enforcing of atomicity so that the partial results of one task do not inadvertently corrupt another task. Atomicity is typically enforced through locking protocols, but these protocols can introduce other complications, such as deadlock, unless restrictive methodologies in their use are adopted. We have recently begun a research project focusing on transactional memory [18] as an alternative mechanism for enforcing atomicity, since it allows the user to avoid many of the complications inherent in locking protocols. Rather than viewing transactions as infrequent occurrences in a program, as has generally been done in the past, we have adopted the point of view that all user code should execute in the context of some transaction. To make this viewpoint viable requires the development of two key technologies: effective hardware support for scalable transactional memory, and linguistic and compiler support. This paper describes our preliminary research results on making “transactions everywhere” a practical reality. / Singapore-MIT Alliance (SMA) transactional memory scalable transactional memory atomicity parallel programming hardware transactional memory linguistic and compiler support
136	On-the-fly Race Detection for Programs with Recursive Spawn-Sync Parallelism He, Yuxiong, Wang, Junqing 01 1900 (has links) Detecting data race is very important for debugging shared-memory parallel programs, because data races result in unintended nondeterministic execution of the program. We propose a dynamic on-the-fly race detection mechanism called Parallel Nondeterminator to check for determinacy races during the parallel execution of a program with recursive spawn-sync parallelism. A modified version of Nested Region Labeling scheme is developed for the concurrency relationship test in the spawn-sync parallel structure. Through the identification of Least Common Ancestor in the spawn tree, the Parallel Nondeterminator only needs to keep two read access records and one write access record for each shared location. The work and critical path in the instrumented codes are analyzed as well as time complexity and space requirements. Let N denote the maximum depth of the recursion in the parallel program. The worst case time increased for each spawn and sync operation is O(N) and the time required to monitor any shared memory location is O(lgN). Moreover, Parallel Nondeterminator is able to execute the race detection code without loss of parallelism of the original program. In summary, the Parallel Non-determinator represents a provably efficient strategy for detecting data races for shared-memory parallel programs. / Singapore-MIT Alliance (SMA) data race detection Parallel Nondeterminator shared-memory parallel programming debugging Nested Region Labeling
137	Enhancing MPI with modern networking mechanisms in cluster interconnects Yu, Weikuan, January 2006 (has links) Thesis (Ph. D.)--Ohio State University, 2006. / Title from first page of PDF file. Includes bibliographical references (p. 161-168).
138	E-AMOM: An Energy-Aware Modeling and Optimization Methodology for Scientific Applications on Multicore Systems Lively, Charles 2012 May 1900 (has links) Power consumption is an important constraint in achieving efficient execution on High Performance Computing Multicore Systems. As the number of cores available on a chip continues to increase, the importance of power consumption will continue to grow. In order to achieve improved performance on multicore systems scientific applications must make use of efficient methods for reducing power consumption and must further be refined to achieve reduced execution time. In this dissertation, we introduce a performance modeling framework, E-AMOM, to enable improved execution of scientific applications on parallel multicore systems with regards to a limited power budget. We develop models for each application based upon performance hardware counters. Our models utilize different performance counters for each application and for each performance component (runtime, system power consumption, CPU power consumption, and memory power consumption) that are selected via our performance-tuned principal component analysis method. Models developed through E-AMOM provide insight into the performance characteristics of each application that affect performance for each component on a parallel multicore system. Our models are more than 92% accurate across both Hybrid (MPI/OpenMP) and MPI implementations for six scientific applications. E-AMOM includes an optimization component that utilizes our models to employ run-time Dynamic Voltage and Frequency Scaling (DVFS) and Dynamic Concurrency Throttling to reduce power consumption of the scientific applications. Further, we optimize our applications based upon insights provided by the performance models to reduce runtime of the applications. Our methods and techniques are able to save up to 18% in energy consumption for Hybrid (MPI/OpenMP) and MPI scientific applications and reduce the runtime of the applications up to 11% on parallel multicore systems. Performance Modeling Power consumption Multicore Parallel Programming MPI Hybrid Power prediction
139	Architectural support for high-performing hardware transactional memory systems Lupon Navazo, Marc 23 December 2011 (has links) Parallel programming presents an efficient solution to exploit future multicore processors. Unfortunately, traditional programming models depend on programmer’s skills for synchronizing concurrent threads, which makes the development of parallel software a hard and errorprone task. In addition to this, current synchronization techniques serialize the execution of those critical sections that conflict in shared memory and thus limit the scalability of multithreaded applications. Transactional Memory (TM) has emerged as a promising programming model that solves the trade-off between high performance and ease of use. In TM, the system is in charge of scheduling transactions (atomic blocks of instructions) and guaranteeing that they are executed in isolation, which simplifies writing parallel code and, at the same time, enables high concurrency when atomic regions access different data. Among all forms of TM environments, Hardware TM (HTM) systems is the only one that offers fast execution at the cost of adding dedicated logic in the processor. Existing HTMsystems suffer considerable delays when they execute complex transactional workloads, especially when they deal with large and contending transactions because they lack adaptability. Furthermore, most HTM implementations are ad hoc and require cumbersome hardware structures to be effective, which complicates the feasibility of the design. This thesis makes several contributions in the design and analysis of low-cost HTMsystems that yield good performance for any kind of TM program. Our first contribution, FASTM, introduces a novel mechanism to elegantly manage speculative (and already validated) versions of transactional data by slightly modifying on-chip memory engine. This approach permits fast recovery when a transaction that fits in private caches is discarded. At the same time, it keeps non-speculative values in software, which allows in-place x memory updates. Thus, FASTM is not hurt from capacity issues nor slows down when it has to undo transactional modifications. Our second contribution includes two different HTM systems that integrate deferred resolution of conflicts in a conventional multicore processor, which reduces the complexity of the system with respect to previous proposals. The first one, FUSETM, combines different-mode transactions under a unified infrastructure to gracefully handle resource overflow. As a result, FUSETM brings fast transactional computation without requiring additional hardware nor extra communication at the end of speculative execution. The second one, SPECTM, introduces a two-level data versioning mechanism to resolve conflicts in a speculative fashion even in the case of overflow. Our third and last contribution presents a couple of truly flexible HTM systems that can dynamically adapt their underlying mechanisms according to the characteristics of the program. DYNTM records statistics of previously executed transactions to select the best-suited strategy each time a new instance of a transaction starts. SWAPTM takes a different approach: it tracks information of the current transactional instance to change its priority level at runtime. Both alternatives obtain great performance over existing proposals that employ fixed transactional policies, especially in applications with phase changes. Hardware transactional memory Parallel programming Dynamically adaptive systems Non-blocring synchronization FASTM DYNTM SWPTM 208 004
140	Efficient Conditional Synchronization for Transactional Memory Based System Naik, Aniket Dilip 10 April 2006 (has links) Multi-threaded applications are needed to realize the full potential of new chip-multi-threaded machines. Such applications are very difficult to program and orchestrate correctly, and transactional memory has been proposed as a way of alleviating some of the programming difficulties. However, transactional memory can directly be applied only to critical sections, while conditional synchronization remains difficult to implement correctly and efficiently. This dissertation describes EasySync, a simple and inexpensive extension to transactional memory that allows arbitrary conditional synchronization to be expressed in a simple and composable way. Transactional memory eliminates the need to use locks and provides composability for critical sections: atomicity of a transaction is guaranteed regardless of how other code is written. EasySync provides the same benefits for conditional synchronizations: it eliminates the need to use conditional variables, and it guarantees wakeup of the waiting transaction when the real condition it is waiting for is satisfied, regardless of whether other code correctly signals that change. EasySync also allows transactional memory systems to efficiently provide lock-free and condition variable-free conditional critical regions and even more advanced synchronization primitives, such as guarded execution with arbitrary conditional or guard code. Because EasySync informs the hardware the that a thread is waiting, it allows simple and effective optimizations, such as stopping the execution of a thread until there is a change in the condition it is waiting for. Like transactional memory, EasySync is backward compatible with existing code, which we confirm by running unmodified Splash-2 applications linked with an EasySync-based synchronization library. We also re-write some of the synchronization in three Splash-2 applications, to take advantage of better code readability, and to replace spin-waiting with its more efficient EasySync equivalents. Our experimental evaluation shows that EasySync successfully eliminates processor activity while waiting, reducing the number of executed instructions by 8.6% on average in a 16-processor CMP. We also show that these savings increase with the number of processors, and also for applications written for transactional memory systems. Finally, EasySync imposes virtually no performance overheads, and can in fact improve performance. Parallel programming Transactional memory Synchronization Transaction systems (Computer systems) Parallel programming (Computer science) Threads (Computer programs)

Search results