Global ETD Search

121	Architectural Enhancements for Color Image and Video Processing on Embedded Systems Kim, Jongmyon 21 April 2005 (has links) As emerging portable multimedia applications demand more and more computational throughput with limited energy consumption, the need for high-efficiency, high-throughput embedded processing is becoming an important challenge in computer architecture. In this regard, this dissertation addresses application-, architecture-, and technology-level issues in existing processing systems to provide efficient processing of multimedia in many, or ideally all, of its form. In particular, this dissertation explores color imaging in multimedia while focusing on two architectural enhancements for memory- and performance-hungry embedded applications: (1) a pixel-truncation technique and (2) a color-aware instruction set (CAX) for embedded multimedia systems. The pixel-truncation technique differs from previous techniques (e.g., 4:2:2 and 4:2:0 subsampling) used in image and video compression applications (e.g., JPEG and MPEG) in that it reduces the information content in individual pixel word sizes rather than in each dimension. Thus, this technique drastically reduces the bandwidth and memory required to transport and store color images without perceivable distortion in color. At the same time, it maintains the pixel storage format of color image processing in which each pixel computation is performed simultaneously on 3-D YCbCr components, which are widely used in the image and video processing community. CAX supports parallel operations on two-packed 16-bit (6:5:5) YCbCr data in a 32-bit datapath processor, providing greater concurrency and efficiency for processing color image sequences. This dissertation presents the impact of CAX on processing performance and on both area and energy efficiency for color imaging applications in three major processor architectures: dynamically scheduled (superscalar), statically scheduled (very long instruction word, VLIW), and embedded single instruction multiple data (SIMD) array processors. Unlike typical multimedia extensions, CAX obtains substantial performance and code density improvements through direct support for color data processing rather than depending solely on generic subword parallelism. In addition, the ability to reduce data format size reduces system cost. The reduction in data bandwidth also simplifies system design. In summary, CAX, coupled with the pixel-truncation technique, provides an efficient mechanism that meets the computational requirements and cost goals for future embedded multimedia products. Data parallel architectures Superscalar processors Embedded systems Subword parallelism Computer architecture Color image and video processing
122	Task Parallelism For Ray Tracing On A Gpu Cluster Unlu, Caglar 01 February 2008 (has links) (PDF) Ray tracing is a computationally complex global illumination algorithm that is used for producing realistic images. In addition to parallel implementations on commodity PC clusters, recently, Graphics Processing Units (GPU) have also been used to accelerate ray tracing. In this thesis, ray tracing is accelerated on a GPU cluster where the viewing plane is divided into unit tiles. Slave processes work on these tiles in a task parallel manner which are dynamically assigned to them. To decrease the number of ray-triangle intersection tests, Bounding Volume Hierarchies (BVH) are used. It is shown that almost linear speedup can be achieved. On the other hand, it is observed that API and network overheads are obstacles for scalability. T General Technology. 1-995
123	On the design of architecture-aware algorithms for emerging applications Kang, Seunghwa 30 January 2011 (has links) This dissertation maps various kernels and applications to a spectrum of programming models and architectures and also presents architecture-aware algorithms for different systems. The kernels and applications discussed in this dissertation have widely varying computational characteristics. For example, we consider both dense numerical computations and sparse graph algorithms. This dissertation also covers emerging applications from image processing, complex network analysis, and computational biology. We map these problems to diverse multicore processors and manycore accelerators. We also use new programming models (such as Transactional Memory, MapReduce, and Intel TBB) to address the performance and productivity challenges in the problems. Our experiences highlight the importance of mapping applications to appropriate programming models and architectures. We also find several limitations of current system software and architectures and directions to improve those. The discussion focuses on system software and architectural support for nested irregular parallelism, Transactional Memory, and hybrid data transfer mechanisms. We believe that the complexity of parallel programming can be significantly reduced via collaborative efforts among researchers and practitioners from different domains. This dissertation participates in the efforts by providing benchmarks and suggestions to improve system software and architectures. MapReduce Nested parallelism Parallel algorithm Algorithm engineering Performance tuning GPU Transactional memory Algorithms Parallel algorithms Multiprocessors
124	A Concurrent IFDS Dataflow Analysis Algorithm Using Actors Rodriguez, Jonathan David January 2010 (has links) There has recently been a resurgence in interest in techniques for effective programming of multi-core computers. Most programmers find general-purpose concurrent programming to be extremely difficult. This difficulty severely limits the number of applications that currently benefit from multi-core computers. There already exist many concurrent solutions for the class of regular applications, which include various algorithms for linear algebra. For the class of irregular applications, which operate on dynamic and pointer- and graph-based structures, efficient concurrent solutions have so far remained elusive. Dataflow analysis applications, which are often found in compilers and program analysis tools, have received particularly little attention with regard to execution on multi-core machines. Operating on the theory that the Actor model, which structures computations as systems of asynchronously-communicating entities, is a more appropriate method for representing irregular algorithms than the shared-memory model, this work presents a concurrent Actor-based formulation of the IFDS, or Interprocedural Finite Distributive Subset, dataflow analysis algorithm. The implementation of this algorithm is done using the Scala language and its Actors library. This algorithm achieves significant speedup on multi-core machines without using any optimistic execution. This work contributes to Actor research by showing how the Actor model can be practically applied to a dataflow analysis problem. This work contributes to static analysis research by showing how a dataflow analysis algorithm can effectively make use of multi-core machines, allowing the possibility of faster and more precise analyses. concurrency static analysis actors irregular parallelism actor model dataflow analysis IFDS Computer Science
125	Compiler Support for Fine-grain Software-only Checkpointing Zhao, Cheng Yan 13 August 2013 (has links) Checkpointing support allows program execution to roll-back to an earlier program point, discarding any modifications made since that point. Existing software-based checkpointing methods are mainly libraries that snapshot all of working-memory, and hence have prohibitive overhead for many potential applications. In this thesis we present a light-weight, fine-grain checkpointing framework implemented entirely in software through compiler transformations and optimizations. A programmer can specify arbitrary checkpoint regions via a simple API, and the compiler automatically transforms the code to implement the checkpoint at the granularity of individual stores, optimizing to remove redundancy. We explore three application areas for this support. First, we investigate its application to debugging, in particular by providing the ability to rewind to an arbitrarily-placed point in a buggy program’s execution. A study using BugBench applications shows that our compiler-based approach is more than 100X less overhead than full-process checkpointing. Second, we demonstrate that compiler-based checkpointing support can be leveraged to free the programmer from manually implementing and maintaining software rollback mechanisms when coding a backtracking algorithm, with runtime overhead of only 15% compared to the manual implementation. Finally, we utilize the efficient software-only checkpointing to support overlapping execution with delinquent loads through the demonstrations both control-speculation and data-speculation transformations. We further propose a theoretical speculative timing model and confirm its prediction effectiveness with real-machine workload. checkpoinitng compiler program analysis optimization speculation parallelism programming transactional memory 0984
126	Efficient Multi-ported Memories for FPGAs LaForest, Charles Eric 15 February 2010 (has links) Multi-ported memories are challenging to implement on FPGAs since the provided block RAMs typically have only two ports. In this dissertation we present a thorough exploration of the design space of FPGA multi-ported memories by evaluating conventional solutions to this problem, and introduce a new design that efficiently combines block RAMs into multi-ported memories with arbitrary numbers of read and write ports and true random access to any memory location, while achieving significantly higher operating frequencies than conventional approaches. For example we build a 256-location, 32-bit, 12-ported (4-write, 8-read) memory that operates at 281 MHz on Altera Stratix III FPGAs while consuming an area equivalent to 3679 ALMs: a 43% speed improvement and 84% area reduction over a pure ALM implemen- tation, and a 61% speed improvement over a pure "multipumped" implementation, although the pure multipumped implementation is 7.2-fold smaller. FPGA computer architecture multi-ported memory multipumping soft processor parallelism FIFO buffers 0544
127	Efficient Multi-ported Memories for FPGAs LaForest, Charles Eric 15 February 2010 (has links) Multi-ported memories are challenging to implement on FPGAs since the provided block RAMs typically have only two ports. In this dissertation we present a thorough exploration of the design space of FPGA multi-ported memories by evaluating conventional solutions to this problem, and introduce a new design that efficiently combines block RAMs into multi-ported memories with arbitrary numbers of read and write ports and true random access to any memory location, while achieving significantly higher operating frequencies than conventional approaches. For example we build a 256-location, 32-bit, 12-ported (4-write, 8-read) memory that operates at 281 MHz on Altera Stratix III FPGAs while consuming an area equivalent to 3679 ALMs: a 43% speed improvement and 84% area reduction over a pure ALM implemen- tation, and a 61% speed improvement over a pure "multipumped" implementation, although the pure multipumped implementation is 7.2-fold smaller. FPGA computer architecture multi-ported memory multipumping soft processor parallelism FIFO buffers 0544
128	Compiler Support for Fine-grain Software-only Checkpointing Zhao, Cheng Yan 13 August 2013 (has links) Checkpointing support allows program execution to roll-back to an earlier program point, discarding any modifications made since that point. Existing software-based checkpointing methods are mainly libraries that snapshot all of working-memory, and hence have prohibitive overhead for many potential applications. In this thesis we present a light-weight, fine-grain checkpointing framework implemented entirely in software through compiler transformations and optimizations. A programmer can specify arbitrary checkpoint regions via a simple API, and the compiler automatically transforms the code to implement the checkpoint at the granularity of individual stores, optimizing to remove redundancy. We explore three application areas for this support. First, we investigate its application to debugging, in particular by providing the ability to rewind to an arbitrarily-placed point in a buggy program’s execution. A study using BugBench applications shows that our compiler-based approach is more than 100X less overhead than full-process checkpointing. Second, we demonstrate that compiler-based checkpointing support can be leveraged to free the programmer from manually implementing and maintaining software rollback mechanisms when coding a backtracking algorithm, with runtime overhead of only 15% compared to the manual implementation. Finally, we utilize the efficient software-only checkpointing to support overlapping execution with delinquent loads through the demonstrations both control-speculation and data-speculation transformations. We further propose a theoretical speculative timing model and confirm its prediction effectiveness with real-machine workload. checkpoinitng compiler program analysis optimization speculation parallelism programming transactional memory 0984
129	Nested pessimistic transactions for both atomicity and synchronization in concurrent software Chammah, Tarek January 2011 (has links) Existing atomic section interface proposals, thus far, have tended to only isolate transactions from each other. Less considered is the coordination of threads performing transactions with respect to one another. Synchronization of nested sections is typically relegated to outside of and among the top-level flattened sections. However existing models do not permit the composition of even simple synchronization constructs such as barriers. The proposed model integrates synchronization as a first-class construct in a truly nested atomic block implementation. The implementation is evaluated on quantitative benchmarks, with qualitative examples of the atomic section interface???s expressive power compared with conventional transactional memory implementations. concurrency parallelism transactions locks transactional memory atomic sections lock inference condition synchronization
130	Scientific Computing on Multicore Architectures Tillenius, Martin January 2014 (has links) Computer simulations are an indispensable tool for scientists to gain new insights about nature. Simulations of natural phenomena are usually large, and limited by the available computer resources. By using the computer resources more efficiently, larger and more detailed simulations can be performed, and more information can be extracted to help advance human knowledge. The topic of this thesis is how to make best use of modern computers for scientific computations. The challenge here is the high level of parallelism that is required to fully utilize the multicore processors in these systems. Starting from the basics, the primitives for synchronizing between threads are investigated. Hardware transactional memory is a new construct for this, which is evaluated for a new use of importance for scientific software: atomic updates of floating point values. The evaluation includes experiments on real hardware and comparisons against standard methods. Higher level programming models for shared memory parallelism are then considered. The state of the art for efficient use of multicore systems is dynamically scheduled task-based systems, where tasks can depend on data. In such systems, the software is divided up into many small tasks that are scheduled asynchronously according to their data dependencies. This enables a high level of parallelism, and avoids global barriers. A new system for managing task dependencies is developed in this thesis, based on data versioning. The system is implemented as a reusable software library, and shown to be as efficient or more efficient than other shared-memory task-based systems in experimental comparisons. The developed runtime system is then extended to distributed memory machines, and used for implementing a parallel version of a software for global climate simulations. By running the optimized and parallelized version on eight servers, an equally sized problem can be solved over 100 times faster than in the original sequential version. The parallel version also allowed significantly larger problems to be solved, previously unreachable due to memory constraints. / UPMARC / eSSENCE multicore scientific computing shared memory parallelism task-based programming parallel programming model task scheduling data versioning

Search results