Global ETD Search

121	Task Parallelism For Ray Tracing On A Gpu Cluster Unlu, Caglar 01 February 2008 (has links) (PDF) Ray tracing is a computationally complex global illumination algorithm that is used for producing realistic images. In addition to parallel implementations on commodity PC clusters, recently, Graphics Processing Units (GPU) have also been used to accelerate ray tracing. In this thesis, ray tracing is accelerated on a GPU cluster where the viewing plane is divided into unit tiles. Slave processes work on these tiles in a task parallel manner which are dynamically assigned to them. To decrease the number of ray-triangle intersection tests, Bounding Volume Hierarchies (BVH) are used. It is shown that almost linear speedup can be achieved. On the other hand, it is observed that API and network overheads are obstacles for scalability. T General Technology. 1-995
122	On the design of architecture-aware algorithms for emerging applications Kang, Seunghwa 30 January 2011 (has links) This dissertation maps various kernels and applications to a spectrum of programming models and architectures and also presents architecture-aware algorithms for different systems. The kernels and applications discussed in this dissertation have widely varying computational characteristics. For example, we consider both dense numerical computations and sparse graph algorithms. This dissertation also covers emerging applications from image processing, complex network analysis, and computational biology. We map these problems to diverse multicore processors and manycore accelerators. We also use new programming models (such as Transactional Memory, MapReduce, and Intel TBB) to address the performance and productivity challenges in the problems. Our experiences highlight the importance of mapping applications to appropriate programming models and architectures. We also find several limitations of current system software and architectures and directions to improve those. The discussion focuses on system software and architectural support for nested irregular parallelism, Transactional Memory, and hybrid data transfer mechanisms. We believe that the complexity of parallel programming can be significantly reduced via collaborative efforts among researchers and practitioners from different domains. This dissertation participates in the efforts by providing benchmarks and suggestions to improve system software and architectures. MapReduce Nested parallelism Parallel algorithm Algorithm engineering Performance tuning GPU Transactional memory Algorithms Parallel algorithms Multiprocessors
123	A Concurrent IFDS Dataflow Analysis Algorithm Using Actors Rodriguez, Jonathan David January 2010 (has links) There has recently been a resurgence in interest in techniques for effective programming of multi-core computers. Most programmers find general-purpose concurrent programming to be extremely difficult. This difficulty severely limits the number of applications that currently benefit from multi-core computers. There already exist many concurrent solutions for the class of regular applications, which include various algorithms for linear algebra. For the class of irregular applications, which operate on dynamic and pointer- and graph-based structures, efficient concurrent solutions have so far remained elusive. Dataflow analysis applications, which are often found in compilers and program analysis tools, have received particularly little attention with regard to execution on multi-core machines. Operating on the theory that the Actor model, which structures computations as systems of asynchronously-communicating entities, is a more appropriate method for representing irregular algorithms than the shared-memory model, this work presents a concurrent Actor-based formulation of the IFDS, or Interprocedural Finite Distributive Subset, dataflow analysis algorithm. The implementation of this algorithm is done using the Scala language and its Actors library. This algorithm achieves significant speedup on multi-core machines without using any optimistic execution. This work contributes to Actor research by showing how the Actor model can be practically applied to a dataflow analysis problem. This work contributes to static analysis research by showing how a dataflow analysis algorithm can effectively make use of multi-core machines, allowing the possibility of faster and more precise analyses. concurrency static analysis actors irregular parallelism actor model dataflow analysis IFDS Computer Science
124	Compiler Support for Fine-grain Software-only Checkpointing Zhao, Cheng Yan 13 August 2013 (has links) Checkpointing support allows program execution to roll-back to an earlier program point, discarding any modifications made since that point. Existing software-based checkpointing methods are mainly libraries that snapshot all of working-memory, and hence have prohibitive overhead for many potential applications. In this thesis we present a light-weight, fine-grain checkpointing framework implemented entirely in software through compiler transformations and optimizations. A programmer can specify arbitrary checkpoint regions via a simple API, and the compiler automatically transforms the code to implement the checkpoint at the granularity of individual stores, optimizing to remove redundancy. We explore three application areas for this support. First, we investigate its application to debugging, in particular by providing the ability to rewind to an arbitrarily-placed point in a buggy program’s execution. A study using BugBench applications shows that our compiler-based approach is more than 100X less overhead than full-process checkpointing. Second, we demonstrate that compiler-based checkpointing support can be leveraged to free the programmer from manually implementing and maintaining software rollback mechanisms when coding a backtracking algorithm, with runtime overhead of only 15% compared to the manual implementation. Finally, we utilize the efficient software-only checkpointing to support overlapping execution with delinquent loads through the demonstrations both control-speculation and data-speculation transformations. We further propose a theoretical speculative timing model and confirm its prediction effectiveness with real-machine workload. checkpoinitng compiler program analysis optimization speculation parallelism programming transactional memory 0984
125	Efficient Multi-ported Memories for FPGAs LaForest, Charles Eric 15 February 2010 (has links) Multi-ported memories are challenging to implement on FPGAs since the provided block RAMs typically have only two ports. In this dissertation we present a thorough exploration of the design space of FPGA multi-ported memories by evaluating conventional solutions to this problem, and introduce a new design that efficiently combines block RAMs into multi-ported memories with arbitrary numbers of read and write ports and true random access to any memory location, while achieving significantly higher operating frequencies than conventional approaches. For example we build a 256-location, 32-bit, 12-ported (4-write, 8-read) memory that operates at 281 MHz on Altera Stratix III FPGAs while consuming an area equivalent to 3679 ALMs: a 43% speed improvement and 84% area reduction over a pure ALM implemen- tation, and a 61% speed improvement over a pure "multipumped" implementation, although the pure multipumped implementation is 7.2-fold smaller. FPGA computer architecture multi-ported memory multipumping soft processor parallelism FIFO buffers 0544
126	Efficient Multi-ported Memories for FPGAs LaForest, Charles Eric 15 February 2010 (has links) Multi-ported memories are challenging to implement on FPGAs since the provided block RAMs typically have only two ports. In this dissertation we present a thorough exploration of the design space of FPGA multi-ported memories by evaluating conventional solutions to this problem, and introduce a new design that efficiently combines block RAMs into multi-ported memories with arbitrary numbers of read and write ports and true random access to any memory location, while achieving significantly higher operating frequencies than conventional approaches. For example we build a 256-location, 32-bit, 12-ported (4-write, 8-read) memory that operates at 281 MHz on Altera Stratix III FPGAs while consuming an area equivalent to 3679 ALMs: a 43% speed improvement and 84% area reduction over a pure ALM implemen- tation, and a 61% speed improvement over a pure "multipumped" implementation, although the pure multipumped implementation is 7.2-fold smaller. FPGA computer architecture multi-ported memory multipumping soft processor parallelism FIFO buffers 0544
127	Compiler Support for Fine-grain Software-only Checkpointing Zhao, Cheng Yan 13 August 2013 (has links) Checkpointing support allows program execution to roll-back to an earlier program point, discarding any modifications made since that point. Existing software-based checkpointing methods are mainly libraries that snapshot all of working-memory, and hence have prohibitive overhead for many potential applications. In this thesis we present a light-weight, fine-grain checkpointing framework implemented entirely in software through compiler transformations and optimizations. A programmer can specify arbitrary checkpoint regions via a simple API, and the compiler automatically transforms the code to implement the checkpoint at the granularity of individual stores, optimizing to remove redundancy. We explore three application areas for this support. First, we investigate its application to debugging, in particular by providing the ability to rewind to an arbitrarily-placed point in a buggy program’s execution. A study using BugBench applications shows that our compiler-based approach is more than 100X less overhead than full-process checkpointing. Second, we demonstrate that compiler-based checkpointing support can be leveraged to free the programmer from manually implementing and maintaining software rollback mechanisms when coding a backtracking algorithm, with runtime overhead of only 15% compared to the manual implementation. Finally, we utilize the efficient software-only checkpointing to support overlapping execution with delinquent loads through the demonstrations both control-speculation and data-speculation transformations. We further propose a theoretical speculative timing model and confirm its prediction effectiveness with real-machine workload. checkpoinitng compiler program analysis optimization speculation parallelism programming transactional memory 0984
128	Nested pessimistic transactions for both atomicity and synchronization in concurrent software Chammah, Tarek January 2011 (has links) Existing atomic section interface proposals, thus far, have tended to only isolate transactions from each other. Less considered is the coordination of threads performing transactions with respect to one another. Synchronization of nested sections is typically relegated to outside of and among the top-level flattened sections. However existing models do not permit the composition of even simple synchronization constructs such as barriers. The proposed model integrates synchronization as a first-class construct in a truly nested atomic block implementation. The implementation is evaluated on quantitative benchmarks, with qualitative examples of the atomic section interface???s expressive power compared with conventional transactional memory implementations. concurrency parallelism transactions locks transactional memory atomic sections lock inference condition synchronization
129	Scientific Computing on Multicore Architectures Tillenius, Martin January 2014 (has links) Computer simulations are an indispensable tool for scientists to gain new insights about nature. Simulations of natural phenomena are usually large, and limited by the available computer resources. By using the computer resources more efficiently, larger and more detailed simulations can be performed, and more information can be extracted to help advance human knowledge. The topic of this thesis is how to make best use of modern computers for scientific computations. The challenge here is the high level of parallelism that is required to fully utilize the multicore processors in these systems. Starting from the basics, the primitives for synchronizing between threads are investigated. Hardware transactional memory is a new construct for this, which is evaluated for a new use of importance for scientific software: atomic updates of floating point values. The evaluation includes experiments on real hardware and comparisons against standard methods. Higher level programming models for shared memory parallelism are then considered. The state of the art for efficient use of multicore systems is dynamically scheduled task-based systems, where tasks can depend on data. In such systems, the software is divided up into many small tasks that are scheduled asynchronously according to their data dependencies. This enables a high level of parallelism, and avoids global barriers. A new system for managing task dependencies is developed in this thesis, based on data versioning. The system is implemented as a reusable software library, and shown to be as efficient or more efficient than other shared-memory task-based systems in experimental comparisons. The developed runtime system is then extended to distributed memory machines, and used for implementing a parallel version of a software for global climate simulations. By running the optimized and parallelized version on eight servers, an equally sized problem can be solved over 100 times faster than in the original sequential version. The parallel version also allowed significantly larger problems to be solved, previously unreachable due to memory constraints. / UPMARC / eSSENCE multicore scientific computing shared memory parallelism task-based programming parallel programming model task scheduling data versioning
130	Exploiting Parallelism in GPUs Hechtman, Blake Alan January 2014 (has links) <p>Heterogeneous processors with accelerators provide an opportunity to improve performance within a given power budget.</p><p>Many of these heterogeneous processors contain Graphics Processing Units (GPUs) that can perform graphics and embarrassingly parallel computation orders of magnitude faster than a CPU while using less energy. Beyond these obvious applications for GPUs, a larger variety of applications can benefit from a GPU's large computation and memory bandwidth. However, many of these applications are irregular and, as a result, require synchronization and scheduling that are commonly believed to perform poorly on GPUs. The basic building block of synchronization and scheduling is memory consistency, which is, therefore, the first place to look for improving performance on irregular applications. In this thesis, we approach the programmability of irregular applications on GPUs by thinking across traditional boundaries of the compute stack. We think about architecture, microarchitecture and runtime systems from the programmers perspective. To this end, we study architectural memory consistency on future GPUs with cache coherence. In addition, we design a GPU memory system</p><p>microarchitecture that can support fine-grain and coarse-grain synchronization without sacrificing throughput. Finally, we develop a task runtime that embraces the GPU microarchitecture to perform well</p><p>on fork/join parallelism desired by many programmers. Overall, this thesis contributes non-intuitive solutions to improve the performance and programmability of irregular applications from the programmer's perspective.</p> / Dissertation Computer engineering Computer science Cache Coherence GPU Memory Consistency Task Parallelism

Search results