Global ETD Search

31	Methodologies and tools for computation offloading on heterogeneous multicores Bhagwat, Ashwini 18 May 2009 (has links) Frequency scaling in traditional computing systems has hit the power wall and multicore computing is here to stay. Unlike homogeneous multicores which have uniform architecture and instruction set across cores, heterogenous multicores have differentially capable cores to provide optimal performance for specialized functionality. However, this heterogeneity also translates into difficult programming models, and extracting its potential is not trivial. The Cell Broadband Engine by the Sony Toshiba IBM(STI) consortium was amongst the first heterogenous multicore systems with a single Power Processing Unit(PPU) and 8 Synergistic Processor Units (SPUs). We address the issue of porting an existing sequential C/C++ codebase on to the Cell BE through compiler driven program analysis and profiling. Until parallel programming models evolve, the "interim" solution to performance involves speeding up legacy code by offloading computationally intense parts of a sequential thread to the co-processor; thus using it as an accelerator. Unique architectural characteristics of an accelerator makes this problem quite challenging. On the Cell, these characteristics include limited local store of the SPU, high latency of data transfer between PPU and SPU, lack of branch prediction unit, limited SIMDizability, expensive scalar code etc. In particular, the designers of the Cell have opted for software controlled memory on its SPUs to reduce power consumption and to give programmers more control over the predictability of latency. The lack of a hardware cache on the SPU can create performance bottlenecks because any data that needs to be brought in to the SPU must be brought in using a DMA call. The need for supporting a software controlled cache is thus evident for irregular memory accesses on the SPU. For such a cache to result in improved performance, the amount of time spent in book-keeping and tracking at run-time should be minimal. Traditional algorithms like LRU, when implemented in software incur overheads on every cache hit because appropriate data structures need to be updated. Such overheads are on off critical path for traditional hardware cache but on the critical path for a software controlled cache. Thus there is a need for better management of "data movement" for the code that is offloaded on to the SPU. This thesis addresses the "code partitioning" problem as well as the "data movement" problem. We present GLIMPSES - a compiler driven profiling tool that analyzes existing C/C++ code for its suitability for porting to the Cell, and presents its results in an interactive visualizer. Software Controlled Cache - an improved eviction policy that exploits information gleaned from memory traces generated through offline profiling. The trace is analyzed to provide guidance for a run-time state machine within the cache manager; resulting in reduced run-time overhead and better performance. The design tradeoffs and several pros and cons of this approach are brought forth as well. It is shown that with just about the right amount of runtime book-keeping and decision making, one can get to the difficult solution space of the right balance to achieve high performance. Parallel Computing Coprocessors Coding theory
32	PERFORMANCE ESTIMATION AND SCHEDULING FOR PARALLEL PROGRAMS WITH CRITICAL SECTIONS Dutta, Sourav 01 May 2017 (has links) A fundamental problem in multithreaded parallel programs is the partial serialization that is imposed due to the presence of mutual exclusion variables or critical sections. In this work we investigate a model that considers the threads consisting of an equal number L of functional blocks, where each functional block has the same duration and either accesses a critical section or executes non-critical code. We derived formulas to estimate the average time spent in a critical section in presence of synchronization barrier and in absence of it. We also develop and establish the optimality of a fast polynomial-time algorithm to find a schedule with the shortest makespan for any number of threads and for any number of critical sections for the case of L = 2. For the general case L > 2, which is NP-complete, we present a competitive heuristic and provide experimental comparisons with the ideal integer linear programming (ILP) formulation. Distributed computing Parallel computing Scheduling
33	Branch-level scheduling in Aurora : the Dharma scheduler Sindaha, Raed Yousef Saba January 1995 (has links) No description available. 005
34	Efficient scheduling of parallel applications on workstation clusters Dantas, Mario A. R. January 1996 (has links) No description available. 005
35	FADI : a fault-tolerant environment for distributed processing systems Osman, Taha Mohammed January 1998 (has links) No description available. 005 Parallel computing; Error detection; PVM
36	Practical structured parallelism using BMF Crooke, David January 1998 (has links) This thesis concerns the use of the Bird- Meertens Formalism as a mechanism to control parallelism in an imperative programming language. One of the main reasons for the failure of parallelism to enter mainstream computing is the difficulty of developing software and the lack of the portability and performance predictability enjoyed by sequential systems. A key objetive should be to minimise costs by abstracting much of the complexity away from the programmer. Criteria for a suitable parallel programming paradigm to meet this goal are defined. The Bird-Meertens Formalism, which has in the past been shown to be a suitable vehicle for expressing parallel algorithms, is used as the basis for a proposed imperative parallel programming paradigm which meets these criteria. A programming language is proposed which is an example of this paradigm, based on the BMF Theory of Lists and the sequential language C. A concurrent operational semantics is outlined, with the emphasis on its use as a practical tool for imcreasing confidence in program correctness, rather than on full and rigorous formality. A prototype implementation of a subset of this language for a distributed memory, massively parallel computer is produced in the form of a C subroutine library. Although not offering realistic absolute performance, it permits measurements of scalability and relative performance to be undertaken. A case study is undertaken which implements a simple but realistic algoritm in the language, and considers how well the the criteria outlined at the start of the project are met. The prototype library implementation is used for performance measurements. A range of further possibilities is examinedm, in particular ways in which the paradigm language may be extended, and the possibility of using alternative BMF-like type theories. Pragmatic considerations for achieving performance in a production implementation are discussed. 004.01
37	Design, development and evaluation of an efficient hierarchical interconnection network. Campbell, Stuart M. January 1999 (has links) Parallel computing has long been an area of research interest because exploiting parallelism in difficult problems has promised to deliver orders of magnitude speedups. Processors are now both powerful and cheap, so that systems incorporating tens, hundreds or even thousands of powerful processors need not be prohibitively expensive. The weak link in exploiting parallelism is the means of communication between the processors. Shared memory systems are fundamentally limited in the number of processors they can utilise. To achieve high levels of parallelism it is still necessary to use distributed memory and some form of interconnection network. But interconnection networks can be costly, slow, difficult to build and expand, vulnerable to faults and limited in the range of problems they can be used to solve effectively. As a result there has been extensive research into developing interconnection networks which overcome some or all of these difficulties. In this thesis it is argued that a new interconnection network, Hierarchical Cliques (HiC), and a derivative, FatHiC, possesses many desirable properties and are worthy of consideration for use in building parallel computers. A fundamental element of an interconnection network is its topology. After defining the topology of HiC, expressions are derived for the various parameters which define its underlying limits of performance and fault tolerance. A second element of an interconnection network is an addressing and routing scheme. The addressing scheme and routing algorithms of HiC are described. The flexibility of HiC is demonstrated by developing embeddings of popular, regular interconnection networks. Some embeddings into HiC suffer from high congestion, however the FatHiC network is shown to have low congestion for those embeddings. The performance of some important, regular, data parallel problems on HiC and ++ / FatHiC are determined by analysis and simulation, using the 2D-mesh as a means of comparison. But performance alone does not tell the whole story. Any parallel computer system must be cost effective. In order to analyse the cost effectiveness of HiCs an existing measure was expanded to provide a more realistic model and a more accurate means of comparison. One aim of this thesis is to demonstrate the suitability of HiC for parallel computing systems which execute irregular algorithms requiring dynamic load balancing. A new dynamic load balancing algorithm is proposed which takes advantage of the hierarchical structure of the HiC to reduce communication overheads incurred when distributing work. To demonstrate performance of an irregular problem, a novel parallel algorithm was developed to detect subgraph isomorphism from many model graphs to a single input graph. The use of the new load balancing algorithm in conjunction with the subgraph isomorphism algorithm is discussed.
38	Computational Structure of the N-body Problem Katzenelson, Jacob 01 April 1988 (has links) This work considers the organization and performance of computations on parallel computers of tree algorithms for the N-body problem where the number of particles is on the order of a million. The N-body problem is formulated as a set of recursive equations based on a few elementary functions, which leads to a computational structure in the form of a pyramid-like graph, where each vertex is a process, and each arc a communication link. The pyramid is mapped to three different processor configurations: (1) A pyramid of processors corresponding to the processes pyramid graph; (2) A hypercube of processors, e.g., a connection-machine like architecture; (3) A rather small array, e.g., $2 \\times 2 \\ times 2$, of processors faster than the ones considered in (1) and (2) above. Simulations of this size can be performed on any of the three architectures in reasonable time. N-body problem parallel computing particle simulation
39	Parallel VLSI Circuit Analysis and Optimization Ye, Xiaoji 2010 December 1900 (has links) The prevalence of multi-core processors in recent years has introduced new opportunities and challenges to Electronic Design Automation (EDA) research and development. In this dissertation, a few parallel Very Large Scale Integration (VLSI) circuit analysis and optimization methods which utilize the multi-core computing platform to tackle some of the most difficult contemporary Computer-Aided Design (CAD) problems are presented. The first CAD application that is addressed in this dissertation is analyzing and optimizing mesh-based clock distribution network. Mesh-based clock distribution network (also known as clock mesh) is used in high-performance microprocessor designs as a reliable way of distributing clock signals to the entire chip. The second CAD application addressed in this dissertation is the Simulation Program with Integrated Circuit Emphasis (SPICE) like circuit simulation. SPICE simulation is often regarded as the bottleneck of the design flow. Recently, parallel circuit simulation has attracted a lot of attention. The first part of the dissertation discusses circuit analysis techniques. First, a combination of clock network specific model order reduction algorithm and a port sliding scheme is presented to tackle the challenges in analyzing large clock meshes with a large number of clock drivers. Our techniques run much faster than the standard SPICE simulation and existing model order reduction techniques. They also provide a basis for the clock mesh optimization. Then, a hierarchical multi-algorithm parallel circuit simulation (HMAPS) framework is presented as an novel technique of parallel circuit simulation. The inter-algorithm parallelism approach in HMAPS is completely different from the existing intra-algorithm parallel circuit simulation techniques and achieves superlinear speedup in practice. The second part of the dissertation talks about parallel circuit optimization. A modified asynchronous parallel pattern search (APPS) based method which utilizes the efficient clock mesh simulation techniques for the clock driver size optimization problem is presented. Our modified APPS method runs much faster than a continuous optimization method and effectively reduces the clock skew for all test circuits. The third part of the dissertation describes parallel performance modeling and optimization of the HMAPS framework. The performance models and runtime optimization scheme improve the speed of HMAPS further more. The dynamically adapted HMAPS becomes a complete solution for parallel circuit simulation. Parallel computing circuit simulation circuit optimization
40	A Phase Based Dense Stereo Algorithm Implemented in CUDA Macomber, Brent David 2011 May 1900 (has links) Stereo imaging is routinely used in Simultaneous Localization and Mapping (SLAM) systems for the navigation and control of autonomous spacecraft proximity operations, advanced robotics, and robotic mapping and surveying applications. A key step (and generally the most computationally expensive step) in the generation of high fidelity geometric environment models from image data is the solution of the dense stereo correspondence problem. A novel method for solving the stereo correspondence problem to sub-pixel accuracy in the Fourier frequency domain by exploiting the Convolution Theorem is developed. The method is tailored to challenging aerospace applications by incorporation of correction factors for common error sources. Error-checking metrics verify correspondence matches to ensure high quality depth reconstructions are generated. The effect of geometric foreshortening caused by the baseline displacement of the cameras is modeled and corrected, drastically improving correspondence matching on highly off-normal surfaces. A metric for quantifying the strength of correspondence matches is developed and implemented to recognize and reject weak correspondences, and a separate cross-check verification provides a final defense against erroneous matches. The core components of this phase based dense stereo algorithm are implemented and optimized in the Compute Uni ed Device Architecture (CUDA) parallel computation environment onboard an NVIDIA Graphics Processing Unit (GPU). Accurate dense stereo correspondence matching is performed on stereo image pairs at a rate of nearly 10Hz. Dense Stereo Parallel Computing CUDA SLAM

Search results