Global ETD Search

81	Knowledge support for parallel performance data mining / Huck, Kevin A., January 2009 (has links) Typescript. Includes vita and abstract. Includes bibliographical references (leaves 218-231). Also available online in Scholars' Bank; and in ProQuest, free to University of Oregon users.
82	Parallel lossless data compression based on the Burrows-Wheeler Transform / Gilchrist, Jeffrey S. January 1900 (has links) Thesis (M.App.Sc.) - Carleton University, 2007. / Includes bibliographical references (p. 99-103). Also available in electronic format on the Internet.
83	Pointer analysis building a foundation for effective program analysis / Hardekopf, Benjamin Charles. January 1900 (has links) Thesis (Ph. D.)--University of Texas at Austin, 2009. / Title from PDF title page (University of Texas Digital Repository, viewed on Sept. 16, 2009). Vita. Includes bibliographical references.
84	Parallel computational geometry on Analog Hopfield Networks. Valiveti, Natana, Carleton University. Dissertation. Computer Science. January 1992 (has links) Thesis (M.C.S.)--Carleton University, 1993. / Also available in electronic format on the Internet.
85	Solving combinatorial based chemical engineering problems via parallel evolutionary approaches / Wong, King Hei. January 2009 (has links) Includes bibliographical references (p. 80-88).
86	Memory consistency directed cache coherence protocols for scalable multiprocessors Elver, Marco Iskender January 2016 (has links) The memory consistency model, which formally specifies the behavior of the memory system, is used by programmers to reason about parallel programs. From a hardware design perspective, weaker consistency models permit various optimizations in a multiprocessor system: this thesis focuses on designing and optimizing the cache coherence protocol for a given target memory consistency model. Traditional directory coherence protocols are designed to be compatible with the strictest memory consistency model, sequential consistency (SC). When they are used for chip multiprocessors (CMPs) that provide more relaxed memory consistency models, such protocols turn out to be unnecessarily strict. Usually, this comes at the cost of scalability, in terms of per-core storage due to sharer tracking, which poses a problem with increasing number of cores in today’s CMPs, most of which no longer are sequentially consistent. The recent convergence towards programming language based relaxed memory consistency models has sparked renewed interest in lazy cache coherence protocols. These protocols exploit synchronization information by enforcing coherence only at synchronization boundaries via self-invalidation. As a result, such protocols do not require sharer tracking which benefits scalability. On the downside, such protocols are only readily applicable to a restricted set of consistency models, such as Release Consistency (RC), which expose synchronization information explicitly. In particular, existing architectures with stricter consistency models (such as x86) cannot readily make use of lazy coherence protocols without either: adapting the protocol to satisfy the stricter consistency model; or changing the architecture’s consistency model to (a variant of) RC, typically at the expense of backward compatibility. The first part of this thesis explores both these options, with a focus on a practical approach satisfying backward compatibility. Because of the wide adoption of Total Store Order (TSO) and its variants in x86 and SPARC processors, and existing parallel programs written for these architectures, we first propose TSO-CC, a lazy cache coherence protocol for the TSO memory consistency model. TSO-CC does not track sharers and instead relies on self-invalidation and detection of potential acquires (in the absence of explicit synchronization) using per cache line timestamps to efficiently and lazily satisfy the TSO memory consistency model. Our results show that TSO-CC achieves, on average, performance comparable to a MESI directory protocol, while TSO-CC’s storage overhead per cache line scales logarithmically with increasing core count. Next, we propose an approach for the x86-64 architecture, which is a compromise between retaining the original consistency model and using a more storage efficient lazy coherence protocol. First, we propose a mechanism to convey synchronization information via a simple ISA extension, while retaining backward compatibility with legacy codes and older microarchitectures. Second, we propose RC3 (based on TSOCC), a scalable cache coherence protocol for RCtso, the resulting memory consistency model. RC3 does not track sharers and relies on self-invalidation on acquires. To satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1 timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving performance comparable to a MESI directory protocol for RC optimized programs. RC3’s storage overhead per cache line scales logarithmically with increasing core count and reduces on-chip coherence storage overheads by 45% compared to TSO-CC. Finally, it is imperative that hardware adheres to the promised memory consistency model. Indeed, consistency directed coherence protocols cannot use conventional coherence definitions (e.g. SWMR) to be verified against, and few existing verification methodologies apply. Furthermore, as the full consistency model is used as a specification, their interaction with other components (e.g. pipeline) of a system must not be neglected in the verification process. Therefore, verifying a system with such protocols in the context of interacting components is even more important than before. One common way to do this is via executing tests, where specific threads of instruction sequences are generated and their executions are checked for adherence to the consistency model. It would be extremely beneficial to execute such tests under simulation, i.e. when the functional design implementation of the hardware is being prototyped. Most prior verification methodologies, however, target post-silicon environments, which when used for simulation-based memory consistency verification would be too slow. We propose McVerSi, a test generation framework for fast memory consistency verification of a full-system design implementation under simulation. Our primary contribution is a Genetic Programming (GP) based approach to memory consistency test generation, which relies on a novel crossover function that prioritizes memory operations contributing to non-determinism, thereby increasing the probability of uncovering memory consistency bugs. To guide tests towards exercising as much logic as possible, the simulator’s reported coverage is used as the fitness function. Furthermore, we increase test throughput by making the test workload simulation-aware. We evaluate our proposed framework using the Gem5 cycle accurate simulator in full-system mode with Ruby (with configurations that use Gem5’s MESI protocol, and our proposed TSO-CC together with an out-of-order pipeline). We discover 2 new bugs in the MESI protocol due to the faulty interaction of the pipeline and the cache coherence protocol, highlighting that even conventional protocols should be verified rigorously in the context of a full-system. Crucially, these bugs would not have been discovered through individual verification of the pipeline or the coherence protocol. We study 11 bugs in total. Our GP-based test generation approach finds all bugs consistently, therefore providing much higher guarantees compared to alternative approaches (pseudo-random test generation and litmus tests). 004.2
87	Tracing de aplicações paralelas com informações de alto nível de abstração / Tracing of parallel applications with information of high level abstraction Thatyana de Faria Piola 06 July 2007 (has links) A computação paralela tem se estabelecido como uma ferramenta indispensável para conseguir o desempenho esperado em aplicações de muitas áreas científicas. É importante avaliar os fatores que limitam o desempenho de uma aplicação paralela. Este trabalho vem apresentar o desenvolvimento e a implementação de uma ferramenta chamada Hierarchical Analyses que permite o levantamento de dados para análise de fatores de desempenho em programas paralelos de forma hierárquica, ou seja, permite coletar as informações acompanhando o nível de abstração usado pelo programador. Essa ferramenta é composta pelos módulos de coleta e transformação dos dados. O módulo de coleta chamado HieraCollector é responsável por coletar e armazenar os dados em arquivos no formato XML, sendo que o usuário não precisa alterar o código fonte de sua aplicação. O módulo de transformação chamado HieraTransform é reponsável por transformar os dados coletados extraindo medidas que permitam a realização da análise do programa paralelo. Para validação dos módulos de coleta e transformação foi utilizada a biblioteca MPI e o framework OOPS que utiliza orientação a objetos. Outra contribuição deste trabalho, foi o desenvolvimento da ferramenta visual chamada HieraOLAP que auxilia o usuário na análise de desempenho de programas paralelos. / Parallel computing has become an essential tool to achieve the performance needed by many scientific applications. The evaluation of performance factors of parallel applications is of utmost significant. This work presents the developement and implementation of a tool called Hierarchical Analyses which facilitates data collection for performance analysis of parallel programs with hierarchical information, i.e. the information is collected in the various abstraction levels used in the application program. The tool consists of a collection and a transformation modules. The collection module (HieraCollector) collects the data and stores it in XML format. The transformation module (HieraTransform) processes the collected data computing measurements to be used in the analysis of parallel code. To validate the tool, implementations adapted to MPI and the OOPS framework have been developed. Another contribution of this work is the development of a visual tool called HieraOLAP to help the user in the analysis of parallel program performance. Análise de Desempenho Programação paralela Trace Parallel programming Performance analyses Trace
88	A Skeleton library for Cell Broadband Engine / Ett Skelettbibliotek för Cell Broadband Engine Ålind, Markus January 2008 (has links) The Cell Broadband Engine processor is a powerful processor capable of over 220 GFLOPS. It is highly specialized and can be controlled in detail by the programmer. The Cell is significantly more complicated to program than a standard homogeneous multi core processor such as the Intel Core2 Duo and Quad. This thesis explores the possibility to abstract some of the complexities of Cell programming while maintaining high performance. The abstraction is achieved through a library of parallel skeletons implemented in the bulk synchronous parallel programming environment NestStep. The library includes constructs for user defined SIMD optimized data parallel skeletons such as map, reduce and more. The evaluation of the library includes porting of a vector based scientific computation program from sequential C code to the Cell using the library and the NestStep environment. The ported program shows good performance when compared to the sequential original code run on a high-end x86 processor. The evaluation also shows that a dot product implemented with the skeleton library is faster than the dot product in the IBM BLAS library for the Cell processor with more than two slave processors. NestStep Cell BlockLib skeleton programming parallel programming Software Engineering Programvaruteknik
89	Parallel computation techniques for virtual acoustics and physical modelling synthesis Webb, Craig Jonathan January 2014 (has links) The numerical simulation of large-scale virtual acoustics and physical modelling synthesis is a computationally expensive process. Time stepping methods, such as finite difference time domain, can be used to simulate wave behaviour in models of three-dimensional room acoustics and virtual instruments. In the absence of any form of simplifying assumptions, and at high audio sample rates, this can lead to simulations that require many hours of computation on a standard Central Processing Unit (CPU). In recent years the video game industry has driven the development of Graphics Processing Units (GPUs) that are now capable of multi-teraflop performance using highly parallel architectures. Whilst these devices are primarily designed for graphics calculations, they can also be used for general purpose computing. This thesis explores the use of such hardware to accelerate simulations of three-dimensional acoustic wave propagation, and embedded systems that create physical models for the synthesis of sound. Test case simulations of virtual acoustics are used to compare the performance of workstation CPUs to that of Nvidia’s Tesla GPU hardware. Using representative multicore CPU benchmarks, such simulations can be accelerated in the order of 5X for single precision and 3X for double precision floating-point arithmetic. Optimisation strategies are examined for maximising GPU performance when using single devices, as well as for multiple device codes that can compute simulations using billions of grid points. This allows the simulation of room models of several thousand cubic metres at audio rates such as 44.1kHz, all within a useable time scale. The performance of alternative finite difference schemes is explored, as well as strategies for the efficient implementation of boundary conditions. Creating physical models of acoustic instruments requires embedded systems that often rely on sparse linear algebra operations. The performance efficiency of various sparse matrix storage formats is detailed in terms of the fundamental operations that are required to compute complex models, with an optimised storage system achieving substantial performance gains over more generalised formats. An integrated instrument model of the timpani drum is used to demonstrate the performance gains that are possible using the optimisation strategies developed through this thesis. 005.2
90	Parallel process placement Handler, Caroline January 1989 (has links) This thesis investigates methods of automatic allocation of processes to available processors in a given network configuration. The research described covers the investigation of various algorithms for optimal process allocation. Among those researched were an algorithm which used a branch and bound technique, an algorithm based on graph theory, and an heuristic algorithm involving cluster analysis. These have been implemented and tested in conjunction with the gathering of performance statistics during program execution, for use in improving subsequent allocations. The system has been implemented on a network of loosely-coupled microcomputers using multi-port serial communication links to simulate a transputer network. The concurrent programming language occam has been implemented, replacing the explicit process allocation constructs with an automatic placement algorithm. This enables the source code to be completely separated from hardware considerations Parallel programming (Computer science) Parallel programming (Computer science)

Search results