• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 215
  • 81
  • 19
  • 12
  • 6
  • 6
  • 6
  • 4
  • 4
  • 3
  • 3
  • 3
  • 2
  • 2
  • 1
  • Tagged with
  • 440
  • 440
  • 215
  • 169
  • 85
  • 76
  • 69
  • 65
  • 57
  • 53
  • 50
  • 47
  • 45
  • 40
  • 39
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
121

Solving combinatorial based chemical engineering problems via parallel evolutionary approaches /

Wong, King Hei. January 2009 (has links)
Includes bibliographical references (p. 80-88).
122

Concurrent system-software via soft-instructions

Montague, Bruce R. January 1998 (has links)
Thesis (Ph. D.)--University of California, Santa Cruz, 1998. / Typescript. Includes bibliographical references (leaves 327-354) and index.
123

High-level programming language abstractions for advanced and dynamic parallel computations /

Deitz, Steven J., January 2005 (has links)
Thesis (Ph. D.)--University of Washington, 2005. / Vita. Includes bibliographical references (p. 157-163).
124

Memory consistency directed cache coherence protocols for scalable multiprocessors

Elver, Marco Iskender January 2016 (has links)
The memory consistency model, which formally specifies the behavior of the memory system, is used by programmers to reason about parallel programs. From a hardware design perspective, weaker consistency models permit various optimizations in a multiprocessor system: this thesis focuses on designing and optimizing the cache coherence protocol for a given target memory consistency model. Traditional directory coherence protocols are designed to be compatible with the strictest memory consistency model, sequential consistency (SC). When they are used for chip multiprocessors (CMPs) that provide more relaxed memory consistency models, such protocols turn out to be unnecessarily strict. Usually, this comes at the cost of scalability, in terms of per-core storage due to sharer tracking, which poses a problem with increasing number of cores in today’s CMPs, most of which no longer are sequentially consistent. The recent convergence towards programming language based relaxed memory consistency models has sparked renewed interest in lazy cache coherence protocols. These protocols exploit synchronization information by enforcing coherence only at synchronization boundaries via self-invalidation. As a result, such protocols do not require sharer tracking which benefits scalability. On the downside, such protocols are only readily applicable to a restricted set of consistency models, such as Release Consistency (RC), which expose synchronization information explicitly. In particular, existing architectures with stricter consistency models (such as x86) cannot readily make use of lazy coherence protocols without either: adapting the protocol to satisfy the stricter consistency model; or changing the architecture’s consistency model to (a variant of) RC, typically at the expense of backward compatibility. The first part of this thesis explores both these options, with a focus on a practical approach satisfying backward compatibility. Because of the wide adoption of Total Store Order (TSO) and its variants in x86 and SPARC processors, and existing parallel programs written for these architectures, we first propose TSO-CC, a lazy cache coherence protocol for the TSO memory consistency model. TSO-CC does not track sharers and instead relies on self-invalidation and detection of potential acquires (in the absence of explicit synchronization) using per cache line timestamps to efficiently and lazily satisfy the TSO memory consistency model. Our results show that TSO-CC achieves, on average, performance comparable to a MESI directory protocol, while TSO-CC’s storage overhead per cache line scales logarithmically with increasing core count. Next, we propose an approach for the x86-64 architecture, which is a compromise between retaining the original consistency model and using a more storage efficient lazy coherence protocol. First, we propose a mechanism to convey synchronization information via a simple ISA extension, while retaining backward compatibility with legacy codes and older microarchitectures. Second, we propose RC3 (based on TSOCC), a scalable cache coherence protocol for RCtso, the resulting memory consistency model. RC3 does not track sharers and relies on self-invalidation on acquires. To satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1 timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving performance comparable to a MESI directory protocol for RC optimized programs. RC3’s storage overhead per cache line scales logarithmically with increasing core count and reduces on-chip coherence storage overheads by 45% compared to TSO-CC. Finally, it is imperative that hardware adheres to the promised memory consistency model. Indeed, consistency directed coherence protocols cannot use conventional coherence definitions (e.g. SWMR) to be verified against, and few existing verification methodologies apply. Furthermore, as the full consistency model is used as a specification, their interaction with other components (e.g. pipeline) of a system must not be neglected in the verification process. Therefore, verifying a system with such protocols in the context of interacting components is even more important than before. One common way to do this is via executing tests, where specific threads of instruction sequences are generated and their executions are checked for adherence to the consistency model. It would be extremely beneficial to execute such tests under simulation, i.e. when the functional design implementation of the hardware is being prototyped. Most prior verification methodologies, however, target post-silicon environments, which when used for simulation-based memory consistency verification would be too slow. We propose McVerSi, a test generation framework for fast memory consistency verification of a full-system design implementation under simulation. Our primary contribution is a Genetic Programming (GP) based approach to memory consistency test generation, which relies on a novel crossover function that prioritizes memory operations contributing to non-determinism, thereby increasing the probability of uncovering memory consistency bugs. To guide tests towards exercising as much logic as possible, the simulator’s reported coverage is used as the fitness function. Furthermore, we increase test throughput by making the test workload simulation-aware. We evaluate our proposed framework using the Gem5 cycle accurate simulator in full-system mode with Ruby (with configurations that use Gem5’s MESI protocol, and our proposed TSO-CC together with an out-of-order pipeline). We discover 2 new bugs in the MESI protocol due to the faulty interaction of the pipeline and the cache coherence protocol, highlighting that even conventional protocols should be verified rigorously in the context of a full-system. Crucially, these bugs would not have been discovered through individual verification of the pipeline or the coherence protocol. We study 11 bugs in total. Our GP-based test generation approach finds all bugs consistently, therefore providing much higher guarantees compared to alternative approaches (pseudo-random test generation and litmus tests).
125

Interrupt-generating active data objects

Clayton, Peter Graham January 1990 (has links)
An investigation is presented into an interrupt-generating object model which is designed to reduce the effort of programming distributed memory multicomputer networks. The object model is aimed at the natural modelling of problem domains in which a number of concurrent entities interrupt one another as they lay claim to shared resources. The proposed computational model provides for the safe encapsulation of shared data, and incorporates inherent arbitration for simultaneous access to the data. It supplies a predicate triggering mechanism for use in conditional synchronization and as an alternative mechanism to polling. Linguistic support for the proposal requires a novel form of control structure which is able to interface sensibly with interrupt-generating active data objects. The thesis presents the proposal as an elemental language structure, with axiomatic guarantees which enforce safety properties and aid in program proving. The established theory of CSP is used to reason about the object model and its interface. An overview is presented of a programming language called HUL, whose semantics reflect the proposed computational model. Using the syntax of HUL, the application of the interrupt-generating active data object is illustrated. A range of standard concurrent problems is presented to demonstrate the properties of the interrupt-generating computational model. Furthermore, the thesis discusses implementation considerations which enable the model to be mapped precisely onto multicomputer networks, and which sustain the abstract programming level provided by the interrupt-generating active data object in the wider programming structures of HUL.
126

Tracing de aplicações paralelas com informações de alto nível de abstração / Tracing of parallel applications with information of high level abstraction

Thatyana de Faria Piola 06 July 2007 (has links)
A computação paralela tem se estabelecido como uma ferramenta indispensável para conseguir o desempenho esperado em aplicações de muitas áreas científicas. É importante avaliar os fatores que limitam o desempenho de uma aplicação paralela. Este trabalho vem apresentar o desenvolvimento e a implementação de uma ferramenta chamada Hierarchical Analyses que permite o levantamento de dados para análise de fatores de desempenho em programas paralelos de forma hierárquica, ou seja, permite coletar as informações acompanhando o nível de abstração usado pelo programador. Essa ferramenta é composta pelos módulos de coleta e transformação dos dados. O módulo de coleta chamado HieraCollector é responsável por coletar e armazenar os dados em arquivos no formato XML, sendo que o usuário não precisa alterar o código fonte de sua aplicação. O módulo de transformação chamado HieraTransform é reponsável por transformar os dados coletados extraindo medidas que permitam a realização da análise do programa paralelo. Para validação dos módulos de coleta e transformação foi utilizada a biblioteca MPI e o framework OOPS que utiliza orientação a objetos. Outra contribuição deste trabalho, foi o desenvolvimento da ferramenta visual chamada HieraOLAP que auxilia o usuário na análise de desempenho de programas paralelos. / Parallel computing has become an essential tool to achieve the performance needed by many scientific applications. The evaluation of performance factors of parallel applications is of utmost significant. This work presents the developement and implementation of a tool called Hierarchical Analyses which facilitates data collection for performance analysis of parallel programs with hierarchical information, i.e. the information is collected in the various abstraction levels used in the application program. The tool consists of a collection and a transformation modules. The collection module (HieraCollector) collects the data and stores it in XML format. The transformation module (HieraTransform) processes the collected data computing measurements to be used in the analysis of parallel code. To validate the tool, implementations adapted to MPI and the OOPS framework have been developed. Another contribution of this work is the development of a visual tool called HieraOLAP to help the user in the analysis of parallel program performance.
127

Scalable applications in a distributed environment

Andersson, Filip, Norberg, Simon January 2011 (has links)
As the amount of simultaneous users of distributed systems increase, scalability is becoming an important factor to consider during software development. Without sufficient scalability, systems might have a hard time to manage high loads, and might not be able to support a high amount of users. We have determined how scalability can best be implemented, and what extra costs this leads to. Our research is based on both a literature review, where we have looked at what others in the field of computer engineering thinks about scalability, and by implementing a highly scalable system of our own. In the end we came up with a couple of general pointers which can help developers to determine if they should focus on scalable development, and what they should consider if they choose to do so.
128

A Skeleton library for Cell Broadband Engine / Ett Skelettbibliotek för Cell Broadband Engine

Ålind, Markus January 2008 (has links)
The Cell Broadband Engine processor is a powerful processor capable of over 220 GFLOPS. It is highly specialized and can be controlled in detail by the programmer. The Cell is significantly more complicated to program than a standard homogeneous multi core processor such as the Intel Core2 Duo and Quad. This thesis explores the possibility to abstract some of the complexities of Cell programming while maintaining high performance. The abstraction is achieved through a library of parallel skeletons implemented in the bulk synchronous parallel programming environment NestStep. The library includes constructs for user defined SIMD optimized data parallel skeletons such as map, reduce and more. The evaluation of the library includes porting of a vector based scientific computation program from sequential C code to the Cell using the library and the NestStep environment. The ported program shows good performance when compared to the sequential original code run on a high-end x86 processor. The evaluation also shows that a dot product implemented with the skeleton library is faster than the dot product in the IBM BLAS library for the Cell processor with more than two slave processors.
129

Parallel computation techniques for virtual acoustics and physical modelling synthesis

Webb, Craig Jonathan January 2014 (has links)
The numerical simulation of large-scale virtual acoustics and physical modelling synthesis is a computationally expensive process. Time stepping methods, such as finite difference time domain, can be used to simulate wave behaviour in models of three-dimensional room acoustics and virtual instruments. In the absence of any form of simplifying assumptions, and at high audio sample rates, this can lead to simulations that require many hours of computation on a standard Central Processing Unit (CPU). In recent years the video game industry has driven the development of Graphics Processing Units (GPUs) that are now capable of multi-teraflop performance using highly parallel architectures. Whilst these devices are primarily designed for graphics calculations, they can also be used for general purpose computing. This thesis explores the use of such hardware to accelerate simulations of three-dimensional acoustic wave propagation, and embedded systems that create physical models for the synthesis of sound. Test case simulations of virtual acoustics are used to compare the performance of workstation CPUs to that of Nvidia’s Tesla GPU hardware. Using representative multicore CPU benchmarks, such simulations can be accelerated in the order of 5X for single precision and 3X for double precision floating-point arithmetic. Optimisation strategies are examined for maximising GPU performance when using single devices, as well as for multiple device codes that can compute simulations using billions of grid points. This allows the simulation of room models of several thousand cubic metres at audio rates such as 44.1kHz, all within a useable time scale. The performance of alternative finite difference schemes is explored, as well as strategies for the efficient implementation of boundary conditions. Creating physical models of acoustic instruments requires embedded systems that often rely on sparse linear algebra operations. The performance efficiency of various sparse matrix storage formats is detailed in terms of the fundamental operations that are required to compute complex models, with an optimised storage system achieving substantial performance gains over more generalised formats. An integrated instrument model of the timpani drum is used to demonstrate the performance gains that are possible using the optimisation strategies developed through this thesis.
130

Analyzing communication flow and process placement in Linda programs on transputers

De-Heer-Menlah, Frederick Kofi 28 November 2012 (has links)
With the evolution of parallel and distributed systems, users from diverse disciplines have looked to these systems as a solution to their ever increasing needs for computer processing resources. Because parallel processing systems currently require a high level of expertise to program, many researchers are investing effort into developing programming approaches which hide some of the difficulties of parallel programming from users. Linda, is one such parallel paradigm, which is intuitive to use, and which provides a high level decoupling between distributable components of parallel programs. In Linda, efficiency becomes a concern of the implementation rather than of the programmer. There is a substantial overhead in implementing Linda, an inherently shared memory model on a distributed system. This thesis describes the compile-time analysis of tuple space interactions which reduce the run-time matching costs, and permits the distributon of the tuple space data. A language independent module which partitions the tuple space data and suggests appropriate storage schemes for the partitions so as to optimise Linda operations is presented. The thesis also discusses hiding the network topology from the user by automatically allocating Linda processes and tuple space partitons to nodes in the network of transputers. This is done by introducing a fast placement algorithm developed for Linda. / KMBT_223

Page generated in 0.1353 seconds