• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 317
  • 189
  • 134
  • 56
  • 45
  • 32
  • 4
  • 3
  • 2
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 873
  • 873
  • 873
  • 391
  • 386
  • 350
  • 349
  • 328
  • 325
  • 319
  • 319
  • 316
  • 314
  • 313
  • 313
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
41

A High-performance, Reconfigurable Architecture for Restricted Boltzmann Machines

Ly, Daniel Le 15 February 2010 (has links)
Despite the popularity and success of neural networks in research, the number of resulting commercial or industrial applications have been limited. A primary cause of this lack of adoption is due to the fact that neural networks are usually implemented as software running on general-purpose processors. Hence, a hardware implementation that can take advantage of the inherent parallelism in neural networks is desired. This thesis investigates how the Restricted Boltzmann machine, a popular type of neural network, can be effectively mapped to a high-performance hardware architecture on FPGA platforms. The proposed, modular framework is designed to reduce the time complexity of the computations through heavily customized hardware engines. The framework is tested on a platform of four Xilinx Virtex II-Pro XC2VP70 FPGAs running at 100MHz through a variety of different configurations. The maximum performance was obtained by instantiating a Restricted Boltzmann Machine of 256x256 nodes distributed across four FPGAs, which results in a computational speed of 3.13 billion connection-updates-per-second and a speed-up of 145-fold over an optimized C program running on a 2.8GHz Intel processor.
42

Acceleration of streaming applications on FPGAs from high level constructs

Mitra, Abhishek. January 2008 (has links)
Thesis (Ph. D.)--University of California, Riverside, 2008. / Includes abstract. Title from first page of PDF file (viewed March 8, 2010). Available via ProQuest Digital Dissertations. Includes bibliographical references (p. 150-168). Also issued in print.
43

Performance analysis and evaluation of dynamic loop scheduling techniques in a competitive runtime environment for distributed memory architectures

Balasubramaniam, Mahadevan. January 2003 (has links)
Thesis (M.S.)--Mississippi State University. Department of Computer Science. / Title from title screen. Includes bibliographical references.
44

Performance studies of high-speed communication on commodity cluster /

Tam, Tat-chun, Anthony. January 2001 (has links)
Thesis (Ph. D.)--University of Hong Kong, 2002. / Includes bibliographical references (leaves 147-156).
45

Computational process networks : a model and framework for high-throughput signal processing

Allen, Gregory Eugene 16 June 2011 (has links)
Many signal and image processing systems for high-throughput, high-performance applications require concurrent implementations in order to realize desired performance. Developing software for concurrent systems is widely acknowledged to be difficult, with common industry practice leaving the burden of preventing concurrency problems on the programmer. The Kahn Process Network model provides the mathematically provable property of determinism of a program result regardless of the execution order of its processes, including concurrent execution. This model is also natural for describing streams of data samples in a signal processing system, where processes transform streams from one data type to another. However, a Kahn Process Network may require infinite memory to execute. I present the dynamic distributed deadlock detection and resolution (D4R) algorithm, which permits execution of Process Networks in bounded memory if it is possible. It detects local deadlocks in a Process Network, determines whether the deadlock can be resolved and, if so, identifies the process that must take action to resolve the deadlock. I propose the Computational Process Network (CPN) model which is based on the formalisms of Kahn’s PN model, but with enhancements that are designed to make it efficiently implementable. These enhancements include multi-token transactions to reduce execution overhead, multi-channel queues for multi-dimensional synchronous data, zero-copy semantics, and consumer and producer firing thresholds for queues. Firing thresholds enable memoryless computation of sliding window algorithms, which are common in signal processing systems. I show that the Computational Process Network model preserves the formal properties of Process Networks, while reducing the operations required to implement sliding window algorithms on continuous streams of data. I also present a high-throughput software framework that implements the Computational Process Network model using C++, and which maps naturally onto distributed targets. This framework uses POSIX threads, and can exploit parallelism in both multi-core and distributed systems. Finally, I present case studies to exercise this framework and demonstrate its performance and utility. The final case study is a three-dimensional circular convolution sonar beamformer and replica correlator, which demonstrates the high throughput and scalability of a real-time signal processing algorithm using the CPN model and framework. / text
46

Virtual application appliances on clusters

Unal, Erkan Unknown Date
No description available.
47

Creating dynamic application behavior for distributed performance analysis

Lepler, Joerg January 1998 (has links)
No description available.
48

Overlapping Computation and Communication through Offloading in MPI over InfiniBand

Inozemtsev, Grigori 30 May 2014 (has links)
As the demands of computational science and engineering simulations increase, the size and capabilities of High Performance Computing (HPC) clusters are also expected to grow. Consequently, the software providing the application programming abstractions for the clusters must adapt to meet these demands. Specifically, the increased cost of interprocessor synchronization and communication in larger systems must be accommodated. Non-blocking operations that allow communication latency to be hidden by overlapping it with computation have been proposed to mitigate this problem. In this work, we investigate offloading a portion of the communication processing to dedicated hardware in order to support communication/computation overlap efficiently. We work with the Message Passing Interface (MPI), the de facto standard for parallel programming in HPC environments. We investigate both point-to-point non-blocking communication and collective operations; our work with collectives focuses on the allgather operation. We develop designs for both flat and hierarchical cluster topologies and examine both eager and rendezvous communication protocols. We also develop a generalized primitive operation with the aim of simplifying further research into non-blocking collectives. We propose a new algorithm for the non-blocking allgather collective and implement it using this primitive. The algorithm has constant resource usage even when executing multiple operations simultaneously. We implemented these designs using CORE-Direct offloading support in Mellanox InfiniBand adapters. We present an evaluation of the designs using microbenchmarks and an application kernel that shows that offloaded non-blocking communication operations can provide latency that is comparable to that of their blocking counterparts while allowing most of the duration of the communication to be overlapped with computation and remaining resilient to process arrival and scheduling variations. / Thesis (Master, Electrical & Computer Engineering) -- Queen's University, 2014-05-29 11:55:53.87
49

Scheduling in STAPL

Sharma, Shishir 03 October 2013 (has links)
Writing efficient parallel programs is a difficult and error-prone process. The Standard Template Adaptive Parallel Library (STAPL) is being developed to make this task easier for programmers with little experience in parallel programming. STAPL is a C++ library for writing parallel programs using a generic programming approach similar to writing sequential programs using the C++ Standard Template Library (STL). STAPL provides a collection of parallel containers (pContainers) to store data in a distributed fashion and a collection of pViews to abstract details of the data distribution. STAPL algorithms are written in terms of PARAGRAPHs which are high level descriptions of task dependence graphs. Scheduling plays a very important role in the efficient execution of parallel programs. In this thesis, we present our work to enable efficient scheduling of parallel programs written using STAPL. We abstract the scheduling activities associated with PARAGRAPHs in a software module called the scheduler which is customizable and extensible. We provide support for static scheduling of PARAGRAPHs and develop mechanisms based on migration of tasks and data to support dynamic scheduling strategies for PARAGRAPHs with arbitrary dependencies. We also provide implementations of different scheduling strategies that can be used to improve the performance of applications suffering from load imbalance. The scheduling infrastructure developed in this thesis is highly customizable and can be used to execute a variety of parallel computations. We demonstrate its usefulness by improving the performance of two applications: a widely used synthetic benchmark (UTS) and a Parallel Motion Planning application. The experiments are conducted on an Opteron cluster and a massively parallel Cray XE6 machine. Experimental results up to 6144 processors are presented.
50

Scalable mining on emerging architectures

Buehrer, Gregory T. January 2007 (has links)
Thesis (Ph. D.)--Ohio State University, 2007.

Page generated in 0.6801 seconds