Global ETD Search

51	Virtual application appliances on clusters Unal, Erkan 06 1900 (has links) Variations between the software environments(e.g., installed applications, versions of libraries) on different high-performance computing (HPC) systems lead to a heterogeneity problem. Therefore, we design an optimized, homogeneous virtual machine (VM) called a virtual application appliance (VAA). Scientists can package scientific applications, and all supporting software components, as VAAs and run them independently from the underlying heterogeneous HPC systems. However, securely moving data in and out of the VAA and controlling the execution of applications are not trivial for a non-computer scientist. Consequently, we develop two automated stage-in/stage-out secure data movement mechanisms. We also explore a migration mechanism to further simplify the control of the VAA execution. Empirical evaluation results show that VAAs achieve near-native performance in widely used bioinformatics applications that we tested. Data movement, VM boot up, shutdown and migration overheads of VAAs are negligible with respect to total run-times. virtual machines system software clusters high performance computing scientific applications
52	GPU Implementation of a Novel Approach to Cramer’s Algorithm for Solving Large Scale Linear Systems West, Rosanne Lane 01 May 2010 (has links) Scientific computing often requires solving systems of linear equations. Most software pack- ages for solving large-scale linear systems use Gaussian elimination methods such as LU- decomposition. An alternative method, recently introduced by K. Habgood and I. Arel, involves an application of Cramer’s Rule and Chio’s condensation to achieve a better per- forming system for solving linear systems on parallel computing platforms. This thesis describes an implementation of this algorithm on an nVidia graphics processor card us- ing the CUDA language. Increased performance, relative to the serial implementation, is demonstrated, paving the way for future parallel realizations of the scheme. high performance computing GPU CUDA Other Computer Engineering
53	Performance Projections of HPC Applications on Chip Multiprocessor (CMP) Based Systems Shawky Sharkawi, Sameh Sh 2011 May 1900 (has links) Performance projections of High Performance Computing (HPC) applications onto various hardware platforms are important for hardware vendors and HPC users. The projections aid hardware vendors in the design of future systems and help HPC users with system procurement and application refinements. In this dissertation, we present an efficient method to project the performance of HPC applications onto Chip Multiprocessor (CMP) based systems using widely available standard benchmark data. The main advantage of this method is the use of published data about the target machine; the target machine need not be available. With the current trend in HPC platforms shifting towards cluster systems with chip multiprocessors (CMPs), efficient and accurate performance projection becomes a challenging task. Typically, CMP-based systems are configured hierarchically, which significantly impacts the performance of HPC applications. The goal of this research is to develop an efficient method to project the performance of HPC applications onto systems that utilize CMPs. To provide for efficiency, our projection methodology is automated (projections are done using a tool) and fast (with small overhead). Our method, called the surrogate-based workload application projection method, utilizes surrogate benchmarks to project an HPC application performance on target systems where computation component of an HPC application is projected separately from the communication component. Our methodology was validated on a variety of systems utilizing different processor and interconnect architectures with high accuracy and efficiency. The average projection error on three target systems was 11.22 percent with standard deviation of 1.18 percent for twelve HPC workloads. High Performance Computing Parallel Systems Parallel Applications Performance Projection Prediction
54	A High-performance, Reconfigurable Architecture for Restricted Boltzmann Machines Ly, Daniel Le 15 February 2010 (has links) Despite the popularity and success of neural networks in research, the number of resulting commercial or industrial applications have been limited. A primary cause of this lack of adoption is due to the fact that neural networks are usually implemented as software running on general-purpose processors. Hence, a hardware implementation that can take advantage of the inherent parallelism in neural networks is desired. This thesis investigates how the Restricted Boltzmann machine, a popular type of neural network, can be effectively mapped to a high-performance hardware architecture on FPGA platforms. The proposed, modular framework is designed to reduce the time complexity of the computations through heavily customized hardware engines. The framework is tested on a platform of four Xilinx Virtex II-Pro XC2VP70 FPGAs running at 100MHz through a variety of different configurations. The maximum performance was obtained by instantiating a Restricted Boltzmann Machine of 256x256 nodes distributed across four FPGAs, which results in a computational speed of 3.13 billion connection-updates-per-second and a speed-up of 145-fold over an optimized C program running on a 2.8GHz Intel processor. Reconfigurable Architecture Neural Networks FPGAs High Performance Computing 0984 0800
55	A High-performance, Reconfigurable Architecture for Restricted Boltzmann Machines Ly, Daniel Le 15 February 2010 (has links) Despite the popularity and success of neural networks in research, the number of resulting commercial or industrial applications have been limited. A primary cause of this lack of adoption is due to the fact that neural networks are usually implemented as software running on general-purpose processors. Hence, a hardware implementation that can take advantage of the inherent parallelism in neural networks is desired. This thesis investigates how the Restricted Boltzmann machine, a popular type of neural network, can be effectively mapped to a high-performance hardware architecture on FPGA platforms. The proposed, modular framework is designed to reduce the time complexity of the computations through heavily customized hardware engines. The framework is tested on a platform of four Xilinx Virtex II-Pro XC2VP70 FPGAs running at 100MHz through a variety of different configurations. The maximum performance was obtained by instantiating a Restricted Boltzmann Machine of 256x256 nodes distributed across four FPGAs, which results in a computational speed of 3.13 billion connection-updates-per-second and a speed-up of 145-fold over an optimized C program running on a 2.8GHz Intel processor. Reconfigurable Architecture Neural Networks FPGAs High Performance Computing 0984 0800
56	GPU Implementation of a Novel Approach to Cramer’s Algorithm for Solving Large Scale Linear Systems West, Rosanne Lane 01 May 2010 (has links) Scientific computing often requires solving systems of linear equations. Most software pack- ages for solving large-scale linear systems use Gaussian elimination methods such as LU- decomposition. An alternative method, recently introduced by K. Habgood and I. Arel, involves an application of Cramer’s Rule and Chio’s condensation to achieve a better per- forming system for solving linear systems on parallel computing platforms. This thesis describes an implementation of this algorithm on an nVidia graphics processor card us- ing the CUDA language. Increased performance, relative to the serial implementation, is demonstrated, paving the way for future parallel realizations of the scheme. high performance computing GPU CUDA Other Computer Engineering
57	Acceleration of streaming applications on FPGAs from high level constructs Mitra, Abhishek. January 2008 (has links) Thesis (Ph. D.)--University of California, Riverside, 2008. / Includes abstract. Title from first page of PDF file (viewed March 8, 2010). Available via ProQuest Digital Dissertations. Includes bibliographical references (p. 150-168). Also issued in print.
58	Performance analysis and evaluation of dynamic loop scheduling techniques in a competitive runtime environment for distributed memory architectures Balasubramaniam, Mahadevan. January 2003 (has links) Thesis (M.S.)--Mississippi State University. Department of Computer Science. / Title from title screen. Includes bibliographical references.
59	Performance studies of high-speed communication on commodity cluster / Tam, Tat-chun, Anthony. January 2001 (has links) Thesis (Ph. D.)--University of Hong Kong, 2002. / Includes bibliographical references (leaves 147-156).
60	Computational process networks : a model and framework for high-throughput signal processing Allen, Gregory Eugene 16 June 2011 (has links) Many signal and image processing systems for high-throughput, high-performance applications require concurrent implementations in order to realize desired performance. Developing software for concurrent systems is widely acknowledged to be difficult, with common industry practice leaving the burden of preventing concurrency problems on the programmer. The Kahn Process Network model provides the mathematically provable property of determinism of a program result regardless of the execution order of its processes, including concurrent execution. This model is also natural for describing streams of data samples in a signal processing system, where processes transform streams from one data type to another. However, a Kahn Process Network may require infinite memory to execute. I present the dynamic distributed deadlock detection and resolution (D4R) algorithm, which permits execution of Process Networks in bounded memory if it is possible. It detects local deadlocks in a Process Network, determines whether the deadlock can be resolved and, if so, identifies the process that must take action to resolve the deadlock. I propose the Computational Process Network (CPN) model which is based on the formalisms of Kahn’s PN model, but with enhancements that are designed to make it efficiently implementable. These enhancements include multi-token transactions to reduce execution overhead, multi-channel queues for multi-dimensional synchronous data, zero-copy semantics, and consumer and producer firing thresholds for queues. Firing thresholds enable memoryless computation of sliding window algorithms, which are common in signal processing systems. I show that the Computational Process Network model preserves the formal properties of Process Networks, while reducing the operations required to implement sliding window algorithms on continuous streams of data. I also present a high-throughput software framework that implements the Computational Process Network model using C++, and which maps naturally onto distributed targets. This framework uses POSIX threads, and can exploit parallelism in both multi-core and distributed systems. Finally, I present case studies to exercise this framework and demonstrate its performance and utility. The final case study is a three-dimensional circular convolution sonar beamformer and replica correlator, which demonstrates the high throughput and scalability of a real-time signal processing algorithm using the CPN model and framework. / text Concurrency Distributed systems High-performance computing Signal processing Deadlock

Search results