• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 465
  • 88
  • 87
  • 56
  • 43
  • 20
  • 14
  • 14
  • 10
  • 5
  • 5
  • 3
  • 3
  • 3
  • 2
  • Tagged with
  • 977
  • 316
  • 202
  • 182
  • 167
  • 165
  • 153
  • 137
  • 123
  • 104
  • 96
  • 93
  • 92
  • 87
  • 81
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
81

Improving energy efficiency of reliable massively-parallel architectures

Krimer, Evgeni 12 July 2012 (has links)
While transistor size continues to shrink every technology generation increasing the amount of transistors on a die, the reduction in energy consumption is less significant. Furthermore, newer technologies induce fabrication challenges resulting in uncertainties in transistor and wire properties. Therefore to ensure correctness, design margins are introduced resulting in significantly sub-optimal energy efficiency. While increasing parallelism and the use of gating methods contribute to energy consumption reduction, ultimately, more radical changes to the architecture and better integration of architectural and circuit techniques will be necessary. This dissertation explores one such approach, combining a highly-efficient massively-parallel processor architecture with a design methodology that reduces energy by trimming design margins. Using a massively-parallel GPU-like (graphics processing unit) base- line architecture, we discuss the different components of process variation and design microarchitectural approaches supporting efficient margins reduction. We evaluate our design using a cycle-based GPU simulator, describe the conditions where efficiency improvements can be obtained, and explore the benefits of decoupling across a wide range of parameters. We architect a test-chip that was fabricated and show these mechanisms to work. We also discuss why previously developed related approaches fall short when process variation is very large, such as in low-voltage operation or as expected for future VLSI technology. We therefore develop and evaluate a new approach specifically for high-variation scenarios. To summarize, in this work, we address the emerging challenges of modern massively parallel architectures including energy efficient, reliable operation and high process variation. We believe that the results of this work are essential for breaking through the energy wall, continuing to improve the efficiency of future generations of the massively parallel architectures. / text
82

Core-characteristic-aware off-chip memory management in a multicore system-on-chip

Jeong, Min Kyu 30 January 2013 (has links)
Future processors will integrate an increasing number of cores because the scaling of single-thread performance is limited and because smaller cores are more power efficient. Off-chip memory bandwidth that is shared between those many cores, however, scales slower than the transistor (and core) count does. As a result, in many future systems, off-chip bandwidth will become the bottleneck of heavy demand from multiple cores. Therefore, optimally managing the limited off-chip bandwidth is critical to achieving high performance and efficiency in future systems. In this dissertation, I will develop techniques to optimize the shared use of limited off-chip memory bandwidth in chip-multiprocessors. I focus on issues that arise from the sharing and exploit the differences in memory access characteristics, such as locality, bandwidth requirement, and latency sensitivity, between the applications running in parallel and competing for the bandwidth. First, I investigate how the shared use of memory by many cores can result in reduced spatial locality in memory accesses. I propose a technique that partitions the internal memory banks between cores in order to isolate their access streams and eliminate locality interference. The technique compensates for the reduced bank-level parallelism of each thread by employing memory sub-ranking to effectively increase the number of independent banks. For three different workload groups that consist of benchmarks with high spatial locality, low spatial locality, and mixes of the two, the average system efficiency improves by 10%, 7%, 9% for 2-rank systems, and 18%, 25%, 20% for 1-rank systems, respectively, over the baseline shared-bank system. Next, I improve the performance of a heterogeneous system-on-chip (SoC) in which cores have distinct memory access characteristics. I develop a deadline-aware shared memory bandwidth management scheme for SoCs that have both CPU and GPU cores. I show that statically prioritizing the CPU can severely constrict GPU performance, and propose to dynamically adapt the priority of CPU and GPU memory requests based on the progress of GPU workload. The proposed dynamic bandwidth management scheme provides the target GPU performance while prioritizing CPU performance as much as possible, for any CPU-GPU workload combination with different complexities. / text
83

Improving the throughput of novel cluster computing systems

Wu, Jiadong 21 September 2015 (has links)
Traditional cluster computing systems such as the supercomputers are equipped with specially designed high-performance hardware, which escalates the manufacturing cost and the energy cost of those systems. Due to such drawbacks and the diversified demand in computation, two new types of clusters are developed: the GPU clusters and the Hadoop clusters. The GPU cluster combines traditional CPU-only computing cluster with general purpose GPUs to accelerate the applications. Thanks to the massively-parallel architecture of the GPU, this type of system can deliver much higher performance-per-watt than the traditional computing clusters. The Hadoop cluster is another popular type of cluster computing system. It uses inexpensive off-the-shelf component and standard Ethernet to minimize manufacturing cost. The Hadoop systems are widely used throughout the industry. Alongside with the lowered cost, these new systems also bring their unique challenges. According to our study, the GPU clusters are prone to severe under-utilization due to the heterogeneous nature of its computation resources, and the Hadoop clusters are vulnerable to network congestion due to its limited network resources. In this research, we are trying to improve the throughput of these novel cluster computing systems by increasing the workload parallelism and network I/O parallelism.
84

Multilevel multidimensional scaling on the GPU

Ingram, Stephen F. 05 1900 (has links)
We present Glimmer, a new multilevel visualization algorithm for multidimensional scaling designed to exploit modern graphics processing unit (GPU) hard-ware. We also present GPU-SF, a parallel, force-based subsystem used by Glimmer. Glimmer organizes input into a hierarchy of levels and recursively applies GPU-SF to combine and refine the levels. The multilevel nature of the algorithm helps avoid local minima while the GPU parallelism improves speed of computation. We propose a robust termination condition for GPU-SF based on a filtered approximation of the normalized stress function. We demonstrate the benefits of Glimmer in terms of speed, normalized stress, and visual quality against several previous algorithms for a range of synthetic and real benchmark datasets. We show that the performance of Glimmer on GPUs is substantially faster than a CPU implementation of the same algorithm. We also propose a novel texture paging strategy called distance paging for working with precomputed distance matrices too large to fit in texture memory.
85

Dynamic warp formation : exploiting thread scheduling for efficient MIMD control flow on SIMD graphics hardware

Fung, Wilson Wai Lun 11 1900 (has links)
Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in commodity desktop computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set architecture and programs using these instructions may experience reduced performance due to the way branch execution is supported by hardware. One solution is to add a stack to allow different SIMD processing elements to execute distinct program paths after a branch instruction. The occurrence of diverging branch outcomes for different processing elements significantly degrades performance using this approach. In this thesis, we propose dynamic warp formation and scheduling, a mechanism for more efficient SIMD branch execution on GPUs. It dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes. We show that a realistic hardware implementation of this mechanism improves performance by an average of 47% for an estimated area increase of 8%.
86

Lattice Boltzmann Method for Simulating Turbulent Flows

Koda, Yusuke January 2013 (has links)
The lattice Boltzmann method (LBM) is a relatively new method for fluid flow simulations, and is recently gaining popularity due to its simple algorithm and parallel scalability. Although the method has been successfully applied to a wide range of flow physics, its capabilities in simulating turbulent flow is still under-validated. Hence, in this project, a 3D LBM program was developed to investigate the validity of the LBM for turbulent flow simulations through large eddy simulations (LES). In achieving this goal, the 3D LBM code was first applied to compute the laminar flow over two tandem cylinders. After validating against literature data, the program was used to study the aerodynamic effects of the early 3D flow structures by comparing between 2D and 3D simulations. It was found that the span-wise instabilities have a profound impact on the lift and drag forces, as well as on the vortex shedding frequency. The LBM code was then modified to allow for a massively parallel execution using graphics processing units (GPU). The GPU enabled program was used to study a benchmark test case involving the flow over a square cylinder in a square channel, to validate its accuracy, as well as measure its performance gains compared to a typical serial implementation. The flow results showed good agreement with literature, and speedups of over 150 times were observed when two GPUs were used in parallel. Turbulent flow simulations were then conducted using LES with the Smagorinsky subgrid model. The methodology was first validated by computing the fully developed turbulent channel flow, and comparing the results against direct numerical simulation results. The results were in good agreement despite the relatively coarse grid. The code was then used to simulate the turbulent flow over a square cylinder confined in a channel. In order to emulate a realistic inflow at the channel inlet, an auxiliary simulation consisting of a fully developed turbulent channel flow was run in conjunction, and its velocity profile was used to enforce the inlet boundary condition for the cylinder flow simulation. Comparison of the results with experimental and numerical results revealed that the presence of the turbulent flow structures at the inlet can significantly influence the resulting flow field around the cylinder.
87

Biomechanically Constrained Groupwise Statistical Shape Model to Ultrasound Registration of the Lumbar Spine

Khallaghi, Siavash 28 September 2010 (has links)
Spinal needle injections for back pain management are frequently carried out in hospitals and radiological clinics. Currently, these procedures are performed under fluoroscopy or CT guidance in specialized interventional radiology facilities. As an alternative, the use of inexpensive ultrasound image guidance promises to improve the efficacy and safety of these procedures. We propose to eliminate or reduce the need for ionizing radiation, by creating and registering a statistical shape model of the lumbar vertebrae to 3D ultrasound volumes of patient, using a groupwise registration algorithm. From a total of 35 patient CT volumes, a statistical shape model of the L2, L3 and L4 vertebrae is built, including the mean shape, and principal modes of variation. The statistical shape model is registered to the 3D ultrasound by interchangeably optimizing the model parameters and their relative poses. We also use a biomechanical model to constrain the relative motion of the models throughout the registration process. Validation is performed on three tissue mimicking-phantoms designed to preserve realistic curvature of the spine. We compare pairwise and groupwise registration of the statistical shape model of the spine and demonstrate that clinically acceptable mean target error registration of 2.4 mm can be achieved with the proposed method. Registration results also show that the groupwise registration outperforms the pairwise in terms of success rate. / Thesis (Master, Electrical & Computer Engineering) -- Queen's University, 2010-09-27 20:08:01.828
88

Multi-population PSO-GA hybrid techniques: integration, topologies, and parallel composition

Franz, Wayne January 2014 (has links)
Recent work in metaheuristic algorithms has shown that solution quality may be improved by composing algorithms with orthogonal characteristics. In this thesis, I study multi-population particle swarm optimization (MPSO) and genetic algorithm (GA) hybrid strategies. I begin by investigating the behaviour of MPSO with crossover, mutation, swapping, and all three, and show that the latter is able to solve the most difficult benchmark functions. Because GAs converge slowly and MPSO provides a large degree of parallelism, I also develop several parallel hybrid algorithms. A composite approach executes PSO and GAs simultaneously in different swarms, and shows advantages when arranged in a star topology, particularly with a central GA. A static scheme executes in series, with a GA performing the exploration followed by MPSO for exploitation. Finally, the last approach dynamically alternates between algorithms. Hybrid algorithms are well-suited for parallelization, but exhibit tradeoffs between performance and solution quality.
89

An Implementation of the Discontinuous Galerkin Method on Graphics Processing Units

Fuhry, Martin 10 April 2013 (has links)
Computing highly-accurate approximate solutions to partial differential equations (PDEs) requires both a robust numerical method and a powerful machine. We present a parallel implementation of the discontinuous Galerkin (DG) method on graphics processing units (GPUs). In addition to being flexible and highly accurate, DG methods accommodate parallel architectures well, as their discontinuous nature produces entirely element-local approximations. While GPUs were originally intended to compute and display computer graphics, they have recently become a popular general purpose computing device. These cheap and extremely powerful devices have a massively parallel structure. With the recent addition of double precision floating point number support, GPUs have matured as serious platforms for parallel scientific computing. In this thesis, we present an implementation of the DG method applied to systems of hyperbolic conservation laws in two dimensions on a GPU using NVIDIA’s Compute Unified Device Architecture (CUDA). Numerous computed examples from linear advection to the Euler equations demonstrate the modularity and usefulness of our implementation. Benchmarking our method against a single core, serial implementation of the DG method reveals a speedup of a factor of over fifty times using a USD $500.00 NVIDIA GTX 580.
90

Dynamic warp formation : exploiting thread scheduling for efficient MIMD control flow on SIMD graphics hardware

Fung, Wilson Wai Lun 11 1900 (has links)
Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in commodity desktop computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high performance with minimal overhead for control hardware. Scalar threads running the same computing kernel are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set architecture and programs using these instructions may experience reduced performance due to the way branch execution is supported by hardware. One solution is to add a stack to allow different SIMD processing elements to execute distinct program paths after a branch instruction. The occurrence of diverging branch outcomes for different processing elements significantly degrades performance using this approach. In this thesis, we propose dynamic warp formation and scheduling, a mechanism for more efficient SIMD branch execution on GPUs. It dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes. We show that a realistic hardware implementation of this mechanism improves performance by an average of 47% for an estimated area increase of 8%.

Page generated in 0.0206 seconds