Global ETD Search

21	Colorization in Gabor space and realistic surface rendering on GPUs. / 基於Gabor特徵空間的染色技術與真實感表面GPU繪製 / CUHK electronic theses & dissertations collection / Ji yu Gabor te zheng kong jian de ran se ji shu yu zhen shi gan biao mian GPU hui zhi January 2011 (has links) Based on the construction of Gabor feature space, which is important in applying pixel similarity computations, we formalize the space using rotation-invariant Gabor filter banks and apply optimizations in texture feature space. In image colorizations, the pixels that have similar Gabor features appear similar colors, our approach can colorize natural images globally, without the restriction of the disjoint regions with similar texture-like appearances. Our approach supports the two-pass colorization processes: coloring optimization in Gabor space and color detailing for progressive effects. We further work on the video colorization using the optimized Gabor flow computing, including coloring keyframes, color propagation by Gabor filtering, and optimized parallel computing over the video. Our video colorization is designed in a spatiotemporal manner to keep temporal coherence, and provides simple closed-form solutions in energy optimization that yield fast colonizations. Moreover, we develop parallel surface texturing of geometric models on GPU, generating spatially-varying visual appearances. We incorporate the Gabor feature space for the searching of 2D exemplars, to determine the k-coherence candidate pixels. The multi-pass correction in synthesis is applied to the local neighborhood for parallel processes. The iso/aniso-scale texture synthesis leverages the strengths of GPU computing, so to synthesize the iso/aniso-scale texturing appearance in parallel over arbitrary surfaces. Our experimental results showed that our approach produces simply controllable texturing effects of surface synthesis, generating texture-similar and spatially-varying visual appearances with GPU accelerated performance. / Texture feature similarity has long been crucial and important topic in VR/graphics applications, such as image and video colorizations, surface texture synthesis and geometry image applications. Generally, the image feature is highly subjective, depending on not only the image pixels but also interactive users. Existing colorization and surface texture synthesis pay little attention to the generation of conforming color/textures that accurately reflect exemplar structures or user's intension. Realistic surface synthesis remains a challenging task in VR/graphics researches. In this dissertation, we focus on the encoding of the Gabor filter banks into texture feature similarity computations and GPU-parallel surface rendering faithfully, including image/vodeo colorizations, parallel texturing of geometric surfaces, and multiresolution rendering on sole-cube maps (SCMs). / We further explore the GPU-based multiresolution rendering on solecube maps (SCMs). Our SCMs on GPU generate adaptive mesh surfaces dynamically, and are fully developed in parallelization for large-scale and complex VR environments. We also encapsulate the differential coordinates in SCMs, reflecting the local geometric characteristics for geometric modeling and interactive animation applications. For the future work, we will work on improving the image/ video feature analysis framework in VR/graphics applications. The further work lying in the surface texture synthesis includes the interactive control of texture orientations by surface vector fields using sketch editing, so to widen the gamut of interactive tools available for texturing artists and end users. / Sheng, Bin. / Adviser: Hanqin Sun. / Source: Dissertation Abstracts International, Volume: 73-04, Section: B, page: . / Thesis (Ph.D.)--Chinese University of Hong Kong, 2011. / Includes bibliographical references (leaves 128-142). / Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [201-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. / Abstract also in Chinese. Color Gabor transforms Graphics processing units Image processing--Digital techniques Rendering (Computer graphics)
22	Acceleration of Transient Stability Simulation for Large-Scale Power Systems on Parallel and Distributed Hardware Jalili-Marandi, Vahid 11 1900 (has links) Transient stability analysis is necessary for the planning, operation, and control of power systems. However, its mathematical modeling and time-domain solution is computationally onerous and has attracted the attention of power systems experts and simulation specialists for decades. The ultimate promised goal has been always to perform this simulation as fast as real-time for realistic-sized systems. In this thesis, methods to speedup transient stability simulation for large-scale power systems are investigated. The research reported in this thesis can be divided into two parts. First, real-time simulation on a general-purpose simulator composed of CPU-based computational nodes is considered. A novel approach called Instantaneous Relaxation (IR) is proposed for the real-time transient stability simulation on such a simulator. The motivation of proposing this technique comes from the inherent parallelism that exists in the transient stability problem that allows to have a coarse grain decomposition of resulting system equations. Comparison of the real-time results with the off-line results shows both the accuracy and efficiency of the proposed method. In the second part of this thesis, Graphics Processing Units (GPUs) are used for the first time for the transient stability simulation of power systems. Data-parallel programming techniques are used on the single-instruction multiple-date (SIMD) architecture of the GPU to implement the transient stability simulations. Several test cases of varying sizes are used to investigate the GPU-based simulation. The simulation results reveal the obvious advantage of using GPUs instead of CPUs for large-scale problems. In the continuation of part two of this thesis the application of multiple GPUs running in parallel is investigated. Two different parallel processing based techniques are implemented: the IR method, and the incomplete LU factorization based approach. Practical information is provided on how to use multi-threaded programming to manage multiple GPUs running simultaneously for the implementation of the transient stability simulation. The implementation of the IR method on multiple GPUs is the intersection of data parallelism and program-level parallelism, which makes possible the simulation of very large-scale systems with 7020 buses and 1800 synchronous generators. / Energy Systems Transient Stability Simulation Parallel Processing Instantaneous Relaxation Real-time Simulation Graphics Processing Units
23	Modeling performance and power for energy-efficient GPGPU computing Hong, Sunpyo 12 November 2012 (has links) The objective of the proposed research is to develop an analytical model that predicts performance and power for many-core architecture and further propose a mechanism, which leverages the analytical model, to enable energy-efficient execution of an application. The key insight of the model is to investigate and quantify a complex relationship that exists between the thread-level parallelism and memory-level parallelism for an application on a given many-core architecture. Two metrics are proposed: memory warp parallelism (MWP), which refers to the number of overlapping memory accesses per core, and computation warp parallelism (CWP), which characterizes an application type. By using these metrics in addition to the architectural and application parameters, the overall application performance is produced. The model uses statically-available parameters such as instruction-mixture information and input-data size, and the prediction accuracy is 13.3% for the GPU-computing benchmarks. Another important aspect of using many-core architecture is reducing peak power and achieving energy savings. By using the proposed integrated power and performance (IPP) framework, the results showed that different optimization points exist for GPU architecture depending on the application type. The work shows that by activating fewer cores, 10.99% of run-time energy consumption can be saved for the bandwidth-limited benchmarks, and a projection of 25.8% energy savings is predicted when power-gating at core level is employed. Finally, the model is shifted to throughput using OpenCL for targeting more variety of processors. First, multiple outputs relating to performance are predicted, including upper-bound and lower-bound values. Second, by using the model parameters, an application can be categorized into a different category, each with its own suggestions for improving performance and energy efficiency. Third, the bandwidth saturation point accuracy is significantly improved by considering independent memory accesses and updating the performance model. Furthermore, a trade-off analysis using architectural and application parameters is straightforward, which provides more insights to improve energy efficiency. In the future, a computer system will contain hundreds of heterogeneous cores. Hence, it is mandatory that a workload gets scheduled to an efficient core or distributed on both types of cores. A preliminary work by using the analytical model to do scheduling between CPU and GPU is demonstrated in the appendix. Since profiling phase is not required, the kernel code can be transformed to run more efficiently on the specific architecture. Another extension of the work regarding the relationship between the speed-up and energy efficiency is mathematically derived. Finally, future research ideas are presented regarding the usage of the model for programmer, compiler, and runtime for future heterogeneous systems. Model Power Energy GPGPU GPU Analytical model Performance Graphics processing units Computer architecture Energy consumption
24	Hardware Acceleration of Electronic Design Automation Algorithms Gulati, Kanupriya 2009 December 1900 (has links) With the advances in very large scale integration (VLSI) technology, hardware is going parallel. Software, which was traditionally designed to execute on single core microprocessors, now faces the tough challenge of taking advantage of this parallelism, made available by the scaling of hardware. The work presented in this dissertation studies the acceleration of electronic design automation (EDA) software on several hardware platforms such as custom integrated circuits (ICs), field programmable gate arrays (FPGAs) and graphics processors. This dissertation concentrates on a subset of EDA algorithms which are heavily used in the VLSI design flow, and also have varying degrees of inherent parallelism in them. In particular, Boolean satisfiability, Monte Carlo based statistical static timing analysis, circuit simulation, fault simulation and fault table generation are explored. The architectural and performance tradeoffs of implementing the above applications on these alternative platforms (in comparison to their implementation on a single core microprocessor) are studied. In addition, this dissertation also presents an automated approach to accelerate uniprocessor code using a graphics processing unit (GPU). The key idea is to partition the software application into kernels in an automated fashion, such that multiple instances of these kernels, when executed in parallel on the GPU, can maximally benefit from the GPU?s hardware resources. The work presented in this dissertation demonstrates that several EDA algorithms can be successfully rearchitected to maximally harness their performance on alternative platforms such as custom designed ICs, FPGAs and graphic processors, and obtain speedups upto 800X. The approaches in this dissertation collectively aim to contribute towards enabling the computer aided design (CAD) community to accelerate EDA algorithms on arbitrary hardware platforms. Hardware Acceleration Graphics Processing Units FPGA Custom IC Boolean Satisfiability Fault Simulation
25	An enhanced GPU architecture for not-so-regular parallelism with special implications for database search Narasiman, Veynu Tupil 27 June 2014 (has links) Graphics Processing Units (GPUs) have become a popular platform for executing general purpose (i.e., non-graphics) applications. To run efficiently on a GPU, applications must be parallelized into many threads, each of which performs the same task but operates on different data (i.e., data parallelism). Previous work has shown that some applications experience significant speedup when executed on a GPU instead of a CPU. The applications that benefit most tend to have certain characteristics such as high computational intensity, regular control-flow and memory access patterns, and little to no communication among threads. However, not all parallel applications have these characteristics. Applications with a more balanced compute to memory ratio, divergent control flow, irregular memory accesses, and/or frequent communication (i.e., not-so-regular applications) will not take full advantage of the GPU's resources, resulting in performance far short of what could be delivered. The goal of this dissertation is to enhance the GPU architecture to better handle not-so-regular parallelism. This is accomplished in two parts. First, I analyze a diverse set of data parallel applications that suffer from divergent control-flow and/or significant stall time due to memory. I propose two microarchitectural enhancements to the GPU called the Large Warp Microarchitecture and Two-Level Warp Scheduling to address these problems respectively. When combined, these mechanisms increase performance by 19% on average. Second, I examine one of the most important and fundamental applications in computing: database search. Database search is an excellent example of an application that is rich in parallelism, but rife with not-so-regular characteristics. I propose enhancements to the GPU architecture including new instructions that improve intra-warp thread communication and decision making, and also a row-buffer locality hint bit to better handle the irregular memory access patterns of index-based tree search. These proposals improve performance by 21% for full table scans, and 39% for index-based search. The result of this dissertation is an enhanced GPU architecture that better handles not-so-regular parallelism. This increases the scope of applications that run efficiently on the GPU, making it a more viable platform not only for current parallel workloads such as databases, but also for future and emerging parallel applications. / text Graphics processing units GPU GPUs GPGPU Branch divergence Warp scheduling Database search on GPUs
26	Acceleration of Transient Stability Simulation for Large-Scale Power Systems on Parallel and Distributed Hardware Jalili-Marandi, Vahid Unknown Date No description available. Transient Stability Simulation Parallel Processing Instantaneous Relaxation Real-time Simulation Graphics Processing Units
27	Power-constrained performance optimization of GPU graph traversal McLaughlin, Adam Thomas 13 January 2014 (has links) Graph traversal represents an important class of graph algorithms that is the nucleus of many large scale graph analytics applications. While improving the performance of such algorithms using GPUs has received attention, understanding and managing performance under power constraints has not yet received similar attention. This thesis first explores the power and performance characteristics of breadth first search (BFS) via measurements on a commodity GPU. We utilize this analysis to address the problem of minimizing execution time below a predefined power limit or power cap exposing key relationships between graph properties and power consumption. We modify the firmware on a commodity GPU to measure power usage and use the GPU as an experimental system to evaluate future architectural enhancements for the optimization of graph algorithms. Specifically, we propose and evaluate power management algorithms that scale i) the GPU frequency or ii) the number of active GPU compute units for a diverse set of real-world and synthetic graphs. Compared to scaling either frequency or compute units individually, our proposed schemes reduce execution time by an average of 18.64% by adjusting the configuration based on the inter- and intra-graph characteristics. GPU architecture Graph algorithms Power-constrained environments Graph algorithms Graphics processing units
28	Many-core architecture for programmable hardware accelerator Lee, Junghee 13 January 2014 (has links) As the further development of single-core architectures faces seemingly insurmountable physical and technological limitations, computer designers have turned their attention to alternative approaches. One such promising alternative is the use of several smaller cores working in unison as a programmable hardware accelerator. It is clear that the vast – and, as yet, largely untapped – potential of hardware accelerators is coming to the forefront of computer architecture. There are many challenges that must be addressed for the programmable hardware accelerator to be realized in practice. In this thesis, load-balancing, on-chip communication, and an execution model are studied. Imbalanced distribution of workloads across the processing elements constitutes wasteful use of resources, which results in degrading the performance of the system. In this thesis, a hardware-based load-balancing technique is proposed, which is demonstrated to be more scalable than state-of-the-art loadbalancing techniques. To facilitate efficient communication among ever increasing number of cores, a scalable communication network is imperative. Packet switching networks-on-chip (NoC) is considered as a viable candidate for scalable communication fabric. The size of flit, which is a unit of flow control in NoC, is one of important design parameters that determine latency, throughput and cost of NoC routers. How to determine an optimal flit size is studied in this thesis and a novel router architecture is proposed, which overcomes a problem related with the flit size. This thesis also includes a new execution model and its supporting architecture. An event-driven model that is an extension of hardware description language is employed as an execution model. The dynamic scheduling and module-level prefetching for supporting the event-driven execution model are evaluated. Hardware accelerator Load-balancing Networks-on-chip Execution model Graphics processing units Computers
29	Shared resource management for efficient heterogeneous computing Lee, Jaekyu 13 January 2014 (has links) The demand for heterogeneous computing, because of its performance and energy efficiency, has made on-chip heterogeneous chip multi-processors (HCMP) become the mainstream computing platform, as the recent trend shows in a wide spectrum of platforms from smartphone application processors to desktop and low-end server processors. The performance of on-chip GPUs is not yet comparable to that of discrete GPU cards, but vendors have integrated more powerful GPUs and this trend will continue in upcoming processors. In this architecture, several system resources are shared between CPUs and GPUs. The sharing of system resources enables easier and cheaper data transfer between CPUs and GPUs, but it also causes resource contention problems between cores. The resource sharing problem has existed since the homogeneous (CPU-only) chip-multi processor (CMP) was introduced. However, resource sharing in HCMPs shows different aspects because of the different nature of CPU and GPU cores. In order to solve the resource sharing problem in HCMPs, we consider efficient shared resource management schemes, in particular tackling the problem in shared last-level cache and interconnection network. In the thesis, we propose four resource sharing mechanisms: First, we propose an efficient cache sharing mechanism that exploits the different characteristics of CPU and GPU cores to effectively share cache space between them. Second, adaptive virtual channel partitioning for on-chip interconnection network is proposed to isolate inter-application interference. By partitioning virtual channels to CPUs and GPUs, we can prevent the interference problem while guaranteeing quality-of-service (QoS) for both cores. Third, we propose a dynamic frequency controlling mechanism to efficiently share system resources. When both cores are active, the degree of resource contention as well as the system throughput will be affected by the operating frequency of CPUs and GPUs. The proposed mechanism tries to find optimal operating frequencies for both cores, which reduces the resource contention while improving system throughput. Finally, we propose a second cache sharing mechanism that exploits GPU-semantic information. The programming and execution models of GPUs are more strict and easier than those of CPUs. Also, programmers are asked to provide more information to the hardware. By exploiting these characteristics, GPUs can energy-efficiently exercise the cache and simpler, but more efficient cache partitioning can be enabled for HCMPs. Resource management Heterogeneous architecture Shared cache On-chip network Graphics processing units Heterogeneous computing Cache memory
30	Energy conservation techniques for GPU computing Mei, Xinxin 29 August 2016 (has links) The emerging general purpose graphics processing units (GPGPU) computing has tremendously speeded up a great variety of commercial and scientific applications. The GPUs have become prevalent accelerators in current high performance clusters. Though the computational capacity per Watt of the GPUs is much higher than that of the CPUs, the hybrid GPU clusters still consume enormous power. To conserve energy on this kind of clusters is of critical significance. In this thesis, we seek energy conservative computing on the GPU accelerated servers. We introduce our studies as follows. First, we dissect the GPU memory hierarchy due to the fact that most of the GPU applications are suffering from the GPU memory bottleneck. We find that the conventional CPU cache models cannot be applied on the modern GPU caches, and the microbenchmarks to study the conventional CPU cache become invalid for the GPU. We propose the GPU-specified microbenchmarks to examine the GPU memory structures and properties. Our benchmark results verify that the design goal of the GPU has transformed from pure computation performance to better energy efficiency. Second, we investigate the impact of dynamic voltage and frequency scaling (DVFS), a successful energy management technique for CPUs, on the GPU platforms. Our experimental results suggest that GPU DVFS is still promising in conserving energy, but the patterns to save energy strongly differ from those of the CPU. Besides, the effect of GPU DVFS depends on the individual application characteristics. Third, we derive the GPU DVFS power and performance models from our experimental results, based on which we find the optimal GPU voltage and frequency setting to minimize the energy consumption of a single GPU task. We then study the problem of scheduling multiple tasks on a hybrid CPU-GPU cluster to minimize the total energy consumption by GPU DVFS. We design an effective offline scheduling algorithm which can reduce the energy consumption significantly. At last, we combine the GPU DVFS and dynamic resource sleep (DRS), another energy management technique, to further conserve the energy, for the online task scheduling on hybrid clusters. Though the idle energy consumption increases significantly compared to the offline problem, our online scheduling algorithm still achieves more than 30% of energy conservation with appropriate runtime GPU DVFS readjustments.

Search results