Global ETD Search

41	Design of a Multi-Core Multi-thread Floating-Point Processor and Its Application in Computer Graphics Yeh, Chia-Yu 06 September 2011 (has links) Graphics processing unit (GPU) designs usually adopts various computer architecture techniques to boost the computation speed, including single-instruction multiple data (SIMD), very-long-instruction word (VLIW), multi-threading, and/or multi-core. In OpenGL ES 2.0, user programmable vertex shader (VS) hardware unit can be designed using vectored SIMD computation unit so that it can efficiently compute the matrix-vector multiplication, one of the key operations in vertex transformation. Recently, high-performance GPU, such as Telsa series from nVidia, is designed with many-core architectures with each core responsible for scalar operations. The intention is to allow for efficient execution of general-purpose computations in addition to the specialized graphics computations. In this thesis, we design a scalar-based multi-threaded GPU design that is composed of four scalar processors, one special-function unit, and can execute multi-threaded instructions. We use the example of vertex transformation to demonstrate execution efficiency of the scalar-based multi-threaded GPU. We also make comparison with the vector-based SIMD GPU. multi-threading graphics processing unit (GPU) vertex shader SIMD matrix-vector multiplication OpenGL ES 2.0
42	A High Performance Register Allocator for Vector Architectures with a Unified Register-Set Su, Yu-Dan 29 June 2012 (has links) This thesis describes a compiler optimization targeted for machines with unified, vector-based register sets. This optimization combines register allocation and instruction scheduling. It examines places where the code performs computations on scalar variables. The goal is to identify instances where the same operation is performed. For example, a program might calculate ¡§base+offset¡¨ and then calculate ¡§i+j¡¨. Even though these computations are unrelated, yet they use the same operator; if ¡§base¡¨ and ¡§i¡¨ are packed into one vector register, while ¡§offset¡¨ and ¡§j¡¨ are packed into another, then these two computations can be performed simultaneously through the vectors¡¦ parallel addition operation. This would reduce the execution time of the compiled code. Although other researchers have considered similar packing methods, their work has been limited by the hardware that they were studying. Such hardware usually imposed high costs for moving data between scalar and vector register banks. This present thesis, however, considers a novel hardware architecture that imposes no such costs. As a consequence, we are able to obtain significant speedups. The architecture that we consider is a Graphics Processing Unit (GPU) for embedded systems that is under development at this university. This GPU has a single register set for integers, float, and vectors. instruction scheduling register allocator compiler optimization unified register set vector architecture novel Graphics Processing Unit
43	Hardware Acceleration of Electronic Design Automation Algorithms Gulati, Kanupriya 2009 December 1900 (has links) With the advances in very large scale integration (VLSI) technology, hardware is going parallel. Software, which was traditionally designed to execute on single core microprocessors, now faces the tough challenge of taking advantage of this parallelism, made available by the scaling of hardware. The work presented in this dissertation studies the acceleration of electronic design automation (EDA) software on several hardware platforms such as custom integrated circuits (ICs), field programmable gate arrays (FPGAs) and graphics processors. This dissertation concentrates on a subset of EDA algorithms which are heavily used in the VLSI design flow, and also have varying degrees of inherent parallelism in them. In particular, Boolean satisfiability, Monte Carlo based statistical static timing analysis, circuit simulation, fault simulation and fault table generation are explored. The architectural and performance tradeoffs of implementing the above applications on these alternative platforms (in comparison to their implementation on a single core microprocessor) are studied. In addition, this dissertation also presents an automated approach to accelerate uniprocessor code using a graphics processing unit (GPU). The key idea is to partition the software application into kernels in an automated fashion, such that multiple instances of these kernels, when executed in parallel on the GPU, can maximally benefit from the GPU?s hardware resources. The work presented in this dissertation demonstrates that several EDA algorithms can be successfully rearchitected to maximally harness their performance on alternative platforms such as custom designed ICs, FPGAs and graphic processors, and obtain speedups upto 800X. The approaches in this dissertation collectively aim to contribute towards enabling the computer aided design (CAD) community to accelerate EDA algorithms on arbitrary hardware platforms. Hardware Acceleration Graphics Processing Units FPGA Custom IC Boolean Satisfiability Fault Simulation
44	An enhanced GPU architecture for not-so-regular parallelism with special implications for database search Narasiman, Veynu Tupil 27 June 2014 (has links) Graphics Processing Units (GPUs) have become a popular platform for executing general purpose (i.e., non-graphics) applications. To run efficiently on a GPU, applications must be parallelized into many threads, each of which performs the same task but operates on different data (i.e., data parallelism). Previous work has shown that some applications experience significant speedup when executed on a GPU instead of a CPU. The applications that benefit most tend to have certain characteristics such as high computational intensity, regular control-flow and memory access patterns, and little to no communication among threads. However, not all parallel applications have these characteristics. Applications with a more balanced compute to memory ratio, divergent control flow, irregular memory accesses, and/or frequent communication (i.e., not-so-regular applications) will not take full advantage of the GPU's resources, resulting in performance far short of what could be delivered. The goal of this dissertation is to enhance the GPU architecture to better handle not-so-regular parallelism. This is accomplished in two parts. First, I analyze a diverse set of data parallel applications that suffer from divergent control-flow and/or significant stall time due to memory. I propose two microarchitectural enhancements to the GPU called the Large Warp Microarchitecture and Two-Level Warp Scheduling to address these problems respectively. When combined, these mechanisms increase performance by 19% on average. Second, I examine one of the most important and fundamental applications in computing: database search. Database search is an excellent example of an application that is rich in parallelism, but rife with not-so-regular characteristics. I propose enhancements to the GPU architecture including new instructions that improve intra-warp thread communication and decision making, and also a row-buffer locality hint bit to better handle the irregular memory access patterns of index-based tree search. These proposals improve performance by 21% for full table scans, and 39% for index-based search. The result of this dissertation is an enhanced GPU architecture that better handles not-so-regular parallelism. This increases the scope of applications that run efficiently on the GPU, making it a more viable platform not only for current parallel workloads such as databases, but also for future and emerging parallel applications. / text Graphics processing units GPU GPUs GPGPU Branch divergence Warp scheduling Database search on GPUs
45	Acceleration of Transient Stability Simulation for Large-Scale Power Systems on Parallel and Distributed Hardware Jalili-Marandi, Vahid Unknown Date No description available. Transient Stability Simulation Parallel Processing Instantaneous Relaxation Real-time Simulation Graphics Processing Units
46	Power-constrained performance optimization of GPU graph traversal McLaughlin, Adam Thomas 13 January 2014 (has links) Graph traversal represents an important class of graph algorithms that is the nucleus of many large scale graph analytics applications. While improving the performance of such algorithms using GPUs has received attention, understanding and managing performance under power constraints has not yet received similar attention. This thesis first explores the power and performance characteristics of breadth first search (BFS) via measurements on a commodity GPU. We utilize this analysis to address the problem of minimizing execution time below a predefined power limit or power cap exposing key relationships between graph properties and power consumption. We modify the firmware on a commodity GPU to measure power usage and use the GPU as an experimental system to evaluate future architectural enhancements for the optimization of graph algorithms. Specifically, we propose and evaluate power management algorithms that scale i) the GPU frequency or ii) the number of active GPU compute units for a diverse set of real-world and synthetic graphs. Compared to scaling either frequency or compute units individually, our proposed schemes reduce execution time by an average of 18.64% by adjusting the configuration based on the inter- and intra-graph characteristics. GPU architecture Graph algorithms Power-constrained environments Graph algorithms Graphics processing units
47	Many-core architecture for programmable hardware accelerator Lee, Junghee 13 January 2014 (has links) As the further development of single-core architectures faces seemingly insurmountable physical and technological limitations, computer designers have turned their attention to alternative approaches. One such promising alternative is the use of several smaller cores working in unison as a programmable hardware accelerator. It is clear that the vast – and, as yet, largely untapped – potential of hardware accelerators is coming to the forefront of computer architecture. There are many challenges that must be addressed for the programmable hardware accelerator to be realized in practice. In this thesis, load-balancing, on-chip communication, and an execution model are studied. Imbalanced distribution of workloads across the processing elements constitutes wasteful use of resources, which results in degrading the performance of the system. In this thesis, a hardware-based load-balancing technique is proposed, which is demonstrated to be more scalable than state-of-the-art loadbalancing techniques. To facilitate efficient communication among ever increasing number of cores, a scalable communication network is imperative. Packet switching networks-on-chip (NoC) is considered as a viable candidate for scalable communication fabric. The size of flit, which is a unit of flow control in NoC, is one of important design parameters that determine latency, throughput and cost of NoC routers. How to determine an optimal flit size is studied in this thesis and a novel router architecture is proposed, which overcomes a problem related with the flit size. This thesis also includes a new execution model and its supporting architecture. An event-driven model that is an extension of hardware description language is employed as an execution model. The dynamic scheduling and module-level prefetching for supporting the event-driven execution model are evaluated. Hardware accelerator Load-balancing Networks-on-chip Execution model Graphics processing units Computers
48	Shared resource management for efficient heterogeneous computing Lee, Jaekyu 13 January 2014 (has links) The demand for heterogeneous computing, because of its performance and energy efficiency, has made on-chip heterogeneous chip multi-processors (HCMP) become the mainstream computing platform, as the recent trend shows in a wide spectrum of platforms from smartphone application processors to desktop and low-end server processors. The performance of on-chip GPUs is not yet comparable to that of discrete GPU cards, but vendors have integrated more powerful GPUs and this trend will continue in upcoming processors. In this architecture, several system resources are shared between CPUs and GPUs. The sharing of system resources enables easier and cheaper data transfer between CPUs and GPUs, but it also causes resource contention problems between cores. The resource sharing problem has existed since the homogeneous (CPU-only) chip-multi processor (CMP) was introduced. However, resource sharing in HCMPs shows different aspects because of the different nature of CPU and GPU cores. In order to solve the resource sharing problem in HCMPs, we consider efficient shared resource management schemes, in particular tackling the problem in shared last-level cache and interconnection network. In the thesis, we propose four resource sharing mechanisms: First, we propose an efficient cache sharing mechanism that exploits the different characteristics of CPU and GPU cores to effectively share cache space between them. Second, adaptive virtual channel partitioning for on-chip interconnection network is proposed to isolate inter-application interference. By partitioning virtual channels to CPUs and GPUs, we can prevent the interference problem while guaranteeing quality-of-service (QoS) for both cores. Third, we propose a dynamic frequency controlling mechanism to efficiently share system resources. When both cores are active, the degree of resource contention as well as the system throughput will be affected by the operating frequency of CPUs and GPUs. The proposed mechanism tries to find optimal operating frequencies for both cores, which reduces the resource contention while improving system throughput. Finally, we propose a second cache sharing mechanism that exploits GPU-semantic information. The programming and execution models of GPUs are more strict and easier than those of CPUs. Also, programmers are asked to provide more information to the hardware. By exploiting these characteristics, GPUs can energy-efficiently exercise the cache and simpler, but more efficient cache partitioning can be enabled for HCMPs. Resource management Heterogeneous architecture Shared cache On-chip network Graphics processing units Heterogeneous computing Cache memory
49	Τριγωνοποίηση Delaunay : μία υλοποίηση βασισμένη στη GPU και η χρήση της σε προβλήματα πραγματικού χρόνου της υπολογιστικής όρασης και της γραφικής Βασιλείου, Πέτρος 01 February 2013 (has links) Μια γρήγορη επίλυση του Delaunay Τριγωνισμός (DT) πρόβληματος αποτελεί ένα από τα βασικά συστατικά σε πολλές θεωριτικές και πρακτικές εφαρμογές. Οι υπάρχουσες μονάδες επεξεργασίας γραφικών (GPU), με βάση τις εφαρμογές των αλγορίθμων DT πάσχουν από δύο σοβαρά μειονεκτήματα. Το πρώτο σχετίζεται με την εξάρτηση του αλγορίθμου καθοδήγηση της GPU από την CPU για τους υπολογισμούς. Το δεύτερο πιο σοβαρό μειονέκτημα είναι η εξάρτησή τους από τη διανομή του σημειοσύνολου εισόδου. Οι περισσότεροι αλγορίθμοι για GPU έχουν καλή απόδοση μόνο με ομοιόμορφες κατανομές σημειοσύνολον. Προτείνουμε ένα καινούριο αλγόριθμο που δεν πάσχουν από τα παραπάνω προβλήματα. / A Fast solver of Delaunay Triangulation (DT) problem constitutes one of the basic ingredients in many practical and sientific applications. Existing Graphics Processing Units (GPU) based implementations of DT algorithms suffer from two serious drawbacks. The first is related to the dependency of the CPU guidance algorithm on GPU calculations. Albeit the modern GPUs have high computational throughput, if the feedback from CPU is necessary for the algorithmic evolution, the overhead caused by CPU-GPU communication can seriously degrade the performance. The second most serious drawback is their dependency on the distribution of the given point-set. Most of the GPU-based implementations can optimally run only on uniformly distributed point-sets, however, in many practical applications this is not the case. Τριγωνοποίηση Κάρτες γραφικών 006.6 Delaunay Graphics processing unit (GPU) Computational geometry Triangulation
50	Design and Implementation of Scalable High-Performance Network Functions Hsieh, Cheng-Liang 01 August 2017 (has links) Service Function Chaining (SFC) enriches the network functionalities to fulfill the increasing demand of value-added services. By leveraging SDN and NFV for SFC, it becomes possible to meet the demand fluctuation and construct a dynamic SFc. However, the integration of SDN with NFV requires packet header modifications, generates excessive network traffics, and induces additional I/O overheads for packet processing. These additional overheads result in a lower system performance, scalability, and agility. To improve the system performance, a co-optimized solution is proposed to implemented NF to achieve a better performance for software-based network functions. To improve the system scalability, a many-field packet classification is proposed to support a more complex ruleset. To improve the system agility, a network function-enabled switch is proposed to lower the network function content switching time. The experiment results show that the performance of a network function is improved by 8 times by leveraging GPU as a parallel computation platform. Moreover, the matching speed to steer network traffics with many-field ruleset is improved by 4 times with the proposed many-field packet classification algorithm. Finally, the proposed system is able to improve system bandwidth 5 times better compared the native solution and maintain the content switch time with the proposed SFC implementation using SDN and NFV. Graphics Processing Unit Network Function Virtualization Service Function Chaining Software-Defined Networking

Search results