Global ETD Search

11	SSAGA: Streaming Multiprocessors (SMs) Sculpted for Asymmetric General Purpose Graphics Processing Unit (GPGPU) Applications Saha, Shamik 01 May 2016 (has links) The evolution of the Graphics Processing Units (GPUs) over the last decade, has reinforced general purpose computing while sustaining a steady performance growth in graphics intensive applications. However, the immense performance improvement is generally associated with a steep rise in GPU power consumption. Consequently, GPUs are already close to the abominable power wall. With a massive popularity of the mobile devices running general-purpose GPU (GPGPU) applications, it is of utmost importance to ensure a high energy efficiency, while meeting the strict performance requirements. In this work, we demonstrate that, customizing a Streaming Multiprocessor (SM) of a GPU, at a lower frequency, is significantly more energy efficient, compared to employing Dynamic Voltage and Frequency Scaling (DVFS) on an SM, designed for a high frequency operation. Using a system level Computer Aided Design (CAD) technique, we propose SSAGA - Streaming Multiprocessors Sculpted for Asymmetric GPGPU Applications, an energy efficient GPU design paradigm. SSAGA creates architecturally identical SM cores, customized for different voltage-frequency domains. graphics processing unit computer aided design custom design energy efficiency VLSI Electrical and Computer Engineering
12	Cellular matrix for parallel k-means and local search to Euclidean grid matching / Matrice cellulaire pour des algorithmes parallèles de k-means et de recherche locale appliqués à des problèmes euclidiens d’appariement de graphes Wang, Hongjian 03 December 2015 (has links) Dans cette thèse, nous proposons un modèle de calcul parallèle, appelé « matrice cellulaire », pour apporter des réponses aux problématiques de calcul parallèle appliqué à la résolution de problèmes d’appariement de graphes euclidiens. Ces problèmes d’optimisation NP-difficiles font intervenir des données réparties dans le plan et des structures élastiques représentées par des graphes qui doivent s’apparier aux données. Ils recouvrent des problèmes connus sous des appellations diverses telles que geometric k-means, elastic net, topographic mapping, elastic image matching. Ils permettent de modéliser par exemple le problème du voyageur de commerce euclidien, le problème du cycle médian, ainsi que des problèmes de mise en correspondance d’images. La contribution présentée est divisée en trois parties. Dans la première partie, nous présentons le modèle de matrice cellulaire qui partitionne les données et définit le niveau de granularité du calcul parallèle. Nous présentons une boucle générique de calcul parallèle qui modélise le principe des projections de graphes et de leur appariement. Dans la deuxième partie, nous appliquons le modèle de calcul parallèle aux algorithmes de k-means avec topologie dans le plan. Les algorithmes proposés sont appliqués au voyageur de commerce, à la génération de maillage structuré et à la segmentation d'image suivant le concept de superpixel. L’approche est nommée superpixel adaptive segmentation map (SPASM). Dans la troisième partie, nous proposons un algorithme de recherche locale parallèle, appelé distributed local search (DLS). La solution du problème résulte des opérations locales sur les structures et les données réparties dans le plan, incluant des évaluations, des recherches de voisinage, et des mouvements structurés. L’algorithme est appliqué à des problèmes d’appariement de graphe tels que le stéréo-matching et le problème de flot optique. / In this thesis, we propose a parallel computing model, called cellular matrix, to provide answers to problematic issues of parallel computation when applied to Euclidean graph matching problems. These NP-hard optimization problems involve data distributed in the plane and elastic structures represented by graphs that must match the data. They include problems known under various names, such as geometric k-means, elastic net, topographic mapping, and elastic image matching. The Euclidean traveling salesman problem (TSP), the median cycle problem, and the image matching problem are also examples that can be modeled by graph matching. The contribution presented is divided into three parts. In the first part, we present the cellular matrix model that partitions data and defines the level of granularity of parallel computation. We present a generic loop for parallel computations, and this loop models the projection between graphs and their matching. In the second part, we apply the parallel computing model to k-means algorithms in the plane extended with topology. The proposed algorithms are applied to the TSP, structured mesh generation, and image segmentation following the concept of superpixel. The approach is called superpixel adaptive segmentation map (SPASM). In the third part, we propose a parallel local search algorithm, called distributed local search (DLS). The solution results from the many local operations, including local evaluation, neighborhood search, and structured move, performed on the distributed data in the plane. The algorithm is applied to Euclidean graph matching problems including stereo matching and optical flow. Matrice cellulaire L'appariement de graphes K-means Recherche locale Algorithmiques parallèles Graphics processing unit (GPU) 620
13	Computational Medical Image Analysis : With a Focus on Real-Time fMRI and Non-Parametric Statistics Eklund, Anders January 2012 (has links) Functional magnetic resonance imaging (fMRI) is a prime example of multi-disciplinary research. Without the beautiful physics of MRI, there wouldnot be any images to look at in the first place. To obtain images of goodquality, it is necessary to fully understand the concepts of the frequencydomain. The analysis of fMRI data requires understanding of signal pro-cessing, statistics and knowledge about the anatomy and function of thehuman brain. The resulting brain activity maps are used by physicians,neurologists, psychologists and behaviourists, in order to plan surgery andto increase their understanding of how the brain works. This thesis presents methods for real-time fMRI and non-parametric fMRIanalysis. Real-time fMRI places high demands on the signal processing,as all the calculations have to be made in real-time in complex situations.Real-time fMRI can, for example, be used for interactive brain mapping.Another possibility is to change the stimulus that is given to the subject, inreal-time, such that the brain and the computer can work together to solvea given task, yielding a brain computer interface (BCI). Non-parametricfMRI analysis, for example, concerns the problem of calculating signifi-cance thresholds and p-values for test statistics without a parametric nulldistribution. Two BCIs are presented in this thesis. In the first BCI, the subject wasable to balance a virtual inverted pendulum by thinking of activating theleft or right hand or resting. In the second BCI, the subject in the MRscanner was able to communicate with a person outside the MR scanner,through a virtual keyboard. A graphics processing unit (GPU) implementation of a random permuta-tion test for single subject fMRI analysis is also presented. The randompermutation test is used to calculate significance thresholds and p-values forfMRI analysis by canonical correlation analysis (CCA), and to investigatethe correctness of standard parametric approaches. The random permuta-tion test was verified by using 10 000 noise datasets and 1484 resting statefMRI datasets. The random permutation test is also used for a non-localCCA approach to fMRI analysis. functional magnetic resonance imaging brain computer interfaces canonical correlation analysis random permutation test graphics processing unit
14	Design of a Multi-Core Multi-thread Floating-Point Processor and Its Application in Computer Graphics Yeh, Chia-Yu 06 September 2011 (has links) Graphics processing unit (GPU) designs usually adopts various computer architecture techniques to boost the computation speed, including single-instruction multiple data (SIMD), very-long-instruction word (VLIW), multi-threading, and/or multi-core. In OpenGL ES 2.0, user programmable vertex shader (VS) hardware unit can be designed using vectored SIMD computation unit so that it can efficiently compute the matrix-vector multiplication, one of the key operations in vertex transformation. Recently, high-performance GPU, such as Telsa series from nVidia, is designed with many-core architectures with each core responsible for scalar operations. The intention is to allow for efficient execution of general-purpose computations in addition to the specialized graphics computations. In this thesis, we design a scalar-based multi-threaded GPU design that is composed of four scalar processors, one special-function unit, and can execute multi-threaded instructions. We use the example of vertex transformation to demonstrate execution efficiency of the scalar-based multi-threaded GPU. We also make comparison with the vector-based SIMD GPU. multi-threading graphics processing unit (GPU) vertex shader SIMD matrix-vector multiplication OpenGL ES 2.0
15	A High Performance Register Allocator for Vector Architectures with a Unified Register-Set Su, Yu-Dan 29 June 2012 (has links) This thesis describes a compiler optimization targeted for machines with unified, vector-based register sets. This optimization combines register allocation and instruction scheduling. It examines places where the code performs computations on scalar variables. The goal is to identify instances where the same operation is performed. For example, a program might calculate ¡§base+offset¡¨ and then calculate ¡§i+j¡¨. Even though these computations are unrelated, yet they use the same operator; if ¡§base¡¨ and ¡§i¡¨ are packed into one vector register, while ¡§offset¡¨ and ¡§j¡¨ are packed into another, then these two computations can be performed simultaneously through the vectors¡¦ parallel addition operation. This would reduce the execution time of the compiled code. Although other researchers have considered similar packing methods, their work has been limited by the hardware that they were studying. Such hardware usually imposed high costs for moving data between scalar and vector register banks. This present thesis, however, considers a novel hardware architecture that imposes no such costs. As a consequence, we are able to obtain significant speedups. The architecture that we consider is a Graphics Processing Unit (GPU) for embedded systems that is under development at this university. This GPU has a single register set for integers, float, and vectors. instruction scheduling register allocator compiler optimization unified register set vector architecture novel Graphics Processing Unit
16	Τριγωνοποίηση Delaunay : μία υλοποίηση βασισμένη στη GPU και η χρήση της σε προβλήματα πραγματικού χρόνου της υπολογιστικής όρασης και της γραφικής Βασιλείου, Πέτρος 01 February 2013 (has links) Μια γρήγορη επίλυση του Delaunay Τριγωνισμός (DT) πρόβληματος αποτελεί ένα από τα βασικά συστατικά σε πολλές θεωριτικές και πρακτικές εφαρμογές. Οι υπάρχουσες μονάδες επεξεργασίας γραφικών (GPU), με βάση τις εφαρμογές των αλγορίθμων DT πάσχουν από δύο σοβαρά μειονεκτήματα. Το πρώτο σχετίζεται με την εξάρτηση του αλγορίθμου καθοδήγηση της GPU από την CPU για τους υπολογισμούς. Το δεύτερο πιο σοβαρό μειονέκτημα είναι η εξάρτησή τους από τη διανομή του σημειοσύνολου εισόδου. Οι περισσότεροι αλγορίθμοι για GPU έχουν καλή απόδοση μόνο με ομοιόμορφες κατανομές σημειοσύνολον. Προτείνουμε ένα καινούριο αλγόριθμο που δεν πάσχουν από τα παραπάνω προβλήματα. / A Fast solver of Delaunay Triangulation (DT) problem constitutes one of the basic ingredients in many practical and sientific applications. Existing Graphics Processing Units (GPU) based implementations of DT algorithms suffer from two serious drawbacks. The first is related to the dependency of the CPU guidance algorithm on GPU calculations. Albeit the modern GPUs have high computational throughput, if the feedback from CPU is necessary for the algorithmic evolution, the overhead caused by CPU-GPU communication can seriously degrade the performance. The second most serious drawback is their dependency on the distribution of the given point-set. Most of the GPU-based implementations can optimally run only on uniformly distributed point-sets, however, in many practical applications this is not the case. Τριγωνοποίηση Κάρτες γραφικών 006.6 Delaunay Graphics processing unit (GPU) Computational geometry Triangulation
17	Design and Implementation of Scalable High-Performance Network Functions Hsieh, Cheng-Liang 01 August 2017 (has links) Service Function Chaining (SFC) enriches the network functionalities to fulfill the increasing demand of value-added services. By leveraging SDN and NFV for SFC, it becomes possible to meet the demand fluctuation and construct a dynamic SFc. However, the integration of SDN with NFV requires packet header modifications, generates excessive network traffics, and induces additional I/O overheads for packet processing. These additional overheads result in a lower system performance, scalability, and agility. To improve the system performance, a co-optimized solution is proposed to implemented NF to achieve a better performance for software-based network functions. To improve the system scalability, a many-field packet classification is proposed to support a more complex ruleset. To improve the system agility, a network function-enabled switch is proposed to lower the network function content switching time. The experiment results show that the performance of a network function is improved by 8 times by leveraging GPU as a parallel computation platform. Moreover, the matching speed to steer network traffics with many-field ruleset is improved by 4 times with the proposed many-field packet classification algorithm. Finally, the proposed system is able to improve system bandwidth 5 times better compared the native solution and maintain the content switch time with the proposed SFC implementation using SDN and NFV. Graphics Processing Unit Network Function Virtualization Service Function Chaining Software-Defined Networking
18	Automatic digital surface model generation using graphics processing unit Van der Merwe, Dirk Jacobus 05 June 2012 (has links) M. Ing. / Digital Surface Models (DSM) are widely used in the earth sciences for research, visu- alizations, construction etc. In order to generate a DSM for a speci c area, specialized equipment and personnel are always required which leads to a costly and time consuming exercise. Image processing has become a viable processing technique to generate terrain models since the improvements of hardware provided adequate processing power to complete such a task. Digital Surface Models (DSM) can be generated from stereo imagery, usually obtained from a remote sensing platform. The core component of a DSM generating system is the image matching algorithm. Even though there are a variety of algorithms to date which can generate DSMs, it is a computationally complex calculation and does tend to take some time to complete. In order to achieve faster DSMs, an investigation into an alternative processing platform for the generation of terrain models has been done. The Graphics Processing Unit (GPU) is usually used in the gaming industry to manipulate display data and then render it to a computer screen. The architecture is designed to manipulate large amounts of oating point data. The scientic community has begun using the GPU processing power available for technical computing, hence the term, General Purpose computing on a Graphics Processing Unit (GPGPU). The GPU is investigated as alternative processing platform for the image matching procedure since the processing capability of the GPU is so much higher than the CPU but only for a conditioned set of input data. A matching algorithm, derived from the GC3 algorithm has been implemented on both a CPU platform and a GPU platform in order to investigate the viability of a GPU processing alternative. The algorithm makes use of a Normalized Cross Correlation similarity measurement and the geometry of the image acquisition contained in the sensor model to obtain conjugate point matches in the two source images. The results of the investigation indicated an improvement of up to 70% on the processing time required to generate a DSM. The improvements varied from 70% to some cases where the GPU has taken longer to generate the DSM. The accuracy of the automatic DSM generating system could not be clearly determined since only poor quality reference data was available. It is however shown the DSMs generated using both the CPU and GPU platforms relate to the reference data and correlate to each other. The discrepancies between the CPU and the GPU results are low enough to prove the GPU processing is bene cial with neglible drawbacks in terms of accuracy. The GPU will definitely provide superior processing capabilites for DSM generation above a CPU implementation if a matching algorithm is speci cally designed to cater for the bene ts and limitations of the GPU. Digital surface model Digital elevation model Graphics processing unit Sensor model theory
19	Enhancing productivity and performance portability of OpenCL applications on heterogeneous systems using runtime optimizations Lutz, Thibaut January 2015 (has links) Initially driven by a strong need for increased computational performance in science and engineering, heterogeneous systems have become ubiquitous and they are getting increasingly complex. The single processor era has been replaced with multi-core processors, which have quickly been surrounded by satellite devices aiming to increase the throughput of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable Gate Arrays or other specialized processors have very different architectures. This puts an enormous strain on programming models and software developers to take full advantage of the computing power at hand. Because of this diversity and the unachievable flexibility and portability necessary to optimize for each target individually, heterogeneous systems remain typically vastly under-utilized. In this thesis, we explore two distinct ways to tackle this problem. Providing automated, non intrusive methods in the form of compiler tools and implementing efficient abstractions to automatically tune parameters for a restricted domain are two complementary approaches investigated to better utilize compute resources in heterogeneous systems. First, we explore a fully automated compiler based approach, where a runtime system analyzes the computation flow of an OpenCL application and optimizes it across multiple compute kernels. This method can be deployed on any existing application transparently and replaces significant software engineering effort spent to tune application for a particular system. We show that this technique achieves speedups of up to 3x over unoptimized code and an average of 1.4x over manually optimized code for highly dynamic applications. Second, a library based approach is designed to provide a high level abstraction for complex problems in a specific domain, stencil computation. Using domain specific techniques, the underlying framework optimizes the code aggressively. We show that even in a restricted domain, automatic tuning mechanisms and robust architectural abstraction are necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs and 3.6x on four. 006.6
20	Pencil beam dose calculation for proton therapy on graphics processing units da Silva, Joakim January 2016 (has links) Radiotherapy delivered using scanned beams of protons enables greater conformity between the dose distribution and the tumour than conventional radiotherapy using X rays. However, the dose distributions are more sensitive to changes in patient anatomy, and tend to deteriorate in the presence of motion. Online dose calculation during treatment delivery offers a way of monitoring the delivered dose in real time, and could be used as a basis for mitigating the effects of motion. The aim of this work has therefore been to investigate how the computational power offered by graphics processing units can be harnessed to enable fast analytical dose calculation for online monitoring in proton therapy. The first part of the work consisted of a systematic investigation of various approaches to implementing the most computationally expensive step of the pencil beam algorithm to run on graphics processing units. As a result, it was demonstrated how the kernel superposition operation, or convolution with a spatially varying kernel, can be efficiently implemented using a novel scatter-based approach. For the intended application, this outperformed the conventional gather-based approach suggested in the literature, permitting faster pencil beam dose calculation and potential speedups of related algorithms in other fields. In the second part, a parallelised proton therapy dose calculation engine employing the scatter-based kernel superposition implementation was developed. Such a dose calculation engine, running all of the principal steps of the pencil beam algorithm on a graphics processing unit, had not previously been presented in the literature. The accuracy of the calculation in the high- and medium-dose regions matched that of a clinical treatment planning system whilst the calculation was an order of magnitude faster than previously reported. Importantly, the calculation times were short, both compared to the dead time available during treatment delivery and to the typical motion period, making the implementation suitable for online calculation. In the final part, the beam model of the dose calculation engine was extended to account for the low-dose halo caused by particles travelling at large angles with the beam, making the algorithm comparable to those in current clinical use. By reusing the workflow of the initial calculation but employing a lower resolution for the halo calculation, it was demonstrated how the improved beam model could be included without prohibitively prolonging the calculation time. Since the implementation was based on a widely used algorithm, it was further predicted that by careful tuning, the dose calculation engine would be able to reproduce the dose from a general beamline with sufficient accuracy. Based on the presented results, it was concluded that, by using a single graphics processing unit, dose calculation using the pencil beam algorithm could be made sufficiently fast for online dose monitoring, whilst maintaining the accuracy of current clinical systems. 616.99

Search results