Global ETD Search

481	MPI WITHIN A GPU Young, Bobby Dalton 01 January 2009 (has links) GPUs offer high-performance floating-point computation at commodity prices, but their usage is hindered by programming models which expose the user to irregularities in the current shared-memory environments and require learning new interfaces and semantics. This thesis will demonstrate that the message-passing paradigm can be conceptually cleaner than the current data-parallel models for programming GPUs because it can hide the quirks of current GPU shared-memory environments, as well as GPU-specific features, behind a well-established and well-understood interface. This will be shown by demonstrating a proof-of-concept MPI implementation which provides cleaner, simpler code with a reasonable performance cost. This thesis will also demonstrate that, although there is a virtualization constraint imposed by MPI, this constraint is harmless as long as the virtualization was already chosen to be optimal in terms of a strong execution model and nearly-optimal execution time. This will be demonstrated by examining execution times with varying virtualization using a computationally-expensive micro-kernel. message-passing virtualization data-parallel virtualization MPI GPU Electrical and Computer Engineering
482	COLLECTIVE COMMUNICATION AND BARRIER SYNCHRONIZATION ON NVIDIA CUDA GPU Rivera-Polanco, Diego Alejandro 01 January 2009 (has links) GPUs (Graphics Processing Units) employ a multi-threaded execution model using multiple SIMD cores. Compared to use of a single SIMD engine, this architecture can scale to more processing elements. However, GPUs sacrifice the timing properties which made barrier synchronization implicit and collective communication operations fast. This thesis demonstrates efficient methods by which these aggregate functions can be implemented using unmodified NVIDIA CUDA GPUs. Although NVIDIA's highest “compute capability" GPUs provide atomic memory functions, they have order N execution time. In contrast, the methods proposed here take advantage of basic properties of the GPU architecture to make implementations that are both efficient and portable to all CUDA-capable GPUs. A variety of coordination operations are synthesized, and the algorithm, CUDA code, and performance of each are discussed in detail. GPU barrier synchronization CUDA constant time race resolution global block synchronization Electrical and Computer Engineering
483	Nonnegative matrix factorization for clustering Kuang, Da 27 August 2014 (has links) This dissertation shows that nonnegative matrix factorization (NMF) can be extended to a general and efficient clustering method. Clustering is one of the fundamental tasks in machine learning. It is useful for unsupervised knowledge discovery in a variety of applications such as text mining and genomic analysis. NMF is a dimension reduction method that approximates a nonnegative matrix by the product of two lower rank nonnegative matrices, and has shown great promise as a clustering method when a data set is represented as a nonnegative data matrix. However, challenges in the widespread use of NMF as a clustering method lie in its correctness and efficiency: First, we need to know why and when NMF could detect the true clusters and guarantee to deliver good clustering quality; second, existing algorithms for computing NMF are expensive and often take longer time than other clustering methods. We show that the original NMF can be improved from both aspects in the context of clustering. Our new NMF-based clustering methods can achieve better clustering quality and run orders of magnitude faster than the original NMF and other clustering methods. Like other clustering methods, NMF places an implicit assumption on the cluster structure. Thus, the success of NMF as a clustering method depends on whether the representation of data in a vector space satisfies that assumption. Our approach to extending the original NMF to a general clustering method is to switch from the vector space representation of data points to a graph representation. The new formulation, called Symmetric NMF, takes a pairwise similarity matrix as an input and can be viewed as a graph clustering method. We evaluate this method on document clustering and image segmentation problems and find that it achieves better clustering accuracy. In addition, for the original NMF, it is difficult but important to choose the right number of clusters. We show that the widely-used consensus NMF in genomic analysis for choosing the number of clusters have critical flaws and can produce misleading results. We propose a variation of the prediction strength measure arising from statistical inference to evaluate the stability of clusters and select the right number of clusters. Our measure shows promising performances in artificial simulation experiments. Large-scale applications bring substantial efficiency challenges to existing algorithms for computing NMF. An important example is topic modeling where users want to uncover the major themes in a large text collection. Our strategy of accelerating NMF-based clustering is to design algorithms that better suit the computer architecture as well as exploit the computing power of parallel platforms such as the graphic processing units (GPUs). A key observation is that applying rank-2 NMF that partitions a data set into two clusters in a recursive manner is much faster than applying the original NMF to obtain a flat clustering. We take advantage of a special property of rank-2 NMF and design an algorithm that runs faster than existing algorithms due to continuous memory access. Combined with a criterion to stop the recursion, our hierarchical clustering algorithm runs significantly faster and achieves even better clustering quality than existing methods. Another bottleneck of NMF algorithms, which is also a common bottleneck in many other machine learning applications, is to multiply a large sparse data matrix with a tall-and-skinny dense matrix. We use the GPUs to accelerate this routine for sparse matrices with an irregular sparsity structure. Overall, our algorithm shows significant improvement over popular topic modeling methods such as latent Dirichlet allocation, and runs more than 100 times faster on data sets with millions of documents. Nonnegative matrix factorization Cluster analysis Hierarchical clustering Cancer subtype discovery GPU computing Sparse matrix multiplication
484	Optimized Graphics for Handheld Real-time CG Applications Powell, Robin January 2014 (has links) The purpose of this theoretical thesis is to research the problem: producing optimized content for handheld real-time CG applications. This thesis is based on existing literature studiesregarding the subject. It will explore the topic on how to deal with draw calls, shaderoptimizations, hard/smooth edges and how engines deal with polygons and textures and howthis correlates with current technical limitations of current game engines for handheld realtime applications such as the iPhone. On these grounds, the thesis will expand on the subject of how those topics affect the hardware; more specifically targeted the GPU and CPU onmobile devices. Today, the hardware used on mobile devices on the market is limited; it is important tounderstand that a lot of factors come in to play and making optimization a part of the design, not a final step regarding the creation of content is vital when creating CG applications formobile devices. A closer look at the available evidences suggests that when dealing with the limited resources available every performance aspect should be closely looked into. It is vital to optimize the content and correlate your game budget accordingly to the marketed device. In the end it will make for a better game. This thesis sheds a light on the development strategiesfor CG applications on handheld devices. 3D game art mobile devices graphic optimization 3D models GPU CPU
485	Détection d'évènements impulsionnels en environnement radioélectrique perturbé : application à l'observation des pulsars intermittents avec un système temps réel de traitement du signal Ait Allal, Dalal 16 November 2012 (has links) (PDF) Les travaux présentés dans ce mémoire s'inscrivent dans le cadre de la détection d'évènements impulsionnels intermittents en provenance de pulsars. Ces objets astrophysiques sont des étoiles à neutrons hautement magnétisées en rotation rapide, qui émettent un faisceau radio balayant l'espace comme la lentille d'un phare. Ils sont détectables grâce à une instrumentation spécifique. Depuis quelques années, on a découvert de nouvelles catégories de ces pulsars, aux caractéristiques extrêmes, avec en particulier des impulsions individuelles plus intenses et irrégulières comparé à la moyenne. Il faut pouvoir les détecter en temps réel dans un environnement radio perturbé à cause des signaux de télécommunications. Cette étude propose des algorithmes de traitement d'interférences radio fréquence (RFI) adaptés à ce contexte. Plusieurs méthodes de traitement de RFI sont présentées et comparées. Parmi elles, deux ont été retenues et comparées au moyen de simulations Monte Carlo, avec un jeu de paramètres simulant le pulsar et un signal BPSK avec des puissances et des durées différentes. Pour la recherche de nouveaux pulsars, une méthode alternative est proposée (SIPSFAR), combinant capacité de recherche en temps réel et robustesse contre les RFI. Elle est basée sur la transformée de Fourier 2D et la transformée de Radon. Une étude comparative théorique a permis de confronter et comparer la sensibilité de cette nouvelle méthode avec celle communément utilisée par les radioastronomes. La méthode a été implantée sur un GPU GTX285 et testée sur un grand relevé du ciel effectué au radiotélescope de Nançay. Les résultats obtenus ont donné lieu à une comparaison statistique complémentaire à partir de données réelles. [SDU:OTHER] Planète et Univers/Autre Interférences radio fréquence Détection Radon GPU Pulsar
486	Traitement STAP en environnement hétérogène. Application à la détection radar et implémentation sur GPU Degurse, Jean-François 15 January 2014 (has links) (PDF) Les traitements spatio-temporels adaptatifs (STAP) sont des traitements qui exploitent conjointement les deux dimensions spatiale et temporelle des signaux reçus sur un réseau d'antennes, contrairement au traitement d'antenne classique qui n'exploite que la dimension spatiale, pour leur filtrage. Ces traitements sont particulièrement intéressants dans le cadre du filtrage des échos reçus par un radar aéroporté en provenance du sol pour lesquels il existe un lien direct entre direction d'arrivée et fréquence Doppler. Cependant, si les principes des traitements STAP sont maintenant bien acquis, leur mise en œuvre pratique face à un environnement réel se heurte à des points durs non encore résolus dans le contexte du radar opérationnel. Le premier verrou, adressé par la thèse dans une première phase, est d'ordre théorique, et consiste en la définition de procédures d'estimation de la matrice de covariance du fouillis sur la base d'une sélection des données d'apprentissage représentatives, dans un contexte à la fois de fouillis non homogène et de densité parfois importante des cibles d'intérêts. Le second verrou est d'ordre technologique, et réside dans l'implémentation physique des algorithmes, lié à la grande charge de calcul nécessaire. Ce point, crucial en aéroporté, est exploré par la thèse dans une deuxième phase, avec l'analyse de la faisabilité d'une implémentation sur GPU des étapes les plus lourdes d'un algorithme de traitement STAP. Radar STAP GPU Filtrage adaptatif
487	Power-constrained performance optimization of GPU graph traversal McLaughlin, Adam Thomas 13 January 2014 (has links) Graph traversal represents an important class of graph algorithms that is the nucleus of many large scale graph analytics applications. While improving the performance of such algorithms using GPUs has received attention, understanding and managing performance under power constraints has not yet received similar attention. This thesis first explores the power and performance characteristics of breadth first search (BFS) via measurements on a commodity GPU. We utilize this analysis to address the problem of minimizing execution time below a predefined power limit or power cap exposing key relationships between graph properties and power consumption. We modify the firmware on a commodity GPU to measure power usage and use the GPU as an experimental system to evaluate future architectural enhancements for the optimization of graph algorithms. Specifically, we propose and evaluate power management algorithms that scale i) the GPU frequency or ii) the number of active GPU compute units for a diverse set of real-world and synthetic graphs. Compared to scaling either frequency or compute units individually, our proposed schemes reduce execution time by an average of 18.64% by adjusting the configuration based on the inter- and intra-graph characteristics. GPU architecture Graph algorithms Power-constrained environments Graph algorithms Graphics processing units
488	Parallel algorithm design and implementation of regular/irregular problems: an in-depth performance study on graphics processing units Solomon, Steven 16 January 2012 (has links) Recently, interest in the Graphics Processing Unit (GPU) for general purpose parallel applications development and research has grown. Much of the current research on the GPU focuses on the acceleration of regular problems, as irregular problems typically do not provide the same level of performance on the hardware. We explore the potential of the GPU by investigating four problems on the GPU with regular and/or irregular properties: lookback option pricing (regular), single-source shortest path (irregular), maximum flow (irregular), and the task matching problem using multi-swarm particle swarm optimization (regular with elements of irregularity). We investigate the design, implementation, optimization, and performance of these algorithms on the GPU, and compare the results. Our results show that the regular problem achieves greater performance and requires less development effort than the irregular problems. However, we find the GPU to still be capable of providing high levels of acceleration for irregular problems. Parallel Computing GPU CUDA Combinatorial Optimization Regular/Irregular Problems Option Pricing Particle Swarm Optimization
489	Exploiting Parallelism in GPUs Hechtman, Blake Alan January 2014 (has links) <p>Heterogeneous processors with accelerators provide an opportunity to improve performance within a given power budget.</p><p>Many of these heterogeneous processors contain Graphics Processing Units (GPUs) that can perform graphics and embarrassingly parallel computation orders of magnitude faster than a CPU while using less energy. Beyond these obvious applications for GPUs, a larger variety of applications can benefit from a GPU's large computation and memory bandwidth. However, many of these applications are irregular and, as a result, require synchronization and scheduling that are commonly believed to perform poorly on GPUs. The basic building block of synchronization and scheduling is memory consistency, which is, therefore, the first place to look for improving performance on irregular applications. In this thesis, we approach the programmability of irregular applications on GPUs by thinking across traditional boundaries of the compute stack. We think about architecture, microarchitecture and runtime systems from the programmers perspective. To this end, we study architectural memory consistency on future GPUs with cache coherence. In addition, we design a GPU memory system</p><p>microarchitecture that can support fine-grain and coarse-grain synchronization without sacrificing throughput. Finally, we develop a task runtime that embraces the GPU microarchitecture to perform well</p><p>on fork/join parallelism desired by many programmers. Overall, this thesis contributes non-intuitive solutions to improve the performance and programmability of irregular applications from the programmer's perspective.</p> / Dissertation Computer engineering Computer science Cache Coherence GPU Memory Consistency Task Parallelism
490	MR-CUDASW - GPU accelerated Smith-Waterman algorithm for medium-length (meta)genomic data 2014 November 1900 (has links) The idea of using a graphics processing unit (GPU) for more than simply graphic output purposes has been around for quite some time in scientific communities. However, it is only recently that its benefits for a range of bioinformatics and life sciences compute-intensive tasks has been recognized. This thesis investigates the possibility of improving the performance of the overlap determination stage of an Overlap Layout Consensus (OLC)-based assembler by using a GPU-based implementation of the Smith-Waterman algorithm. In this thesis an existing GPU-accelerated sequence alignment algorithm is adapted and expanded to reduce its completion time. A number of improvements and changes are made to the original software. Workload distribution, query profile construction, and thread scheduling techniques implemented by the original program are replaced by custom methods specifically designed to handle medium-length reads. Accordingly, this algorithm is the first highly parallel solution that has been specifically optimized to process medium-length nucleotide reads (DNA/RNA) from modern sequencing machines (i.e. Ion Torrent). Results show that the software reaches up to 82 GCUPS (Giga Cell Updates Per Second) on a single-GPU graphic card running on a commodity desktop hardware. As a result it is the fastest GPU-based implemen- tation of the Smith-Waterman algorithm tailored for processing medium-length nucleotide reads. Despite being designed for performing the Smith-Waterman algorithm on medium-length nucleotide sequences, this program also presents great potential for improving heterogeneous computing with CUDA-enabled GPUs in general and is expected to make contributions to other research problems that require sensitive pairwise alignment to be applied to a large number of reads. Our results show that it is possible to improve the performance of bioinformatics algorithms by taking full advantage of the compute resources of the underlying commodity hardware and further, these results are especially encouraging since GPU performance grows faster than multi-core CPUs. Bioinformatics Sequence Alignment Smith-Waterman Algorithm GPU Computing CUDA Sequence Assembly Metagenomics Next-Generation-Sequencing

Search results