Global ETD Search

91	Performance Optimization of Memory-Bound Programs on Data Parallel Accelerators Sedaghati Mokhtari, Naseraddin 08 June 2016 (has links) No description available. Computer Science Computer Engineering Engineering
92	A New Representation of Structured Grids for Matrix-vector Operation and Optimization of Doitgen Kernel Murugandi, Iyyappa Thirunavukkarasu 27 September 2010 (has links) No description available. Computer Science Structure Grid Doitgen Matrix Vector multiplication vectorization Orio SIMD Tuning Optimization MADNESS
93	Automatic Code Generation for Stencil Computations on GPU Architectures Holewinski, Justin A. 19 December 2012 (has links) No description available. Computer Engineering Computer Science GPU SIMD stencils CUDA OpenCL code generation dynamic analysis
94	Performance Optimization of Public Key Cryptography on Embedded Platforms Pabbuleti, Krishna Chaitanya 23 May 2014 (has links) Embedded systems are so ubiquitous that they account for almost 90% of all the computing devices. They range from very small scale devices with an 8-bit microcontroller and few kilobytes of RAM to large-scale devices featuring PC-like performance with full-blown 32-bit or 64-bit processors, special-purpose acceleration hardware and several gigabytes of RAM. Each of these classes of embedded systems have unique set of challenges in terms of hardware utilization, performance and power consumption. As network connectivity becomes a standard feature in these devices, security becomes an important concern. Public Key Cryptography is an indispensable tool to implement various security features necessary on these embedded platforms. In this thesis, we provide optimized PKC solutions on platforms belonging to two extreme classes of the embedded system spectrum. First, we target high-end embedded platforms Qualcomm Snapdragon and Intel Atom. Each of these platforms features a dual-core processor, a GPU and a gigabyte of RAM. We use the SIMD coprocessor built into these processors to accelerate the modular arithmetic which accounts for the majority of execution time in Elliptic Curve Cryptography. We exploit the structure of NIST primes to perform the reduction step as we perform the multiplication. Our implementation runs over two times faster than OpenSSL implementations on the respective platforms. The second platform we targeted is an energy-harvested wireless sensor node which has a 16-bit MSP430 microcontroller and a low power RF interface. The system derives its power from a solar panel and is constrained in terms of available energy and computational power. We analyze the computation and communication energy requirements for different signature schemes, each with a different trade-off between computation and communication. We investigate the Elliptic Curve Digital Signature Algorithm (ECDSA), the Lamport-Diffie one-time hash-based signature scheme (LD-OTS) and the Winternitz one-time hash-based signature scheme (W-OTS). We demonstrate that there’s a trade-off between energy needs, security level and algorithm selection. However, when we consider the energy needs for the overall system, we show that all schemes are within one order of magnitude from each another. / Master of Science Elliptic Curve Cryptography Modular Arithmetic SIMD Hash-based Signatures MSP430 Wireless Sensor Node
95	Vector Instruction Set Extensions for Efficient and Reliable Computation of Keccak Rawat, Hemendra Kumar 27 August 2016 (has links) Recent processor architectures such as Intel Westmere (and later) and ARMv8 include instruction-level support for the Advanced Encryption Standard (AES), for the Secure Hashing Standard (SHA-1, SHA2) and for carry-less multiplication. These crypto-instructions are optimized for a single algorithm and provide significant performance improvements over software written using general-purpose instruction set. However, today's secure systems and protocols do not rely on just one, but a suite of many cryptographic applications that are expected to work in a correct and reliable manner. In this work, we propose a new instruction set for supporting efficient and reliable cryptography on modern processors. For efficiency, we propose flexible instruction set extensions for Keccak, a cryptographic kernel for hashing, authenticated encryption, key-stream generation and random-number generation. Keccak is the basis of the SHA-3 standard and the newly proposed Keyak and Ketje authenticated ciphers. For reliability, we propose a set of trusted instructions to verify the integrity of a cryptographic software library. These instructions are aimed at detecting tamper in the software or in the configurable hardware. We develop the instruction extensions for a 128-bit interface, commonly available in the vector processing unit of many modern processors. Simulation results on GEM5 architectural simulator show that the proposed instructions not only improves the performance of Keccak applications by 2 times (over NEON programming) and 6 times (over assembly programming), but also improves the reliability of applications at a performance overhead of just 6%. / Master of Science SIMD Instruction Set Extensions SHA-3 Hashing Authenticated Encryption Software Integrity
96	Isolating Drone Frequencies in a Real-Time Drone Detection System Teglund, Jonas January 2024 (has links) The problems caused by commercial drones in air traffic, airports, and vital and military installations have increased the demand for drone detection and tracking systems. An acoustic beamforming system that tracks audio sources using 256 microphones in real-time was extended to detect and track drones. This thesis studied software-defined, multi-channel, real-time filtering solutions to improve the systems' drone detection and tracking capabilities. The sound frequencies of drone sound and disturbance noise were analyzed to create a suitable filter. Methods for implementing this filter on all channels while still operating in real-time were studied. SIMD intrinsics were used to create a few candidate algorithms, and a GPU algorithm was created as well. These algorithms were compared to each other based on execution time metrics, and the system was also analyzed for performance degradation and placement of the filtering algorithms. The results of the isolated execution time of the filtering algorithms showed the best SIMD algorithm to be operating at 0.41 milliseconds and the GPU algorithm at 0.12 milliseconds when filtering 256 samples from all 256 channels. The real-time constraint was around 5.2 milliseconds, meaning both solutions operated well below the limit. The results of the system's drone detection and tracking capabilities, when placed outdoors in a windy environment, showed the system clearly finding the drone 48% of the time without any filtering and 89% of the time with filtering. The signal-to-noise ratio was also improved by 21dB by using this filter. The results show that a software-defined multi-channel, real-time filter operating on a large data stream is a viable solution to real-time DSP applications. When specializing a beamforming application in tracking a desired frequency, filtering was revealed to be a good solution. Digital filtering Real-time computing Embedded system SIMD GPU Computer and Information Sciences Data- och informationsvetenskap
97	Pic-Vert : une implémentation de la méthode particulaire pour architectures multi-coeurs / Pic-Vert : a particle-in-cell implementation for multi-core architectures Barsamian, Yann 31 October 2018 (has links) Cette thèse a pour contexte la résolution numérique du système de Vlasov–Poisson (modèle utilisé en physique des plasmas, par exemple dans le cadre du projet ITER) par les méthodes classiques particulaires (PIC pour "Particle-in-Cell") et semi-Lagrangiennes. La contribution principale de notre thèse est une implémentation efficace de la méthode PIC pour architectures multi-coeurs, écrite dans le langage C, dont le nom est Pic-Vert. Notre implémentation (a) atteint un nombre quasi-minimal de transferts mémoires avec la mémoire principale, (b) exploite les instructions vectorielles (SIMD) pour les calculs numériques, et (c) expose une quantité suffisante de parallélisme, en mémoire partagée. Pour mettre notre travail en perspective avec l'état de l'art, nous proposons une métrique permettant de comparer différentes implémentations sur différentes architectures. Notre implémentation est 3 fois plus rapide que d'autres implémentations récentes sur la même architecture (Intel Haswell). / In this thesis, we are interested in solving the Vlasov–Poisson system of equations (useful in the domain of plasma physics, for example within the ITER project), thanks to classical Particle-in-Cell (PIC) and semi-Lagrangian methods. The main contribution of our thesis is an efficient implementation of the PIC method on multi-core architectures, written in C, called Pic-Vert. Our implementation (a) achieves close-to-minimal number of memory transfers with the main memory, (b) exploits SIMD instructions for numerical computations, and (c) exhibits a high degree of shared memory parallelism. To put our work in perspective with respect to the state-of-the-art, we propose a metric to compare the efficiency of different PIC implementations when using different multi-core architectures. Our implementation is 3 times faster than other recent implementations on the same architecture (Intel Haswell). Informatique Parallélisme Méthode particulaire Méthode semi-Lagrangienne Physique des plasmas Multi-coeurs Architecture SIMD Mémoire partagée Computer science Parallelism Particle-in-cell Semi-Lagrangian Plasma physics Multi-core SIMD architecture Shared memory 005.4
98	From a Comprehensive Experimental Survey to a Cost-based Selection Strategy for Lightweight Integer Compression Algorithms Damme, Patrick, Ungethüm, Annett, Hildebrandt, Juliana, Habich, Dirk, Lehner, Wolfgang 11 January 2023 (has links) Lightweight integer compression algorithms are frequently applied in in-memory database systems to tackle the growing gap between processor speed and main memory bandwidth. In recent years, the vectorization of basic techniques such as delta coding and null suppression has considerably enlarged the corpus of available algorithms. As a result, today there is a large number of algorithms to choose from, while different algorithms are tailored to different data characteristics. However, a comparative evaluation of these algorithms with different data and hardware characteristics has never been sufficiently conducted in the literature. To close this gap, we conducted an exhaustive experimental survey by evaluating several state-of-the-art lightweight integer compression algorithms as well as cascades of basic techniques. We systematically investigated the influence of data as well as hardware properties on the performance and the compression rates. The evaluated algorithms are based on publicly available implementations as well as our own vectorized reimplementations. We summarize our experimental findings leading to several new insights and to the conclusion that there is no single-best algorithm. Moreover, in this article, we also introduce and evaluate a novel cost model for the selection of a suitable lightweight integer compression algorithm for a given dataset. info:eu-repo/classification/ddc/510 ddc:510 info:eu-repo/classification/ddc/004 ddc:004
99	SIMD Optimizations of Software Rendering in 2D Video Games / SIMD optimeringar i mjukvarurendering av 2D spel Mendel, Oskar, Bergström, Jesper January 2019 (has links) Optimizing rendering is one of the greatest challenges faced by game developers. Most game engines make use of hardware rendering which uses technology specifically built for rendering. Before such hardware existed, game developers had to rely on the CPU to render their games. This is known as software rendering. Software rendering is not commonly used nowadays but has been seen in cases such as a backup for when the end users machine does not support the hardware based renderer of the application. Since the CPU is not purposely built for rendering, unlike the GPU, the developer has to perform optimizations to make the renderer more efﬁcient in terms of speed. In this thesis, we present an approach which is a subset of parallel programming called Single Instruction, Multiple Data. This technique operates on vector based registers which means operations can be performed on multiple pieces of data at once. This is applied to an already built game engine in order to optimize its rendering. The results show a speed-up of 90.5% and a framerate increase from 30 frames per second to 133 frames per second within the rendering routine. SIMD AVX SSE CPU GPU Parallel programming Optimization Game developement Game engine x86 Haswell Rendering Computer Sciences Datavetenskap (datalogi)
100	Architecture et validation comportementale en VHDL d'un calculateur parallèle dédié à la vision Collette, Thierry 14 September 1992 (has links) (PDF) Actuellement, l'accélération des opérations de traitement d'images est principalement obtenue par l'utilisation de calculateurs parallèles. De tels processeurs, a flot d'instructions unique et a flots de données multiples (simd), sont développés, mais s'ils s'avèrent efficaces pour les opérations de traitement d'images dites de bas niveau, ou la structure des données reste la même, ils se heurtent a de nombreux problèmes lorsqu'il s'agit des opérations de moyen et de haut niveau. Notamment lors des opérations de moyen niveau, une réorganisation aléatoire des données sur les processeurs doit être effectuée, tache difficilement exécutable sur les structures parallèles synchrones a mémoire distribuée. Le but de cette thèse était d'étendre les capacités d'un calculateur simd, afin qu'il puisse exécuter, efficacement, les opérations de traitement d'images de moyen niveau. L'étude des algorithmes représentatifs de cette classe d'opérations dégage les limites de ce calculateur que des modifications d'architecture permettent d'affranchir. C'est ainsi que Sympatix, le nouveau calculateur SIMD, a été proposé. Afin de le valider, son modèle comportemental décrit en VHDL langage de description de matériel a été élaboré. Grâce a ce modèle, les performances de la nouvelle structure sont ainsi directement mesurées, par simulations d'algorithmes de traitement d'images. L'approche par modélisation VHDL permet, de plus, d'effectuer la conception électronique descendante du système, ce qui, par ailleurs, offre un couplage aise entre les modifications architecturales du système et leur cout électronique. Les résultats obtenus montrent que Sympatix est adapte aux opérations de traitement d'images de bas et de moyen niveau, qu'il est ouvert a un calculateur de haut niveau, et qu'il est capable de supporter d'autres applications de vision. Ce manuscrit présente également, une méthodologie de conception descendante, basée sur le vhdl, et destinée aux architectes de systèmes électroniques vision traitement d'image parallélisme réseau multiprocesseur simulation comportementale conception électronique SIMD VHDL SYMPATIX

Search results