• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 86
  • 13
  • 10
  • 8
  • 7
  • 4
  • 4
  • 2
  • 1
  • Tagged with
  • 148
  • 64
  • 36
  • 32
  • 24
  • 23
  • 20
  • 19
  • 18
  • 17
  • 16
  • 14
  • 14
  • 14
  • 14
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
121

A model of dynamic compilation for heterogeneous compute platforms

Kerr, Andrew 10 December 2012 (has links)
Trends in computer engineering place renewed emphasis on increasing parallelism and heterogeneity. The rise of parallelism adds an additional dimension to the challenge of portability, as different processors support different notions of parallelism, whether vector parallelism executing in a few threads on multicore CPUs or large-scale thread hierarchies on GPUs. Thus, software experiences obstacles to portability and efficient execution beyond differences in instruction sets; rather, the underlying execution models of radically different architectures may not be compatible. Dynamic compilation applied to data-parallel heterogeneous architectures presents an abstraction layer decoupling program representations from optimized binaries, thus enabling portability without encumbering performance. This dissertation proposes several techniques that extend dynamic compilation to data-parallel execution models. These contributions include: - characterization of data-parallel workloads - machine-independent application metrics - framework for performance modeling and prediction - execution model translation for vector processors - region-based compilation and scheduling We evaluate these claims via the development of a novel dynamic compilation framework, GPU Ocelot, with which we execute real-world workloads from GPU computing. This enables the execution of GPU computing workloads to run efficiently on multicore CPUs, GPUs, and a functional simulator. We show data-parallel workloads exhibit performance scaling, take advantage of vector instruction set extensions, and effectively exploit data locality via scheduling which attempts to maximize control locality.
122

Développement d'algorithmes d'imagerie et de reconstruction sur architectures à unités de traitements parallèles pour des applications en contrôle non destructif

Pedron, Antoine 28 May 2013 (has links) (PDF)
La problématique de cette thèse se place à l'interface entre le domaine scientifique du contrôle non destructif par ultrasons (CND US) et l'adéquation algorithme architecture. Le CND US comprend un ensemble de techniques utilisées pour examiner un matériau, qu'il soit en production ou maintenance. Afin de détecter d'éventuels défauts, de les positionner et les dimensionner, des méthodes d'imagerie et de reconstruction ont été développées au CEA-LIST, dans la plateforme logicielle CIVA.L'évolution du matériel d'acquisition entraine une augmentation des volumes de données et par conséquent nécessite toujours plus de puissance de calcul pour parvenir à des reconstructions en temps interactif. L'évolution multicoeurs des processeurs généralistes (GPP), ainsi que l'arrivée de nouvelles architectures comme les GPU rendent maintenant possible l'accélération de ces algorithmes.Le but de cette thèse est d'évaluer les possibilités d'accélération de deux algorithmes de reconstruction sur ces architectures. Ces deux algorithmes diffèrent dans leurs possibilités de parallélisation. Pour un premier, la parallélisation sur GPP est relativement immédiate, contrairement à celle sur GPU qui nécessite une utilisation intensive des instructions atomiques. Quant au second, le parallélisme est plus simple à exprimer, mais l'ordonnancement des nids de boucles sur GPP, ainsi que l'ordonnancement des threads et une bonne utilisation de la mémoire partagée des GPU sont nécessaires pour obtenir un fonctionnement efficace. Pour ce faire, OpenMP, CUDA et OpenCL ont été utilisés et comparés. L'intégration de ces prototypes dans la plateforme CIVA a mis en évidence un ensemble de problématiques liées à la maintenance et à la pérennisation de codes sur le long terme.
123

Reducing Communication Through Buffers on a SIMD Architecture

Choi, Jee W. 13 May 2004 (has links)
Advances in wireless technology and the growing popularity of multimedia applications have brought about a need for energy efficient and cost effective portable supercomputers capable of delivering performance beyond the capabilities of current microprocessors and DSP chips. The SIMPil architecture currently being developed at Georgia Institute of Technology is a promising candidate for this task. In order to develop applications for SIMPil, a high level language and an optimizing compiler for the language are essential. However, with the recent trend of interconnect latency becoming a major bottleneck on computer systems, optimizations focusing on reducing latency are becoming more important, especially with SIMPil, as it is highly scalable. The compiler tracks the path of data through the network and buffers data in each processor to eliminate redundant communication. With a buffer size of 5, the compiler was able to eliminate 96 percent of the redundant communication for a 9x9 convolution and 8x8 DCT algorithms. With 5x5 convolution, only 89 percent elimination was observed. In terms of performance, 106 percent speedup was observed with 9x9 convolution at buffer size of 5 while 5x5 convolution and 8x8 DCT which have a much lower number of communication showed only 101 percent speedup.
124

Development and implementation of highly parallel algorithms for decoding perfect space-time block codes .

Amani, Kikongo Elie. January 2012 (has links)
M. Tech. Electrical Engineering. / Applies conditional optimisation to ML decoding of perfect STBCs, it is hypothesised that the obtained algorithms have reduced complexity and exhibit high DLP and TLP that can be exploited to map them on low-power multi-core SIMD processors, and possibly to reduce their runtimes and allow their real-time execution in a 4G wireless system.
125

SEPARATING INSTRUCTION FETCHES FROM MEMORY ACCESSES : ILAR (INSTRUCTION LINE ASSOCIATIVE REGISTERS)

Lim, Nien Yi 01 January 2009 (has links)
Due to the growing mismatch between processor performance and memory latency, many dynamic mechanisms which are “invisible” to the user have been proposed: for example, trace caches and automatic pre-fetch units. However, these dynamic mechanisms have become inadequate due to implicit memory accesses that have become so expensive. On the other hand, compiler-visible mechanisms like SWAR (SIMD Within A Register) and LARs (Line Associative Registers) are potentially more effective at improving data access performance. This thesis investigates applying the same ideas to improve instruction access. ILAR (Instruction LARs) store instructions in wide registers. Instruction blocks are explicitly loaded into ILAR, using block compression to enhance memory bandwidth. The control flow of the program then refers to instructions directly by their position within an ILAR, rather than by lengthy memory addresses. Because instructions are accessed directly from within registers, there is no implicit instruction fetch from memory. This thesis proposes an instruction set architecture for ILAR, investigates a mechanism to load ILAR using the best available block compression algorithm and also develop hardware descriptions for both ILAR and a conventional memory cache model so that performance comparisons could be made on the instruction fetch stage.
126

Intel Integrated Performance Primitives a jejich využití při vývoji aplikací / Intel Integrated Performance Primitives and their use in application development

Machač, Jiří January 2008 (has links)
The aim of the presented work is to demonstrate and evaluate the contribution of computing system SIMD especially units MMX, SSE, SSE2, SSE3, SSSE3 and SSE4 from Intel company, by creation of demostrating applications with using Intel Integrated Performance Primitives library. At first, possibilities of SIMD programming using intrinsic function, vektorization and libraries Intel Integrated Performance Primitives are presented, as next are descibed options of evaluation of particular algorithms. Finally procedure of programing by using Intel Integrated Performance Primitives library are ilustrated.
127

Detektor obličejů pro platformu Android / Face Detector For Android Platform

Slavík, Roman January 2011 (has links)
This master's thesis deals with face detection on mobile phones with Android OS. The introduction describes some algorithms used for pattern detection from image, as well as various techniques of features extracting. After that Android platform development specifics, including basic description of development tools, are described. Architecture of SIMD is introduced in next part of this work. After acquiring basic knowleage analysis and implementation of final app are descrited. Performance tests are conducted whose results are summarized in the conclusion.
128

Efektivní implementace genetického algoritmu s využitím vícejádrových CPU / The Efficient Implementation of the Genetic Algorithm Using Multicore Processors

Kouřil, Miroslav January 2010 (has links)
This diploma thesis deals with acceleration of advanced genetic algorithm. For implementation, discrete and continuos versions of UMDA genetic algorithm were chosen. The main part of the acceleration is the utilization of SSE instruction set. Using this set, the functions for calculating fitness and new population sampling were accelerated in particular. Then the pseudorandom number generator that also uses SSE instruction set was implemented.  The discrete algorithm reached the speed of up to 4,6 after this implementation. Finally, the algorithms were modified so that the system  OpenMP could be used, which enables the running of blocks of code in more threads. The continuous version of algorithm is not convenient for parallelization, because computational complexity of that algorithm is low. In comparison, the discrete versions of algorithm are really appropriate for parallelization. Both the implemented versions reached the total acceleration of up to 4,9 and 7,2.
129

Jádra schématu lifting pro vlnkovou transformaci / Lifting Scheme Cores for Wavelet Transform

Bařina, David Unknown Date (has links)
Práce se zaměřuje na efektivní výpočet dvourozměrné diskrétní vlnkové transformace. Současné metody jsou v práci rozšířeny v několika směrech a to tak, aby spočetly tuto transformaci v jediném průchodu, a to případně víceúrovňově, použitím kompaktního jádra. Tohle jádro dále může být vhodně přeorganizováno za účelem minimalizace užití některých prostředků. Představený přístup krásně zapadá do běžně používaných rozšíření SIMD, využívá hierarchii cache pamětí moderních procesorů a je vhodný k paralelnímu výpočtu. Prezentovaný přístup je nakonec začleněn do kompresního řetězce formátu JPEG 2000, ve kterém se ukázal být zásadně rychlejší než široce používané implementace.
130

Akcelerace vektorových a krytografických operací na platformě x86-64 / Acceleration of Vector and Cryptographic Operations on x86-64 Platform

Šlenker, Samuel January 2017 (has links)
The aim of this thesis was to study and subsequently process a comparison of older and newer SIMD processing units of modern microprocessors on the x86-64 platform. The thesis provides an overview of the fastest computations of vector operations with matrices and vectors, including corresponding source codes. Furthermore, the thesis is focused on authenticated encryption, specifically on block cipher AES operating in Galois Counter Mode, and on a discussion of possibilities of instruction sets for cryptographic support.

Page generated in 0.0453 seconds