• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 84
  • 13
  • 10
  • 8
  • 7
  • 4
  • 4
  • 2
  • 1
  • Tagged with
  • 146
  • 64
  • 36
  • 32
  • 24
  • 23
  • 19
  • 19
  • 18
  • 16
  • 15
  • 14
  • 14
  • 13
  • 13
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
121

Développement d'algorithmes d'imagerie et de reconstruction sur architectures à unités de traitements parallèles pour des applications en contrôle non destructif

Pedron, Antoine 28 May 2013 (has links) (PDF)
La problématique de cette thèse se place à l'interface entre le domaine scientifique du contrôle non destructif par ultrasons (CND US) et l'adéquation algorithme architecture. Le CND US comprend un ensemble de techniques utilisées pour examiner un matériau, qu'il soit en production ou maintenance. Afin de détecter d'éventuels défauts, de les positionner et les dimensionner, des méthodes d'imagerie et de reconstruction ont été développées au CEA-LIST, dans la plateforme logicielle CIVA.L'évolution du matériel d'acquisition entraine une augmentation des volumes de données et par conséquent nécessite toujours plus de puissance de calcul pour parvenir à des reconstructions en temps interactif. L'évolution multicoeurs des processeurs généralistes (GPP), ainsi que l'arrivée de nouvelles architectures comme les GPU rendent maintenant possible l'accélération de ces algorithmes.Le but de cette thèse est d'évaluer les possibilités d'accélération de deux algorithmes de reconstruction sur ces architectures. Ces deux algorithmes diffèrent dans leurs possibilités de parallélisation. Pour un premier, la parallélisation sur GPP est relativement immédiate, contrairement à celle sur GPU qui nécessite une utilisation intensive des instructions atomiques. Quant au second, le parallélisme est plus simple à exprimer, mais l'ordonnancement des nids de boucles sur GPP, ainsi que l'ordonnancement des threads et une bonne utilisation de la mémoire partagée des GPU sont nécessaires pour obtenir un fonctionnement efficace. Pour ce faire, OpenMP, CUDA et OpenCL ont été utilisés et comparés. L'intégration de ces prototypes dans la plateforme CIVA a mis en évidence un ensemble de problématiques liées à la maintenance et à la pérennisation de codes sur le long terme.
122

Reducing Communication Through Buffers on a SIMD Architecture

Choi, Jee W. 13 May 2004 (has links)
Advances in wireless technology and the growing popularity of multimedia applications have brought about a need for energy efficient and cost effective portable supercomputers capable of delivering performance beyond the capabilities of current microprocessors and DSP chips. The SIMPil architecture currently being developed at Georgia Institute of Technology is a promising candidate for this task. In order to develop applications for SIMPil, a high level language and an optimizing compiler for the language are essential. However, with the recent trend of interconnect latency becoming a major bottleneck on computer systems, optimizations focusing on reducing latency are becoming more important, especially with SIMPil, as it is highly scalable. The compiler tracks the path of data through the network and buffers data in each processor to eliminate redundant communication. With a buffer size of 5, the compiler was able to eliminate 96 percent of the redundant communication for a 9x9 convolution and 8x8 DCT algorithms. With 5x5 convolution, only 89 percent elimination was observed. In terms of performance, 106 percent speedup was observed with 9x9 convolution at buffer size of 5 while 5x5 convolution and 8x8 DCT which have a much lower number of communication showed only 101 percent speedup.
123

Development and implementation of highly parallel algorithms for decoding perfect space-time block codes .

Amani, Kikongo Elie. January 2012 (has links)
M. Tech. Electrical Engineering. / Applies conditional optimisation to ML decoding of perfect STBCs, it is hypothesised that the obtained algorithms have reduced complexity and exhibit high DLP and TLP that can be exploited to map them on low-power multi-core SIMD processors, and possibly to reduce their runtimes and allow their real-time execution in a 4G wireless system.
124

SEPARATING INSTRUCTION FETCHES FROM MEMORY ACCESSES : ILAR (INSTRUCTION LINE ASSOCIATIVE REGISTERS)

Lim, Nien Yi 01 January 2009 (has links)
Due to the growing mismatch between processor performance and memory latency, many dynamic mechanisms which are “invisible” to the user have been proposed: for example, trace caches and automatic pre-fetch units. However, these dynamic mechanisms have become inadequate due to implicit memory accesses that have become so expensive. On the other hand, compiler-visible mechanisms like SWAR (SIMD Within A Register) and LARs (Line Associative Registers) are potentially more effective at improving data access performance. This thesis investigates applying the same ideas to improve instruction access. ILAR (Instruction LARs) store instructions in wide registers. Instruction blocks are explicitly loaded into ILAR, using block compression to enhance memory bandwidth. The control flow of the program then refers to instructions directly by their position within an ILAR, rather than by lengthy memory addresses. Because instructions are accessed directly from within registers, there is no implicit instruction fetch from memory. This thesis proposes an instruction set architecture for ILAR, investigates a mechanism to load ILAR using the best available block compression algorithm and also develop hardware descriptions for both ILAR and a conventional memory cache model so that performance comparisons could be made on the instruction fetch stage.
125

Intel Integrated Performance Primitives a jejich využití při vývoji aplikací / Intel Integrated Performance Primitives and their use in application development

Machač, Jiří January 2008 (has links)
The aim of the presented work is to demonstrate and evaluate the contribution of computing system SIMD especially units MMX, SSE, SSE2, SSE3, SSSE3 and SSE4 from Intel company, by creation of demostrating applications with using Intel Integrated Performance Primitives library. At first, possibilities of SIMD programming using intrinsic function, vektorization and libraries Intel Integrated Performance Primitives are presented, as next are descibed options of evaluation of particular algorithms. Finally procedure of programing by using Intel Integrated Performance Primitives library are ilustrated.
126

Detektor obličejů pro platformu Android / Face Detector For Android Platform

Slavík, Roman January 2011 (has links)
This master's thesis deals with face detection on mobile phones with Android OS. The introduction describes some algorithms used for pattern detection from image, as well as various techniques of features extracting. After that Android platform development specifics, including basic description of development tools, are described. Architecture of SIMD is introduced in next part of this work. After acquiring basic knowleage analysis and implementation of final app are descrited. Performance tests are conducted whose results are summarized in the conclusion.
127

Efektivní implementace genetického algoritmu s využitím vícejádrových CPU / The Efficient Implementation of the Genetic Algorithm Using Multicore Processors

Kouřil, Miroslav January 2010 (has links)
This diploma thesis deals with acceleration of advanced genetic algorithm. For implementation, discrete and continuos versions of UMDA genetic algorithm were chosen. The main part of the acceleration is the utilization of SSE instruction set. Using this set, the functions for calculating fitness and new population sampling were accelerated in particular. Then the pseudorandom number generator that also uses SSE instruction set was implemented.  The discrete algorithm reached the speed of up to 4,6 after this implementation. Finally, the algorithms were modified so that the system  OpenMP could be used, which enables the running of blocks of code in more threads. The continuous version of algorithm is not convenient for parallelization, because computational complexity of that algorithm is low. In comparison, the discrete versions of algorithm are really appropriate for parallelization. Both the implemented versions reached the total acceleration of up to 4,9 and 7,2.
128

Jádra schématu lifting pro vlnkovou transformaci / Lifting Scheme Cores for Wavelet Transform

Bařina, David Unknown Date (has links)
Práce se zaměřuje na efektivní výpočet dvourozměrné diskrétní vlnkové transformace. Současné metody jsou v práci rozšířeny v několika směrech a to tak, aby spočetly tuto transformaci v jediném průchodu, a to případně víceúrovňově, použitím kompaktního jádra. Tohle jádro dále může být vhodně přeorganizováno za účelem minimalizace užití některých prostředků. Představený přístup krásně zapadá do běžně používaných rozšíření SIMD, využívá hierarchii cache pamětí moderních procesorů a je vhodný k paralelnímu výpočtu. Prezentovaný přístup je nakonec začleněn do kompresního řetězce formátu JPEG 2000, ve kterém se ukázal být zásadně rychlejší než široce používané implementace.
129

Akcelerace vektorových a krytografických operací na platformě x86-64 / Acceleration of Vector and Cryptographic Operations on x86-64 Platform

Šlenker, Samuel January 2017 (has links)
The aim of this thesis was to study and subsequently process a comparison of older and newer SIMD processing units of modern microprocessors on the x86-64 platform. The thesis provides an overview of the fastest computations of vector operations with matrices and vectors, including corresponding source codes. Furthermore, the thesis is focused on authenticated encryption, specifically on block cipher AES operating in Galois Counter Mode, and on a discussion of possibilities of instruction sets for cryptographic support.
130

Efficient Implementation of 3D Finite Difference Schemes on Recent Processor Architectures / Effektiv implementering av finita differensmetoder i 3D på senaste processorarkitekturer

Ceder, Frederick January 2015 (has links)
Efficient Implementation of 3D Finite Difference Schemes on Recent Processors Abstract In this paper a solver is introduced that solves a problem set modelled by the Burgers equation using the finite difference method: forward in time and central in space. The solver is parallelized and optimized for Intel Xeon Phi 7120P as well as Intel Xeon E5-2699v3 processors to investigate differences in terms of performance between the two architectures. Optimized data access and layout have been implemented to ensure good cache utilization. Loop tiling strategies are used to adjust data access with respect to the L2 cache size. Compiler hints describing aligned memory access are used to support vectorization on both processors. Additionally, prefetching strategies and streaming stores have been evaluated for the Intel Xeon Phi. Parallelization was done using OpenMP and MPI. The parallelisation for native execution on Xeon Phi is based on OpenMP and yielded a raw performance of nearly 100 GFLOP/s, reaching a speedup of almost 50 at a 83\% parallel efficiency. An OpenMP implementation on the E5-2699v3 (Haswell) processors produced up to 292 GFLOP/s, reaching a speedup of almost 31 at a 85\% parallel efficiency. For comparison a mixed implementation using interleaved communications with computations reached 267 GFLOP/s at a speedup of 28 with a 87\% parallel efficiency. Running a pure MPI implementation on the PDC's Beskow supercomputer with 16 nodes yielded a total performance of 1450 GFLOP/s and for a larger problem set it yielded a total of 2325 GFLOP/s, reaching a speedup and parallel efficiency at resp. 170 and 33,3\% and 290 and 56\%. An analysis based on the roofline performance model shows that the computations were memory bound to the L2 cache bandwidth, suggesting good L2 cache utilization for both the Haswell and the Xeon Phi's architectures. Xeon Phi performance can probably be improved by also using MPI. Keeping technological progress for computational cores in the Haswell processor in mind for the comparison, both processors perform well. Improving the stencil computations to a more compiler friendly form might improve performance more, as the compiler can possibly optimize more for the target platform. The experiments on the Cray system Beskow showed an increased efficiency from 33,3\% to 56\% for the larger problem, illustrating good weak scaling. This suggests that problem sizes should increase accordingly for larger number of nodes in order to achieve high efficiency. Frederick Ceder / Effektiv implementering av finita differensmetoder i 3D på moderna processorarkitekturer Sammanfattning Denna uppsats diskuterar implementationen av ett program som kan lösa problem modellerade efter Burgers ekvation numeriskt. Programmet är byggt ifrån grunden och använder sig av finita differensmetoder och applicerar FTCS metoden (Forward in Time Central in Space). Implementationen paralleliseras och optimeras på Intel Xeon Phi 7120P Coprocessor och Intel Xeon E5-2699v3 processorn för att undersöka skillnader i prestanda mellan de två modellerna. Vi optimerade programmet med omtanke på dataåtkomst och minneslayout för att få bra cacheutnyttjande. Loopblockningsstrategier används också för att dela upp arbetsminnet i mindre delar för att begränsa delarna i L2 cacheminnet. För att utnyttja vektorisering till fullo så används kompilatordirektiv som beskriver minnesåtkomsten, vilket ska hjälpa kompilatorn att förstå vilka dataaccesser som är alignade. Vi implementerade också prefetching strategier och streaming stores på Xeon Phi och disskuterar deras värde. Paralleliseringen gjordes med OpenMP och MPI. Parallelliseringen för Xeon Phi:en är baserad på bara OpenMP och exekverades direkt på chipet. Detta gav en rå prestanda på nästan 100 GFLOP/s och nådde en speedup på 50 med en 83% effektivitet. En OpenMP implementation på E5-2699v3 (Haswell) processorn fick upp till 292 GFLOP/s och nådde en speedup på 31 med en effektivitet på 85%. I jämnförelse fick en hybrid implementation 267 GFLOP/s och nådde en speedup på 28 med en effektivitet på 87%. En ren MPI implementation på PDC's Beskow superdator med 16 noder gav en total prestanda på 1450 GFLOP/s och för en större problemställning gav det totalt 2325 GFLOP/s, med speedup och effektivitet på respektive 170 och 33% och 290 och 56%. En analys baserad på roofline modellen visade att beräkningarna var minnesbudna till L2 cache bandbredden, vilket tyder på bra L2-cache användning för både Haswell och Xeon Phi:s arkitekturer. Xeon Phis prestanda kan förmodligen förbättras genom att även använda MPI. Håller man i åtanke de tekniska framstegen när det gäller beräkningskärnor på de senaste åren, så preseterar både arkitekturer bra. Beräkningskärnan av implementationen kan förmodligen anpassas till en mer kompilatorvänlig variant, vilket eventuellt kan leda till mer optimeringar av kompilatorn för respektive plattform. Experimenten på Cray-systemet Beskow visade en ökad effektivitet från 33,3% till 56% för större problemställningar, vilket visar tecken på bra weak scaling. Detta tyder på att effektivitet kan uppehållas om problemställningen växer med fler antal beräkningsnoder. Frederick Ceder

Page generated in 0.0498 seconds