Global ETD Search

131	Smith-Waterman Sequence Alignment For Massively Parallel High-Performance Computing Architectures Steinfadt, Shannon Irene 19 April 2010 (has links) No description available. Bioinformatics Computer Science Molecular Biology Sequence Alignment Smith-Waterman Parallel Computing Associative Computing ASC SIMD MIMD ClearSpeed JumboMem SWAMP SWAMP+
132	Supporting Applications Involving Dynamic Data Structures and Irregular Memory Access on Emerging Parallel Platforms Ren, Bin 09 September 2014 (has links) No description available. Computer Science Irregular Data Structure Fine Grained Parallelism SIMD MIMD SSE GPUs Xeon Phi Static Analysis Runtime Analysis Offloading Python Redundancy Elimination
133	Performance Optimization of Stencil Computations on Modern SIMD Architectures Henretty, Thomas Steel January 2014 (has links) No description available. Computer Science stencil SIMD PDE DLT SDSL domain specific language stream alignment conflict split tiling nested split tiling hybrid split tiling
134	COMPARISON OF THE PERFORMANCE OF NVIDIA ACCELERATORS WITH SIMD AND ASSOCIATIVE PROCESSORS ON REAL-TIME APPLICATIONS Shaker, Alfred M. 27 July 2017 (has links) No description available. Computer Science
135	Parallélisation de simulations interactives de champs ultrasonores pour le contrôle non destructif / Parallelization of ultrasonic field simulations for non destructive testing Lambert, Jason 03 July 2015 (has links) La simulation est de plus en plus utilisée dans le domaine industriel du Contrôle Non Destructif. Elle est employée tout au long du processus de contrôle, que ce soit pour en accélérer la mise au point ou en comprendre les résultats. Les travaux menés au cours de cette thèse présentent une méthode de calcul rapide de champ ultrasonore rayonné par un capteur multi-éléments dans une pièce isotrope, permettant un usage interactif des simulations. Afin de tirer parti des architectures parallèles communément disponibles, un modèle régulier (qui limite au maximum les branchements divergents) dérivé du modèle générique présent dans la plateforme logicielle CIVA a été mis au point. Une première implémentation de référence a permis de le valider par rapport aux résultats CIVA et d'analyser son comportement en termes de performances. Le code a ensuite été porté et optimisé sur trois classes d'architectures parallèles aujourd'hui disponibles dans les stations de calcul : le processeur généraliste central (GPP), le coprocesseur manycore (Intel MIC) et la carte graphique (nVidia GPU). Concernant le processeur généraliste et le coprocesseur manycore, l'algorithme a été réorganisé et le code implémenté afin de tirer parti des deux niveaux de parallélisme disponibles, le multithreading et les instructions vectorielles. Sur la carte graphique, les différentes étapes de simulation de champ ont été découpées en une série de noyaux CUDA. Enfin, des bibliothèques de calculs spécifiques à ces architectures, Intel MKL et nVidia cuFFT, ont été utilisées pour effectuer les opérations de Transformées de Fourier Rapides. Les performances et la bonne adéquation des codes produits ont été analysées en détail pour chaque architecture. Dans plusieurs cas, sur des configurations de contrôle réalistes, des performances autorisant l'interactivité ont été atteintes. Des perspectives pour traiter des configurations plus complexes sont dressées. Enfin la problématique de l'industrialisation de ce type de code dans la plateforme logicielle CIVA est étudiée. / The Non Destructive Testing field increasingly uses simulation.It is used at every step of the whole control process of an industrial part, from speeding up control development to helping experts understand results. During this thesis, a simulation tool dedicated to the fast computation of an ultrasonic field radiated by a phase array probe in an isotropic specimen has been developped. Its performance enables an interactive usage. To benefit from the commonly available parallel architectures, a regular model (aimed at removing divergent branching) derived from the generic CIVA model has been developped. First, a reference implementation was developped to validate this model against CIVA results, and to analyze its performance behaviour before optimization. The resulting code has been optimized for three kinds of parallel architectures commonly available in workstations: general purpose processors (GPP), manycore coprocessors (Intel MIC) and graphics processing units (nVidia GPU). On the GPP and the MIC, the algorithm was reorganized and implemented to benefit from both parallelism levels, multhreading and vector instructions. On the GPU, the multiple steps of field computing have been divided in multiple successive CUDA kernels.Moreover, libraries dedicated to each architecture were used to speedup Fast Fourier Transforms, Intel MKL on GPP and MIC and nVidia cuFFT on GPU. Performance and hardware adequation of the produced algorithms were thoroughly studied for each architecture. On multiple realistic control configurations, interactive performance was reached. Perspectives to adress more complex configurations were drawn. Finally, the integration and the industrialization of this code in the commercial NDT plateform CIVA is discussed. Contrôle non destructif Programmation parallèle Simulation de champ ultrasonore Processeurs généralistes multicoeurs Processeurs graphiques GPGPU SIMD Parallélisme (informatique) Xeon Phi CUDA Manycore Non destructive testing Parallel programming Ultrasonic field simulation Multicore general purpose processors Graphic processing units GPGPU SIMD Parallelism Xeon Phi CUDA Manycore
136	Introducing Machine Learning in a Vectorized Digital Signal Processor / Introduktion av Maskininlärning på en Vektoriserad Digital Signalprocessor Ridderström, Linnéa January 2023 (has links) Machine learning is rapidly being integrated into all areas of society, however, that puts a lot of pressure on resource costraint hardware such as embedded systems. The company Ericsson is gradually integrating machine learning based on neural networks, so-called deep learning, into their radio products. One promising product is their vectorized Digital Signal Processor (DSP) that are based upon the machine learning suitable Single Instruction, Multiple Data (SIMD) paradigm and Very Long Instruction Word (VLIW) architecture. However, despite the suitability of the SIMD paradigm, the embedded system needs to efficiently execute a computation-intensive deep learning algorithm with proper use of its limited resources. Therefore commonly used methods of implementing each layer of the computation-intensive Convolutional Neural Network (CNN), a type of Deep Neural Network (DNN), have been used and evaluated its implementation on the hardware and to assess the vectorized DSP’s deep learning suitability and capabilities. Despite the suitability of the hardware, the implementation utilized less than half of the available resources at all times during the execution. The main limitations were identified to be the limited 16-bit element instructions. To enhance the performance and improve the utilization of the available resources, easy-to-implement hardware instructions have been suggested. This work has made the first steps of implementing an efficiently performing CNN implementation on the examined vectorized DSP. / Integreringen av maskininlärning in i alla samhällsområden sker idag i rusande fart, men det sätter stor press på begränsad hårdvara som inbyggda system. Företaget Ericsson integrerar successivt maskininlärning baserad på neurala nätverk, så kallad djupinlärning, i sina radioprodukter. En lovande produkt är deras vektoriserade DSP som är baserade på maskininlärningspasset SIMD-paradigm och VLIW-arkitektur. Men trots lämpligheten av SIMD-paradigmet, är den största utmaningen att utnyttja de begränsade resurserna i inbyggda systemet för att effektivt exekvera en beräkningsintensiv djupinlärningsalgoritm. Därför har vanligt använda metoder för att implementera varje lager av den beräkningsintensiva CNN, en typ av DNN, använts och utvärderats på hårdvaran för att bedöma den vektoriserade DSP:s djupinlärningslämplighet samt förmågor. Trots hårdvarans lämplighet använde alla implementeringar mindre än hälften av de tillgängliga resurserna vid alla tidpunkter under exekveringen. De huvudsakliga begränsningarna identifierades vara den begränsade tillgången på 16-bitars element instruktioner. För att förbättra prestandan för ett närmare fullt utnyttjande av tillgängliga resurser har hårdvaruinstruktioner som är enkla att implementera föreslagits. Detta arbete har tagit de första stegen för att implementera ett effektivt förformande CNN på den undersökta vekotriserade DSP. Digital Signal Processor (DSP) Machine Learning Deep Learning Convolutional Neural Network (CNN) Very Long Instruction Word (VLIM) Single Instruction Multiple Data (SIMD) Digital Signalprocessor (DSP) Maskininlärning Djupinlärning Konvolutionella Neurala Nätverk (CNN) Very Long Instruction Word (VLIW) Single Instruction Multiple Data (SIMD) Elektroteknik och elektronik
137	Optimalizované sledování paprsku / Optimized Ray Tracing Brich, Radek Unknown Date (has links) Goal of this work is to write an optimized program for visualization of 3D scenes using ray tracing method. First, the theory of ray tracing together with particular techniques are presented. Next part focuses on different approaches to accelerate the algorithm. These are space partitioning structures, fast ray-triangle intersection technique and possibilities to parallelize the whole ray tracing method. A standalone chapter addresses the design and implementation of the ray tracing program.
138	Développement d’algorithmes d’imagerie et de reconstruction sur architectures à unités de traitements parallèles pour des applications en contrôle non destructif / Development of imaging and reconstructions algorithms on parallel processing architectures for applications in non-destructive testing Pedron, Antoine 28 May 2013 (has links) La problématique de cette thèse se place à l’interface entre le domaine scientifique du contrôle non destructif par ultrasons (CND US) et l’adéquation algorithme architecture. Le CND US comprend un ensemble de techniques utilisées pour examiner un matériau, qu’il soit en production ou maintenance. Afin de détecter d’éventuels défauts, de les positionner et les dimensionner, des méthodes d’imagerie et de reconstruction ont été développées au CEA-LIST, dans la plateforme logicielle CIVA.L’évolution du matériel d’acquisition entraine une augmentation des volumes de données et par conséquent nécessite toujours plus de puissance de calcul pour parvenir à des reconstructions en temps interactif. L’évolution multicoeurs des processeurs généralistes (GPP), ainsi que l’arrivée de nouvelles architectures comme les GPU rendent maintenant possible l’accélération de ces algorithmes.Le but de cette thèse est d’évaluer les possibilités d’accélération de deux algorithmes de reconstruction sur ces architectures. Ces deux algorithmes diffèrent dans leurs possibilités de parallélisation. Pour un premier, la parallélisation sur GPP est relativement immédiate, contrairement à celle sur GPU qui nécessite une utilisation intensive des instructions atomiques. Quant au second, le parallélisme est plus simple à exprimer, mais l’ordonnancement des nids de boucles sur GPP, ainsi que l’ordonnancement des threads et une bonne utilisation de la mémoire partagée des GPU sont nécessaires pour obtenir un fonctionnement efficace. Pour ce faire, OpenMP, CUDA et OpenCL ont été utilisés et comparés. L’intégration de ces prototypes dans la plateforme CIVA a mis en évidence un ensemble de problématiques liées à la maintenance et à la pérennisation de codes sur le long terme. / This thesis work is placed between the scientific domain of ultrasound non-destructive testing and algorithm-architecture adequation. Ultrasound non-destructive testing includes a group of analysis techniques used in science and industry to evaluate the properties of a material, component, or system without causing damage. In order to characterize possible defects, determining their position, size and shape, imaging and reconstruction tools have been developed at CEA-LIST, within the CIVA software platform.Evolution of acquisition sensors implies a continuous growth of datasets and consequently more and more computing power is needed to maintain interactive reconstructions. General purprose processors (GPP) evolving towards parallelism and emerging architectures such as GPU allow large acceleration possibilities than can be applied to these algorithms.The main goal of the thesis is to evaluate the acceleration than can be obtained for two reconstruction algorithms on these architectures. These two algorithms differ in their parallelization scheme. The first one can be properly parallelized on GPP whereas on GPU, an intensive use of atomic instructions is required. Within the second algorithm, parallelism is easier to express, but loop ordering on GPP, as well as thread scheduling and a good use of shared memory on GPU are necessary in order to obtain efficient results. Different API or libraries, such as OpenMP, CUDA and OpenCL are evaluated through chosen benchmarks. An integration of both algorithms in the CIVA software platform is proposed and different issues related to code maintenance and durability are discussed. Controle non destructif Reconstruction d'image Programmation parallèle Processeurs graphiques PGPU Précision numérique Stabilité numérique Non destructive évaluation Image reconstruction Parallel programming Multicore general purpose processors Graphic processing units GPGPU Numerical precision Numerical stability
139	Conception et validation d'un processeur programmable de traitement du signal à faible consommation et à faible empreinte silicium : application à la vidéo HD sur téléphone mobile Thevenin, Mathieu 16 October 2009 (has links) (PDF) Les capteurs CMOS sont de plus en plus présents dans les produits de grande consommation comme les téléphones portables. Les images issues de ces capteurs nécessitent divers traitements numériques avant d'être afﬁchées. Aujourd'hui, seuls des composants dédiés sont utilisés pour maintenir un niveau de consom- mation électrique faible. Toutefois leur ﬂexibilité est fortement limitée et elle ne permet pas l'intégration de nouveaux algorithmes de traitement d'image. Ce manuscrit présente la conception et la validation de l'archi- tecture de calcul eISP entièrement programmable et originale capable de supporter la vidéo HD 1080p qui sera intégrée dans les futures générations de téléphones portables. Architecture calcul programmable eISP SIMD traitement d'image ﬂux vidéo haute déﬁni- tion 1080p capteur CMOS CCD bayer brutes ﬁltre traitement du signal numérique ﬂexible ﬂot données processeur SplitWay DSP bus TDMA tuile de calcul 65nm Programmable computing architecture
140	An Optimizing Code Generator for a Class of Lattice-Boltzmann Computations Pananilath, Irshad Muhammed January 2014 (has links) (PDF) Lattice-Boltzmann method(LBM), a promising new particle-based simulation technique for complex and multiscale fluid flows, has seen tremendous adoption in recent years in computational fluid dynamics. Even with a state-of-the-art LBM solver such as Palabos, a user still has to manually write his program using the library-supplied primitives. We propose an automated code generator for a class of LBM computations with the objective to achieve high performance on modern architectures. Tiling is a very important loop transformation used to improve the performance of stencil computations by exploiting locality and parallelism. In the first part of the work, we explore diamond tiling, a new tiling technique to exploit the inherent ability of most stencils to allow tile-wise concurrent start. This enables perfect load-balance during execution and reduces the frequency of synchronization required. Few studies have looked at time tiling for LBM codes. We exploit a key similarity between stencils and LBM to enable polyhedral optimizations and in turn time tiling for LBM. Besides polyhedral transformations, we also describe a number of other complementary transformations and post processing necessary to obtain good parallel and SIMD performance on modern architectures. We also characterize the performance of LBM with the Roofline performance model. Experimental results for standard LBM simulations like Lid Driven Cavity, Flow Past Cylinder, and Poiseuille Flow show that our scheme consistently outperforms Palabos–on average by3 x while running on 16 cores of a n Intel Xeon Sandy bridge system. We also obtain a very significant improvement of 2.47 x over the native production compiler on the SPECLBM benchmark. Lattice-Boltzmann Computations Computational Fluid Dynamics Tiling Stencil Computations Single Instruction Multiple Data (SIMD) Parallel Computers Parallel Processing Loop Transformations Lattice-Boltzman Method (LBM) Lattice Boltzman Method Lattice-Boltzmann Equation Computer Science

Search results