• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 84
  • 13
  • 10
  • 8
  • 7
  • 4
  • 4
  • 2
  • 1
  • Tagged with
  • 146
  • 64
  • 36
  • 32
  • 24
  • 23
  • 19
  • 19
  • 18
  • 16
  • 15
  • 14
  • 14
  • 13
  • 13
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
51

Performance Optimization of Signal Processing Algorithms for SIMD Architectures

Yagneswar, Sharan January 2017 (has links)
Digital Signal Processing(DSP) algorithms are widely implemented in real time systems.In fields such as digital music technology, many of these said algorithms areimplemented, often in combination, to achieve the desired functionality. When itcomes to implementation, DSP algorithms are performance critical as they havetight deadlines. In this thesis, performance optimization using Single InstructionMultiple Data(SIMD) vectorization technique is performed on the ARM Cortex-A15 architecture for six commonly used DSP algorithms; Gain, Mix, Gain and Mix,Complex Number Multiplication, Envelope Detection and Cascaded IIR Filter. Toensure optimal performance, the instructions should be scheduled with minimalpipeline stalls. This requires execution time to be measured with fine time granularity.First, a technique of accurately measuring the execution time using thecycle counter of the processor’s Performance Management Unit(PMU) along withsynchronization barriers is developed. It was found that the execution time measuredby using the operating system calls have high variations and very low timegranularity, whereas the cycle counter method was accurate and produced reliableresults. The cost associated with the cycle counter method is 75 clock cycles. Usingthis technique, the contribution by each SIMD instruction towards the executiontime is measured and is used to schedule the instructions. This thesis also presentsa guideline on how to schedule instructions which have data dependencies usingthe cycle counter timing execution time measurement technique, to ensure that thepipeline stalls are minimized. The algorithms are also modified, if needed, to favorvectorization and are implemented using ARM architecture specific SIMD instructions.These implementations are then compared to that which are automaticallyproduced by the compiler’s auto-vectorization feature. The execution times of theSIMD implementations was much lower compared to that produced by the compilerand the speedup ranged from 2.47 to 5.11. Also, the performance increaseis significant when the instructions are scheduled in an optimal way. This thesisconcludes that the auto-vectorized code does poorly for complex algorithms andproduces code with a lot of data dependencies causing pipeline stalls, even with fulloptimizations enabled. Using the guidelines presented in this thesis for schedulingthe instructions, the performance of the DSP algorithms have significant improvementscompared to their auto-vectorized counterparts. / Digitala signalbehandlingsalgoritmer(DSP) implementeras ofta i realtidssystem. Inomfält som exempelvis digital musikteknik används dessa algoritmer, ofta i olika kombinationer,för att ge önskad funktionalitet. Implementationen av DSP-algoritmerär prestandakritisk eftersom systemen ofta har små tidsmarginaler. I det härexamensarbetet genomförs prestandaoptimering med Single Instruction MultipleData(SIMD)-vektorisering på en ARM A15-arkitektur för 6 vanliga DSP-algoritmer;volym, mix, volym och mix, multiplikation av komplexa tal, amplituddetektering,och seriekopplade IIR-filter. Maximal optimering av algoritmerna kräver ocksåatt antalet pipeline stalls i processorn minimeras. För att kunna observera dettakrävs att exekveringstiden kan mätas med hög tidsupplösning. I det här examensarbeteutvecklas först en teknik för att mäta exekveringstiden med hjälp aven klockcykelräknare i processorns Performance Management Unit(PMU) tillsammansmed synkroniseringsbarriärer. Tidsmätning med hjälp av operativsystemsfunktionervisade sig ha sämre noggrannhet och tidsupplösning än metoden medatt räkna klockcykler, som gav tillförlitliga resultat. Den extra exekveringstidenför klockcykelräkning uppmättes till 75 klockcykler. Med den här tekniken är detmöjligt att mäta hur mycket varje SIMD-instruktion bidrar till den totala exekveringstiden.Examensarbete presenterar också en metod att ordna instruktioner somhar databeroenden sinsemellan med hjälp av ovanstående tidsmätningsmetod, såatt antalet pipeline stalls minimeras. I de fall det behövdes, skrevs koden till algoritmernaom för att bättre kunna utnyttja ARM-arkitekturens specifika SIMDinstruktioner.Dessa jämfördes sedan med resultaten från kompilatorns automatgenereradevektoriseringkod. Exekveringstiden för SIMD-implementationerna varsignifikant kortare än för de kompilatorgenererade och visade på en förbättring påmellan 2,47 och 5,11 gånger, mätt i exekveringstid. Resultaten visade också på entydlig förbättring när instruktionerna exekveras i en optimal ordning. Resultatenvisar att automatgenererad vektorisering presterar sämre för komplexa algoritmeroch producerar maskinkod med signifikanta databeroenden som orsakar pipelinestalls, även med optimeringsflaggor påslagna. Med hjälp av metoder presenteradei det här examensarbete kan prestandan i DSP-algoritmer förbättras betydligt ijämförelse med automatgenererad vektorisering.
52

Evaluating the Vector Supercomputer SX-Aurora TSUBASA as a Co-Processor for In-Memory Database Systems

Pietrzyk, Johannes, Habich, Dirk, Damme, Patrick, Focht, Erich, Lehner, Wolfgang 16 June 2023 (has links)
In-memory column-store database systems are state of the art for the efficient processing of analytical workloads. In these systems, data compression as well as vectorization play an important role. Currently, the vectorized processing is done using regular SIMD (Single Instruction Multiple Data) extensions of modern processors. For example, Intel’s latest SIMD extension supports 512-bit vector registers which allows the parallel processing of 8× 64-bit values. From a database system perspective, this vectorization technique is not only very interesting for compression and decompression to reduce the computational overhead, but also for all database operators like joins, scan, as well as groupings. In contrast to these SIMD extensions, NEC Corporation has recently introduced a novel pure vector engine (supercomputer) as a co-processor called SX-Aurora TSUBASA. This vector engine features a vector length of 16.384 bits with the world’s highest bandwidth of up to 1.2 TB/s, which perfectly fits to data-intensive applications like in-memory database systems. Therefore, we describe the unique architecture and properties of this novel vector engine in this paper. Moreover, we present selected in-memory column-store-specific evaluation results to show the benefits of this vector engine compared to regular SIMD extensions. Finally, we conclude the paper with an outlook on our ongoing research activities in this direction.
53

Partition-based SIMD Processing and its Application to Columnar Database Systems

Hildebrandt, Juliana, Pietrzyk, Johannes, Krause, Alexander, Habich, Dirk, Lehner, Wolfgang 19 March 2024 (has links)
The Single Instruction Multiple Data (SIMD) paradigm became a core principle for optimizing query processing in columnar database systems. Until now, only the LOAD/STORE instructions are considered to be efficient enough to achieve the expected speedups, while avoiding GATHER/SCATTER is considered almost imperative. However, the GATHER instruction offers a very flexible way to populate SIMD registers with data elements coming from non-consecutive memory locations. As we will discuss within this article, the GATHER instruction can achieve the same performance as the LOAD instruction, if applied properly. To enable the proper usage, we outline a novel access pattern allowing fine-grained, partition-based SIMD implementations. Then, we apply this partition-based SIMD processing to two representative examples from columnar database systems to experimentally demonstrate the applicability and efficiency of our new access pattern.
54

SAP HANA: The Evolution from a Modern Main-Memory Data Platform to an Enterprise Application Platform

Sikka, Vishal, Färber, Franz, Goel, Anil, Lehner, Wolfgang 10 January 2023 (has links)
Sensors in smart-item environments capture data about product conditions and usage to support business decisions as well as production automation processes. A challenging issue in this application area is the restricted quality of sensor SAP HANA is a pioneering, and one of the best performing, data platform designed from the grounds up to heavily exploit modern hardware capabilities, including SIMD, and large memory and CPU footprints. As a comprehensive data management solution, SAP HANA supports the complete data life cycle encompassing modeling, provisioning, and consumption. This extended abstract outlines the vision and planned next step of the SAP HANA evolution growing from a core data platform into an innovative enterprise application platform as the foundation for current as well as novel business applications in both on-premise and on-demand scenarios. We argue that only a holistic system design rigorously applying co-design at different levels may yield a highly optimized and sustainable platform for modern enterprise applications.
55

Implementation of LTE Baseband Algorithms for a Highly Parallel DSP Platform

Keller, Markus January 2016 (has links)
The division of computer engineering at Linköping’s university is currentlydeveloping an innovative parallel DSP processor architecture called ePUMA. Onepossible future purpose of the ePUMA that has been thought of is to implement itin base stations for mobile communication. In order to investigate the performanceand potential of the ePUMA as a processing unit in base stations, a model of theLTE physical layer uplink receiving chain has been simulated in Matlab and thenpartially mapped onto the ePUMA processor.The project work included research and understanding of the LTE standard andsimulating the uplink processing chain in Matlab for a transmission bandwidth of5 MHz. Major tasks of the DSP implementation included the development of a300-point FFT algorithm and a channel equalization algorithm for the SIMD unitsof the ePUMA platform. This thesis provides the reader with an introduction tothe LTE standard as well as an introduction to the ePUMA processor. Furthermore,it can serve as a guidance to develop mixed point radix FFTs in general orthe 300 point FFT in specific and can help with a basic understanding of channelequalization. The work of the thesis included the whole developing chain from understandingthe algorithms, simplifying and mapping them onto a DSP platform,and testing and verification of the results.
56

Parallel computing techniques for computed tomography

Deng, Junjun 01 May 2011 (has links)
X-ray computed tomography is a widely adopted medical imaging method that uses projections to recover the internal image of a subject. Since the invention of X-ray computed tomography in the 1970s, several generations of CT scanners have been developed. As 3D-image reconstruction increases in popularity, the long processing time associated with these machines has to be significantly reduced before they can be practically employed in everyday applications. Parallel computing is a computer science computing technique that utilizes multiple computer resources to process a computational task simultaneously; each resource computes only a part of the whole task thereby greatly reducing computation time. In this thesis, we use parallel computing technology to speed up the reconstruction while preserving the image quality. Three representative reconstruction algorithms--namely, Katsevich, EM, and Feldkamp algorithms--are investigated in this work. With the Katsevich algorithm, a distributed-memory PC cluster is used to conduct the experiment. This parallel algorithm partitions and distributes the projection data to different computer nodes to perform the computation. Upon completion of each sub-task, the results are collected by the master computer to produce the final image. This parallel algorithm uses the same reconstruction formula as the sequential counterpart, which gives an identical image result. The parallelism of the iterative CT algorithm uses the same PC cluster as in the first one. However, because it is based on a local CT reconstruction algorithm, which is different from the sequential EM algorithm, the image results are different with the sequential counterpart. Moreover, a special strategy using inhomogeneous resolution was used to further speed up the computation. The results showed that the image quality was largely preserved while the computational time was greatly reduced. Unlike the two previous approaches, the third type of parallel implementation uses a shared-memory computer. Three major accelerating methods--SIMD (Single instruction, multiple data), multi-threading, and OS (ordered subsets)--were employed to speed up the computation. Initial investigations showed that the image quality was comparable to those of the conventional approach though the computation speed was significantly increased.
57

Conception et analyse d'algorithmes numériques parallèles

Delesalle, Denis 12 February 1993 (has links) (PDF)
Cette thèse présente les limites du mode s.i.m.d. Dans le cadre de la programmation parallèle d'algorithmes d'algèbre linéaire. Plus précisément, celles de la règle d'or du parallélisme massif: un élément de la matrice par processeur, sont développées. Des expérimentations sont effectuées sur une connection machine 2. Néanmoins, la première partie montre comment la création de procédures de communications écrites a partir d'un nouvel algorithme de construction d'arbres équilibres, et un placement de données judicieux permettent d'atteindre des performances proches de la puissance crête. Mais ce type de travail ne peut pas être effectue sur n'importe quel algorithme, et tout ne s'adapte pas aussi bien. Dans la deuxième partie, nous présentons les avantages de la décomposition en blas pour la construction d'algorithmes massivement parallèles. Elle met, dans le chapitre 4, en évidence la barrière de synchronisation pour la methode du gradient conjugue. Nous proposons dans ce cas particulier comme solution, une ancienne methode qui bien qu'elle soit, en séquentiel, de convergence plus lente, est plus rapide en parallèle. De plus, la structure des matrices est un facteur important. Elle permet d'accélérer les calculs et d'augmenter la dimension des problèmes a résoudre. L'architecture des machines actuelles en limite encore trop l'utilisation. La dernière partie est entièrement consacrée aux permutations, et aux communications qu'elles entrainent. Dans le cadre de l'algorithme de Burg, nous proposons une solution qui calcule a la fois les coefficients de réflexion et ceux d'autoregression sans cout supplémentaire
58

Intégration sur tranche d'une architecture massivement parallèle tolérant les défauts de fin de fabrication

Patry, Jean-Luc 04 March 1992 (has links) (PDF)
Cette thèse présente des méthodes et outils de conception de systèmes integres sur tranche entière (wafer scale intégration). L'application traitée (dans le cadre d'un projet européen esprit) est une architecture constituée d'un réseau 2d de 6720 processeurs (pe) monobits, destinée au traitement d'image de bas niveau. Pour tolérer les défauts de fin fabrication, une approche hiérarchisée a été implantée. Au niveau sous-système, une technique de redondance figee a consiste a implanter une colonne de pes de réserve, destines a remplacer les pes défaillants. Au niveau tranche entière, une technique de construction d'une cible maximale n'utilisant que des sous-systèmes s'appuient sur l'implantation d'un réseau de commutateurs permettant d'éviter les sous-systèmes défaillants. Une architecture originale des réseaux de commutateurs contrôle a partir des plots externes et des algorithmes efficaces de définition et construction du réseau opérationnel constituent les points forts de cette thèse
59

Algorithmique parallèle : réseaux d'automates, architectures systoliques, machines SIMD et MIMD

Robert, Yves 06 January 1986 (has links) (PDF)
.
60

A Multimedia DSP Processor Design / Design av en Multimedia DSP Processor

Gnatyuk, Vladimir, Runesson, Christian January 2004 (has links)
<p>This Master Thesis presents the design of the core of a fixed point general purpose multimedia DSP processor (MDSP) and its instruction set. This processor employs parallel processing techniques and specialized addressing models to speed up the processing of multimedia applications. </p><p>The MDSP has a dual MAC structure with one enhanced MAC that provides a SIMD, Single Instruction Multiple Data, unit consisting of four parallel data paths that are optimized for accelerating multimedia applications. The SIMD unit performs four multimedia- oriented 16- bit operations every clock cycle. This accelerates computationally intensive procedures such as video and audio decoding. The MDSP uses a memory bank of four memories to provide multiple accesses of source data each clock cycle.</p>

Page generated in 0.0449 seconds