Global ETD Search

Return to search

Performance Optimization of Signal Processing Algorithms for SIMD Architectures

Digital Signal Processing(DSP) algorithms are widely implemented in real time systems.In fields such as digital music technology, many of these said algorithms areimplemented, often in combination, to achieve the desired functionality. When itcomes to implementation, DSP algorithms are performance critical as they havetight deadlines. In this thesis, performance optimization using Single InstructionMultiple Data(SIMD) vectorization technique is performed on the ARM Cortex-A15 architecture for six commonly used DSP algorithms; Gain, Mix, Gain and Mix,Complex Number Multiplication, Envelope Detection and Cascaded IIR Filter. Toensure optimal performance, the instructions should be scheduled with minimalpipeline stalls. This requires execution time to be measured with fine time granularity.First, a technique of accurately measuring the execution time using thecycle counter of the processor’s Performance Management Unit(PMU) along withsynchronization barriers is developed. It was found that the execution time measuredby using the operating system calls have high variations and very low timegranularity, whereas the cycle counter method was accurate and produced reliableresults. The cost associated with the cycle counter method is 75 clock cycles. Usingthis technique, the contribution by each SIMD instruction towards the executiontime is measured and is used to schedule the instructions. This thesis also presentsa guideline on how to schedule instructions which have data dependencies usingthe cycle counter timing execution time measurement technique, to ensure that thepipeline stalls are minimized. The algorithms are also modified, if needed, to favorvectorization and are implemented using ARM architecture specific SIMD instructions.These implementations are then compared to that which are automaticallyproduced by the compiler’s auto-vectorization feature. The execution times of theSIMD implementations was much lower compared to that produced by the compilerand the speedup ranged from 2.47 to 5.11. Also, the performance increaseis significant when the instructions are scheduled in an optimal way. This thesisconcludes that the auto-vectorized code does poorly for complex algorithms andproduces code with a lot of data dependencies causing pipeline stalls, even with fulloptimizations enabled. Using the guidelines presented in this thesis for schedulingthe instructions, the performance of the DSP algorithms have significant improvementscompared to their auto-vectorized counterparts. / Digitala signalbehandlingsalgoritmer(DSP) implementeras ofta i realtidssystem. Inomfält som exempelvis digital musikteknik används dessa algoritmer, ofta i olika kombinationer,för att ge önskad funktionalitet. Implementationen av DSP-algoritmerär prestandakritisk eftersom systemen ofta har små tidsmarginaler. I det härexamensarbetet genomförs prestandaoptimering med Single Instruction MultipleData(SIMD)-vektorisering på en ARM A15-arkitektur för 6 vanliga DSP-algoritmer;volym, mix, volym och mix, multiplikation av komplexa tal, amplituddetektering,och seriekopplade IIR-filter. Maximal optimering av algoritmerna kräver ocksåatt antalet pipeline stalls i processorn minimeras. För att kunna observera dettakrävs att exekveringstiden kan mätas med hög tidsupplösning. I det här examensarbeteutvecklas först en teknik för att mäta exekveringstiden med hjälp aven klockcykelräknare i processorns Performance Management Unit(PMU) tillsammansmed synkroniseringsbarriärer. Tidsmätning med hjälp av operativsystemsfunktionervisade sig ha sämre noggrannhet och tidsupplösning än metoden medatt räkna klockcykler, som gav tillförlitliga resultat. Den extra exekveringstidenför klockcykelräkning uppmättes till 75 klockcykler. Med den här tekniken är detmöjligt att mäta hur mycket varje SIMD-instruktion bidrar till den totala exekveringstiden.Examensarbete presenterar också en metod att ordna instruktioner somhar databeroenden sinsemellan med hjälp av ovanstående tidsmätningsmetod, såatt antalet pipeline stalls minimeras. I de fall det behövdes, skrevs koden till algoritmernaom för att bättre kunna utnyttja ARM-arkitekturens specifika SIMDinstruktioner.Dessa jämfördes sedan med resultaten från kompilatorns automatgenereradevektoriseringkod. Exekveringstiden för SIMD-implementationerna varsignifikant kortare än för de kompilatorgenererade och visade på en förbättring påmellan 2,47 och 5,11 gånger, mätt i exekveringstid. Resultaten visade också på entydlig förbättring när instruktionerna exekveras i en optimal ordning. Resultatenvisar att automatgenererad vektorisering presterar sämre för komplexa algoritmeroch producerar maskinkod med signifikanta databeroenden som orsakar pipelinestalls, även med optimeringsflaggor påslagna. Med hjälp av metoder presenteradei det här examensarbete kan prestandan i DSP-algoritmer förbättras betydligt ijämförelse med automatgenererad vektorisering.

http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-213720

Performance Optimization

Elektroteknik och elektronik

Identifer	oai:union.ndltd.org:UPSALLA1/oai:DiVA.org:kth-213720
Date	January 2017
Creators	Yagneswar, Sharan
Publisher	KTH, Reglerteknik
Source Sets	DiVA Archive at Upsalla University
Language	English
Detected Language	English
Type	Student thesis, info:eu-repo/semantics/bachelorThesis, text
Format	application/pdf
Rights	info:eu-repo/semantics/openAccess
Relation	TRITA-EE, 1653-5146 ; 2017:088

Page generated in 0.0075 seconds

Performance Optimization of Signal Processing Algorithms for SIMD Architectures

Description

Links & Downloads

Tags

Additional Fields