Spelling suggestions: "subject:"5oftware pipelined loops"" "subject:"5oftware pipelined hoops""
1 |
Spill Code Minimization And Buffer And Code Size Aware Instruction Scheduling TechniquesNagarakatte, Santosh G 08 1900 (has links)
Instruction scheduling and Software pipelining are important compilation techniques which reorder instructions in a program to exploit instruction level parallelism. They are essential for enhancing instruction level parallelism in architectures such as very Long Instruction Word and tiled processors. This thesis addresses two important problems in the context of these instruction reordering techniques. The first problem is for general purpose applications and architectures, while the second is for media and graphics applications for tiled and multi-core architectures. The first problem deals with software pipelining which is an instruction scheduling technique that overlaps instructions from multiple iterations. Software pipelining increases the register pressure and hence it may be required to introduce spill instructions.
In this thesis, we model the problem of register allocation with optimal spill code generation and scheduling in software pipelined loops as a 0-1 integer linear program. By minimizing the amount of spill code produced, the formulation ensures that the initiation interval (II) between successive iterations of the loop is not increased unnecessarily. Experimental results show that our formulation performs better than the existing heuristics by preventing an increase in the II and also generating less spill code on average among loops extracted from Perfect Club and SPEC benchmarks.
The second major contribution of the thesis deals with the code size aware scheduling of stream programs. Large scale synchronous dataflow graphs (SDF’s) and StreamIt have emerged as powerful programming models for high performance streaming applications. In these models, a program is represented as a dataflow graph where each node represents an autonomous filter and the edges represent the channels through which the nodes communicate. In constructing static schedules for programs in these models, it is important to optimize the execution time buffer requirements of the data channel and the space required to store the encoded schedule. Earlier approaches have either given priority to one of the requirements or proposed ad-hoc methods for generating schedules with good trade-offs. In this thesis, we propose a genetic algorithm framework based on non-dominated sorting for generating serial schedules which have good trade-off between code size and buffer requirement. We extend the framework to generate software pipelined schedules for tiled architectures. From our experiments, we observe that the genetic algorithm framework generates schedules with good trade-off and performs better than the earlier approaches.
|
2 |
Efficient LU Factorization for Texas Instruments Keystone Architecture Digital Signal Processors / Effektiv LU-faktorisering för Texas Instruments digitala signalprocessorer med Keystone-arkitekturNetzer, Gilbert January 2015 (has links)
The energy consumption of large-scale high-performance computer (HPC) systems has become one of the foremost concerns of both data-center operators and computer manufacturers. This has renewed interest in alternative computer architectures that could offer substantially better energy-efficiency.Yet, the for the evaluation of the potential of these architectures necessary well-optimized implementations of typical HPC benchmarks are often not available for these for the HPC industry novel architectures. The in this work presented LU factorization benchmark implementation aims to provide such a high-quality tool for the HPC industry standard high-performance LINPACK benchmark (HPL) for the eight-core Texas Instruments TMS320C6678 digitalsignal processor (DSP). The presented implementation could perform the LU factorization at up to 30.9 GF/s at 1.25 GHz core clock frequency by using all the eight DSP cores of the System-on-Chip (SoC). This is 77% of the attainable peak double-precision floating-point performance of the DSP, a level of efficiency that is comparable to the efficiency expected on traditional x86-based processor architectures. A presented detailed performance analysis shows that this is largely due to the optimized implementation of the embedded generalized matrix-matrix multiplication (GEMM). For this operation, the on-chip direct memory access (DMA) engines were used to transfer the necessary data from the external DDR3 memory to the core-private and shared scratchpad memory. This allowed to overlap the data transfer with computations on the DSP cores. The computations were in turn optimized by using software pipeline techniques and were partly implemented in assembly language. With these optimization the performance of the matrix multiplication reached up to 95% of attainable peak performance. A detailed description of these two key optimization techniques and their application to the LU factorization is included. Using a specially instrumented Advantech TMDXEVM6678L evaluation module, described in detail in related work, allowed to measure the SoC’s energy efficiency of up to 2.92 GF/J while executing the presented benchmark. Results from the verification of the benchmark execution using standard HPL correctness checks and an uncertainty analysis of the experimentally gathered data are also presented. / Energiförbrukningen av storskaliga högpresterande datorsystem (HPC) har blivit ett av de främsta problemen för såväl ägare av dessa system som datortillverkare. Det har lett till ett förnyat intresse för alternativa datorarkitekturer som kan vara betydligt mer effektiva ur energiförbrukningssynpunkt. För detaljerade analyser av prestanda och energiförbrukning av dessa för HPC-industrin nya arkitekturer krävs väloptimerade implementationer av standard HPC-bänkmärkningsproblem. Syftet med detta examensarbete är att tillhandhålla ett sådant högkvalitativt verktyg i form av en implementation av ett bänkmärkesprogram för LU-faktorisering för den åttakärniga digitala signalprocessorn (DSP) TMS320C6678 från Texas Instruments. Bänkmärkningsproblemet är samma som för det inom HPC-industrin välkända bänkmärket “high-performance LINPACK” (HPL). Den här presenterade implementationen nådde upp till en prestanda av 30,9 GF/s vid 1,25 GHz klockfrekvens genom att samtidigt använda alla åtta kärnor i DSP:n. Detta motsvarar 77% av den teoretiskt uppnåbara prestandan, vilket är jämförbart med förväntningar på effektivteten av mer traditionella x86-baserade system. En detaljerad prestandaanalys visar att detta tillstor del uppnås genom den högoptimerade implementationen av den ingående matris-matris-multiplikationen. Användandet av specialiserade “direct memory access” (DMA) hårdvaruenheter för kopieringen av data mellan det externa DDR3 minnet och det interna kärn-privata och delade arbetsminnet tillät att överlappa dessa operationer med beräkningar. Optimerade mjukvaruimplementationer av dessa beräkningar, delvis utförda i maskinspåk, tillät att utföra matris-multiplikationen med upp till 95% av den teoretiskt nåbara prestandan. I rapporten ges en detaljerad beskrivning av dessa två nyckeltekniker. Energiförbrukningen vid exekvering av det implementerade bänkmärket kunde med hjälp av en för ändamålet anpassad Advantech TMDXEVM6678L evalueringsmodul bestämmas till maximalt 2,92 GF/J. Resultat från verifikationen av bänkmärkesimplementationen och en uppskattning av mätosäkerheten vid de experimentella mätningarna presenteras också.
|
Page generated in 1.1076 seconds