Global ETD Search

1	A reconfigurable SIMD architecture on-chip Andersson, Johan, Mohlin, Mikael, Nilsson, Artur January 2006 (has links) <p>This project targets the problems with design and implementation of Single Instruction </p><p>Multiple Data (SIMD) architectures in System-on-Chip (SoC), with the goal to construct </p><p>a reconfigurable framework in VHDL to ease this process. The resulting framework should </p><p>be implemented on an FPGA and its usability tested. The main parts of a SIMD archi- </p><p>tecture was identified to be the Control Unit (CU), the Processing Elements (PE) and </p><p>the Interconnection Network (ICN), and a framework was constructed with these parts </p><p>as the main building blocks. The constructed framework is reconfigurable in data width, </p><p>memory size, number of PEs, topology and instruction set. To test ease of use and per- </p><p>formance of the system a FIR-filter application was implemented. The scalability of the </p><p>system and its different parts has been measured and comparisons are illustrated.</p> Single Instruction SIMD System-on-Chip
2	A reconfigurable SIMD architecture on-chip Andersson, Johan, Mohlin, Mikael, Nilsson, Artur January 2006 (has links) This project targets the problems with design and implementation of Single Instruction Multiple Data (SIMD) architectures in System-on-Chip (SoC), with the goal to construct a reconfigurable framework in VHDL to ease this process. The resulting framework should be implemented on an FPGA and its usability tested. The main parts of a SIMD archi- tecture was identified to be the Control Unit (CU), the Processing Elements (PE) and the Interconnection Network (ICN), and a framework was constructed with these parts as the main building blocks. The constructed framework is reconfigurable in data width, memory size, number of PEs, topology and instruction set. To test ease of use and per- formance of the system a FIR-filter application was implemented. The scalability of the system and its different parts has been measured and comparisons are illustrated. Single Instruction SIMD System-on-Chip
3	Genomic variation detection using dynamic programming methods Zhao, Mengyao January 2014 (has links) Thesis advisor: Gabor T. Marth / Background: Due to the rapid development and application of next generation sequencing (NGS) techniques, large amounts of NGS data have become available for genome-related biological research, such as population genetics, evolutionary research, and genome wide association studies. A crucial step of these genome-related studies is the detection of genomic variation between different species and individuals. Current approaches for the detection of genomic variation can be classified into alignment-based variation detection and assembly-based variation detection. Due to the limitation of current NGS read length, alignment-based variation detection remains the mainstream approach. The Smith-Waterman algorithm, which produces the optimal pairwise alignment between two sequences, is frequently used as a key component of fast heuristic read mapping and variation detection tools for next-generation sequencing data. Though various fast Smith-Waterman implementations are developed, they are either designed as monolithic protein database searching tools, which do not return detailed alignment, or they are embedded into other tools. These issues make reusing these efficient Smith-Waterman implementations impractical. After the alignment step in the traditional variation detection pipeline, the afterward variation detection using pileup data and the Bayesian model is also facing great challenges especially from low-complexity genomic regions. Sequencing errors and misalignment problems still influence variation detection (especially INDEL detection) a lot. The accuracy of genomic variation detection still needs to be improved, especially when we work on low- complexity genomic regions and low-quality sequencing data. Results: To facilitate easy integration of the fast Single-Instruction-Multiple-Data Smith-Waterman algorithm into third-party software, we wrote a C/C++ library, which extends Farrar's Striped Smith-Waterman (SSW) to return alignment information in addition to the optimal Smith-Waterman score. In this library we developed a new method to generate the full optimal alignment results and a suboptimal score in linear space at little cost of efficiency. This improvement makes the fast Single-Instruction-Multiple-Data Smith-Waterman become really useful in genomic applications. SSW is available both as a C/C++ software library, as well as a stand-alone alignment tool at: https://github.com/mengyao/Complete- Striped-Smith-Waterman-Library. The SSW library has been used in the primary read mapping tool MOSAIK, the split-read mapping program SCISSORS, the MEI detector TAN- GRAM, and the read-overlap graph generation program RZMBLR. The speeds of the mentioned software are improved significantly by replacing their ordinary Smith-Waterman or banded Smith-Waterman module with the SSW Library. To improve the accuracy of genomic variation detection, especially in low-complexity genomic regions and on low-quality sequencing data, we developed PHV, a genomic variation detection tool based on the profile hidden Markov model. PHV also demonstrates a novel PHMM application in the genomic research field. The banded PHMM algorithms used in PHV make it a very fast whole-genome variation detection tool based on the HMM method. The comparison of PHV to GATK, Samtools and Freebayes for detecting variation from both simulated data and real data shows PHV has good potential for dealing with sequencing errors and misalignments. PHV also successfully detects a 49 bp long deletion that is totally misaligned by the mapping tool, and neglected by GATK and Samtools. Conclusion: The efforts made in this thesis are very meaningful for methodology development in studies of genomic variation detection. The two novel algorithms stated here will also inspire future work in NGS data analysis. / Thesis (PhD) — Boston College, 2014. / Submitted to: Boston College. Graduate School of Arts and Sciences. / Discipline: Biology. genomic variation detection profile hidden Markov model sequence alignment single instruction multiple data single nucleotide polymorphism Smith-Watherman algorithm
4	Predicting Critical Warps in Near-Threshold GPGPU Applications Using a Dynamic Choke Point Analysis Sanyal, Sourav 01 August 2019 (has links) General purpose graphics processing units (GP-GPU), owing to their enormous thread-level parallelism, can significantly improve the power consumption at the near-threshold (NTC) operating region, while offering close to a super-threshold performance. However, process variation (PV) can drastically reduce the GPU performance at NTC. In this work, choke points—a unique device-level characteristic of PV at NTC—that can exacerbate the warp criticality problem in GPUs have been explored. It is shown that the modern warp schedulers cannot tackle the choke point induced critical warps in an NTC GPU. Additionally, Choke Point Aware Warp Speculator, a circuit-architectural solution is proposed to dynamically predict the critical warps in GPUs, and accelerate them in their respective execution units. The best scheme achieves an average improvement of ∼39% in performance, and ∼31% in energy-efficiency, over one state-of-the-art warp scheduler, across 15 GPGPU applications, while incurring marginal hardware overheads. Process Variation Near-Threshold Computing General Purpose Graphics Processing Unit Warp Criticality Problem Single Instruction Multiple Data Computer Engineering
5	Introducing Machine Learning in a Vectorized Digital Signal Processor / Introduktion av Maskininlärning på en Vektoriserad Digital Signalprocessor Ridderström, Linnéa January 2023 (has links) Machine learning is rapidly being integrated into all areas of society, however, that puts a lot of pressure on resource costraint hardware such as embedded systems. The company Ericsson is gradually integrating machine learning based on neural networks, so-called deep learning, into their radio products. One promising product is their vectorized Digital Signal Processor (DSP) that are based upon the machine learning suitable Single Instruction, Multiple Data (SIMD) paradigm and Very Long Instruction Word (VLIW) architecture. However, despite the suitability of the SIMD paradigm, the embedded system needs to efficiently execute a computation-intensive deep learning algorithm with proper use of its limited resources. Therefore commonly used methods of implementing each layer of the computation-intensive Convolutional Neural Network (CNN), a type of Deep Neural Network (DNN), have been used and evaluated its implementation on the hardware and to assess the vectorized DSP’s deep learning suitability and capabilities. Despite the suitability of the hardware, the implementation utilized less than half of the available resources at all times during the execution. The main limitations were identified to be the limited 16-bit element instructions. To enhance the performance and improve the utilization of the available resources, easy-to-implement hardware instructions have been suggested. This work has made the first steps of implementing an efficiently performing CNN implementation on the examined vectorized DSP. / Integreringen av maskininlärning in i alla samhällsområden sker idag i rusande fart, men det sätter stor press på begränsad hårdvara som inbyggda system. Företaget Ericsson integrerar successivt maskininlärning baserad på neurala nätverk, så kallad djupinlärning, i sina radioprodukter. En lovande produkt är deras vektoriserade DSP som är baserade på maskininlärningspasset SIMD-paradigm och VLIW-arkitektur. Men trots lämpligheten av SIMD-paradigmet, är den största utmaningen att utnyttja de begränsade resurserna i inbyggda systemet för att effektivt exekvera en beräkningsintensiv djupinlärningsalgoritm. Därför har vanligt använda metoder för att implementera varje lager av den beräkningsintensiva CNN, en typ av DNN, använts och utvärderats på hårdvaran för att bedöma den vektoriserade DSP:s djupinlärningslämplighet samt förmågor. Trots hårdvarans lämplighet använde alla implementeringar mindre än hälften av de tillgängliga resurserna vid alla tidpunkter under exekveringen. De huvudsakliga begränsningarna identifierades vara den begränsade tillgången på 16-bitars element instruktioner. För att förbättra prestandan för ett närmare fullt utnyttjande av tillgängliga resurser har hårdvaruinstruktioner som är enkla att implementera föreslagits. Detta arbete har tagit de första stegen för att implementera ett effektivt förformande CNN på den undersökta vekotriserade DSP. Digital Signal Processor (DSP) Machine Learning Deep Learning Convolutional Neural Network (CNN) Very Long Instruction Word (VLIM) Single Instruction Multiple Data (SIMD) Digital Signalprocessor (DSP) Maskininlärning Djupinlärning Konvolutionella Neurala Nätverk (CNN) Very Long Instruction Word (VLIW) Single Instruction Multiple Data (SIMD) Elektroteknik och elektronik
6	Μονάδες επεξεργασίας δεδομένων για μικροεπεξεργαστές υψηλών αποδόσεων Δημητρακόπουλος, Γεώργιος 16 March 2009 (has links) Οι μονάδες επεξεργασίας δεδομένων αποτελούν τις βασικές δομικές μονάδες όλων των μικροεπεξεργαστών. Κάποια από τα κυκλώματα αυτής της κατηγορίας υλοποιούν τις βασικές αριθμητικές πράξεις πάνω σε δεδομένα τόσο σταθερής όσο και κινητής υποδιαστολής, ενώ κάποια άλλα αναλαμβάνουν την αναδιοργάνωση των δεδομένων αυτών για την επιτάχυνση του υπολογισμού. Σε επεξεργαστές ειδικού σκοπού, όπως οι επεξεργαστές πολυμέσων και γραφικών, οι μονάδες επεξεργασίας δεδομένων καταλαμβάνουν περισσότερο από το 30% του ολοκληρωμένου και η αποτελεσματική σχεδίαση τους έχει άμεσο αντίκτυπο στην απόδοση ολόκληρου του συστήματος. Στο μέλλον, αναμένεται πως ακόμα και οι επεξεργαστές γενικού σκοπού, θα είναι εξοπλισμένοι από εξειδικευμένους επιταχυντές, οι οποίοι θα εκτελούν απ’ ευθείας σε υλικό σύνθετους αλγορίθμους με μεγάλες υπολογιστικές απαιτήσεις. Η βάση όλων των προτεινόμενων λύσεων σ’ αυτή τη διατριβή είναι η αναλυτική εύρεση ενός εγγενώς απλούστερου αλγορίθμου, ο οποίος θα επιτρέπει την αποτελεσματική υλοποίηση των αντίστοιχων κυκλωμάτων ανεξάρτητα από την τεχνολογία που θα χρησιμοποιηθεί και από τους επιπλέον περιορισμούς που τυχόν θα επιβληθούν στο μέλλον κατά την κατασκευή των κυκλωμάτων αυτών. Η ανάλυση και τα πειραματικά αποτελέσματα που συλλέξαμε βασίζονται τόσο σε υλοποιήσεις σε επίπεδο τρανζίστορ, που είναι η κύρια μέχρι τώρα πρακτική σχεδίασης των μικροεπεξεργαστών υψηλών επιδόσεων, όσο και σε πλήρως αυτοματοποιημένες υλοποιήσεις. Φυσικά, στη δεύτερη περίπτωση η απόδοση των κυκλωμάτων επιβαρύνεται, τόσο σε καθυστέρηση όσο και σε ενέργεια, εξαιτίας των περιορισμών των αυτοματοποιημένων εργαλείων και την αναγκαστική χρήση των προσχεδιασμένων βιβλιοθηκών βασικών πυλών. Η μελέτη που πραγματοποιήσαμε στοχεύει στην πλήρη εξερεύνηση του χώρου λύσεων των κυκλωμάτων αυτών. Η ανάλυση της συμπεριφοράς τους πραγματοποιήθηκε χρησιμοποιώντας τις βέλτιστες καμπύλες της ενέργειας ως προς την καθυστέρηση, οι οποίες αποτελούν τον πιο έγκυρο τρόπο περιγραφής της απόδοσης ενός κυκλώματος. Τα κυκλώματα που παρουσιάζονται ανήκουν σε τρεις βασικές κατηγορίες. Στην πρώτη ανήκουν οι αθροιστές παράλληλου προθέματος, που χρησιμοποιούν τα κρατούμενα του Ling για την υλοποίηση της δυαδικής πρόσθεσης. Τα κρατούμενα που προτάθηκαν από τον Ling αποτελούν απλοποιημένες μορφές των κλασικών σχέσεων πρόβλεψης κρατουμένου και χρησιμοποιούνται αυτή τη στιγμή στην πλειοψηφία των εμπορικών επεξεργαστών. Το νέο κύκλωμα, που προτείναμε, αποτελεί ουσιαστικά τη γενίκευση των σχέσεων αυτών, επιτρέποντας την υλοποίηση τους με απλοποιημένες δομές παράλληλου προθέματος, με αποτέλεσμα τη μείωση τόσο της καθυστέρησης όσο και της απαιτούμενης ενέργειας. Η νέα τεχνική οδηγεί σε γρηγορότερα κυκλώματα ανεξάρτητα από τη λογική οικογένεια που θα χρησιμοποιηθεί (στατική ή δυναμική CMOS λογική) και το δένδρο παράλληλου προθέματος που θα επιλεγεί. Η δεύτερη κατηγορία αναφέρεται σε κυκλώματα αναδιάταξης των δεδομένων που είναι αποθηκευμένα μέσα στους καταχωρητές του επεξεργαστή. Η αποδοτική αναδιάταξη των δεδομένων καταλήγει να είναι σε πολλούς αλγορίθμους (κρυπτογραφία, ψηφιακή επεξεργασία σήματος, πολυμέσα) τόσο αναγκαία όσο και η γρήγορη υλοποίηση των βασικών αριθμητικών πράξεων, αλλά και η ταχεία επικοινωνία με τη μνήμη. H προσπάθεια μας εστιάστηκε στην αποδοτική υλοποίηση μιας γενικής εντολής αναδιάταξης δεδομένων, στοχεύοντας σε όσο το δυνατόν ταχύτερες υλοποιήσεις. Όλες οι εκδοχές που προτείναμε στηρίζονται σε μια νέα μορφή δικτύων ταξινόμησης, η οποία μας επιτρέπει να παρέχουμε λύσεις που είναι σημαντικά πιο αποδοτικές σε σχέση με τις ήδη υπάρχουσες. Τα κυκλώματα που προτείνουμε κατασκευάζονται με τη χρήση ενός μόνο κελιού υπολογισμού (διαφορετικό για κάθε δίκτυο ταξινόμησης) και διατηρούν μια πλήρως κανονική δομή. Το στοιχείο αυτό, συμβάλλει, πέρα από τη βελτίωση της απόδοσης, στην αποτελεσματικότερη χωροθέτηση του κυκλώματος και στη μείωση των αρνητικών επιδράσεων των γραμμών διασύνδεσης. Η τελευταία κατηγορία κυκλωμάτων αναφέρεται σε κυκλώματα που χρησιμοποιούνται για την υλοποίηση της πρόσθεσης αριθμών κινητής υποδιαστολής. Τα κυκλώματα που προτείνουμε χρησιμοποιούνται στα πιο κρίσιμα στάδια, από πλευράς καθυστέρησης, του υπολογισμού του αθροίσματος και αφορούν στην πρόσθεση των μεγεθών και στην κανονικοποίηση του αποτελέσματος. Αρχικά, περιγράφουμε μια εναλλακτική προσέγγιση για την υλοποίηση των αθροιστών μεγέθους των αριθμών κινητής υποδιαστολής. Οι νέες μονάδες εκμεταλλεύονται την αναπαράσταση συμπληρώματος ως προς ένα και τις γρήγορες μονάδες υπολογισμού του κρατουμένου, που βασίζονται στην τεχνική παράλληλου προθέματος. Προτείνουμε μια ενοποιημένη μεθοδολογία για το πως μπορούμε να παράγουμε δομές παράλληλου προθέματος ανεξάρτητα από το μέγεθος της λέξης εισόδου, ενώ καταφέρνουμε να ενώσουμε για πρώτη φορά τις απλοποιημένες σχέσεις κρατουμένου του Ling με την πρόσθεση αριθμών που ακολουθούν την αναπαράσταση συμπληρώματος ως προς ένα. Στη συνέχεια, περιγράφεται ένας νέος απλός τρόπος για την υλοποίηση της πρόβλεψης και της μέτρησης των προπορευόμενων μηδενικών που εμφανίζονται στα αποτελέσματα των πράξεων αριθμών κινητής υποδιαστολής. Με τη χρήση των νέων κυκλωμάτων η κανονικοποίηση του αποτελέσματος μπορεί να πραγματοποιηθεί σε λιγότερο χρόνο και με σημαντικά μικρότερη ενέργεια. / Data processing units (or simply datapath) constitute a major part of all microprocessors. They take over the execution of all arithmetic operations either of fixed point or floating-point data, while they are also responsible for the execution of the needed data rearrangements in order to speed up the computation. In application-specific processors used for media and graphics applications, datapath circuits occupy more than one third of the processor’s core area and their efficient design directly affects the energy-delay behavior of the whole circuit. In the near future, it is expected that even general-purpose processors will be equipped we specialized accelerators that will execute directly in hardware complex algorithms with large computational demands. The basis of all circuits presented in this thesis is the derivation of an inherently simpler algorithm that would allow their efficient implementation irrespective the technology used and the constraints that would be imposed in the future, concerning the reliable and more predictable circuit fabrication in very deep submicron technologies. Our analysis relies on full-custom transistor-level designs that is the most common technique employed in high-performance microprocessor design. The performance of some of the presented circuits has also been investigated using an automated design flow. It is expected that, in these cases, the performance of the presented circuits will be aggravated due to the limitations imposed by the design automation tools and the available standard cell library. In this study, we aim at fully exploring the design space of our circuits. For this reason, we derived an optimal energy-delay curve for each one of the examined circuits in order to analyze its behavior. An energy-delay curve is the most reliable metric for presenting the performance of a circuit and allows the designer to perform a fair comparison among various design alternatives and circuit topologies. The new circuits presented in this thesis belong to three categories. In the first class, we find the parallel prefix adders that adopt the carries proposed by Ling. These carries are a simplified form of the classic carry lookahead equations and they are used at the moment in the majority of commercial high-speed microprocessors. The newly proposed circuits are based on a transformation of the Ling carries that leads to more efficient parallel prefix structures, which are better suited for Ling-carry computation. This new technique offers faster implementations irrespective the logic family used (either static or dynamic CMOS) and the prefix structure selected for the implementation. The second class refers to circuits that rearrange the data stored inside one or more of the processor’s registers. Efficient data rearrangement ends up being, in many cases, such as cryptography, digital signal processing, and multimedia applications, as essential as the fast implementation of basic arithmetic operations and the high bandwidth processor-memory communication. Our effort has focused on the efficient implementation of one of the most versatile permutation instruction, aiming to the reduction of the delay of the corresponding circuit. The design of the proposed permutation units is put under a common framework and their functionality resembles that of sorting networks. All the presented variants are designed using a single processing element (different for each sorting network) and have a very regular structure. This fact significantly contributes to the delay reduction because of the regular placement of the circuits’ cells that also alleviates the interconnect delay overhead. The last class of circuits is used for the implementation of high-speed floating-point units. The proposed circuits participate in two of the most time critical parts of any floating-point adder that is the significand (or fraction) adder and the result normalization unit. At first, we describe an alternative implementation of the significant adder that employs the one’s complement representation in order to reduce the delay of the circuit. The proposed parallel-prefix structures are derived using a general design methodology that leads to efficient designs irrespective the wordlength of the input operands. Also, we managed for the first time to produce simplified parallel-prefix carry computation units for the case of one’s complement addition that rely on the definition of Ling carries. Secondly, we describe a simple and practical algorithm for counting the number of leading zeros that may appear in the result of floating-point addition. New circuits are also presented that simplify the design of the corresponding leading zero anticipation logic. Using the proposed structures, normalization can be performed with less delay and significantly reduced power dissipation compared to already known implementations. Μικροεπεξεργαστές 621.39 Very large scale integration (VLSI) Datapath design Floating point units
7	An Optimizing Code Generator for a Class of Lattice-Boltzmann Computations Pananilath, Irshad Muhammed January 2014 (has links) (PDF) Lattice-Boltzmann method(LBM), a promising new particle-based simulation technique for complex and multiscale fluid flows, has seen tremendous adoption in recent years in computational fluid dynamics. Even with a state-of-the-art LBM solver such as Palabos, a user still has to manually write his program using the library-supplied primitives. We propose an automated code generator for a class of LBM computations with the objective to achieve high performance on modern architectures. Tiling is a very important loop transformation used to improve the performance of stencil computations by exploiting locality and parallelism. In the first part of the work, we explore diamond tiling, a new tiling technique to exploit the inherent ability of most stencils to allow tile-wise concurrent start. This enables perfect load-balance during execution and reduces the frequency of synchronization required. Few studies have looked at time tiling for LBM codes. We exploit a key similarity between stencils and LBM to enable polyhedral optimizations and in turn time tiling for LBM. Besides polyhedral transformations, we also describe a number of other complementary transformations and post processing necessary to obtain good parallel and SIMD performance on modern architectures. We also characterize the performance of LBM with the Roofline performance model. Experimental results for standard LBM simulations like Lid Driven Cavity, Flow Past Cylinder, and Poiseuille Flow show that our scheme consistently outperforms Palabos–on average by3 x while running on 16 cores of a n Intel Xeon Sandy bridge system. We also obtain a very significant improvement of 2.47 x over the native production compiler on the SPECLBM benchmark. Lattice-Boltzmann Computations Computational Fluid Dynamics Tiling Stencil Computations Single Instruction Multiple Data (SIMD) Parallel Computers Parallel Processing Loop Transformations Lattice-Boltzman Method (LBM) Lattice Boltzman Method Lattice-Boltzmann Equation Computer Science
8	SIMD-aware word length optimization for floating-point to fixed-point conversion targeting embedded processors / Optimisation SIMD de la largeur des mots pour la conversion de virgule flottante en virgule fixe pour des processeurs embarqués El Moussawi, Ali Hassan 16 December 2016 (has links) Afin de limiter leur coût et/ou leur consommation électrique, certains processeurs embarqués sacrifient le support matériel de l'arithmétique à virgule flottante. Pourtant, pour des raisons de simplicité, les applications sont généralement spécifiées en utilisant l'arithmétique à virgule flottante. Porter ces applications sur des processeurs embarqués de ce genre nécessite une émulation logicielle de l'arithmétique à virgule flottante, qui peut sévèrement dégrader la performance. Pour éviter cela, l'application est converti pour utiliser l'arithmétique à virgule fixe, qui a l'avantage d'être plus efficace à implémenter sur des unités de calcul entier. La conversion de virgule flottante en virgule fixe est une procédure délicate qui implique des compromis subtils entre performance et précision de calcul. Elle permet, entre autre, de réduire la taille des données pour le coût de dégrader la précision de calcul. Par ailleurs, la plupart de ces processeurs fournissent un support pour le calcul vectoriel de type SIMD (Single Instruction Multiple Data) afin d'améliorer la performance. En effet, cela permet l'exécution d'une opération sur plusieurs données en parallèle, réduisant ainsi le temps d'exécution. Cependant, il est généralement nécessaire de transformer l'application pour exploiter les unités de calcul vectoriel. Cette transformation de vectorisation est sensible à la taille des données ; plus leurs tailles diminuent, plus le taux de vectorisation augmente. Il apparaît donc un compromis entre vectorisation et précision de calcul. Plusieurs travaux ont proposé des méthodologies permettant, d'une part la conversion automatique de virgule flottante en virgule fixe, et d'autre part la vectorisation automatique. Dans l'état de l'art, ces deux transformations sont considérées indépendamment, pourtant elles sont fortement liées. Dans ce contexte, nous étudions la relation entre ces deux transformations, dans le but d'exploiter efficacement le compromis entre performance et précision de calcul. Ainsi, nous proposons d'abord un algorithme amélioré pour l'extraction de parallélisme SLP (Superword Level Parallelism ; une technique de vectorisation). Puis, nous proposons une nouvelle méthodologie permettant l'application conjointe de la conversion de virgule flottante en virgule fixe et de l'exploitation du SLP. Enfin, nous implémentons cette approche sous forme d'un flot de compilation source-à-source complètement automatisé, afin de valider ces travaux. Les résultats montrent l'efficacité de cette approche, dans l'exploitation du compromis entre performance et précision, vis-à-vis d'une approche classique considérant ces deux transformations indépendamment. / In order to cut-down their cost and/or their power consumption, many embedded processors do not provide hardware support for floating-point arithmetic. However, applications in many domains, such as signal processing, are generally specified using floating-point arithmetic for the sake of simplicity. Porting these applications on such embedded processors requires a software emulation of floating-point arithmetic, which can greatly degrade performance. To avoid this, the application is converted to use fixed-point arithmetic instead. Floating-point to fixed-point conversion involves a subtle tradeoff between performance and precision ; it enables the use of narrower data word lengths at the cost of degrading the computation accuracy. Besides, most embedded processors provide support for SIMD (Single Instruction Multiple Data) as a mean to improve performance. In fact, this allows the execution of one operation on multiple data in parallel, thus ultimately reducing the execution time. However, the application should usually be transformed in order to take advantage of the SIMD instruction set. This transformation, known as Simdization, is affected by the data word lengths ; narrower word lengths enable a higher SIMD parallelism rate. Hence the tradeoff between precision and Simdization. Many existing work aimed at provide/improving methodologies for automatic floating-point to fixed-point conversion on the one side, and Simdization on the other. In the state-of-the-art, both transformations are considered separately even though they are strongly related. In this context, we study the interactions between these transformations in order to better exploit the performance/accuracy tradeoff. First, we propose an improved SLP (Superword Level Parallelism) extraction (an Simdization technique) algorithm. Then, we propose a new methodology to jointly perform floating-point to fixed-point conversion and SLP extraction. Finally, we implement this work as a fully automated source-to-source compiler flow. Experimental results, targeting four different embedded processors, show the validity of our approach in efficiently exploiting the performance/accuracy tradeoff compared to a typical approach, which considers both transformations independently. Optimisation de la largeur des mots Vectorisation Processeurs embarqués Compilation source-À-Source Génération de code C Embedded processors Source-To-Source compilation Floating-Point to fixed-Point conversion Single Instruction Multiple Data (SIMD) Superword Level Parallelism Word length conversion C code generation

Search results