Global ETD Search

1	On the automated compilation of UML notation to a VLIW chip multiprocessor Stevens, David January 2013 (has links) With the availability of more and more cores within architectures the process of extracting implicit and explicit parallelism in applications to fully utilise these cores is becoming complex. Implicit parallelism extraction is performed through the inclusion of intelligent software and hardware sections of tool chains although these reach their theoretical limit rather quickly. Due to this the concept of a method of allowing explicit parallelism to be performed as fast a possible has been investigated. This method enables application developers to perform creation and synchronisation of parallel sections of an application at a finer-grained level than previously possible, resulting in smaller sections of code being executed in parallel while still reducing overall execution time. Alongside explicit parallelism, a concept of high level design of applications destined for multicore systems was also investigated. As systems are getting larger it is becoming more difficult to design and track the full life-cycle of development. One method used to ease this process is to use a graphical design process to visualise the high level designs of such systems. One drawback in graphical design is the explicit nature in which systems are required to be generated, this was investigated, and using concepts already in use in text based programming languages, the generation of platform-independent models which are able to be specialised to multiple hardware architectures was developed. The explicit parallelism was performed using hardware elements to perform thread management, this resulted in speed ups of over 13 times when compared to threading libraries executed in software on commercially available processors. This allowed applications with large data dependent sections to be parallelised in small sections within the code resulting in a decrease of overall execution time. The modelling concepts resulted in the saving of between 40-50% of the time and effort required to generate platform-specific models while only incurring an overhead of up to 15% the execution cycles of these models designed for specific architectures. 621.3
2	Performance optimization mechanisms for fault-resilient VLIW processors / Mécanismes d'optimisation des performances des processeurs VLIW à tolérance de fautes Psiakis, Rafail 21 December 2018 (has links) Les processeurs intégrés dans des domaines critiques exigent une combinaison de fiabilité, de performances et de faible consommation d'énergie. Very Large Instruction Word (VLIW) processeurs améliorent les performances grâce à l'exploitation ILP (Instruction Level Parallelism), tout en maintenant les coûts et la puissance à un niveau bas. L’ILP étant fortement dépendant de l'application, le processeur n'utilise pas toutes ses ressources en permanence et ces ressources peuvent donc être utilisées pour l'exécution d'instructions redondantes. Cette thèse présente une méthodologie d’injection fautes pour processeurs VLIW et trois mécanismes matériels pour traiter les pannes légères, permanentes et à long terme menant à trois contributions.La première contribution présente un schéma d’analyse du facteur de vulnérabilité architecturale et du facteur de vulnérabilité d’instruction pour les processeurs VLIW. Une méthodologie d’injection de fautes au niveau de différentes structures de mémoire est proposée pour extraire les capacités de masquage architecture / instruction du processeur. Un schéma de classification des défaillances de haut niveau est présenté pour catégoriser la sortie du processeur. La deuxième contribution explore les ressources inactives hétérogènes au moment de l'exécution, à l'intérieur et à travers des ensembles d'instructions consécutifs. Pour ce faire, une technique d’ordonnancement des instructions optimisée pour le matériel est appliquée en parallèle avec le pipeline afin de contrôler efficacement la réplication et l’ordonnancement des instructions. Suivant les tendances à la parallélisation croissante, une conception basée sur les clusters est également proposée pour résoudre les problèmes d’évolutivité, tout en maintenant une pénalité surface/énergie raisonnable. La technique proposée accélère la performance de 43,68% avec une surcoût en surface et en énergie de ~10% par rapport aux approches existantes. Les analyses AVF et IVF évaluent la vulnérabilité du processeur avec le mécanisme proposé.La troisième contribution traite des défauts persistants. Un mécanisme matériel est proposé, qui réplique au moment de l'exécution les instructions et les planifie aux emplacements inactifs en tenant compte des contraintes de ressources. Si une ressource devient défaillante, l'approche proposée permet de relier efficacement les instructions d'origine et les instructions répliquées pendant l'exécution. Les premiers résultats de performance d’évaluation montrent un gain de performance jusqu’à 49% sur les techniques existantes.Afin de réduire davantage le surcoût lié aux performances et de prendre en charge l’atténuation des erreurs uniques et multiples sur les transitoires de longue durée (LDT), une quatrième contribution est présentée. Nous proposons un mécanisme matériel qui détecte les défauts toujours actifs pendant l'exécution et réorganise les instructions pour utiliser non seulement les unités fonctionnelles saines, mais également les composants sans défaillance des unités fonctionnelles concernées. Lorsque le défaut disparaît, les composants de l'unité fonctionnelle concernés peuvent être réutilisés. La fenêtre de planification du mécanisme proposé comprend deux ensembles d'instructions pouvant explorer des solutions d'atténuation lors de l'exécution de l'instruction en cours et de l'instruction suivante. Les résultats obtenus sur l'injection de fautes montrent que l'approche proposée peut atténuer un grand nombre de fautes avec des performances, une surface et une surcharge de puissance faibles. / Embedded processors in critical domains require a combination of reliability, performance and low energy consumption. Very Long Instruction Word (VLIW) processors provide performance improvements through Instruction Level Parallelism (ILP) exploitation, while keeping cost and power in low levels. Since the ILP is highly application dependent, the processor does not use all its resources constantly and, thus, these resources can be utilized for redundant instruction execution. This thesis presents a fault injection methodology for VLIW processors and three hardware mechanisms to deal with soft, permanent and long-term faults leading to three contributions. The first contribution presents an Architectural Vulnerability Factor (AVF) and Instruction Vulnerability Factor (IVF) analysis schema for VLIW processors. A fault injection methodology at different memory structures is proposed to extract the architectural/instruction masking capabilities of the processor. A high-level failure classification schema is presented to categorize the output of the processor. The second contribution explores heterogeneous idle resources at run-time both inside and across consecutive instruction bundles. To achieve this, a hardware optimized instruction scheduling technique is applied in parallel with the pipeline to efficiently control the replication and the scheduling of the instructions. Following the trends of increasing parallelization, a cluster-based design is also proposed to tackle the issues of scalability, while maintaining a reasonable area/power overhead. The proposed technique achieves a speed-up of 43.68% in performance with a ~10% area and power overhead over existing approaches. AVF and IVF analysis evaluate the vulnerability of the processor with the proposed mechanism.The third contribution deals with persistent faults. A hardware mechanism is proposed which replicates at run-time the instructions and schedules them at the idle slots considering the resource constraints. If a resource becomes faulty, the proposed approach efficiently rebinds both the original and replicated instructions during execution. Early evaluation performance results show up to 49\% performance gain over existing techniques.In order to further decrease the performance overhead and to support single and multiple Long-Duration Transient (LDT) error mitigation a fourth contribution is presented. We propose a hardware mechanism, which detects the faults that are still active during execution and re-schedules the instructions to use not only the healthy function units, but also the fault-free components of the affected function units. When the fault faints, the affected function unit components can be reused. The scheduling window of the proposed mechanism is two instruction bundles being able to explore mitigation solutions in the current and the next instruction execution. The obtained fault injection results show that the proposed approach can mitigate a large number of faults with low performance, area, and power overhead. Very Long Instruction Word (VLIW) Tolérance aux fautes Exploitation des ressources inactives Erreurs légères Long Duration Transients (LDTs) Erreurs permanentes Planification des instructions en ligne Very Long Instruction Word (VLIW) Fault Tolerance Idle resource exploitation Soft errors Long Duration Transients(LDTs) Permanent Errors Hardware online re-Scheduling
3	Evaluation Of Register Allocation And Instruction Scheduling Methods In Multiple Issue Processors Valluri, Madhavi Gopal 01 1900 (has links) (PDF) No description available. Compiling (Electronic Computers) Multiprocessors Instruction Scheduling Compilers Register Allocation Machine Models Instruction-Level Parallelism (ILP) Modulo-Variable Expansion (MVE) Sensitive Scheduling Computer Science
4	Introducing Machine Learning in a Vectorized Digital Signal Processor / Introduktion av Maskininlärning på en Vektoriserad Digital Signalprocessor Ridderström, Linnéa January 2023 (has links) Machine learning is rapidly being integrated into all areas of society, however, that puts a lot of pressure on resource costraint hardware such as embedded systems. The company Ericsson is gradually integrating machine learning based on neural networks, so-called deep learning, into their radio products. One promising product is their vectorized Digital Signal Processor (DSP) that are based upon the machine learning suitable Single Instruction, Multiple Data (SIMD) paradigm and Very Long Instruction Word (VLIW) architecture. However, despite the suitability of the SIMD paradigm, the embedded system needs to efficiently execute a computation-intensive deep learning algorithm with proper use of its limited resources. Therefore commonly used methods of implementing each layer of the computation-intensive Convolutional Neural Network (CNN), a type of Deep Neural Network (DNN), have been used and evaluated its implementation on the hardware and to assess the vectorized DSP’s deep learning suitability and capabilities. Despite the suitability of the hardware, the implementation utilized less than half of the available resources at all times during the execution. The main limitations were identified to be the limited 16-bit element instructions. To enhance the performance and improve the utilization of the available resources, easy-to-implement hardware instructions have been suggested. This work has made the first steps of implementing an efficiently performing CNN implementation on the examined vectorized DSP. / Integreringen av maskininlärning in i alla samhällsområden sker idag i rusande fart, men det sätter stor press på begränsad hårdvara som inbyggda system. Företaget Ericsson integrerar successivt maskininlärning baserad på neurala nätverk, så kallad djupinlärning, i sina radioprodukter. En lovande produkt är deras vektoriserade DSP som är baserade på maskininlärningspasset SIMD-paradigm och VLIW-arkitektur. Men trots lämpligheten av SIMD-paradigmet, är den största utmaningen att utnyttja de begränsade resurserna i inbyggda systemet för att effektivt exekvera en beräkningsintensiv djupinlärningsalgoritm. Därför har vanligt använda metoder för att implementera varje lager av den beräkningsintensiva CNN, en typ av DNN, använts och utvärderats på hårdvaran för att bedöma den vektoriserade DSP:s djupinlärningslämplighet samt förmågor. Trots hårdvarans lämplighet använde alla implementeringar mindre än hälften av de tillgängliga resurserna vid alla tidpunkter under exekveringen. De huvudsakliga begränsningarna identifierades vara den begränsade tillgången på 16-bitars element instruktioner. För att förbättra prestandan för ett närmare fullt utnyttjande av tillgängliga resurser har hårdvaruinstruktioner som är enkla att implementera föreslagits. Detta arbete har tagit de första stegen för att implementera ett effektivt förformande CNN på den undersökta vekotriserade DSP. Digital Signal Processor (DSP) Machine Learning Deep Learning Convolutional Neural Network (CNN) Very Long Instruction Word (VLIM) Single Instruction Multiple Data (SIMD) Digital Signalprocessor (DSP) Maskininlärning Djupinlärning Konvolutionella Neurala Nätverk (CNN) Very Long Instruction Word (VLIW) Single Instruction Multiple Data (SIMD) Elektroteknik och elektronik

1

Page generated in 0.0912 seconds