01 January 2009
Time predictability is one of the most important design considerations for real-time systems. In this dissertation, time predictability of the instruction cache is studied on both single core processors and multi-core processors. It is observed that many features in modern microprocessor architecture such as cache memories and branch prediction are in favor of average-case performance, which can significantly compromise the time predictability and make accurate worst-case performance analysis extremely difficult if not impossible. Therefore, the time predictability of VLIW (Very Long Instruction Word) processors and its compiler support is studied. The impediments to time predictability for VLIW processors are analyzed and compiler-based techniques to address these problems with minimal modifications to the VLIW hardware design are proposed. Specifically, the VLIW compiler is enhanced to support full if conversion, hyperblock scheduling, and intra-block nop insertion to enable efficient WCET (Worst Case Execution Time) analysis for VLIW processors. Our time-predictable processor incorporates the instruction caches which can mitigate the latency of fetching instructions that hit in the cache. For instruction missing from the cache, instruction prefetching is a useful technique to boost the average-case performance. However, it is unclear whether or not instruction prefetching can benefit the worst-case performance as well. Thus, the impact of instruction prefetching on the worst-case performance of instruction caches is studied. Extension of the static cache simulation technique is applied to model and compute the worst-case instruction cache performance with prefetching. It is shown that instruction prefetching can be reasonably bound, however, the time variation of computing is increased by instruction prefetching. As the technology advances, it is projected that multi-core chips will be increasingly adopted by microprocessor industry. For real-time systems to safely harness the potential of multi-core computing, designers must be able to accurately obtain the worst-case execution time (WCET) of applications running on multi-core platforms, which is very challenging due to the possible runtime inter-core interferences in using shared resources such as the shared L2 caches. As the first step toward time-predictable multi-core computing, this dissertation presents novel approaches to bounding the worst-case performance for threads running on multi-core processors with shared L2 instruction caches. CF (Control Flow) based approach. This approach computes the worst-case instruction access interferences between different threads based on the program control flow information of each thread, which can be statically analyzed. Extended ILP (Integer Linear Programming) based approach. This approach uses constraint programming to model the worst-case instruction access interferences between different threads. In the context of timing analysis in many core architecture, static approaches may also face the scalability issue. Thus, it is important and challenging to design time predictable caches in multi-core architecture. We propose an approach to leverage the prioritized shared L2 caches to improve time predictability for real-time threads running on multi-core processors. The prioritized shared L2 caches give higher priority to real-time threads while allowing low-priority threads to use shared L2 cache space that is available. Detailed implementation and experimental results discussion are presented in this dissertation.
Very Long Instruction Word (VLIW) processors are wide-issue statically scheduled processors. Instruction scheduling for these processors is performed by the compiler and is therefore a critical factor for its operation. Some VLIWs are clustered, a design that improves scalability to higher issue widths while improving energy efficiency and frequency. Their design is based on physically partitioning the shared hardware resources (e.g., register file). Such designs further increase the challenges of instruction scheduling since the compiler has the additional tasks of deciding on the placement of the instructions to the corresponding clusters and orchestrating the data movements across clusters. In this thesis we propose instruction scheduling optimizations for energy-efficient VLIW processors. Some of the techniques aim at improving the existing state-of-theart scheduling techniques, while others aim at using compiler techniques for closing the gap between lightweight hardware designs and more complex ones. Each of the proposed techniques target individual features of energy efficient VLIW architectures. Our first technique, called Aligned Scheduling, makes use of a novel scheduling heuristic for hiding memory latencies in lightweight VLIW processors without hardware load-use interlocks (Stall-On-Miss). With Aligned Scheduling, a software-only technique, a SOM processor coupled with non-blocking caches can better cope with the cache latencies and it can perform closer to the heavyweight designs. Performance is improved by up to 20% across a range of benchmarks from the Mediabench II and SPEC CINT2000 benchmark suites. The rest of the techniques target a class of VLIW processors known as clustered VLIWs, that are more scalable and more energy efficient and operate at higher frequencies than their monolithic counterparts. The second scheme (LUCAS) is an improved scheduler for clustered VLIW processors that solves the problem of the existing state-of-the-art schedulers being very susceptible to the inter-cluster communication latency. The proposed unified clustering and scheduling technique is a hybrid scheme that performs instruction by instruction switching between the two state-of-the-art clustering heuristics, leading to better scheduling than either of them. It generates better performing code compared to the state-of-the-art for a wide range of inter-cluster latency values on the Mediabench II benchmarks. The third technique (called CAeSaR) is a scheduler for clustered VLIW architectures that minimizes inter-cluster communication by local caching and reuse of already received data. Unlike dynamically scheduled processors, where this can be supported by the register renaming hardware, in VLIWs it has to be done by the code generator. The proposed instruction scheduler unifies cluster assignment, instruction scheduling and communication minimization in a single unified algorithm, solving the phase ordering issues between all three parts. The proposed scheduler shows an improvement in execution time of up to 20.3% and 13.8% on average across a range of benchmarks from the Mediabench II and SPEC CINT2000 benchmark suites. The last technique, applies to heterogeneous clustered VLIWs that support dynamic voltage and frequency scaling (DVFS) independently per cluster. In these processors there are no hardware interlocks between clusters to honor the data dependencies. Instead, the scheduler has to be aware of the DVFS decisions to guarantee correct execution. Effectively controlling DVFS, to selectively decrease the frequency of clusters with slack in their schedule, can lead to significant energy savings. The proposed technique (called UCIFF) solves the phase ordering problem between frequency selection and scheduling that is present in existing algorithms. The results show that UCIFF produces better code than the state-of-the-art and very close to the optimal across the Mediabench II benchmarks. Overall, the proposed instruction scheduling techniques lead to either better efficiency on existing designs or allow simpler lightweight designs to be competitive against ones with more complex hardware.
VLIW processors are parallel computing devices that are used in embedded devices as well as in servers. My thesis contains description of this architecture. It is aimed at making and subsequently implementing design of custom general-purpose VLIW processor with wide range of configurable parameters. Operational implementation of such processor in VHDL which can be tested on FITkit platform is an integral part.
Překlad OpenCL aplikací pro vestavěné systémy / Compilation of OpenCL Applications for Embedded SystemsŠnobl, Pavel January 2016 (has links)
This master's thesis deals with the support for compilation and execution of programs written using OpenCL framework on embedded systems. OpenCL is a system for programming heterogeneous systems comprising processors, graphic accelerators and other computing devices. But it also finds usage on systems composed of just one computing unit, where it allows to write parallel programs (task and data parallelism) and work with hierarchical system of memories. In this thesis, various available open source OpenCL implementations are compared and one selected is then integrated into LLVM compiler infrastructure. This compiler is generated as a part of toolchain provided by application specific instruction set architecture processor developement environment called Codasip Studio. Designed and implemented are also optimizations for architectures with SIMD instructions and VLIW architectures. The result is tested and demonstrated on a set of testing applications.
23 August 2005
The main goal of this thesis is to design and implement the high performance processor core for completing those digital signal processing algorithms applied at the DVB-T systems. The DSP must support the signal flow in time. Completing the FFT algorithm at 8192 input signal points instantaneously is the most important key. In order to achieve the time demand of FFT and the DSP frequency must be as lower as possible, the way is to increase the degree of instruction level parallelism (ILP). The thesis designs a VLIW architecture processing core called DVB-T DSP to support instruction parallelism with enough execution units. The thesis also uses the software pipelining to schedule the loop to achieve the highest ILP when used to execute FFT butterfly operations. Furthermore, in order to provide the smooth data stream for pipeline, the thesis designs a mechanism to improve the modulo addressing, called extended modulo addressing, will collect the discrete vectors into one continuous vector. This is a big problem that the program size is bigger than other processor architecture at the VLIW processor architecture. In order to solve the problem, this thesis proposes an instruction compression mechanism, which can increase double program density and does not affect the processor execution efficiency. The simulation result shows that DVB-T DSP can achieve the time demand of FFT at 133Mhz. DVB-T DSP also has good performance for other digital signal processing algorithms.
Universiẗat, Diss.--Paderborn, 1995.
25 June 2003
In order to improving the performance for real-time application, current digital signal processors use VLIW architectures to increase the degree of instruction level parallelism (ILP). Two factors will limit the ILP, one is enough hardware resource for all parallel instructions. Another is the dependence relations between instructions. This thesis designs a VLIW architecture processing core called DVBTDSP molded by FFT algorithm and uses the software pipelining mechanism to schedule the loop to achieve the highest ILP degree when used to execute FFT butterfly operations. Furthermore, in order to provide the smooth data stream for pipeline operations, we design a mechanism to improve the modulo addressing, which will collect the discrete vectors into one continuous vector. The simulation results show that the DVBTDSP has double performance of the C6200 for the FFT processing, and has good performance for FIR, IIR and DCT algorithm computing.
Simulation Native des Systèmes Multiprocesseurs sur Puce à l'aide de la Virtualisation Assistée par le Matériel / Native Simulation of Multiprocessor System-on-Chip using Hardware-Assisted VirtualizationHamayun, Mian Muhammad 04 July 2013 (has links)
L'intégration de plusieurs processeurs hétérogènes en un seul système sur puce (SoC) est une tendance claire dans les systèmes embarqués. La conception et la vérification de ces systèmes nécessitent des plateformes rapides de simulation, et faciles à construire. Parmi les approches de simulation de logiciels, la simulation native est un bon candidat grâce à l'exécution native de logiciel embarqué sur la machine hôte, ce qui permet des simulations à haute vitesse, sans nécessiter le développement de simulateurs d'instructions. Toutefois, les techniques de simulation natives existantes exécutent le logiciel de simulation dans l'espace de mémoire partagée entre le matériel modélisé et le système d'exploitation hôte. Il en résulte de nombreux problèmes, par exemple les conflits l'espace d'adressage et les chevauchements de mémoire ainsi que l'utilisation des adresses de la machine hôte plutôt des celles des plates-formes matérielles cibles. Cela rend pratiquement impossible la simulation native du code existant fonctionnant sur la plate-forme cible. Pour surmonter ces problèmes, nous proposons l'ajout d'une couche transparente de traduction de l'espace adressage pour séparer l'espace d'adresse cible de celui du simulateur de hôte. Nous exploitons la technologie de virtualisation assistée par matériel (HAV pour Hardware-Assisted Virtualization) à cet effet. Cette technologie est maintenant disponibles sur plupart de processeurs grande public à usage général. Les expériences montrent que cette solution ne dégrade pas la vitesse de simulation native, tout en gardant la possibilité de réaliser l'évaluation des performances du logiciel simulé. La solution proposée est évolutive et flexible et nous fournit les preuves nécessaires pour appuyer nos revendications avec des solutions de simulation multiprocesseurs et hybrides. Nous abordons également la simulation d'exécutables cross- compilés pour les processeurs VLIW (Very Long Instruction Word) en utilisant une technique de traduction binaire statique (SBT) pour généré le code natif. Ainsi il n'est pas nécessaire de faire de traduction à la volée ou d'interprétation des instructions. Cette approche est intéressante dans les situations où le code source n'est pas disponible ou que la plate-forme cible n'est pas supporté par les compilateurs reciblable, ce qui est généralement le cas pour les processeurs VLIW. Les simulateurs générés s'exécutent au-dessus de notre plate-forme basée sur le HAV et modélisent les processeurs de la série C6x de Texas Instruments (TI). Les résultats de simulation des binaires pour VLIW montrent une accélération de deux ordres de grandeur par rapport aux simulateurs précis au cycle près. / Integration of multiple heterogeneous processors into a single System-on-Chip (SoC) is a clear trend in embedded systems. Designing and verifying these systems require high-speed and easy-to-build simulation platforms. Among the software simulation approaches, native simulation is a good candidate since the embedded software is executed natively on the host machine, resulting in high speed simulations and without requiring instruction set simulator development effort. However, existing native simulation techniques execute the simulated software in memory space shared between the modeled hardware and the host operating system. This results in many problems, including address space conflicts and overlaps as well as the use of host machine addresses instead of the target hardware platform ones. This makes it practically impossible to natively simulate legacy code running on the target platform. To overcome these issues, we propose the addition of a transparent address space translation layer to separate the target address space from that of the host simulator. We exploit the Hardware-Assisted Virtualization (HAV) technology for this purpose, which is now readily available on almost all general purpose processors. Experiments show that this solution does not degrade the native simulation speed, while keeping the ability to accomplish software performance evaluation. The proposed solution is scalable as well as flexible and we provide necessary evidence to support our claims with multiprocessor and hybrid simulation solutions. We also address the simulation of cross-compiled Very Long Instruction Word (VLIW) executables, using a Static Binary Translation (SBT) technique to generated native code that does not require run-time translation or interpretation support. This approach is interesting in situations where either the source code is not available or the target platform is not supported by any retargetable compilation framework, which is usually the case for VLIW processors. The generated simulators execute on top of our HAV based platform and model the Texas Instruments (TI) C6x series processors. Simulation results for VLIW binaries show a speed-up of around two orders of magnitude compared to the cycle accurate simulators.
Contributions à la traduction binaire dynamique : support du parallélisme d'instructions et génération de traducteurs optimisés / Contributions to dynamic binary translation : instruction parallelism support and optimized translators generatorMichel, Luc 18 December 2014 (has links)
Les unités de calculs qui composent les systèmes intégrés numériques d'aujourd'hui sont complexes, hétérogènes, et en nombre toujours croissant.La simulation, largement utilisée tant dans les phases de conception logicielle que matérielle de ces systèmes devient donc un vrai défi.Lors de la simulation du système, la performance est en grande partie édictée par la stratégie de simulation des jeux d'instructions des processeurs.La traduction binaire dynamique (DBT) est une technique qui a fait ses preuves dans ce contexte.Le principe de cette solution est de traduire au fur et à mesure les instructions du programme simulé (la cible), en instructions compréhensibles par la machine exécutant la simulation (l'hôte).C'est une technique rapide, mais la réalisation de simulateurs fondée sur cette technologie reste complexe.Elle est d'une part limitée en terme d'architectures cibles supportées, et d'autre part compliquée dans sa mise en œuvre effective qui requiert de longs et délicats développements.Les travaux menés dans cette thèse s'articulent autour de deux contributions majeures.La première s'attaque au support des architectures cibles de type Very Long Instruction Word (VLIW), en étudiant leurs particularités vis-à-vis de la DBT.Certaines de ces spécificités, tel le parallélisme explicite entre instructions, rendent la traduction vers un processeur hôte scalaire non triviale.La solution que nous proposons apporte des gains en vitesse de simulation d'environ deux ordres de grandeur par rapport à des simulateurs basés sur des techniques d'interprétation.La seconde contribution s'intéresse à la génération automatique de simulateurs basés sur la DBT.À partir d'une description architecturale de la cible et de l'hôte, nous cherchons à produire un simulateur qui soit optimisé pour ce couple.L'optimisation est faite grâce au processus de mise en correspondance des instructions du couple afin de sélectionner la ou les meilleures instructions hôtes pour simuler une instruction cible.Bien qu'expérimental, le générateur réalisé donne des résultats très prometteurs puisqu'il est à même de produire un simulateur pour l'architecture MIPS aux performances comparables à celles d'une implémentation manuelle. / Computing units embedded into modern integrated systems are com-plex, heterogeneous and numerous. Simulation widely used during both software and hardware designof these systems is becoming a real challenge. The simulator performance ismainly driven by the processors instruction set simulation approach, among which Dynamic BinaryTranslation (DBT) is one of the most promising technique. DBT aims at transla-ting on the fly instructions of the simulated processor (the target) into instructions that canbe understood by the computer running the simulation (the host). This technique is fast,but designing a simulator based on it is complex. Indeed, the number of target architecturesis limited, and furthermore, implementing a simulator is a complicated process because oflong and error prone development.This PhD contributes to solve two major issues. The first contribution tackles the problem ofsupporting Very Long Instruction Word (VLIW) architectures as simulation targets,by studying their architecture peculiarities with regards to DBT. Some of these specificities,like explicit instruction parallelism make the translation to scalar hosts nontrivial. Thesolutions we propose bring simulation speed gains of two orders of magnitude compared tointerpreter based simulators. The second contribution addresses the problem of automaticgeneration of DBT based simulators. With both target and host architectural descriptions,we produce a simulator optimised for this pair. This optimisation is done with an instructionsmatching process that finds host instruction candidates to simulate a target instruction.Although being experimental, our generator gives very promising results. It is able toproduce a simulator for the MIPS architecture whose performances are close to a hand writtenimplementation.
Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Scheduling for clustered architectures involves spatial concerns (where to schedule) as well as temporal concerns (when to schedule). Various clustered VLIW configurations, connectivity types, and inter-cluster communication models present different performance trade-offs to a scheduler. The scheduler is responsible for resolving the conflicting requirements of exploiting the parallelism offered by the hardware and limiting the communication among clusters to achieve better performance. Earlier proposals for cluster scheduling fall into two main categories, viz., phase-decoupled scheduling and phase-coupled scheduling and they focus on clustered architectures which provide inter-cluster communication by an explicit inter-cluster copy operation. However, modern commercial clustered architectures provide snooping capabilities (apart from the support for inter-cluster communication using an explicit MV operation) by allowing some of the functional units to read operands from the register file of some of the other clusters without any extra delay. The phase-decoupled approach of scheduling suffers from the well known phase-ordering problem which becomes severe for such a machine model (with snooping) because communication and resource constraints are tightly coupled and thus are exposed only during scheduling. Tight integration of communication and resource constraints further requires taking into account the resource and communication requirements of other instructions ready to be scheduled in the current cycle while binding an instruction, in order to carry out effective binding. However, earlier proposals on integrated scheduling consider instructions and clusters for binding using a fixed order and thus they show different widely varying performance characteristics in terms of execution time and code size. Other shortcomings of earlier integrated algorithms (that lead to suboptimal cluster scheduling decisions) are due to non-consideration of future communication (that may arise due to a binding) and functional unit binding. In this thesis, we propose a pragmatic scheme and also a generic graph matching based framework for cluster scheduling based on a generic and realistic clustered machine model. The proposed scheme effectively utilizes the exact knowledge of available communication slots, functional units, and load on different clusters as well as future resource and communication requirements known only at schedule time to attain significant performance improvement without code size penalty over earlier algorithms. The proposed graph matching based framework for cluster scheduling resolves the phase-ordering and fixed-ordering problem associated with scheduling on clustered VLIW architectures. The framework provides a mechanism to exploit the slack of instructions by dynamically varying the freedom available in scheduling an instruction and hence the cost of scheduling an instruction using different alternatives to reduce the inter-cluster communication. An experimental evaluation of the proposed framework and some of the earlier proposals is presented in the context of a state-of-art commercial clustered architecture.
Page generated in 0.133 seconds