Global ETD Search

31	Uso da técnica VLIW para aumento de performance e redução do consumo de potência em sistemas embarcados baseados em Java / Using the VLIW technique to increase performance and to reduce power comsumption in embedded systems based on Java Beck Filho, Antonio Carlos Schneider January 2004 (has links) A contribuição deste trabalho foi orientada principalmente ao desenvolvimento de alternativas de hardware para a execução nativa de bytecodes Java em sistemas embarcados que naturalmente possuem restrições quanto à potência consumida, ao desempenho e à área ocupada. Primeiramente, o desenvolvimento do Femtojava Low- Power demonstra que a utilização de um pipeline e de um banco de registradores interno em arquiteturas de pilha resultam em uma redução significativa no consumo de potência. Após, a técnica de folding, que basicamente transforma várias operações de pilha em uma operação tipo RISC, é avaliada. A análise de uma segunda solução arquitetural, baseada em VLIW (Very Long Instruction Word), também traz resultados satisfatórios na redução do consumo de potência, sendo que a paralelização do código, feita por um analisador desenvolvido, é facilitada devido à utilização de uma arquitetura de pilha. O desempenho e a potência consumida de todas as arquiteturas propostas neste trabalho foram validadas utilizando-se o simulador CACO-PS, também desenvolvido no contexto desta dissertação. Os estudos de caso adotados para a validação das alternativas arquiteturais compreenderam algoritmos matemáticos, de ordenação, busca e processamento de sinais, bastante utilizados no domínio de sistemas embarcados. Resultados promissores principalmente em termos de energia consumida são alcançados, assim como na disponibilização de diferentes arquiteturas para a execução nativa de Java, principal proposta deste trabalho. / The main contribution of this work was the development of hardware alternatives for native execution of Java bytecodes for embedded systems that have power, performance and area constraints. Firstly, the development of the Femtojava Low- Power shows that the use of a pipeline and an internal register bank in stack architectures brings a significant reduction in the power consumption. After that, the folding technique, that basically changes a set of stack operations into a simple RISC one, is evaluated. Then, the analysis of a second architectural solution, based on VLIW (Very Long Instruction Word), demonstrates also good results concerning power consumption. Moreover, it is shown that the parallelization of the code is facilitated due to the specific stack architecture. The power consumption and performance of all architectures here proposed were evaluated using the CACO-PS simulator, which was also developed in this work. The case studies adopted for the validation of the architectures were mathematic, sort, search and DSP algorithms, widely used in the embedded system domain. Promising results mainly in energy consumption were achieved, as well as the disponibilization of different architectures for native execution of Java, the main objective of this work. Microeletrônica Sistemas embarcados Java (Linguagem de programação) Java Power Energy Folding VLIW Embedded systems Simulator
32	Adaptive and polymorphic VLIW processor to dynamically balance performance, energy consumption, and fault tolerance / Processador VLIW adaptativo e polimórfico para equilibrar de forma dinâmica o desempenho, o consumo de energia e a tolerância a falhas Sartor, Anderson Luiz January 2018 (has links) Ao se projetar um novo processador, o desempenho não é mais o único objetivo de otimização. Reduzir o consumo de energia também é essencial, pois, enquanto a maior parte dos dispositivos embarcados depende fortemente de bateria, os processadores de propósito geral (GPPs) são restringidos pelos limites da energia térmica de projeto (TDP – thermal design power). Além disso, devido à evolução da tecnologia, a taxa de falhas transientes tem aumentado nos processadores modernos, o que afeta a confiabilidade de sistemas tanto no espaço quanto no nível do mar. Adicionalmente, a maioria dos processadores homogêneos e heterogêneos tem um design fixo, o que limita a adaptação em tempo de execução. Nesse cenário, nós propomos dois designs de processadores que são capazes de realizar o trade-off entre esses eixos de acordo com a aplicação alvo e os requisitos do sistema. Ambos designs baseiam-se em um mecanismo de duplicação de instruções com rollback que detecta e corrige falhas, um módulo de power gating para reduzir o consumo de energia das unidades funcionais. O primeiro é chamado de processador adaptativo e usa thresholds, definidos em tempo de projeto, para adaptar a execução da aplicação Adicionalmente, ele controla o ILP da aplicação para criar mais oportunidade de duplicação e de power gating. O segundo design é chamado processador polimórfico e ele avalia (em tempo de execução) a melhor configuração de hardware a ser usada para cada aplicação. Ele também explora o hardware disponível para maximizar o número de aplicações que são executadas em paralelo. Para a versão adaptativa usando uma configuração orientada a otimização de energia, é possível, em média, economizar 37,2% de energia com um overhead de apenas 8,2% em performance, mantendo baixos níveis de defeito, quando comparado a um design tolerante a falhas. Para a versão polimórfica, os resultados mostram que a reconfiguração dinâmica do processador é capaz de adaptar eficientemente o hardware ao comportamento da aplicação, de acordo com os requisitos especificados pelo designer, chegando a 94.88% do resultado de um processador oráculo quando o trade-off entre os três eixos é considerado. Por outro lado, a melhor configuração estática apenas atinge 28.24% do resultado do oráculo. / Performance is no longer the only optimization goal when designing a new processor. Reducing energy consumption is also mandatory: while most of the embedded devices are heavily dependent on battery power, General-Purpose Processors (GPPs) are being pulled back by the limits of Thermal Design Power (TDP). Moreover, due to technology scaling, soft error rate (i.e., transient faults) has been increasing in modern processors, which affects the reliability of both space and ground-level systems. In addition, most traditional homogeneous and heterogeneous processors have a fixed design, which limits its runtime adaptability. Therefore, they are not able to cope with the changing application behavior when one considers the axes of fault tolerance, performance, and energy consumption altogether. In this context, we propose two processor designs that are able to trade-off these three axes according to the application at hand and system requirements. Both designs rely on an instruction duplication with rollback mechanism that can detect and correct errors and a power gating module to reduce the energy consumption of the functional units The former design, called adaptive processor, uses thresholds defined at design time to allow runtime adaptation of the application’s execution and controls the application’s Instruction-Level Parallelism (ILP) to create more slots for duplication or power gating. The latter design (polymorphic processor) takes the former one step further by dynamically reconfiguring the hardware and evaluating different processor configurations for each application, and it also exploits the available pipelanes to maximize the number of applications that are executed concurrently. For the adaptive processor using an energy-oriented configuration, it is possible, on average, to reduce energy consumption by 37.2% with an overhead of only 8.2% in performance, while maintaining low levels of failure rate, when compared to a fault-tolerant design. For the polymorphic processor, results show that the dynamic reconfiguration of the processor is able to efficiently match the hardware to the behavior of the application, according to the requirements of the designer, achieving 94.88% of the result of an oracle processor when the trade-off between the three axes is considered. On the other hand, the best static configuration only achieves 28.24% of the oracle’s result. Tolerância a falhas Processamento paralelo Adaptive processor Fault tolerance Energy consumption Performance VLIW
33	Uso da técnica VLIW para aumento de performance e redução do consumo de potência em sistemas embarcados baseados em Java / Using the VLIW technique to increase performance and to reduce power comsumption in embedded systems based on Java Beck Filho, Antonio Carlos Schneider January 2004 (has links) A contribuição deste trabalho foi orientada principalmente ao desenvolvimento de alternativas de hardware para a execução nativa de bytecodes Java em sistemas embarcados que naturalmente possuem restrições quanto à potência consumida, ao desempenho e à área ocupada. Primeiramente, o desenvolvimento do Femtojava Low- Power demonstra que a utilização de um pipeline e de um banco de registradores interno em arquiteturas de pilha resultam em uma redução significativa no consumo de potência. Após, a técnica de folding, que basicamente transforma várias operações de pilha em uma operação tipo RISC, é avaliada. A análise de uma segunda solução arquitetural, baseada em VLIW (Very Long Instruction Word), também traz resultados satisfatórios na redução do consumo de potência, sendo que a paralelização do código, feita por um analisador desenvolvido, é facilitada devido à utilização de uma arquitetura de pilha. O desempenho e a potência consumida de todas as arquiteturas propostas neste trabalho foram validadas utilizando-se o simulador CACO-PS, também desenvolvido no contexto desta dissertação. Os estudos de caso adotados para a validação das alternativas arquiteturais compreenderam algoritmos matemáticos, de ordenação, busca e processamento de sinais, bastante utilizados no domínio de sistemas embarcados. Resultados promissores principalmente em termos de energia consumida são alcançados, assim como na disponibilização de diferentes arquiteturas para a execução nativa de Java, principal proposta deste trabalho. / The main contribution of this work was the development of hardware alternatives for native execution of Java bytecodes for embedded systems that have power, performance and area constraints. Firstly, the development of the Femtojava Low- Power shows that the use of a pipeline and an internal register bank in stack architectures brings a significant reduction in the power consumption. After that, the folding technique, that basically changes a set of stack operations into a simple RISC one, is evaluated. Then, the analysis of a second architectural solution, based on VLIW (Very Long Instruction Word), demonstrates also good results concerning power consumption. Moreover, it is shown that the parallelization of the code is facilitated due to the specific stack architecture. The power consumption and performance of all architectures here proposed were evaluated using the CACO-PS simulator, which was also developed in this work. The case studies adopted for the validation of the architectures were mathematic, sort, search and DSP algorithms, widely used in the embedded system domain. Promising results mainly in energy consumption were achieved, as well as the disponibilization of different architectures for native execution of Java, the main objective of this work. Microeletrônica Sistemas embarcados Java (Linguagem de programação) Java Power Energy Folding VLIW Embedded systems Simulator
34	Evaluating the Scalability of SDF Single-chip Multiprocessor Architecture Using Automatically Parallelizing Code Zhang, Yuhua 12 1900 (has links) Advances in integrated circuit technology continue to provide more and more transistors on a chip. Computer architects are faced with the challenge of finding the best way to translate these resources into high performance. The challenge in the design of next generation CPU (central processing unit) lies not on trying to use up the silicon area, but on finding smart ways to make use of the wealth of transistors now available. In addition, the next generation architecture should offer high throughout performance, scalability, modularity, and low energy consumption, instead of an architecture that is suitable for only one class of applications or users, or only emphasize faster clock rate. A program exhibits different types of parallelism: instruction level parallelism (ILP), thread level parallelism (TLP), or data level parallelism (DLP). Likewise, architectures can be designed to exploit one or more of these types of parallelism. It is generally not possible to design architectures that can take advantage of all three types of parallelism without using very complex hardware structures and complex compiler optimizations. We present the state-of-art architecture SDF (scheduled data flowed) which explores the TLP parallelism as much as that is supplied by that application. We implement a SDF single-chip multiprocessor constructed from simpler processors and execute the automatically parallelizing application on the single-chip multiprocessor. SDF has many desirable features such as high throughput, scalability, and low power consumption, which meet the requirements of the next generation of CPU design. Compared with superscalar, VLIW (very long instruction word), and SMT (simultaneous multithreading), the experiment results show that for application with very little parallelism SDF is comparable to other architectures, for applications with large amounts of parallelism SDF outperforms other architectures. Multiprocessors. Computer architecture. SDF single-chip multiprocessor multithreading automatic parallelization scalability superscalar VLIW SMT
35	VARIATIONS ON ROTATION SCHEDULING Richter, Michael Edwin 13 September 2007 (has links) No description available. Computer Science static scheduling high-level synthesis rotation scheduling list scheduling VLIW
36	Custom floating-point arithmetic for integer processors : algorithms, implementation, and selection / Arithmétique à virgule flottante spécifique pour processeurs entiers : algorithmes, implémentation et sélection Jourdan, Jingyan 15 November 2012 (has links) Les applications multimédia se composent généralement de blocs numériques exhibant des schémas de calcul flottant réguliers. Sur les processeurs sans support architectural pour l'arithmétique flottante, ils peuvent être profitablement transformés en opérateurs dédiés, s'ajoutant aux 5 opérateurs élémentaires (+, -, X, / et √) : en traitant plus d'opérations simultanément, ils permettent d'obtenir de meilleures performances. Cette thèse porte sur la conception de tels opérateurs, et les techniques de compilation mises en œuvre pour les sélectionner. Nous avons réalisé des implémentations optimisées pour un ensemble d'opérateurs dédiés : élévation au carré, mise à l'échelle, fused multiply-add, produit scalaire en dimension deux (DP2), addition/soustraction simultané et sinus/cosinus simultanés. En proposant de nouveaux algorithmes cherchant à maximiser le parallélisme d'instructions et détaillés ici, nous obtenons des accélérations d'un facteur allant jusqu'à 4.2 par appel. Nous détaillons également les changements apportés dans le compilateur pour effectuer la sélection. La plupart des opérateurs sont sélectionnés au niveau syntaxique. Cependant, pour certains opérateurs, nous avons dû améliorer l'analyse d'intervalles entiers pour prendre en compte les variables de type flottant, afin de prouver certaines conditions de positivité requises à leur sélection. Enfin, nous apportons la preuve en pratique de la pertinence de cette approche : sur des noyaux typiques du traitement du signal et sur certaines applications, nous mesurons une amélioration de performance allant jusqu'à 1.59x en comparaison avec la performance obtenue avec les seuls opérateurs élémentaires. / Media processing applications typically involve numerical blocks that exhibit regular floating-point computation patterns. For processors whose architecture supports only integer arithmetic, these patterns can be profitably turned into custom operators, coming in addition to the five basic ones (+, -, X, / and √), but achieving better performance by treating more operations. This thesis addresses the design of such custom operators as well as the techniques developed in the compiler to select them in application codes. We have designed optimized implementations for a set of custom operators which includes squaring, scaling, adding two nonnegative terms, fused multiply-add, fused square-add (x*x+z, with z>=0), two-dimensional dot products (DP2), sums of two squares, as well as simultaneous addition/subtraction and sine/cosine. With novel algorithms targeting high instruction-level parallelism and detailed here for squaring, scaling, DP2, and sin/cos, we achieve speedups of up to 4.2x for individual custom operators even when subnormal numbers are fully supported. Furthermore, we introduce the optimizations developed in the ST231 C/C++ compiler for selecting such operators. Most of the selections are achieved at high level, using syntactic criteria. However, for fused square-add, we also enhance the framework of integer range analysis to support floating-point variables in order to prove the required positivity condition z>= 0. Finally, we provide quantitative evidence of the benefits to support this selection of custom operations: on DSP kernels and benchmarks, our approach allows us to be up to 1.59x faster compared to the sole usage of basic ones. Arithmétique virgule flottante Opérateur dédié Processeur embarqué entier Architecture VLIW Optimisation du compilateur Sélection du code par le compilateur IEEE floating-point arithmetic Custom operator Embedded integer processor VLIW architecture Compiler optimization Compiler code selection
37	Compiler-Assisted Energy Optimization For Clustered VLIW Processors Nagpal, Rahul 03 1900 (has links) Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Although clustering helps by improving clock speed, reducing energy consumption of the logic, and making the design simpler, it introduces extra overheads by way of inter-cluster communication. This communication happens over long wires having high load capacitance which leads to delay in execution and significantly high energy consumption. Inter-cluster communication also introduces many short idle cycles, therby significantly increasing the overall leakage energy consumption in the functional units. The trend towards miniatrurization of devices (and associated reduction in threshold voltage) makes energy consumption in interconnects and functional units even worse and limits the usability of clustered architectures in smaller technologies. In the past, study of leakage energy management at the architectural level has mostly focused on storage structures such as cache. Relatively, little work has been done on architecture level leakage energy management in functional units in the context of superscalar processors and energy efficient scheduling in the context of VLIW architectures. In the absence of any high level model for interconnect energy estimation, the primary focus of research in the context of interconnects has been to reduce the latency of communication and evaluation of various inter-cluster communication models. To the best of our knowledge, there has been no such work in the past from the point of view of enegy efficiency targeting clustered VLIW architectures specifically focusing on smaller technologies. Technological advancements now permit design of interconnects and functional units With varying performance and power modes. In thesis we people scheduling algorithms that aggregate the scheduling slack of instructions and communication slack of data values to exploit the low power modes of interconnects and functional units . We also propose a high level model for estimation of interconnect delay and energy (in contrast to low-level circuit level model proposed earlier) that makes it possible to carry out architectural and compiler optimizations specifically targeting the inter connect, Finally we present synergistic combination of these algorithms that simultaneously saves energy in functional units and interconnects to improve the usability of clustered architectures by archiving better overall energy-performance trade-offs. Our compiler assisted leakage energy management scheme for functional units reduces the energy consumption of functional units approximately by 15% and 17% in the context of a 2-clustered and a 4-clustered VLIW architecture respectively with negligible performance degradation over and above that offered by a hardware-only scheme. The interconnect energy optimization scheme improves the energy consumption of interconnects on an average by 41% and 46% for a 2-clustered and a 4-clustered machine respectively with 2% and 1.5% performance degradation. The combined scheme options slightly better energy benefit in functional units and 37% and 43% energy benefit in interconnect with slightly higher performance degradation. Even with the conservative estimates of contribution of functional unit interconnect to overall processor energy consumption the proposed combined scheme obtains on an average 8% and 10% improvement in overall energy delay product with 3.5% and 2% performance degradation for a 2-clustered and a 4-clustered machine respectively. We present a detailed experimental evaluation of the proposed schemes using the Trimaran compiler infrastructure. Embedded Systems Computer Architecture Algorithms Very Long Instruction Word Architecture Energy Optimization Energy Management Interconnect Modeling Clustered VLIW Architectures Clustered VLIW Processors Power Optimization Micro-architetural Explorations Computer Science
38	Architectures flot de données dédiées au traitement d'images par morphologie mathématique Clienti, Christophe 30 September 2009 (has links) (PDF) Nous abordons ici la thématique des opérateurs et processeurs flot de données dédiés au traitement d'images et orientés vers la morphologie mathématique. L'objectif principal est de proposer des architectures performantes capables de réaliser les opérations simples de ce corpus mathématique afin de proposer des opérateurs morphologiques avancés. Ces dernières années, des algorithmes astucieux ont été proposés avec comme objectif de réduire la quantité des calculs nécessaires à la réalisation de transformations telle que la ligne de partage des eaux. Toutefois, les mises en œuvre proposées font souvent appel à des structures de données complexes qui sont difficiles à employer sur des machines différentes des processeurs généralistes monocœurs. Les processeurs standard poursuivant aujourd'hui leur évolution vers une augmentation du parallélisme, ces implémentations ne nous permettent pas d'obtenir les gains de performance escomptés à chaque nouvelle génération de machine. Nous proposons alors des mises en œuvre rapides des opérations complexes de la morphologie mathématique par des machines exploitant fortement le parallélisme intrinsèque des opérations basiques. Nous étudions dans une première partie les processeurs de voisinage travaillant directement sur un flot de pixels et nous proposons différentes méthodologies de conception rapide de pipelines dédiés à une application. Nous proposons également une structure de pipeline programmable via l'utilisation de processeurs vectoriels avec différentes possibilités de chaînage. Enfin, une étude avec des machines est proposée afin d'observer la pertinence de notre approche. [MATH] Mathematics Traitement image Morphologie mathématique Parallélisme Processeur haute performance Processeur VLIW Processeur vectoriel Processeur pipeline Calcul intensif
39	Simulation Native des Systèmes Multiprocesseurs sur Puce à l'aide de la Virtualisation Assistée par le Matériel Hamayun, Mian Muhammad 04 July 2013 (has links) (PDF) L'intégration de plusieurs processeurs hétérogènes en un seul système sur puce (SoC) est une tendance claire dans les systèmes embarqués. La conception et la vérification de ces systèmes nécessitent des plateformes rapides de simulation, et faciles à construire. Parmi les approches de simulation de logiciels, la simulation native est un bon candidat grâce à l'exécution native de logiciel embarqué sur la machine hôte, ce qui permet des simulations à haute vitesse, sans nécessiter le développement de simulateurs d'instructions. Toutefois, les techniques de simulation natives existantes exécutent le logiciel de simulation dans l'espace de mémoire partagée entre le matériel modélisé et le système d'exploitation hôte. Il en résulte de nombreux problèmes, par exemple les conflits l'espace d'adressage et les chevauchements de mémoire ainsi que l'utilisation des adresses de la machine hôte plutôt des celles des plates-formes matérielles cibles. Cela rend pratiquement impossible la simulation native du code existant fonctionnant sur la plate-forme cible. Pour surmonter ces problèmes, nous proposons l'ajout d'une couche transparente de traduction de l'espace adressage pour séparer l'espace d'adresse cible de celui du simulateur de hôte. Nous exploitons la technologie de virtualisation assistée par matériel (HAV pour Hardware-Assisted Virtualization) à cet effet. Cette technologie est maintenant disponibles sur plupart de processeurs grande public à usage général. Les expériences montrent que cette solution ne dégrade pas la vitesse de simulation native, tout en gardant la possibilité de réaliser l'évaluation des performances du logiciel simulé. La solution proposée est évolutive et flexible et nous fournit les preuves nécessaires pour appuyer nos revendications avec des solutions de simulation multiprocesseurs et hybrides. Nous abordons également la simulation d'exécutables cross- compilés pour les processeurs VLIW (Very Long Instruction Word) en utilisant une technique de traduction binaire statique (SBT) pour généré le code natif. Ainsi il n'est pas nécessaire de faire de traduction à la volée ou d'interprétation des instructions. Cette approche est intéressante dans les situations où le code source n'est pas disponible ou que la plate-forme cible n'est pas supporté par les compilateurs reciblable, ce qui est généralement le cas pour les processeurs VLIW. Les simulateurs générés s'exécutent au-dessus de notre plate-forme basée sur le HAV et modélisent les processeurs de la série C6x de Texas Instruments (TI). Les résultats de simulation des binaires pour VLIW montrent une accélération de deux ordres de grandeur par rapport aux simulateurs précis au cycle près. MPSoC Simulation Natif Estimation de Performance Tlm Vliw Hav
40	Estimation de la consommation dans la conception système des applications embarquées temps réels Laurent, Johann 09 December 2002 (has links) (PDF) Aujourd'hui, les consommations de puissance et d'énergie sont devenues des contraintes incontournables lors de la conception d'un système, au même titre que le temps ou la surface. En effet, les applications modernes utilisent de plus en plus de ressources de calcul et de mémoires ce qui entraîne une augmentation significative de leur consommation (multiplication 4 tous les 3 ans). De plus, comme la place du logiciel embarqué devient prépondérante dans les systèmes temps réel, l'optimisation de code a un impact important sur la maîtrise de la consommation. Cependant, mesurer l'impact des optimisations réalisées nécessite l'utilisation d'outils d'estimation rapides, précis et ayant un point d'entrée à haut niveau (par exemple le code C). Un tel point d'entrée permet au concepteur de déterminer la cible la mieux adaptée sans avoir à acquérir les différents outils de développement constructeurs.<br /><br />Plusieurs équipes de recherche ont déjà développé des méthodes d'estimation de la consommation pour processeur. La plupart d'entre elles sont des méthodes au niveau instructions (Instruction Level Power Analysis). Dans cette méthode, la consommation de chacune des instructions est mesurée ainsi que la consommation inter-instructions (passage d'une instruction à une autre) afin de développer le modèle global de consommation de la cible. Le principal inconvénient de ces méthodes est le temps d'obtention du modèle de puissance pour les architectures complexes (VLIW ou super scalaire). En effet, pour une architecture VLIW, le modèle ILPA requiert N2k mesures où N représente le nombre d'instructions du jeu et k le nombre d'instructions pouvant être exécutées en parallèle. En conséquence, le temps de modélisation de telles architectures devient, par cette méthode, prohibitif. De plus, la prise en compte de l'environnement extérieur est problématique (défauts de cache, ruptures de pipeline). Pour les architectures actuelles, il faut donc développer une nouvelle approche permettant de réduire le temps d'obtention du modèle tout en conservant une précision acceptable. La réduction du temps de modélisation ne peut se faire que par l'élévation du niveau d'abstraction.<br />Nous proposons dans cette thèse une nouvelle approche basée sur une analyse fonctionnelle et architecturale de la cible d'un point de vue de la consommation (Functional Level Power Analysis). Notre méthodologie est constituée de deux étapes : une étape de modélisation et une étape d'estimation. [SPI:OTHER] Engineering Sciences/Other estimation consommation processeur DSP VLIW puissance énergie applications TdSI

Search results