Global ETD Search

1	Predicated execution and register windows for out-of-order processors Quiñones Moreno, Eduardo 18 November 2008 (has links) ISA extensions are a very powerful approach to implement new hardware techniques that require or benefit from compiler support: decisions made at compile time can be complemented at runtime, achieving a synergistic effect between the compiler and the processor. This thesis is focused on two ISA extensions: predicate execution and register windows. Predicate execution is exploited by the if-conversion compiler technique. If-conversion removes control dependences by transforming them to data dependences, which helps to exploit ILP beyond a single basic-block. Register windows help to reduce the amount of loads and stores required to save and restore registers across procedure calls by storing multiple contexts into a large architectural register file.In-order processors specially benefit from using both ISA extensions to overcome the limitations that control dependences and memory hierarchy impose on static scheduling. Predicate execution allows to move control dependence instructions past branches. Register windows reduce the amount of memory operations across procedure calls. Although if-conversion and register windows techniques have not been exclusively developed for in-order processors, their use for out-of-order processors has been studied very little. In this thesis we show that the uses of if-conversion and register windows introduce new performance opportunities and new challenges to face in out-of-order processors.The use of if-conversion in out-of-order processors helps to eliminate hard-to-predict branches, alleviating the severe performance penalties caused by branch mispredictions. However, the removal of some conditional branches by if-conversion may adversely affect the predictability of the remaining branches, because it may reduce the amount of correlation information available to the branch predictor. Moreover, predicate execution in out-of-order processors has to deal with two performance issues. First, multiple definitions of the same logical register can be merged into a single control flow, where each definition is guarded with a different predicate. Second, instructions whose guarding predicate evaluates to false consume unnecessary resources. This thesis proposes a branch prediction scheme based on predicate prediction that solves the three problems mentioned above. This scheme, which is built on top of a predicated ISA that implement a compare-and-branch model such as the one considered in this thesis, has two advantages: First, the branch accuracy is improved because the correlation information is not lost after if-conversion and the mechanism we propose permits using the computed value of the branch predicate when available, achieving 100% of accuracy. Second it avoids the predicate out-of-order execution problems.Regarding register windows, we propose a mechanism that reduces physical register requirements of an out-of-order processor to the bare minimum with almost no performance loss. The mechanism is based on identifying which architectural registers are in use by current in-flight instructions. The registers which are not in use, i.e. there is no in-flight instruction that references them, can be early released.In this thesis we propose a very efficient and low-cost hardware implementation of predicate execution and register windows that provide important benefits to out-of-order processors. ISA-extensions out-of-order itanium predicate prediction register windows predicated execution 004
2	Améliorer la performance séquentielle à l’ère des processeurs massivement multicœurs / Increase Sequential Performance in the Manycore Era Prémillieu, Nathanaël 03 December 2013 (has links) L'omniprésence des ordinateurs et la demande de toujours plus de puissance poussent les architectes processeur à chercher des moyens d'augmenter les performances de ces processeurs. La tendance actuelle est de répliquer sur une même puce plusieurs cœurs d'exécution pour paralléliser l'exécution. Si elle se poursuit, les processeurs deviendront massivement multicoeurs avec plusieurs centaines voire un millier de cœurs disponibles. Cependant, la loi d'Amdahl nous rappelle que l'augmentation de la performance séquentielle sera toujours nécessaire pour améliorer les performances globales. Une voie essentielle pour accroître la performance séquentielle est de perfectionner le traitement des branchements, ceux-ci limitant le parallélisme d'instructions. La prédiction de branchements est la solution la plus étudiée, dont l'intérêt dépend essentiellement de la précision du prédicteur. Au cours des dernières années, cette précision a été continuellement améliorée et a atteint un seuil qu'il semble difficile de dépasser. Une autre solution est d'éliminer les branchements et de les remplacer par une construction reposant sur des instructions prédiquées. L'exécution des instructions prédiquées pose cependant plusieurs problèmes dans les processeurs à exécution dans le désordre, en particulier celui des définitions multiples. Les travaux présentés dans cette thèse explorent ces deux aspects du traitement des branchements. La première partie s'intéresse à la prédiction de branchements. Une solution pour améliorer celle-ci sans augmenter la précision est de réduire le coût d'une mauvaise prédiction. Cela est possible en exploitant la reconvergence de flot de contrôle et l'indépendance de contrôle pour récupérer une partie du travail fait par le processeur sur le mauvais chemin sur les instructions communes aux deux chemins pour éviter de le refaire sur le bon chemin. La deuxième partie s'intéresse aux instructions prédiquées. Nous proposons une solution au problème des définitions multiples qui passe par la prédiction sélective de la valeur des prédicats. Un mécanisme de rejeu sélectif est utilisé pour réduire le coût d'une mauvaise prédiction de prédicat. / Computers are everywhere and the need for always more computation power has pushed the processor architects to find new ways to increase performance. The today's tendency is to replicate execution core on the same die to parallelize the execution. If it goes on, processors will become manycores featuring hundred to a thousand cores. However, Amdahl's law reminds us that increasing the sequential performance will always be vital to increase global performance. A perfect way to increase sequential performance is to improve how branches are executed because they limit instruction level parallelism. The branch prediction is the most studied solution, its interest greatly depending on its accuracy. In the last years, this accuracy has been continuously improved up to reach a hardly exceeding limit. An other solution is to suppress the branches by replacing them with a construct based on predicated instructions. However, the execution of predicated instructions on out-of-order processors comes up with several problems like the multiple definition problem. This study investigates these two aspects of the branch treatment. The first part is about branch prediction. A way to improve it without increasing the accuracy is to reduce the coast of a branch misprediction. This is possible by exploiting control flow reconvergence and control independence. The work done on the wrong path on instructions common to the two paths is saved to be reused on the correct path. The second part is about predicated instructions. We propose a solution to the multiple definition problem by selectively predicting the predicate values. A selective replay mechanism is used to reduce the cost of a predicate misprediction. Architecture des processeurs Prédiction de branchements Reconvergence Indépendance de contrôle Instructions prédiquées Rejeu sélectif Processor architecture Branch prediction Reconvergence : control independence Predicated instructions Selective replay
3	Suporte especializado de hardware para geração automática de loop pipelining em FPGAS Souza, Guilherme Stefano Silva de 19 November 2014 (has links) Submitted by Daniele Amaral (daniee_ni@hotmail.com) on 2016-09-13T20:06:59Z No. of bitstreams: 1 DissGSSS.pdf: 12761989 bytes, checksum: 9e4c2b4e76a2502af072064ed081eec1 (MD5) / Approved for entry into archive by Marina Freitas (marinapf@ufscar.br) on 2016-09-15T13:34:53Z (GMT) No. of bitstreams: 1 DissGSSS.pdf: 12761989 bytes, checksum: 9e4c2b4e76a2502af072064ed081eec1 (MD5) / Approved for entry into archive by Marina Freitas (marinapf@ufscar.br) on 2016-09-15T13:35:23Z (GMT) No. of bitstreams: 1 DissGSSS.pdf: 12761989 bytes, checksum: 9e4c2b4e76a2502af072064ed081eec1 (MD5) / Made available in DSpace on 2016-09-15T13:35:30Z (GMT). No. of bitstreams: 1 DissGSSS.pdf: 12761989 bytes, checksum: 9e4c2b4e76a2502af072064ed081eec1 (MD5) Previous issue date: 2014-11-19 / Não recebi financiamento / Loop pipelining is a technique that may offer significant performance improvements, being employed not only in conventional compilation targeting microprocessors, but also by High Level Synthesis (HLS) tools, targeting heterogeneous architectures and hardware accelerators. This work presents a specialized hardware support aiming at facilitate compilation tasks for HLS tools, along with potential advantages in execution performance and total silicon area employed. Two specialized hardware modules are presented: a queue register file and an instruction predication control module. / O desempenho na execução de programas, que é cada vez mais uma prioridade, pode ter uma melhora significativa por meio do uso de paralelismo em nível de instrução (ILP). Uma técnica que utiliza o ILP e propicia ganhos de desempenho significativos é o loop pipelining, sendo usado não apenas por compiladores para microprocessadores, mas também por ferramentas de Síntese de Alto Nível (HLS), visando arquiteturas heterogêneas e aceleradores de hardware. Neste trabalho é apresentado o projeto e implementação de estruturas de hardware especializadas, objetivando-se em solucionar o problema de sobreposição de valores que ocorre no loop pipelining, facilitar tarefas de compilaçãoo em ferramentas HLS e diminuir a repetição de código. Além disso, ganhos potenciais de desempenho e área de silício total podem ser alcançados como resultado do uso das estruturas propostas. Serão apresentados: um arquivo de registradores baseado em filas e um módulo de controle para a execução de instruções predicadas. Loop Pipelining Software Pipelining Queued Register File Modulo Scheduling Register Files Queue Predicated Instructions Escalonamento de Módulo QRF Arquivos de Registradores Filas Instruções Predicadas

1

Page generated in 0.0655 seconds