Global ETD Search

11	The Study of Double Level Branch Buffer Chen, Yi-Chang 12 October 2001 (has links) Pipelining is the major organizational technique by which computers can execute several instructions simultaneously to reach higher single-processor performance. Branches are recognized as a major impediment to achieve the maximum performance of pipelining and superscalar processors due to stalls caused by unresolved branches. Branch prediction is an effective strategy to reduce the branch penalty via predicting, prefetching and executing the speculative instructions before the branch is resolved. A branch target buffer (BTB)[13] can reduce the performance caused by branches via predicting the direction of the branch and caching information about the branch. While prediction is incorrect, the processor requires flushing the speculative instructions, undoing the effects of the improperly initiated speculative execution and resuming on the correct path. These flushing and refilling degrade significantly processor performance. In this thesis we propose a mechanism, Double Level Branch Buffer, which can reduce the branch penalty and performance loss caused from incorrect prediction. We try to cache the information of branch about both taken and not taken direction. The pipeline will degrade the dependence upon branch prediction accuracy by utilizing this mechanism. Pipelining Branch Penalty Branch Prediction Branch Target Buffer
12	Enabling high-performance, mixed-signal approximate computing St Amant, Renee Marie 07 July 2014 (has links) For decades, the semiconductor industry enjoyed exponential improvements in microprocessor power and performance with the device scaling of successive technology generations. Scaling limitations at sub-micron technologies, however, have ceased to provide these historical performance improvements within a limited power budget. While device scaling provides a larger number of transistors per chip, for the same chip area, a growing percentage of the chip will have to be powered off at any given time due to power constraints. As such, the architecture community has focused on energy-efficient designs and is looking to specialized hardware to provide gains in performance. A focus on energy efficiency, along with increasingly less reliable transistors due to device scaling, has led to research in the area of approximate computing, where accuracy is traded for energy efficiency when precise computation is not required. There is a growing body of approximation-tolerant applications that, for example, compute on noisy or incomplete data, such as real-world sensor inputs, or make approximations to decrease the computation load in the analysis of cumbersome data sets. These approximation-tolerant applications span application domains, such as machine learning, image processing, robotics, and financial analysis, among others. Since the advent of the modern processor, computing models have largely presumed the attribute of accuracy. A willingness to relax accuracy requirements, however, with goal of gaining energy efficiency, warrants the re-investigation of the potential of analog computing. Analog hardware offers the opportunity for fast and low-power computation; however, it presents challenges in the form of accuracy. Where analog compute blocks have been applied to solve fixed-function problems, general-purpose computing has relied on digital hardware implementations that provide generality and programmability. The work presented in this thesis aims to answer the following questions: Can analog circuits be successfully integrated into general-purpose computing to provide performance and energy savings? And, what is required to address the historical analog challenges of inaccuracy, programmability, and a lack of generality to enable such an approach? This thesis work investigates a neural approach as a means to address the historical analog challenges of inaccuracy, programmability, and generality and to enable the use of analog circuits in general-purpose, high-performance computing. The first piece of this thesis work investigates the use of analog circuits at the microarchitecture level in the form of an analog neural branch predictor. The task of branch prediction can tolerate imprecision, as roll-back mechanisms correct for branch mispredictions, and application-level accuracy remains unaffected. We show that analog circuits enable the implementation of a highly-accurate, neural-prediction algorithm that is infeasible to implement in the digital domain. The second piece of this thesis work presents a neural accelerator that targets approximation-tolerant code. Analog neural acceleration provides application speedup of 3.3x and energy savings of 12.1x with a quality loss less than 10% for all except one approximation-tolerant benchmark. These results show that, using a neural approach, analog circuits can be applied to provide performance and energy efficiency in high-performance, general-purpose computing. / text Approximate computing Neural branch prediction Neural accelerator General purpose computing
13	Static Branch Prediction through Representation Learning / Statisk Branch Prediction genom Representation Learning Alovisi, Pietro January 2020 (has links) In the context of compilers, branch probability prediction deals with estimating the probability of a branch to be taken in a program. In the absence of profiling information, compilers rely on statically estimated branch probabilities, and state of the art branch probability predictors are based on heuristics. Recent machine learning approaches learn directly from source code using natural language processing algorithms. A representation learning word embedding algorithm is built and evaluated to predict branch probabilities on LLVM’s intermediate representation (IR) language. The predictor is trained and tested on SPEC’s CPU 2006 benchmark and compared to state-of-the art branch probability heuristics. The predictor obtains a better miss rate and accuracy in branch prediction than all the evaluated heuristics, but produces and average null performance speedup over LLVM’s branch predictor on the benchmark. This investigation shows that it is possible to predict branch probabilities using representation learning, but more effort must be put in obtaining a predictor with practical advantages over the heuristics. / Med avseende på kompilatorer, handlar branch probability prediction om att uppskatta sannolikheten att en viss förgrening kommer tas i ett program. Med avsaknad av profileringsinformation förlitar sig kompilatorer på statiskt upp- skattade branch probabilities och de främsta branch probability predictors är baserade på heuristiker. Den senaste maskininlärningsalgoritmerna lär sig direkt från källkod genom algoritmer för natural language processing. En algoritm baserad på representation learning word embedding byggs och utvärderas för branch probabilities prediction på LLVM’s intermediate language (IR). Förutsägaren är tränad och testad på SPEC’s CPU 2006 riktmärke och jämförd med de främsta branch probability heuristikerna. Förutsägaren erhåller en bättre frekvens av missar och träffsäkerhet i sin branch prediction har jämförts med alla utvärderade heuristiker, men producerar i genomsnitt ingen prestandaförbättring jämfört med LLVM’s branch predictor på riktmärket. Den här undersökningen visar att det är möjligt att förutsäga branch prediction probabilities med användande av representation learning, men att det behöver satsas mer på att få tag på en förutsägare som har praktiska övertag gentemot heuristiken. compiler compiler optimization branch prediction machine learning representation learning LLVM kompilator kompilatoroptimering branch prediction maskininlärning representation learning LLVM. Computer and Information Sciences Data- och informationsvetenskap
14	Analyse réaliste d'algorithmes standards / Realistic analysis of standard algorithms Auger, Nicolas 20 December 2018 (has links) À l'origine de cette thèse, nous nous sommes intéressés à l'algorithme de tri TimSort qui est apparu en 2002, alors que la littérature sur le problème du tri était déjà bien dense. Bien qu'il soit utilisé dans de nombreux langages de programmation, les performances de cet algorithme n'avaient jamais été formellement analysées avant nos travaux. L'étude fine de TimSort nous a conduits à enrichir nos modèles théoriques, en y incorporant des caractéristiques modernes de l'architecture des ordinateurs. Nous avons, en particulier, étudié le mécanisme de prédiction de branchement. Grâce à cette analyse théorique, nous avons pu proposer des modifications de certains algorithmes élémentaires (comme l'exponentiation rapide ou la dichotomie) qui utilisent ce principe à leur avantage, améliorant significativement leurs performances lorsqu'ils sont exécutés sur des machines récentes. Enfin, même s'il est courant dans le cadre de l'analyse en moyenne de considérer que les entrées sont uniformément distribuées, cela ne semble pas toujours refléter les distributions auxquelles nous sommes confrontés dans la réalité. Ainsi, une des raisons du choix d'implanter TimSort dans des bibliothèques standard de Java et Python est probablement sa capacité à s'adapter à des entrées partiellement triées. Nous proposons, pour conclure cette thèse, un modèle mathématique de distribution non-uniforme sur les permutations qui favorise l'apparition d'entrées partiellement triées, et nous en donnons une analyse probabiliste détaillée / At first, we were interested in TimSort, a sorting algorithm which was designed in 2002, at a time where it was hard to imagine new results on sorting. Although it is used in many programming languages, the efficiency of this algorithm has not been studied formally before our work. The fine-grain study of TimSort leads us to take into account, in our theoretical models, some modern features of computer architecture. In particular, we propose a study of the mechanisms of branch prediction. This theoretical analysis allows us to design variants of some elementary algorithms (like binary search or exponentiation by squaring) that rely on this feature to achieve better performance on recent computers. Even if uniform distributions are usually considered for the average case analysis of algorithms, it may not be the best framework for studying sorting algorithms. The choice of using TimSort in many programming languages as Java and Python is probably driven by its efficiency on almost-sorted input. To conclude this dissertation, we propose a mathematical model of non-uniform distribution on permutations, for which permutations that are almost sorted are more likely, and provide a detailed probabilistic analysis Analyse d'algorithmes Algorithmes de tri Modèle de cache Prédiction de branchement Analysis of algorithm Sorting algorithms Cache oblivious Branch prediction
15	Instruction Timing Analysis for Linux/x86-based Embedded and Desktop Systems John, Tobias 19 October 2005 (has links) (PDF) Real-time aspects are becoming more important in standard desktop PC environments and x86 based processors are being utilized in embedded systems more often. While these processors were not created for use in hard real time systems, they are fast and inexpensive and can be used if it is possible to determine the worst case execution time. Information on CPU caches (L1, L2) and branch prediction architecture is necessary to simulate best and worst cases in execution timing, but is often not detailed enough and sometimes not published at all. This document describes how the underlying hardware can be analysed to obtain this information. branch prediction cache architecture ddc:004 Analyse Benchmark Cache-Speicher LINUX Timing
16	Computational and storage based power and performance optimizations for highly accurate branch predictors relying on neural networks Aasaraai, Kaveh 09 August 2007 (has links) In recent years, highly accurate branch predictors have been proposed primarily for high performance processors. Unfortunately such predictors are extremely energy consuming and in some cases not practical as they come with excessive prediction latency. Perceptron and O-GEHL are two examples of such predictors. To achieve high accuracy, these predictors rely on large tables and extensive computations and require high energy and long prediction delay. In this thesis we propose power optimization techniques that aim at reducing both computational complexity and storage size for these predictors. We show that by eliminating unnecessary data from computations, we can reduce both predictor's energy consumption and prediction latency. Moreover, we apply information theory findings to remove noneffective storage used by O-GEHL, without any significant accuracy penalty. We reduce the dynamic and static power dissipated in the computational parts of the predictors. Meantime we improve performance as we make faster prediction possible. Branch Prediction Perceptron O-GEHL Microarchitecture
17	DCE: the dynamic conditional execution in a multipath control independent architecture / DCE: execução dinâmica condicional em uma arquitetura de múltiplos fluxos com independência de controle Santos, Rafael Ramos dos January 2003 (has links) Esta tese apresenta DCE, ou Execução Dinâmica Condicional, como uma alternativa para reduzir o custo da previsão incorreta de desvios. A idéia básica do modelo apresentado é buscar e executar todos os caminhos de desvios que obedecem à certas restrições no que diz respeito a complexidade e tamanho. Como resultado, tem-se um número menor de desvios sendo previstos e consequentemente um número menor de desvios previstos incorretamente. DCE busca todos os caminhos dos desvios selecionados evitando quebras no fluxo de busca quando estes desvios são buscados. Os caminhos buscados dos desvios selecionados são então executados mas somente o caminho correto é completado. Nesta tese nós propomos uma arquitetura para executar múltiplos caminhos dos desvios selecionados. A seleção dos desvios ocorre baseada no tamanho do desvio e em outras condições. A seleção de desvios simples e complexos permite a predicação dinâmica destes desvios sem a necessidade da existência de um conjunto específico de instruções nem otimizações especiais por parte do compilador. Além disso, é proposta também uma técnica para reduzir a sobrecarga gerada pela execução dos múltiplos caminhos dos desvios selecionados. O desempenho alcançado atinge níveis de até 12% quando um previsor de desvios Local é usado no DCE e um previsor Global é usado na máquina de referência. Quando ambas as máquinas empregam previsão Local, há um aumento de desempenho da ordem de 3-3.5%. / This thesis presents DCE, or Dynamic Conditional Execution, as an alternative to reduce the cost of mispredicted branches. The basic idea is to fetch all paths produced by a branch that obey certain restrictions regarding complexity and size. As a result, a smaller number of predictions is performed, and therefore, a lesser number of branches are mispredicted. DCE fetches through selected branches avoiding disruptions in the fetch flow when these branches are fetched. Both paths of selected branches are executed but only the correct path commits. In this thesis we propose an architecture to execute multiple paths of selected branches. Branches are selected based on the size and other conditions. Simple and complex branches can be dynamically predicated without requiring a special instruction set nor special compiler optimizations. Furthermore, a technique to reduce part of the overhead generated by the execution of multiple paths is proposed. The performance achieved reaches levels of up to 12% when comparing a Local predictor used in DCE against a Global predictor used in the reference machine. When both machines use a Local predictor, the speedup is increased by an average of 3-3.5%. Simulação Arquiteturas super escalares Branch prediction Multipath execution Dynamic conditional execution Dynamic predication
18	DCE: the dynamic conditional execution in a multipath control independent architecture / DCE: execução dinâmica condicional em uma arquitetura de múltiplos fluxos com independência de controle Santos, Rafael Ramos dos January 2003 (has links) Esta tese apresenta DCE, ou Execução Dinâmica Condicional, como uma alternativa para reduzir o custo da previsão incorreta de desvios. A idéia básica do modelo apresentado é buscar e executar todos os caminhos de desvios que obedecem à certas restrições no que diz respeito a complexidade e tamanho. Como resultado, tem-se um número menor de desvios sendo previstos e consequentemente um número menor de desvios previstos incorretamente. DCE busca todos os caminhos dos desvios selecionados evitando quebras no fluxo de busca quando estes desvios são buscados. Os caminhos buscados dos desvios selecionados são então executados mas somente o caminho correto é completado. Nesta tese nós propomos uma arquitetura para executar múltiplos caminhos dos desvios selecionados. A seleção dos desvios ocorre baseada no tamanho do desvio e em outras condições. A seleção de desvios simples e complexos permite a predicação dinâmica destes desvios sem a necessidade da existência de um conjunto específico de instruções nem otimizações especiais por parte do compilador. Além disso, é proposta também uma técnica para reduzir a sobrecarga gerada pela execução dos múltiplos caminhos dos desvios selecionados. O desempenho alcançado atinge níveis de até 12% quando um previsor de desvios Local é usado no DCE e um previsor Global é usado na máquina de referência. Quando ambas as máquinas empregam previsão Local, há um aumento de desempenho da ordem de 3-3.5%. / This thesis presents DCE, or Dynamic Conditional Execution, as an alternative to reduce the cost of mispredicted branches. The basic idea is to fetch all paths produced by a branch that obey certain restrictions regarding complexity and size. As a result, a smaller number of predictions is performed, and therefore, a lesser number of branches are mispredicted. DCE fetches through selected branches avoiding disruptions in the fetch flow when these branches are fetched. Both paths of selected branches are executed but only the correct path commits. In this thesis we propose an architecture to execute multiple paths of selected branches. Branches are selected based on the size and other conditions. Simple and complex branches can be dynamically predicated without requiring a special instruction set nor special compiler optimizations. Furthermore, a technique to reduce part of the overhead generated by the execution of multiple paths is proposed. The performance achieved reaches levels of up to 12% when comparing a Local predictor used in DCE against a Global predictor used in the reference machine. When both machines use a Local predictor, the speedup is increased by an average of 3-3.5%. Simulação Arquiteturas super escalares Branch prediction Multipath execution Dynamic conditional execution Dynamic predication
19	DCE: the dynamic conditional execution in a multipath control independent architecture / DCE: execução dinâmica condicional em uma arquitetura de múltiplos fluxos com independência de controle Santos, Rafael Ramos dos January 2003 (has links) Esta tese apresenta DCE, ou Execução Dinâmica Condicional, como uma alternativa para reduzir o custo da previsão incorreta de desvios. A idéia básica do modelo apresentado é buscar e executar todos os caminhos de desvios que obedecem à certas restrições no que diz respeito a complexidade e tamanho. Como resultado, tem-se um número menor de desvios sendo previstos e consequentemente um número menor de desvios previstos incorretamente. DCE busca todos os caminhos dos desvios selecionados evitando quebras no fluxo de busca quando estes desvios são buscados. Os caminhos buscados dos desvios selecionados são então executados mas somente o caminho correto é completado. Nesta tese nós propomos uma arquitetura para executar múltiplos caminhos dos desvios selecionados. A seleção dos desvios ocorre baseada no tamanho do desvio e em outras condições. A seleção de desvios simples e complexos permite a predicação dinâmica destes desvios sem a necessidade da existência de um conjunto específico de instruções nem otimizações especiais por parte do compilador. Além disso, é proposta também uma técnica para reduzir a sobrecarga gerada pela execução dos múltiplos caminhos dos desvios selecionados. O desempenho alcançado atinge níveis de até 12% quando um previsor de desvios Local é usado no DCE e um previsor Global é usado na máquina de referência. Quando ambas as máquinas empregam previsão Local, há um aumento de desempenho da ordem de 3-3.5%. / This thesis presents DCE, or Dynamic Conditional Execution, as an alternative to reduce the cost of mispredicted branches. The basic idea is to fetch all paths produced by a branch that obey certain restrictions regarding complexity and size. As a result, a smaller number of predictions is performed, and therefore, a lesser number of branches are mispredicted. DCE fetches through selected branches avoiding disruptions in the fetch flow when these branches are fetched. Both paths of selected branches are executed but only the correct path commits. In this thesis we propose an architecture to execute multiple paths of selected branches. Branches are selected based on the size and other conditions. Simple and complex branches can be dynamically predicated without requiring a special instruction set nor special compiler optimizations. Furthermore, a technique to reduce part of the overhead generated by the execution of multiple paths is proposed. The performance achieved reaches levels of up to 12% when comparing a Local predictor used in DCE against a Global predictor used in the reference machine. When both machines use a Local predictor, the speedup is increased by an average of 3-3.5%. Simulação Arquiteturas super escalares Branch prediction Multipath execution Dynamic conditional execution Dynamic predication
20	Instruction Timing Analysis for Linux/x86-based Embedded and Desktop Systems John, Tobias 19 October 2005 (has links) Real-time aspects are becoming more important in standard desktop PC environments and x86 based processors are being utilized in embedded systems more often. While these processors were not created for use in hard real time systems, they are fast and inexpensive and can be used if it is possible to determine the worst case execution time. Information on CPU caches (L1, L2) and branch prediction architecture is necessary to simulate best and worst cases in execution timing, but is often not detailed enough and sometimes not published at all. This document describes how the underlying hardware can be analysed to obtain this information. branch prediction cache architecture info:eu-repo/classification/ddc/004 ddc:004 Analyse Benchmark Cache-Speicher LINUX Timing

Search results