Global ETD Search

161	Instruction Timing Analysis for Linux/x86-based Embedded and Desktop Systems John, Tobias 19 October 2005 (has links) (PDF) Real-time aspects are becoming more important in standard desktop PC environments and x86 based processors are being utilized in embedded systems more often. While these processors were not created for use in hard real time systems, they are fast and inexpensive and can be used if it is possible to determine the worst case execution time. Information on CPU caches (L1, L2) and branch prediction architecture is necessary to simulate best and worst cases in execution timing, but is often not detailed enough and sometimes not published at all. This document describes how the underlying hardware can be analysed to obtain this information. branch prediction cache architecture ddc:004 Analyse Benchmark Cache-Speicher LINUX Timing
162	A Branch Predictor Directed Data Cache Prefetcher for Out-of-order and Multicore Processors Sharma, Prabal 16 December 2013 (has links) Modern superscalar pipelines have tremendous capacity to consume the instruction stream. This has been possible owing to improvements in process technology, technology scaling and microarchitectural design improvements that allow programs to speculate past control and data dependencies in the superscalar architecture. However, the speed of the memory subsystem lags behind due to physical constraints in bringing in huge amounts of data to the processor core. Cache hierarchies have subdued the impact of this speed gap; however, there is much that can be still done in improving microarchitecture. Data prefetching techniques bring in memory content significantly before the instruction stream actually witnesses demand misses. However, a majority of the techniques proposed so far depend upon an initial demand miss that initiates a stream of previously identified prefetches. In this thesis, we propose a novel prefetching algorithm, which leverages branch prediction to facilitate deep memory system speculation. The branch predictor directed lookahead mechanism builds a speculative control flow path for the instruction stream about to be fetched by the main superscalar pipeline. Prefetches are generated along this speculative path from a condensed representation of the memory instructions, leveraging register index based correlation. The technique integrates eloquently with the main pipeline's branch predictor to filter out prefetches along invalid speculative paths. Impact of the prefetching scheme is analyzed using out- of-order model of the Gem5 cycle accurate simulator. Evaluation shows that on a set of 13 memory intensive SPEC CPU2006 benchmarks, our prefetching technique improves performance by an average of 5.6% over the baseline out-of-order processor. prefetcher branch directed lookahead data cache auxiliary pipeline primary cache mshr
163	Shared resource management for efficient heterogeneous computing Lee, Jaekyu 13 January 2014 (has links) The demand for heterogeneous computing, because of its performance and energy efficiency, has made on-chip heterogeneous chip multi-processors (HCMP) become the mainstream computing platform, as the recent trend shows in a wide spectrum of platforms from smartphone application processors to desktop and low-end server processors. The performance of on-chip GPUs is not yet comparable to that of discrete GPU cards, but vendors have integrated more powerful GPUs and this trend will continue in upcoming processors. In this architecture, several system resources are shared between CPUs and GPUs. The sharing of system resources enables easier and cheaper data transfer between CPUs and GPUs, but it also causes resource contention problems between cores. The resource sharing problem has existed since the homogeneous (CPU-only) chip-multi processor (CMP) was introduced. However, resource sharing in HCMPs shows different aspects because of the different nature of CPU and GPU cores. In order to solve the resource sharing problem in HCMPs, we consider efficient shared resource management schemes, in particular tackling the problem in shared last-level cache and interconnection network. In the thesis, we propose four resource sharing mechanisms: First, we propose an efficient cache sharing mechanism that exploits the different characteristics of CPU and GPU cores to effectively share cache space between them. Second, adaptive virtual channel partitioning for on-chip interconnection network is proposed to isolate inter-application interference. By partitioning virtual channels to CPUs and GPUs, we can prevent the interference problem while guaranteeing quality-of-service (QoS) for both cores. Third, we propose a dynamic frequency controlling mechanism to efficiently share system resources. When both cores are active, the degree of resource contention as well as the system throughput will be affected by the operating frequency of CPUs and GPUs. The proposed mechanism tries to find optimal operating frequencies for both cores, which reduces the resource contention while improving system throughput. Finally, we propose a second cache sharing mechanism that exploits GPU-semantic information. The programming and execution models of GPUs are more strict and easier than those of CPUs. Also, programmers are asked to provide more information to the hardware. By exploiting these characteristics, GPUs can energy-efficiently exercise the cache and simpler, but more efficient cache partitioning can be enabled for HCMPs. Resource management Heterogeneous architecture Shared cache On-chip network Graphics processing units Heterogeneous computing Cache memory
164	Extended architectural enhancements for minimizing message delivery latency on cache-less architectures (e.g., Cell BE) Kroeker, Anthony 12 January 2012 (has links) This thesis proposes to reduce the latency of MPI receive operations on cacheless architectures, by removing the delay of copying messages when they are first received. This is achieved by copying the messages directly into buffers in the lowest level of the memory hierarchy (e.g., scratchpad memory). The previously proposed solution introduced an Indirection Cache which would map between the receive variables and the buffered message payload locations. This proved somewhat beneficial, but the lookup penalty of the Indirection Cache limited its effectiveness. Therefore this thesis proposes that a most recently used buffer (i.e., an Indirection Buffer) be placed in front of the Indirection Cache to eliminate this penalty and speed up access. The tests conducted demonstrated that this method was indeed effective and improved over the original method by at least an order of magnitude. Finally, examination of implementation feasibility showed that this could be implemented with a small Cache, and that even with access times 6x slower than initially assumed, the approach with the Indirection Buffer would still be effective. / Graduate computer engineering computer architecture cell processor mpi cache injection cacheless Indirection Cache Indirection Buffer
165	Συμπίεση με πρόγνωση αποστάσεων επαναχρησιμοποίησης σε κρυφές μνήμες δευτέρου επιπέδου Σταυρόπουλος, Νικόλαος 03 October 2011 (has links) Η αλματώδης αύξηση της ταχύτητας του επεξεργαστή δημιούργησε ένα χάσμα μεταξύ αυτού και της κύριας μνήμης. Η αρχιτεκτονική υπολογιστών καλείται να δώσει λύση στο πρόβλημα αυτό εφαρμόζοντας νέες τεχνικές στην ιεραρχία μνημών. Να αποκρύψει δηλαδή αυτή την καθυστέρηση έχοντας όμως περιορισμούς στην σχεδίαση ως προς τον χώρο και την κατανάλωση. Για τον λόγο αυτό προτείνουμε μια νέα τεχνική που συνδυάζει συμπίεση και πρόγνωση αποστάσεων επαναχρησιμοποίησης. Η συμπίεση αυξάνει την αποθηκευτική δυνατότητα της μνήμης και η πρόγνωση αποστάσεων επαναχρησιμοποίησης βοηθά στην σωστή επιλογή του μπλοκ προς συμπίεση. Η διπλωματική εργασία έχει ως στόχο την διερεύνηση του μοντέλου συμπίεσης με αλγόριθμο (FPC) και πρόγνωσης βάση εντολής αποστάσεων επαναχρησιμοποίησης (IbRDP) σε κρυφές μνήμες δευτέρου επιπέδου, ως προς την βελτιστοποίηση που μπορεί να επιφέρει στην ταχύτητα εκτέλεσης των προγραμμάτων καθώς και σε άλλες παραμέτρους. Διερευνήθηκαν διάφορα μοντέλα και στο βέλτιστο μοντέλο επετεύχθησαν σημαντικές αυξήσεις στην ταχύτητα εκτέλεσης των μετροπρογραμμάτων (16% αύξηση γεωμετρικού μέσου IPC στο 1ΜΒ) ενώ μόνο ένα μετροπρόγραμμα παρουσίασε έντονη μείωση της τάξης του 17 %. / the gap of speed between CPU and main memory is a problem than need to be solved by proposing new techniques on cache hierarchies, so the delay of fetching data from the main memory will be eliminated. We propose a new techinque of compression and reuse distance prediction. This compression will increase the capacity of L2 cache memory and the reuse distance prediction will find the most appropriate block to compress The thesis aims to search the combinational model of compression (FPC) and Reuse distance Predictor (IbRDP)on L2 cache memories. Several models have been simulated and the optimal model had increased execution speed of benchmarks (16% improvement in geometric mean IPC 1MB) while only one bencmark reduced its execution speed by 17%. Συμπίεση Κρυφές μνήμες 004.53 Compression Reuse distance FPC IbRDP Cache memories L2 cache
166	Vylepšení přesnosti výkonnostních modelů software na vícejádrových platformách se sdílenými cache / Improving Accuracy of Software Performance Models on Multicore Platforms with Shared Caches Babka, Vlastimil January 2012 (has links) The context of this work are performance models of software systems, which are used for predicting performance of a system in its design phase. For this purpose, performance models capture the explicit interactions of software components that make up the system, and the resource demands of primitive actions performed by the components. On contemporary hardware platforms, the software components however interact also through implicit sharing of numerous resources such as processor caches, which influence the performance of the primitive actions. Implicit resource sharing is often omitted in performance models, which affects their prediction accuracy. In this work we introduce two methods for including resource sharing models in performance models. Next, we propose an approximate resource sharing model based on linear regression, and a detailed model for predicting performance impact of cache sharing. The cache model is validated on a real processor and its design is preceded by extensive experiments which investigate the performance aspects of cache sharing. In addition, we introduce a method for robust validation of performance models using many automatically generated applications.
167	Représentation dynamique de la liste des copies pour le passage à l'échelle des protocoles de cohérence de cache / Dynamic sharing set for scalable cache coherence protocols Dumas, Julie 13 December 2017 (has links) Le problème du passage à l’échelle des protocoles de cohérence de cache qui se pose pour les machines parallèles se pose maintenant aussi sur puce, suite à l’émergence des architectures manycores. Il existe fondamentalement deux classes de protocoles : ceux basés sur l’espionnage et ceux utilisant un répertoire. Les protocoles basés sur l’espionnage, qui doivent transmettre à tous les caches les informations de cohérence, engendrent un nombre important de messages dont peu sont effectivement utiles. En revanche, les protocoles avec répertoires visent à n’envoyer des messages qu’aux caches qui en ont besoin. L’implémentation la plus évidente utilise un champ de bits complet dont la taille dépend uniquement du nombre de cœurs. Ce champ de bits représente la liste des copies. Pour passer à l’échelle, un protocole de cohérence doit émettre un nombre raisonnable de messages de cohérence et limiter le matériel utilisé pour la cohérence et en particulier pour la liste des copies. Afin d’évaluer et de comparer les différents protocoles et leurs représentations de la liste des copies, nous proposons tout d’abord une méthode de simulation basée sur l’injection de traces dans un modèle de cache à haut niveau. Cette méthode permet d’effectuer rapidement l’exploration architecturale des protocoles de cohérence de cache. Dans un second temps, nous proposons une nouvelle représentation dynamique de la liste des copies pour le passage à l’échelle des protocoles de cohérence de cache. Pour une architecture à 64 cœurs, 93% des lignes de cache sont partagées par au maximum 8 cœurs, sachant par ailleurs que le système d’exploitation chercher à placer les tâches communicantes proches les unes des autres. Notre représentation dynamique de la liste des copies tire parti de ces deux observations en utilisant un champ de bits pour un sous-ensemble des copies et une liste chaînée. Le champ de bits correspond à un rectangle à l’intérieur duquel la représentation de la liste des copies est exacte. La position et la forme de ce rectangle évoluent au cours de la durée de vie des applications. Plusieurs algorithmes pour le placement du rectangle cohérent sont proposés et évalués. Pour finir, nous effectuons une comparaison avec les représentations de la liste des copies de l’état de l’art. / Cache coherence protocol scalability problem for parallel architecture is also a problem for on chip architecture, following the emergence of manycores architectures. There are two protocol classes : snooping and directory-based.Protocols based on snooping, which send coherence information to all caches, generate a lot of messages whose few are useful.On the other hand, directory-based protocols send messages only to caches which need them. The most obvious implementation uses a full bit vector whose size depends only on the number of cores. This bit vector represents the sharing set. To scale, a coherence protocol must produce a reasonable number of messages and limit hardware ressources used by the coherence and in particular for the sharing set.To evaluate and compare protocols and their sharing set, we first propose a method based on trace injection in a high-level cache model. This method enables a very fast architectural exploration of cache coherence protocols.We also propose a new dynamic sharing set for cache coherence protocols, which is scalable. With 64 cores, 93% of cache blocks are shared by up to 8 cores.Futhermore, knowing that the operating system looks to place communicating tasks close to each other. Our dynamic sharing set takes advantage from these two observations by using a bit vector for a subset of copies and a linked list. The bit vector corresponds to a rectangle which stores the exact sharing set. The position and shape of this rectangle evolve over application's lifetime. Several algorithms for coherent rectangle placement are proposed and evaluated. Finally, we make a comparison with sharing sets from the state of the art. Système sur puce Cache Architecture Cohérence Network on chip Cache Architecture Coherence 004
168	Avaliação do compartilhamento das memórias cache no desempenho de arquiteturas multi-core / Performance evaluation of shared cache memory for multi-core architectures Alves, Marco Antonio Zanata January 2009 (has links) No atual contexto de inovações em multi-core, em que as novas tecnologias de integração estão fornecendo um número crescente de transistores por chip, o estudo de técnicas de aumento de vazão de dados é de suma importância para os atuais e futuros processadores multi-core e many-core. Com a contínua demanda por desempenho computacional, as memórias cache vêm sendo largamente adotadas nos diversos tipos de projetos arquiteturais de computadores. Os atuais processadores disponíveis no mercado apontam na direção do uso de memórias cache L2 compartilhadas. No entanto, ainda não está claro quais os ganhos e custos inerentes desses modelos de compartilhamento da memória cache. Assim, nota-se a importância de estudos que abordem os diversos aspectos do compartilhamento de memória cache em processadores com múltiplos núcleos. Portanto, essa dissertação visa avaliar diferentes compartilhamentos de memória cache, modelando e aplicando cargas de trabalho sobre as diferentes organizações, a fim de obter resultados significativos sobre o desempenho e a influência do compartilhamento da memória cache em processadores multi-core. Para isso, foram avaliados diversos compartilhamentos de memória cache, utilizando técnicas tradicionais de aumento de desempenho, como aumento da associatividade, maior tamanho de linha, maior tamanho de memória cache e também aumento no número de níveis de memória cache, investigando a correlação entre essas arquiteturas de memória cache e os diversos tipos de aplicações da carga de trabalho. Os resultados mostram a importância da integração entre os projetos de arquitetura de memória cache e o projeto físico da memória, a fim de obter o melhor equilíbrio entre tempo de acesso à memória cache e redução de faltas de dados. Nota-se nos resultados, dentro do espaço de projeto avaliado, que devido às limitações físicas e de desempenho, as organizações 1Core/L2 e 2Cores/L2, com tamanho total igual a 32 MB (bancos de 2 MB compartilhados), tamanho de linha igual a 128 bytes, representam uma boa escolha de implementação física em sistemas de propósito geral, obtendo um bom desempenho em todas aplicações avaliadas sem grandes sobrecustos de ocupação de área e consumo de energia. Além disso, como conclusão desta dissertação, mostra-se que, para as atuais e futuras tecnologias de integração, as tradicionais técnicas de ganho de desempenho obtidas com modificações na memória cache, como aumento do tamanho das memórias, incremento da associatividade, maiores tamanhos da linha, etc. não devem apresentar ganhos reais de desempenho caso o acréscimo de latência gerado por essas técnicas não seja reduzido, a fim de equilibrar entre a redução na taxa de faltas de dados e o tempo de acesso aos dados. / In the current context of innovations in multi-core processors, where the new integration technologies are providing an increasing number of transistors inside chip, the study of techniques for increasing data throughput has great importance for the current and future multi-core and many-core processors. With the continuous demand for performance, the cache memories have been widely adopted in various types of architectural designs of computers. Nowadays, processors on the market point out for the use of shared L2 cache memory. However, it is not clear the gains and costs of these shared cache memory models. Thus, studies that address different aspects of shared cache memory have great importance in context of multi-core processors. Therefore, this dissertation aims to evaluate different shared cache memory, modeling and applying workloads on different organizations in order to obtain significant results from the performance and the influence of the shared cache memory multi-core processors. Thus, several types of shared cache memory were evaluated using traditional techniques to increase performance, such as increasing the associativity, larger line size, larger cache memory and also the increase on the cache memory hierarchy, investigating the correlation between the cache memory architecture and the workload applications. The results show the importance of integration between cache memory architecture project and memory physical design in order to obtain the best trade-off between cache memory access time and cache misses. According to the results, within evaluations, due to physical limitations and performance, organizations 1Core/L2 and 2Cores/L2 with total cache size equal to 32MB, using banks of 2 MB, line size equal to 128 bytes, represent a good choice for physical implementation in general purpose systems, obtaining a good performance in all evaluated applications without major extra costs of area occupation and power consumption. Furthermore, as a conclusion in this dissertation is shown that, for current and future integration technologies, traditional techniques for performance gain obtained with changes in the cache memory such as, increase of the memory size, increasing the associativity, larger line sizes etc.. should not lead to real performance gains if the additional latency generated by these techniques was not treated, in order to balance between the reduction of cache miss rate and the data access time. Processamento paralelo Desempenho : Computadores Memoria cache Cache memory Multi-core processor Computer architecture High performance computing
169	Understanding the emergence of adaptive water governance: a case study of the Cache River watershed of Southern Illinois Hancock, Jodie 01 August 2017 (has links) The sustainable management of coupled social-ecological systems, such as water resource systems, requires institutional mechanisms for managing uncertainties and building more resilient social-ecological systems. Adaptive governance is an outcome of the search for a way to manage uncertainties and complexities within social-ecological systems. The concept of adaptive governance has emerged as a product of resilience theory and theoretical insights on common pool resources management. Adaptive governance refers to flexible multi-level institutions that connect state and non-state actors to facilitate a collaborative and learning-based approach to ecosystem management. As such, it has the potential to integrate social considerations into the decision process while also dealing with uncertainties in complex water resource systems. However, little is understood on how transitions toward adaptive governance systems take place and what criteria qualify a given institutional mechanism as an adaptive governance regime. This thesis presents results on a study that was aimed at understanding the process and outcomes of transitions toward adaptive water governance by using the Cache River Joint Venture Partnership (CRWJVP) within the Cache River Watershed in Southern Illinois as a case study. Qualitative data for the study were generated through key informant interviews among members of the CRWJVP and other knowledgeable actors, document review, and participant observation. The results revealed that the transformation of the governance of the Cache River watershed through the emergence of the CRWJVP was the result of ecological crises that began a citizen-led effort to preserve the Cache River wetlands. Additionally, the transition process was facilitated through trust-building, incentives, leadership, enabling legislation, and the role of bridging organizations. The results also showed that when compared to the attributes of an adaptive governance system, the current governance system of the Cache River watershed does not fully exhibit all the ideal attributes. However, the CRWJVP is moving towards an adaptive governance regime through the recent utilization of decision-making processes for recognizing and managing conflicts and uncertainties in the management of the watershed. Barriers in the transition process and recommendations for overcoming them are also discussed in the thesis. In all, findings from this study should be of relevance to scientists and decision-makers interested in understanding and enhancing transitions toward adaptive governance for the sustainable management of land and water resources in the Cache River watershed and elsewhere. Adaptive Governance Cache River Joint Venture Partnership Cache River watershed Watershed Management
170	Avaliação do compartilhamento das memórias cache no desempenho de arquiteturas multi-core / Performance evaluation of shared cache memory for multi-core architectures Alves, Marco Antonio Zanata January 2009 (has links) No atual contexto de inovações em multi-core, em que as novas tecnologias de integração estão fornecendo um número crescente de transistores por chip, o estudo de técnicas de aumento de vazão de dados é de suma importância para os atuais e futuros processadores multi-core e many-core. Com a contínua demanda por desempenho computacional, as memórias cache vêm sendo largamente adotadas nos diversos tipos de projetos arquiteturais de computadores. Os atuais processadores disponíveis no mercado apontam na direção do uso de memórias cache L2 compartilhadas. No entanto, ainda não está claro quais os ganhos e custos inerentes desses modelos de compartilhamento da memória cache. Assim, nota-se a importância de estudos que abordem os diversos aspectos do compartilhamento de memória cache em processadores com múltiplos núcleos. Portanto, essa dissertação visa avaliar diferentes compartilhamentos de memória cache, modelando e aplicando cargas de trabalho sobre as diferentes organizações, a fim de obter resultados significativos sobre o desempenho e a influência do compartilhamento da memória cache em processadores multi-core. Para isso, foram avaliados diversos compartilhamentos de memória cache, utilizando técnicas tradicionais de aumento de desempenho, como aumento da associatividade, maior tamanho de linha, maior tamanho de memória cache e também aumento no número de níveis de memória cache, investigando a correlação entre essas arquiteturas de memória cache e os diversos tipos de aplicações da carga de trabalho. Os resultados mostram a importância da integração entre os projetos de arquitetura de memória cache e o projeto físico da memória, a fim de obter o melhor equilíbrio entre tempo de acesso à memória cache e redução de faltas de dados. Nota-se nos resultados, dentro do espaço de projeto avaliado, que devido às limitações físicas e de desempenho, as organizações 1Core/L2 e 2Cores/L2, com tamanho total igual a 32 MB (bancos de 2 MB compartilhados), tamanho de linha igual a 128 bytes, representam uma boa escolha de implementação física em sistemas de propósito geral, obtendo um bom desempenho em todas aplicações avaliadas sem grandes sobrecustos de ocupação de área e consumo de energia. Além disso, como conclusão desta dissertação, mostra-se que, para as atuais e futuras tecnologias de integração, as tradicionais técnicas de ganho de desempenho obtidas com modificações na memória cache, como aumento do tamanho das memórias, incremento da associatividade, maiores tamanhos da linha, etc. não devem apresentar ganhos reais de desempenho caso o acréscimo de latência gerado por essas técnicas não seja reduzido, a fim de equilibrar entre a redução na taxa de faltas de dados e o tempo de acesso aos dados. / In the current context of innovations in multi-core processors, where the new integration technologies are providing an increasing number of transistors inside chip, the study of techniques for increasing data throughput has great importance for the current and future multi-core and many-core processors. With the continuous demand for performance, the cache memories have been widely adopted in various types of architectural designs of computers. Nowadays, processors on the market point out for the use of shared L2 cache memory. However, it is not clear the gains and costs of these shared cache memory models. Thus, studies that address different aspects of shared cache memory have great importance in context of multi-core processors. Therefore, this dissertation aims to evaluate different shared cache memory, modeling and applying workloads on different organizations in order to obtain significant results from the performance and the influence of the shared cache memory multi-core processors. Thus, several types of shared cache memory were evaluated using traditional techniques to increase performance, such as increasing the associativity, larger line size, larger cache memory and also the increase on the cache memory hierarchy, investigating the correlation between the cache memory architecture and the workload applications. The results show the importance of integration between cache memory architecture project and memory physical design in order to obtain the best trade-off between cache memory access time and cache misses. According to the results, within evaluations, due to physical limitations and performance, organizations 1Core/L2 and 2Cores/L2 with total cache size equal to 32MB, using banks of 2 MB, line size equal to 128 bytes, represent a good choice for physical implementation in general purpose systems, obtaining a good performance in all evaluated applications without major extra costs of area occupation and power consumption. Furthermore, as a conclusion in this dissertation is shown that, for current and future integration technologies, traditional techniques for performance gain obtained with changes in the cache memory such as, increase of the memory size, increasing the associativity, larger line sizes etc.. should not lead to real performance gains if the additional latency generated by these techniques was not treated, in order to balance between the reduction of cache miss rate and the data access time. Processamento paralelo Desempenho : Computadores Memoria cache Cache memory Multi-core processor Computer architecture High performance computing

Search results