Global ETD Search

31	Design And Analysis Of Time-Predicatable Single-Core And Multi-Core Processors Yan, Jun 01 January 2009 (has links) (PDF) Time predictability is one of the most important design considerations for real-time systems. In this dissertation, time predictability of the instruction cache is studied on both single core processors and multi-core processors. It is observed that many features in modern microprocessor architecture such as cache memories and branch prediction are in favor of average-case performance, which can significantly compromise the time predictability and make accurate worst-case performance analysis extremely difficult if not impossible. Therefore, the time predictability of VLIW (Very Long Instruction Word) processors and its compiler support is studied. The impediments to time predictability for VLIW processors are analyzed and compiler-based techniques to address these problems with minimal modifications to the VLIW hardware design are proposed. Specifically, the VLIW compiler is enhanced to support full if conversion, hyperblock scheduling, and intra-block nop insertion to enable efficient WCET (Worst Case Execution Time) analysis for VLIW processors. Our time-predictable processor incorporates the instruction caches which can mitigate the latency of fetching instructions that hit in the cache. For instruction missing from the cache, instruction prefetching is a useful technique to boost the average-case performance. However, it is unclear whether or not instruction prefetching can benefit the worst-case performance as well. Thus, the impact of instruction prefetching on the worst-case performance of instruction caches is studied. Extension of the static cache simulation technique is applied to model and compute the worst-case instruction cache performance with prefetching. It is shown that instruction prefetching can be reasonably bound, however, the time variation of computing is increased by instruction prefetching. As the technology advances, it is projected that multi-core chips will be increasingly adopted by microprocessor industry. For real-time systems to safely harness the potential of multi-core computing, designers must be able to accurately obtain the worst-case execution time (WCET) of applications running on multi-core platforms, which is very challenging due to the possible runtime inter-core interferences in using shared resources such as the shared L2 caches. As the first step toward time-predictable multi-core computing, this dissertation presents novel approaches to bounding the worst-case performance for threads running on multi-core processors with shared L2 instruction caches. CF (Control Flow) based approach. This approach computes the worst-case instruction access interferences between different threads based on the program control flow information of each thread, which can be statically analyzed. Extended ILP (Integer Linear Programming) based approach. This approach uses constraint programming to model the worst-case instruction access interferences between different threads. In the context of timing analysis in many core architecture, static approaches may also face the scalability issue. Thus, it is important and challenging to design time predictable caches in multi-core architecture. We propose an approach to leverage the prioritized shared L2 caches to improve time predictability for real-time threads running on multi-core processors. The prioritized shared L2 caches give higher priority to real-time threads while allowing low-priority threads to use shared L2 cache space that is available. Detailed implementation and experimental results discussion are presented in this dissertation. cache multicore VLIW WCET
32	Techniques for Improving Uniformity in Direct Mapped Caches Nwachukwu, Izuchukwu Udochi 05 1900 (has links) Directly mapped caches are an attractive option for processor designers as they combine fast lookup times with reduced complexity and area. However, directly-mapped caches are prone to higher miss-rates as there are no candidates for replacement on a cache miss, hence data residing in a cache set would have to be evicted to the next level cache. Another issue that inhibits cache performance is the non-uniformity of accesses exhibited by most applications: some sets are under-utilized while others receive the majority of accesses. This implies that increasing the size of caches may not lead to proportionally improved cache hit rates. Several solutions that address cache non-uniformity have been proposed in the literature. These techniques have been proposed over the past decade and each proposal independently claims the benefit of reduced conflict misses. However, because the published results use different benchmarks and different experimental setups, (there is no established frame of reference for comparing these results) it is not easy to compare them. In this work we report a side-by-side comparison of these techniques. Finally, we propose and Adaptive-Partitioned cache for multi-threaded applications. This design limits inter-thread thrashing while dynamically reducing traffic to heavily accessed sets. cache memory indexing
33	Cache Coherence State Based Replacement Policies Agarwal, Tanuj Kumar January 2015 (has links) (PDF) Cache replacement policies can play a pivotal role in the overall performance of a system by preserving data locality and thus limiting the o -chip accesses. In a shared memory system, a cache coherence protocol is necessary to ensure correctness of data computations by maintaining the state of entries in the cache. In this work we attempt to build and investigate the effect of cache replacement policies using the information provided by cache coherence protocol states. The cache coherence protocol states give us an idea about the state of entry with respect to other cores in the system. State based analysis of SPLASH-2 and PARSEC benchmark suites show that this information hints us towards the locality patterns of cache blocks, which can be used to prioritize the order of replacement of a cache states in a replacement policy. We model ten di erent cache state based replacement policies, three having xed priorities and seven whose priorities vary dynamically over the most recently used state. We compare these policies against the standard replacement policies (LRU, FIFO and Random) in terms of system performance and ease of implementation. We develop our simulation framework using the Multi2Sim simulator, where we model cache state based replacement policies. We simulate SPLASH-2 and PARSEC benchmark suites over a variety of con gurations, where we vary the number of cores, associatively for each level of cache, private/shared L2 cache. We characterize the programs to find out critical components for performance. For an 8-core system we observe that the best case among these state based replacement policies shows marginal improvements in IPC over the Random and FIFO policies, falling slightly short of LRU. We design the state based replacement policies using a smaller cache (CSL-cache), which is used to store the state information of the blocks in the main cache. The CSL cache communicates with the controller to provide the replacement entry. The complexity associated with the system is equal to FIFO and is independent of the associatively of the cache. Cache Replacement Policy Cache Coherence Computer Architecture Cache Memory Computer Storage Devices Cache Replacement Policies Cache Coherence Protocol Computer Science
34	Energy efficient cache architectures for single, multi and many core processors Thucanakkenpalayam Sundararajan, Karthik January 2013 (has links) With each technology generation we get more transistors per chip. Whilst processor frequencies have increased over the past few decades, memory speeds have not kept pace. Therefore, more and more transistors are devoted to on-chip caches to reduce latency to data and help achieve high performance. On-chip caches consume a significant fraction of the processor energy budget but need to deliver high performance. Therefore cache resources should be optimized to meet the requirements of the running applications. Fixed configuration caches are designed to deliver low average memory access times across a wide range of potential applications. However, this can lead to excessive energy consumption for applications that do not require the full capacity or associativity of the cache at all times. Furthermore, in systems where the clock period is constrained by the access times of level-1 caches, the clock frequency for all applications is effectively limited by the cache requirements of the most demanding phase within the most demanding application. This motivates the need for dynamic adaptation of cache configurations in order to optimize performance while minimizing energy consumption, on a per-application basis. First, this thesis proposes an energy-efficient cache architecture for a single core system, along with a run-time support framework for dynamic adaptation of cache size and associativity through the use of machine learning. The machine learning model, which is trained offline, profiles the application’s cache usage and then reconfigures the cache according to the program’s requirement. The proposed cache architecture has, on average, 18% better energy-delay product than the prior state-of-the-art cache architectures proposed in the literature. Next, this thesis proposes cooperative partitioning, an energy-efficient cache partitioning scheme for multi-core systems that share the Last Level Cache (LLC), with a core to LLC cache way ratio of 1:4. The proposed cache partitioning scheme uses small auxiliary tags to capture each core’s cache requirements, and partitions the LLC according to the individual cores cache requirement. The proposed partitioning uses a way-aligned scheme that helps in the reduction of both dynamic and static energy. This scheme, on an average offers 70% and 30% reduction in dynamic and static energy respectively, while maintaining high performance on par with state-of-the-art cache partitioning schemes. Finally, when Last Level Cache (LLC) ways are equal to or less than the number of cores present in many-core systems, cooperative partitioning cannot be used for partitioning the LLC. This thesis proposes a region aware cache partitioning scheme as an energy-efficient approach for many core systems that share the LLC, with a core to LLC way ratio of 1:2 and 1:1. The proposed partitioning, on an average offers 68% and 33% reduction in dynamic and static energy respectively, while again maintaining high performance on par with state-of-the-art LLC cache management techniques. 004.5
35	Exploration d'architecture d'accélérateurs à mémoire distribuée / Design space exploration of distributed-memory accelerators Busseuil, Rémi 04 December 2012 (has links) Bien que le développement actuel d'accélérateurs se concentre principalement sur la création de puces Multiprocesseurs (MPSoC) hétérogènes, c'est-à-dire composés de processeurs spécialisées, de nombreux acteurs de la microélectronique s'intéressent au développement d'un autre type de MPSoC, constitué d'une grille de processeurs identiques. Ces MPSoC homogènes, bien que composés de processeurs énergétiquement moins efficaces, possèdent une programmabilité et une flexibilité plus importante que les MPSoC hétérogènes, ce qui favorise notamment l'adaptation du système à la charge demandée, et offre un espace de solutions de configuration potentiellement plus vaste et plus simple à contrôler. C'est dans ce contexte que s'inscrit cette thèse, en exposant la création d'une architecture MPSoC homogène scalable (c'est-à-dire dont la mise à l'échelle des performances est linéaire), ainsi que le développement de différents systèmes d'adaptation et de programmation sur celle-ci.Cette architecture, constituée d'une grille de processeurs de type MicroBlaze, possédant chacun sa propre mémoire, au sein d'un Réseau sur Puce 2D, a été développée conjointement avec un système d'exploitation temps réel (RTOS) spécialisé et modulaire. Grâce à la création d'une pile de communication complexe, plusieurs mécanismes d'adaptation ont été mis en œuvre : une migration de tâche « avec redirection de données », permettant de diminuer l'impact de cette migration avec des applications de type flux de données, ainsi qu'un mécanisme dit « d'exécution distante ». Ce dernier consiste non plus à migrer le code instruction d'une mémoire à une autre, mais de conserver le code dans sa mémoire initiale et de le faire exécuter par un processeur distinct. Les différentes expériences réalisées avec ce mécanisme ont permis de souligner la meilleure réactivité de celui-ci face à la migration de tâche, tout en possédant des performances d'adaptation plus faible.Ce dernier mécanisme a conduit naturellement à la création d'un modèle de programmation de type « mémoire partagée » au sein de l'architecture. La mise en place de ce dernier nécessitait la création d'un mécanisme de cohérence mémoire, qui a été réalisé de façon matérielle/logicielle et scalable par l'intermédiaire du développement de la librairie PThread. Les performances ainsi obtenues mettent en évidence les avantages d'un MPSoC homogène tout en utilisant une programmation « classique » de type multiprocesseur. / Although the accelerators market is dominated by heterogeneous MultiProcessor Systems-on-Chip (MPSoC), i.e. with different specialized processors, a growing interest is put on another type of MPSoC, composed by an array of identical processors. Even if these processors achieved lower performance to power ratio, the better flexibility and programmability of these homogeneous MPSoC allow an easier adaptation to the load, and offer a wider space of configurations. In this context, this thesis exposes the development of a scalable homogeneous MPSoC – i.e. with linear performance scaling – and different kind of adaptive mechanisms and programming model on it.This architecture is based on an array of MicroBlaze-like processors, each having its own memory, and connected through a 2D NoC. A modular RTOS was build on top of it. Thanks to a complex communication stack, different adaptive mechanisms were made: a “redirected data” task migration mechanism, reducing the impact of the migration mechanism for data-flow applications, and a “remote execution” mechanism. Instead of migrate the instruction code from a memory to another, this last consists in only migrate the execution, keeping the code in its initial memory. The different experiments shows faster reactivity but lower performance of this mechanism compared to migration.This development naturally led to the creation of a shared memory programming model. To achieve this, a scalable hardware/software memory consistency and cache coherency mechanism has been made, through the PThread library development. Experiments show the advantage of using NoC based homogeneous MPSoC with a brand programming model. Cohérence de cache Mémoire Distribuée MPSoC Cache coherency Distributed memory MPSoC
36	Banking and Finance in Cache Valley, 1856-1956 Hurren, Patricia Kaye 01 May 1956 (has links) For one hundred years Cache Valley has been a growing segment of the American economy. Alternating periods of national financial stress and prosperity have been reflected in this Valley. In addition, Cache Valley has been confronted with certain distinct economic and financial problems and has produced her own solutions to these problems. Although her economy is based in large part on agriculture, the existence of the state agricultural college and a number of industries in her midst has induced economic growth and influenced economic development. Cache Valley provides fertile ground for study of a representative segment of the American economy. Likewise, it offers a fruitful case-study of Utah, Western, and Mountain economic development. Few of the men who lived during the making of Cache Valley's early history are living today to tell its story with their own lips. With the help of those who are today living a new Cache Valley financial story, the writer has completed a history of finance using both primary and secondary sources. banking finance cache valley cache valley 1856 1956 Economics
37	The History of South Cache High School Gustaveson, Robert C. 01 May 1954 (has links) The History of Education enables teachers and administrators to gain profitable ideas from the past as well as it enable them to avoid the mistakes of the past. With this in mind numerous historical studies have been made in the field of education. The aim of this study is to add to this area of research by presenting a history of South Cache High School from its beginning until 1953. South Cache High School History Cache History of Education Education
38	Adaptive Cache-Oblivious All-to-All Operation Chung, Shin Yee, Hsu, Wen Jing 01 1900 (has links) Modern processors rely on cache memories to reduce the latency of data accesses. Extensive cache misses would thus compromise the usefulness of the scheme. Cache-aware algorithms make use of the knowledge about the cache, such as the cache line size, L, and cache size, Z, to be cache efficient. However, careful tuning of these parameters for these algorithms is needed for different hardware platforms. Cache-oblivious (CO) algorithms were first introduced by Leiserson to work without the knowledge of the cache parameters mentioned earlier, but still achieve optimal work complexity and optimal cache complexity. Here we present CO algorithms for all-to-all operations (analogous to the cross-product operation). Its applications include Convolution, Polynomial Arithmetic, Multiple Sequence Alignment, N-Body Simulation, etc. Given two lists each with n elements, a naive implementation of all-to-all operation incurs O(n²/L) cache misses. Our CO version incurs only O(n²/L²√Z) cache misses. Preliminary experiments on Opteron 1.4GHz and MIPS 250MHz show that the CO implementation achieves two times faster. The profiling tool further confirms that the amount of cache misses is significantly lower. We also consider various situations where (a) the elements have non-uniform sizes, (b) an element cannot fit into the cache, (c) the lengths of the lists vary, and (d) an element is linked list. In addition, we study the extension to K-lists All-to-All Operation and its application. Finally, we will present the empirical results and compare with cache-aware algorithms. / Singapore-MIT Alliance (SMA) cache-oblivious algorithms all-to-all operations cache misses
39	Asymmetric clustering using a register cache Morrison, Roger Allen 18 April 2006 (has links) Graduation date: 2006 / Conventional register files spread porting resources uniformly across all registers. This paper proposes a method called Asymmetric Clustering using a Register Cache (ACRC). ACRC utilizes a fast register cache that concentrates valuable register file ports to the most active registers thereby reducing the total register file area and power consumption. A cluster of functional units and a highly ported register cache execute the majority of instructions, while a second cluster with a full register file having fewer read ports processes instructions with source registers not found in the register cache. An ‘in-cache’ marking system tracks the contents of the register cache and routes instructions to the correct cluster. This system utilizes logic similar to the ‘ready’ bit system found in wake-up and select logic keeping the additional logic required to a minimum. When using a 256-entry register file, this design reduces the total register file area by an estimated 65% while exhibiting similar IPC performance compared to a non-clustered 8-way processor. As the feature size becomes smaller and processor clocks become faster, the number of clock cycles needed to access the register file will increase. Therefore, the smaller register file area requirement and subsequent smaller register file delay of ACRC will lead to better IPC performance than conventional processors. register cluster cache architecture computer processor Cache memory Registers (Computers)
40	Analyse und Erweiterung des Paradyn Performance Tools Arndt, Michael 12 May 2006 (has links) (PDF) Das kostenfrei erhältliche Performanz Analyse Werkzeug Paradyn wird im Hinblick auf die Tauglichkeit zur Performanzanalyse quantenmechanischer Anwendungen (konkret Abinit) untersucht. Zusätzlich wird Paradyn so erweitert, dass eine Analyse mittels vorhandener Hardwarecounter möglich ist. Da Paradyn plattformunabhängig ist werden Performance Counter Bibliotheken wie PCL oder PAPI verwendet. Cache Misses Dynamische Instrumentierung Paradyn ddc:004 Cache-Speicher Instrumentation

Search results