Global ETD Search

111	Extending branch prediction information to effective caching. January 1996 (has links) by Chung-Leung, Chiu. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1996. / Includes bibliographical references (leaves 110-113). / Abstract --- p.i / Acknowledgement --- p.iii / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Partial Basic Block Storing Mechanism --- p.1 / Chapter 1.2 --- Data-Tagged Mechanism in Branch Target Buffer --- p.4 / Chapter 1.3 --- Organization of the dissertation --- p.5 / Chapter 2 --- Related Research --- p.7 / Chapter 2.1 --- Branch Prediction --- p.7 / Chapter 2.2 --- Branch History Table --- p.8 / Chapter 2.2.1 --- Performance of Branch History Table in reducing the Branch Penalty --- p.10 / Chapter 2.3 --- Branch Target Cache --- p.10 / Chapter 2.4 --- Early Resolution of Branch --- p.11 / Chapter 2.5 --- Software Inter-block Reorganization --- p.12 / Chapter 2.6 --- Branch Target Buffer --- p.13 / Chapter 2.7 --- Data Prefetching --- p.16 / Chapter 2.7.1 --- Software-Directed Prefetching --- p.16 / Chapter 2.7.2 --- Hardware-based prefetching --- p.17 / Chapter 3 --- New Branch Target Buffer Design --- p.19 / Chapter 3.1 --- Alternate Line Storing --- p.22 / Chapter 3.2 --- Storing More Than One Line On Entering The Dynamic Basic Block --- p.27 / Chapter 4 --- Simulation Environment for New Branch Target Buffer Design --- p.30 / Chapter 4.1 --- Architectural Models and Assumptions --- p.30 / Chapter 4.2 --- Memory Models --- p.33 / Chapter 4.3 --- Evaluation Methodology and Measurement Criteria --- p.34 / Chapter 4.4 --- Description of the Traces --- p.35 / Chapter 4.5 --- Effect of the limitation of ATOM on the statistics of SPEC92 Bench- marks --- p.35 / Chapter 4.6 --- Environments for collecting relevant statistics of SPEC92 Benchmarks --- p.36 / Chapter 5 --- Results for New Branch Target Buffer Design --- p.38 / Chapter 5.1 --- Statistical Results and Analysis for SPEC92 Benchmarks --- p.38 / Chapter 5.2 --- Overall Performance --- p.39 / Chapter 5.3 --- Bus Latency Effect --- p.42 / Chapter 5.4 --- Effect of Cache Size --- p.45 / Chapter 5.5 --- Effect of Line Size --- p.47 / Chapter 5.6 --- Cache Set Associativity --- p.50 / Chapter 5.7 --- Partial Hits --- p.50 / Chapter 5.8 --- Prefetch Accuracy --- p.53 / Chapter 5.9 --- Effect of Prefetch Buffer Size --- p.54 / Chapter 5.10 --- Effect of Storing More Than One Line on Entry of New Dynamic Basic Block --- p.56 / Chapter 6 --- Data References Tagged into Branch Target Buffer --- p.60 / Chapter 6.1 --- Branch History Table Tagged Mechanism --- p.60 / Chapter 6.2 --- Lookahead Technique --- p.65 / Chapter 6.3 --- Default Prefetches Vs Data-tagged Prefetches --- p.71 / Chapter 6.4 --- New Priority Scheme --- p.73 / Chapter 7 --- Architectural Model for Data-Tagged References in Branch Target Buffer --- p.74 / Chapter 7.1 --- Architectural Models and Assumptions --- p.76 / Chapter 7.2 --- Memory Models --- p.79 / Chapter 7.3 --- Evaluation Methodology and Measurement Criteria --- p.79 / Chapter 7.4 --- Description of the Traces --- p.80 / Chapter 7.5 --- Environments for collecting relevant statistics of SPEC92 Benchmarks --- p.80 / Chapter 8 --- Results for Data References Tagged into Branch Target Buffer --- p.82 / Chapter 8.1 --- Statistical Results and Analysis --- p.82 / Chapter 8.2 --- Overall Performance --- p.83 / Chapter 8.3 --- Effect of Branch Prediction --- p.85 / Chapter 8.4 --- Effect of Number of Tagged Registers --- p.87 / Chapter 8.5 --- Effect of Different Tagged Positions in Basic Block --- p.90 / Chapter 8.6 --- Effect of Lookahead Size --- p.91 / Chapter 8.7 --- Prefetch Accuracy --- p.93 / Chapter 8.8 --- Cache Size --- p.95 / Chapter 8.9 --- Line Size --- p.96 / Chapter 8.10 --- Set Associativity --- p.97 / Chapter 8.11 --- Size of Branch History Table --- p.99 / Chapter 8.12 --- Set Associativity of Branch History Table --- p.99 / Chapter 8.13 --- New Priority Scheme Vs Default Priority Scheme --- p.102 / Chapter 8.14 --- Effect of Prefetch-On-Miss --- p.103 / Chapter 8.15 --- Memory Latency --- p.104 / Chapter 9 --- Conclusions and Future Research --- p.106 / Chapter 9.1 --- Conclusions --- p.106 / Chapter 9.2 --- Future Research --- p.108 / Bibliography --- p.110 / Appendix --- p.114 / Chapter A --- Statistical Results - SPEC92 Benchmarks --- p.114 / Chapter A.1 --- Definition of Abbreviations and Terms --- p.114 Computer architecture Memory management (Computer science) Cache memory Buffer storage (Computer science)
112	Online thread and data mapping using the memory management unit / Mapeamento dinâmico de threads e dados usando a unidade de gerência de memória Cruz, Eduardo Henrique Molina da January 2016 (has links) Conforme o paralelismo a nível de threads aumenta nas arquiteturas modernas devido ao aumento do número de núcleos por processador e processadores por sistema, a complexidade da hierarquia de memória também aumenta. Tais hierarquias incluem diversos níveis de caches privadas ou compartilhadas e tempo de acesso não uniforme à memória. Um desafio importante em tais arquiteturas é a movimentação de dados entre os núcleos, caches e bancos de memória primária, que ocorre quando um núcleo realiza uma transação de memória. Neste contexto, a redução da movimentação de dados é um dos pilares para futuras arquiteturas para manter o aumento de desempenho e diminuir o consumo de energia. Uma das soluções adotadas para reduzir a movimentação de dados é aumentar a localidade dos acessos à memória através do mapeamento de threads e dados. Mecanismos de mapeamento do estado-da-arte aumentam a localidade de memória mapeando threads que compartilham um grande volume de dados em núcleos próximos na hierarquia de memória (mapeamento de threads), e mapeando os dados em bancos de memória próximos das threads que os acessam (mapeamento de dados). Muitas propostas focam em mapeamento de threads ou dados separadamente, perdendo oportunidades de ganhar desempenho. Outras propostas dependem de traços de execução para realizar um mapeamento estático, que podem impor uma sobrecarga alta e não podem ser usados em aplicações cujos comportamentos de acesso à memória mudam em diferentes execuções. Há ainda propostas que usam amostragem ou informações indiretas sobre o padrão de acesso à memória, resultando em informação imprecisa sobre o acesso à memória. Nesta tese de doutorado, são propostas soluções inovadoras para identificar um mapeamento que otimize o acesso à memória fazendo uso da unidade de gerência de memória para monitor os acessos à memória. As soluções funcionam dinamicamente em paralelo com a execução da aplicação, detectando informações para o mapeamento de threads e dados. Com tais informações, o sistema operacional pode realizar o mapeamento durante a execução das aplicações, não necessitando de conhecimento prévio sobre o comportamento da aplicação. Como as soluções funcionam diretamente na unidade de gerência de memória, elas podem monitorar a maioria dos acessos à memória com uma baixa sobrecarga. Em arquiteturas com TLB gerida por hardware, as soluções podem ser implementadas com pouco hardware adicional. Em arquiteturas com TLB gerida por software, algumas das soluções podem ser implementadas sem hardware adicional. As soluções aqui propostas possuem maior precisão que outros mecanismos porque possuem acesso a mais informações sobre o acesso à memória. Para demonstrar os benefícios das soluções propostas, elas são avaliadas com uma variedade de aplicações usando um simulador de sistema completo, uma máquina real com TLB gerida por software, e duas máquinas reais com TLB gerida por hardware. Na avaliação experimental, as soluções reduziram o tempo de execução em até 39%. O ganho de desempenho se deu por uma redução substancial da quantidade de faltas na cache, e redução do tráfego entre processadores. / As thread-level parallelism increases in modern architectures due to larger numbers of cores per chip and chips per system, the complexity of their memory hierarchies also increase. Such memory hierarchies include several private or shared cache levels, and Non-Uniform Memory Access nodes with different access times. One important challenge for these architectures is the data movement between cores, caches, and main memory banks, which occurs when a core performs a memory transaction. In this context, the reduction of data movement is an important goal for future architectures to keep performance scaling and to decrease energy consumption. One of the solutions to reduce data movement is to improve memory access locality through sharing-aware thread and data mapping. State-of-the-art mapping mechanisms try to increase locality by keeping threads that share a high volume of data close together in the memory hierarchy (sharing-aware thread mapping), and by mapping data close to where its accessing threads reside (sharing-aware data mapping). Many approaches focus on either thread mapping or data mapping, but perform them separately only, losing opportunities to improve performance. Some mechanisms rely on execution traces to perform a static mapping, which have a high overhead and can not be used if the behavior of the application changes between executions. Other approaches use sampling or indirect information about the memory access pattern, resulting in imprecise memory access information. In this thesis, we propose novel solutions to identify an optimized sharing-aware mapping that make use of the memory management unit of processors to monitor the memory accesses. Our solutions work online in parallel to the execution of the application and detect the memory access pattern for both thread and data mappings. With this information, the operating system can perform sharing-aware thread and data mapping during the execution of the application, without any prior knowledge of their behavior. Since they work directly in the memory management unit, our solutions are able to track most memory accesses performed by the parallel application, with a very low overhead. They can be implemented in architectures with hardwaremanaged TLBs with little additional hardware, and some can be implemented in architectures with software-managed TLBs without any hardware changes. Our solutions have a higher accuracy than previous mechanisms because they have access to more accurate information about the memory access behavior. To demonstrate the benefits of our proposed solutions, we evaluate them with a wide variety of applications using a full system simulator, a real machine with software-managed TLBs, and a trace-driven evaluation in two real machines with hardware-managed TLBs. In the experimental evaluation, our proposals were able to reduce execution time by up to 39%. The improvements happened to a substantial reduction in cache misses and interchip interconnection traffic. Processamento paralelo Memoria : Computadores Data movement Thread and data mapping Cache memory NUMA
113	Efficient shared cache management in multicore processors Xie, Yuejian 20 May 2011 (has links) In modern multicore processors, various resources (such as memory bandwidth and caches) are designed to be shared by concurrently running threads. Though it is good to be able to run multiple programs on a single chip at the same time, sometimes the contention of these shared resources can create problems for system performance. Naive hard-partitioning between threads can result in low resource utilization. This research shows that simple and effective approaches to dynamically manage the shared cache can be achieved. The contributions of this work are the following: (1) a technique for dynamic on-line classification of application memory access behaviors to predict the usefulness of cache partitioning, and a simple shared-cache management approach based on the classification; (2) a cache pseudo-partitioning technique that manipulates insertion and promotion policies; (3) a scalable algorithm to quickly decide per-core cache allocations; (4) pseudo-LRU cache partition approximation; (5) a dynamic shared cache compression technique that considers different thread behaviors. Performance Management Multicore processors Shared cache Simultaneous multithreading processors Cache memory
114	Micro-scheduling and its interaction with cache partitioning Choudhary, Dhruv 05 July 2011 (has links) The thesis explores the sources of energy inefficiency in asymmetric multi- core architectures where energy efficiency is measured by the energy-delay squared product. The insights gathered from this study drive the development of optimized thread scheduling and coordinated cache management strategies in an important class of asymmetric shared memory architectures. The proposed techniques are founded on well known mathematical optimization techniques yet are lightweight enough to be implemented in practical systems. Cache partitioning Computer architecture Thread scheduling Cache memory Multiprocessors High performance computing
115	Enhanced font services for X Window system Tsang, Pong-fan, Dex. January 2000 (has links) Thesis (M. Phil.)--University of Hong Kong, 2001. / Includes bibliographical references (leaves 80-84).
116	Algorithms and data structures for cache-efficient computation: theory and experimental evaluation Chowdhury, Rezaul Alam 28 August 2008 (has links) Not available / text Computer algorithms Graph algorithms Gaussian processes Dynamic programming Data structures (Computer science) Cache memory
117	Algorithms for distributed caching and aggregation Tiwari, Mitul 29 August 2008 (has links) Not available Cache memory Computer networks Computer algorithms
118	Ανάπτυξη cache controller βασισμένο στον δίαυλο AHB bus / Cache controller based on AHB bus Γερακάρης, Δημήτρης 16 May 2014 (has links) Η παρούσα διπλωματική αποτελεί την προσπάθεια κατασκευής ενός cache controller βασισμένο στον AHB BUS. Η ανάπτυξή του έγινε ως επί το πλείστο στο Εργαστήριο Vlsi του τμήματος Μηχανικών Υπολογιστών και Πληροφορικής με την προοπτική να ενσωματωθεί σε ένα ευρύτερο υπάρχων σύστημα βασισμένο στον open source cpu της arm Cortex M0. Δοκιμάστηκε επιτυχώς σε FPGA του εργαστηρίου αλλά ακόμα δεν έχει χρησιμοποιηθεί σε «πραγματικές συνθήκες». Απώτερος στόχος είναι να χρησιμοποιηθεί στο εργαστήριο για την επιτάχυνση εφαρμογών που θα χρειαστούν εξωτερική μνήμη δηλ. μεγαλύτερη μνήμη από την embedded του FPGA. Αν και δεν δοκιμάστηκε σε κάποιο άλλο σύστημα έχει φτιαχτεί με γνώμονα το πρότυπο του AHB οπότε υποθετικά δεν θα έχει κάποιο πρόβλημα να ενσωματωθεί σε οποιοδήποτε συμβατό με τον δίαυλο σύστημα. Η λογική πίσω από την υλοποίηση του είναι να είναι σχετικά εύκολη η αλλαγή ορισμένων μεταβλητών ώστε να διαφοροποιείται ο controller βάση των αναγκών του καθενός. Οι προδιαγραφές δίνονται παρακάτω αν και πιθανόν εκτός των πλαισίων της διπλωματικής και εντός του 2014 να επανα-σχεδιαστεί ώστε να γίνει πλήρως modular. / Cache controller compatible with AHB bus in system Verilog. Κρυφή μνήμη Ελεγκτές 005.435 Cache memory Controllers High-Performance Bus (AHB)
119	Enabling scalable online user interaction management through data warehousing of interaction histories / by Helen Thomas Thomas, Helen 12 1900 (has links) No description available. Interactive computer systems Data warehousing Data mining Cache memory World Wide Web
120	Shared resource management for efficient heterogeneous computing Lee, Jaekyu 13 January 2014 (has links) The demand for heterogeneous computing, because of its performance and energy efficiency, has made on-chip heterogeneous chip multi-processors (HCMP) become the mainstream computing platform, as the recent trend shows in a wide spectrum of platforms from smartphone application processors to desktop and low-end server processors. The performance of on-chip GPUs is not yet comparable to that of discrete GPU cards, but vendors have integrated more powerful GPUs and this trend will continue in upcoming processors. In this architecture, several system resources are shared between CPUs and GPUs. The sharing of system resources enables easier and cheaper data transfer between CPUs and GPUs, but it also causes resource contention problems between cores. The resource sharing problem has existed since the homogeneous (CPU-only) chip-multi processor (CMP) was introduced. However, resource sharing in HCMPs shows different aspects because of the different nature of CPU and GPU cores. In order to solve the resource sharing problem in HCMPs, we consider efficient shared resource management schemes, in particular tackling the problem in shared last-level cache and interconnection network. In the thesis, we propose four resource sharing mechanisms: First, we propose an efficient cache sharing mechanism that exploits the different characteristics of CPU and GPU cores to effectively share cache space between them. Second, adaptive virtual channel partitioning for on-chip interconnection network is proposed to isolate inter-application interference. By partitioning virtual channels to CPUs and GPUs, we can prevent the interference problem while guaranteeing quality-of-service (QoS) for both cores. Third, we propose a dynamic frequency controlling mechanism to efficiently share system resources. When both cores are active, the degree of resource contention as well as the system throughput will be affected by the operating frequency of CPUs and GPUs. The proposed mechanism tries to find optimal operating frequencies for both cores, which reduces the resource contention while improving system throughput. Finally, we propose a second cache sharing mechanism that exploits GPU-semantic information. The programming and execution models of GPUs are more strict and easier than those of CPUs. Also, programmers are asked to provide more information to the hardware. By exploiting these characteristics, GPUs can energy-efficiently exercise the cache and simpler, but more efficient cache partitioning can be enabled for HCMPs. Resource management Heterogeneous architecture Shared cache On-chip network Graphics processing units Heterogeneous computing Cache memory

Search results