• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 345
  • 54
  • 41
  • 39
  • 23
  • 16
  • 15
  • 13
  • 8
  • 8
  • 4
  • 3
  • 3
  • 3
  • 3
  • Tagged with
  • 745
  • 291
  • 279
  • 144
  • 100
  • 93
  • 90
  • 87
  • 79
  • 70
  • 65
  • 46
  • 44
  • 43
  • 38
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
431

Mapping HW resource usage towards SW performance

Suljevic, Benjamin January 2019 (has links)
With the software applications increasing in complexity, description of hardware is becoming increasingly relevant. To ensure the quality of service for specific applications, it is imperative to have an insight into hardware resources. Cache memory is used for storing data closer to the processor needed for quick access and improves the quality of service of applications. The description of cache memory usually consists of the size of different cache levels, set associativity, or line size. Software applications would benefit more from a more detailed model of cache memory.In this thesis, we offer a way of describing the behavior of cache memory which benefits software performance. Several performance events are tested, including L1 cache misses, L2 cache misses, and L3 cache misses. With the collected information, we develop performance models of cache memory behavior. Goodness of fit is tested for these models and they are used to predict the behavior of the cache memory during future runs of the same application.Our experiments show that L1 cache misses can be modeled to predict the future runs. L2 cache misses model is less accurate but still usable for predictions, and L3 cache misses model is the least accurate and is not feasible to predict the behavior of the future runs.
432

An efficient and scalable core allocation strategy for multicore systems

Unknown Date (has links)
Multiple threads can run concurrently on multiple cores in a multicore system and improve performance/power ratio. However, effective core allocation in multicore and manycore systems is very challenging. In this thesis, we propose an effective and scalable core allocation strategy for multicore systems to achieve optimal core utilization by reducing both internal and external fragmentations. Our proposed strategy helps evenly spreading the servicing cores on the chip to facilitate better heat dissipation. We introduce a multi-stage power management scheme to reduce the total power consumption by managing the power states of the cores. We simulate three multicore systems, with 16, 32, and 64 cores, respectively, using synthetic workload. Experimental results show that our proposed strategy performs better than Square-shaped, Rectangle-shaped, L-Shaped, and Hybrid (contiguous and non-contiguous) schemes in multicore systems in terms of fragmentation and completion time. Among these strategies, our strategy provides a better heat dissipation mechanism. / by Manira S. Rani. / Thesis (M.S.C.S.)--Florida Atlantic University, 2011. / Includes bibliography. / Electronic reproduction. Boca Raton, Fla., 2011. Mode of access: World Wide Web.
433

Caching HTTP : A comparative study of caching reverse proxies Varnish and Nginx

Logren Dély, Tobias January 2014 (has links)
With the amount of users on the web steadily increasing websites must at times endure heavy loads and risk grinding to a halt beneath the flood of visitors. One solution to this problem is by using HTTP reverse proxy caching, which acts as an intermediate between web application and user. Content from the application is stored and passed on, avoiding the need for the application produce it anew for every request. One popular application designed solely for this task is Varnish; another interesting application for the task is Nginx which is primarily designed as a web server. This thesis compares the performance of the two applications in terms of number of requests served in relation to response time, as well as system load and free memory. With both applications using their default configuration, the experiments find that Nginx performs better in the majority of tests performed. The difference is however very slightly in tests with low request rate.
434

High-performance memory safety : optimizing the CHERI capability machine

Joannou, Alexandre Jean-Michel Procopi January 2018 (has links)
This work presents optimizations for modern capability machines and specifically for the CHERI architecture, a 64-bit MIPS instruction set extension for security, supporting fine-grained memory protection through hardware-enforced capabilities. The original CHERI model uses 256-bit capabilities to carry information required for various checks helping to enforce memory safety, leading to increased memory bandwidth requirements and cache pressure when using CHERI capabilities in place of conventional 64-bit pointers. In order to mitigate this cost, I present two new 128-bit CHERI capability formats, using different compression techniques, while preserving C-language compatibility lacking in previous pointer compression schemes. I explore the trade-offs introduced by these new formats over the 256-bit format. I produce an implementation in the L3 ISA modeling language, collaborate on the hardware implementation, and provide an evaluation of the mechanism. Another cost related to CHERI capabilities is the memory traffic increase due to capability-validity tags: to provide unforgeable capabilities, CHERI uses a tagged memory that preserves validity tags for every 256-bit memory word in a shadowspace inaccessible to software. The CHERI hardware implementation of this shadowspace uses a capability-validity-tag table in memory and caches it at the end of the cache hierarchy. To efficiently implement such a shadowspace and improve on CHERI’s current approach, I use sparse data structures in a hierarchical tag-cache that filters unnecessary memory accesses. I present an in-depth study of this technique through a Python implementation of the hierarchical tag-cache, and also provide a hardware implementation and evaluation. I find that validity-tag traffic is reduced for all applications and scales with tag use. For legacy applications that do not use tags, there is near zero overhead. Removing these costs through the use of the proposed optimizations makes the CHERI architecture more affordable and appealing for industrial adoption.
435

Griddler : uma estratégia configurável para armazenamento distribuído de objetos peer-to-peer que combina replicação e erasure coding com sistema de cache /

Caetano, André Francisco Morielo. January 2017 (has links)
Orientador: Carlos Roberto Valêncio / Banca: Geraldo Francisco Donega Zafalon / Banca: Pedro Luiz Pizzigatti Correa / Resumo: Sistemas de gerenciamento de banco de dados, na sua essência, almejam garantir o armazenamento confiável da informação. Também é tarefa de um sistema de gerenciamento de banco de dados oferecer agilidade no acesso às informações. Nesse contexto, é de grande interesse considerar alguns fenômenos recentes: a progressiva geração de conteúdo não-estruturado, como imagens e vídeo, o decorrente aumento do volume de dados em formato digital nas mais diversas mídias e o grande número de requisições por parte de usuários cada vez mais exigentes. Esses fenômenos fazem parte de uma nova realidade, denominada Big Data, que impõe aos projetistas de bancos de dados um aumento nos requisitos de flexibilidade, escalabilidade, resiliência e velocidade dos seus sistemas. Para suportar dados não-estruturados foi preciso se desprender de algumas limitações dos bancos de dados convencionais e definir novas arquiteturas de armazenamento. Essas arquiteturas definem padrões para gerenciamento dos dados, mas um sistema de armazenamento deve ter suas especificidades ajustadas em cada nível de implementação. Em termos de escalabilidade, por exemplo, cabe a escolha entre sistemas com algum tipo de centralização ou totalmente descentralizados. Por outro lado, em termos de resiliência, algumas soluções utilizam um esquema de replicação para preservar a integridade dos dados por meio de cópias, enquanto outras técnicas visam a otimização do volume de dados armazenados. Por fim, ao mesmo tempo que são... / Abstract: Database management systems, in essence, aim to ensure the reliable storage of information. It is also the task of a database management system to provide agility in accessing information. In this context, it is of great interest to consider some recent phenomena: the progressive generation of unstructured content such as images and video, the consequent increase in the volume of data in digital format in the most diverse media and the large number of requests by users increasingly demanding. These phenomena are part of a new reality, named Big Data, that imposes on database designers an increase in the flexibility, scalability, resiliency, and speed requirements of their systems. To support unstructured data, it was necessary to get rid of some limitations of conventional databases and define new storage architectures. These architectures define standards for data management, but a storage system must have its specificities adjusted at each level of implementation. In terms of scalability, for example, it is up to the choice between systems with some type of centralization or totally decentralized. On the other hand, in terms of resiliency, some solutions utilize a replication scheme to preserve the integrity of the data through copies, while other techniques are aimed at optimizing the volume of stored data. Finally, at the same time that new network and disk technologies are being developed, one might think of using caching to optimize access to what is stored. This work explores and analyzes the different levels in the development of distributed storage systems. This work objective is to present an architecture that combines different resilience techniques. The scientific contribution of this work is, in addition to a totally decentralized suggestion of data allocation, the use of an access cache structure with adaptive algorithms in this environment / Mestre
436

Cache-conscious off-line real-time scheduling for multi-core platforms : algorithms and implementation / Ordonnanceur hors-ligne temps-réel et conscient du cache ciblant les architectures multi-coeurs : algorithmes et implémentations

Nguyen, Viet Anh 22 February 2018 (has links)
Les temps avancent et les applications temps-réel deviennent de plus en plus gourmandes en ressources. Les plate-formes multi-cœurs sont apparues dans le but de satisfaire les demandes des applications en ressources, tout en réduisant la taille, le poids, et la consommation énergétique. Le challenge le plus pertinent, lors du déploiement d'un système temps-réel sur une plate-forme multi-cœur, est de garantir les contraintes temporelles des applications temps réel strict s'exécutant sur de telles plate-formes. La difficulté de ce challenge provient d'une interdépendance entre les analyses de prédictabilité temporelle. Cette interdépendance peut être figurativement liée au problème philosophique de l'œuf et de la poule, et expliqué comme suit. L'un des pré-requis des algorithmes d'ordonnancement est le Pire Temps d'Exécution (PTE) des tâches pour déterminer leur placement et leur ordre d'exécution. Mais ce PTE est lui aussi influencé par les décisions de l'ordonnanceur qui va déterminer quelles sont les tâches co-localisées ou concurrentes propageant des effets sur les caches locaux et les ressources physiquement partagées et donc le PTE. La plupart des méthodes d'analyse pour les architectures multi-cœurs supputent un seul PTE par tâche, lequel est valide pour toutes conditions d'exécutions confondues. Cette hypothèse est beaucoup trop pessimiste pour entrevoir un gain de performance sur des architectures dotées de caches locaux. Pour de telles architectures, le PTE d'une tâche est dépendant du contenu du cache au début de l'exécution de la dite tâche, qui est lui-même dépendant de la tâche exécutée avant et ainsi de suite. Dans cette thèse, nous proposons de prendre en compte des PTEs incluant les effets des caches privés sur le contexte d’exécution de chaque tâche. Nous proposons dans cette thèse deux techniques d'ordonnancement ciblant des architectures multi-cœurs équipées de caches locaux. Ces deux techniques ordonnancent une application parallèle modélisée par un graphe de tâches, et génèrent un planning statique partitionné et non-préemptif. Nous proposons une méthode optimale à base de Programmation Linéaire en Nombre Entier (PLNE), ainsi qu'une méthode de résolution par heuristique basée sur de l'ordonnancement par liste. Les résultats expérimentaux montrent que la prise en compte des effets des caches privés sur les PTE des tâches réduit significativement la longueur des ordonnancements générés, ce comparé à leur homologue ignorant les caches locaux. Afin de parfaire les résultats ainsi obtenus, nous avons réalisé l'implémentation de nos ordonnancements dirigés par le temps et conscients du cache pour un déploiement sur une machine Kalray MPPA-256, une plate-forme multi-cœur en grappes (clusters). En premier lieu, nous avons identifié les challenges réels survenant lors de ce type d'implémentation, tel que la pollution des caches, la contention induite par le partage du bus, les délais de lancement d'une tâche introduits par la présence de l'ordonnanceur, et l'absence de cohérence des caches de données. En second lieu, nous proposons des stratégies adaptées et incluant, dans la formulation PLNE, les contraintes matérielles ; ainsi qu'une méthode permettant de générer le code final de l'application. Enfin, l'évaluation expérimentale valide la correction fonctionnelle et temporelle de notre implémentation pendant laquelle nous avons pu observé le facteur le plus impactant la longueur de l'ordonnancement: la contention. / Nowadays, real-time applications are more compute-intensive as more functionalities are introduced. Multi-core platforms have been released to satisfy the computing demand while reducing the size, weight, and power requirements. The most significant challenge when deploying real-time systems on multi-core platforms is to guarantee the real-time constraints of hard real-time applications on such platforms. This is caused by interdependent problems, referred to as a chicken and egg situation, which is explained as follows. Due to the effect of multi-core hardware, such as local caches and shared hardware resources, the timing behavior of tasks are strongly influenced by their execution context (i.e., co-located tasks, concurrent tasks), which are determined by scheduling strategies. Symetrically, scheduling algorithms require the Worst-Case Execution Time (WCET) of tasks as prior knowledge to determine their allocation and their execution order. Most schedulability analysis techniques for multi-core architectures assume a single WCET per task, which is valid in all execution conditions. This assumption is too pessimistic for parallel applications running on multi-core architectures with local caches. In such architectures, the WCET of a task depends on the cache contents at the beginning of its execution, itself depending on the task that was executed before the task under study. In this thesis, we address the issue by proposing scheduling algorithms that take into account context-sensitive WCETs of tasks due to the effect of private caches. We propose two scheduling techniques for multi-core architectures equipped with local caches. The two techniques schedule a parallel application modeled as a task graph, and generate a static partitioned non-preemptive schedule. We propose an optimal method, using an Integer Linear Programming (ILP) formulation, as well as a heuristic method based on list scheduling. Experimental results show that by taking into account the effect of private caches on tasks’ WCETs, the length of generated schedules are significantly reduced as compared to schedules generated by cache-unaware scheduling methods. Furthermore, we perform the implementation of time-driven cache-conscious schedules on the Kalray MPPA-256 machine, a clustered many-core platform. We first identify the practical challenges arising when implementing time-driven cache-conscious schedules on the machine, including cache pollution cause by the scheduler, shared bus contention, delay to the start time of tasks, and data cache inconsistency. We then propose our strategies including an ILP formulation for adapting cache-conscious schedules to the identified practical factors, and a method for generating the code of applications to be executed on the machine. Experimental validation shows the functional and the temporal correctness of our implementation. Additionally, shared bus contention is observed to be the most impacting factor on the length of adapted cache-conscious schedules.
437

A cooperative and incentive-based proxy-and-client caching system for on-demand media streaming.

January 2005 (has links)
Ip Tak Shun. / Thesis (M.Phil.)--Chinese University of Hong Kong, 2005. / Includes bibliographical references (leaves 95-101). / Abstracts in English and Chinese. / Abstract --- p.i / Acknowledgement --- p.iv / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Background --- p.1 / Chapter 1.1.1 --- Media Streaming --- p.1 / Chapter 1.1.2 --- Incentive Mechanism --- p.2 / Chapter 1.2 --- Cooperative and Incentive-based Proxy-and-Client Caching --- p.4 / Chapter 1.2.1 --- Cooperative Proxy-and-Client Caching --- p.4 / Chapter 1.2.2 --- Revenue-Rewarding Mechanism --- p.5 / Chapter 1.3 --- Thesis Contribution --- p.6 / Chapter 1.4 --- Thesis Organization --- p.7 / Chapter 2 --- Related Work --- p.9 / Chapter 2.1 --- Media Streaming --- p.9 / Chapter 2.2 --- Incentive Mechanism --- p.11 / Chapter 2.3 --- Resource Pricing --- p.14 / Chapter 3 --- Cooperative Proxy-and-Client Caching --- p.16 / Chapter 3.1 --- Overview of the COPACC System --- p.16 / Chapter 3.2 --- Optimal Cache Allocation (CAP) --- p.21 / Chapter 3.2.1 --- Single Proxy with Client Caching --- p.21 / Chapter 3.2.2 --- Multiple Proxies with Client Caching --- p.24 / Chapter 3.2.3 --- Cost Function with Suffix Multicast --- p.26 / Chapter 3.3 --- Cooperative Proxy-Client Caching Protocol --- p.28 / Chapter 3.3.1 --- Cache Allocation and Organization --- p.29 / Chapter 3.3.2 --- Cache Lookup and Retrieval --- p.30 / Chapter 3.3.3 --- Client Access and Integrity Verification --- p.30 / Chapter 3.4 --- Performance Evaluation --- p.33 / Chapter 3.4.1 --- Effectiveness of Cooperative Proxy and Client Caching --- p.34 / Chapter 3.4.2 --- Robustness --- p.37 / Chapter 3.4.3 --- Scalability and Control Overhead --- p.38 / Chapter 3.4.4 --- Sensitivity to Network Topologies --- p.40 / Chapter 4 --- Revenue-Rewarding Mechanism --- p.43 / Chapter 4.1 --- System Model --- p.44 / Chapter 4.1.1 --- System Overview --- p.44 / Chapter 4.1.2 --- System Formulation --- p.47 / Chapter 4.2 --- Resource Allocation Game --- p.50 / Chapter 4.2.1 --- Non-Cooperative Game --- p.50 / Chapter 4.2.2 --- Profit Maximizing Game --- p.52 / Chapter 4.2.3 --- Utility Maximizing Game --- p.61 / Chapter 4.3 --- Performance Evaluation --- p.74 / Chapter 4.3.1 --- Convergence --- p.76 / Chapter 4.3.2 --- Participation Incentive --- p.77 / Chapter 4.3.3 --- Cost effectiveness --- p.85 / Chapter 5 --- Conclusion --- p.87 / Chapter A --- NP-Hardness of the CAP problem --- p.90 / Chapter B --- Optimality of the Greedy Algorithm --- p.92 / Bibliography --- p.95
438

Extending branch prediction information to effective caching.

January 1996 (has links)
by Chung-Leung, Chiu. / Thesis (M.Phil.)--Chinese University of Hong Kong, 1996. / Includes bibliographical references (leaves 110-113). / Abstract --- p.i / Acknowledgement --- p.iii / Chapter 1 --- Introduction --- p.1 / Chapter 1.1 --- Partial Basic Block Storing Mechanism --- p.1 / Chapter 1.2 --- Data-Tagged Mechanism in Branch Target Buffer --- p.4 / Chapter 1.3 --- Organization of the dissertation --- p.5 / Chapter 2 --- Related Research --- p.7 / Chapter 2.1 --- Branch Prediction --- p.7 / Chapter 2.2 --- Branch History Table --- p.8 / Chapter 2.2.1 --- Performance of Branch History Table in reducing the Branch Penalty --- p.10 / Chapter 2.3 --- Branch Target Cache --- p.10 / Chapter 2.4 --- Early Resolution of Branch --- p.11 / Chapter 2.5 --- Software Inter-block Reorganization --- p.12 / Chapter 2.6 --- Branch Target Buffer --- p.13 / Chapter 2.7 --- Data Prefetching --- p.16 / Chapter 2.7.1 --- Software-Directed Prefetching --- p.16 / Chapter 2.7.2 --- Hardware-based prefetching --- p.17 / Chapter 3 --- New Branch Target Buffer Design --- p.19 / Chapter 3.1 --- Alternate Line Storing --- p.22 / Chapter 3.2 --- Storing More Than One Line On Entering The Dynamic Basic Block --- p.27 / Chapter 4 --- Simulation Environment for New Branch Target Buffer Design --- p.30 / Chapter 4.1 --- Architectural Models and Assumptions --- p.30 / Chapter 4.2 --- Memory Models --- p.33 / Chapter 4.3 --- Evaluation Methodology and Measurement Criteria --- p.34 / Chapter 4.4 --- Description of the Traces --- p.35 / Chapter 4.5 --- Effect of the limitation of ATOM on the statistics of SPEC92 Bench- marks --- p.35 / Chapter 4.6 --- Environments for collecting relevant statistics of SPEC92 Benchmarks --- p.36 / Chapter 5 --- Results for New Branch Target Buffer Design --- p.38 / Chapter 5.1 --- Statistical Results and Analysis for SPEC92 Benchmarks --- p.38 / Chapter 5.2 --- Overall Performance --- p.39 / Chapter 5.3 --- Bus Latency Effect --- p.42 / Chapter 5.4 --- Effect of Cache Size --- p.45 / Chapter 5.5 --- Effect of Line Size --- p.47 / Chapter 5.6 --- Cache Set Associativity --- p.50 / Chapter 5.7 --- Partial Hits --- p.50 / Chapter 5.8 --- Prefetch Accuracy --- p.53 / Chapter 5.9 --- Effect of Prefetch Buffer Size --- p.54 / Chapter 5.10 --- Effect of Storing More Than One Line on Entry of New Dynamic Basic Block --- p.56 / Chapter 6 --- Data References Tagged into Branch Target Buffer --- p.60 / Chapter 6.1 --- Branch History Table Tagged Mechanism --- p.60 / Chapter 6.2 --- Lookahead Technique --- p.65 / Chapter 6.3 --- Default Prefetches Vs Data-tagged Prefetches --- p.71 / Chapter 6.4 --- New Priority Scheme --- p.73 / Chapter 7 --- Architectural Model for Data-Tagged References in Branch Target Buffer --- p.74 / Chapter 7.1 --- Architectural Models and Assumptions --- p.76 / Chapter 7.2 --- Memory Models --- p.79 / Chapter 7.3 --- Evaluation Methodology and Measurement Criteria --- p.79 / Chapter 7.4 --- Description of the Traces --- p.80 / Chapter 7.5 --- Environments for collecting relevant statistics of SPEC92 Benchmarks --- p.80 / Chapter 8 --- Results for Data References Tagged into Branch Target Buffer --- p.82 / Chapter 8.1 --- Statistical Results and Analysis --- p.82 / Chapter 8.2 --- Overall Performance --- p.83 / Chapter 8.3 --- Effect of Branch Prediction --- p.85 / Chapter 8.4 --- Effect of Number of Tagged Registers --- p.87 / Chapter 8.5 --- Effect of Different Tagged Positions in Basic Block --- p.90 / Chapter 8.6 --- Effect of Lookahead Size --- p.91 / Chapter 8.7 --- Prefetch Accuracy --- p.93 / Chapter 8.8 --- Cache Size --- p.95 / Chapter 8.9 --- Line Size --- p.96 / Chapter 8.10 --- Set Associativity --- p.97 / Chapter 8.11 --- Size of Branch History Table --- p.99 / Chapter 8.12 --- Set Associativity of Branch History Table --- p.99 / Chapter 8.13 --- New Priority Scheme Vs Default Priority Scheme --- p.102 / Chapter 8.14 --- Effect of Prefetch-On-Miss --- p.103 / Chapter 8.15 --- Memory Latency --- p.104 / Chapter 9 --- Conclusions and Future Research --- p.106 / Chapter 9.1 --- Conclusions --- p.106 / Chapter 9.2 --- Future Research --- p.108 / Bibliography --- p.110 / Appendix --- p.114 / Chapter A --- Statistical Results - SPEC92 Benchmarks --- p.114 / Chapter A.1 --- Definition of Abbreviations and Terms --- p.114
439

Cache-oblivious Algorithms / Cache-oblivious Algorithms

Vaner, Michal January 2012 (has links)
In this work, we study the cache-oblivious computation model, which is inspired by the behaviour of the memory hierarchy of current computers. We study several graph algorithms and techniques of their design in this model. We consider graph searching, identifying connected components and computing maximal matching. We also study sorting and matrix multiplication as subproblems of many graph algorithms. In ad- dition to previously known algorithms, we present several new ones. We study their efficiency both by the means of asymptotic complexity and by benchmarking them on real hardware and we compare them with classical algorithms.
440

Online thread and data mapping using the memory management unit / Mapeamento dinâmico de threads e dados usando a unidade de gerência de memória

Cruz, Eduardo Henrique Molina da January 2016 (has links)
Conforme o paralelismo a nível de threads aumenta nas arquiteturas modernas devido ao aumento do número de núcleos por processador e processadores por sistema, a complexidade da hierarquia de memória também aumenta. Tais hierarquias incluem diversos níveis de caches privadas ou compartilhadas e tempo de acesso não uniforme à memória. Um desafio importante em tais arquiteturas é a movimentação de dados entre os núcleos, caches e bancos de memória primária, que ocorre quando um núcleo realiza uma transação de memória. Neste contexto, a redução da movimentação de dados é um dos pilares para futuras arquiteturas para manter o aumento de desempenho e diminuir o consumo de energia. Uma das soluções adotadas para reduzir a movimentação de dados é aumentar a localidade dos acessos à memória através do mapeamento de threads e dados. Mecanismos de mapeamento do estado-da-arte aumentam a localidade de memória mapeando threads que compartilham um grande volume de dados em núcleos próximos na hierarquia de memória (mapeamento de threads), e mapeando os dados em bancos de memória próximos das threads que os acessam (mapeamento de dados). Muitas propostas focam em mapeamento de threads ou dados separadamente, perdendo oportunidades de ganhar desempenho. Outras propostas dependem de traços de execução para realizar um mapeamento estático, que podem impor uma sobrecarga alta e não podem ser usados em aplicações cujos comportamentos de acesso à memória mudam em diferentes execuções. Há ainda propostas que usam amostragem ou informações indiretas sobre o padrão de acesso à memória, resultando em informação imprecisa sobre o acesso à memória. Nesta tese de doutorado, são propostas soluções inovadoras para identificar um mapeamento que otimize o acesso à memória fazendo uso da unidade de gerência de memória para monitor os acessos à memória. As soluções funcionam dinamicamente em paralelo com a execução da aplicação, detectando informações para o mapeamento de threads e dados. Com tais informações, o sistema operacional pode realizar o mapeamento durante a execução das aplicações, não necessitando de conhecimento prévio sobre o comportamento da aplicação. Como as soluções funcionam diretamente na unidade de gerência de memória, elas podem monitorar a maioria dos acessos à memória com uma baixa sobrecarga. Em arquiteturas com TLB gerida por hardware, as soluções podem ser implementadas com pouco hardware adicional. Em arquiteturas com TLB gerida por software, algumas das soluções podem ser implementadas sem hardware adicional. As soluções aqui propostas possuem maior precisão que outros mecanismos porque possuem acesso a mais informações sobre o acesso à memória. Para demonstrar os benefícios das soluções propostas, elas são avaliadas com uma variedade de aplicações usando um simulador de sistema completo, uma máquina real com TLB gerida por software, e duas máquinas reais com TLB gerida por hardware. Na avaliação experimental, as soluções reduziram o tempo de execução em até 39%. O ganho de desempenho se deu por uma redução substancial da quantidade de faltas na cache, e redução do tráfego entre processadores. / As thread-level parallelism increases in modern architectures due to larger numbers of cores per chip and chips per system, the complexity of their memory hierarchies also increase. Such memory hierarchies include several private or shared cache levels, and Non-Uniform Memory Access nodes with different access times. One important challenge for these architectures is the data movement between cores, caches, and main memory banks, which occurs when a core performs a memory transaction. In this context, the reduction of data movement is an important goal for future architectures to keep performance scaling and to decrease energy consumption. One of the solutions to reduce data movement is to improve memory access locality through sharing-aware thread and data mapping. State-of-the-art mapping mechanisms try to increase locality by keeping threads that share a high volume of data close together in the memory hierarchy (sharing-aware thread mapping), and by mapping data close to where its accessing threads reside (sharing-aware data mapping). Many approaches focus on either thread mapping or data mapping, but perform them separately only, losing opportunities to improve performance. Some mechanisms rely on execution traces to perform a static mapping, which have a high overhead and can not be used if the behavior of the application changes between executions. Other approaches use sampling or indirect information about the memory access pattern, resulting in imprecise memory access information. In this thesis, we propose novel solutions to identify an optimized sharing-aware mapping that make use of the memory management unit of processors to monitor the memory accesses. Our solutions work online in parallel to the execution of the application and detect the memory access pattern for both thread and data mappings. With this information, the operating system can perform sharing-aware thread and data mapping during the execution of the application, without any prior knowledge of their behavior. Since they work directly in the memory management unit, our solutions are able to track most memory accesses performed by the parallel application, with a very low overhead. They can be implemented in architectures with hardwaremanaged TLBs with little additional hardware, and some can be implemented in architectures with software-managed TLBs without any hardware changes. Our solutions have a higher accuracy than previous mechanisms because they have access to more accurate information about the memory access behavior. To demonstrate the benefits of our proposed solutions, we evaluate them with a wide variety of applications using a full system simulator, a real machine with software-managed TLBs, and a trace-driven evaluation in two real machines with hardware-managed TLBs. In the experimental evaluation, our proposals were able to reduce execution time by up to 39%. The improvements happened to a substantial reduction in cache misses and interchip interconnection traffic.

Page generated in 0.0664 seconds