• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 14
  • 9
  • 8
  • 5
  • 3
  • 2
  • 1
  • Tagged with
  • 52
  • 20
  • 19
  • 12
  • 12
  • 11
  • 10
  • 8
  • 7
  • 6
  • 5
  • 5
  • 5
  • 5
  • 5
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
21

Impact of Increased Cache Misses on Runtime Performance of MPX-enabled Programs

Sharma, Niti 10 June 2019 (has links)
Low level languages like C and C++ provide high performance and direct control over memory management. But these languages are prone to memory safety violations. Intel introduced a new ISA extension-Memory Protection Extension(MPX), a hardware-assisted full-stack solution, to protect against the memory safety violations. While MPX efficiently prevents memory errors like buffer overflows and out of bound memory accesses, it comes at the cost of high performance overheads. Also, the cache locality worsens in MPX protected applications. In our research, we analyze if there is a correlation between increase in cache misses and runtime degradation in programs compiled with MPX support. We analyze 15 SPEC CPU benchmark programs for different input sizes on Windows platform, compiled with Intel's ICC compiler. We find that for input sizes train(medium) and ref(large), the average performance overheads are 140% and 144% respectively. We find that 5 out of 15 benchmarks do not have any runtime overheads and also, do not have any change in cache misses at any level. However for rest of the 10 benchmarks, we find a strong correlation between runtime overheads and cache misses overheads, with the correlation coefficients ranging from 0.8 to 0.36 for different input sizes. Based on our findings, we conclude that there is a direct correlation between runtime overheads and increase in cache misses. We also find that instructions overheads and runtime overheads have a positive correlation, with the coefficient values ranging from 0.7 to 0.33 for different input sizes. / Master of Science / Low level programming languages like C and C++ are primary choices to write low-level systems software such as operating systems, virtual machines, embedded software, and performance-critical applications. But these languages are considered as unsafe and prone to memory safety errors. Intel introduced a new technique- Memory Protection Extensions (MPX) to protect against these memory errors. But prior research found that applications supported with MPX have increased runtimes (slowdowns). In our research, we analyze these slowdowns for different input sizes(medium and large) in 15 benchmark applications. Based on the input sizes, the average slowdowns range from 140% to 144%. We then examine if there is a correlation between increase in cache misses under MPX and the slowdowns. A hardware cache is a component that stores data so that future requests for that data can be served faster. Hence, cache miss is a state where the data requested for processing by a component or application is not found in the cache. Whenever a cache miss happen, the processor waits for the data to be fetched from the next cache level or from main memory before it can continue to execute. This wait influences the runtime performance of the application. Our evaluations find that 10 out of 15 applications which have increased runtimes, also have increase in cache misses. This shows a positive correlation between these two parameters. Along with that, we also found that increase in instruction size in MPX protected applications also has a direct correlation with the runtime degradation. We also quantify these relationships with a statistical measure called correlation coefficient.
22

Sur des modèles pour l’évaluation de performance des caches dans un réseau cœur et de la consommation d’énergie dans un réseau d’accès sans-fil / On models for performance analysis of a core cache network and power save of a wireless access network

Choungmo Fofack, Nicaise Éric 21 February 2014 (has links)
Internet est un véritable écosystème. Il se développe, évolue et s’adapte aux besoins des utilisateurs en termes de communication, de connectivité et d’ubiquité. Dans la dernière décennie, les modèles de communication ont changé passant des interactions machine-à-machine à un modèle machine-à-contenu. Cependant, différentes technologies sans-fil et de réseaux (tels que les smartphones et les réseaux 3/4G, streaming en ligne des médias, les réseaux sociaux, réseaux-orientés contenus) sont apparues pour améliorer la distribution de l’information. Ce développement a mis en lumière les problèmes liés au passage à l’échelle et à l’efficacité énergétique; d’où la question: Comment concevoir ou optimiser de tels systèmes distribués qui garantissent un accès haut débit aux contenus tout en (i) réduisant la congestion et la consommation d’énergie dans le réseau et (ii) s’adaptant à la demande des utilisateurs dans un contexte connectivité quasi-permanente? Dans cette thèse, nous nous intéressons à deux solutions proposées pour répondre à cette question: le déploiement des réseaux de caches et l’implantation des protocoles économes en énergie. Précisément, nous proposons des modèles analytiques pour la conception de ces réseaux de stockage et la modélisation de la consommation d’énergie dans les réseaux d’accès sans fil. Nos études montrent que la prédiction de la performance des réseaux de caches réels peut être faite avec des erreurs relatives absolues de l’ordre de 1% à 5% et qu’une proportion importante soit 70% à 90% du coût de l’énergie dans les cellules peut être économisée au niveau des stations de base et des mobiles sous des conditions réelles de trafic. / Internet is a real ecosystem. It grows, evolves and adapts to the needs of users in terms of communication, connectivity and ubiquity of users. In the last decade, the communication paradigm has shifted from traditional host-to-host interactions to the recent host-to-content model; while various wireless and networking technologies (such as 3/4G smartphones and networks, online media streaming, social networks, clouds, Big-Data, information-centric networks) emerged to enhance content distribution. This development shed light on scalability and energy efficiency issues which can be formulated as follows. How can we design or optimize such large scale distributed systems in order to achieve and maintain high-speed access to contents while (i) reducing congestion and energy consumption in the network and (ii) adapting to the temporal locality of users demand in a continuous connectivity paradigm? In this thesis we focus on two solutions proposed to answer this question: In-network caching and Power save protocols for scalability and energy efficiency issues respectively. Precisely, we propose analytic models for designing core cache networks and modeling energy consumption in wireless access networks. Our studies show that the prediction of the performance of general core cache networks in real application cases can be done with absolute relative errors of order of 1%–5%; meanwhile, dramatic energy save can be achieved by mobile devices and base stations, e.g., as much as 70%–90% of the energy cost in cells with realistic traffic load and the considered parameter settings.
23

Hybrid caches: design and data management

Valero Bresó, Alejandro 07 October 2013 (has links)
Cache memories have been usually implemented with Static Random-Access Memory (SRAM) technology since it is the fastest electronic memory technology. However, this technology consumes a high amount of leakage currents, which is a major design concern because leakage energy consumption increases as the transistor size shrinks. Alternative technologies are being considered to reduce this consumption. Among them, embedded Dynamic RAM (eDRAM) technology provides minimal area and leakage by design but reads are destructive and it is not as fast as SRAM. In this thesis, both SRAM and eDRAM technologies are mingled to take the advantatges that each of them o¿ers. First, they are combined at cell level to implement an n-bit macrocell consisting of one SRAM cell and n-1 eDRAM cells. The macrocell is used to build n-way set-associative hybrid ¿rst-level (L1) data caches having one SRAM way and n-1 eDRAM ways. A single SRAM way is enough to achieve good performance given the high data locality of L1 caches. Architectural mechanisms such as way-prediction, swaps, and scrub operations are considered to avoid unnecessary eDRAM reads, to maintain the Most Recently Used (MRU) data in the fast SRAM way, and to completely avoid refresh logic. Experimental results show that, compared to a conventional SRAM cache, leakage and area are largely reduced with a scarce impact on performance. The study of the bene¿ts of hybrid caches has been also carried out in second-level (L2) caches acting as Last-Level Caches (LLCs). In this case, the technologies are combined at bank level and the optimal ratio of SRAM and eDRAM banks that achieves the best trade-o¿ among performance, energy, and area is identi¿ed. Like in L1 caches, the MRU blocks are kept in the SRAM banks and they are accessed ¿rst to avoid unnecessary destructive reads. Nevertheless, refresh logic is not removed since data locality widely di¿ers in this cache level. Experimental results show that a hybrid LLC with an eighth of its banks built with SRAM technology is enough to achieve the best target trade-o¿. This dissertation also deals with performance of replacement policies in heterogeneous LLCs mainly focusing on the energy overhead incurred by refresh operations. In this thesis it is de¿ned a new concept, namely MRU-Tour (MRUT), that helps estimate reuse information of cache blocks. Based on this concept, it is proposed a family of MRUTbased replacement algorithms that randomly select the victim block among those having a single MRUT. These policies are enhanced to leverage recency of information for a few blocks and to adapt to changes in the working set of the benchmarks. Results show that the proposed MRUT policies, with simpler hardware complexity, outperform the Least Recently Used (LRU) policy and a set of the most representative state-of-the-art replacement policies for LLCs. Refresh operations represent an important fraction of the overall dynamic energy consumption of eDRAM LLCs. This fraction increases with the cache capacity, since more blocks have to be refreshed for a given period of time. Prior works have attacked the refresh energy taking into account inter-cell feature variations. Unlike these works, this thesis proposes a selective refresh policy based on the MRUT concept. The devised policy takes into account the number of MRUTs of a block to select whether the block is refreshed. In this way, many refreshes done in a typical distributed refresh policy are skipped (i.e., in those blocks having a single MRUT). This refresh mechanism is applied in the hybrid LLC memory. Results show that refresh energy consumption is largely reduced with respect to a conventional eDRAM cache, while the performance degradation is minimal with respect to a conventional SRAM cache. / Valero Bresó, A. (2013). Hybrid caches: design and data management [Tesis doctoral]. Editorial Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/32663 / Premios Extraordinarios de tesis doctorales
24

Exécution d'applications stockées dans la mémoire non-adressable d'une carte à puce

Cogniaux, Geoffroy 13 December 2012 (has links) (PDF)
La dernière génération de cartes à puce permet le téléchargement d'applications après leur mise en circulation. Outre les problèmes que cela implique, cette capacité d'extension applicative reste encore aujourd'hui bridée par un espace de stockage adressable restreint. La thèse défendue dans ce mémoire est qu'il est possible d'exécuter efficacement des applications stockées dans la mémoire non-adressable des cartes à puce, disponible en plus grande quantité, et ce, malgré ses temps de latences très longs, donc peu favorables a priori à l'exécution de code. Notre travail consiste d'abord à étudier les forces et faiblesse de la principale réponse proposée par l'état de l'art qu'est un cache. Cependant, dans notre contexte, il ne peut être implémenté qu'en logiciel, avec alors une latence supplémentaire. De plus, ce cache doit respecter les contraintes mémoires des cartes à puce et doit donc avoir une empreinte mémoire faible. Nous montrons comment et pourquoi ces deux contraintes réduisent fortement les performances d'un cache, qui devient alors une réponse insuffisante pour la résolution de notre challenge. Nous appliquons notre démonstration aux caches de code natif, puis de code et méta-données Java et JavaCard2. Forts de ces constats, nous proposons puis validons une solution reposant sur une pré-interprétation de code, dont le but est à la fois de détecter précocement les données manquantes en cache pour les charger à l'avance et en parallèle, mais aussi grouper des accès au cache et réduire ainsi l'impact de son temps de latence logiciel, démontré comme son principal coût. Le tout produit alors une solution efficace, passant l'échelle des cartes à puce.
25

Understanding Multicore Performance : Efficient Memory System Modeling and Simulation

Sandberg, Andreas January 2014 (has links)
To increase performance, modern processors employ complex techniques such as out-of-order pipelines and deep cache hierarchies. While the increasing complexity has paid off in performance, it has become harder to accurately predict the effects of hardware/software optimizations in such systems. Traditional microarchitectural simulators typically execute code 10 000×–100 000× slower than native execution, which leads to three problems: First, high simulation overhead makes it hard to use microarchitectural simulators for tasks such as software optimizations where rapid turn-around is required. Second, when multiple cores share the memory system, the resulting performance is sensitive to how memory accesses from the different cores interleave. This requires that applications are simulated multiple times with different interleaving to estimate their performance distribution, which is rarely feasible with today's simulators. Third, the high overhead limits the size of the applications that can be studied. This is usually solved by only simulating a relatively small number of instructions near the start of an application, with the risk of reporting unrepresentative results. In this thesis we demonstrate three strategies to accurately model multicore processors without the overhead of traditional simulation. First, we show how microarchitecture-independent memory access profiles can be used to drive automatic cache optimizations and to qualitatively classify an application's last-level cache behavior. Second, we demonstrate how high-level performance profiles, that can be measured on existing hardware, can be used to model the behavior of a shared cache. Unlike previous models, we predict the effective amount of cache available to each application and the resulting performance distribution due to different interleaving without requiring a processor model. Third, in order to model future systems, we build an efficient sampling simulator. By using native execution to fast-forward between samples, we reach new samples much faster than a single sample can be simulated. This enables us to simulate multiple samples in parallel, resulting in almost linear scalability and a maximum simulation rate close to native execution. / CoDeR-MP / UPMARC
26

Τρόποι διαχείρισης κρυφών μνημών με ανομοιογενείς χρόνους πρόσβασης

Αβραμόπουλος, Γεώργιος 20 September 2010 (has links)
Η εργασία αποτελεί μελέτη της λειτουργίας των caches, χρησιμοποιώντας μια συγκεκριμένη cache δομή. Η εργασία αυτή έχει σα σκοπό τη μελέτη των κρυφών μνημών με μη ομοιογενή χρόνο προσπέλασης στα διάφορα «φυσικά» σημεία της επιφάνειάς της. Αντικειμενικός σκοπός των κρυφών αυτών μνημών, είναι να τοποθετούνται τα δεδομένα που χρησιμοποιούνται συχνότερα, σε θέσεις που βρίσκονται κοντύτερα στον επεξεργαστή και έχουν λιγότερες διασυνδέσεις καλωδίων, άρα έχουν και το μικρότερο χρόνο προσπέλασης. Όταν αυτό είναι επιτεύξιμο, τα δεδομένα που χρησιμοποιούνται περισσότερες φορές, χρειάζονται τον ελάχιστο χρόνο για την προσπέλασή τους. Για το σκοπό αυτό επιλέξαμε έναν ήδη προτεινόμενο μηχανισμό, τον οποίο αναλύσαμε εκτενώς. Η επιλογή αυτή δεν έγινε τυχαία, αλλά επιλέξαμε έναν μηχανισμό που διαφέρει στη λογική από τη γενική έννοια των εν λόγω κρυφών μνημών (NUCA), έχοντας σαν κύρια διαφορά ότι διαφοροποιεί εντελώς τη διαχείριση του tag από εκείνη του data array, αντίθετα με τις γενικότερης έννοιας NUCA μνήμες. Εκτός από τη λειτουργία της δομής αυτής όπως είχε προταθεί, εισάγουμε στη διαχείριση των δεδομένων και την πληροφορία της πρόβλεψης για να δούμε πως μπορεί να επιδράσει στην απόδοση και αν μπορούμε να καταφέρουμε κάποια βελτίωση. / This work is a study of cache memories, using a specific cache structure. Its goal is to study cache memories with non-uniform access time for all blocks throughout the cache surface (NUCA). The objective of these "hidden" memories is to put the most often used data at the closest to processor positions (blocks), which have fewer wire connections and therefore smaller access time. Whenever this is feasible, the data used most often need are accessed in the least possible amount of time. For this purpose we chose an already proposed mechanism, which was analyzed extensively. The selection was not random, but chose a structure that differs from the usual NUCA structure, having as main diferrence that it completely decouples the tag array management from the data array, contrary to the general concept of NUCA memories. Apart from this strucure's function as originally proposed, we introduced prediction in both tag and data arrays management to see how it can affect performance and whether we can achieve some performance improvement.
27

Cache-Related Delay Server for Aperiodic Job Handling in Real-Time Systems

Pukhraj Jain, Vardhman Jain 01 December 2010 (has links)
Embedded/real-time systems are becoming ubiquitous in today's world and their pervasive nature is increasing with the advent of cyber-physical systems. Providing temporal guarantees is paramount in such systems. Most of the normal operation in real-time systems is modelled using periodic tasks. Event-driven behaviour is modelled using aperiodic jobs. To ensure an acceptable Quality of Service for aperiodic jobs without jeopardizing safety of periodic tasks, aperiodic servers were introduced [2], [3]. Aperiodic servers are used to reserve a quota for the execution of aperiodic jobs. However, they do not take into account, cache-related delays that the execution of aperiodic jobs could impose on periodic tasks, thereby making their use in systems with caches unsafe. In this thesis, we introduce Cache Related Delay Servers to solve this problem. Statically, every periodic task's worst-case execution time includes a pre-determined delay quota for delay caused by aperiodic jobs. During system operation, the aperiodic server is allowed to execute only if periodic jobs that may be affected by it have sufficient delay quota to accommodate its execution. Otherwise, the priority of the aperiodic server is temporarily decreased to the level of the lowest-priority periodic job with insufficient quota, thereby ensuring safe execution of periodic tasks.
28

Memory Interference Characterization and Mitigation for Heterogeneous Smartphones

January 2016 (has links)
abstract: The availability of a wide range of general purpose as well as accelerator cores on modern smartphones means that a significant number of applications can be executed on a smartphone simultaneously, resulting in an ever increasing demand on the memory subsystem. While the increased computation capability is intended for improving user experience, memory requests from each concurrent application exhibit unique memory access patterns as well as specific timing constraints. If not considered, this could lead to significant memory contention and result in lowered user experience. This work first analyzes the impact of memory degradation caused by the interference at the memory system for a broad range of commonly-used smartphone applications. The real system characterization results show that smartphone applications, such as web browsing and media playback, suffer significant performance degradation. This is caused by shared resource contention at the application processor’s last-level cache, the communication fabric, and the main memory. Based on the detailed characterization results, rest of this thesis focuses on the design of an effective memory interference mitigation technique. Since web browsing, being one of the most commonly-used smartphone applications and represents many html-based smartphone applications, my thesis focuses on meeting the performance requirement of a web browser on a smartphone in the presence of background processes and co-scheduled applications. My thesis proposes a light-weight user space frequency governor to mitigate the degradation caused by interfering applications, by predicting the performance and power consumption of web browsing. The governor selects an optimal energy-efficient frequency setting periodically by using the statically-trained performance and power models with dynamically-varying architecture and system conditions, such as the memory access intensity of background processes and/or coscheduled applications, and temperature of cores. The governor has been extensively evaluated on a Nexus 5 smartphone over a diverse range of mobile workloads. By operating at the most energy-efficient frequency setting in the presence of interference, energy efficiency is improved by as much as 35% and with an average of 18% compared to the existing interactive governor, while maintaining the satisfactory performance of web page loading under 3 seconds. / Dissertation/Thesis / Masters Thesis Electrical Engineering 2016
29

Emulating Variable Block Size Caches

Muthulaxmi, S 05 1900 (has links) (PDF)
No description available.
30

ZipThru: A software architecture that exploits Zipfian skew in datasets for accelerating Big Data analysis

Ejebagom J Ojogbo (9529172) 16 December 2020 (has links)
<div>In the past decade, Big Data analysis has become a central part of many industries including entertainment, social networking, and online commerce. MapReduce, pioneered by Google, is a popular programming model for Big Data analysis, famous for its easy programmability due to automatic data partitioning, fault tolerance, and high performance. Majority of MapReduce workloads are summarizations, where the final output is a per-key ``reduced" version of the input, highlighting a shared property of each key in the input dataset.</div><div><br></div><div>While MapReduce was originally proposed for massive data analyses on networked clusters, the model is also applicable to datasets small enough to be analyzed on a single server. In this single-server context the intermediate tuple state generated by mappers is saved to memory, and only after all Map tasks have finished are reducers allowed to process it. This Map-then-Reduce sequential mode of execution leads to distant reuse of the intermediate state, resulting in poor locality for memory accesses. In addition the size of the intermediate state is often too large to fit in the on-chip caches, leading to numerous cache misses as the state grows during execution, further degrading performance. It is well known, however, that many large datasets used in these workloads possess a Zipfian/Power Law skew, where a minority of keys (e.g., 10\%) appear in a majority of tuples/records (e.g., 70\%). </div><div><br></div><div>I propose ZipThru, a novel MapReduce software architecture that exploits this skew to keep the tuples for the popular keys on-chip, processing them on the fly and thus improving reuse of their intermediate state and curtailing off-chip misses. ZipThru achieves this using four key mechanisms: 1) Concurrent execution of both Map and Reduce phases; 2) Holding only the small, reduced state of the minority of popular keys on-chip during execution; 3) Using a lookup table built from pre-processing a subset of the input to distinguish between popular and unpopular keys; and 4) Load balancing the concurrently executing Map and Reduce phases to efficiently share on-chip resources. </div><div><br></div><div>Evaluations using Phoenix, a shared-memory MapReduce implementation, on 16- and 32-core servers reveal that ZipThru incurs 72\% fewer cache misses on average over traditional MapReduce while achieving average speedups of 2.75x and 1.73x on both machines respectively.</div>

Page generated in 0.0361 seconds