• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 347
  • 54
  • 41
  • 39
  • 23
  • 16
  • 15
  • 13
  • 8
  • 8
  • 4
  • 3
  • 3
  • 3
  • 3
  • Tagged with
  • 747
  • 291
  • 279
  • 144
  • 100
  • 93
  • 90
  • 87
  • 79
  • 71
  • 65
  • 46
  • 44
  • 43
  • 39
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
401

Arquitetura pdccm em hardware para compressão/descompressão de instruções em sistemas embarcados

Dias, Wanderson Roger Azevedo 30 April 2009 (has links)
Made available in DSpace on 2015-04-11T14:03:12Z (GMT). No. of bitstreams: 1 DISSERTACAO - WANDERSON ROGER.pdf: 2032449 bytes, checksum: f75ada58e34bb5da29e9716bc5899cab (MD5) Previous issue date: 2009-04-30 / Fundação de Amparo à Pesquisa do Estado do Amazonas / In the development of the design of embedded systems several factors must be led in account, such as: physical size, weight, mobility, energy consumption, memory, cooling, security requirements, trustiness and everything ally to a reduced cost and of easy utilization. But, on the measure that the systems become more heterogeneous they admit major complexity in its development. There are several techniques to optimize the execution time and power usage in embedded systems. One of these techniques is the code compression, however, most existing proposals focus on decompress and they assume that the code is compressed in compilation time. Therefore, this work proposes the development of an specific architecture, with its prototype in hardware (using VHDL and FPGAs), special for the process of compression/decompression code. Thus, it is proposed a technique called PDCCM (Processor Memory Cache Compressor Decompressor). The results are obtained via simulation and prototyping. In the analysis, benchmark programs such as MiBench had been used. Also a method of compression, called of MIC was considered (Middle Instruction Compression), which was compared with the traditional Huffman compression method. Therefore, in the architecture PDCCM the MIC method showed better performance in relation to the Huffman method for some programs of the MiBench analyzed that are widely used in embedded systems, resulting in 26% less of the FPGA logic elements, 71% more in the frequency of the clock MHz and in the 36% plus on the compression of instruction compared with Huffman, besides allowing the compression/decompression in time of execution. / No desenvolvimento do projeto de sistemas embarcados vários fatores têm que ser levados em conta, tais como: tamanho físico, peso, mobilidade, consumo de energia, memória, refrescância, requisitos de segurança, confiabilidade e tudo isso aliado a um custo reduzido e de fácil utilização. Porém, à medida que os sistemas tornam-se mais heterogêneos os mesmos admitem maior complexidade em seu desenvolvimento. Existem diversas técnicas para otimizar o tempo de execução e o consumo de energia em sistemas embarcados. Uma dessas técnicas é a compressão de código, não obstante, a maioria das propostas existentes focaliza na descompressão e assumem que o código é comprimido em tempo de compilação. Portanto, este trabalho propõe o desenvolvimento de uma arquitetura, com respectiva prototipação em hardware (usando VHDL e FPGAs), para o processo de compressão/descompressão de código. Assim, propõe-se a técnica denominada de PDCCM (Processor Decompressor Cache Compressor Memory). Os resultados são obtidos via simulação e prototipação. Na análise usaram-se programas do benchmark MiBench. Foi também proposto um método de compressão, denominado de MIC (Middle Instruction Compression), o qual foi comparado com o tradicional método de compressão de Huffman. Portanto, na arquitetura PDCCM o método MIC apresentou melhores desempenhos computacionais em relação ao método de Huffman para alguns programas do MiBench analisados que são muito usados em sistemas embarcados, obtendo 26% a menos dos elementos lógicos do FPGA, 71% a mais na freqüência do clock em MHz e 36% a mais na compressão das instruções comparando com o método de Huffman, além de permitir a compressão/descompressão em tempo de execução.
402

Cache memory aware priority assignment and scheduling simulation of real-time embedded systems / Affectation de priorité et simulation d’ordonnancement de systèmes temps réel embarqués avec prise en compte de l'effet des mémoires cache

Tran, Hai Nam 23 January 2017 (has links)
Les systèmes embarqués en temps réel (RTES) sont soumis à des contraintes temporelles. Dans ces systèmes, l'exactitude du résultat ne dépend pas seulement de l'exactitude logique du calcul, mais aussi de l'instant où ce résultat est produit (Stankovic, 1988). Les systèmes doivent être hautement prévisibles dans le sens où le temps d'exécution pire-cas de chaque tâche doit être déterminé. Une analyse d’ordonnancement est effectuée sur le système pour s'assurer qu'il y a suffisamment de ressources pour ordonnancer toutes les tâches. La mémoire cache est un composant matériel utilisé pour réduire l'écart de performances entre le processeur et la mémoire principale. L'intégration de la mémoire cache dans un RTES améliore généralement la performance en terme de temps d'exécution, mais malheureusement, elle peut entraîner une augmentation du coût de préemption et de la variabilité du temps d'exécution. Dans les systèmes avec mémoire cache, plusieurs tâches partagent cette ressource matérielle, ce qui conduit à l'introduction d'un délai de préemption lié au cache (CRPD). Par définition, le CRPD est le délai ajouté au temps d'exécution de la tâche préempté car il doit recharger les blocs de cache évincés par la préemption. Il est donc important de pouvoir prendre en compte le CRPD lors de l'analyse d’ordonnancement. Cette thèse se concentre sur l'étude des effets du CRPD dans les systèmes uni-processeurs, et étend en conséquence des méthodes classiques d'analyse d’ordonnancement. Nous proposons plusieurs algorithmes d’affectation de priorités qui tiennent compte du CRPD. De plus, nous étudions les problèmes liés à la simulation d'ordonnancement intégrant le CRPD et nous établissons deux résultats théoriques qui permettent son utilisation en tant que méthode de vérification. Le travail de cette thèse a permis l'extension de l'outil Cheddar - un analyseur d'ordonnancement open-source. Plusieurs méthodes d'analyse de CRPD ont été également mises en oeuvre dans Cheddar en complément des travaux présentés dans cette thèse. / Real-time embedded systems (RTES) are subject to timing constraints. In these systems, the total correctness depends not only on the logical correctness of the computation but also on the time in which the result is produced (Stankovic, 1988). The systems must be highly predictable in the sense that the worst case execution time of each task must be determined. Then, scheduling analysis is performed on the system to ensure that there are enough resources to schedule all of the tasks.Cache memory is a crucial hardware component used to reduce the performance gap between processor and main memory. Integrating cache memory in a RTES generally enhances the whole performance in term of execution time, but unfortunately, it can lead to an increase in preemption cost and execution time variability. In systems with cache memory, multiple tasks can share this hardware resource which can lead to cache related preemption delay (CRPD) being introduced. By definition, CRPD is the delay added to the execution time of the preempted task because it has to reload cache blocks evicted by the preemption. It is important to be able to account for CRPD when performing schedulability analysis.This thesis focuses on studying the effects of CRPD on uniprocessor systems and employs the understanding to extend classical scheduling analysis methods. We propose several priority assignment algorithms that take into account CRPD while assigning priorities to tasks. We investigate problems related to scheduling simulation with CRPD and establish two results that allows the use of scheduling simulation as a verification method. The work in this thesis is made available in Cheddar - an open-source scheduling analyzer. Several CRPD analysis features are also implemented in Cheddar besides the work presented in this thesis.
403

Conception d'une architecture extensible pour le calcul massivement parallèle / Designing a scalable architecture for massively parallel computing

Kaci, Ania 14 December 2016 (has links)
En réponse à la demande croissante de performance par une grande variété d’applications (exemples : modélisation financière, simulation sub-atomique, bio-informatique, etc.), les systèmes informatiques se complexifient et augmentent en taille (nombre de composants de calcul, mémoire et capacité de stockage). L’accroissement de la complexité de ces systèmes se traduit par une évolution de leur architecture vers une hétérogénéité des technologies de calcul et des modèles de programmation. La gestion harmonieuse de cette hétérogénéité, l’optimisation des ressources et la minimisation de la consommation constituent des défis techniques majeurs dans la conception des futurs systèmes informatiques.Cette thèse s’adresse à un domaine de cette complexité en se focalisant sur les sous-systèmes à mémoire partagée où l’ensemble des processeurs partagent un espace d’adressage commun. Les travaux porteront essentiellement sur l’implémentation d’un protocole de cohérence de cache et de consistance mémoire, sur une architecture extensible et sur la méthodologie de validation de cette implémentation.Dans notre approche, nous avons retenu les processeurs 64-bits d’ARM et des co-processeurs génériques (GPU, DSP, etc.) comme composants de calcul, les protocoles de mémoire partagée AMBA/ACE et AMBA/ACE-Lite ainsi que l’architecture associée « CoreLink CCN » comme solution de départ. La généralisation et la paramètrisation de cette architecture ainsi que sa validation dans l’environnement de simulation Gem5 constituent l’épine dorsale de cette thèse.Les résultats obtenus à la fin de la thèse, tendent à démontrer l’atteinte des objectifs fixés / In response to the growing demand for performance by a wide variety of applications (eg, financial modeling, sub-atomic simulation, bioinformatics, etc.), computer systems become more complex and increase in size (number of computing components, memory and storage capacity). The increased complexity of these systems results in a change in their architecture towards a heterogeneous computing technologies and programming models. The harmonious management of this heterogeneity, resource optimization and minimization of consumption are major technical challenges in the design of future computer systems.This thesis addresses a field of this complexity by focusing on shared memory subsystems where all processors share a common address space. Work will focus on the implementation of a cache coherence and memory consistency on an extensible architecture and methodology for validation of this implementation.In our approach, we selected processors 64-bit ARM and generic co-processor (GPU, DSP, etc.) as components of computing, shared memory protocols AMBA / ACE and AMBA / ACE-Lite and associated architecture "CoreLink CCN" as a starting solution. Generalization and parameterization of this architecture and its validation in the simulation environment GEM5 are the backbone of this thesis.The results at the end of the thesis, tend to demonstrate the achievement of objectives
404

Adaptive and intelligent memory systems / Système mémoire adaptatif intelligent

Sridharan, Aswinkumar 15 December 2016 (has links)
Dans cette thèse, nous nous sommes concentrés sur l'interférence aux ressources de la hiérarchie de la mémoire partagée : cache de dernier niveau et accès à la mémoire hors-puce dans le contexte des systèmes multicœurs à grande échelle. À cette fin, le premier travail a porté sur les caches de dernier niveau partagées, où le nombre d'applications partageant le cache pourrait dépasser l'associativité du cache. Pour gérer les caches dans de telles situations, notre solution évalue l'empreinte du cache des applications pour déterminer approximativement à quel point elles pourraient utiliser le cache. L'estimation quantitative de l'utilitaire de cache permet explicitement de faire respecter différentes priorités entre les applications. La seconde partie apporte une prédétection dans la gestion de la mémoire cache. En particulier, nous observons les blocs cache pré-sélectionnés pour présenter un bon comportement de réutilisation dans le contexte de caches plus grands. Notre troisième travail est axé sur l'interférence entre les demandes à la demande et les demandes de prélecture à l'accès partagé à la mémoire morte. Ce travail est basé sur deux observations fondamentales de la fraction des requêtes de prélecture générées et de sa corrélation avec l'utilité de prélecture et l'interférence causée par le prélecteur. Au total, deux observations conduisent à contrôler le flux de requêtes de prélecture entre les mémoires LLC et off-chip. / In this thesis, we have focused on addressing interference at the shared memory-hierarchy resources: last level cache and off-chip memory access in the context of large-scale multicore systems. Towards this end, the first work focused on shared last level caches, where the number of applications sharing the cache could exceed the associativity of the cache. To manage caches in such situations, our solution estimates the cache footprint of applications to approximate how well they could utilize the cache. Quantitative estimate of cache utility explicitly allows enforcing different priorities across applications. The second part brings in prefetch awareness in cache management. In particular, we observe prefetched cache blocks to exhibit good reuse behavior in the context of larger caches. Our third work focuses on addressing interference between on-demand and prefetch requests at the shared off-chip memory access. This work is based on two fundamental observations of the fraction of prefetch requests generated and its correlation with prefetch usefulness and prefetcher-caused interference. Altogether, two observations lead to control the flow of prefetch requests between LLC and off-chip memory.
405

Workload Driven Designs for Cost-Effective Non-Volatile Memory Hierarchies

Timothy A Pritchett (9179468) 28 July 2020 (has links)
Compared to traditional hard-disk drives (HDDs), non-volatile (NV) memory technologies offer significant performance advantages on one hand, but also incur significant cost and asymmetric write-performance on the other. A common strategy to manage such cost- and performance-differentials is to use hierarchies such that a small, but intensely accessed, working set is staged in the NV storage (selective caching). However, when this working set includes write-heavy data, the low write-lifetime of NV storage necessitates significant over-provisioning to maintain required lifespans (e.g., storage lifespan must match or exceed 3 year server lifespan). One may think that employing DRAM-based write-buffers can filter writes that trickle through to the NV storage and thus alleviate the write-pressure felt at the NV storage. Unfortunately, selective caches, when used with common recency-based or frequency-based replacement, have access patterns that require large write buffers (e.g., 100MB+ relative to a 12GB cache) to filter writes adequately. Further, these large DRAM write-buffers also require backup-power to ensure the durability of disk writes. More sophisticated replacement policies that combine recency and frequency can reduce the size of the DRAM buffer (while preserving write-filtering), but are so computationally-expensive that they can limit the I/O rate, especially for simple controllers (e.g., RAID controller). <br>My first contribution is the design and implementation of WriteGuard– a self-tuning sieving write-buffer algorithm that filters writes as well as the highly-effective (but computationally-expensive) algorithms while requiring lightweight computation comparable to a simple LRU-based write-buffer. While WriteGuard reduces the capacity needed for DRAM buffering (to approx. 64 MB), it does not eliminate the need for DRAM buffers (and corresponding power backup).<br>For my second thrust, I identify two specific application characteristics – (1) the vast majority of the write-buffer’s contents is composed of write-dominant blocks, and (2) the vast majority of blocks in the write-buffer are overwritten within a period of 28 hours. I show that these characteristics help enable a high-density, optimized STT-MRAM as a replacement for DRAM, which enables durable write-buffers (thus eliminating the cost of power backup for the write-buffer). My optimized STT-MRAM-based write buffer achieves higher density by (a) trading off superfluous durability by exploiting characteristic (2), and (b) deoptimizing the read-performance of STT-MRAM by leveraging characteristic (1). Together, the techniques increase the density of STT-MRAM by 20% with low or no impact on write-buffer performance.<br>
406

Design and Implementation of a Graceful Degradation Approach for Polymorphic Role Invocation in Object Teams

Kummer, Cornelius 07 September 2021 (has links)
In the ever-evolving world of modern software engineering, dynamic and context-dependent adaptability becomes increasingly important. A promising new paradigm that has been proposed is role-oriented programming, an extension of object-oriented programming which allows collaborative relationships of objects to be modeled. Through the introduction of roles and contexts, the behavior of objects can be adapted at run-time via addition or modification of attributes and methods. This dynamism however incurs a high overhead, especially in the area of role function invocation. Recent research has found a remedy inspired by polymorphic inline caches, allowing reuse of so-called dispatch plans which encode the steps directly required for the execution of adaptations. With this optimization, an average speedup of 4.0× was achieved in static contexts and 1.1× in variable contexts. Still, performance sharply drops off at a certain degree of volatility as a consequence of cache capacity exhaustion. This thesis presents a fallback mechanism that is to be used at highly variable call sites which would normally cause a significant slowdown with the new approach. In addition, an optimized reuse mechanism is proposed, further improving execution efficiency. Evaluation through benchmarking shows complete elimination of the aforementioned overhead, meaning a speedup of 16.5×, while the previously achieved speedup is maintained.
407

Investigating the effect of implementing Data-Oriented Design principles on performance and cache utilization

Nyberg, Frank January 2021 (has links)
Game engines process a lot of data under strict deadlines. Therefore, measures to increase performance are important in this area. Data-Oriented Design (DOD) promotes principles that are meant to increase performance by better cache utilization. The purpose of this thesis is to examine a selection of these principles to give a better understanding of how DOD affects CPU time and the rate of cache misses, with focus on the area of game development. More specifically, the principles examined are removal of run-time polymorphism, iteration over contiguous data, and lowering the amount of data in hot loops. Also, the Entity-Component-System pattern is examined, which is based upon DOD principles. The approach was to first present a theoretical background on the subject, and then to conduct tests by implementing a simulation of movement and collision detection utilizing said principles. The tests were written in C++ and executed on an Intel Core i7 4770k with no rendering. CPU time was measured in updated entities per μs, and cache utilization was measured in the form of cache miss rate. The results showed that the DOD principles did increase performance. Cache miss rate was also lower, with the exception of when removing run-time polymorphism. The conclusion is that Data-Oriented Design, used in game development, is likely to result in better performance, mostly as a result of better cache utilization.
408

Jádra schématu lifting pro vlnkovou transformaci / Lifting Scheme Cores for Wavelet Transform

Bařina, David Unknown Date (has links)
Práce se zaměřuje na efektivní výpočet dvourozměrné diskrétní vlnkové transformace. Současné metody jsou v práci rozšířeny v několika směrech a to tak, aby spočetly tuto transformaci v jediném průchodu, a to případně víceúrovňově, použitím kompaktního jádra. Tohle jádro dále může být vhodně přeorganizováno za účelem minimalizace užití některých prostředků. Představený přístup krásně zapadá do běžně používaných rozšíření SIMD, využívá hierarchii cache pamětí moderních procesorů a je vhodný k paralelnímu výpočtu. Prezentovaný přístup je nakonec začleněn do kompresního řetězce formátu JPEG 2000, ve kterém se ukázal být zásadně rychlejší než široce používané implementace.
409

Performance Analysis of Complex Shared Memory Systems

Molka, Daniel 10 March 2017 (has links)
Systems for high performance computing are getting increasingly complex. On the one hand, the number of processors is increasing. On the other hand, the individual processors are getting more and more powerful. In recent years, the latter is to a large extent achieved by increasing the number of cores per processor. Unfortunately, scientific applications often fail to fully utilize the available computational performance. Therefore, performance analysis tools that help to localize and fix performance problems are indispensable. Large scale systems for high performance computing typically consist of multiple compute nodes that are connected via network. Performance analysis tools that analyze performance problems that arise from using multiple nodes are readily available. However, the increasing number of cores per processor that can be observed within the last decade represents a major change in the node architecture. Therefore, this work concentrates on the analysis of the node performance. The goal of this thesis is to improve the understanding of the achieved application performance on existing hardware. It can be observed that the scaling of parallel applications on multi-core processors differs significantly from the scaling on multiple processors. Therefore, the properties of shared resources in contemporary multi-core processors as well as remote accesses in multi-processor systems are investigated and their respective impact on the application performance is analyzed. As a first step, a comprehensive suite of highly optimized micro-benchmarks is developed. These benchmarks are able to determine the performance of memory accesses depending on the location and coherence state of the data. They are used to perform an in-depth analysis of the characteristics of memory accesses in contemporary multi-processor systems, which identifies potential bottlenecks. However, in order to localize performance problems, it also has to be determined to which extend the application performance is limited by certain resources. Therefore, a methodology to derive metrics for the utilization of individual components in the memory hierarchy as well as waiting times caused by memory accesses is developed in the second step. The approach is based on hardware performance counters, which record the number of certain hardware events. The developed micro-benchmarks are used to selectively stress individual components, which can be used to identify the events that provide a reasonable assessment for the utilization of the respective component and the amount of time that is spent waiting for memory accesses to complete. Finally, the knowledge gained from this process is used to implement a visualization of memory related performance issues in existing performance analysis tools. The results of the micro-benchmarks reveal that the increasing number of cores per processor and the usage of multiple processors per node leads to complex systems with vastly different performance characteristics of memory accesses depending on the location of the accessed data. Furthermore, it can be observed that the aggregated throughput of shared resources in multi-core processors does not necessarily scale linearly with the number of cores that access them concurrently, which limits the scalability of parallel applications. It is shown that the proposed methodology for the identification of meaningful hardware performance counters yields useful metrics for the localization of memory related performance limitations.
410

Cache-Efficient Aggregation: Hashing Is Sorting

Müller, Ingo, Sanders, Peter, Lacurie, Arnaud, Lehner, Wolfgang, Färber, Franz 14 June 2022 (has links)
For decades researchers have studied the duality of hashing and sorting for the implementation of the relational operators, especially for efficient aggregation. Depending on the underlying hardware and software architecture, the specifically implemented algorithms, and the data sets used in the experiments, different authors came to different conclusions about which is the better approach. In this paper we argue that in terms of cache efficiency, the two paradigms are actually the same. We support our claim by showing that the complexity of hashing is the same as the complexity of sorting in the external memory model. Furthermore we make the similarity of the two approaches obvious by designing an algorithmic framework that allows to switch seamlessly between hashing and sorting during execution. The fact that we mix hashing and sorting routines in the same algorithmic framework allows us to leverage the advantages of both approaches and makes their similarity obvious. On a more practical note, we also show how to achieve very low constant factors by tuning both the hashing and the sorting routines to modern hardware. Since we observe a complementary dependency of the constant factors of the two routines to the locality of the input, we exploit our framework to switch to the faster routine where appropriate. The result is a novel relational aggregation algorithm that is cache-efficient---independently and without prior knowledge of input skew and output cardinality---, highly parallelizable on modern multi-core systems, and operating at a speed close to the memory bandwidth, thus outperforming the state-of-the-art by up to 3.7x.

Page generated in 0.0646 seconds