171 |
Návrh a implementace prostředků pro zvýšení výkonu procesoru / Design and Implementation of Mechanisms for Enhancing Performance of CPUZlatohlávková, Lucie January 2007 (has links)
This masters thesis is focused on the issue of processor architecture. The ground of this project is a design of a simple processor, which is enriched by modern components in processor architecture such as pipelining, cache memory and branch prediction. The processor has been made in VHDL programming language and was simulated in ModelSim simulation tool.
|
172 |
Critical Words Cache MemoryGieske, Edmund Joseph 28 August 2008 (has links)
No description available.
|
173 |
Deterministic Object Management in Large Distributed SystemsMikhailov, Mikhail 05 March 2003 (has links)
Caching is a widely used technique to improve the scalability of distributed systems. A central issue with caching is maintaining object replicas consistent with their master copies. Large distributed systems, such as the Web, typically deploy heuristic-based consistency mechanisms, which increase delay and place extra load on the servers, while not providing guarantees that cached copies served to clients are up-to-date. Server-driven invalidation has been proposed as an approach to strong cache consistency, but it requires servers to keep track of which objects are cached by which clients.
We propose an alternative approach to strong cache consistency, called MONARCH, which does not require servers to maintain per-client state. Our approach builds on a few key observations. Large and popular sites, which attract the majority of the traffic, construct their pages from distinct components with various characteristics. Components may have different content types, change characteristics, and semantics. These components are merged together to produce a monolithic page, and the information about their uniqueness is lost. In our view, pages should serve as containers holding distinct objects with heterogeneous type and change characteristics while preserving the boundaries between these objects. Servers compile object characteristics and information about relationships between containers and embedded objects into explicit object management commands. Servers piggyback these commands onto existing request/response traffic so that client caches can use these commands to make object management decisions.
The use of explicit content control commands is a deterministic, rather than heuristic, object management mechanism that gives content providers more control over their content. The deterministic object management with strong cache consistency offered by MONARCH allows content providers to make more of their content cacheable. Furthermore, MONARCH enables content providers to expose internal structure of their pages to clients.
We evaluated MONARCH using simulations with content collected from real Web sites. The results show that MONARCH provides strong cache consistency for all objects, even for unpredictably changing ones, and incurs smaller byte and message overhead than heuristic policies. The results also show that as the request arrival rate or the number of clients increases, the amount of server state maintained by MONARCH remains the same while the amount of server state incurred by server invalidation mechanisms grows.
|
174 |
Adaptive Prefetching for Visual Data ExplorationDoshi, Punit Rameshchandra 31 January 2003 (has links)
Loading of data from slow persistent memory (disk storage) to main memory represents a bottleneck for current interactive visual data exploration applications, especially when applied to huge volumnes of data. Semantic caching of queries at the client-side is a recently emerging technology that can significantly improve the performance of such systems, though it may not in all cases fully achieve the near real-time responsiveness required by such interactive applications. We hence propose to augment the semantic caching techniques by applying prefetching. That is, the system predicts the user's next requested data and loads the data into the cache as a background process before the next user request is made. Our experimental studies confirm that prefetching indeed achieves performance improvements for interactive visual data exploration. However, a given prefetching technique is not always able to correctly predict changes in a user's navigation pattern. Especially, as different users may have different navigation patterns, implying that the same strategy might fail for a new user. In this research, we tackle this shortcoming by utilizing the adaptation concept of strategy selection to allow the choice of prefetching strategy to change over time both across as well as within one user session. While other adaptive prefetching research has focused on refining a single strategy, we instead have developed a framework that facilitates strategy selection. For this, we explored various metrics to measure performance of prefetching strategies in action and thus guide the adaptive selection process. This work is the first to study caching and prefetching in the context of visual data exploration. In particular, we have implemented and evaluated our proposed approach within XmdvTool, a free-ware visualization system for visually exploring hierarchical multivariate data. We have tested our technique on real user traces gathered by the logging tool of our system as well as on synthetic user traces. Our results confirm that our adaptive approach improves system performance by selecting a good combination of prefetching strategies that adapts to the user's changing navigation patterns.
|
175 |
Hierarquia de memória configurável para redução energética no codificador de vídeo HEVC / Configurable memory hierarchy for energy reduction in HEVC video encoderMartins, Anderson da Silva 29 September 2017 (has links)
Submitted by Aline Batista (alinehb.ufpel@gmail.com) on 2018-04-18T14:40:46Z
No. of bitstreams: 2
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5)
Dissertacao_Anderson_Martins.pdf: 8654389 bytes, checksum: f6e25bd57867fb8466bfe88dcf25afb3 (MD5) / Approved for entry into archive by Aline Batista (alinehb.ufpel@gmail.com) on 2018-04-19T14:42:52Z (GMT) No. of bitstreams: 2
Dissertacao_Anderson_Martins.pdf: 8654389 bytes, checksum: f6e25bd57867fb8466bfe88dcf25afb3 (MD5)
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Made available in DSpace on 2018-04-19T14:43:00Z (GMT). No. of bitstreams: 2
Dissertacao_Anderson_Martins.pdf: 8654389 bytes, checksum: f6e25bd57867fb8466bfe88dcf25afb3 (MD5)
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5)
Previous issue date: 2017-09-29 / Sem bolsa / Dados recentes mostram que há uma demanda crescente de aplicações de vídeo em dispositivos móveis, sendo este um grande desafio para pesquisas em arquiteturas de codificadores de vídeo de alto desempenho como o padrão HEVC. Em um sistema embarcado o consumo de energia e o desempenho estão diretamente ligados ao sistema de memória. No codificador de vídeo não é diferente, e no HEVC a etapa de estimação de movimento (ME) é conhecida por ser responsável pela maior parte do tempo de processamento e acesso à memória. Portanto, este trabalho apresenta uma exploração do espaço de projeto para definir configurações de memória cache eficientes em energia para o processo da ME e, propor uma hierarquia de memória cache configurável, considerando diferentes sequências de vídeo e configurações do codificador HEVC. A avaliação considerou o algoritmo TZ Search, amplamente utilizado, 23 sequências de vídeo com resoluções distintas e quatro Parâmetros de Quantização (QPs) sob 32 configurações de cache diferentes. Um simulador de cache foi desenvolvido e a ferramenta CACTI foi utilizada para obter parâmetros de tempo e energia. Assim, foi possível identificar configurações de cache ótimas para cada cenário, visto que não existe uma única configuração de memória cache que satisfaça todos os cenários ao mesmo tempo quando o objetivo é redução de energia. Considerando a configuração ótima de cache para cada cenário, o uso de cache pode levar a uma economia de largura de banda da memória externa de até 97,37%, que corresponde a uma redução de 25,48GB/s para 548,53MB/s em um caso. A redução de energia chega a 93,95%, o que corresponde, uma redução de energia de 5,02mJ para 0,30mJ, ao comparar diferentes configurações de cache. Estes resultados possibilitaram propor uma hierarquia de memória cache configurável para o processo de estimação de movimento que é capaz de atender eficientemente todos os cenários testados. Para a arquitetura configurável proposta foram encontradas economia de energia de até 78,09% quando as configurações ótimas são comparadas com o pior caso dentro da cache configurável (16KB-8). Já quando comparada com Level-C, foram alcançadas economia de energia de até 86,91%. Além disso, a economia de largura de banda alcançada ficou entre 90,21% e 96,84% com uma média de 94,97%. / Recent data show that there is a growing demand for video applications on mobile devices, which is a major challenge for research into high performance video encoder architectures such as the HEVC standard. In an embedded system, power consumption and performance are directly connected to the memory system. In the video encoder it is no different, and in the HEVC the motion estimation (ME) step is known to be responsible for most of the processing time and memory access. Therefore, this work presents an exploration of the design space to define energy-efficient cache memory configurations for the ME process and propose a configurable cache memory hierarchy considering different video sequences and HEVC encoder configurations. The evaluation considered the widely used TZ Search algorithm, 23 video sequences with distinct resolutions, and four Quantization Parameters (QPs) under 32 different cache configurations. A cache simulator was developed and the CACTI tool was used to obtain time and energy parameters. Thus, it was possible to identify optimal cache configurations for each scenario, since there is no single cache configuration that satisfies all scenarios at the same time when the goal is to reduce power. Considering the optimal cache configuration for each scenario, cache usage can lead to external memory bandwidth savings of up to 97.37%, which corresponds to a reduction of 25.48GB/s to 548.53MB/s in one case. The energy reduction comes to 93.95%, which corresponds to an energy reduction of 5.02mJ to 0.30mJ when comparing different cache configurations. These results have made it possible to propose a configurable cache memory hierarchy for motion estimation process that is capable of efficiently satisfying all scenarios tested. For the proposed configurable architecture, energy savings of up to 78.09% were found when the optimal configurations were compared to the worst case within the configurable cache (16KB-8). When compared to Level-C, energy savings of up to 86.91% were achieved. In addition, the external memory bandwidth savings achieved was between 90.21% and 96.84% with an average of 94.97%.
|
176 |
Managing dynamic non-uiform cache architecturesLira Rueda, Javier 25 November 2011 (has links)
Researchers from both academia and industry agree that future CMPs will accommodate large shared on-chip last-level caches.
However, the exponential increase in multicore processor cache sizes accompanied by growing on-chip wire delays make it difficult
to implement traditional caches with a single, uniform access latency. Non-Uniform Cache Access (NUCA) designs have been
proposed to address this situation. A NUCA cache divides the whole cache memory into smaller banks that are distributed along the
chip and can be accessed independently. Response time in NUCA caches does not only depend on the latency of the actual bank,
but also on the time required to reach the bank that has the requested data and to send it to the core. So, the NUCA cache allows
those banks that are located next to the cores to have lower access latencies than the banks that are further away, thus mitigating
the effects of the cache’s internal wires.
These cache architectures have been traditionally classified based on their placement decisions as static (S-NUCA) or dynamic (DNUCA).
In this thesis, we have focused on D-NUCA as it exploits the dynamic features that NUCA caches offer, like data migration.
The flexibility that D-NUCA provides, however, raises new challenges that hardens the management of this kind of cache
architectures in CMP systems. We have identified these new challenges and tackled them from the point of view of the four NUCA
policies: replacement, access, placement and migration.
First, we focus on the challenges introduced by the replacement policy in D-NUCA. Data migration makes most frequently accessed
data blocks to be concentrated on the banks that are closer to the processors. This creates big differences in the average usage
rate of the NUCA banks, being the banks that are close to the processors the most accessed banks, while the banks that are further
away are not accessed so often. Upon a replacement in a particular bank of the NUCA cache, the probabilities of the evicted data
block to be reused by the program will differ if its last location in the NUCA cache was a bank that are close to the processors, or
not. The decentralized nature of NUCA, however, prevents a NUCA bank from knowing that other bank is constantly evicting data
blocks that are later being reused. We propose three different techniques to dealwith the replacement policy, being The Auction the
most successful one.
Then, we deal with the challenges in the access policy. As data blocks can be mapped in multiple banks within the NUCA cache.
Finding the requesting data in a D-NUCA cache is a difficult task. In addition, data can freely move between these banks, thus the
search scheme must look up all banks where the requesting data block can be mapped to ascertain if it is in the NUCA cache, or
not. We have proposed HK-NUCA. This is a search scheme that uses home knowledge to effectively reduce the average number of
messages introduced to the on-chip network to satisfy a memory request.
With regard to the placement policy, this thesis shows the implementation of a hybrid NUCA cache. We have proposed a novel
placement policy that accomodates both memory technologies, SRAM and eDRAM, in a single NUCA cache.
Finally, in order to deal with the migration policy in D-NUCA caches, we propose The Migration Prefetcher. This is a technique that
anticipates data migrations.
Summarizing, in this thesis we propose different techniques to efficiently manage future D-NUCA cache architectures on CMPs. We
demonstrate the effectivity of our techniques to deal with the challenges introduced by D-NUCA caches. Our techniques outperform
existing solutions in the literature, and are in most cases more energy efficient. / CMPs actuales integran memorias cache de último nivel cada vez más grandes dentro del chip. Roadmaps en la industria y
trabajos en ámbito académico muestran que esta tendencia seguirá en los próximos años. Sin embargo, los altos retrasos en la
red de interconexión y el cableado hace que sea cada vez más difícil de implementar memorias cachés tradicionales con una única
y uniforme latencia de acceso. Para solventar esta situación aparecieron los diseños NUCA (Non-Uniform Cache Access). Una
caché de tipo NUCA divide una memoria grande en bloques más pequeños que se distribuyen a lo largo del chip y pueden ser
accedidos de manera independiente. De esta manera el tiempo de respuesta en una caché NUCA no depende sólo de la latencia
de un banco, sino que también se tiene en cuenta el tiempo de enrutamiento de la petición hasta y desde el banco de la NUCA que
responde. La posición física de un banco en el chip es clave para determinar la latencia de acceso a NUCA, entonces bancos que
se encuentren más cerca de los cores tendrán menores latencias de acceso que otros que estén más alejados.
Las cachés NUCA se pueden clasificar como estáticas (S-NUCA) o dinámicas (D-NUCA), basándonos en sus decisiones de
emplazamiento. Esta tesis se centra en D-NUCA. Este diseño permite a un dato migrar de banco en banco a fín de reducir la
latencia de futuros accesos a ese dato, pero también ofrece otros retos que deben ser investigados para gestionar estas cachés de
manera eficiente. Hemos identificado y explorado estos retos desde el punto de vista de las cuatro políticas NUCA: reemplazo,
acceso, emplazamiento y migración.
En primer lugar nos hemos centrado en la política de reemplazo. La migración de datos permite que los datos que se utilizan más
frequentemente se concentren en aquellos bancos que estan más cerca de los cores. Ésto crea grandes diferencias en el uso
medio de los bancos en NUCA, siendo los bancos cercanos a los cores los más accedidos, mientras que los bancos lejanos no se
acceden tan a menudo. Debido a las diferencias en la frequencia de reemplazos entre bancos, las probabilidades de que el dato
expulsado sea reusado en un futuro crecerán o disminuirán dependiendo del banco donde se efectuó el reemplazo. Por otro lado,
los trabajos previos en la política de reemplazo no son efectivos en este tipo de cachés ya que los bancos trabajan de manera
independiente. Nosotros proponemos tres técnicas de reemplazo para NUCA, siendo The Auction la técnica con mayor beneficio.
En cuanto a los retos con la política de acceso, como los datos se pueden mapear en diversos bancos dentro de la caché NUCA,
encontrarlos se convierte en una tarea complicada y costosa. Aquí, nosotros proponemos HK-NUCA. Es un algoritmo de acceso
que usa el conocimiento integrado en los bancos "home" para reducir de manera eficiente el número medio de accesos necesarios
para resolver una petición de memoria.
Para analizar la política de emplazamiento, esta tesis muestra la implementación de una caché NUCA híbrida. Nuestra política de
emplazamiento permite integrar ambas tecnologías, SRAM y eDRAM, en un único nivel de cache NUCA.
Finalmente, en cuanto a la migración en D-NUCA, hemos propuesto The Migration Prefetcher. Es una técnica que permite anticipar
migraciones de datos usando el conocimiento adquirido por el historial de accesos.
En resumen, esta tesis propone diferentes técnicas para gestionar de manera eficiente las futuras arquitecturas de memoria caché
D-NUCA en un entorno CMP. A lo largo de la tesis, demostramos la efectividad de las técnicas propuestas para paliar los efectos
inducidos por el hecho de utilizar cachés D-NUCA. Estas técnicas, además de obtener mayor rendimiento que otros mecanismos
existentes en la literatura, son en muchos casos más eficientes en términos de energía.
|
177 |
Towards Efficient Delivery of Dynamic Web ContentRamaswamy, Lakshmish Macheeri 26 August 2005 (has links)
Advantages of cache cooperation on edge cache networks serving dynamic web content were studied. Design of cooperative edge cache grid a large-scale cooperative edge cache network for delivering highly dynamic web content with varying server update frequencies was presented. A cache clouds-based architecture was proposed to promote low-cost cache cooperation in cooperative edge cache grid. An Internet landmarks-based scheme, called selective landmarks-based server-distance sensitive clustering scheme, for grouping edge caches into cooperative clouds was presented. Dynamic hashing technique for efficient, load-balanced, and reliable documents lookups and updates was presented. Utility-based scheme for cooperative document placement in cache clouds was proposed. The proposed architecture and techniques were evaluated through trace-based simulations using both real-world and synthetic traces. Results showed that the proposed techniques provide significant performance benefits.
A framework for automatically detecting cache-effective fragments in dynamic web pages was presented. Two types of fragments in web pages, namely, shared fragments and lifetime-personalization fragments were identified and formally defined. A hierarchical fragment-aware web page model called the augmented-fragment tree model was proposed. An efficient algorithm to detect maximal fragments that are shared among multiple documents was proposed. A practical algorithm for detecting fragments based on their lifetime and personalization characteristics was designed. The proposed framework and algorithms were evaluated through experiments on real web sites. The effect of adopting the detected fragments on web-caches and origin-servers is experimentally studied.
|
178 |
Design Space Exploration and Optimization of Embedded Memory SystemsRabbah, Rodric Michel 11 July 2006 (has links)
Recent years have witnessed the emergence of microprocessors that are
embedded within a plethora of devices used in everyday life. Embedded
architectures are customized through a meticulous and time consuming
design process to satisfy stringent constraints with respect to
performance, area, power, and cost. In embedded systems, the cost of
the memory hierarchy limits its ability to play as central a
role. This is due to stringent constraints that fundamentally limit
the physical size and complexity of the memory system. Ultimately,
application developers and system engineers are charged with the heavy
burden of reducing the memory requirements of an application.
This thesis offers the intriguing possibility that compilers can play
a significant role in the automatic design space exploration and
optimization of embedded memory systems. This insight is founded upon
a new analytical model and novel compiler optimizations that are
specifically designed to increase the synergy between the processor
and the memory system. The analytical models serve to characterize
intrinsic program properties, quantify the impact of compiler
optimizations on the memory systems, and provide deep insight into the
trade-offs that affect memory system design.
|
179 |
High-performance memory system architectures using data compressionBaek, Seungcheol 22 May 2014 (has links)
The Chip Multi-Processor (CMP) paradigm has cemented itself as the archetypal philosophy of future microprocessor design. Rapidly diminishing technology feature sizes have enabled the integration of ever-increasing numbers of processing cores on a single chip die. This abundance of processing power has magnified the venerable processor-memory performance gap, which is known as the ”memory wall”. To bridge this performance gap, a high-performing memory structure is needed. An attractive solution to overcoming this processor-memory performance gap is using compression in the memory hierarchy. In this thesis, to use compression techniques more efficiently, compressed cacheline size information is studied, and size-aware cache management techniques and hot-cacheline prediction for dynamic early decompression technique are proposed. Also, the proposed works in this thesis attempt to mitigate the limitations of phase change memory (PCM) such as low write performance and limited long-term endurance. One promising solution is the deployment of hybridized memory architectures that fuse dynamic random access memory (DRAM) and PCM, to combine the best attributes of each technology by using the DRAM as an off-chip cache. A dual-phase compression technique is proposed for high-performing DRAM/PCM hybrid environments and a multi-faceted wear-leveling technique is proposed for the long-term endurance of compressed PCM. This thesis also includes a new compression-based hybrid multi-level cell (MLC)/single-level cell (SLC) PCM management technique that aims to combine the performance edge of SLCs with the higher capacity of MLCs in a hybrid environment.
|
180 |
Co-scheduling for large-scale applications : memory and resilience / Ordonnancement concurrent d’applications à grande échelle : mémoire et résiliencePottier, Loïc 18 September 2018 (has links)
Cette thèse explore les problèmes liés à l'ordonnancement concurrent dans le contexte des applications massivement parallèle, de deux points de vue: le coté mémoire (en particulier la mémoire cache) et le coté tolérance aux fautes.Avec l'avènement récent des architectures dites many-core, tels que les récents processeurs multi-coeurs, le nombre d'unités de traitement augmente de manière importante.Dans ce contexte, les avantages fournis par les techniques d'ordonnancements concurrents ont été démontrés à travers de nombreuses études.L'ordonnancement concurrent, aussi appelé co-ordonnancement, consiste à exécuter les applications de manière concurrente plutôt que les unes après les autres, dans le but d'améliorer le débit global de la plateforme.Mais le partage des ressources peut souvent générer des interférences.Une des solutions pour réduire de manière importante ces interférences est le partitionnement de cache.À travers un modèle théorique, des simulations et des expériences sur une plateforme existante, nous montrons l'utilité et l'importance du co-ordonnancement quand nos stratégies de partitionnement de cache sont utilisées.De plus, avec ce nombre croissant de processeurs, la probabilité d'une panne augmente également.L'efficacité des techniques de co-ordonnancement a été démontrée dans un contexte sans pannes, mais les plateformes massivement parallèles sont confrontées à des pannes fréquentes, et des techniques de tolérance aux fautes doivent être mise en place pour améliorer l'efficacité de ces plateformes.Nous étudions la complexité du problème avec un modèle théorique, nous concevons des heuristiques et nous effectuons un ensemble complet de simulations avec un simulateur de pannes, qui démontre l'efficacité des heuristiques proposées. / This thesis explores co-scheduling problems in the context of large-scale applications with two main focus: the memory side, in particular the cache memory and the resilience side.With the recent advent of many-core architectures such as chip multiprocessors (CMP), the number of processing units is increasing.In this context, the benefits of co-scheduling techniques have been demonstrated. Recall that, the main idea behind co-scheduling is to execute applications concurrently rather than in sequence in order to improve the global throughput of the platform.But sharing resources often generates interferences.With the arising number of processing units accessing to the same last-level cache, those interferences among co-scheduled applications becomes critical.In addition, with that increasing number of processors the probability of a failure increases too.Resiliency aspects must be taking into account, specially for co-scheduling because failure-prone resources might be shared between applications.On the memory side, we focus on the interferences in the last-level cache, one solution used to reduce these interferences is the cache partitioning.Extensive simulations demonstrate the usefulness of co-scheduling when our efficient cache partitioning strategies are deployed.We also investigate the same problem on a real cache partitioned chip multiprocessors, using the Cache Allocation Technology recently provided by Intel.In a second time, still on the memory side, we study how to model and schedule task graphs on the new many-core architectures, such as Knights Landing architecture.These architectures offer a new level in the memory hierarchy through a new on-packagehigh-bandwidth memory. Current approaches usually do not take intoaccount this new memory level, however new scheduling algorithms anddata partitioning schemes are needed to take advantage of this deepmemory hierarchy.On the resilience, we explore the impact on failures on co-scheduling performance.The co-scheduling approach has been demonstrated in a fault-free context, but large-scale computer systems are confronted by frequent failures, and resilience techniques must be employed for large applications to execute efficiently. Indeed, failures may create severe imbalance between applications, and significantly degrade performance.We aim at minimizing the expected completion time of a set of co-scheduled applications in a failure-prone context by redistributing processors.
|
Page generated in 0.064 seconds