Global ETD Search

461	Improving and Extending a High Performance Processor Optimized for FPGAs / Förbättring och utökning av en högpresterande processor anpassad för FPGAer Källming, Daniel, Hultenius, Kristoffer January 2010 (has links) <p>This thesis is about a number of improvements and additions done to a soft CPU optimized for field programmable gate arrays (FPGAs). The goal has been to implement the changes without substantially lowering the CPU's ability to operate at high clock frequencies. The result of the thesis is a number of high clock frequency modules, which when added completes the CPU hardware functionality in certain areas. The maximum frequency of the CPU is however somewhat lowered after the modules have been added.</p> / <p>Detta examensarbete handlar om ett antal förbättringar och utökningar av en mjuk processor speciellt anpassad för fältprogrammerbara grindmatriser (FPGA). Målet har varit att göra förändringarna utan att göra större avkall på processorns förmåga att operera i höga klockfrekvenser. Resultatet av examensarbetet är ett antal moduler som klarar av höga frekvenser och kompletterar processorns hårdvarufunktioner. Dock reduceras maxfrekvensen på processorn något med modulerna tillagda.</p> FPGA Soft CPU Xi2 Embedded Cache Division Interrupts DSP Computer engineering Datorteknik Electrical engineering Elektroteknik
462	Generationsskräpsamling med explicit kontroll av hårdvarucache Karlsson, Karl-Johan January 2006 (has links) <p>This report evaluates whether an interpreted high-level garbage collected language has enough information about its memory behaviour to make better cache decisions than modern general CPU hardware.</p><p>With a generational garbage collector, depending on promotion algorithm and generation size, around 90% of all objects never leave the first generation. This report is based on the hypothesis that, because of the low promotion rate, accesses to higher generations are sufficiently rare not to benefit from caching.</p><p>To test this hypothesis, we built an operating system with a Scheme interpreter in kernel mode, where the interpreter controls the cache. Generic x86 PC hardware was used, since it allows fine-grained control of cache decisions.</p><p>Measurements of execution time in this interpreter show that disabling the cache for generations higher than the first does not give any performance gain, but rather a performance loss of up to 50%.</p><p>We conclude that this interpreter design is not an improvement, but cannot conclude that the hypothesis is false in general. We suggest building a better CPU simulator to gather more data from which to make better caching decisions, moving internal interpreter data structures into the garbage collected heap and modifying the hardware to allow control in the currently rigid dimension of where data is cached---for example separate control of instruction and data caches and separate data caches for different areas of memory.</p> memory management garbage collection operating systems cache control Computer science Datalogi
463	A Dynamically Partitionable Compressed Cache Chen, David, Peserico, Enoch, Rudolph, Larry 01 1900 (has links) The effective size of an L2 cache can be increased by using a dictionary-based compression scheme. Naive application of this idea performs poorly since the data values in a cache greatly vary in their “compressibility.” The novelty of this paper is a scheme that dynamically partitions the cache into sections of different compressibilities. While compression is often researched in the context of a large stream, in this work it is applied repeatedly on smaller cache-line sized blocks so as to preserve the random access requirement of a cache. When a cache-line is brought into the L2 cache or the cache-line is to be modified, the line is compressed using a dynamic, LZW dictionary. Depending on the compression, it is placed into the relevant partition. The partitioning is dynamic in that the ratio of space allocated to compressed and uncompressed varies depending on the actual performance, Certain SPEC-2000 benchmarks using a compressed L2 cache show an 80reduction in L2 miss-rate when compared to using an uncompressed L2 cache of the same area, taking into account all area overhead associated with the compression circuitry. For other SPEC-2000 benchmarks, the compressed cache performs as well as a traditional cache that is 4.3 times as large as the compressed cache in terms of hit rate, The adaptivity ensures that, in terms of miss rates, the compressed cache never performs worse than a traditional cache. / Singapore-MIT Alliance (SMA) dictionary-based compression Lempel-Ziv-Welch dictionary Partitioned Compressed Cache clock-scheme
464	Advanced Texture Unit Design for 3D Rendering System Lin, Huang-lun 05 September 2007 (has links) In order to achieve more realistic visual effect, the texturing mapping has become a very important and popular technique used in three-dimensional (3D) graphic. Many advanced rendering effects including shadow, environment, and bump mapping all depend on various applications of texturing function. Therefore, how to design an efficient texture unit is very important for 3D graphic rendering system. This thesis proposes an advanced texture unit design targeted for the rendering system with the fill rate of two fragments per cycle. This unit can support various filtering functions including nearest neighbor, bi-linear and tri-linear filtering. It can also provide the mip-map function to automatically select the best texture images for rendering. In order to realize the high texel throughput requirement for some complex filtering function, the texture cache has been divided into four banks such that up to eight texels can be delivered every cycle. The data-path design for the filtering unit has adopted the common expression sharing technique to reduce the required arithmetic units. The proposed texturing unit architecture has been implemented and embedded into a 3D rendering accelerator which has been integrated with OpenGL-ES software module, Linux operation system and geometry module, and successfully prototyped on the ARM versatile platform. With the 0.18um technology, this unit can run up to 150 Mhz, and provide the peak throughput of 1.2G texel/s. texture mapping texture unit tri-linear filtering texture cache bi-linear filtering
465	Classement de Services et de Données par leur Utilsation Constantin, Camelia 27 November 2007 (has links) (PDF) L'émergence des systèmes pair-à-pair et la possibilité de réaliser des calculs et d'échanger des données par des services web conduit à des systèmes d'intégration de données à large échelle où l'évaluation de requêtes et d'autres traitements complexes sont réalisés par composition de services. Un problème important dans ce type de systèmes est l'absence de connaissances globales. Il est difficile par exemple de choisir le meilleur pair pour le routage des requêtes, le meilleur service lors de la composition de services ou de décider parmi les données locales à un pair celles à rafraîchir, à mettre en cache, etc. La notion de choix implique celle de classement. Bien qu'il soit possible de comparer et classer des entités d'après leur contenu ou d'autres métadonnées associées, ces techniques sont généralement basées sur des descriptions homogènes et sémantiquement riches. Une alternative intéressante dans le contexte d'un système à large échelle est le classement basé sur les liens qui exploite les relations entre les différentes entités et permet de faire des choix fondés sur des informations globales. Cette thèse présente un nouveau modèle générique de classement de services fondé sur leurs liens de collaboration. Nous définissons une importance globale de service en exploitant des connaissances spécifiques sur sa contribution aux autres services à travers les appels reçus et les données échangées. L'importance peut être calculée efficacement par un algorithme asynchrone sans génération de messages supplémentaires. La notion de contribution est abstraite et nous avons étudié son instanciation dans le cadre de trois applications: (i) le classement de services basé sur les appels où la contribution reflète la sémantique des services ainsi que leur utilisation avec le temps; (ii) le classement de services par l'utilisation des données où la contribution des services est fondée sur l'utilisation de leurs données pendant l'évaluation des requêtes dans un entrepôt distribué; (iii) la définition des stratégies de cache distribuées qui sont basées sur la contribution d'une mise en cache des données à réduire la charge du système. Services données contribution utilisation cache algorithmes distribués
466	An Extended Iterative Location Management Schema for Load-Balancing in a Cellular Network Subramanian, Shanthi Sridhar 12 May 2005 (has links) Location Management is defined as the process of tracking the position of a mobile terminal when it moves to its associated area within the network. This allows the network to detect the mobile user’s path for the purpose of call delivery. The location management schema in a public-LAN mobile network (IS-41 and GSM) is based on centralized two-tier database architecture. The root level is called the Home Location Register (HLR) and the second level is called the Visitor Location Register (VLR). The HLR permanently stores all the mobile users’ location information and the types of services subscribed in the user’s profile database. The VLR stores the location information whenever a user registered in the HLR moves to its related location area within the network. By contacting the HLR, the VLR authenticates and updates the mobile user’s current position when a mobile terminal moves from one location area to another. The HLR then updates the mobile terminal’s new location information and removes the mobile terminal from its previous VLR. There can be multiple VLR’s under each HLR in a network. In the current location management schema, all the information requests, queries, acknowledgements have to go through the HLR. This results in excessive overload at the HLR. This overload becomes high when the number of mobile terminals increases within the network. The heavy traffic at the root (HLR) may cause congestion, degradation of the bandwidth at the root and hence becomes a major bottleneck for the entire network. To solve this congestion/bottleneck problem, a modified iterative protocol with VLR cache was introduced, where the VLRs in the network handle all de-registration, registration and acknowledgement of messages. The HLR only handles updating the location information of the mobile terminal in its database. This reduced the excess load/traffic experienced at the HLR thus improving the network’s performance. The modified protocol was tested with different cache replacement policies such as First-In First-Out (FIFO), Random and Least Frequently Visited (LFV) with uniform traffic with random mobile terminal movement. In this thesis report, we extend the previous work in the modified iterative protocol by 1) increasing the topography of the network, to analyze the impact of network’s size on performance and 2) changing the mobile terminal traffic pattern from uniform traffic with random mobile terminal movement to non-uniform traffic with unbalanced probability movement. With these changes, we analyzed the modified protocol’s performance with different cache replacement policies (FIFO, LFV and Random) under uniform traffic with random movement and non-uniform traffic with unbalanced probability movement. Location Management Hand-Off Management HLR-VLR Recursive Schema Iterative Schema VLR Cache Computer Sciences
467	Fault-tolerant Cache Coherence Protocols for CMPs Fernández Pascual, Ricardo 23 July 2007 (has links) We propose a way to deal with transient faults in the interconnection network of many-core CMPs that is different from the classic approach of building a fault-tolerant interconnection network. In particular, we provide fault tolerance mechanisms at the level of the cache coherence protocol so that it guarantees the correct execution of programs even when the underlying interconnection network does not deliver all messages correctly. This way, we can take advantage of the different meaning of each message to achieve fault tolerance with lower overhead than at the level of the interconnection network, which has to treat all messages alike with respect to reliability.We design several fault-tolerant cache coherence protocols using these techniques and evaluate them. This evaluation shows that, in absence of faults, our techniques do not increase significantly the execution time of the applications and their major cost is an increase in network traffic due to acknowledgment messages that ensure the reliable transference of ownership between coherence nodes, which are sent out of the critical path of cache misses. In addition, a system using our protocols degrades gracefully when transient faults actually happen and can support fault rates much higher than those expected in the real world with only a small performance degradation. / Se proponen una forma de tratar con los fallos transitorios en la red de interconexión de un CMP con gran número de núcleos que es diferente del enfoque clásico basado en construir una red de interconexión tolerante a fallos. En particular se proporcionan mecanismos de tolerancia a fallos al nivel del protocolo de coherencia. De esta forma, se puede aprovechar el conocimiento que el protocolo tiene sobre el significado de cada mensaje para obtener tolerancia a fallos con menor sobrecarga que en el nivel de red, que tiene que tratar todos los mensajes idénticamente.En la tesis se diseñan y evalúan varios protocolos de coherencia utilizando estas técnicas. Los resultados muestran que, cuando no hay fallos, nuestras técnicas no incrementan significativamente el tiempo de ejecución de las aplicaciones y su mayor coste es un incremento en el tráfico de red. Además, un sistema que use nuestros protocolos soporta tasas de fallos mucho mayores que las esperadas en circunstancias realistas y su rendimiento se degrada gradualmente cuando ocurren los fallos. CMP multiprocesador en un chip coherencia tolerancia a fallos fault-tolerance cache-coherence Arquitectura de computadores 621.3
468	Software caching techniques and hardware optimizations for on-chip local memories Vujic, Nikola 05 June 2012 (has links) Despite the fact that the most viable L1 memories in processors are caches, on-chip local memories have been a great topic of consideration lately. Local memories are an interesting design option due to their many benefits: less area occupancy, reduced energy consumption and fast and constant access time. These benefits are especially interesting for the design of modern multicore processors since power and latency are important assets in computer architecture today. Also, local memories do not generate coherency traffic which is important for the scalability of the multicore systems. Unfortunately, local memories have not been well accepted in modern processors yet, mainly due to their poor programmability. Systems with on-chip local memories do not have hardware support for transparent data transfers between local and global memories, and thus ease of programming is one of the main impediments for the broad acceptance of those systems. This thesis addresses software and hardware optimizations regarding the programmability, and the usage of the on-chip local memories in the context of both single-core and multicore systems. Software optimizations are related to the software caching techniques. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this thesis, we start optimizing traditional software cache by proposing a hierarchical, hybrid software-cache architecture. Afterwards, we develop few optimizations in order to speedup our hybrid software cache as much as possible. As the result of the software optimizations we obtain that our hybrid software cache performs from 4 to 10 times faster than traditional software cache on a set of NAS parallel benchmarks. We do not stop with software caching. We cover some other aspects of the architectures with on-chip local memories, such as the quality of the generated code and its correspondence with the quality of the buffer management in local memories, in order to improve performance of these architectures. Therefore, we run our research till we reach the limit in software and start proposing optimizations on the hardware level. Two hardware proposals are presented in this thesis. One is about relaxing alignment constraints imposed in the architectures with on-chip local memories and the other proposal is about accelerating the management of local memories by providing hardware support for the majority of actions performed in our software cache. / Malgrat les memòries cau encara son el component basic pel disseny del subsistema de memòria, les memòries locals han esdevingut una alternativa degut a les seves característiques pel que fa a l’ocupació d’àrea, el seu consum energètic i el seu rendiment amb un temps d’accés ràpid i constant. Aquestes característiques son d’especial interès quan les properes arquitectures multi-nucli estan limitades pel consum de potencia i la latència del subsistema de memòria.Les memòries locals pateixen de limitacions respecte la complexitat en la seva programació, fet que dificulta la seva introducció en arquitectures multi-nucli, tot i els avantatges esmentats anteriorment. Aquesta tesi presenta un seguit de solucions basades en programari i maquinari específicament dissenyat per resoldre aquestes limitacions.Les optimitzacions del programari estan basades amb tècniques d'emmagatzematge de memòria cau suportades per llibreries especifiques. La memòria cau per programari és un sòlid mètode per proporcionar a l'usuari una visió transparent de l'arquitectura, però aquest enfocament pot patir d'un rendiment deficient. En aquesta tesi, es proposa una estructura jeràrquica i híbrida. Posteriorment, desenvolupem optimitzacions per tal d'accelerar l’execució del programari que suporta el disseny de la memòria cau. Com a resultat de les optimitzacions realitzades, obtenim que el nostre disseny híbrid es comporta de 4 a 10 vegades més ràpid que una implementació tradicional de memòria cau sobre un conjunt d’aplicacions de referencia, com son els “NAS parallel benchmarks”.El treball de tesi inclou altres aspectes de les arquitectures amb memòries locals, com ara la qualitat del codi generat i la seva correspondència amb la qualitat de la gestió de memòria intermèdia en les memòries locals, per tal de millorar el rendiment d'aquestes arquitectures. La tesi desenvolupa propostes basades estrictament en el disseny de nou maquinari per tal de millorar el rendiment de les memòries locals quan ja no es possible realitzar mes optimitzacions en el programari. En particular, la tesi presenta dues propostes de maquinari: una relaxa les restriccions imposades per les memòries locals respecte l’alineament de dades, l’altra introdueix maquinari específic per accelerar les operacions mes usuals sobre les memòries locals. computer architecture high performance computing on-chip local memories software cache DMA multicores 004
469	Cache-Oblivious Searching and Sorting in Multisets Farzan, Arash January 2004 (has links) We study three problems related to searching and sorting in multisets in the cache-oblivious model: Finding the most frequent element (the mode), duplicate elimination and finally multi-sorting. We are interested in minimizing the cache complexity (or number of cache misses) of algorithms for these problems in the context under which the cache size and block size are unknown. We start by showing the lower bounds in the comparison model. Then we present the lower bounds in the cache-aware model, which are also the lower bounds in the cache-oblivious model. We consider the input multiset of size <i>N</i> with multiplicities <i>N</i><sub>1</sub>,. . . , <i>N<sub>k</sub></i>. The lower bound for the cache complexity of determining the mode is Ω({<i>N</i> over <i>B</i>} log {<i>M</i> over <i>B</i>} {<i>N</i> over <i>fB</i>}) where &fnof; is the frequency of the mode and <i>M</i>, <i>B</i> are the cache size and block size respectively. Cache complexities of duplicate removal and multi-sorting have lower bounds of Ω({<i>N</i> over <i>B</i>} log {<i>M</i> over <i>B</i>} {<i>N</i> over <i>B</i>} - £{<i>k</i> over <i>i</i>}=1{<i>N<sub>i</sub></i> over <i>B</i>}log {<i>M</i> over <i>B</i>} {<i>N<sub>i</sub></i> over <i>B</i>}). We present two deterministic approaches to give algorithms: selection and distribution. The algorithms with these deterministic approaches differ from the lower bounds by at most an additive term of {<i>N</i> over <i>B</i>} loglog <i>M</i>. However, since loglog <i>M</i> is very small in real applications, the gap is tiny. Nevertheless, the ideas of our deterministic algorithms can be used to design cache-aware algorithms for these problems. The algorithms turn out to be simpler than the previously-known cache-aware algorithms for these problems. Another approach to design algorithms for these problems is the probabilistic approach. In contrast to the deterministic algorithms, our randomized cache-oblivious algorithms are all optimal and their cache complexities exactly match the lower bounds. All of our algorithms are within a constant factor of optimal in terms of the number of comparisons they perform. Computer Science Memory Hierarchies Cache-Oblivious model Multisets Determining the mode Duplicate Elimination Sorting
470	Design and Evaluation of High Density 5T SRAM Cache for Advanced Microprocessors / Konstruktion och utvärdering av kompakta 5T SRAM cache för avancerade mikroprocessorer Carlson, Ingvar January 2004 (has links) This thesis presents a five-transistor SRAM intended for the advanced microprocessor cache market. The goal is to reduce the area of the cache memory array while maintaining competitive performance. Various existing technologies are briefly discussed with their strengths and weaknesses. The design metrics for the five-transistor cell are discussed in detail and performance and stability are evaluated. Finally a comparison is done between a 128Kb memory of an existing six-transistor technology and the proposed technology. The comparisons include area, performance and stability of the memories. It is shown that the area of the memory array can be reduced by 23% while maintaining comparable performance. The new cell also has 43% lower total leakage current. As a trade-off for these advantages some of the stability margin is lost but the cell is still stable in all process corners. The performance and stability has been validated through post-layout simulations using Cadence Spectre. Electronics SRAM high-density cache five-transistor 5T memory microprocessor Elektronik Electronics Elektronik

Search results