Global ETD Search

391	Spare Block Cache Architecture to Enable Low-Voltage Operation Siddique, Nafiul Alam 01 January 2011 (has links) Power consumption is a major concern for modern processors. Voltage scaling is one of the most effective mechanisms to reduce power consumption. However, voltage scaling is limited by large memory structures, such as caches, where many cells can fail at low voltage operation. As a result, voltage scaling is limited by a minimum voltage (Vccmin), below which the processor may not operate reliably. Researchers have proposed architectural mechanisms, error detection and correction techniques, and circuit solutions to allow the cache to operate reliably at low voltages. Architectural solutions reduce cache capacity at low voltages at the expense of logic complexity. Circuit solutions change the SRAM cell organization and have the disadvantage of reducing the cache capacity (for the same area) even when the system runs at a high voltage. Error detection and correction mechanisms use Error Correction Codes (ECC) codes to keep the cache operation reliable at low voltage, but have the disadvantage of increasing cache access time. In this thesis, we propose a novel architectural technique that uses spare cache blocks to back up a set-associative cache at low voltage. In our mechanism, we perform memory tests at low voltage to detect errors in all cache lines and tag them as faulty or fault-free. We have designed shifter and adder circuits for our architecture, and evaluated our design using the SimpleScalar simulator. We constructed a fault model for our design to find the cache set failure probability at low voltage. Our evaluation shows that, at 485mV, our designed cache operates with an equivalent bit failure probability to a conventional cache operating at 782mV. We have compared instructions per cycle (IPC), miss rates, and cache accesses of our design with a conventional cache operating at nominal voltage. We have also compared our cache performance with a cache using the previously proposed Bit-Fix mechanism. Our result show that our designed spare cache mechanism is 15% more area efficient compared to Bit-Fix. Our proposed approach provides a significant improvement in power and EPI (energy per instruction) over a conventional cache and Bit-Fix, at the expense of having lower performance at high voltage. Low Voltage Cache Architecture Power consumption Cache capacity Electronic systems -- Energy consumption Low voltage integrated circuits Microprocessors -- Power supply
392	Managing dynamic non-uiform cache architectures Lira Rueda, Javier 25 November 2011 (has links) Researchers from both academia and industry agree that future CMPs will accommodate large shared on-chip last-level caches. However, the exponential increase in multicore processor cache sizes accompanied by growing on-chip wire delays make it difficult to implement traditional caches with a single, uniform access latency. Non-Uniform Cache Access (NUCA) designs have been proposed to address this situation. A NUCA cache divides the whole cache memory into smaller banks that are distributed along the chip and can be accessed independently. Response time in NUCA caches does not only depend on the latency of the actual bank, but also on the time required to reach the bank that has the requested data and to send it to the core. So, the NUCA cache allows those banks that are located next to the cores to have lower access latencies than the banks that are further away, thus mitigating the effects of the cache’s internal wires. These cache architectures have been traditionally classified based on their placement decisions as static (S-NUCA) or dynamic (DNUCA). In this thesis, we have focused on D-NUCA as it exploits the dynamic features that NUCA caches offer, like data migration. The flexibility that D-NUCA provides, however, raises new challenges that hardens the management of this kind of cache architectures in CMP systems. We have identified these new challenges and tackled them from the point of view of the four NUCA policies: replacement, access, placement and migration. First, we focus on the challenges introduced by the replacement policy in D-NUCA. Data migration makes most frequently accessed data blocks to be concentrated on the banks that are closer to the processors. This creates big differences in the average usage rate of the NUCA banks, being the banks that are close to the processors the most accessed banks, while the banks that are further away are not accessed so often. Upon a replacement in a particular bank of the NUCA cache, the probabilities of the evicted data block to be reused by the program will differ if its last location in the NUCA cache was a bank that are close to the processors, or not. The decentralized nature of NUCA, however, prevents a NUCA bank from knowing that other bank is constantly evicting data blocks that are later being reused. We propose three different techniques to dealwith the replacement policy, being The Auction the most successful one. Then, we deal with the challenges in the access policy. As data blocks can be mapped in multiple banks within the NUCA cache. Finding the requesting data in a D-NUCA cache is a difficult task. In addition, data can freely move between these banks, thus the search scheme must look up all banks where the requesting data block can be mapped to ascertain if it is in the NUCA cache, or not. We have proposed HK-NUCA. This is a search scheme that uses home knowledge to effectively reduce the average number of messages introduced to the on-chip network to satisfy a memory request. With regard to the placement policy, this thesis shows the implementation of a hybrid NUCA cache. We have proposed a novel placement policy that accomodates both memory technologies, SRAM and eDRAM, in a single NUCA cache. Finally, in order to deal with the migration policy in D-NUCA caches, we propose The Migration Prefetcher. This is a technique that anticipates data migrations. Summarizing, in this thesis we propose different techniques to efficiently manage future D-NUCA cache architectures on CMPs. We demonstrate the effectivity of our techniques to deal with the challenges introduced by D-NUCA caches. Our techniques outperform existing solutions in the literature, and are in most cases more energy efficient. / CMPs actuales integran memorias cache de último nivel cada vez más grandes dentro del chip. Roadmaps en la industria y trabajos en ámbito académico muestran que esta tendencia seguirá en los próximos años. Sin embargo, los altos retrasos en la red de interconexión y el cableado hace que sea cada vez más difícil de implementar memorias cachés tradicionales con una única y uniforme latencia de acceso. Para solventar esta situación aparecieron los diseños NUCA (Non-Uniform Cache Access). Una caché de tipo NUCA divide una memoria grande en bloques más pequeños que se distribuyen a lo largo del chip y pueden ser accedidos de manera independiente. De esta manera el tiempo de respuesta en una caché NUCA no depende sólo de la latencia de un banco, sino que también se tiene en cuenta el tiempo de enrutamiento de la petición hasta y desde el banco de la NUCA que responde. La posición física de un banco en el chip es clave para determinar la latencia de acceso a NUCA, entonces bancos que se encuentren más cerca de los cores tendrán menores latencias de acceso que otros que estén más alejados. Las cachés NUCA se pueden clasificar como estáticas (S-NUCA) o dinámicas (D-NUCA), basándonos en sus decisiones de emplazamiento. Esta tesis se centra en D-NUCA. Este diseño permite a un dato migrar de banco en banco a fín de reducir la latencia de futuros accesos a ese dato, pero también ofrece otros retos que deben ser investigados para gestionar estas cachés de manera eficiente. Hemos identificado y explorado estos retos desde el punto de vista de las cuatro políticas NUCA: reemplazo, acceso, emplazamiento y migración. En primer lugar nos hemos centrado en la política de reemplazo. La migración de datos permite que los datos que se utilizan más frequentemente se concentren en aquellos bancos que estan más cerca de los cores. Ésto crea grandes diferencias en el uso medio de los bancos en NUCA, siendo los bancos cercanos a los cores los más accedidos, mientras que los bancos lejanos no se acceden tan a menudo. Debido a las diferencias en la frequencia de reemplazos entre bancos, las probabilidades de que el dato expulsado sea reusado en un futuro crecerán o disminuirán dependiendo del banco donde se efectuó el reemplazo. Por otro lado, los trabajos previos en la política de reemplazo no son efectivos en este tipo de cachés ya que los bancos trabajan de manera independiente. Nosotros proponemos tres técnicas de reemplazo para NUCA, siendo The Auction la técnica con mayor beneficio. En cuanto a los retos con la política de acceso, como los datos se pueden mapear en diversos bancos dentro de la caché NUCA, encontrarlos se convierte en una tarea complicada y costosa. Aquí, nosotros proponemos HK-NUCA. Es un algoritmo de acceso que usa el conocimiento integrado en los bancos "home" para reducir de manera eficiente el número medio de accesos necesarios para resolver una petición de memoria. Para analizar la política de emplazamiento, esta tesis muestra la implementación de una caché NUCA híbrida. Nuestra política de emplazamiento permite integrar ambas tecnologías, SRAM y eDRAM, en un único nivel de cache NUCA. Finalmente, en cuanto a la migración en D-NUCA, hemos propuesto The Migration Prefetcher. Es una técnica que permite anticipar migraciones de datos usando el conocimiento adquirido por el historial de accesos. En resumen, esta tesis propone diferentes técnicas para gestionar de manera eficiente las futuras arquitecturas de memoria caché D-NUCA en un entorno CMP. A lo largo de la tesis, demostramos la efectividad de las técnicas propuestas para paliar los efectos inducidos por el hecho de utilizar cachés D-NUCA. Estas técnicas, además de obtener mayor rendimiento que otros mecanismos existentes en la literatura, son en muchos casos más eficientes en términos de energía. Memòria cache NUCA Política d'emplaçament Política de cerca Política de reemplaçament Política de migració Cache memory Placement policy Access policy Replacement policy Migration policy 004
393	DNSSEC en säkerhetsförbättring av DNS : en studie om Svenska kommuners syn på DNSSEC Telling, Henric, Gunnarsson, Anders January 2010 (has links) Syftet med uppsatsen är att undersöka varför få svenska kommunerna valt att installera DNSSEC på sina domäner. DNS är en av de viktigaste protokollen på Internet och behövs för att sammanlänka IP-adresser med mer lättförståeliga adresser för oss människor. DNS skapades utan att tänka på säkerheten, för att kunna göra DNS säkrare utvecklades ett säkerhetstillägg till DNS detta fick namnet DNSSEC.Vi har använt oss av litteraturstudie, experiment och intervjuer för att skapa en djupare kunskap och förståelse om hur DNS och DNSSEC fungerar samt besvara varför få kommuner har valt att installera DNSSEC.Under vår litteraturstudie läste vi om flera sårbarheter i DNS och hur dessa kan utnyttjas för att utsätta en organisation för attacker såsom cacheförgiftning och MITM. Vi testade dessa sårbarheter och bekräftade det. Efter installationen av DNSSEC kunde inte angreppen längre genomföras i vår testmiljö.Under intervjuerna kom vi fram till att den vanligaste orsaken att kommuner inte väljer att installera DNSSEC är okunskap om tillvägagångsättet för en installation och att de tycker deras nuvarande DNS fungerar bra, det blir då ingen prioriterad fråga. Kommunerna som installerat DNSSEC är nöjda med sin installation och bara en kommun har upplevt problem vid införandet.För att vi ska kunna fortsätta utveckla Internet är en kontroll av säkerheten en nödvändighet och då är DNSSEC en vägvisare. Kommunerna borde föregå med gott exempel och vara bland de första som inför DNSSEC så besökarna till deras hemsidor kan känna sig säkra att informationen på deras sidor är korrekt. / The purpose of this paper is to investigate why few Swedish municipalities have chosen to install DNSSEC on their domains. DNS is one of the most important protocols on the Internet and used to link IP-addresses to understandable addresses for users. DNS was created without thinking about security, to make DNS more secure a security extension was developed to DNS, named DNSSEC.We have used literature review, experiments and interviews to create a deeper knowledge and understanding about DNS and DNSSEC, how it works and why few municipalities have chosen to install DNSSEC.In the literature we read about several vulnerabilities in DNS and it can easily be exposed to attacks such as cache poisoning and MITM. We tested these vulnerabilities and confirmed them. After installation of DNSSEC we could not expose our implemented DNS anymore in our test environment.During the interviews, we concluded that the most common reason why municipalities do not choose to install DNSSEC is ignorance of an installation and they think that their current DNS works well and it does not become a priority. The municipalities that have installed DNSSEC are satisfied with its installation and only one municipality has experienced difficulties during the implementation.In order for us to continue developing the Internet a control of security is a necessity and DNSSEC is a good example. Local authorities should lead by good example and be among the first to implement DNSSEC, so users of their websites can be assured that the information on their pages is accurate. DNS DNSSEC IP Security Cache MITM Internet .SE ARP Intrusion Windows Server DoS. DNS DNSSEC IP Säkerhet Cache MITM Internet .SE ARP Intrång Windows Server DoS.
394	Design Space Exploration and Optimization of Embedded Memory Systems Rabbah, Rodric Michel 11 July 2006 (has links) Recent years have witnessed the emergence of microprocessors that are embedded within a plethora of devices used in everyday life. Embedded architectures are customized through a meticulous and time consuming design process to satisfy stringent constraints with respect to performance, area, power, and cost. In embedded systems, the cost of the memory hierarchy limits its ability to play as central a role. This is due to stringent constraints that fundamentally limit the physical size and complexity of the memory system. Ultimately, application developers and system engineers are charged with the heavy burden of reducing the memory requirements of an application. This thesis offers the intriguing possibility that compilers can play a significant role in the automatic design space exploration and optimization of embedded memory systems. This insight is founded upon a new analytical model and novel compiler optimizations that are specifically designed to increase the synergy between the processor and the memory system. The analytical models serve to characterize intrinsic program properties, quantify the impact of compiler optimizations on the memory systems, and provide deep insight into the trade-offs that affect memory system design. Cache architecture Temporal locality Embedded systems Design space exploration Compilers Spatial locality Prefetching Data remapping Embedded computer systems Programming Cache memory Compilers (Computer programs) Memory hierarchy
395	A Peer To Peer Web Proxy Cache For Enterprise Networks Ravindranath, C K 06 1900 (has links) In this thesis, we propose a decentralized peer-to-peer (P2P) Web proxy cache for enterprise networks (ENs). Currently, enterprises use a centralized proxy-based Web cache, where a dedicated proxy server does the caching. A dedicated proxy Web Cache has to be over-provisioned to handle peak loads. It is expensive, a single point of failure, and a bottleneck. In a P2P Web Cache, the clients themselves cooperate in caching the Web objects without any dedicated proxy cache. The resources from the client machines are pooled together to form a Web cache. This eliminates the need for extra hardware and the single point of failure, and improves the average response time, since all the machines serve the request queue. The most important attraction for the P2P scheme is its inherent scalability. Squirrel was the earliest P2P Web cache. Squirrel is built upon a structured P2P protocol called Pastry. Pastry is based on consistent hashing; a special hashing that performs well in the presence of client membership changes. Consistent hashing based protocols are designed for Internet-wide environments to handle very large membership sizes and high rates of membership change. To minimize the protocol bandwidth, the membership state maintained at each peer is very small. This state consists of the information about the peer’s immediate neighbours, and those of a few other P2P members, to achieve faster look-up. This scheme has the following advantages: (i) since peers do not maintain information about all the other peers in the system, any peer needing an object has to find the peer responsible for the object through a multi-hop lookup, thereby increasing the latency, and (ii) the number of objIds assigned to a peer depends on the hashing used, and this can be skewed, which affects the load distribution. The popular applications of the P2P paradigm have been file-sharing systems. These systems are deployed across the Internet. Hence, the existing P2P protocols were designed to operate within the constraints of Internet environments. The P2P proxy Web cache has been a recent application of the P2P paradigm. P2P Web Proxy caches operate across the entire network of an enterprise. An enterprise network(EN) comprises all the computing and communications capabilities of an institution. Institutions typically consist of many departments, with each department having and managing its own local area netwok (LAN). The available bandwidth in LANs is very high. LANs have low latency and low error rates. EN environments have smaller membership size, less frequent membership changes and more available bandwidth. Hence, in such environments, the P2P protocol can afford to store more membership information. This thesis explores the significant differences between EN and Internet environments. It proposes a new P2P protocol designed to exploit these differences, and a P2P Web proxy caching scheme based on this new protocol. Specifically, it shows that it is possible to maintain complete the consistent membership information on ENs. The thesis then presents a load distribution policy for a P2P system with complete and consistent membership information to achieve (i) load balance and (ii) minimum object migrations subsequent to each node join or node leave event. The proposed system requires extra storage and bandwidth costs. We have seen that the necessary storage is available in general workstations and the required bandwidth is feasible in modern networks. We then evaluated the improvement in performance achieved by the system over existing consistent hashing based systems. We have shown that without investing in any special hardware, the P2P system can match the performance of dedicated proxy caches. We have further shown that the buddy based P2P scheme has a better load distribution, especially under heavy loads when load balancing becomes critical. We have also shown that for large P2P systems, the buddy based scheme has a lower latency than the consistent hashing based schemes. Further, we have compared the costs of the proposed scheme and the existing consistent hashing based scheme for different loads (i.e., rate of Web object requests), and identified the situations in which the proposed scheme is likely to perform best. In summary, the thesis shows that (i) the membership dynamics of P2P systems on ENs are different from that of Internet file-sharing systems and (ii) it is feasible in ENs, to maintain complete the consistent view of the P2P membership at all the peers. We have designed a structured P2P protocol for LANs that maintains a complete and consistent view of membership information at all peers. P2P Web caches achieve single hop routing and a better balanced load distribution using this scheme. Complete and consistent view of membership information enabled a single-hop lookup and a flexible load assignment. Computer Networks Web Proxy Cache Peer-to-Peer Web Proxy Cache Web Caching Enterprise Networks (ENs) P2P Protocols Web Proxy Caching Computer Science
396	Native simulation of MPSoC : instrumentation and modeling of non-functional aspects / Simulation native des MPSoC : instrumentation et modélisation des aspects non fonctionnels Matoussi, Oumaima 30 November 2017 (has links) Les systèmes embarqués modernes intègrent des dizaines, voire des centaines, de cœurs sur une même puce communiquant à travers des réseaux sur puce, afin de répondre aux exigences de performances édictées par le marché. On parle de systèmes massivement multi-cœurs ou systèmes many-cœurs. La complexité de ces systèmes fait de l’exploration de l’espace de conception architecturale, de la co-vérification du matériel et du logiciel, ainsi que de l’estimation de performance, un vrai défi. Cette complexité est généralement com-pensée par la flexibilité du logiciel embarqué. La dominance du logiciel dans ces architectures nécessite de commencer le développement et la vérification du matériel et du logiciel dès les premières étapes du flot de conception, bien avant d’avoir accès à un prototype matériel.Ainsi, il faut disposer d’un modèle abstrait qui reproduit le comportement de la puce cible en un temps raisonnable. Un tel modèle est connu sous le nom de plateforme virtuelle ou de simulation. L’exécution du logiciel sur une telle plateforme est couramment effectuée au moyen d’un simulateur de jeu d’instruction (ISS). Ce type de simulateur, basé sur l’interprétation des instructions une à une, est malheureusement caractérisé par une vitesse de simulation très lente, qui ne fait qu’empirer par l’augmentation du nombre de cœurs.La simulation native est considérée comme une candidate adéquate pour réduire le temps de simulation des systèmes many-cœurs. Le principe de la simulation native est de compiler puis exécuter la quasi totalité de la pile logicielle directement sur la machine hôte tout en communiquant avec des modèles réalistes des composants matériels de l’architecture cible, permettant ainsi de raccourcir les temps de simulation. La simulation native est beau-coup plus rapide qu’un ISS mais elle ne prend pas en compte les aspects non-fonctionnels,tel que le temps d’exécution, dépendant de l’architecture matérielle réelle, ce qui empêche de faire des estimations de performance du logiciel.Ceci dresse le contexte des travaux menés dans cette thèse qui se focalisent sur la simulation native et s’articulent autour de deux contributions majeures. La première s’attaque à l’introduction d’informations non-fonctionnelles dans la représentation intermédiaire (IR)du compilateur. L’insertion précise de telles informations dans le modèle fonctionnel est réalisée grâce à un algorithme dont l’objectif est de trouver des correspondances entre le code binaire cible et le code IR tout en tenant compte des optimisations faites par le compilateur. La deuxième contribution s’intéresse à la modélisation d’un cache d’instruction et d’un tampon d’instruction d’une architecture VLIW pour générer des estimations de performance précises.Ainsi, la plateforme de simulation native associée à des modèles de performance précis et à une technique d’annotation efficace permet, malgré son haut niveau d’abstraction, non seulement de vérifier le bon fonctionnement du logiciel mais aussi de fournir des estimations de performances précises en des temps de simulation raisonnables. / Modern embedded systems are endowed with a high level of parallelism and significantprocessing capabilities as they integrate hundreds of cores on a single chip communicatingthrough network on chip. The complexity of these systems and their dedicated softwareshould not be an excuse for long design cycles, even though the design space is enormousand the underlying design decisions are critical. Thus, design space exploration, hard-ware/software co-verification and performance estimation need to be conducted within areasonable amount of time and early enough in the design process to avoid any tardy de-tection of functional or performance deficiencies.Co-simulation platforms are becoming an increasingly important part in design and ver-ification steps. With instruction interpretation-based software simulation platforms beingtoo slow as they model low-level details of the target system, an alternative software sim-ulation approach known as native simulation or host-compiled simulation has gained mo-mentum this past decade. Native simulation consists of compiling the embedded softwareto the host binary format and executing it directly on the host machine. However, this tech-nique fails to reflect the performance of the embedded software and its actual interactionwith the target hardware. So, the speedup gained by native simulation comes at a price,which is the absence of non-functional information (such as time and energy) needed for es-timating the performance of the entire system and ensuring its proper functioning. Withoutsuch information, native simulation approaches are limited to functional validation.Yielding accurate estimates entails the integration of high-level abstract models thatmimic the behavior of target-specific micro-architectural components in the simulation plat-form and the accurate placement of the obtained non-functional information in the high-level code. Back-annotating non-functional information at the right place requires a map-ping between the binary instructions and the high-level code statements, which can be chal-lenging particularly when compiler optimizations are enabled.In this thesis, we propose an annotation framework working at the compiler interme-diate representation level to accurately annotate performance metrics extracted from thebinary code, thanks to a dedicated mapping algorithm. This mapping algorithm is furtherenhanced to deal with aggressive compiler optimizations, such as loop unrolling, that radi-cally alter the structure of the code. Our target architecture being a VLIW processor, we alsomodel at a high level its instruction buffer to faithfully reproduce its timing behavior.The experiments we conducted to validate our mapping algorithm and component mod-els yielded accurate results and high simulation speed compared to a cycle accurate ISS ofthe target platform. Simulation Vliw Aspects non fonctionnels Estimation de performance Représentation intermédiaire (IR) Modèle de cache d'instruction Simulation Vliw Non-Functional aspects Performance estimation Intermediate representation (IR) Instruction cache model 004
397	Système distribué à adressage global et cohérence logicielle pourl’exécution d’un modèle de tâche à flot de données / Distributed runtime system with global address space and software cache coherence for a data-flow task model Gindraud, François 11 January 2018 (has links) Les architectures distribuées sont fréquemment utilisées pour le calcul haute performance (HPC). Afin de réduire la consommation énergétique, certains fabricants de processeurs sont passés d’architectures multi-cœurs en mémoire partagée aux MPSoC. Les MPSoC (Multi-Processor System On Chip) sont des architectures incluant un système distribué dans une puce.La programmation des architectures distribuées est plus difficile que pour les systèmes à mémoire partagée, principalement à cause de la nature distribuée de la mémoire. Une famille d’outils nommée DSM (Distributed Shared Memory) a été développée pour simplifier la programmation des architectures distribuées. Cette famille inclut les architectures NUMA, les langages PGAS, et les supports d’exécution distribués pour graphes de tâches. La stratégie utilisée par les DSM est de créer un espace d’adressage global pour les objets du programme, et de faire automatiquement les transferts réseaux nécessaires lorsque ces objets sont utilisés. Les systèmes DSM sont très variés, que ce soit par l’interface fournie, les fonctionnalités, la sémantique autour des objets globalement adressables, le type de support (matériel ou logiciel), ...Cette thèse présente un nouveau système DSM à support logiciel appelé Givy. Le but de Givy est d’exécuter sur des MPSoC (MPPA) des programmes sous la forme de graphes de tâches dynamiques, avec des dépendances de flot de données (data-flow ). L’espace d’adressage global (GAS) de Givy est indexé par des vrais pointeurs, contrairement à de nombreux autres systèmes DSM à support logiciel : les pointeurs bruts du langage C sont valides sur tout le système distribué. Dans Givy, les objets globaux sont les blocs de mémoire fournis par malloc(). Ces blocs sont répliqués entre les nœuds du système distribué, et sont gérés par un protocole de cohérence de cache logiciel nommé Owner Writable Memory. Le protocole est capable de déplacer ses propres métadonnées, ce qui devrait permettre l’exécution efficace de programmes irréguliers. Le modèle de programmation impose de découper le programme en tâches créées dynamiquement et annotées par leurs accès mémoire. Ces annotations sont utilisées pour générer les requêtes au protocole de cohérence, ainsi que pour fournir des informations à l’ordonnanceur de tâche (spatial et temporel).Le premier résultat de cette thèse est l’organisation globale de Givy. Une deuxième contribution est la formalisation du protocole Owner Writable Memory. Le troisième résultat est la traduction de cette formalisation dans le langage d’un model checker (Cubicle), et les essais de validation du protocole. Le dernier résultat est la réalisation et explication détaillée du sous-système d’allocation mémoire : le choix de pointeurs bruts en tant qu’index globaux nécessite une intégration forte entre l’allocateur mémoire et le protocole de cohérence de cache. / Distributed systems are widely used in HPC (High Performance Computing). Owing to rising energy concerns, some chip manufacturers moved from multi-core CPUs to MPSoC (Multi-Processor System on Chip), which includes a distributed system on one chip.However distributed systems – with distributed memories – are hard to program compared to more friendly shared memory systems. A family of solutions called DSM (Distributed Shared Memory) systems has been developed to simplify the programming of distributed systems. DSM systems include NUMA architectures, PGAS languages, and distributed task runtimes. The common strategy of these systems is to create a global address space of some kind, and automate network transfers on accesses to global objects. DSM systems usually differ in their interfaces, capabilities, semantics on global objects, implementation levels (hardware / software), ...This thesis presents a new software DSM system called Givy. The motivation of Givy is to execute programs modeled as dynamic task graphs with data-flow dependencies on MPSoC architectures (MPPA). Contrary to many software DSM, the global address space of Givy is indexed by real pointers: raw C pointers are made global to the distributed system. Givy global objects are memory blocks returned by malloc(). Data is replicated across nodes, and all these copies are managed by a software cache coherence protocol called Owner Writable Memory. This protocol can relocate coherence metadata, and thus should help execute irregular applications efficiently. The programming model cuts the program into tasks which are annotated with memory accesses, and created dynamically. Memory annotations are used to drive coherence requests, and provide useful information for scheduling and load-balancing.The first contribution of this thesis is the overall design of the Givy runtime. A second contribution is the formalization of the Owner Writable Memory coherence protocol. A third contribution is its translation in a model checker language (Cubicle), and correctness validation attempts. The last contribution is the detailed allocator subsystem implementation: the choice of real pointers for global references requires a tight integration between memory allocator and coherence protocol. Système distribué Protocole de cohérence de cache Support d'exécution Multi-Coeurs Modèle mémoire Distributed systems Cache coherence protocol Runtime Manycore Memory model 004
398	Um modelo de memória transacional para arquiteturas heterogêneas baseado em software Cache / A transactional memory model for heterogeneous architectures based in Software Cache Goldstein, Felipe Portavales 17 August 2018 (has links) Orientador: Rodolfo Jardim de Azevedo / Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Matemática, Estatística e Computação Científica / Made available in DSpace on 2018-08-17T02:02:14Z (GMT). No. of bitstreams: 1 Goldstein_FelipePortavales_M.pdf: 2303926 bytes, checksum: c44512059a990654552904a0f94d74f2 (MD5) Previous issue date: 2010 / Resumo: A adoção de processadores com múltiplos núcleos pela indústria, levou à necessidade de novas técnicas para facilitar a programação de software paralelo. A técnica chamada memórias transacionais é uma das mais promissoras. Esta técnica é capaz de executar tarefas concorrentemente de forma otimista, o que permite um bom desempenho. Outra vantagem é que a sua utilização é muito mais simples comparada com a técnica clássica de exclusão mútua. Neste trabalho é proposto o primeiro modelo de memória transacional para arquiteturas híbridas, neste caso a arquitetura alvo é o processador Cell BE. O processador Cell BE é especialmente complexo por causa das dificuldades que a arquitetura deste processador impõe ao programador quando se necessita acessar a memória global compartilhada. O modelo proposto age como uma camada entre o programa e a memória principal, permitindo um acesso transparente aos dados, garantindo coerência e realizando o controle de concorrência de forma automática. O modelo proposto utiliza Software Cache combinado com a memória transacional para facilitar o acesso à memória externa a partir dos SPEs. Ele foi implementado e testado utilizando 8 aplicativos benchmark diferentes, mostrando sua viabilidade para casos de uso reais. Foi feita uma análise detalhada de cada parte da arquitetura proposta com relação ao impacto no desempenho geral do sistema. Este modelo foi capaz de obter um desempenho até duas vezes superior à implementação utilizando um mutex global. As vantagens da utilização se concentram principalmente na facilidade de uso, garantias de coerência e por evitar alguns tipos de bugs que seriam comuns em uma implementação com mutex, como por exemplo dead-locks. Este trabalho obteve o prêmio de melhor artigo no SBAC-PAD 2008 / Abstract: The adoption of multi-core processors by the industry has pushed towards the development of new techniques to simplify programming parallel software. The technique called transactional memories is one of the most promising. This technique is able to execute multiple tasks concurrently in an optimistic way to achieve a better performance. Another advantage is that the usage of this technique is simpler than the classic mutual exclusion. This work proposes the first transactional memory model for hybrid architectures, in this case the target architecture is the Cell BE processor. The Cell BE is specially complex because of the dificulties when acessing the main shared memory from one of the SPEs. The proposed model acts as a layer between the program running and the main shared memory, allowing transparent access to the data, guaranteeing coherency and automatic concurrency control. The proposed model uses a Software Cache combined with a transactional memory to facilitate the acess to the main memory from the SPEs. This model was implemented and tested using 8 benchmark applications, showing its feasability in real use cases. A detailed analysis of its internal parts has been made to show the impact of each part in the overal system performance. The model was able to achieve a performance up to two times better than a similar implementation using a global mutex. The advantages of this model rely on its usability, coherency guaranty and because it is able to avoid concurrency programming bugs such as dead-lock, which are common in a mutex implementation. This work won the best paper award at SBAC-PAD 2008 / Mestrado / Arquitetura de Computadores / Mestre em Ciência da Computação Memória cache Memória hierárquica (Computação) Processadores multicore Arquitetura de computador Transactional memory Cache memory Hierarchical memory (Computer science) Multicore processors Computer architecture
399	Arquitetura pdccm em hardware para compressão/descompressão de instruções em sistemas embarcados Dias, Wanderson Roger Azevedo 30 April 2009 (has links) Made available in DSpace on 2015-04-11T14:03:12Z (GMT). No. of bitstreams: 1 DISSERTACAO - WANDERSON ROGER.pdf: 2032449 bytes, checksum: f75ada58e34bb5da29e9716bc5899cab (MD5) Previous issue date: 2009-04-30 / Fundação de Amparo à Pesquisa do Estado do Amazonas / In the development of the design of embedded systems several factors must be led in account, such as: physical size, weight, mobility, energy consumption, memory, cooling, security requirements, trustiness and everything ally to a reduced cost and of easy utilization. But, on the measure that the systems become more heterogeneous they admit major complexity in its development. There are several techniques to optimize the execution time and power usage in embedded systems. One of these techniques is the code compression, however, most existing proposals focus on decompress and they assume that the code is compressed in compilation time. Therefore, this work proposes the development of an specific architecture, with its prototype in hardware (using VHDL and FPGAs), special for the process of compression/decompression code. Thus, it is proposed a technique called PDCCM (Processor Memory Cache Compressor Decompressor). The results are obtained via simulation and prototyping. In the analysis, benchmark programs such as MiBench had been used. Also a method of compression, called of MIC was considered (Middle Instruction Compression), which was compared with the traditional Huffman compression method. Therefore, in the architecture PDCCM the MIC method showed better performance in relation to the Huffman method for some programs of the MiBench analyzed that are widely used in embedded systems, resulting in 26% less of the FPGA logic elements, 71% more in the frequency of the clock MHz and in the 36% plus on the compression of instruction compared with Huffman, besides allowing the compression/decompression in time of execution. / No desenvolvimento do projeto de sistemas embarcados vários fatores têm que ser levados em conta, tais como: tamanho físico, peso, mobilidade, consumo de energia, memória, refrescância, requisitos de segurança, confiabilidade e tudo isso aliado a um custo reduzido e de fácil utilização. Porém, à medida que os sistemas tornam-se mais heterogêneos os mesmos admitem maior complexidade em seu desenvolvimento. Existem diversas técnicas para otimizar o tempo de execução e o consumo de energia em sistemas embarcados. Uma dessas técnicas é a compressão de código, não obstante, a maioria das propostas existentes focaliza na descompressão e assumem que o código é comprimido em tempo de compilação. Portanto, este trabalho propõe o desenvolvimento de uma arquitetura, com respectiva prototipação em hardware (usando VHDL e FPGAs), para o processo de compressão/descompressão de código. Assim, propõe-se a técnica denominada de PDCCM (Processor Decompressor Cache Compressor Memory). Os resultados são obtidos via simulação e prototipação. Na análise usaram-se programas do benchmark MiBench. Foi também proposto um método de compressão, denominado de MIC (Middle Instruction Compression), o qual foi comparado com o tradicional método de compressão de Huffman. Portanto, na arquitetura PDCCM o método MIC apresentou melhores desempenhos computacionais em relação ao método de Huffman para alguns programas do MiBench analisados que são muito usados em sistemas embarcados, obtendo 26% a menos dos elementos lógicos do FPGA, 71% a mais na freqüência do clock em MHz e 36% a mais na compressão das instruções comparando com o método de Huffman, além de permitir a compressão/descompressão em tempo de execução. Sistemas Embarcados Compressão/Descompressão de código Processador Memória, Cache. Systems Embedded Compression/Decompression of code Processor Memory Cache.
400	Cache memory aware priority assignment and scheduling simulation of real-time embedded systems / Affectation de priorité et simulation d’ordonnancement de systèmes temps réel embarqués avec prise en compte de l'effet des mémoires cache Tran, Hai Nam 23 January 2017 (has links) Les systèmes embarqués en temps réel (RTES) sont soumis à des contraintes temporelles. Dans ces systèmes, l'exactitude du résultat ne dépend pas seulement de l'exactitude logique du calcul, mais aussi de l'instant où ce résultat est produit (Stankovic, 1988). Les systèmes doivent être hautement prévisibles dans le sens où le temps d'exécution pire-cas de chaque tâche doit être déterminé. Une analyse d’ordonnancement est effectuée sur le système pour s'assurer qu'il y a suffisamment de ressources pour ordonnancer toutes les tâches. La mémoire cache est un composant matériel utilisé pour réduire l'écart de performances entre le processeur et la mémoire principale. L'intégration de la mémoire cache dans un RTES améliore généralement la performance en terme de temps d'exécution, mais malheureusement, elle peut entraîner une augmentation du coût de préemption et de la variabilité du temps d'exécution. Dans les systèmes avec mémoire cache, plusieurs tâches partagent cette ressource matérielle, ce qui conduit à l'introduction d'un délai de préemption lié au cache (CRPD). Par définition, le CRPD est le délai ajouté au temps d'exécution de la tâche préempté car il doit recharger les blocs de cache évincés par la préemption. Il est donc important de pouvoir prendre en compte le CRPD lors de l'analyse d’ordonnancement. Cette thèse se concentre sur l'étude des effets du CRPD dans les systèmes uni-processeurs, et étend en conséquence des méthodes classiques d'analyse d’ordonnancement. Nous proposons plusieurs algorithmes d’affectation de priorités qui tiennent compte du CRPD. De plus, nous étudions les problèmes liés à la simulation d'ordonnancement intégrant le CRPD et nous établissons deux résultats théoriques qui permettent son utilisation en tant que méthode de vérification. Le travail de cette thèse a permis l'extension de l'outil Cheddar - un analyseur d'ordonnancement open-source. Plusieurs méthodes d'analyse de CRPD ont été également mises en oeuvre dans Cheddar en complément des travaux présentés dans cette thèse. / Real-time embedded systems (RTES) are subject to timing constraints. In these systems, the total correctness depends not only on the logical correctness of the computation but also on the time in which the result is produced (Stankovic, 1988). The systems must be highly predictable in the sense that the worst case execution time of each task must be determined. Then, scheduling analysis is performed on the system to ensure that there are enough resources to schedule all of the tasks.Cache memory is a crucial hardware component used to reduce the performance gap between processor and main memory. Integrating cache memory in a RTES generally enhances the whole performance in term of execution time, but unfortunately, it can lead to an increase in preemption cost and execution time variability. In systems with cache memory, multiple tasks can share this hardware resource which can lead to cache related preemption delay (CRPD) being introduced. By definition, CRPD is the delay added to the execution time of the preempted task because it has to reload cache blocks evicted by the preemption. It is important to be able to account for CRPD when performing schedulability analysis.This thesis focuses on studying the effects of CRPD on uniprocessor systems and employs the understanding to extend classical scheduling analysis methods. We propose several priority assignment algorithms that take into account CRPD while assigning priorities to tasks. We investigate problems related to scheduling simulation with CRPD and establish two results that allows the use of scheduling simulation as a verification method. The work in this thesis is made available in Cheddar - an open-source scheduling analyzer. Several CRPD analysis features are also implemented in Cheddar besides the work presented in this thesis. Mémoire cache CRPD Affectation de priorité Simulation d’ordonnancement Systèmes temps réel embarqués Cache memory Priority assignment Scheduling simulation CRPD Real-time embedded systems

Search results