Global ETD Search

1	Exploring Computational Sprinting in New Domains Saravanan, Indrajeet 28 August 2019 (has links) No description available. Computer Science Intel Cache Allocation Technology Service Level Objectives Browsers Computational Sprinting Dark Silicon
2	Adaptive Prefetching and Cache Partitioning for Multicore Processors Selfa Oliver, Vicent 13 November 2018 (has links) El acceso a la memoria principal en los procesadores actuales supone un importante cuello de botella para las prestaciones, dado que los diferentes núcleos compiten por el limitado ancho de banda de memoria, agravando la brecha entre las prestaciones del procesador y las de la memoria principal. Distintas técnicas atacan este problema, siendo las más relevantes el uso de jerarquías de caché multinivel y la prebúsqueda. Las cachés jerárquicas aprovechan la localidad temporal y espacial que en general presentan los programas en el acceso a los datos, para mitigar las enormes latencias de acceso a memoria principal. Para limitar el número de accesos a la memoria DRAM, fuera del chip, los procesadores actuales cuentan con grandes cachés de último nivel (LLC). Para mejorar su utilización y reducir costes, estas cachés suelen compartirse entre todos los núcleos del procesador. Este enfoque mejora significativamente el rendimiento de la mayoría de las aplicaciones en comparación con el uso de cachés privados más pequeños. Compartir la caché, sin embargo, presenta una problema importante: la interferencia entre aplicaciones. La prebúsqueda, por otro lado, trae bloques de datos a las cachés antes de que el procesador los solicite, ocultando la latencia de memoria principal. Desafortunadamente, dado que la prebúsqueda es una técnica especulativa, si no tiene éxito puede contaminar la caché con bloques que no se usarán. Además, las prebúsquedas interfieren con los accesos a memoria normales, tanto los del núcleo que emite las prebúsquedas como los de los demás. Esta tesis se centra en reducir la interferencia entre aplicaciones, tanto en las caché compartidas como en el acceso a la memoria principal. Para reducir la interferencia entre aplicaciones en el acceso a la memoria principal, el mecanismo propuesto en esta disertación regula la agresividad de cada prebuscador, activando o desactivando selectivamente algunos de ellos, dependiendo de su rendimiento individual y de los requisitos de ancho de banda de memoria principal de los otros núcleos. Con respecto a la interferencia en cachés compartidos, esta tesis propone dos técnicas de particionado para la LLC, las cuales otorgan más espacio de caché a las aplicaciones que progresan más lentamente debido a la interferencia entre aplicaciones. La primera propuesta de particionado de caché requiere hardware específico no disponible en procesadores comerciales, por lo que se ha evaluado utilizando un entorno de simulación. La segunda propuesta de particionado de caché presenta una familia de políticas que superan las limitaciones en el número de particiones y en el número de vías de caché disponibles mediante la agrupación de aplicaciones en clústeres y la superposición de particiones de caché, por lo que varias aplicaciones comparten las mismas vías. Dado que se ha implementado utilizando los mecanismos para el particionado de la LLC que presentan algunos procesadores Intel modernos, esta propuesta ha sido evaluada en una máquina real. Los resultados experimentales muestran que el mecanismo de prebúsqueda selectiva propuesto en esta tesis reduce el número de solicitudes de memoria principal en un 20%, cosa que se traduce en mejoras en la equidad del sistema, el rendimiento y el consumo de energía. Por otro lado, con respecto a los esquemas de partición propuestos, en comparación con un sistema sin particiones, ambas propuestas reducen la iniquidad del sistema en un promedio de más del 25%, independientemente de la cantidad de aplicaciones en ejecución, y esta reducción en la injusticia no afecta negativamente al rendimiento. / Accessing main memory represents a major performance bottleneck in current processors, since the different cores compete among them for the limited offchip bandwidth, aggravating even more the so called memory wall. Several techniques have been applied to deal with the core-memory performance gap, with the most preeminent ones being prefetching and hierarchical caching. Hierarchical caches leverage the temporal and spacial locality of the accessed data, mitigating the huge main memory access latencies. To limit the number of accesses to the off-chip DRAM memory, current processors feature large Last Level Caches. These caches are shared between all the cores to improve the utilization of the cache space and reduce cost. This approach significantly improves the performance of most applications compared to using smaller private caches. Cache sharing, however, presents an important shortcoming: the interference between applications. Prefetching, on the other hand, brings data blocks to the caches before they are requested, hiding the main memory latency. Unfortunately, since prefetching is a speculative technique, inaccurate prefetches may pollute the cache with blocks that will not be used. In addition, the prefetches interfere with the regular memory requests, both the ones from the application running on the core that issued the prefetches and the others. This thesis focuses on reducing the inter-application interference, both in the shared cache and in the access to the main memory. To reduce the interapplication interference in the access to main memory, the proposed approach regulates the aggressiveness of each core prefetcher, and selectively activates or deactivates some of them, depending on their individual performance and the main memory bandwidth requirements of the other cores. With respect to interference in shared caches, this thesis proposes two LLC partitioning techniques that give more cache space to the applications that have their progress diminished due inter-application interferences. The first cache partitioning proposal requires dedicated hardware not available in commercial processors, so it has been evaluated using a simulation framework. The second proposal dealing with cache partitioning presents a family of partitioning policies that overcome the limitations in the number of partitions and the number of available ways by grouping applications and overlapping cache partitions, so multiple applications share the same ways. Since it has been implemented using the cache partitioning features of modern Intel processors it has been evaluated in a real machine. Experimental results show that the proposed selective prefetching mechanism reduces the number of main memory requests by 20%, which translates to improvements in unfairness, performance, and energy consumption. On the other hand, regarding the proposed partitioning schemes, compared to a system with no partitioning, both reduce unfairness more than 25% on average, regardless of the number of applications running in the multicore, and this reduction in unfairness does not negatively affect the performance. / L'accés a la memòria principal en els processadors actuals suposa un important coll d'ampolla per a les prestacions, ja que els diferents nuclis competeixen pel limitat ample de banda de memòria, agreujant la bretxa entre les prestacions del processador i les de la memòria principal. Diferents tècniques ataquen aquest problema, sent les més rellevants l'ús de jerarquies de memòria cau multinivell i la prebusca. Les memòries cau jeràrquiques aprofiten la localitat temporal i espacial que en general presenten els programes en l'accés a les dades per mitigar les enormes latències d'accés a memòria principal. Per limitar el nombre d'accessos a la memòria DRAM, fora del xip, els processadors actuals compten amb grans caus d'últim nivell (LLC). Per millorar la seva utilització i reduir costos, aquestes memòries cau solen compartir-se entre tots els nuclis del processador. Aquest enfocament millora significativament el rendiment de la majoria de les aplicacions en comparació amb l'ús de caus privades més menudes. Compartir la memòria cau, no obstant, presenta una problema important: la interferencia entre aplicacions. La prebusca, per altra banda, porta blocs de dades a les memòries cau abans que el processador els sol·licite, ocultant la latència de memòria principal. Desafortunadament, donat que la prebusca és una técnica especulativa, si no té èxit pot contaminar la memòria cau amb blocs que no fan falta. A més, les prebusques interfereixen amb els accessos normals a memòria, tant els del nucli que emet les prebusques com els dels altres. Aquesta tesi es centra en reduir la interferència entre aplicacions, tant en les cau compartides com en l'accés a la memòria principal. Per reduir la interferència entre aplicacions en l'accés a la memòria principal, el mecanismo proposat en aquesta dissertació regula l'agressivitat de cada prebuscador, activant o desactivant selectivament alguns d'ells, en funció del seu rendiment individual i dels requisits d'ample de banda de memòria principal dels altres nuclis. Pel que fa a la interferència en caus compartides, aquesta tesi proposa dues tècniques de particionat per a la LLC, les quals atorguen més espai de memòria cau a les aplicacions que progressen més lentament a causa de la interferència entre aplicacions. La primera proposta per al particionat de memòria cau requereix hardware específic no disponible en processadors comercials, per la qual cosa s'ha avaluat utilitzant un entorn de simulació. La segona proposta de particionat per a memòries cau presenta una família de polítiques que superen les limitacions en el nombre de particions i en el nombre de vies de memòria cau disponibles mitjan¿ cant l'agrupació d'aplicacions en clústers i la superposició de particions de memòria cau, de manera que diverses aplicacions comparteixen les mateixes vies. Atès que s'ha implementat utilitzant els mecanismes per al particionat de la LLC que ofereixen alguns processadors Intel moderns, aquesta proposta s'ha avaluat en una màquina real. Els resultats experimentals mostren que el mecanisme de prebusca selectiva proposat en aquesta tesi redueix el nombre de sol·licituds a la memòria principal en un 20%, cosa que es tradueix en millores en l'equitat del sistema, el rendiment i el consum d'energia. Per altra banda, pel que fa als esquemes de particiónat proposats, en comparació amb un sistema sense particions, ambdues propostes redueixen la iniquitat del sistema en més d'un 25% de mitjana, independentment de la quantitat d'aplicacions en execució, i aquesta reducció en la iniquitat no afecta negativament el rendiment. / Selfa Oliver, V. (2018). Adaptive Prefetching and Cache Partitioning for Multicore Processors [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/112423
3	An Evaluation of Intel Cache Allocation Technology for Data- Intensive Applications / En utvärdering av Intel Cache Allocation Technology för dataintensiva applikationer Ihre Sherif, Alan January 2021 (has links) On certain CPUs part of the Intel Xeon Scalable CPU family, the level three (L3) cache is shared among the CPU cores residing on the same CPU socket. This has benefits in that a larger and more scalable cache space is available to the CPU cores. However, when the L3 cache is shared between CPU cores and thereby by the applications running there, the applications can affect the performance of each other if some of them have high L3 cache usage. This can be particularly problematic if an application is over-utilizing the L3 cache and effectively evicting the data of other applications, which are more prioritized, from the L3 cache. Such applications are called L3 cache noisy neighbors. The experiments in this thesis study the effect L3 cache noisy neighbors have on other, more prioritized, applications and if Intel Cache Allocation Technology (CAT) can be used to limit the performance impact the noisy neighbors have. Intel CAT provides functionality to control the amount of L3 cache allocated to a CPU core and by allocating less L3 cache to a noisy neighbor it no longer shares as much L3 cache with the prioritized applications and thus the prioritized applications can again utilize more of the L3 cache and regain their performance. The research question of this thesis is to investigate in what cases Intel CAT can provide advantages and where it is a disadvantage to use it by studying its use for three commonly used applications; bzip2, Redis, and Graph500. All the three applications were significantly impacted when running simultaneously with a noisy neighbor and for the Redis application there was a decrease of 49.2% in the number of ’GET’ requests per second that the Redis server could handle and an 18.2% decrease for ’SET’ requests. For the bzip2 and Graph500 applications, there was a 14.7% and 28.1% increase in execution time respectively. Intel CAT was successfully used to limit the impact of the noisy neighbor on the three applications. For the Redis application, the number of requests per second increased by 8.6% for the ’GET’ operation and by 4.2% for the ’SET’ operation. For the bzip2 and Graph500 applications, there was a 5.8% and 12.0% decrease in execution time respectively. Moreover, the thesis studies the scenario when only prioritized applications are running and if their performance can be increased by isolating the L3 cache for each one of them so that they cannot cause L3 cache evictions for each other. The use case of Intel CAT in such a scenario is not as clear as when mitigating the impact of a noisy neighbor but some performance benefits can be observed when running multiple Redis instances on the same machine and isolating some of the L3 cache available to them. / För vissa processorer som tillhör familjen Intel Xeon Scalable är den tredje nivåns cache (L3-cache) delad mellan CPU-kärnorna som befinner sig på samma CPU-sockel. Detta har fördelen att ett större och mer skalbart cacheutrymme blir tillgängligt för CPU-kärnorna. Att L3-cache är delat mellan kärnorna innebär däremot att applikationerna som kör där kan påverka varandras prestanda om någon av dem överutnyttjar L3-cache. När en applikation överutnyttjar L3-cache leder det till att data från andra applikationer, som kan vara mer prioriterade, inte längre får plats i cachen. Sådana applikationer kallas för ”L3-cache noisy neigbors”. Experimenten i denna studie undersöker effekterna av L3-cache noisy neigbors på mer prioriterade applikationer och om Intel Cache Allocation Technology (CAT) kan användas för att begränsa den påverkan som L3-cache noisy neigbors har. Intel CAT har funktionalitet för att kontrollera mängden L3-cache som allokeras till en CPU-kärna och genom att allokera mindre L3-cache till en noisy neigbor så delar den inte lika mycket L3-cache med de prioriterade applikationerna och därmed kan de prioriterade applikationerna återfå sin prestanda. Frågeställningen för denna studie är att undersöka i vilka användningsområden Intel CAT har fördelar och när det är en nackdel att använda det genom att studera dess användning för tre välanvända applikationer, bzip2, Redis och Graph500. Prestandan för alla av dessa tre applikationer blev tydligt påverkad när de kördes samtidigt som en noisy neigbor och Intel CAT kunde användas för att minska den påverkan. För Redis ökade antalet frågor som hanterades av Redis med 8.6% för GET-operationer och 4.2% för SET-operationer. För bzip2 och Graph500 observerades en minskning i exekveringstid på 5.8% och 12.0% respektive. Denna uppsats undersöker även scenariot där bara prioriterade applikationer körs och om deras prestanda kan ökas genom att isolera L3-cache för var och en av dem så att de inte tar plats från varandra i L3-cachen. När Intel CAT användes i ett sådant scenario är fördelarna inte lika tydliga som när påverkan av en noisy neighbor begränsades men en viss förbättring i prestanda går att observera när flera Redisservrar körs på samma maskin och en del av L3-cachen isoleras till var och en av dem. L3 cache Last Level Cache Intel Cache Allocation Technology L3 Cache allocation L3 cache performance improvement Redis performance Graph500 performance bzip2 performance L3-cache Sista nivåns cache Intel Cache Allocation Technology L3 Cache-allokering L3-cache prestandaförbättring Redis prestanda Graph500 prestanda bzip2 prestanda Computer Sciences Datavetenskap (datalogi)

1

Page generated in 0.1236 seconds