Global ETD Search

1	Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors Lodde, Mario 04 February 2014 (has links) La jerarquía de caches y la red en el chip (NoC) son dos componentes clave de los chip multiprocesadores (CMPs). La mayoría del trafico en la NoC se debe a mensajes que las caches envían según lo que establece el protocolo de coherencia. La cantidad de trafico, el porcentaje de mensajes cortos y largos y el patrón de trafico en general varían dependiendo de la geometría de las caches y del protocolo de coherencia. La arquitectura de la NoC y la jerarquía de caches están de hecho firmemente acopladas, y estos dos componentes deben ser diseñados y evaluados conjuntamente para estudiar como el variar uno afecta a las prestaciones del otro. Además, cada componente debe ajustarse a los requisitos y a las oportunidades del otro, y al revés. Normalmente diferentes clases de mensajes se envían por diferentes redes virtuales o por NoCs con diferente ancho de banda, separando mensajes largos y cortos. Sin embargo, otra clasificación de los mensajes se puede hacer dependiendo del tipo de información que proveen: algunos mensajes, como las peticiones de datos, necesitan campos para almacenar información (dirección del bloque, tipo de petición, etc.); otros, como los mensajes de reconocimiento (ACK), no proporcionan ninguna información excepto por el ID del nodo destino: solo proveen una información de tipo temporal, en el sentido que la recepción de un ACK indica que el nodo fuente ha recibido el mensaje al que está contestando con el ACK y completado todas las operaciones determinadas por el protocolo de coherencia. Esta segunda clase de mensaje no necesita de mucho ancho de banda: la latencia es mucho mas importante, dado que el nodo destino esta típicamente bloqueado esperando la recepción de ellos. En este trabajo de tesis se desarrolla una red dedicada para trasmitir la segunda clase de mensajes; la red es muy sencilla y rápida, y permite la entrega de los ACKs con una latencia de pocos ciclos de reloj. Reduciendo la latencia y el trafico en la NoC debido a los ACKs, es posible: -acelerar la fase de invalidación en fase de escritura en un sistema que usa un protocolo de coherencia basado en directorios -mejorar las prestaciones de un protocolo de coerencia basado en broadcast, hasta llegar a prestaciones comparables con las de un protocolo de directorios pero sin el coste de área debido a la necesidad de almacenar el directorio -implementar un mapeado dinámico de bloques a las caches de ultimo nivel de forma eficiente, con el objetivo de acercar cuanto al máximo los bloques a los cores que los utilizan El objetivo final es obtener un co-diseño de NoC y jerarquía de caches que minimice los problemas de escalabilidad de los protocolos de coherencia. Como gran objetivo final, se pretende la implementación de un CMP con ubicación dinámica de los recursos de cache y red, tal que estos recursos se puedan particionar de forma eficiente e independiente para asignar diferentes particiones a diferentes aplicaciones en un entorno virtualizado. / Lodde, M. (2014). Smart Memory and Network-On-Chip Design for High-Performance Shared-Memory Chip Multiprocessors [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/35325 Chip multiprocessors Computer architecture Cache hierarchy Cache coherence protocols Network-on-chip
2	Efficient graph algorithm execution on data-parallel architectures Bangalore Lakshminarayana, Nagesh 12 January 2015 (has links) Mechanisms for improving the execution efficiency of graph algorithms on Data-Parallel Architectures were proposed and identified. Execution of graph algorithms on GPGPU architectures, the prevalent data-parallel architectures was considered. Irregular and data dependent accesses in graph algorithms were found to cause significant idle cycles in GPGPU cores. A prefetching mechanism that reduced the amount of idle cycles by prefetching a data-dependent access pattern found in graph algorithms was proposed. Storing prefetches in unused spare registers in addition to storing them in the cache was shown to be more effective by the prefetching mechanism. The design of the cache hierarchy for graph algorithms was explored. First, an exclusive cache hierarchy was shown to be beneficial at the cost of increased traffic; a region based exclusive cache hierarchy was shown to be similar in performance to an exclusive cache hierarchy while reducing on-chip traffic. Second, bypassing cache blocks at both the level one and level two caches was shown to be beneficial. Third, the use of fine-grained memory accesses (or cache sub-blocking) was shown to be beneficial. The combination of cache bypassing and fine-grained memory accesses was shown to be more beneficial than applying the two mechanisms individually. Finally, the impact of different implementation strategies on algorithm performance was evaluated for the breadth first search algorithm using different input graphs and heuristics to identify the best performing implementation for a given input graph were also discussed. Graph algorithms Data-parallel architectures GPGPU architectures Prefetching Cache hierarchy Inclusion property Cache bypass Fine-grained accesses BFS characterization
3	Managing the memory hierarchy in GPUs Dublish, Saumay Kumar January 2018 (has links) Pervasive use of GPUs across multiple disciplines is a result of continuous adaptation of the GPU architectures to address the needs of upcoming application domains. One such vital improvement is the introduction of the on-chip cache hierarchy, used primarily to filter the high bandwidth demand to the off-chip memory. However, in contrast to traditional CPUs, the cache hierarchy in GPUs is presented with significantly different challenges such as cache thrashing and bandwidth bottlenecks, arising due to small caches and high levels of memory traffic. These challenges lead to severe congestion across the memory hierarchy, resulting in high memory access latencies. In memory-intensive applications, such high memory access latencies often get exposed and can no longer be hidden through multithreading, and therefore adversely impact system performance. In this thesis, we address the inefficiencies across the memory hierarchy in GPUs that lead to such high levels of congestion. We identify three major factors contributing to poor memory system performance: first, disproportionate and insufficient bandwidth resources in the cache hierarchy; second, poor cache management policies; and third, high levels of multithreading. In order to revitalize the memory hierarchy by addressing the above limitations, we propose a three-pronged approach. First, we characterize the bandwidth bottlenecks present across the memory hierarchy in GPUs and identify the architectural parameters that are most critical in alleviating congestion. Subsequently, we explore the architectural design space to mitigate the bandwidth bottlenecks in a cost-effective manner. Second, we identify significant inter-core reuse in GPUs, presenting an opportunity to reuse data among the L1s. We exploit this reuse by connecting the L1 caches with a lightweight ring network to facilitate inter-core communication of shared data. We show that this technique reduces traffic to the L2 cache, freeing up the bandwidth for other accesses. Third, we present Poise, a machine learning approach to mitigate cache thrashing and bandwidth bottlenecks by altering the levels of multi-threading. Poise comprises a supervised learning model that is trained offline on a set of profiled kernels to make good warp scheduling decisions. Subsequently, a hardware inference engine is used to predict good warp scheduling decisions at runtime using the model learned during training. In summary, we address the problem of bandwidth bottlenecks across the memory hierarchy in GPUs by exploring how to best scale, supplement and utilize the existing bandwidth resources. These techniques provide an effective and comprehensive methodology to mitigate the bandwidth bottlenecks in the GPU memory hierarchy.
4	Performance Optimisation of Discrete-Event Simulation Software on Multi-Core Computers / Prestandaoptimering av händelsestyrd simuleringsmjukvara på flerkärniga datorer Kaeslin, Alain E. January 2016 (has links) SIMLOX is a discrete-event simulation software developed by Systecon AB for analysing logistic support solution scenarios. To cope with ever larger problems, SIMLOX's simulation engine was recently enhanced with a parallel execution mechanism in order to take advantage of multi-core processors. However, this extension did not result in the desired reduction in runtime for all simulation scenarios even though the parallelisation strategy applied had promised linear speedup. Therefore, an in-depth analysis of the limiting scalability bottlenecks became necessary and has been carried out in this project. Through the use of a low-overhead profiler and microarchitecture analysis, the root causes were identified: atomic operations causing a high communication overhead, poor locality leading to translation lookaside buffer thrashing, and hot spots that consume significant amounts of CPU time. Subsequently, appropriate optimisations to overcome the limiting factors were implemented: eliminating the expensive operations, more efficient handling of heap memory through the use of a scalable memory allocator, and data structures that make better use of caches. Experimental evaluation using real world test cases demonstrated a speedup of at least 6.75x on an eight-core processor. Most cases even achieve a speedup of more than 7.2x. The various optimisations implemented further helped to lower run times for sequential execution by 1.5x or more. It can be concluded that achieving nearly linear speedup on a multi-core processor is possible in practice for discrete-event simulation. / SIMLOX är en kommersiell mjukvara utvecklad av Systecon AB, vars huvudsakliga funktion är en händelsestyrd simuleringskärna för analys av underhållslösningar för komplexa tekniska system. För hantering av stora problem så används parallellexekvering för simuleringen, vilket i teorin borde ge en nästan linjär skalning med antal trådar. Prestandaförbättringen som observerats i praktiken var dock ytterst begränsad, varför en ordentlig analys av skalbarheten har gjorts i detta projekt. Genom användandet av ett profileringsverktyg med liten overhead och mikroarkitektur-analys, så kunde orsakerna hittas: atomiska operationer som skapar mycket overhead för kommunikation, dålig lokalitet ger fragmentering vid översättning till fysiska adresser och dåligt utnyttjande av TLB-cachen, och vissa flaskhalsar som kräver mycket CPU-kraft. Därefter implementerades och testade optimeringar för att undvika de identifierade problem. Testade lösningar inkluderar eliminering av dyra operationer, ökad effektivitet i minneshantering genom skalbara minneshanteringsalgoritmer och implementation av datastrukturer som ger bättre lokalitet och därmed bättre användande av cache-strukturen. Verifiering på verkliga testfall visade på uppsnabbningar på åtminstone 6.75 gånger på en processor med 8 kärnor. De flesta fall visade på en uppsnabbning med en faktor större än 7.2. Optimeringarna gav även en uppsnabbning med en faktor på åtminstone 1.5 vid sekventiell exekvering i en tråd. Slutsatsen är därmed att det är möjligt att uppnå nästan linjär skalning med antalet kärnor för denna typ av händelsestyrd simulering. cache hierarchy caches communication overhead data structures discrete-event simulation heap memory linear speedup logistic support low-overhead profiler memory allocator microarchitecture microarchitecture analysis multi-core optimisation parallel execution profiler runtime scalability scalability bottlenecks scalable memory allocator simulation translation lookaside buffer translation lookaside buffer thrashing TLB Computer Sciences Datavetenskap (datalogi)
5	Deep Learning Inference on Low-Power Commodity Processors and the AMD Versal AI Engine Lei, Jie 18 November 2024 (has links) [ES] Esta tesis presenta un estudio exhaustivo sobre la implementación de una realización eficiente de GEMM en procesadores de bajo consumo y en una plataforma heterogénea de AMD. Esta investigación está inspirada por la creciente demanda de inferencias de bajo consumo, baja latencia y alto rendimiento con modelos complejos de Deep Learning (DL) que surgen, por ejemplo, en Natural Language Processing (NLP) y Convolutional Neural Networks (CNN). Esto llevó a la oportunidad de explorar la aplicabilidad de la aceleración de hardware y software para GEMM en plataformas ARM, RISC-V y AMD Versal AI Engine (AIE). Establecimos los objetivos de nuestra investigación de la siguiente manera: Primero, desarrollar kernels de precisión mixta eficientes para GEMM en arquitecturas ARM y RISC-V explotando las unidades Single-Instruction, Multiple-Data (SIMD) en estas arquitecturas. En segundo lugar, explorar la aplicabilidad del algoritmo convencional para GEMM en plataformas de hardware no convencionales como el AIE en el sistema AMD Versal. Por último, investigar la escalabilidad del diseño paralelo de GEMM a múltiples AIE en sistemas AMD Versal. En mayor detalle, la investigación comienza implementando GEMM en las arquitecturas ARM y RISC-V, donde propusimos una herramienta de generación de código de micro-kernels basada en plantillas para ARM Neon, la extensión vectorial RISC-V (RVV) 0.7.1 y RVV 1.0. La herramienta de generación de código también permite configurar las dimensiones del micro-kernel, un parámetro crítico desde el punto de vista del rendimiento. Este trabajo indica que esta generación de código de kernels mejoró drásticamente la productividad y la portabilidad de los diseños de GEMM basados en intrínsecos. También incorporamos aritmética de precisión mixta INT8\|INT32, mostrando la aceleración sobre los enfoques FP32. Basándonos en el éxito de la implementación de GEMM en sistemas convencionales de bajo costo, extendimos nuestros intereses a plataformas heterogéneas no convencionales, en particular, la arquitectura AMD Versal AIE. Para esta plataforma, diseñamos micro-kernels específicos de la arquitectura de 8x8 utilizando intrínsecos flexibles de bajo nivel, implementando aritmética de precisión mixta y rutinas de empaquetado de datos, todo orientado a la inferencia de DL de alto rendimiento. Más importante aún, propusimos un diseño de jerarquía de memoria personalizada para esta arquitectura, crucial para operaciones de GEMM de baja latencia. Los resultados muestran que los micro-kernels propuestos lograron el 86.7% del rendimiento máximo de la implementación de un solo AIE. Fuimos un paso más allá al evaluar el diseño de GEMM en el modelo de DL ResNet-50 v1.5+ImageNet, donde convertimos los operadores de convolución a kernels de GEMM. Tras la implementación exitosa de GEMM en un solo tile de AIE, extendimos nuestra investigación a múltiples tiles de AIE, donde introdujimos la paralelización en el algoritmo. Rediseñamos el GEMM específico de la arquitectura acomodando hasta 32 tiles de AIE. Para lograr esto, optimizamos el diseño de la jerarquía de memoria personalizada y propusimos una nueva topología para un mayor rendimiento de comunicación. Los resultados muestran una gran escalabilidad del diseño paralelo de GEMM, reduciendo drásticamente el tiempo de computación en 31.5x en comparación con el diseño de un solo tile de AIE. / [CA] Aquesta tesi presenta un estudi complet sobre la implementació d'una realització eficient de GEMM en processadors de baix consum i una plataforma heterogènia d'AMD. Aquesta investigació s'inspira en la creixent demanda d'inferències de baix consum, baixa latència i alt rendiment amb models complexos de Deep Learning (DL), com per exemple, en Natural Language Processing (NLP) i Convolutional Neural Networks (CNN). Això va portar a l'oportunitat d'explorar l'aplicabilitat de l'acceleració de maquinari i programari per a GEMM en plataformes ARM, RISC-V i AMD Versal AI Engine (AIE). Els objectius de la nostra investigació són els següents: En primer lloc, desenvolupar nuclis de precisió mixta eficients per a GEMM en arquitectures ARM i RISC-V explotant les unitats Single-Instruction, Multiple-Data (SIMD) en aquestes arquitectures. En segon lloc, explorar l'aplicabilitat de l'algorisme convencional per a GEMM en plataformes de maquinari no convencionals com l'AIE en el sistema AMD Versal. Finalment, investigar l'escalabilitat del disseny paral·lel de GEMM a múltiples AIE en sistemes AMD Versal. En més detall, la investigació comença implementant GEMM en arquitectures ARM i RISC-V, on hem proposat una eina de generació de codi de micro-nuclis basada en plantilles per a ARM Neon, l'extensió vectorial RISC-V (RVV) 0.7.1 i RVV 1.0. L'eina de generació de codi també permet configurar les dimensions del micro-nucli, un paràmetre crític des del punt de vista del rendiment. Aquest treball indica que aquesta generació de codi de nucli va millorar dràsticament la productivitat i portabilitat dels dissenys de GEMM basats en intrínsecs. També incorporem aritmètica de precisió mixta INT8\|INT32, mostrant la millora de velocitat respecte als enfocaments FP32. Sobre la base de l'èxit de la implementació de GEMM en sistemes convencionals de consum, vam ampliar els nostres interessos a arquitectures heterogènies no convencionals, en particular, l'arquitectura AMD Versal AIE. Per a aquesta plataforma, vam dissenyar micro-nuclis específics d'arquitectura de 8x8 utilitzant els intrínsecs de baix nivell flexibles, implementant aritmètica de precisió mixta i rutines d'embalatge de dades, totes destinades a inferència de DL d'alt rendiment. Més important encara, vam proposar un disseny de jerarquia de memòria personalitzat per a aquesta arquitectura, que és crucial per a operacions GEMM de baixa latència. Els resultats mostren que els micro-nuclis proposats van aconseguir el 86,7% del rendiment màxim d'una implementació d'AIE única. Vam anar un pas més enllà avaluant el disseny de GEMM en el model de DL ResNet-50 v1.5+ImageNet, on vam convertir els operadors de convolució en nuclis GEMM. Després de la implementació exitosa de GEMM en una sola rajola AIE, vam ampliar la nostra investigació a múltiples rajoles AIE, on vam introduir la paral·lelització a l'algorisme. Vam redissenyar el GEMM específic d'arquitectura per a acomodar fins a 32 rajoles AIE. Per aconseguir-ho, vam optimitzar el disseny de la jerarquia de memòria personalitzada i vam proposar una nova topologia per a un major ample de banda de comunicació. / [EN] This thesis presents a comprehensive study on implementing an efficient realization of GEMM on low-power commodity processors and a heterogeneous platform from AMD. This research is inspired by the increasing demand for low-power, low-latency, high-performance inference with complex Deep Learning (DL) models arising, for instance, in Natural Language Processing (NLP) and Convolutional Neural Networks (CNN). This led to the opportunity to explore the applicability of hardware and software acceleration for GEMM on ARM, RISC-V, and AMD Versal AI Engine (AIE) platforms. We set up the objectives of our research as follows: Firstly, to develop efficient mixed precision kernels for GEMM on ARM and RISC-V architectures exploiting the Single-Instruction, Multiple-Data (SIMD) units in these architectures. Secondly, to explore the applicability of the conventional algorithm for GEMM to non-conventional hardware platforms such as the AIE in the AMD Versal system. Lastly, to investigate the scalability of the parallel design of GEMM to multiple AIEs on AMD Versal systems. In greater detail, the research starts by implementing GEMM on ARM and RISC-V architectures, where we proposed template-based micro-kernels code generation tool for ARM Neon, RISC-V vector (RVV) extension 0.7.1, and RVV 1.0. The code generation tool also allows configuring the micro-kernel dimensions, a critical parameter from the point of performance. This work indicates this kernel code generation drastically improved the productivity and portability of intrinsic-based GEMM designs. We also incorporate mixed-precision INT8\|INT32 arithmetic, showing the speedup over FP32 approaches. Building upon the success of GEMM implementation on conventional commodity systems, we extended our interests to non-conventional heterogeneous platforms, in particular, the AMD Versal AIE architecture. For this platform, we designed architecture-specific 8x8 micro-kernels utilizing the flexible low-level intrinsic, implementing mixed-precision arithmetic and data-packing routines, all aimed for high-performance DL inference. More importantly, we proposed a customized memory hierarchy design for this architecture, which is crucial for low-latency GEMM operations. The results show that the proposed micro-kernels achieved 86.7% of the peak performance of a single AIE implementation. We went a step further by benchmarking the GEMM design on the DL model ResNet-50 v1.5+ImageNet, where we converted the convolution operators to GEMM kernels. Following the successful implementation of GEMM on a single AIE tile, we extended our research to multiple AIE tiles, where we introduced parallelization to the algorithm. We redesigned the architecture-specific GEMM accommodating up to 32 AIE tiles. To achieve this, we optimized the customized memory hierarchy design and proposed a new topology for higher communication throughput. The results show great scalability of the parallel GEMM design, drastically reducing computational time by 31.5x compared to the single AIE tile design. / I would like to express my sincere appreciation to Horizon 2020 of the European Union for their generous funding. This project has been supported by the European Union’s Horizon 2020 (H2020) Marie Sklodowska-Curie Innovative Training Networks H2020-MSCA-ITN-2020 call, under Grant Agreement no. 956090. This funding has been crucial in enabling the success of this research. / Lei, J. (2024). Deep Learning Inference on Low-Power Commodity Processors and the AMD Versal AI Engine [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/212297 General Matrix Multiply (GEMM) Matrix Multiplication High Performance Computing AMD Versal Cache Hierarchy SIMD Units Loop Parallelism Deep Learning Artificial Intelligence Inteligencia artificial Aprendizaje profundo Multiplicación de matrices Computación de alto rendimiento

1

Page generated in 0.0703 seconds