1 |
Register Caching for Energy Efficient GPGPU Tensor Core Computing / Registrera cachelagring för energieffektiv GPGPU Tensor Core ComputingQian, Qiran January 2023 (has links)
The General-Purpose GPU (GPGPU) has emerged as the predominant computing device for extensive parallel workloads in the fields of Artificial Intelligence (AI) and Scientific Computing, primarily owing to its adoption of the Single Instruction Multiple Thread architecture, which not only provides a wealth of thread context but also effectively hide the latencies exposed in the single threads executions. As computational demands have evolved, modern GPGPUs have incorporated specialized matrix engines, e.g., NVIDIA’s Tensor Core (TC), in order to deliver substantially higher throughput for dense matrix computations compared with traditional scalar or vector architectures. Beyond mere throughput, energy efficiency is a pivotal concern in GPGPU computing. The register file is the largest memory structure on the GPGPU die and typically accounts for over 20% of the dynamic power consumption. To enhance energy efficiency, GPGPUs incorporate a technique named register caching borrowed from the realm of CPUs. Register caching captures temporal locality among register operands to reduce energy consumption within a 2- level register file structure. The presence of TC raises new challenges for Register Cache (RC) design, as each matrix instruction applies intensive operand delivering traffic on the register file banks. In this study, we delve into the RC design trade-offs in GPGPUs. We undertake a comprehensive exploration of the design space, encompassing a range of workloads. Our experiments not only reveal the basic design considerations of RC but also clarify that conventional caching strategies underperform, particularly when dealing with TC computations, primarily due to poor temporal locality and the substantial register operand traffic involved. Based on these findings, we propose an enhanced caching strategy featuring a look-ahead allocation policy to minimize unnecessary cache allocations for the destination register operands. Furthermore, to leverage the energy efficiency of Tensor Core computing, we highlight an alternative instruction scheduling framework for Tensor Core instructions that collaborates with a specialized caching policy, resulting in a remarkable reduction of up to 50% in dynamic energy consumption within the register file during Tensor Core GEMM computations. / Den allmänna ändamålsgrafikprocessorn (GPGPU) har framträtt som den dominerande beräkningsenheten för omfattande parallella arbetsbelastningar inom områdena för artificiell intelligens (AI) och vetenskaplig beräkning, huvudsakligen tack vare dess antagande av arkitekturen för enkel instruktion, flera trådar (Single Instruction Multiple Thread), vilket inte bara ger en mängd trådcontext utan också effektivt döljer de latenser som exponeras vid enskilda trådars utförande. När beräkningskraven har utvecklats har moderna GPGPU:er inkorporerat specialiserade matrismotorer, t.ex., NVIDIAs Tensor Core (TC), för att leverera avsevärt högre genomströmning för täta matrisberäkningar jämfört med traditionella skalär- eller vektorarkitekturer. Bortom endast genomströmning är energieffektivitet en central oro inom GPGPUberäkning. Registerfilen är den största minnesstrukturen på GPGPU-dien och svarar vanligtvis för över 20% av den dynamiska effektförbrukningen För att förbättra energieffektiviteten inkorporerar GPGPU:er en teknik vid namn registercachning, lånad från CPU-världen. Registercachning fångar temporal lokalitet bland registeroperanderna för att minska energiförbrukningen inom en 2-nivåers registerfilstruktur. Närvaron av TC innebär nya utmaningar för Register Cache (RC)-design, eftersom varje matrisinstruktion genererar intensiv operandleverans på registerfilbankarna. I denna studie fördjupar vi oss i RC-designavvägandena i GPGPU:er. Vi genomför en omfattande utforskning av designutrymmet, som omfattar olika arbetsbelastningar. Våra experiment avslöjar inte bara de grundläggande designövervägandena för RC utan klargör också att konventionella cachestrategier underpresterar, särskilt vid hantering av TC-beräkningar, främst på grund av dålig temporal lokalitet och den betydande trafiken med registeroperand. Baserat på dessa resultat föreslår vi en förbättrad cachestrategi med en look-ahead-alloceringspolicy för att minimera onödiga cacheallokeringar för destinationens registeroperand. Dessutom, för att dra nytta av energieffektiviteten hos Tensor Core-beräkning, belyser vi en alternativ instruktionsplaneringsram för Tensor Core-instruktioner som samarbetar med en specialiserad cachelayout, vilket resulterar i en anmärkningsvärd minskning av upp till 50% i dynamisk energiförbrukning inom registerfilen under Tensor Core GEMM-beräkningar.
|
2 |
Evaluation of FPGA-based High Performance Computing PlatformsFrick-Lundgren, Martin January 2023 (has links)
High performance computing is a topic that has risen to the top in the era ofdigitalization, AI and automation. Therefore, the search for more cost and timeeffective ways to implement HPC work is always a subject extensively researched.One part of this is to have hardware that is capable to improve on these criteria. Different hardware usually have different code languages to implement theseworks though, cross-platform solution like Intel’s oneAPI framework is startingto gaining popularity.In this thesis, the capabilities of Intel’s oneAPI framework to implement andexecute HPC benchmarks on different hardware platforms will be discussed. Using the hardware available through Intel’s DevCloud services, Intel’s Xeon Gold6128, Intel’s UHD Graphics P630 and the Arria10 FPGA board were all chosento use for implementation. The benchmarks that were chosen to be used wereGEMM (General Matrix Multiplication) and BUDE (Bristol University DockingEngine). They were implemented using DPC++ (Data Parallel C++), Intel’s ownSYCL-based C++ extension. The benchmarks were also tried to be improved uponwith HPC speed-up methods like loop unrolling and some hardware manipulation.The performance for CPU and GPU were recorded and compared, as the FPGAimplementation could not be preformed because of technical difficulties. Theresults are good comparison to related work, but did not improve much uponthem. This because the hardware used is quite weak compared to industry standard. Though further research on the topic would be interesting, to compare aworking FPGA implementation to the other results and results from other studies. This implementation also probably has the biggest improvement potential,so to see how good one could make it would be interesting. Also, testing someother more complex benchmarks could be interesting.
|
3 |
Use of genetically engineered mouse models in preclinical drug developmentCreedon, Helen January 2015 (has links)
The paucity of well validated preclinical models is frequently cited as a contributing factor to the high attrition rates seen in clinical oncological trials. There remains a critical need to develop models which are accurately able to recapitulate the features of human disease. The aims of this study were to use genetically engineered mouse models (GEMMs) to explore the efficacy of novel treatment strategies in HER2 positive breast cancer and to further develop the model to facilitate the study of mechanisms underpinning drug resistance. Using the BLG--HER2KI-PTEN+/- model, we demonstrated that Src plays an important role in the early stages of tumour development. Chemopreventative treatment with dasatinib delayed tumour inititation (p= 0.046, Wilcoxon signed rank test) and prolonged overall survival (OS) (p=0.06, Wilcoxon signed rank test). Dasatinib treatment also induced squamous metaplasia in 66% of drug treated tumours. We used 2 cell lines derived from this model to further explore dasatinib’s mechanism of action and demonstrated reduced proliferation, migration and invasion following in vitro treatment. Due to the prolonged tumour latency and the low metastatic rate seen in this model, further studies were undertaken with the MMTV-NIC model. This model also allowed us to study the impact of PTEN loss on therapeutic response. We validated this model by treating a cohort of MMTV-NIC PTEN+/- mice with paclitaxel and demonstrated prolonged OS (p=0.035, Gehan Breslow Wilcoxon test). AZD8931 is an equipotent signalling inhibitor of HER2, HER3 and EGFR. We observed heterogeneity in tumour response but overall AZD8931 treatment prolonged OS in both MMTV-NIC PTEN FL/+ and MMTV-NIC PTEN+/- models. PTEN loss was associated with reduced sensitivity to AZD8931 and failure to suppress Src activity, suggesting these may be suitable predictive biomarkers of AZD8931 response. To facilitate further studies exploring resistance, we transplanted MMTV-NIC PTEN+/- fragments into syngeneic mice and generated 3 tumours with acquired resistance to AZD8931. These tumours displayed differing resistance strategies; 1 tumour continued to express HER2 whilst the remaining 2 underwent EMT and lost HER2 expression reflecting to a very limited degree some of the heterogeneity of resistance strategies seen in human disease. To further explore resistance to HER2 targeting tyrosine kinase inhibitors, we generated a panel of human cell lines with acquired resistance to AZD8931 and lapatinib. Western blotting demonstrated loss of HER2, HER3 and PTEN in all resistant lines. Acquisition of resistance was associated with a marked change in phenotype and western blotting confirmed all lines had undergone EMT. We used a combination of RPPA and mass spectrometry to further characterise the AZD8931 resistant lines and identified multiple potential novel proteins involved in the resistant phenotype, including several implicated in EMT. In conclusion, when coupled with appropriate in vitro techniques, the MMTV-NIC model is a valuable tool for selection of emerging drugs to carry forward into clinical trials of HER2 positive breast cancer.
|
4 |
Recursive Blocked Algorithms, Data Structures, and High-Performance Software for Solving Linear Systems and Matrix EquationsJonsson, Isak January 2003 (has links)
<p>This thesis deals with the development of efficient and reliable algorithms and library software for factorizing matrices and solving matrix equations on high-performance computer systems. The architectures of today's computers consist of multiple processors, each with multiple functional units. The memory systems are hierarchical with several levels, each having different speed and size. The practical peak performance of a system is reached only by considering all of these characteristics. One portable method for achieving good system utilization is to express a linear algebra problem in terms of level 3 BLAS (Basic Linear Algebra Subprogram) transformations. The most important operation is GEMM (GEneral Matrix Multiply), which typically defines the practical peak performance of a computer system. There are efficient GEMM implementations available for almost any platform, thus an algorithm using this operation is highly portable.</p><p>The dissertation focuses on how recursion can be applied to solve linear algebra problems. Recursive linear algebra algorithms have the potential to automatically match the size of subproblems to the different memory hierarchies, leading to much better utilization of the memory system. Furthermore, recursive algorithms expose level 3 BLAS operations, and reveal task parallelism. The first paper handles the Cholesky factorization for matrices stored in packed format. Our algorithm uses a recursive packed matrix data layout that enables the use of high-performance matrix--matrix multiplication, in contrast to the standard packed format. The resulting library routine requires half the memory of full storage, yet the performance is better than for full storage routines.</p><p>Paper two and tree introduce recursive blocked algorithms for solving triangular Sylvester-type matrix equations. For these problems, recursion together with superscalar kernels produce new algorithms that give 10-fold speedups compared to existing routines in the SLICOT and LAPACK libraries. We show that our recursive algorithms also have a significant impact on the execution time of solving unreduced problems and when used in condition estimation. By recursively splitting several problem dimensions simultaneously, parallel algorithms for shared memory systems are obtained. The fourth paper introduces a library---RECSY---consisting of a set of routines implemented in Fortran 90 using the ideas presented in paper two and three. Using performance monitoring tools, the last paper evaluates the possible gain in using different matrix blocking layouts and the impact of superscalar kernels in the RECSY library. </p>
|
5 |
Recursive Blocked Algorithms, Data Structures, and High-Performance Software for Solving Linear Systems and Matrix EquationsJonsson, Isak January 2003 (has links)
This thesis deals with the development of efficient and reliable algorithms and library software for factorizing matrices and solving matrix equations on high-performance computer systems. The architectures of today's computers consist of multiple processors, each with multiple functional units. The memory systems are hierarchical with several levels, each having different speed and size. The practical peak performance of a system is reached only by considering all of these characteristics. One portable method for achieving good system utilization is to express a linear algebra problem in terms of level 3 BLAS (Basic Linear Algebra Subprogram) transformations. The most important operation is GEMM (GEneral Matrix Multiply), which typically defines the practical peak performance of a computer system. There are efficient GEMM implementations available for almost any platform, thus an algorithm using this operation is highly portable. The dissertation focuses on how recursion can be applied to solve linear algebra problems. Recursive linear algebra algorithms have the potential to automatically match the size of subproblems to the different memory hierarchies, leading to much better utilization of the memory system. Furthermore, recursive algorithms expose level 3 BLAS operations, and reveal task parallelism. The first paper handles the Cholesky factorization for matrices stored in packed format. Our algorithm uses a recursive packed matrix data layout that enables the use of high-performance matrix--matrix multiplication, in contrast to the standard packed format. The resulting library routine requires half the memory of full storage, yet the performance is better than for full storage routines. Paper two and tree introduce recursive blocked algorithms for solving triangular Sylvester-type matrix equations. For these problems, recursion together with superscalar kernels produce new algorithms that give 10-fold speedups compared to existing routines in the SLICOT and LAPACK libraries. We show that our recursive algorithms also have a significant impact on the execution time of solving unreduced problems and when used in condition estimation. By recursively splitting several problem dimensions simultaneously, parallel algorithms for shared memory systems are obtained. The fourth paper introduces a library---RECSY---consisting of a set of routines implemented in Fortran 90 using the ideas presented in paper two and three. Using performance monitoring tools, the last paper evaluates the possible gain in using different matrix blocking layouts and the impact of superscalar kernels in the RECSY library.
|
6 |
Biological functions of microRNA-216 and microRNA-217 during the development of pancreatic cancerAzevedo-Pouly, Ana Clara P. 17 October 2013 (has links)
No description available.
|
7 |
Defining Mutation-Specific NRAS Functions that Drive MelanomagenesisMurphy, Brandon M. January 2021 (has links)
No description available.
|
8 |
Deep Learning Inference on Low-Power Commodity Processors and the AMD Versal AI EngineLei, Jie 18 November 2024 (has links)
[ES] Esta tesis presenta un estudio exhaustivo sobre la implementación de una realización eficiente de GEMM en procesadores de bajo consumo y en una plataforma heterogénea de AMD. Esta investigación está inspirada por la creciente demanda de inferencias de bajo consumo, baja latencia y alto rendimiento con modelos complejos de Deep Learning (DL) que surgen, por ejemplo, en Natural Language Processing (NLP) y Convolutional Neural Networks (CNN). Esto llevó a la oportunidad de explorar la aplicabilidad de la aceleración de hardware y software para GEMM en plataformas ARM, RISC-V y AMD Versal AI Engine (AIE).
Establecimos los objetivos de nuestra investigación de la siguiente manera: Primero, desarrollar kernels de precisión mixta eficientes para GEMM en arquitecturas ARM y RISC-V explotando las unidades Single-Instruction, Multiple-Data (SIMD) en estas arquitecturas. En segundo lugar, explorar la aplicabilidad del algoritmo convencional para GEMM en plataformas de hardware no convencionales como el AIE en el sistema AMD Versal. Por último, investigar la escalabilidad del diseño paralelo de GEMM a múltiples AIE en sistemas AMD Versal.
En mayor detalle, la investigación comienza implementando GEMM en las arquitecturas ARM y RISC-V, donde propusimos una herramienta de generación de código de micro-kernels basada en plantillas para ARM Neon, la extensión vectorial RISC-V (RVV) 0.7.1 y RVV 1.0. La herramienta de generación de código también permite configurar las dimensiones del micro-kernel, un parámetro crítico desde el punto de vista del rendimiento. Este trabajo indica que esta generación de código de kernels mejoró drásticamente la productividad y la portabilidad de los diseños de GEMM basados en intrínsecos. También incorporamos aritmética de precisión mixta INT8|INT32, mostrando la aceleración sobre los enfoques FP32.
Basándonos en el éxito de la implementación de GEMM en sistemas convencionales de bajo costo, extendimos nuestros intereses a plataformas heterogéneas no convencionales, en particular, la arquitectura AMD Versal AIE. Para esta plataforma, diseñamos micro-kernels específicos de la arquitectura de 8x8 utilizando intrínsecos flexibles de bajo nivel, implementando aritmética de precisión mixta y rutinas de empaquetado de datos, todo orientado a la inferencia de DL de alto rendimiento. Más importante aún, propusimos un diseño de jerarquía de memoria personalizada para esta arquitectura, crucial para operaciones de GEMM de baja latencia. Los resultados muestran que los micro-kernels propuestos lograron el 86.7% del rendimiento máximo de la implementación de un solo AIE. Fuimos un paso más allá al evaluar el diseño de GEMM en el modelo de DL ResNet-50 v1.5+ImageNet, donde convertimos los operadores de convolución a kernels de GEMM.
Tras la implementación exitosa de GEMM en un solo tile de AIE, extendimos nuestra investigación a múltiples tiles de AIE, donde introdujimos la paralelización en el algoritmo. Rediseñamos el GEMM específico de la arquitectura acomodando hasta 32 tiles de AIE. Para lograr esto, optimizamos el diseño de la jerarquía de memoria personalizada y propusimos una nueva topología para un mayor rendimiento de comunicación. Los resultados muestran una gran escalabilidad del diseño paralelo de GEMM, reduciendo drásticamente el tiempo de computación en 31.5x en comparación con el diseño de un solo tile de AIE. / [CA] Aquesta tesi presenta un estudi complet sobre la implementació d'una realització eficient de GEMM en processadors de baix consum i una plataforma heterogènia d'AMD. Aquesta investigació s'inspira en la creixent demanda d'inferències de baix consum, baixa latència i alt rendiment amb models complexos de Deep Learning (DL), com per exemple, en Natural Language Processing (NLP) i Convolutional Neural Networks (CNN). Això va portar a l'oportunitat d'explorar l'aplicabilitat de l'acceleració de maquinari i programari per a GEMM en plataformes ARM, RISC-V i AMD Versal AI Engine (AIE).
Els objectius de la nostra investigació són els següents: En primer lloc, desenvolupar nuclis de precisió mixta eficients per a GEMM en arquitectures ARM i RISC-V explotant les unitats Single-Instruction, Multiple-Data (SIMD) en aquestes arquitectures. En segon lloc, explorar l'aplicabilitat de l'algorisme convencional per a GEMM en plataformes de maquinari no convencionals com l'AIE en el sistema AMD Versal. Finalment, investigar l'escalabilitat del disseny paral·lel de GEMM a múltiples AIE en sistemes AMD Versal.
En més detall, la investigació comença implementant GEMM en arquitectures ARM i RISC-V, on hem proposat una eina de generació de codi de micro-nuclis basada en plantilles per a ARM Neon, l'extensió vectorial RISC-V (RVV) 0.7.1 i RVV 1.0. L'eina de generació de codi també permet configurar les dimensions del micro-nucli, un paràmetre crític des del punt de vista del rendiment. Aquest treball indica que aquesta generació de codi de nucli va millorar dràsticament la productivitat i portabilitat dels dissenys de GEMM basats en intrínsecs. També incorporem aritmètica de precisió mixta INT8|INT32, mostrant la millora de velocitat respecte als enfocaments FP32.
Sobre la base de l'èxit de la implementació de GEMM en sistemes convencionals de consum, vam ampliar els nostres interessos a arquitectures heterogènies no convencionals, en particular, l'arquitectura AMD Versal AIE. Per a aquesta plataforma, vam dissenyar micro-nuclis específics d'arquitectura de 8x8 utilitzant els intrínsecs de baix nivell flexibles, implementant aritmètica de precisió mixta i rutines d'embalatge de dades, totes destinades a inferència de DL d'alt rendiment. Més important encara, vam proposar un disseny de jerarquia de memòria personalitzat per a aquesta arquitectura, que és crucial per a operacions GEMM de baixa latència. Els resultats mostren que els micro-nuclis proposats van aconseguir el 86,7% del rendiment màxim d'una implementació d'AIE única. Vam anar un pas més enllà avaluant el disseny de GEMM en el model de DL ResNet-50 v1.5+ImageNet, on vam convertir els operadors de convolució en nuclis GEMM.
Després de la implementació exitosa de GEMM en una sola rajola AIE, vam ampliar la nostra investigació a múltiples rajoles AIE, on vam introduir la paral·lelització a l'algorisme. Vam redissenyar el GEMM específic d'arquitectura per a acomodar fins a 32 rajoles AIE. Per aconseguir-ho, vam optimitzar el disseny de la jerarquia de memòria personalitzada i vam proposar una nova topologia per a un major ample de banda de comunicació. / [EN] This thesis presents a comprehensive study on implementing an efficient realization of GEMM on low-power commodity processors and a heterogeneous platform from AMD. This research is inspired by the increasing demand for low-power, low-latency, high-performance inference with complex Deep Learning (DL) models arising, for instance, in Natural Language Processing (NLP) and Convolutional Neural Networks (CNN). This led to the opportunity to explore the applicability of hardware and software acceleration for GEMM on ARM, RISC-V, and AMD Versal AI Engine (AIE) platforms.
We set up the objectives of our research as follows: Firstly, to develop efficient mixed precision kernels for GEMM on ARM and RISC-V architectures exploiting the Single-Instruction, Multiple-Data (SIMD) units in these architectures. Secondly, to explore the applicability of the conventional algorithm for GEMM to non-conventional hardware platforms such as the AIE in the AMD Versal system. Lastly, to investigate the scalability of the parallel design of GEMM to multiple AIEs on AMD Versal systems.
In greater detail, the research starts by implementing GEMM on ARM and RISC-V architectures, where we proposed template-based micro-kernels code generation tool for ARM Neon, RISC-V vector (RVV) extension 0.7.1, and RVV 1.0. The code generation tool also allows configuring the micro-kernel dimensions, a critical parameter from the point of performance. This work indicates this kernel code generation drastically improved the productivity and portability of intrinsic-based GEMM designs. We also incorporate mixed-precision INT8|INT32 arithmetic, showing the speedup over FP32 approaches.
Building upon the success of GEMM implementation on conventional commodity systems, we extended our interests to non-conventional heterogeneous platforms, in particular, the AMD Versal AIE architecture. For this platform, we designed architecture-specific 8x8 micro-kernels utilizing the flexible low-level intrinsic, implementing mixed-precision arithmetic and data-packing routines, all aimed for high-performance DL inference. More importantly, we proposed a customized memory hierarchy design for this architecture, which is crucial for low-latency GEMM operations. The results show that the proposed micro-kernels achieved 86.7% of the peak performance of a single AIE implementation. We went a step further by benchmarking the GEMM design on the DL model ResNet-50 v1.5+ImageNet, where we converted the convolution operators to GEMM kernels.
Following the successful implementation of GEMM on a single AIE tile, we extended our research to multiple AIE tiles, where we introduced parallelization to the algorithm. We redesigned the architecture-specific GEMM accommodating up to 32 AIE tiles. To achieve this, we optimized the customized memory hierarchy design and proposed a new topology for higher communication throughput. The results show great scalability of the parallel GEMM design, drastically reducing computational time by 31.5x compared to the single AIE tile design. / I would like to express my sincere appreciation to Horizon 2020 of the European Union for their generous funding. This project has been supported by the European Union’s Horizon 2020 (H2020) Marie Sklodowska-Curie Innovative Training Networks H2020-MSCA-ITN-2020
call, under Grant Agreement no. 956090. This funding has been crucial in enabling the success of this research. / Lei, J. (2024). Deep Learning Inference on Low-Power Commodity Processors and the AMD Versal AI Engine [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/212297
|
Page generated in 0.0273 seconds